Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke,...
Transcript of Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke,...
![Page 1: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/1.jpg)
A. Reinefeld, F. Schintke, ZIB 1
Replication and Consistency
in Cloud File SystemsAlexander Reinefeld und Florian Schintke
Zuse-Institut Berlin
Cloud-Computing-Tag im IKMZ der BTU Cottbus 14.04.2011
![Page 2: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/2.jpg)
A. Reinefeld, F. Schintke, ZIB 2
Let’s start with a little quiz
Who invented Cloud Computing?
a) Werner Vogels
b) Ian Foster
c) Konrad Zuse
The correct answer is … c)
“Schließlich werden auch Rechenzentren über
Fernmeldeleitungen miteinander vernetzt werden, …”
Konrad Zuse in „Rechnender Raum“ (1969)
Konrad Zuse22.06.1910-18.12.1995
![Page 3: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/3.jpg)
A. Reinefeld, F. Schintke, ZIB 3
Zuse Institute Berlin
Research institute for applied mathematics
and computer science
• Peter Deuflhard
chair for scientific computing, FU Berlin
• Martin Grötschel
chair for discrete mathematics, TU Berlin
• Alexander Reinefeld
chair for computer science, HU Berlin
![Page 4: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/4.jpg)
A. Reinefeld, F. Schintke, ZIB 4
HPC Systems @ ZIB
1987
Cray X-MP
471 MFlops
1994
Cray T3D
38 GFlops
1997
Cray T3E
486 GFlops
2002
IBM p690
2,5 TFlops1984
Cray 1M
160 MFlops
1.000.000-fold performance increase in 25 years1984 2009
2008/09
SGI ICE, XE
150 TFlops
![Page 5: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/5.jpg)
H
L
R
N
2 sites
98 computer racks
26112 CPU cores
128 TB memory
1620 TB disk
300 TF peak perf.
![Page 6: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/6.jpg)
ST O R A G E
3 SL8500 robots
39 tape drives
19000 slots
![Page 7: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/7.jpg)
A. Reinefeld, F. Schintke, ZIB 7
What is Cloud Computing?
Cloud Computing = Grid Computing on Datacenters?
• not that simple …
Cloud and Grid both abstract resources through interfaces.
• Grid: via new middleware.
Requires Grid APIs.
• Cloud: via virtualization.
Allows legacy APIs.
Software as a Service (SaaS)Applications
Application Services
Platform as a Service (PaaS)Programming Environment
Execution Environment
Infrastructure as a Service (IaaS)Infrastructure Services
Resource Set
![Page 8: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/8.jpg)
Alexander Reinefeld, ZIB 8
Why Cloud?
Pros
• It scales
because it‘s their resources, not yours
• It‘s simple
because they operate it
• Pay for what you need
don‘t pay for empty spinning disks
Cons
• It‘s expensive
Amazon S3 charges $.15 / GB / month.
= $1800 / TB / year
• It‘s not 100% secure
S3 now allows to bring your own
RSA key-pair. But: Would you put your
bank account into the cloud?
• It‘s not 100% available
S3 provides “service credits“ if availability
drops (10% for 99.0-99.9% availability)
![Page 9: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/9.jpg)
Alexander Reinefeld, ZIB 9
File System Landscape
ext3, ZFS,NTFS
NFS, SMBAFS/Coda
Lustre, Panasas,GPFS, CEPH...
Cloud/Grid
Cluster FS/Datacenter
Network FS/Centralized
PC, local system
GDM"gridftp"
Grid File SystemGFarm
![Page 10: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/10.jpg)
Alexander Reinefeld, ZIB 10
Consistency, Availability, Partition tolerance:
Pick two of three!
Consistency: All clients have the
same view of the data.
Availability: Each client can
always read and write.
Partition tolerance: Operations
will complete, even if individual
components are unavailable.
Brewer, Eric. Towards Robust Distributed Systems. PODC Keynote, 2004.
A
C P
A + PAmazon S3
Mercurial
Coda/AFS
C + Asingle server,
Linux HA
(one data center)
C + Pdistributed databases,
distributed file systems
![Page 11: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/11.jpg)
Alexander Reinefeld, ZIB 11
Which semantic do you expect?
Distributed file systems should provide C + P
But recent hype was on A + P + eventual consistency (e.g. Amazon S3)
![Page 12: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/12.jpg)
Alexander Reinefeld, ZIB 12
Grid File System
• provides access to heterogeneous
storage resources, but
• middleware causes additional
complexity, vulnerability
• requires explicit file transfer
• whole file: latency to 1st access,
bandwidth, disk storage
• also partial file access (gridftp)
and pattern access (falls)
• no consistency among replicas
• user must take care
• no access control on replicas
![Page 13: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/13.jpg)
Alexander Reinefeld, ZIB 13
Cloud File System: XtreemFS
Focus on
• data distribution
• data replication
• object based
Key features
• MRCs are separated
from OSDs
• fat Client is the “link”
MRC = metadata & replica catalogue
OSD = object storage device
Client = file system interface
![Page 14: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/14.jpg)
A. Reinefeld, F. Schintke, ZIB 14
A closer look at XtreemFS
• Features
• distributed, replicated POSIX compliant file system
• Server software (Java) runs on Linux, OS X, Solaris
• Client software (C++) runs on Linux, OS X, Windows
• secure: X.509 and SSL
• open source (GPL)
• Assumptions
• synchronous clocks with max. time drift (needed for OSD lease
negotiation, reasonable assumption in clouds)
• upper limit on round trip time
• no need for FIFO channels (runs on either TCP or UDP)
![Page 15: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/15.jpg)
A. Reinefeld, F. Schintke, ZIB 15
XtreemFS Interfaces
![Page 16: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/16.jpg)
Alexander Reinefeld, ZIB 16
File access protocol
XtreemFS Client (fuse)
User appl.
(Linux VFS) MRC OSD
FileSize = 128k
Update(Cap, FileSize=128k)
![Page 17: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/17.jpg)
A. Reinefeld, F. Schintke, ZIB 17
Client
• gets list of OSDs from MRC
• get a capability (signed by MRC) per file
• selects best OSD(s) for parallel I/O
• various striping policies: scatter/gather, RAIDx, erasure codes
• scalable and fast access
• no communication between OSD and MRC needed
• client is the “missing link”
![Page 18: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/18.jpg)
A. Reinefeld, F. Schintke, ZIB 18
MRC – Metadata and Replication Catalogue
• provides
• open(), close(), readdir(), rename(), …
• attributes per file: size, last access, access rights, location (OSDs), …
• capability (file handle) to authorize a client to access objects on OSDs
• implemented with a key/value store (BabuDB)
• fast index
• append-only DB
• allows snapshots
![Page 19: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/19.jpg)
A. Reinefeld, F. Schintke, ZIB 19
OSD – Object Storage Device
• serves file content operations
• read(), write(), truncate(), flush(), …
• implements object replication
• also partial replicas for read-access �
• data is filled on demand
• gets OSD list from MRC
• slave OSD redirects to master OSD
• write ops only on master OSD
• POSIX requires linearizable reads, hence reads are also redirected
![Page 20: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/20.jpg)
A. Reinefeld, F. Schintke, ZIB 20
OSD – Object Storage Device
• Which OSD to select?
• object list
• bandwidth
• rarest first
• network coordinates, datacenter map, …
• prefetching (for partial replicas)
![Page 21: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/21.jpg)
A. Reinefeld, F. Schintke, ZIB 21
OSD – Object Storage Device
• implements concurrency control for replica consistency
• POSIX compliant
• master/slave replication with failover
• group membership service provided by MRC
• lease service “Flease”: distributed, scalable and failure-tolerant
• 50,000 leases/sec with 30 OSDs
• based on quorum consensus (Paxos)
![Page 22: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/22.jpg)
A. Reinefeld, F. Schintke, ZIB 22
Quorum consensus
• Basic algorithm• When a majority is informed, each other majority has at least one
member with up-to-date information.
• A minority may crash at any time.
• Paxos Consensus• 1 Step: Check whether a consensus c was already established
• 2 Step: Re-establish c or try to establish own proposal
xx
xx
xx
x x
![Page 23: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/23.jpg)
Alexander Reinefeld, ZIB 23
ProposerInit
r = 1 // lokale Runden-Nr
rlatest = 0 // Nr der höchsten bestätigten Runde
latestv = ⊥ // Wert d. höchsten bestätigten Runde
// Neues Proposal senden
acknum = 0 // Anzahl gültiger Bestätigungen
Sende prepare(r) an alle acceptors
Empfange ack(rack,vi,ri) von acceptor i
Falls r == rack
acknum ++
Falls ri > rlatest
rlatest = ri // jüngere akzeptierte Runde
latestv = vi // jüngerer Wert
Falls acknum ≥ maj
Falls latestv == ⊥schlage selbst einen Wert latestv ≠ ⊥ vor
sende accept(r, latestv) an alle acceptors
Learnernumaccepted = 0 // Anzahl gesammelter accepts
Empfange accepted(r, v) von acceptor i
Wenn r steigt: numaccepted = 0
numaccepted ++
Falls numaccepted == maj
decide v; inform client // v ist Konsens
Acceptor
Init
rack = 0 // zuletzt bestätigte Runde
raccepted = 0 // zuletzt akzeptierte Runde
v = ⊥ // aktueller lokaler Wert
Empfange prepare(r) von proposer
Falls r > rack ∧ r > raccepted // höhere Runde
rack = r
Sende ack(rack, v, raccepted) an Proposer
Empfange accept(r, w)
Falls r ≥ rack ∧ r > raccepted
raccepted = r
v = w
Sende accepted (raccepted, v) an Learners
Ende 1. Phase
![Page 24: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/24.jpg)
A. Reinefeld, F. Schintke, ZIB 24
Striping Performance on Cluster
• Striping
• parallel transfer from/to
many OSDs
• bandwidth scales with the
number of OSDs
• client is the bottleneck: (slower reads are caused by TCP
ingress problem)
RE
AD
WR
ITE
One client writes/reads a single 4GB file using asynchr. writes, read-ahead, 1MB chunk size, 29 OSDs. Nodes are connected with IP over IB (1.2 GB/s).
![Page 25: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/25.jpg)
A. Reinefeld, F. Schintke, ZIB 25
Snapshots & Backups
Metadata snapshots (MRC)
• need atomic operation without service interrupt
• asynchronous consolidation in background
• granularity: subdirectories or volumes
• implemented by BabuDB or Scalaris
File snapshots (OSD)
• taken implicitly when file is idle
or explicitly when closing file or fsync()
• versioning of file objects: copy-on-write
![Page 26: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/26.jpg)
A. Reinefeld, F. Schintke, ZIB 26
Atomic Snapshots in MRC
• implemented with BabuDB backend
• a large-scale DB for data that exceeds the system’s main memory
• 2 components:
• small mutable overlay trees (LSM trees)
• large immutable memory-mapped index on disk
• non-transactional key-value store
• prefix and range queries
• primary design goal: Performance!
• 300,000 lookups/sec (30M entries)
• fast crash recovery
• fast start-up
![Page 27: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/27.jpg)
A. Reinefeld, F. Schintke, ZIB 27
Log Structured Merge Trees: .
A lookup takes O(s log(n)) with s: #snapshots, n: #files
![Page 28: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/28.jpg)
Alexander Reinefeld, ZIB 28
Replicating MRC, OSDs
Master/Slave Scheme
• Pros
• fast local read
• no distributed transactions
• easy to implement
• Cons
• master is performance bottleneck
• interrupt when master fails: needs stable master election
Replicated State Machine (Paxos)
• Pros
• no master, no single point of failure
• no extra latency on failure
• Cons
• slower: 2 round trips per op
• needs distrib. consensus
![Page 29: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/29.jpg)
Alexander Reinefeld, ZIB 29
XtreemFS Features
Release 1.2.1 (current)
• RAID and parallel I/O
• POSIX compatibility
• Read-only replication
• Partial replicas (on-demand)
• Security (SSL, X.509)
• Internet ready
• Checksums
• Extensions
• OSD and replica selection (Vivaldi,
datacenter maps)
• Asynchronous MRC backups
• Metadata caching
• Graphical admin console
• Hadoop file system driver (experimental)
Release 1.3 (very soon)
• DIR and MRC replication with automatic
failover
• Read/write replication
Release 2.x
• Consistent Backups
• Snapshots
• Automatic replica creation, deletion and
maintenance
![Page 30: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/30.jpg)
A. Reinefeld, F. Schintke, ZIB 30
Source Code
• XtreemFS• http://code.google.com/p/xtreemfs
• 35.000 lines of C++ and Java code
• GNU GPL v2 license
• BabuDB• http://code.google.com/p/babudb
• 10.000 lines of Java code
• new BSD license
• Scalaris• http://code.google.com/p/scalaris
• 28.214 lines of Erlang and C++ code
• Apache 2.0 licenseScalaris
![Page 31: Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian](https://reader030.fdocuments.in/reader030/viewer/2022012923/5aba0de97f8b9aa6018e92b2/html5/thumbnails/31.jpg)
A. Reinefeld, F. Schintke, ZIB 31
Summary
• Cloud file systems require replication
• availability
• fast access, striping
• Replication requires consistency algorithm
• when crashes are rare: use master/slave replication
• with frequent crashes: use Paxos
• Only Consistency + Partition tolerance from CAP theorem
• Our next step: Faster high-level data services
• for MapReduce, Dryad, key/value store, SQL, …