CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XIII: Replication-II.
Lecture XIII: Replication-II
description
Transcript of Lecture XIII: Replication-II
![Page 1: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/1.jpg)
CMPT 401 2008
Dr. Alexandra Fedorova
Lecture XIII: Replication-II
![Page 2: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/2.jpg)
2CMPT 401 2008 © A. Fedorova
Outline
• Harp– A replicated research file system
• Google File System – A real replicated file system
• Amazon Distributed Data Store– A distributed database
![Page 3: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/3.jpg)
3CMPT 401 2008 © A. Fedorova
Questions about Harp
• Does HARP use the two-phase commit protocol? If so, when and how? How does it differ from the 2PC protocol we studied in class?
• How many replicas that keep copies of data do we need to survive n failures? How many total participants must we have to survive n failures?
• Describe normal operation in Harp. Explain the following:– What the primary does– What the replica does– What the witness does
• How does Harp survive failures without flushing updates to disk before responding to the client?
![Page 4: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/4.jpg)
4CMPT 401 2008 © A. Fedorova
Overview of Harp
• Uses primary copy replication for– Reliability– Availability
• Single primary server, backups and witness• Accessed via NFS interface• Performance was a concern – operations log is kept in
memory only:– To guard against machine failures: other replicas will have the log
in memory– To guard against power failures: each machine has a UPS, upon
power failure there is time to flush log to persistent storage
![Page 5: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/5.jpg)
5CMPT 401 2008 © A. Fedorova
Access via NFS Interface
User application
OS
NFS client
OS
NFS server
Replicated FS: • Primary• Backup• Witness
![Page 6: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/6.jpg)
6CMPT 401 2008 © A. Fedorova
Failover Transparent to Clients
User application
OS
NFS client
OS
NFS server
OS
NFS server
OS
NFS server
• Data is sent to a multicast address
• Reaches all potential primaries
• Discarded by hardware at all except the primary
192.168.51.2
primary
backup
witness
![Page 7: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/7.jpg)
7CMPT 401 2008 © A. Fedorova
Goals and Environment of Harp
• Provide highly available file system service via replication• Assume failstop failures• Survive network partitions• Assume synchronous system (?) – probably, because they
rely on timeouts when detecting node failure• In many systems, replication caused performance
degradation – replica communication slowed down the sending of response to the client
• Harp’s goal was to provide reliability and availability without performance loss
![Page 8: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/8.jpg)
8CMPT 401 2008 © A. Fedorova
Harp’s Components
• In presence of network partitions, must have 2n + 1 replicated components to survive n failures
• The quorum, (the majority (n+1) servers) get to form a new group and elect a new primary
• Usually data is replicated on 2n+1 replicas
• In Harp, data is replicated on n+1 servers
• The other servers are used to create quorum
• They are called witnesses
![Page 9: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/9.jpg)
9CMPT 401 2008 © A. Fedorova
Harp’s Witnessprimarybackup
witness• Backup and primary cannot communicate• Who should be the primary?• Witness resolves the tie in favor of
primary• Data survives at the primary
primarybackup
witness
• Witness resolves the tie in favor of backup• Data survives at the backup
![Page 10: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/10.jpg)
10CMPT 401 2008 © A. Fedorova
Harp: Normal Operation
primary
backup
witness
1. Send request to the primary
2. Record the operation in the in-memory log
3. Forward request to backup
4. Record the operation in the in-memory log
5. Respond to primary
6. “Commit” the operation – mark it as committed in memory 7. Respond to client
8. Tell the back up to commit
![Page 11: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/11.jpg)
11CMPT 401 2008 © A. Fedorova
Two-phase Protocol for Updates
• Phase 1: – send updates to all backups– wait for backups to respond– send response to the client
• Phase 2:– backups are informed about commit– backups commit the operation locally
• Phase 1 is in the critical path• Phase 2 happens on the background• Phase 1 is quick, because updates do not have to be
written to disk
![Page 12: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/12.jpg)
12CMPT 401 2008 © A. Fedorova
In-Memory Logging• Client operations are recorded in the in-memory logs (at
the primary and at the backup) when the response is sent to client
• Operations are applied to the file system later, in the background
• This is done to remove disk access out of critical path when communicating with the client
• What if there primary fails?– That’s okay, because in-memory log survives at the backup
• What if there is a power failure?– The machine will operate for a while on UPS – this time will be
used to apply operations in the log to the file system
![Page 13: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/13.jpg)
13CMPT 401 2008 © A. Fedorova
Write-Behind Logging
CP – commit pointer – most recently committed event record
Record nRecord n+1Record n+2Record n+3Record n+4
…
AP – most recently applied event recordRecord n+5
LB – most recently event that has reached the local disk
GLB – most recently event that has reached the local disk at primary and backup
Record n+6
On failure the server restores the log and re-does all committed operations in the log
![Page 14: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/14.jpg)
14CMPT 401 2008 © A. Fedorova
Log Updates: Commit Pointer
• Primary receives the client request– A log record is created at the primary
• Primary forwards request to the backups– Backups add records to their logs
• Backups acknowledge receipt of records to the primary• Primary commits the operation
– Advances commit pointer CP– Sends the commit decision to the backup
• Backup advances its own CP
![Page 15: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/15.jpg)
15CMPT 401 2008 © A. Fedorova
Log Updates: Application Pointer
• The “Apply” process• Runs on the background• Applies committed records to disk• Advances AP pointer• Can we discard records before the AP pointer?• No! Writes are asynchronous• A committed record may not necessarily be on disk
![Page 16: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/16.jpg)
16CMPT 401 2008 © A. Fedorova
Log Updates: LB and GLB pointers
• Another process that checks when writes associated with log records have been applied to the file system
• When writes have finished, it advances the LB pointer• GLB: Global LB pointer: all records up to this pointer have
been applied to disk at both the primary and the backup• Records below GLB pointer can be discarded• Log invariant:
GLB <= LB <= AP <= CP
![Page 17: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/17.jpg)
17CMPT 401 2008 © A. Fedorova
Non-modification Operations
• Performed entirely at the primary• No communication with backups• Problem: what if the backup becomes disconnected from the
primary and forms a new view? • Then the primary may respond to a read operation with old
state (i.e., it may not know that a file has been updated)• How does Harp solve this problem?• Backup sends a promise to the server to not change a view
within time t + σ. Within that time, the primary can respond to read operations without talking to backup.
• After that, it must contact backup before performing a non-modification operation, to get a new promise.
![Page 18: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/18.jpg)
18CMPT 401 2008 © A. Fedorova
Handling Failures: View Changes
• View –a composition of the group and the roles of the members
• When some members fail, the view has to change• A view change selects the members of the new view and makes
sure that the state of the new view reflects all committed operations form previous views
• The designated primary and backup monitor other group members to detect changes in communication ability
• If they cannot communicate with some of the members, a view change is needed
• Either a primary or a backup can initiate a view change (not witness)
![Page 19: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/19.jpg)
19CMPT 401 2008 © A. Fedorova
View Changeprimarybackup
witness
• Primary cannot reach with backup, but can reach the witness
• Primary initiates a view change
primarybackup
witness
• Backup cannot reach the primary, but it can reach the witness
• Backup initiates the view change
![Page 20: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/20.jpg)
20CMPT 401 2008 © A. Fedorova
Causes and Outcomes of View Changes• A primary fails, so a new primary is needed
– A backup will become the primary after a view change• A backup fails, someone else needs to replicate the state
at the primary– Witness is configured to act as a backup – the witness is
promoted• A primary that had failed comes back
– It will bring itself up-to-date (using other servers’ logs) and will become the primary again
• A backup that had failed comes back– It will bring itself up-to-date; the previously promoted witness will
no longer act as backup – the witness is demoted
![Page 21: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/21.jpg)
21CMPT 401 2008 © A. Fedorova
View Change: The Algorithm• The node that starts the view change acts as coordinator• Phase 1:
– Coordinator tells others it wants to start a view change– Others stop processing any operations and send the
coordinator their state, i.e., log records (that the coordinator does not already have)
– The coordinator applies the log records to bring itself up-to-date
![Page 22: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/22.jpg)
22CMPT 401 2008 © A. Fedorova
View Change: The Algorithm• Phase 2:
– The coordinator writes the new view number to disk– Sends the view state to all participants– If both backup and witness responded, witness will be
demoted– If only the witness responded, witness will be
promoted– Other nodes write the view number to disk
![Page 23: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/23.jpg)
23CMPT 401 2008 © A. Fedorova
A Promoted Witness
• Does not have a copy of the file system state• Under normal operation, does not update the file system• A promoted witness begins logging filesystem state• Upon promotion receives all log records that have not
made it to disk (everything later than the GLB pointer)• Promoted witness never discards log records• When the log becomes too large, it is stored on disk or
tape
![Page 24: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/24.jpg)
24CMPT 401 2008 © A. Fedorova
Simultaneous View Changes
• Suppose primary and backup cannot communicate with each other
• They both initiate a view change simultaneously• One view change will be redundant – don’t want to waste
time/resources on a useless view change• Solution: delay the view change at the backup• This way the primary is most likely to “win the race” for
the view change• What happens if simultaneous view changes are in place?
![Page 25: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/25.jpg)
25CMPT 401 2008 © A. Fedorova
Optimizations for Fast View Changes
• User operations are not processed during a view change, so view changes must be fast
• A view change may be slow if the server that must bring itself up-to-date must receive lots of log records from other servers
• Therefore, the server that must bring itself up-to-date in a new view (i.e., the primary that comes back after failure) brings itself up-to-date before initiating the view change
• If the server’s disk is intact it gets log records from the witness• If the disk is damaged, it get FS state from the backup and then
it gets log records from the witness
![Page 26: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/26.jpg)
26CMPT 401 2008 © A. Fedorova
Other Optimizations
• When the witness is promoted, it must receive all log entries beyond GLB
• The number of entries is likely to be large, so the view change may be slow
• To expedite the view change, the witness is kept in hot standby
• The primary sends all updates to the witness. The witness logs them, but does not acknowledge them. It discards the old entries from memory, does not log them to disk or tape
![Page 27: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/27.jpg)
27CMPT 401 2008 © A. Fedorova
Guarding Against a “Killer Packet”
• Many crashes are due to software bugs• Some bugs may cause simultaneous failure at the primary and
backup – i.e., an OS bug is triggered by a certain FS operation• To guard against this, the backup waits with applying changes to
the FS until they have been applied at the primaryAPbackup ≤ APprimary
• If the primary fails after applying a certain change, the backup will likely initiate the view change and will send the log to the witness
• So even if the backup fails after applying the same operation that crashed the primary, the record of that operation won’t be lost
![Page 28: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/28.jpg)
28CMPT 401 2008 © A. Fedorova
A Potential Failure Scenario
primary backup
1. Receive operation from the client
2. Forward it to backup 3. Record the operation in the log
4. Respond to the primary5. Commit the operation
6. Respond to the client
7. Crash
• Backup does not know if the operation was committed
• Does it assume it was not committed and discard log entries?
• Does it assume it committed and apply the results?
![Page 29: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/29.jpg)
29CMPT 401 2008 © A. Fedorova
Let’s Play Harp!
• Let’s go over all the steps• During normal operation • And with failures
![Page 30: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/30.jpg)
30CMPT 401 2008 © A. Fedorova
Summary
• Primary-copy file system• Unlike other replicated file system, provides good
performance, because disk writes are not in the critical path
• Needs at least 2n+1 participants to handle n failures• Data is replicated only on n+1 servers, to save disk space• Wishing to have evidence/discussion on:
– How the system works with view changes– What happens if a component crashes during a view change? – What happens with log records of uncommitted operations?
![Page 31: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/31.jpg)
31CMPT 401 2008 © A. Fedorova
Google File System
• A real massive distributed file system• Hundreds of servers and clients
– The largest cluster has >1000 storage nodes, over 300 TB of disk storage, hundreds of clients
• Metadata replication• Data replication• Design driven by application workload and technological
environment• Avoided many of the difficulties traditionally associated
with replication by designing for a specific use case
![Page 32: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/32.jpg)
32CMPT 401 2008 © A. Fedorova
Specifics of the Google Environment
• FS is consists of hundreds of storage machines, built of inexpensive commodity parts
• Component failures are a norm– Application and OS bugs– Human errors– Hardware failures: disks, memory, network, power supplies
• Millions of files, each 100 MB or larger• Multi-GB files are common• Applications are written for GFS• Allows co-design of the file system and applications
![Page 33: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/33.jpg)
33CMPT 401 2008 © A. Fedorova
Specifics of the Google Workload
• Most files are mutated by appending new data – large sequential writes
• Random writes are very uncommon• Files are written once, then they are only read• Reads are sequential• Large streaming reads and small random reads• High bandwidth is more important than low latency• Google applications:
– Data analysis programs that scan through data repositories– Data streaming applications– Archiving– Applications producing (intermediate) search results
![Page 34: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/34.jpg)
34CMPT 401 2008 © A. Fedorova
GFS Architecture
![Page 35: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/35.jpg)
35CMPT 401 2008 © A. Fedorova
GFS Architecture (cont.)
• Single master• Multiple chunk servers• Multiple clients• Each is a commodity Linux machine, a server is a user-level process• Files are divided into chunks • Each chunk has a handle (an ID assigned by the master)• Each chunk is replicated (on three machines by default)• Master stores metadata, manages chunks, does garbage collection,
etc. • Clients communicate with master for metadata operations, but with
chunkservers for data operations• No additional caching (besides the Linux in-memory buffer caching)
![Page 36: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/36.jpg)
36CMPT 401 2008 © A. Fedorova
Client/GFS Interaction
• Client:– Takes file and offset– Translates it into the chunk index within the file– Sends request to master, containing file name and chunk index
• Master:– Replies with the corresponding chunk handle and location of the
replicas (the master must know where the replicas are)• Client:
– Caches this information– Contacts one of the replicas (i.e., a chunkserver) for data
![Page 37: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/37.jpg)
37CMPT 401 2008 © A. Fedorova
Master
• Stores metadata– The file and chunk namespaces– Mapping from files to chunks– Locations of each chunk’s replicas
• Interacts with clients• Creates chunk replicas• Orchestrates chunk modifications across multiple replicas
– Ensures atomic concurrent appends– Locks concurrent operations
• Deletes old files (via garbage collection)
![Page 38: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/38.jpg)
38CMPT 401 2008 © A. Fedorova
Metadata On Master
• Metadata – data about the data:– File names– Mapping of file names to chunk IDs– Chunk locations
• Metadata is kept in memory• File names and chunk mappings are also kept persistent in
an operation log• Chunk locations are kept in memory only
– They will be lost during the crash– The master asks chunk servers about their chunks at startup –
builds a table of chunk locations
![Page 39: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/39.jpg)
39CMPT 401 2008 © A. Fedorova
Why Keep Metadata In Memory?
• To keep master operations fast • Master can periodically scan its internal state in the
background, in order to implement:– Garbage collection– Re-replication (in case of chunk server failures)– Chunk migration (for load balancing)
• But the file system size is limited by the amount of memory on the master? – This has not been a problem for GFS – metadata is compact
![Page 40: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/40.jpg)
40CMPT 401 2008 © A. Fedorova
Why Not Keep Chunk Locations Persistent?
• Chunk location – which chunk server has a replica of a given chunk• Master polls chunk servers for that information on startup• Thereafter, master keeps itself up-to-date:
– It controls all initial chunk placement, migration and re-replication– It monitors chunkserver status with regular HeartBeat messages
• Motivation: simplicity• Eliminates the need to keep master and chunkservers synchronized • Synchronization would be needed when chunkservers:
– Join and leave the cluster– Change names– Fail and restart
![Page 41: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/41.jpg)
41CMPT 401 2008 © A. Fedorova
Operation Log
• Historical record of metadata changes• Maintains logical order of concurrent operations• Log is used for recovery – the master replays it in the
event of failures• Master periodically checkpoints the log• Checkpoint is a B-tree data structure
– Can be loaded into memory– Used for namespace lookup without extra parsing
• Checkpoint can be done on the background
![Page 42: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/42.jpg)
42CMPT 401 2008 © A. Fedorova
Data Consistency in GFS• Loose data consistency – applications are designed for it• Applications may see inconsistent data – data is different on
different replicas • Applications may see data from partially completed writes –
undefined file region• On successful modification the file region is consistent• A write may leave the region undefined – if the client reads the
file before another client’s write is complete• Replicas are not guaranteed to be bytewise identical (we’ll see
why later, and how clients deal with this)
![Page 43: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/43.jpg)
43CMPT 401 2008 © A. Fedorova
Data Consistency in GFS (cont.)
• Failures:– A modification may fail at one or more replicas– On modification failure, file region is inconsistent
• Successes:– Modifications are applied to a chunk in the same order on all
replicas– After a number of successful modifications, the file region is
guaranteed to be defined:• All replicas have the same data• All replicas contain all the data written by all the write
operations
![Page 44: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/44.jpg)
44CMPT 401 2008 © A. Fedorova
Implications of Loose Data Consistency For Applications
• Applications are designed to handle loose data consistency
• Example 1: a file is generated from beginning to end– An application creates a file with a temporary name– Atomically renames the file – May periodically checkpoint the file while it is written– File is written via appends – more resilient to failures than random
writes• Example 2: producer-consumer file
– Many writers concurrently append to one file (for merged results)– Each record is self-validating (contains a checksum)– Client filters out padding and duplicate records
![Page 45: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/45.jpg)
45CMPT 401 2008 © A. Fedorova
Updates of Replicated Data
• Each mutation (modification) is performed at all the replicas
• Modifications are applied in the same order across all replicas
• Master grants a chunk lease to one replica – i.e., the primary
• The primary picks a serial order for all mutations to the chunk
• The client pushes data to all replicas• The primary tells the replicas in which order they should
apply modifications
![Page 46: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/46.jpg)
46CMPT 401 2008 © A. Fedorova
Updates of Replicated Data (cont.)
1. Client asks master for replica locations
2. Master responds3. Client pushes data to all replicas;
replicas store it in a buffer cache4. Client sends a write request to the
primary (identifying the data that had been pushed)
5. Primary forwards request to the secondaries (identifies the order)
6. The secondaries respond to the primary
7. The primary responds to the client
![Page 47: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/47.jpg)
47CMPT 401 2008 © A. Fedorova
Failure Handling During Updates
• If a write fails at the primary:– The primary may report failure to the client – the client will retry– If the primary does not respond, the client retries from Step 1 by
contacting the master• If a write succeeds at the primary, but fails at several
replicas– The client retries several times (Steps 3-7)
![Page 48: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/48.jpg)
48CMPT 401 2008 © A. Fedorova
Data Flow
• Data flow is decoupled from control flow• Data is pushed linearly across all chunkservers in a
pipelined fashion (not necessarily from client to primary and from primary to secondary)
• Client forwards data to the closest replica; that replica forwards to the next closest replica, etc.
• Pipelined fashion: while the data is incoming, the server begins forwarding it to the next replica
• This design ensures good network utilization
![Page 49: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/49.jpg)
49CMPT 401 2008 © A. Fedorova
Atomic Record Appends• Atomic append is a write – but GFS (the primary replica)
chooses the offset where the append happens; returns the offset to the client
• This way GFS can decide on serial order of concurrent appends without client synchronization
• If an append fails at some replicas – the client retries• As a result, the file may contain multiple copies of the
same record, plus replicas may be bytewise different• But after a successful update all replicas will be defined –
they will all have the data written by the client at the same offset
![Page 50: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/50.jpg)
50CMPT 401 2008 © A. Fedorova
Non-Identical Replicas
• Because of failed and retried record appends, replicas may be non-identical bytewise
• Some replicas may have duplicate records (because of failed and retried appends)
• Some replicas may have padded file space (empty space filled with junk) – if the master chooses record offset higher than the first available offset at a replica
• Clients must deal with it: they write self-identifying records so they can distinguish valid data from junk
• If clients cannot tolerate duplicates, they must insert version numbers in records
• GFS pushes complexity to the client; without this, complex failure recovery scheme would need to be in place
![Page 51: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/51.jpg)
51CMPT 401 2008 © A. Fedorova
Snapshot
• Copy of a file or a directory tree – used by applications for fast copies of data sets and for checkpointing
• Steps involved to snapshot directory A:1. Master revokes leases on directory A2. Logs the operation to disk, copies metadata for A to A’ in its
memory: both A and A’ point to the same files on disk3. When a client wants to write to chunk C in A, master defers
replying to the client; creates a new chunk handle C’4. Master asks each chunkserver that has replica C to create a copy
in chunk C’ – this ensures that copies are created locally, not over the network
5. All new modifications go to chunk C’
![Page 52: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/52.jpg)
52CMPT 401 2008 © A. Fedorova
Namespace Management and Locking
• Each file or directory has an associated read/write lock• Each operation on a master acquires a set of read/write locks before it
runs• Read locks are acquired on all files/directories that are being
accessed, i.e., each intermediate directory in /d1/d2/…/dn
• Write locks are acquired on – Snapshots (to prevent creation of new files in a directory during
the snapshot)– File names – when that file is created– No write lock on directory is needed on file creation – no directory
inode to modify; multiple file creations can be done concurrently
![Page 53: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/53.jpg)
53CMPT 401 2008 © A. Fedorova
Garbage Collection
• File deletion is not done immediately – space from deleted files is garbage collected lazily
• When a file is deleted – the master logs the operation and renames it to a hidden name
• During regular metadata scan the master deletes that file’s metadata (after at least three days)
• During regular scan of chunk namespace, the master identifies orphaned chunks, deletes that metadata
• Master tells chunk replicas to delete orphaned chunks
![Page 54: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/54.jpg)
54CMPT 401 2008 © A. Fedorova
Load Balancing
• Goals:– Maximize data availability and reliability– Maximize network bandwidth utilization
• Google infrastructure:– Cluster consists of hundreds of racks– Each rack has a dozen machines– Racks are connected by network
switches– A rack is on a single power circuit
• Must balance load across machines and across racks
![Page 55: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/55.jpg)
55CMPT 401 2008 © A. Fedorova
Creation, Re-replication, Rebalancing• Creation (initial replica placement):
– On chunk servers with low disk space utilization– Limit the number of recent creations on each chunkserver –
recent creations mean heavy write traffic– Spread replicas across racks
• Re-replication– When the number of replicas falls below the replication target– When a chunkserver becomes unavailable– When a replica becomes corrupted– A new replica is copied directly from an existing one
• Re-balancing– Master periodically examines replica distribution and moves them
to meet load-balancing criteria
![Page 56: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/56.jpg)
56CMPT 401 2008 © A. Fedorova
Fault Tolerance
• Fast recovery– No distinction between normal and abnormal shutdown– Servers are routinely restarted by “killing” a server process– Servers are designed for fast recovery – all state can be recovered
from the log• Chunk replication• Master replication• Data integrity• Diagnostic tools
![Page 57: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/57.jpg)
57CMPT 401 2008 © A. Fedorova
Chunk Replication
• Each chunk is replicated on multiple chunkservers on different racks
• Users can specify different replication levels for different parts of the file namespace (default is 3)
• The master clones existing replicas as needed to keep each chunk fully replicated
![Page 58: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/58.jpg)
58CMPT 401 2008 © A. Fedorova
Single Master
• Simplifies design• Master can make sophisticated load-balancing decisions
involving chunk placement using global knowledge• To prevent master from becoming the bottleneck
– Clients communicate with master only for metadata– Master keeps metadata in memory– Clients cache metadata– File data is transferred from chunkservers
![Page 59: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/59.jpg)
59CMPT 401 2008 © A. Fedorova
Master Replication
• Master state is replicated on multiple machines, so a new server can become master if the old master fails
• What is replicated: operation logs and checkpoints• A modification is considered successful only after it has been logged
on all master replicas• A single master is in charge; if it fails, it restarts almost
instantaneously• If a machine fails and the master cannot restart itself, a failure
detector outside GFS starts a new master with a replicated operation log (no master election)
• Master replicas are master’s shadows – they operate similarly to the master w.r.t. updating the log, the in-memory metadata, polling the chunkservers
![Page 60: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/60.jpg)
60CMPT 401 2008 © A. Fedorova
Data Integrity• Disks often fail – may cause data corruption• Detect corrupt replicas by comparing with other chunk servers?
– Not a good idea – divergent replicas may be legal• Each chunkserver verifies its own replicas using checksums• Checksums are kept in memory and stored persistently in the log• Small effect on read performance – checksums are kept in memory,
checksum computation can be overlapped with I/O• Write performance: checksum computation optimized for appends• Checksum can be computed incrementally for a checksum block
(64KB)• If corruption is detected, the master creates new replicas using data
from correct chunks• During idle periods chunkservers scan inactive chunks for corruption
![Page 61: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/61.jpg)
61CMPT 401 2008 © A. Fedorova
Detecting Stale Replicas
• A replica may become stale if it misses a modification while the chunkserver was down
• Each chunk has a version number, version numbers are used to detect stale replicas
• A stale replica will never be given to the client as a chunk location, and will never participate in mutation
• A client may read from a stale replica (because the client caches metadata)– But this window is limited, because cache entries time out
![Page 62: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/62.jpg)
62CMPT 401 2008 © A. Fedorova
Diagnostic Tools
• GFS servers perform diagnostic logging• Helps debugging and performance analysis• Diagnostic logs record:
– Chunk servers going up and down– All RPC requests and replies
• RPC requests and responses from different machine logs can be collated and analyzed to determine exact interaction between machines
• Logs are also used for load testing and performance analysis
![Page 63: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/63.jpg)
63CMPT 401 2008 © A. Fedorova
GFS Summary• Real replicated file system• Uses commodity hardware – hundreds of commodity PCs
and disks• Two levels of replication:
– Metadata is replicated via replicated masters– Data is replicated on replicated chunkservers
• Designed for specific use case – for Google applications– And applications are designed for GFS
• This is why it is simple and it actually works
![Page 64: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/64.jpg)
64CMPT 401 2008 © A. Fedorova
GFS Summary (cont.)• Design philosophy:
– A replicated FS can’t do all things right and all things well:– Strong data consistency?– Identical replicas?– Fast concurrent operations?– That’s too hard…– So make several operations fast, make them common case
• Common case operations – atomic appends• Clients deal with weak consistency
– Write self-identifying records– Deal with duplicate records and padding
• Something to learn: if generic design is hard, design for your specific use case!
![Page 65: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/65.jpg)
65CMPT 401 2008 © A. Fedorova
Outline
• Harp– A replicated research file system
• Google File System – A real replicated file system
• Amazon Distributed Data Store– A distributed database
![Page 66: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/66.jpg)
66CMPT 401 2008 © A. Fedorova
Problem Solving
• Design GFS over Dynamo– A system layer that presents GFS interface to Dynamo key-value
store– Present your design – How would you write a GFS-over-Dynamo application? Would you
need to change it? • Dynamo over GFS
– As above• Discussion
– Is this a good idea? – What system properties make this a fundamentally good/bad
idea?
![Page 67: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/67.jpg)
67CMPT 401 2008 © A. Fedorova
Dynamo: Amazon’s Key-Value Store
• A distributed database• Contains data about:
– Customer shopping carts– Customer sessions– Amazon search engine
• Highly replicated– Across data center– Across continents– “A customer must be able to update a shopping cart even if the
world is being destroyed by a tornado”.
![Page 68: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/68.jpg)
68CMPT 401 2008 © A. Fedorova
Dynamo: A Database?
• This is basically a database• But not your conventional database• Conventional (relational) database:
– Data organized in tables– Primary and secondary keys– Tables sorted by primary/secondary keys– Designed to answer any imaginable query– Does not scale to thousands of nodes– Difficult to replicate
• Amazon’s Dynamo– Access by primary key only
![Page 69: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/69.jpg)
69CMPT 401 2008 © A. Fedorova
ACID Properties
• Atomicity – yes– Updates are atomic by definition– There are no transactions
• Consistency – no– Data is eventually consistent– Loose consistency is tolerated– Reconciliation is performed by the client
• Isolation– No isolation – one update at a time
• Durability – yes– Durability is provided via replication
![Page 70: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/70.jpg)
70CMPT 401 2008 © A. Fedorova
High Availability
• Good service time is key for Amazon• Not good when a credit card transaction times out• Service-level agreement: the client’s response must be
answered within 300ms• Must provide this service for 99.9% of transactions at the
load of 500 requests/second.
![Page 71: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/71.jpg)
71CMPT 401 2008 © A. Fedorova
The Cost of Respecting the SLA
• Loose consistency• Synchronous replica reconciliation during the request
cannot be done• We contact a few replicas, if some do not reply, request is
considered failed• When to resolve conflicting updates? During reads or
during writes? • Usually resolved during writes• Dynamo resolves it during reads • Motivation: must have an always writable data store
(can’t lose customer shopping card data)
![Page 72: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/72.jpg)
72CMPT 401 2008 © A. Fedorova
System Interface
• get ( key )– Locate object replicas– Return:
• A single object• A list of objects with conflicting versions• Context (opaque information about object versioning)
• put (key, value, context) – Determines where the replicas should be placed– Writes them to disk
![Page 73: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/73.jpg)
73CMPT 401 2008 © A. Fedorova
Key System Architecture Components
• Partitioning• Replication• Versioning• Membership• Failure Handling• Scaling
![Page 74: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/74.jpg)
74CMPT 401 2008 © A. Fedorova
Partitioning
• How to partition data among nodes?• Use consistent hashing• Output of the hash maps to a circular space• The largest hash value wraps to the smallest hash value• Each node is assigned a random value in the space• This represents its position in the ring
![Page 75: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/75.jpg)
75CMPT 401 2008 © A. Fedorova
Assigning a Key to a Node
• Hash the key• Find the node with the corresponding ring position• Walk the ring clockwise to find the first node with the
greater position than that of the key• Similar search algorithms are used in distributed hash
tables
![Page 76: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/76.jpg)
76CMPT 401 2008 © A. Fedorova
Problems With Consistent Hashing
• May lead to unbalance data and load distribution• Solution:
– Each node is a virtual node– Assign multiple virtual nodes to one physical node
![Page 77: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/77.jpg)
77CMPT 401 2008 © A. Fedorova
Replication
• Each node has a coordinator (the node determined by the hash)
• The coordinator hashes the node at N other replicas• N replicas that are next to the coordinator node in the ring
in the clockwise fashion• Virtual nodes are skipped to ensure that replicas are
located on different physical nodes
![Page 78: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/78.jpg)
78CMPT 401 2008 © A. Fedorova
Versioning
• Dynamo stores multiple versions of each data item• Each update creates a new immutable version of the data
item• Versions are reconciled
– By the system– By the client
• Versioning is achieved using vector clocks
![Page 79: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/79.jpg)
79CMPT 401 2008 © A. Fedorova
Routing of Requests
• Through a generic load balancer– May forward request to a node who is NOT a coordinator– The recipient node will forward the request to the coordinator
• Through a partition-aware client library that directly selects a coordinator
![Page 80: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/80.jpg)
80CMPT 401 2008 © A. Fedorova
Maintaining Consistency Via Quorum
• Dynamo is configured with two parameters: R and W• R is the minimum number of nodes who participate in the
successful Read operation• W is the minimum number of nodes who participate in
the successful Write operation• Request handling protocol (for writes):
– Coordinator receives request– Coordinator computes vector clock and writes new version to disk– Coordinator sends the new version and vector clock to the N
replicas– If at least W-1 respond, the request is successful
![Page 81: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/81.jpg)
81CMPT 401 2008 © A. Fedorova
Sloppy Quorum
• What if some of the N replicas are temporarily unavailable?
• This could limit system’s availability• Cannot use strict quorum• Use sloppy quorum• If one of N replicas is unavailable, use another node that is
not a replica• That node will temporarily store the data• Will forward it to the real replica when the replica is back
up
![Page 82: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/82.jpg)
82CMPT 401 2008 © A. Fedorova
Replica Synchronization
• User Merkle trees• Leaves are hashes of keys• Can compare trees incrementally, without transferring the
whole tree• If a part the tree is not modified, the parent nodes’ hashes
will be identical• So parts of the tree can be compared without sending
data between two replicas• Only keys that are out of sync are transferred
![Page 83: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/83.jpg)
83CMPT 401 2008 © A. Fedorova
Membership
• Membership is always explicit• Nodes are added/removed by the operator• So there is no need for “coordinator election”• If a node is unavailable, this is considered temporary• A node that starts up chooses a set of tokens (virtual
nodes) and maps virtual nodes to physical nodes• Membership information is eventually propagated via
gossip protocol• Mapping is persisted on disk
![Page 84: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/84.jpg)
84CMPT 401 2008 © A. Fedorova
Preventing Logical Partitions
• A new node may be unaware of other nodes before memberships are propagated
• If several such nodes are added simultaneously, we may have a logical partition
• Partitions are prevented using seed nodes• Seed nodes are obtained from a static source, and they
are known to everyone• Memberships are propagated to everyone via seed nodes
![Page 85: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/85.jpg)
85CMPT 401 2008 © A. Fedorova
Failure Detection
• Failure discovery is local• Node A discovers that Node B has failed if Node B does
not respond• Failures (like memberships) are propagated via gossip
protocol
![Page 86: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/86.jpg)
86CMPT 401 2008 © A. Fedorova
Problem Solving (I)
• Design GFS over Dynamo– A system layer that presents GFS interface to Dynamo key-value
store– Present your design – How would you write a GFS-over-Dynamo application? Would you
need to change it? • Dynamo over GFS
– As above• Discussion
– Is this a good idea? – What system properties make this a fundamentally good/bad
idea?
![Page 87: Lecture XIII: Replication-II](https://reader035.fdocuments.in/reader035/viewer/2022070501/56816979550346895de16ef1/html5/thumbnails/87.jpg)
87CMPT 401 2008 © A. Fedorova
Problem Solving (II)
• Can you name similarities between GFS and Dynamo?• Can you name the differences? • Play in teams!