Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees...

18
Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9

Transcript of Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees...

Page 1: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

Google File SystemLog-Structured Merge Trees

Marco Serafini

COMPSCI 590SLecture 9

Page 2: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

2

Page 3: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

33

Peculiar Requirements• Huge files

• Files can span multiple servers• Coarse granularity blocks to keep metadata manageable

• Failures• Many servers à many failures

• Workload• Append-only writes, reads mostly sequential

Page 4: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

44

Design Choices• Optimized for bandwidth not latency• Weak consistency

• Supports multiple concurrent appends to a file• Best-effort attempt to guarantee atomicity of each append• Minimal attempts to “fix” state after failures• No locks

• How to deal with weak consistency• Application-level mechanisms to deal with inconsistent data

• Clients cache only metadata

Page 5: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

55

Implementation• Distributed layer on top of Linux servers• Use local Linux file system to actually store data

Page 6: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

66

Master-Slave Architecture• Master

• Keeps file chunk metadata (e.g. mapping to chunkservers)• Failure detection of chunkservers

• Procedure• Client contacts master to get metadata (small size)• Client contacts chunkserver(s) to get data (large size)• Master is not in the “critical path” and is thus not overloaded

Page 7: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

77

Advantages of Large Chunks• Small metadata

• All metadata fits in memory at the master à no bottleneck• Clients cache lots of metadata à low load on master

• Batching when transferring data

Page 8: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

88

Master Metadata• Persisted data

• File and chunk namespaces• File to chunks mapping• Operation log (Write-Ahead Log)• Stored externally for fault tolerance

• Q: Why not simply restart master from scratch?• This is what MapReduce does, after all

• Non-persisted data: Location of chunks • Fetched at startup from chunkservers• Updated periodically

Page 9: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

99

Operation Log• Persists state• Memory mapped file - use only offsets as pointers• Log is a WAL - we will discuss it• Trimmed using checkpoints

Page 10: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

10

Chunkserver Replication• Mutations are sent to all replicas

• One replica is primary for a lease• Within that lease, it totally orders and sends to backups• After old lease expires, master assigns new primary

• Separation of data and control flow• Data dissemination to all replicas (data flow) • Ordering through primary (control flow)

Page 11: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

1111

Replication Protocol1. Client disseminates data to chunkservers2. Client contacts primary replica for ordering3. Primary determines order (offset of write)

• Also persists order to disk for recovery4. Primary sends offset to backups5. Backups apply write and ack back to primary6. Primary acks to client

Page 12: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

1212

Weak Consistency• In presence of failures,

• There can be inconsistencies (e.g. failed backup)• Client simply retries the write à duplicate data

• Successful write (acknowledged back to client) is• Atomic: all data written (but may be later partially overwritten)• Consistent: same offset at all replica• This is because the primary proposes a specific offset

• Result: file contains stretches of “good” data interspersed with inconsistent and/or duplicated data

Page 13: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

1313

Implications for Applications• Applications must deal with inconsistency

• Atomic file renaming after finishing to a file (single writer)• Add checksums to data to detect incomplete writes• Add unique record ids to detect duplication

• More difficult to program!

Page 14: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

14

Log Structured Merge Trees

Page 15: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

1515

LSMT Data Structures• Memtable

• Binary tree or skiplist à sorted• Receives writes and serves reads• Persistency through a Write Ahead Log

• Log files (runs)• L0: dump of memtable• Li: merge of multiple Li-1 runs

• Goal: make disk accesses sequential!

Page 16: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

1616

Operations• Writes go to memtable• Reads

• Search memtables and read caches (if available)• Search log files in reverse chronological order• Bloom filters – indices in log files

• Periodically dump memtable to L0• Periodically merge from Li-1 to Li

Page 17: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

1717

Optimizing Reads• Binary search in each run• Use a block index• Bloom filter

• Over-approximation of a set: ! ∈ # can return• False positives• No false negative

• Much smaller than storing entire set (e.g. HashSet)

Page 18: Google File System Log-Structured Merge Trees · Google File System Log-Structured Merge Trees Marco Serafini COMPSCI 590S Lecture 9. 2. 33 Peculiar Requirements ... file contains

1818

Merging• Starting from L1, every run is related to a key partition • Merging runs

• Take two Li runs• Merge with the relevant Li+1 runs (sequential)• Create new run Li+1 to replace the merged one• If too large, create a new Li+1 run• Merge to Li+2 if needed (too many Li+1 runs)