Distributed Distributed Systems - Brown University
Transcript of Distributed Distributed Systems - Brown University
Distributed Distributed Systems
L22: Distributed File Systems ContinuedTheophilus BensonCS1380 Spring 20
Todays Agenda
• General Distributed File Systems
• Industry Use Cases• Google File System (GFS)• MegaStore/BigTable• Zookeeper
GFS: Google File System
• Designed for Google Search• Indexing of the world wide web by hundreds of clients • Periodic batch analysis (e.g., MapReduce) to build index and analyze web data• Huge data sets
• 1000 servers• 300 TB of data• Few million files 100MB+ in size
• Workload characteristics• Reads: small random + large streaming reads• Writes: larges appends (or only written to once at creation time)
• No support for random writes à delete whole file and rewrite a new file
ChunkServer: Appending/Writing to Files
• Appends with at-least-once semantics• Consistency model: no consistency
• Implications of at-least-once • Different chunks servers may have
ChunkServer
ChunkServer
ChunkServer
Client
Gmail
A A B B
A B
A A B
GFSMaster
write
Use server 1
You are in chargeOf writes to file
I want to append
write
writeOK
Common Usage Pattern: Atomic Record Append• GFS appends it to the file atomically at least once
• GFS picks the offset• Works for concurrent writers
• Problems with ”at least once”• If a concurrent write fails• Some Replica will have the data others will not!• Client-library must retry: some replica will have two copies
• Used heavily by Google applications• Applications can suppress duplicates using RecordIDs
• Duplicates have same IDs• Application must be able to handle duplices
278
Common Usage Pattern: Atomic Record Append
279
Client
Replica A
Replica 2
Replica 3
1 2 3
1 2 3
1 2 3
4 4
4
4
Append ID=4Retry Append ID=4
ChunkServer: Reading from Files• Appends with at-least-once semantics
• Consistency model: no consistency
• Implications of at-least-once • Different chunks servers may have
• On reads:• Applications developers must suppress duplications using IDs
ChunkServer
ChunkServer 3
ChunkServer
Client
Gmail
A A B B
A B
A A B
GFSMaster
I want to read
Use server 3
GimmeData
I need to Suppress
duplicates L
Masters: MetaData Operations and Fault Tolerance• Maintain a log of all events• Replicate to Shadows before acknowledging
to Client• Periodically create a checkpoint
• On failure restore checkpoint and replay log
• On master failure• Shadows use Chubby (i.e., Zookeeper) to
determine new primary master
GFSMaster
GFSMasterShadow
Master
BackUp masters
Chubby
Who is Master??
Masters: MetaData Operations and Fault Tolerance• Maintain a log of all events• Replicate to Shadows before acknowledging to
Client• Periodically create a checkpoint
• Checkpoint is a snapshot of in memory data structures• On failure restore checkpoint and replay log• Checkpoint helps with faster recovery
• No need to replay all `ops’ to create state
• On master failure• Shadows use Chubby (i.e., Zookeeper) to determine new
primary master
op op opCheckpoint op op op op op
Checkpoint captures all prior events in log
New log file for post checkpointevents
Issues with GFS
• GFS scaled to ~50 million files, ~10 PB
• Latency-sensitive applications suffered• Append only semantics and large file (64MB chunks)• Developers had to rewrite applications to use append only semantics
• Developers needed a different abstraction• Want transactions• Want something similar to SQL
Colossus Fixes (GFS 2.0: GFS Improved for Interactive Flows)• Reducing chunk size à more meta data à scaling issues with single
masters• Solution: Multiple Masters à partition metadata across multiple masters• New ChunkSizes: 1MB instead of 64MB
• Data fault tolerance à 3 copies à requires lots of storage• Solution: Erasure coding (Reed/Solomon encoding)• Pros: less data to write (faster writes)• Cons: on recovery, slower because you need all partitions
http://googleappengine.blogspot.com/2010/06/datastore-performance-growing-pains.htmlhttp://googleappengine.blogspot.com/2009/09/migration-to-better-datastore.html
Motivation to Move Beyond GFS/NoSQL-DB
• Explosion in online services• More data, more users• Large geographic footprint
• Fast application development cycles• ACID [Linearizable] is easy to program• No SQL-DB [eventual consistency] is not
• ACID-based DB (e.g., RDBMS) are slow• Do not scale!!
https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/#.XKyW1ZNKjwchttp://sahrzad.net/blog/problems-with-google-services/https://www.cbronline.com/what-is/what-is-rdbms-4945418/
Motivation to Move Beyond GFS/NoSQL-DB
• Explosion in online services• More data, more users• Large geographic footprint
• Fast application development cycles• ACID [Linearizable] is easy to program• No SQL-DB [eventual consistency] is not
• ACID-based DB (e.g., RDBMS) are slow• Do not scale!!
https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/#.XKyW1ZNKjwchttp://sahrzad.net/blog/problems-with-google-services/https://www.cbronline.com/what-is/what-is-rdbms-4945418/
App
BigTable
GFS
MegaStore
Industry Storage Systems
GFS [2003]
Dynamo [2007]
BigTable[2008]
Cassandra[2008]
MegaStore[2011]
NoSQL DBFast DistributedTransactions
NoSQL + RDBMS
Spanner[2012]
Orleans[2013]
NoSQL vs. RDBMS
• NoSQL (Bigtable, HBase, Cassandra)• Merits:
+ Highly scalable
• Limitations:- Less features to build applications
(transaction at the granularity of single key, poor schema support and query capability, limited API (put/get))
- Loose consistency models with asynchronous replication (eventual consistency)
• RDBMS (mysql)• Merits:
+ Mature, rich data management features, easy to build applications
+ Synchronous replication comes with strong transactional semantics
• Limitations:- Hard to scale- Synchronous replication may have
performance and scalability issues- may not have fault-tolerant
replication mechanisms
BigTable (Google’s Dynamo-ish NoSQL DB) àBuilds on GFS• Data Model
• Big spare table• Content -> (row, column, timestamp)• Email == content
• Optimized for fast look up on keys• rows are ordered lexicographically, so scans in
order
• Divide the table into tablets (~100 MB) grouped by a range of sorted rows
• Each tablet is stored on a tablet server that manages 10-1000 tablets
32ff
B
GMA
C
F
BigTableMaster
GFSMaster
BigTableShadowMaster
App
ClientLibrary
BigTable: Semantics.
• Durability/Atomicity• Writes to GFS
• Consistency: strong/sequential• Ops processed by a single server in order
• Writes to individual rows• Guarantees on the row level• Isolated changes to a row• No transaction for support across rows
• Isolated transactions: single-row only, e.g., compare-and-swap
App
BigTable
GFS
32ff
B
GMA
C
F
32ff
B
GMA
C
F
BigTableMaster
GFSMaster
BigTableShadowMaster
App
ClientLibrary
GFSMaster
GFSMasterShadow
Master
BackUp masters
ChunkServer
ChunkServer
ChunkServerLeader for chunk
replica
Client
Gmail
List of chuckservers
Writes
writes
writes
HeartBeats(Chunk List)
LeaderLease
Open()
Node A Node B Node C
entity Entity EntityEntity Group
Node D Node E Node F
Entity Entity EntityEntity Group
Transaction across Entity Groups
MegaStore = RDBMS (consistency) + NoSQL (Scale)Partition and applyReplication schem
e
NoSQL: • Partition Data into Entity
Groups• Poor consistency across
groups
RDBMS:• Each group was Key/Value in
Cassandra/Dynamo/BigTable• Each group is a DB• ACID within a group• Optimize for reads
Key Questions for MegaStore
• How do you partition data across entity boundaries?
• How are ACID semantics provided?
• How do you deal with transactions that need to move across entity boundaries?
• How do you optimize for reads?
Application Developer
FrameworkDeveloper
How do you Partition Data? Entity Group Boundaries are natural for a class of applications
• GMail: • Each user has mail box of messages• User A does not modify User B’s mail box
• User A can send email to User B and this mail goes over the internet
• Gmaps:• WorldMap can be partitioned• World can be partitioned into “tiles”
• With few changes across squares
Distributed Transaction No Longer Considered Dead!
https://thenewstack.io/microsoft-orleans-brings-distributed-transactions-to-cloud/
2019: MS’s OrleansDist. Transaction to the Cloud!!!
2007: Avoid Distributed transactions
https://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf
2012: Google’s SpannerDist. Transaction!!!
Industry Storage Systems
GFS [2003]
Dynamo [2007]
BigTable[2008]
Cassandra[2008]
MegaStore[2011]
NoSQL DBFast DistributedTransactions
NoSQL + RDBMS
Spanner[2012]
Orleans[2013]
Other File Systems @ Google
• GFS à Newer version is called Colossus• Initial GFS: optimized for large files (i.e., for Google Search/indexing)• GFS 2.0 (Colossus): supports interactive services (i.e., Maps, Docs etc)
• Dynamo à Google’s BigTable
• Google’s Chubby à Zookeeper