Distributed Distributed Systems - Brown University

Distributed Distributed Systems

L22: Distributed File Systems ContinuedTheophilus BensonCS1380 Spring 20

Todays Agenda

• General Distributed File Systems

• Industry Use Cases• Google File System (GFS)• MegaStore/BigTable• Zookeeper

GFS: Google File System

• Designed for Google Search• Indexing of the world wide web by hundreds of clients • Periodic batch analysis (e.g., MapReduce) to build index and analyze web data• Huge data sets

• 1000 servers• 300 TB of data• Few million files 100MB+ in size

• Workload characteristics• Reads: small random + large streaming reads• Writes: larges appends (or only written to once at creation time)

• No support for random writes à delete whole file and rewrite a new file

ChunkServer: Appending/Writing to Files

• Appends with at-least-once semantics• Consistency model: no consistency

• Implications of at-least-once • Different chunks servers may have

ChunkServer

ChunkServer

ChunkServer

Client

Gmail

A A B B

A B

A A B

GFSMaster

write

Use server 1

You are in chargeOf writes to file

I want to append

write

writeOK

Common Usage Pattern: Atomic Record Append• GFS appends it to the file atomically at least once

• GFS picks the offset• Works for concurrent writers

• Problems with ”at least once”• If a concurrent write fails• Some Replica will have the data others will not!• Client-library must retry: some replica will have two copies

• Used heavily by Google applications• Applications can suppress duplicates using RecordIDs

• Duplicates have same IDs• Application must be able to handle duplices

278

Common Usage Pattern: Atomic Record Append

279

Client

Replica A

Replica 2

Replica 3

1 2 3

1 2 3

1 2 3

4 4

4

4

Append ID=4Retry Append ID=4

ChunkServer: Reading from Files• Appends with at-least-once semantics

• Consistency model: no consistency

• Implications of at-least-once • Different chunks servers may have

• On reads:• Applications developers must suppress duplications using IDs

ChunkServer

ChunkServer 3

ChunkServer

Client

Gmail

A A B B

A B

A A B

GFSMaster

I want to read

Use server 3

GimmeData

I need to Suppress

duplicates L

Masters: MetaData Operations and Fault Tolerance• Maintain a log of all events• Replicate to Shadows before acknowledging

to Client• Periodically create a checkpoint

• On failure restore checkpoint and replay log

• On master failure• Shadows use Chubby (i.e., Zookeeper) to

determine new primary master

GFSMaster

GFSMasterShadow

Master

BackUp masters

Chubby

Who is Master??

Masters: MetaData Operations and Fault Tolerance• Maintain a log of all events• Replicate to Shadows before acknowledging to

Client• Periodically create a checkpoint

• Checkpoint is a snapshot of in memory data structures• On failure restore checkpoint and replay log• Checkpoint helps with faster recovery

• No need to replay all `ops’ to create state

• On master failure• Shadows use Chubby (i.e., Zookeeper) to determine new

primary master

op op opCheckpoint op op op op op

Checkpoint captures all prior events in log

New log file for post checkpointevents

Issues with GFS

• GFS scaled to ~50 million files, ~10 PB

• Latency-sensitive applications suffered• Append only semantics and large file (64MB chunks)• Developers had to rewrite applications to use append only semantics

• Developers needed a different abstraction• Want transactions• Want something similar to SQL

Colossus Fixes (GFS 2.0: GFS Improved for Interactive Flows)• Reducing chunk size à more meta data à scaling issues with single

masters• Solution: Multiple Masters à partition metadata across multiple masters• New ChunkSizes: 1MB instead of 64MB

• Data fault tolerance à 3 copies à requires lots of storage• Solution: Erasure coding (Reed/Solomon encoding)• Pros: less data to write (faster writes)• Cons: on recovery, slower because you need all partitions

http://googleappengine.blogspot.com/2010/06/datastore-performance-growing-pains.htmlhttp://googleappengine.blogspot.com/2009/09/migration-to-better-datastore.html

http://googleappengine.blogspot.com/2010/06/datastore-performance-growing-pains.html

http://googleappengine.blogspot.com/2009/09/migration-to-better-datastore.html

Motivation to Move Beyond GFS/NoSQL-DB

• Explosion in online services• More data, more users• Large geographic footprint

• Fast application development cycles• ACID [Linearizable] is easy to program• No SQL-DB [eventual consistency] is not

• ACID-based DB (e.g., RDBMS) are slow• Do not scale!!

https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/#.XKyW1ZNKjwchttp://sahrzad.net/blog/problems-with-google-services/https://www.cbronline.com/what-is/what-is-rdbms-4945418/

https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/

http://sahrzad.net/blog/problems-with-google-services/

https://www.cbronline.com/what-is/what-is-rdbms-4945418/

Motivation to Move Beyond GFS/NoSQL-DB

• Explosion in online services• More data, more users• Large geographic footprint

• Fast application development cycles• ACID [Linearizable] is easy to program• No SQL-DB [eventual consistency] is not

• ACID-based DB (e.g., RDBMS) are slow• Do not scale!!

https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/#.XKyW1ZNKjwchttp://sahrzad.net/blog/problems-with-google-services/https://www.cbronline.com/what-is/what-is-rdbms-4945418/

App

BigTable

GFS

MegaStore

https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/

http://sahrzad.net/blog/problems-with-google-services/

https://www.cbronline.com/what-is/what-is-rdbms-4945418/

Industry Storage Systems

GFS [2003]

Dynamo [2007]

BigTable[2008]

Cassandra[2008]

MegaStore[2011]

NoSQL DBFast DistributedTransactions

NoSQL + RDBMS

Spanner[2012]

Orleans[2013]

NoSQL vs. RDBMS

• NoSQL (Bigtable, HBase, Cassandra)• Merits:

+ Highly scalable

• Limitations:- Less features to build applications

(transaction at the granularity of single key, poor schema support and query capability, limited API (put/get))

- Loose consistency models with asynchronous replication (eventual consistency)

• RDBMS (mysql)• Merits:

+ Mature, rich data management features, easy to build applications

+ Synchronous replication comes with strong transactional semantics

• Limitations:- Hard to scale- Synchronous replication may have

performance and scalability issues- may not have fault-tolerant

replication mechanisms

BigTable (Google’s Dynamo-ish NoSQL DB) àBuilds on GFS• Data Model

• Big spare table• Content -> (row, column, timestamp)• Email == content

• Optimized for fast look up on keys• rows are ordered lexicographically, so scans in

order

• Divide the table into tablets (~100 MB) grouped by a range of sorted rows

• Each tablet is stored on a tablet server that manages 10-1000 tablets

32ff

B

GMA

C

F

BigTableMaster

GFSMaster

BigTableShadowMaster

App

ClientLibrary

BigTable: Semantics.

• Durability/Atomicity• Writes to GFS

• Consistency: strong/sequential• Ops processed by a single server in order

• Writes to individual rows• Guarantees on the row level• Isolated changes to a row• No transaction for support across rows

• Isolated transactions: single-row only, e.g., compare-and-swap

App

BigTable

GFS

32ff

B

GMA

C

F

32ff

B

GMA

C

F

BigTableMaster

GFSMaster

BigTableShadowMaster

App

ClientLibrary

GFSMaster

GFSMasterShadow

Master

BackUp masters

ChunkServer

ChunkServer

ChunkServerLeader for chunk

replica

Client

Gmail

List of chuckservers

Writes

writes

writes

HeartBeats(Chunk List)

LeaderLease

Open()

Node A Node B Node C

entity Entity EntityEntity Group

Node D Node E Node F

Entity Entity EntityEntity Group

Transaction across Entity Groups

MegaStore = RDBMS (consistency) + NoSQL (Scale)Partition and applyReplication schem

e

NoSQL: • Partition Data into Entity

Groups• Poor consistency across

groups

RDBMS:• Each group was Key/Value in

Cassandra/Dynamo/BigTable• Each group is a DB• ACID within a group• Optimize for reads

Key Questions for MegaStore

• How do you partition data across entity boundaries?

• How are ACID semantics provided?

• How do you deal with transactions that need to move across entity boundaries?

• How do you optimize for reads?

Application Developer

FrameworkDeveloper

How do you Partition Data? Entity Group Boundaries are natural for a class of applications

• GMail: • Each user has mail box of messages• User A does not modify User B’s mail box

• User A can send email to User B and this mail goes over the internet

• Gmaps:• WorldMap can be partitioned• World can be partitioned into “tiles”

• With few changes across squares

Distributed Transaction No Longer Considered Dead!

https://thenewstack.io/microsoft-orleans-brings-distributed-transactions-to-cloud/

2019: MS’s OrleansDist. Transaction to the Cloud!!!

2007: Avoid Distributed transactions

https://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf

2012: Google’s SpannerDist. Transaction!!!

Industry Storage Systems

GFS [2003]

Dynamo [2007]

BigTable[2008]

Cassandra[2008]

MegaStore[2011]

NoSQL DBFast DistributedTransactions

NoSQL + RDBMS

Spanner[2012]

Orleans[2013]

Other File Systems @ Google

• GFS à Newer version is called Colossus• Initial GFS: optimized for large files (i.e., for Google Search/indexing)• GFS 2.0 (Colossus): supports interactive services (i.e., Maps, Docs etc)

• Dynamo à Google’s BigTable

• Google’s Chubby à Zookeeper

Distributed Distributed Systems - Brown University

Documents

Transcript of Distributed Distributed Systems - Brown University