Google file system

1

THE GOOGLE FILE SYSTEMS. GHEMAWAT, H. GOBIOFF AND S. LEUNG

APRIL 7, 2015

CSI5311: Distributed Databases and Transaction ProcessingWinter 2015

Prof. Iluju KiringaUniversity of Ottawa

Presented By:

Ajaydeep Grewal

Roopesh Jhurani

2

AGENDA• Introduction• Design Overview• System Interactions• Master Operations• Fault Tolerance and Diagnosis• Measurements• Conclusion• References

3

IntroductionGoogle File System(GFS) is a distributed file

system developed by GOOGLE for its own use.It is a scalable file system for large distributed

data-intensive applications.It is widely used within GOOGLE as a storage

platform for generation and processing of data.

4

Inspirational factors

Multiple clusters distributed worldwide.Thousands of queries served per second.Single query reads more than 100's of MB of data.Google stores dozens of copies of the entire Web.

ConclusionNeed large, distributed, highly fault tolerant file system.Large data processing needs Performance, Reliability,

Scalability and Availability.

5

Design AssumptionsComponent Failures

File System consists of hundreds of machines made from commodity parts.

The quantity and quality of the machines guarantee that there are non functional nodes at a given time.

Huge File SizesWorkload Large streaming reads.

Small random reads.

Large, sequential writes that append data to file.Applications & API are co-designed Increases flexibility.

Goal is simple file system, light burden on applications.

6

GFS Architecture

Master

Chunk Servers

GFS Client API

7

GFS Architecture

Master Contains the system metadata like:• Namespaces• Access Control Information• Mappings from files to chunks• Current location of chunks

Also helps in:◦ Garbage collection◦ Synching across Chunk Servers(Heartbeat Synching)

8

GFS ArchitectureChunk ServersMachines containing physical files divided into chunks.Each Master server can have a number of associated chunk

servers.For reliability, each chunk is replicated on multiple chunk

servers.

Chunk Handle Immutable 64 bit chunk handle assigned by master at the

time of chunk creation.

9

GFS Architecture

GFS Client codeCode at client machine that interacts with GFS. Interacts with the master for metadata operations. Interacts with Chunk Servers for all Read-Write operations.

10

GFS Architecture 1.GFS Client code requests for a particular file .

2. Master gives the location of the chunk server.

3.Client caches the information and interacts directly with the chunk server.

4.Periodic replication of changes across all the replicas.

11

Chunk Size

Having a large uniform chunk size of 64 MB has the following advantages:Reduced Client-Master interaction.Reduced Network-Overhead. Reduction in the size of metadata's stored.

12

MetadataThe file and chunk namespaces.The mappings from files to chunks.Location of each chunk’s replica.

First two are kept persistently in operation log files to ensure reliability and recoverability.

Chunk locations are held by chunk servers.

Master polls the chunk server at start-up and also periodically thereafter.

13

Operation LogsThe operation log contains a historical record of critical

metadata changes.Metadata updates are in following format

e.g. (old value, new value) pairs. Since the operation logs are very important, so they are

replicated on remote machines.Global snapshots (checkpoints)

Checkpoint is B-tree like form and mapped into memory.

When new updates arrive checkpoints can be created.

14

System Interactions MutationA mutation is an operation that changes the contents or metadata of a chunk such as a write or an append operation.

Lease mechanismLeases are used to maintain a consistent mutation order across replicas.

◦ Firstly the master grants a chunk lease to a replica and calls it primary.

◦ The primary determines the order of updates to all the other replicas.

15

Write Control and Data Flow1.Client requests for a write operation.

2.Master replies with the location of Chunk Primary and replicas.

3.Client caches the information and pushes the write information.

4.The Primary and replicas store the information in buffer and sends a confirmation.

5.Primary sends a mutation order to all the secondaries.

7.Primary sends a confirmation to the client.

6.Secondaries commit the mutations and sends a confirmation to the Primary.

16

ConsistencyConsistent: All the replicated chunks have the

same data.Inconsistent: A failed mutation makes the region

inconsistent, i.e., di erent clients may see di erent ff ffdata.

Master Operations

1. Namespace Management and Locking

2. Replica Placement

3. Creation, Re-replication and Rebalancing

4. Garbage Collection

5. Stale Replica Detection

17

Master OperationsNamespace Management and Locking

Separate locks on region namespace ensures:

Serialization

Multiple operations on master to avoid any delay.

Each master operation acquires a set of locks before it runs. To make operation on /dir1/dir2/dir3/leaf it requires locks.

Read-Lock on /dir1, /dir1/dir2/, /dir1/dir2/dir3 Read-Lock or Write-Lock on /dir1/dir2/dir3/leaf

File creation doesn’t require write-lock on parent directory: read-lock is enough to protect it from deletion, rename, or snapshotted.

Write-locks on file names serialize attempts to create any duplicate file.

18

mailto:0

mailto:mailto

mailto:0

mailto:0

mailto:0

Master OperationsLocking Mechanism

Snapshot acquires Read Locks on: /home, /save Write Locks on: /home/user, /save/user

File to be created: Read Locks on: /home, /home/user Write Locks on: /home/user/foo

Conflicting locks on /home/user

/home/user /save/usersnapshotted

/home/user/foo

19

mailto:0

mailto:0

mailto:0

mailto:0

mailto:0

mailto:0

mailto:0

mailto:0

mailto:0

Master OperationsReplica Placement

Serves two purposes: Maximize data reliability and availability Maximize Network Bandwidth utilization

Spread Chunk replicas across racks: To ensure chunk survivability To exploit aggregate read bandwidth of multiple racks Write traffic has to flow through multiple racks.

20

Master Operations Creation, re-replication and rebalancing

Creation: Master considers several factors Place new replicas on chunk servers with below average disk utilization. Limit the number of “recent” creations on chunk server. Spread replicas of a chunk across racks.

Re-replication: Master re-replicate a chunk when number of replicas fall below a goal

level. Re-replicated chunk is prioritized based on several factors. Master limits the numbers of active clone operations both for the cluster

and for each chunk servers. Each chunk servers limits bandwidth it spends on each clone operation.

Balancing: Master re-balances replicas periodically for better disk and load-balancing. Master gradually fills up a chunk server rather than instantly filling it with

new chunks.

21

Master OperationsGarbage Collection

Lazy garbage collection by GFS for a deleted file. Mechanism:

Master logs the deletion like other changes. File is renamed to a hidden name that include deletion timestamp. Master removes any hidden files during regular namespace

scanning thus erasing its in-memory metadata. Similar scan performed for chunk namespace to identify orphaned

chunks and erase metadata for the same. Chunk Server can delete those chunks not identified in master

metadata during regular heartbeat message exchange.

22

Master OperationsStale Replica Detection

Problem: Chunk Replica may become stale if a chunk server fails and misses mutations.

Solution: for each chunk, master maintains a version number. Whenever a master grants a new lease on a chunk, master increases

the version number and inform up-to-date replicas (version number is stored permanently on the master and associated chunk servers)

Master detects that chunk server has a stale replica when the chunk server restarts and reports its set of chunks and associated version numbers.

Master removes stale replica in its regular garbage collection. Master includes chunk version number when it informs clients

which chunk server holds a lease on chunk, or when it instructs a chunk server to read the chunk from another chunk server in cloning operation.

23

Fault Tolerance and DiagnosisHigh Availability

Strategies: Fast recovery and Replication. Fast Recovery:

Master and chunk servers are designed to restore their state in seconds. No matter how they terminated, no distinction between normal and abnormal

termination (servers routinely shutdown just by killing process). Clients and servers experience minor timeout on outstanding requests, reconnect to

the restarted server, and retry. Chunk Replication:

Chunk replicated on multiple chunk servers on different racks (different parts of the file namespace can have different replica on level).

Master clones existing replicas as chunk servers go offline or detect corrupted replicas (checksum verification).

Master Replication Shadow master provides read-only access to file system even when the master is

down. Master operation logs and checkpoints are replicated on multiple machines for

reliability.

24

Fault Tolerance and DiagnosisData Integrity

Each chunk server uses check summing to detect corruption of stored chunk.

Chunk is broken into 64KB blocks with associated 32 bit checksum. Checksums are metadata kept in memory and stored persistently with

logging, separate from user data.

For READS: chunk server verifies the checksum of data blocks that overlap the range before returning any data.

For WRITES: chunk server verifies the checksum of first and last data blocks that overlap the write range before perform the write, and finally compute and record new checksums.

25

MeasurementsMicro-benchmarks: GFS cluster

One master, 2 master replicas, 16 chunk servers with 16 clients. Dual 1.4 GHz PIII processors, 2 GB RAM, 2*80 GB 5400 RPM

disks, FastEthernet NIC connected to one HP 2524 Ethernet switch ports 10/100 + Gigabit uplink.

26

MeasurementsMicro-benchmarks: READS

Each client read a randomly selected 4MB region 256 times (=1GB of data) from a 320 MB file.

Aggregate chunk server memory is 32GB, so 10% hit rate in Linux buffer cache is expected.

27

MeasurementsMicro-benchmarks: WRITE

Each client writes 1GB of data to a new file in a series of 1MB writes. Network stack does not interact very well with the pipelining scheme

used for pushing data to the chunk replicas: network congestion is more likely for 16 writers than for 16 readers because each write involves 3 different replicas()

28

MeasurementsMicro-benchmarks: RECORD APPENDS

Each client appends simultaneously to a single file. Performance is limited by the network bandwidth of the 3 chunk

servers that store the last chunk of the file, independent of the number of clients.

29

ConclusionGoogle File System Support Large Scale data processing workloads on COTS x86 servers. Component failure are norms rather than exceptions. Optimize for huge files mostly append to and then read sequentially. Fault tolerance by constant monitoring, replicating crucial data and

fast and automatic recovery. Delivers high aggregate throughput to many concurrent readers and

writers.

Future Improvements Networking Stack Limit: Write throughput can be improved in the

future.

30

References

1. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google File System." ACM SIGOPS Operating Systems Review: 29. Print.

2. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A scalable distributed file system. In Proceedings of the 16th ACM Symposium on Operating System Principles, pages 224–237, Saint-Malo, France, October 1997.

3. http://en.wikipedia.org/wiki/Google_File_System

4. http://computer.howstuffworks.com/internet/basics/google-file-system.htm

5. http://en.wikiversity.org/wiki/Big_Data/Google_File_System

6. http://storagemojo.com/google-file-system-eval-part-i/

7. https://www.youtube.com/watch?v=d2SWUIP40Nw

31

Thank You!!

32

Google file system

Technology

Transcript of Google file system