The Hadoop Distributed File System - CS | Computer...

Presented by Haoran Ma, Yifan Qiao

The Hadoop Distributed File System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo!

Sunnyvale, California USA

Outline

• Introduction

• Architecture

• File I/O Operations and Replica Management

• Practice at YAHOO!

• Future Work

Introduction

• A single dataset is too large —> Divide it and store them on a cluster of commodity hardwares.

- What if one of the physical machines fails?

• Some applications like MapReduce need high throughput of data access.

Introduction

• HDFS is the file system component of Hadoop. It is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications.

• These are achieved by replicating file contents on multiple machines(DataNodes).

Introduction• Very Large Distributed File System

• Assumes Commodity Hardware

- Files are replicated to handle hardware failure

• Optimized for Batch Processing

- Data locations exposed so that computations can move to where data resides

Introduction

Usually 128 MB

Source: HDFS Tutorial – A Complete Hadoop HDFS Overview. DATAFLAIR TEAM.

Architecture

Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

Architecture

• Stores meta-data such as number of data blocks, replicas and other details in memory

• Maintains and manages the DataNodes, and assigns tasks to them

Architecture

Store Application Data

Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

ArchitectureHDFS Client: a code library that exports the HDFS file system interface

Architecture

• How does this architecture achieve high fault-tolerance?

• DataNodes Failure

• NameNode Failure

Architecture: Failure Recovery for DataNodes

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.


Block Report

Heartbeat



What if NameNode fails?

Architecture: Failure Recovery for NameNode

Image = Checkpoint + Journal

• Image: The file system metadata that describes the organization of application data as directories and files.

• Checkpoint: A persistent record of the image written to disk.

• Journal: The modification log of the image. It is also stored in the local host’s native file system.


• CheckpointNode:

• Periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal.

• BackupNode:

• A read-only NameNode.

• Maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.

• If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state.


• Snapshots

• To minimize potential damage to the data stored in the system during upgrades.

• Persistently save the current state of the file system(both data and metadata).

• , so that if the upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.


• Snapshots

• To minimize potential damage to the data stored in the system during upgrades.

• Persistently save the current state of the file system(both data and metadata).

• , so that if the upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.Copy on Write


Combine Return

NameNode

DataNode (Example)

BackupNode

CheckpointNode

Memory:Image

Disk:

Memory:

SynchronizeImage

CheckpointJournal

New CheckpointEmpty Journal

Snapshot Snapshot (Only Hard Links)

Disk:

Disk:

Usage & Management of HDFS Cluster

• Basic File I/O operations

• Rack Awareness

• Replication Management

File I/O Operations• Write Files to HDFS: Single writer, multiple reader

(1) addBlock

(2) uniqueblock ids

(3) writeto block


• Write Files to HDFS: Single writer, multiple reader

(1) addBlock

(2) uniqueblock ids

(3) writeto block

1. Client consults the NameNode to get a lease and destination DataNodes

2. Client writes a block to DataNodes in a pipeline way

3. DataNodes replicate blocks 4. Client writes a new block after finishing the previous block

The visibility of the modification is not guaranteed!


File I/O Operations

1. the client consults the NameNode to get the list of blocks and their replicas' locations

2. try the nearest replica first, then the second nearest replica, and so on

• Identifying corrupted data • Checksums-CRC32

• Read Files in HDFS

File I/O Operations

In-cluster Client Reads a File


Outside Client Reads a File


Rack Awareness


Rack Awareness• Benefits:

• higher throughput • higher reliability:

• an entire rack failure never loses all replicas of a block

• better network bandwidth utilization: • reduce inter-rack and inter-node write traffic as

much as possible

• The default HDFS replica placement policy: 1. No DataNode contains more than one

replica of any block 2. No rack contains more than two replicas of

the same block, provided there are sufficient racks on the cluster

Rack Awareness

Replication Management• to avoid blocks to be under- or over-replicated


Practice at Yahoo!

• Clusters at Yahoo! can be as large as ~3500 nodes with typical configuration: • 2 quad core Xeon processors @ 2.5Ghz • 4 directly attached SATA drives (1TB each, 4TB total) • 16GB RAM• 1-Gbit Ethernet

• Total 9.8PB of storage available, 3.3PB available for user applications when replicating blocks 3 times

Cluster Basic Information

Practice at Yahoo!• Uncorrelated nodes failure:

• Chance of a node failed during a month ~0.8% (Naive estimation for a node failure probability during a year is ~9.2%)

• Chance of losing a block during a year < 0.5%• Correlated nodes failure:

• HDFS tolerates a rack switch failure • But a core switch failure or cluster power loss can

lose some blocks

Data Durability

Practice at Yahoo!• Benchmarks

Scenario Read (MB/sper node)

Write (MB/sper node)

DFSIO 66 407200 RPM

Desktop HDD[6]

< 130(typical 50-120)

< 130(typical 50-120)

Table1: Contrived benchmark compared with typical HDD performance

Scenario Read (MB/sper node)

Write (MB/sper node)

Busy Cluster 1.02 1.09Table2: HDFS performance in a production cluster


Bytes (TB) Nodes Maps Reduces Time / s

HDFS I/O Bytes/s

Aggregate (GB) Per Node (MB)

1 1460 8000 2700 62 32 22.11000 3658 80000 20000 58500 34.2 9.35

Table 3: Sort benchmark

1000TB is too large to fit in the node memory intermediate results spill to disks and occupy disk bandwidth


Operation Throughput (ops/s)

Open file for read 126 100Create file 5600Rename file 8300Delete file 20 700

DataNode Heartbeat 300 000

Blocks report (blocks/s) 639 700

Table4: NameNode throughput benchmark

involve modifying nodes,

can be the bottleneck in large scale

Summary: HDFS: Two Easy Pieces*

• Reliability

• Throughput

*: The title is from two great books: Six Easy Pieces: Essentials Of Physics Explained By Its Most Brilliant Teacher, by Richard P. Feynman, and Operating Systems: Three Easy Pieces, by Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau

Summary: HDFS: Reliability• System Design:

• Split files into blocks and replicate them (typical 3) • For NameNode:

• Checkpoint + Journal can restore the latest image • BackupNode • Snapshot • NameNode is the single point of failure of the whole system - NOT GOOD!

• For DataNodes: • Rack Awareness + Replica Placement Policy, never lose a block if a rack

fails • Replica Management, to avoid blocks to be under-replicated • Snapshot

Summary: HDFS: Throughput

• System Design • Split files into large blocks (128MB) - good for streaming access and

parallel access • Provide APIs that expose the location of blocks - facilitating applications

to schedule computation tasks to where the data reside • NameNode - Not Good for High Throughput and Scalability

• Single node handles all requests from clients and manages all DataNodes • DataNodes

• Rack Awareness & replica placement policy - better utilizing network bandwidth

• Write files in a pipeline way • Read files from the nearest DataNode first

Future Work (Out of Date!)• Automated failover solution

• Zookeeper

• Scalability of the NameNode

• multiple namespaces to share physical storage• Advantages:

• isolate namespaces • improve the overall availability • generalize the block storage

abstraction

• Drawbacks: • management cost

Thank you.

• Reference:

[1] “The Hadoop Distributed File System”. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler.

[2] “Hadoop HDFS Architecture Explanation and Assumptions”. DATAFLAIR TEAM. https://data-flair.training/blogs/hadoop-hdfs-architecture/

[3] “HDFS Tutorial – A Complete Hadoop HDFS Overview”. DATAFLAIR TEAM. https://data-flair.training/blogs/hadoop-hdfs-tutorial/

[4] “HDFS Architecture”. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

[5] “Understanding Hadoop Clusters and the Network”. Brad Hedlund. http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

[6] "Speed Considerations". Seagate. https://web.archive.org/web/20110920075313/http://www.seagate.com/www/en-us/support/before_you_buy/speed_considerations

https://data-flair.training/blogs/hadoop-hdfs-architecture/

https://data-flair.training/blogs/hadoop-hdfs-tutorial/

https://data-flair.training/blogs/hadoop-hdfs-tutorial/

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html



http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

https://web.archive.org/web/20110920075313/http://www.seagate.com/www/en-us/support/before_you_buy/speed_considerations



The Hadoop Distributed File System - CS | Computer...

Documents

Transcript of The Hadoop Distributed File System - CS | Computer...