The Hadoop Distributed File System - CS | Computer...

41
Presented by Haoran Ma, Yifan Qiao The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA

Transcript of The Hadoop Distributed File System - CS | Computer...

Page 1: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Presented by Haoran Ma, Yifan Qiao

The Hadoop Distributed File System

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo!

Sunnyvale, California USA

Page 2: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Outline

• Introduction

• Architecture

• File I/O Operations and Replica Management

• Practice at YAHOO!

• Future Work

Page 3: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Introduction

• A single dataset is too large —> Divide it and store them on a cluster of commodity hardwares.

- What if one of the physical machines fails?

• Some applications like MapReduce need high throughput of data access.

Page 4: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Introduction

• HDFS is the file system component of Hadoop. It is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications.

• These are achieved by replicating file contents on multiple machines(DataNodes).

Page 5: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Introduction• Very Large Distributed File System

• Assumes Commodity Hardware

- Files are replicated to handle hardware failure

• Optimized for Batch Processing

- Data locations exposed so that computations can move to where data resides

Page 6: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Introduction

Usually 128 MB

Source: HDFS Tutorial – A Complete Hadoop HDFS Overview. DATAFLAIR TEAM.

Page 7: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture

Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

Page 8: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture

• Stores meta-data such as number of data blocks, replicas and other details in memory

• Maintains and manages the DataNodes, and assigns tasks to them

Page 9: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture

Store Application Data

Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

Page 10: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

ArchitectureHDFS Client: a code library that exports the HDFS file system interface

Page 11: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture

• How does this architecture achieve high fault-tolerance?

• DataNodes Failure

• NameNode Failure

Page 12: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for DataNodes

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Page 13: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for DataNodes

Block Report

Heartbeat

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Page 14: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for DataNodes

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Page 15: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for DataNodes

What if NameNode fails?

Page 16: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for NameNode

Image = Checkpoint + Journal

• Image: The file system metadata that describes the organization of application data as directories and files.

• Checkpoint: A persistent record of the image written to disk.

• Journal: The modification log of the image. It is also stored in the local host’s native file system.

Page 17: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for NameNode

• CheckpointNode:

• Periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal.

• BackupNode:

• A read-only NameNode.

• Maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.

• If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state.

Page 18: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for NameNode

• Snapshots

• To minimize potential damage to the data stored in the system during upgrades.

• Persistently save the current state of the file system(both data and metadata).

• , so that if the upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.

Page 19: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for NameNode

• Snapshots

• To minimize potential damage to the data stored in the system during upgrades.

• Persistently save the current state of the file system(both data and metadata).

• , so that if the upgrade results in data loss or corruption, it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at the time of the snapshot.Copy on Write

Page 20: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Architecture: Failure Recovery for NameNode

Combine Return

NameNode

DataNode (Example)

BackupNode

CheckpointNode

Memory:Image

Disk:

Memory:

SynchronizeImage

CheckpointJournal

New CheckpointEmpty Journal

Snapshot Snapshot (Only Hard Links)

Disk:

Disk:

Page 21: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Usage & Management of HDFS Cluster

• Basic File I/O operations

• Rack Awareness

• Replication Management

Page 22: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

File I/O Operations• Write Files to HDFS: Single writer, multiple reader

(1) addBlock

(2) uniqueblock ids

(3) writeto block

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Page 23: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

• Write Files to HDFS: Single writer, multiple reader

(1) addBlock

(2) uniqueblock ids

(3) writeto block

1. Client consults the NameNode to get a lease and destination DataNodes

2. Client writes a block to DataNodes in a pipeline way

3. DataNodes replicate blocks 4. Client writes a new block after finishing the previous block

The visibility of the modification is not guaranteed!

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

File I/O Operations

Page 24: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

1. the client consults the NameNode to get the list of blocks and their replicas' locations

2. try the nearest replica first, then the second nearest replica, and so on

• Identifying corrupted data • Checksums-CRC32

• Read Files in HDFS

File I/O Operations

Page 25: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

In-cluster Client Reads a File

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Page 26: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Outside Client Reads a File

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Page 27: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Rack Awareness

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Page 28: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Rack Awareness• Benefits:

• higher throughput • higher reliability:

• an entire rack failure never loses all replicas of a block

• better network bandwidth utilization: • reduce inter-rack and inter-node write traffic as

much as possible

Page 29: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

• The default HDFS replica placement policy: 1. No DataNode contains more than one

replica of any block 2. No rack contains more than two replicas of

the same block, provided there are sufficient racks on the cluster

Rack Awareness

Page 30: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Replication Management• to avoid blocks to be under- or over-replicated

Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

Page 31: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Practice at Yahoo!

• Clusters at Yahoo! can be as large as ~3500 nodes with typical configuration: • 2 quad core Xeon processors @ 2.5Ghz • 4 directly attached SATA drives (1TB each, 4TB total) • 16GB RAM• 1-Gbit Ethernet

• Total 9.8PB of storage available, 3.3PB available for user applications when replicating blocks 3 times

Cluster Basic Information

Page 32: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Practice at Yahoo!• Uncorrelated nodes failure:

• Chance of a node failed during a month ~0.8% (Naive estimation for a node failure probability during a year is ~9.2%)

• Chance of losing a block during a year < 0.5%• Correlated nodes failure:

• HDFS tolerates a rack switch failure • But a core switch failure or cluster power loss can

lose some blocks

Data Durability

Page 33: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Practice at Yahoo!• Benchmarks

Scenario Read (MB/sper node)

Write (MB/sper node)

DFSIO 66 407200 RPM

Desktop HDD[6]

< 130(typical 50-120)

< 130(typical 50-120)

Table1: Contrived benchmark compared with typical HDD performance

Scenario Read (MB/sper node)

Write (MB/sper node)

Busy Cluster 1.02 1.09Table2: HDFS performance in a production cluster

Page 34: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Practice at Yahoo!• Benchmarks

Bytes (TB) Nodes Maps Reduces Time / s

HDFS I/O Bytes/s

Aggregate (GB) Per Node (MB)

1 1460 8000 2700 62 32 22.11000 3658 80000 20000 58500 34.2 9.35

Table 3: Sort benchmark

1000TB is too large to fit in the node memory intermediate results spill to disks and occupy disk bandwidth

Page 35: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Practice at Yahoo!• Benchmarks

Operation Throughput (ops/s)

Open file for read 126 100Create file 5600Rename file 8300Delete file 20 700

DataNode Heartbeat 300 000

Blocks report (blocks/s) 639 700

Table4: NameNode throughput benchmark

involve modifying nodes,

can be the bottleneck in large scale

Page 36: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Summary: HDFS: Two Easy Pieces*

• Reliability

• Throughput

*: The title is from two great books: Six Easy Pieces: Essentials Of Physics Explained By Its Most Brilliant Teacher, by Richard P. Feynman, and Operating Systems: Three Easy Pieces, by Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau

Page 37: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Summary: HDFS: Reliability• System Design:

• Split files into blocks and replicate them (typical 3) • For NameNode:

• Checkpoint + Journal can restore the latest image • BackupNode • Snapshot • NameNode is the single point of failure of the whole system - NOT GOOD!

• For DataNodes: • Rack Awareness + Replica Placement Policy, never lose a block if a rack

fails • Replica Management, to avoid blocks to be under-replicated • Snapshot

Page 38: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Summary: HDFS: Throughput

• System Design • Split files into large blocks (128MB) - good for streaming access and

parallel access • Provide APIs that expose the location of blocks - facilitating applications

to schedule computation tasks to where the data reside • NameNode - Not Good for High Throughput and Scalability

• Single node handles all requests from clients and manages all DataNodes • DataNodes

• Rack Awareness & replica placement policy - better utilizing network bandwidth

• Write files in a pipeline way • Read files from the nearest DataNode first

Page 39: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Future Work (Out of Date!)• Automated failover solution

• Zookeeper

• Scalability of the NameNode

• multiple namespaces to share physical storage• Advantages:

• isolate namespaces • improve the overall availability • generalize the block storage

abstraction

• Drawbacks: • management cost

Page 40: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

Thank you.

Page 41: The Hadoop Distributed File System - CS | Computer Scienceweb.cs.ucla.edu/~harryxu/courses/239/fall19/slides/HDFS... · 2019-09-30 · Architecture: Failure Recovery for NameNode

• Reference:

[1] “The Hadoop Distributed File System”. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler.

[2] “Hadoop HDFS Architecture Explanation and Assumptions”. DATAFLAIR TEAM. https://data-flair.training/blogs/hadoop-hdfs-architecture/

[3] “HDFS Tutorial – A Complete Hadoop HDFS Overview”. DATAFLAIR TEAM. https://data-flair.training/blogs/hadoop-hdfs-tutorial/

[4] “HDFS Architecture”. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

[5] “Understanding Hadoop Clusters and the Network”. Brad Hedlund. http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

[6] "Speed Considerations". Seagate. https://web.archive.org/web/20110920075313/http://www.seagate.com/www/en-us/support/before_you_buy/speed_considerations