HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K....

21
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Transcript of HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K....

Page 1: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

HADOOP DISTRIBUTED FILE SYSTEMHDFS Reliability

Based on“The Hadoop Distributed File System”

K. Shvachko et al., MSST 2010

Michael Tsitrin

26/05/13

Page 2: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Topics• Introduction• HDFS Overview

• Basics• Architecture

• Data reliability• Block replicas

• NameNode reliability• NameNode failure• Journal• Checkpoint

• Conclusion

Page 3: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

INTRODUCTION

Page 4: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Introduction• HDFS is a Cloud Based File Systems which allows

storage of large data sets on clusters of commodity hardware

• Huge number of components, each component has a non-trivial probability of failure• Hardware failure is the norm rather than the exception

• The purpose of this presentation is to present the techniques used in HDFS to keep the system and Data fully reliable.

Page 5: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

HDFS OVERVIEW

Page 6: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

HDFS Basics• An open-source implementation of distributed file system

based on Google File System

• Designed to store very large data sets reliably across large clusters of computers

• Optimized for MapReduce application• Large files, some several GB large• Reads are performed in a large streaming fashion• Large throughput instead of low latency

Page 7: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

HDFS Architecture

Namenode

Breplication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..

Block ops

Page 8: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

HDFS NameSpace Node• The HDFS Namespcae Node keeps the metadata for

each data block in the system• Implemented as a single master server for a cluster• To achieve high performance, the entire namespace kept

in RAM• Manage the replication logic for the DataNodes• Serves clients with file block location for reads• metadata includes:

• Files and directories hierarchy• Permissions, modification time, etc• Mapping of file blocks to DataNodes

Page 9: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

HDFS DataNode• A cluster can contain thousands of DataNodes• DataNode is where the actual File block is kept• User data divided into blocks and replicated across

DataNodes• A DataNode identifies block replicas in its possession to

the NameNode by sending a block report• DataNodes serves read, write requests, performs block

creation, deletion, and replication upon instruction from NameNode

Page 10: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

DATA RELIABILITYBlock replicas

Page 11: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

NameNode & Data Replication• All data-replication information is stored and managed by

the NameNode• The NameNode makes all decisions regarding replication

of blocks• It periodically receives a Heartbeat and a Blockreport from

each of the DataNodes in the cluster

• A Blockreport contains a list of all blocks on a DataNode• Receipt of a Heartbeat implies that the DataNode is functioning

properly• Datanodes without recent heartbeat marked as dead

Page 12: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Re-replication• The NameNode constantly tracks which blocks need to be

replicated and initiates replication whenever necessary• The necessity for re-replication may arise due to many

reasons• a DataNode may become unavailable• the replication factor of a file may be increased• a replica may become corrupted• a hard disk on a DataNode may fail

• Re-replication is fast because it is a parallel problem that scales with the size of the cluster• Lower the propability of block loss while replication is carried out

Page 13: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Replica placement• To protected against rack failure (as in power shortage),

the name node can manage replicas to be stored in different racks

• Beside the data reliability, this can also improve network bandwidth and client’s latency

• Common case (replication factor == 3):• Put one replica on one node in the local rack• Another on a different node in the local rack• The last on a different node in a different rack

• Doesn’t compromise data reliability and availability

Page 14: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

NAMENODE RELIABILITY

Page 15: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

NameNode Failure• NameNode is a Single Point of Failure for HDFS cluster

• If it becomes unavailable for clients, the whole cluster is unusable• Corruption / loss of metadata – Data blocks becomes unavailable

• NameNode keeps data on RAM – full data loss in case of power shortage• Needs a persistent solution

Page 16: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

NameNode Persistence• The persistent record of the image stored in the local

host’s native files system is called a checkpoint

• The NameNode also stores the modification log of the image called the journal in the local host’s native file system

• For improved durability, redundant copies of the checkpoint and journal can be made at other servers

Page 17: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Journal• The journal is persistently record every change that

occurs to file system metadata (not including block mapping)

• Implemented as a write-ahead commit log for changes to the file system that must be persistent• To avoid being a bottleneck, few transactions are batched and

committed together

Page 18: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Checkpoint• Checkpoint is a persistent record of the NameNode’s

state written to disk• The checkpoint file is never changed by the NameNode

• Either a new checkpoint is created or a namespace is loaded from a previous checkpoint by the namenode

• When the NameNode starts, it performs the checkpoint process:• reads the current checkpoint and Journal from the disk• applies all the transactions from the Journal to the in-memory

representation of the namespace• flushes out this new version into a new checkpoint on disk• truncate the old Journal

Page 19: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Creating a Checkpoint• New checkpoint file can be created at startup only or

periodically• Creating a checkpoint emptying the journal:

• Long journal increase the probability of loss or corruption of the journal file

• Very large journal extends the time required to restart the NameNode

• To create periodic checkpoint, a dedicated server required (Checkpoint Node)• since it has the same memory requirements as the NameNode

Page 20: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

CONCLUSION

Page 21: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Conclusion• HDFS has good reliability model, which can handle the

expected hardware failure

• While few techniques are in use to achieve namespace fault tolerance, it is still single point of failure in the system

• Many reliability parameters are configurable and can be changed to fit system demands• Replicas count• Rack scattering policy• Checkpoint and Journal redundancy