Hadoop at a glance

HDFS at a glance

Students: An Du – Tan Tran – Toan Do – Vinh Nguyen

Instructor: Professor Lothar Piepmayer

Agenda

1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model 4. Parallel copying5. Demo - Command line

The Design of HDFS

Very large distributed file systemUp to 10K nodes, 1 billion files, 100PB

Streaming data accessWrite once, read many times

Commodity hardwareFiles are replicated to handle hardware failure

Detect failures and recover from them

Worst fit with

Low-latency data accessLots of small filesMultiple writers, arbitrary file modifications

HDFS Blocks

Normal Filesystem blocks are few kilobytesHDFS has Large block size

Default 64MB Typical 128MB

Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block

HDFS Blocks

A file is stored in blocks on various nodes in hadoop cluster.

HDFS creates several replication of the data blocks

Each and every data block is replicated to multiple nodes across the cluster.

Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf

HDFS Blocks

Why blocks in HDFS so large?

Minimize the cost of seeks=> Make transfer time = disk transfer rate

Benefit of Block abstraction

A file can be larger than any single disk in the network

Simplify the storage subsystemProviding fault tolerance and availability

Namenode & Datanodes

Namenode (master)– manages the filesystem namespace– maintains the filesystem tree and metadata for all the files and directories in the tree.

Datanodes (slaves)– store data in the local file system– Periodically report back to the namenode with lists of all existing blocks

Clients communicate with both namenode and datanodes.

Namenode & Datanodes

Anatomy of a File Read

Anatomy of a File Read

Benefits:- Avoid “bottle neck”- Multi-Clients

Writing in HDFS

NamenodeDatanodeBlock

Writing in HDFS

Exeptions: Node failedPipeline close, remove block and addr of

failed nodeNamenode arrange new datanode

Coherency Model

Not visible when copyinguse sync()Apply in applications

Parallel copying in HDFS

Transfer data between clusters% hadoop distcp hdfs://namenode1/foo

hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs

protocol:% hadoop distcp webhdfs://namenode1:50070/foo

webhdfs://namenode2:50070/bar

Setup

Cluster with 03 nodes: 04 GB RAM 02 CPU @ 2.0Ghz+ 100G HDD

Using vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04

Windows: Not tested

Setup Guide - Single Node

java runtime ssh http://hadoop.apache.org/common/

docs/r1.0.3/single_node_setup.html /etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml

Cluster

/etc/hadoop/masters /etc/hadoop/slaves http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html

http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html






Command Line

Similar to *nix hadoop fs -ls / hadoop fs -mkdir /test hadoop fs -rmr /test hadoop fs -cp /1 /2 hadoop fs -copyFromLocal /3 hdfs://localhost/

Namedone-specific: hadoop namenode -format start-all.sh

Command Line

Sorting: Standard method to test cluster TeraGen: Generate dummy data TeraSort: Sort TeraValidate: Validate sort result

Command Line: hadoop jar /usr/share/hadoop/hadoop-examples-

1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41

Benchmark Result

2 Nodes, 1GB data: 0:03:38 3 Nodes, 1GB data: 0:03:13

2 Nodes, 10GB data: 0:38:07 3 Nodes, 10GB data: 0:31:28

Virtual Machine's harddisks are the bottle-neck

Who wins…

?

References

Hadoop The Definitive Guide

Hadoop at a glance

Technology

Transcript of Hadoop at a glance