Hadoop at a glance

26
HDFS at a glance Students: An Du – Tan Tran – Toan Do – Vinh Nguyen Instructor: Professor Lothar Piepmayer

description

"An Elephan can't jump. But can carry heavy load". Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations: Amazon.com, Apple, eBay, IBM, ImageShack, LinkedIn, Microsoft, Twitter, The New York Times...

Transcript of Hadoop at a glance

Page 1: Hadoop at a glance

HDFS at a glance

Students: An Du – Tan Tran – Toan Do – Vinh Nguyen

Instructor: Professor Lothar Piepmayer

Page 2: Hadoop at a glance

Agenda

1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model 4. Parallel copying5. Demo - Command line

Page 3: Hadoop at a glance

The Design of HDFS

Very large distributed file systemUp to 10K nodes, 1 billion files, 100PB

Streaming data accessWrite once, read many times

Commodity hardwareFiles are replicated to handle hardware failure

Detect failures and recover from them

Page 4: Hadoop at a glance

Worst fit with

Low-latency data accessLots of small filesMultiple writers, arbitrary file modifications

Page 5: Hadoop at a glance

HDFS Blocks

Normal Filesystem blocks are few kilobytesHDFS has Large block size

Default 64MB Typical 128MB

Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block

Page 6: Hadoop at a glance

HDFS Blocks

A file is stored in blocks on various nodes in hadoop cluster.

HDFS creates several replication of the data blocks

Each and every data block is replicated to multiple nodes across the cluster.

Page 7: Hadoop at a glance

Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf

HDFS Blocks

Page 8: Hadoop at a glance

Why blocks in HDFS so large?

Minimize the cost of seeks=> Make transfer time = disk transfer rate

Page 9: Hadoop at a glance

Benefit of Block abstraction

A file can be larger than any single disk in the network

Simplify the storage subsystemProviding fault tolerance and availability

Page 10: Hadoop at a glance

Namenode & Datanodes

Page 11: Hadoop at a glance

Namenode (master)– manages the filesystem namespace– maintains the filesystem tree and metadata for all the files and directories in the tree.

Datanodes (slaves)– store data in the local file system– Periodically report back to the namenode with lists of all existing blocks

Clients communicate with both namenode and datanodes.

Namenode & Datanodes

Page 12: Hadoop at a glance

Anatomy of a File Read

Page 13: Hadoop at a glance

Anatomy of a File Read

Benefits:- Avoid “bottle neck”- Multi-Clients

Page 14: Hadoop at a glance

Writing in HDFS

NamenodeDatanodeBlock

Page 15: Hadoop at a glance
Page 16: Hadoop at a glance

Writing in HDFS

Exeptions: Node failedPipeline close, remove block and addr of

failed nodeNamenode arrange new datanode

Page 17: Hadoop at a glance

Coherency Model

Not visible when copyinguse sync()Apply in applications

Page 18: Hadoop at a glance

Parallel copying in HDFS

Transfer data between clusters% hadoop distcp hdfs://namenode1/foo

hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs

protocol:% hadoop distcp webhdfs://namenode1:50070/foo

webhdfs://namenode2:50070/bar

Page 19: Hadoop at a glance

Setup

Cluster with 03 nodes: 04 GB RAM 02 CPU @ 2.0Ghz+ 100G HDD

Using vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04

Windows: Not tested

Page 20: Hadoop at a glance

Setup Guide - Single Node

java runtime ssh http://hadoop.apache.org/common/

docs/r1.0.3/single_node_setup.html /etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml

Page 22: Hadoop at a glance

Command Line

Similar to *nix hadoop fs -ls / hadoop fs -mkdir /test hadoop fs -rmr /test hadoop fs -cp /1 /2 hadoop fs -copyFromLocal /3 hdfs://localhost/

Namedone-specific: hadoop namenode -format start-all.sh

Page 23: Hadoop at a glance

Command Line

Sorting: Standard method to test cluster TeraGen: Generate dummy data TeraSort: Sort TeraValidate: Validate sort result

Command Line: hadoop jar /usr/share/hadoop/hadoop-examples-

1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41

Page 24: Hadoop at a glance

Benchmark Result

2 Nodes, 1GB data: 0:03:38 3 Nodes, 1GB data: 0:03:13

2 Nodes, 10GB data: 0:38:07 3 Nodes, 10GB data: 0:31:28

Virtual Machine's harddisks are the bottle-neck

Page 25: Hadoop at a glance

Who wins…

?

Page 26: Hadoop at a glance

References

Hadoop The Definitive Guide