Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop at a glance
-
Upload
tan-tran -
Category
Technology
-
view
2.083 -
download
0
description
Transcript of Hadoop at a glance
HDFS at a glance
Students: An Du – Tan Tran – Toan Do – Vinh Nguyen
Instructor: Professor Lothar Piepmayer
Agenda
1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model 4. Parallel copying5. Demo - Command line
The Design of HDFS
Very large distributed file systemUp to 10K nodes, 1 billion files, 100PB
Streaming data accessWrite once, read many times
Commodity hardwareFiles are replicated to handle hardware failure
Detect failures and recover from them
Worst fit with
Low-latency data accessLots of small filesMultiple writers, arbitrary file modifications
HDFS Blocks
Normal Filesystem blocks are few kilobytesHDFS has Large block size
Default 64MB Typical 128MB
Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
HDFS Blocks
A file is stored in blocks on various nodes in hadoop cluster.
HDFS creates several replication of the data blocks
Each and every data block is replicated to multiple nodes across the cluster.
Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
HDFS Blocks
Why blocks in HDFS so large?
Minimize the cost of seeks=> Make transfer time = disk transfer rate
Benefit of Block abstraction
A file can be larger than any single disk in the network
Simplify the storage subsystemProviding fault tolerance and availability
Namenode & Datanodes
Namenode (master)– manages the filesystem namespace– maintains the filesystem tree and metadata for all the files and directories in the tree.
Datanodes (slaves)– store data in the local file system– Periodically report back to the namenode with lists of all existing blocks
Clients communicate with both namenode and datanodes.
Namenode & Datanodes
Anatomy of a File Read
Anatomy of a File Read
Benefits:- Avoid “bottle neck”- Multi-Clients
Writing in HDFS
NamenodeDatanodeBlock
Writing in HDFS
Exeptions: Node failedPipeline close, remove block and addr of
failed nodeNamenode arrange new datanode
Coherency Model
Not visible when copyinguse sync()Apply in applications
Parallel copying in HDFS
Transfer data between clusters% hadoop distcp hdfs://namenode1/foo
hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs
protocol:% hadoop distcp webhdfs://namenode1:50070/foo
webhdfs://namenode2:50070/bar
Setup
Cluster with 03 nodes: 04 GB RAM 02 CPU @ 2.0Ghz+ 100G HDD
Using vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04
Windows: Not tested
Setup Guide - Single Node
java runtime ssh http://hadoop.apache.org/common/
docs/r1.0.3/single_node_setup.html /etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml
Cluster
/etc/hadoop/masters /etc/hadoop/slaves http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
Command Line
Similar to *nix hadoop fs -ls / hadoop fs -mkdir /test hadoop fs -rmr /test hadoop fs -cp /1 /2 hadoop fs -copyFromLocal /3 hdfs://localhost/
Namedone-specific: hadoop namenode -format start-all.sh
Command Line
Sorting: Standard method to test cluster TeraGen: Generate dummy data TeraSort: Sort TeraValidate: Validate sort result
Command Line: hadoop jar /usr/share/hadoop/hadoop-examples-
1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
Benchmark Result
2 Nodes, 1GB data: 0:03:38 3 Nodes, 1GB data: 0:03:13
2 Nodes, 10GB data: 0:38:07 3 Nodes, 10GB data: 0:31:28
Virtual Machine's harddisks are the bottle-neck
Who wins…
?
References
Hadoop The Definitive Guide