Hands-On Hadoop Tutorial

Jian Wang

Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das

Yahoo! Inc. Bangalore & Apache Software Foundation

Need to process 10TB datasets On 1 node:

◦ scanning @ 50MB/s = 2.3 days On 1000 node cluster:

◦ scanning @ 50MB/s = 3.3 min

Need Efficient, Reliable and Usable framework◦Google File System (GFS) paper◦Google's MapReduce paper

http://en.wikipedia.org/wiki/GoogleFS

Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system◦ Files are divided into large blocks and distributed

across the cluster (64MB)◦ Blocks replicated to handle hardware failure◦ Current block replication is 3 (configurable)◦ It cannot be directly mounted by an existing operating system.

Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30

Master-Slave Architecture

HDFS Master “Namenode” (irkm-1)◦ Accepts MR jobs submitted by users◦ Assigns Map and Reduce tasks to Tasktrackers◦ Monitors task and tasktracker status, re-executes

tasks upon failure HDFS Slaves “Datanodes” (irkm-1 to irkm-6)

◦ Run Map and Reduce tasks upon instruction from the Jobtracker

◦ Manage storage and transmission of intermediate output

Hadoop is locally “installed” on each machine◦ Version 0.19.2

◦ Installed location is in /home/tmp/hadoop

◦ Slave nodes store their data in /tmp/hadoop-${user.name} (configurable)

If it is the first time that you use it, you need to format the namenode:◦ - log to irkm-1◦ - cd /home/tmp/hadoop◦ - bin/hadoop namenode –format

Basically we see most commands look similar ◦ bin/hadoop “some command” options◦ If you just type hadoop you get all possible

commands (including undocumented)

hadoop dfs◦ [-ls <path>]◦ [-du <path>]◦ [-cp <src> <dst>]◦ [-rm <path>]◦ [-put <localsrc> <dst>]◦ [-copyFromLocal <localsrc> <dst>]◦ [-moveFromLocal <localsrc> <dst>]◦ [-get [-crc] <src> <localdst>]◦ [-cat <src>]◦ [-copyToLocal [-crc] <src> <localdst>]◦ [-moveToLocal [-crc] <src> <localdst>]◦ [-mkdir <path>]◦ [-touchz <path>]◦ [-test -[ezd] <path>]◦ [-stat [format] <path>]◦ [-help [cmd]]

bin/start-all.sh – starts all slave nodes and master node

bin/stop-all.sh – stops all slave nodes and master node

Run jps to check the status

Log to irkm-1 rm –fr /tmp/hadoop/$userID cd /home/tmp/hadoop bin/hadoop dfs –ls bin/hadoop dfs –copyFromLocal example

example

After that bin/hadoop dfs –ls

Mapper.py

Reducer.py

bin/hadoop dfs -ls

bin/hadoop dfs –copyFromLocal example example

bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount-py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output

bin/hadoop dfs -cat java-output/part-00000

bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local

Hadoop job tracker◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp

Hadoop task tracker◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp

Hadoop dfs checker◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp

Hands-On Hadoop Tutorial

Documents

Transcript of Hands-On Hadoop Tutorial