Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

39
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University

Transcript of Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Page 1: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Hadoop, Hadoop, Hadoop!!!

Jerome MitchellIndiana University

Page 2: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Outline

• BIG DATA• Hadoop MapReduce• The Hadoop Distributed File System (HDFS)• Workflow• Conclusions• References• HandsOn

Page 3: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

LOTS OF DATA EVERYWHERE

Page 4: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Why Should You Care? Even Grocery Stores Care

Page 5: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Page 6: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

What is Hadoop?

• At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.

• GFS is not open source.• Doug Cutting and Yahoo! reverse engineered the GFS

and called it Hadoop Distributed File System (HDFS).• The software framework that supports HDFS,

MapReduce and other related entities is called the project Hadoop or simply Hadoop.

• This is open source and distributed by Apache.

Page 7: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

What exactly is Hadoop

• A growing collection of subprojects

Page 8: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Motivation for MapReduce

• Large-Scale Data Processing– Want to use 1000s of CPUs

• But, don’t want the hassle of managing things

• MapReduce Architecture provides– Automatic Parallelization and Distribution – Fault Tolerance– I/O Scheduling– Monitoring and Status Updates

Page 9: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

9

MapReduce Model• Input & Output: a set of key/value pairs• Two primitive operations

– map: (k1,v1) list(k2,v2)

– reduce: (k2,list(v2)) list(k3,v3)

• Each map operation processes one input key/value pair and produces a set of key/value pairs

• Each reduce operation– Merges all intermediate values (produced by map ops) for a particular key– Produce final key/value pairs

• Operations are organized into tasks– Map tasks: apply map operation to a set of key/value pairs– Reduce tasks: apply reduce operation to intermediate key/value pairs– Each MapReduce job comprises a set of map and reduce (optional) tasks.

Page 10: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

HDFS Architecture

Page 11: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

The WorkFlow

• Load data into the Cluster (HDFS writes)• Analyze the data (MapReduce)• Store results in the Cluster (HDFS)• Read the results from the Cluster (HDFS reads)

Page 12: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Distributed Data Processing

Slaves

Master

Client

Distributed Data Storage

Name NodeSecondary Name

Node

Data Nodes &Task Tracker

Data Nodes &Task Tracker

Data Nodes &Task Tracker

Data Nodes &Task Tracker

Data Nodes &Task Tracker

Data Nodes &Task Tracker

MapReduce HDFS

Job Tracker

Hadoop Server Roles

Page 13: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Hadoop Cluster

Switch Switch

Switch Switch Switch

DN + TTDN + TTDN + TTDN + TT

DN + TTDN + TTDN + TTDN + TT

DN + TTDN + TTDN + TTDN + TT

DN + TTDN + TTDN + TTDN + TTDN + TT

Switch

Rack 1 Rack 2 Rack 3 Rack N

Page 14: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Hadoop Rack Awareness – Why?

Data Node 1

Data Node 2

Data Node 3Data Node 5

Rack 1 Rack 5 Rack 9

Switch Switch Switch

Data Node 5

Data Node 6

Data Node 7Data Node 8

Data Node 5

Data Node 10

Data Node 11

Data Node 12

Switch

Name Node

Results.txt =BLK A:DN1, DN5, DN6

BLK B:DN7, DN1, DN2

BLK CDN5, DN8, DN9

metatadata

Rack 1:Data Node 1

Data Node 2Data Node 3

Rack 1:Data Node 5Data Node 6Data Node 7

Rack Aware

Page 15: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Sample Scenario

Huge file containing all emails sent to Indiana University… File.txt

How many times did our customers type the word refund into emails sent to Indiana University?

Page 16: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Writing Files to HDFSFile.txt

Data Node 1 Data Node 5 Data Node 6

BLK A BLK B BLK C

BLK A BLK B BLK C

Data Node N

Client

NameNode

Page 17: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Data Node Reading Files From HDFS

Data Node 1

Data Node 2

Data NodeData NodeRack 1 Rack 5 Rack 9

Switch Switch Switch

Data Node 1

Data Node 2

Data NodeData Node

Data Node 1

Data Node 2

Data NodeData Node

Switch

Name Node

Results.txt =BLK A:DN1, DN5, DN6

BLK B:DN7, DN1, DN2

BLK CDN5, DN8, DN9

metatadata

Rack 1:Data Node 1 Data Node 2Data Node 3

Rack 1:Data Node 5

Rack Aware

Page 18: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

MapReduce: Three Phases

1. Map2. Sort

3. Reduce

Page 19: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Data Processing: Map

Map Task Map Task Map Task

BLK A BLK B BLK C

Job TrackerName Node

Data Node 1 Data Node 5 Data Node 9

File.txt

Page 20: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

MapReduce: The Map Step

vk

k v

k v

mapvk

vk

k vmap

Inputkey-value pairs

Intermediatekey-value pairs

k v

Page 21: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Data Processing: Reduce

Map Task Map Task Map Task

Reduce TaskResults.txt

HDFS

BLK A BLK B BLK C

Page 22: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

MapReduce: The Reduce Step

k v

k v

k v

k v

Intermediatekey-value pairs

group

reduce

reduce

k v

k v

k v

k v

k v

k v v

v v

Key-value groupsOutput key-value pairs

Page 23: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Clients Reading Files from HDFS

Data Node 1

Data Node 2

Data NodeData NodeRack 1 Rack 5 Rack 9

Switch Switch Switch

Data Node 1

Data Node 2

Data NodeData Node

Data Node 1

Data Node 2

Data NodeData Node

Client

NameNode

Results.txt =BLK A:DN1, DN5, DN6

BLK B:DN7, DN1, DN2

BLK CDN5, DN8, DN9

metatadata

Page 24: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Conclusions

• We introduced MapReduce programming model for processing large scale data

• We discussed the supporting Hadoop Distributed File System

• The concepts were illustrated using a simple example

• We reviewed some important parts of the source code for the example.

Page 25: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

References

1. Apache Hadoop Tutorial: http://hadoop.apache.org http://hadoop.apache.org/core/docs/current/mapred_tutorial.html

2. Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113.

3. Cloudera Videos by Aaron Kimball: http://www.cloudera.com/hadoop-training-basic4.

http://www.cse.buffalo.edu/faculty/bina/mapreduce.html

Page 26: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

FINALLY, A HANDS-ON ASSIGNMENT!

Page 27: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Page 28: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

The MapReduce Framework (pioneered by Google)

Page 29: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Automatic Parallel Execution in MapReduce (Google)

Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to

avoid a slow task slowing down the whole job

Page 30: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

The Map (Example)

When in the course of human events it …

It was the best of times and the worst of times…

map(in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) …

(when,1), (course,1) (human,1) (events,1) (best,1) …

inputs tasks (M=3) partitions (intermediate files) (R=2)

This paper evaluates the suitability of the … map (this,1) (paper,1) (evaluates,1) (suitability,1) …

(the,1) (of,1) (the,1) …

Over the past five years, the authors and many…

map (over,1), (past,1) (five,1) (years,1) (authors,1) (many,1) …

(the,1), (the,1) (and,1) …

Page 31: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

The Reduce (Example)

reduce

(in,1) (the,1) (of,1) (it,1) (it,1) (was,1) (the,1) (of,1) …

(the,1) (of,1) (the,1) …

reduce taskpartition (intermediate files) (R=2)

(the,1), (the,1) (and,1) …

sort

(and, (1)) (in,(1)) (it, (1,1)) (the, (1,1,1,1,1,1)) (of, (1,1,1)) (was,(1))

(and,1) (in,1) (it, 2) (of, 3) (the,6) (was,1)

Note: only one of the two reduce tasks shown

Page 32: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 33: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Job Configuration Parameters• 190+ parameters in

Hadoop• Set manually or defaults

are used

Page 34: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Formatting the NameNode

Before we start, we have to format Hadoop's distributed filesystem (HDFS) for the namenode. You need to do this the first time you set up Hadoop. Do not format a running Hadoop namenode, this will cause all your data in the HDFS filesytem to be erased.

To format the filesystem, run the command (from the master):

--------------------------------------------

bin/hadoop namenode -format

---------------------------------------------

Starting Hadoop:

Starting hadoop is done in two steps: First, the HDFS daemons are started: the namenode daemon is started on master, and datanode daemons are started on all slaves (here: master and slave). Second, the MapReduce daemons are started: the jobtracker is started on master, and tasktracker daemons are started on all slaves (here: master and slave).

Page 35: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

HDFS daemons:Run the command <HADOOP_HOME>/bin/start-dfs.sh on the machine you want the namenode to run on. This will bring up HDFS with the namenode running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file. In our case, we will run bin/start-dfs.sh on master: -------------------------bin/start-dfs.sh------------------------On slave, you can examine the success or failure of this command by inspecting the log file <HADOOP_HOME>/logs/hadoop-hadoop-datanode-slave.log.

Now, the following Java processes should run on master:root@ubuntu:$HADOOP_HOME/bin: jps 14799 NameNode 15314 Jps 14880 DataNode 14977 SecondaryNameNode ------------------------------------

Page 36: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

MapReduce Daemons:

In our case, we will run bin/start-mapred.sh on master:

-------------------------------------

bin/start-mapred.sh

-------------------------------------

On slave, you can examine the success or failure of this command by inspecting the log file <HADOOP_HOME>/logs/hadoop-hadoop-tasktracker-slave.log.

At this point, the following Java processes should run on master:

----------------------------------------------------

root@ubuntu:$HADOOP_HOME/bin$ jps

16017 Jps

14799 NameNode

15686 TaskTracker

14880 DataNode

15596 JobTracker

14977 SecondaryNameNode

----------------------------------------------------

Page 37: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

We will execute your first Hadoop MapReduce job. We will use the WordCount example, which will read a text file and count the frequency of words. The input is a text file and the output is a text file with each line of which contains a word and the count of how often it occurred, separated by a tab. Download example input data:

Copy local data file to HDFSBefore we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop's HDFS-----------------------------root@ubuntu:$HADOOP_HOME/bin$ hadoop dfs -copyFromLocal /tmp/source destination

Page 38: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Build WordCountExecute build.sh

Run the MapReduce jobNow, we actually run the WordCount example job. This command will read all the files in the HDFS “destination” directory , process it, and store the result in the HDFS directory “output”.-----------------------------------------root@ubuntu:$HADOOP_HOME/bin bin/hadoop hadoop-example wordcount destination output-----------------------------------------You can check if the result is successfully stored in HDFS directory “output”.

Retrieve the job result from HDFSTo inspect the file, you can copy it from HDFS to the local file system.-------------------------------------root@ubuntu:/usr/local/hadoop$ mkdir /tmp/outputroot@ubuntu:/usr/local/hadoop$ bin/hadoop dfs –copyToLocal output/part-00000 /tmp/output ---------------------------------------- Alternatively, you can read the file directly from HDFS without copying it to the local file system by using the command :---------------------------------------------root@ubuntu:/usr/local/hadoop$ bin/hadoop dfs –cat output/part-00000

Page 39: Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++. Creating a launching program for your applicationThe launching program configures: – The Mapper and Reducer to use – The output key and value types (input types are inferred from the InputFormat) – The locations for your input and outputThe launching program then submits the job and typically waits for it to complete

A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used A Map/Reduce may specify how it’s output is to be written by specifying an OutputFormat to be used