Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

Introduction to HDFS and MapReduce

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2

Who Am I- Ryan Tabora

- Data Developer at Think Big Analytics

- Big Data Consulting

- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.


Confidential Think Big Analytics

• One of Silicon Valley’s Fastest Growing Big Data start ups• 100% Focus on Big Data consulting & Data Science solution services• Management Background:

Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture

C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999• Clients: 40+• North America Locations

• US East: Boston, New York, Washington D.C.• US Central: Chicago, Austin• US West: HQ Mountain View, San Diego, Salt Lake City

• EMEA & APAC

3

Think Big is the leading professional services firm that’s purpose built for Big Data.


Confidential Think Big Analytics 01/04/13

Think Big Recognized as a Top Pure-Play Big Data Vendor

Source: Forbes February 2012

4Thursday, January 10, 13


Agenda- Big Data

- Hadoop Ecosystem

- HDFS

- MapReduce in Hadoop

- The Hadoop Java API

- Conclusions



Big Data



A Data Shift...

Source: EMC Digital Universe Study*



“Simple algorithms and lots of data trump complex

models. ”

Motivation

Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems



Pioneers• Google and Yahoo:

- Index 850+ million websites, over one trillion URLs.

• Facebook ad targeting:

- 840+ million users, > 50% of whom are active daily.



Hadoop Ecosystem



Common Tool?• Hadoop

- Cluster: distributed computing platform.

- Commodity*, server-class hardware.

- Extensible Platform.



Hadoop Origins• MapReduce and Google File System (GFS)

pioneered at Google.

• Hadoop is the commercially-supported open-source equivalent.



What Is Hadoop?• Hadoop is a platform.

• Distributes and replicates data.

• Manages parallel tasks created by users.

• Runs as several processes on a cluster.

• The term Hadoop generally refers to a toolset, not a single tool.



Why Hadoop?• Handles unstructured to semi-structured to

structured data.

• Handles enormous data volumes.

• Flexible data analysis and machine learning tools.

• Cost-effective scalability.



• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

15

The Hadoop Ecosystem



HDFS



What Is HDFS?• Hadoop Distributed File System.

• Stores files in blocks across many nodes in a cluster.

• Replicates the blocks across nodes for durability.

• Master/Slave architecture.



HDFS Traits

18

• Not fully POSIX compliant.

• No file updates.

• Write once, read many times.

• Large blocks, sequential read patterns.

• Designed for batch processing.



HDFS Master

19

• NameNode

- Runs on a single node as a master process

‣ Holds file metadata (which blocks are where)

‣ Directs client access to files in HDFS

• SecondaryNameNode

- Not a hot failover

- Maintains a copy of the NameNode metadata



HDFS Slaves

20

• DataNode

- Generally runs on all nodes in the cluster

‣ Block creation/replication/deletion/reads

‣ Takes orders from the NameNode



NameNode

DataNode 1 DataNode 2 DataNode 3


HDFS Illustrated

21

File

Put File



NameNode



HDFS Illustrated

21

3

12Put File



NameNode



HDFS Illustrated

21

3

12,4,6

Put File



NameNode



HDFS Illustrated

21

3

12,4,6,5,3Put File



NameNode



HDFS Illustrated

21

3

12,2,6

,4,6,5,3Put File



DataNode 1

NameNode

DataNode 2 DataNode 3


22

Power of Hadoop

3

12,2,6

,4,6,5,3Read File



NameNode



22

Power of Hadoop

32,2,6

,4,6,5,3Read File



NameNode



22

Power of Hadoop

32,2,6

,4,6,5,3Read File

5



NameNode



22

Power of Hadoop

32,2,6

,4,6,5,3Read File

Read time =

Transfer Rate x

Number of Machines*

5



NameNode



22

Power of Hadoop

32,2,6

,4,6,5,3Read File

Read time =

Transfer Rate x

Number of Machines*

100 MB/sx3=

300MB/s

5



HDFS Shell

23

• Easy to use command line interface.

• Create, copy, move, and delete files.

• Administrative duties - chmod, chown, chgrp.

• Set replication factor for a file.

• Head, tail, cat to view files.








24




MapReduce in

Hadoop



MapReduce Basics

26

• Logical functions: Mappers and Reducers.

• Developers write map and reduce functions, then submit a jar to the Hadoop cluster.

• Hadoop handles distributing the Map and Reduce tasks across the cluster.

• Typically batch oriented.



MapReduce Daemons

27

•JobTracker (Master)

- Manages MapReduce jobs, giving tasks to different nodes, managing task failure

•TaskTracker (Slave)

- Creates individual map and reduce tasks

- Reports task status to JobTracker



MapReduce in Hadoop



MapReduce in Hadoop

Let’s look at how MapReduce actually works in Hadoop,

using WordCount.



There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

29






Reducers


a 2hadoop 1is 2

Output



(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

29

We need to convert the Input

into the Output.






Reducers


a 2hadoop 1is 2

Output







Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers

There is a Reduce phase (doc4, "…")





Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers


(hadoop, 1)(uses, 1)(mapreduce, 1)

(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)

(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)





Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

Reducers


(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z





Input

(doc1, "…")

(doc2, "…")

(doc3, "")


(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers


(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z





Input

(doc1, "…")

(doc2, "…")

(doc3, "")


(a, [1,1]),(hadoop, [1]),

(is, [1,1])


(phase, [1,1])

Reducers


a 2hadoop 1is 2

Output


(reduce, [1]),(there, [1,1]),

(uses, 1)


(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z





Input

(doc1, "…")

(doc2, "…")

(doc3, "")


(a, [1,1]),(hadoop, [1]),

(is, [1,1])


(phase, [1,1])

Reducers


a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z





Input

(doc1, "…")

(doc2, "…")

(doc3, "")


(a, [1,1]),(hadoop, [1]),

(is, [1,1])


(phase, [1,1])

Reducers


a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Map:

• Transform one input to 0-N outputs.





Input

(doc1, "…")

(doc2, "…")

(doc3, "")


(a, [1,1]),(hadoop, [1]),

(is, [1,1])


(phase, [1,1])

Reducers


a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Map:

• Transform one input to 0-N outputs.

Reduce:

• Collect multiple inputs into one output.



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

MMM

jar

M R



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

MMM

jar

M R

Map Phase



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

MMM

jar

M R

k,v k,vk,v k,v k,vMap Phase * Intermediate Data Is Stored Locally



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

jar

M R

k,v k,vk,v k,v k,vMap Phase



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

jar

M R

k,v k,vk,v k,v k,v

Shuffle/Sort



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

jar

M R

k,v k,v k,v k,v k,v

Shuffle/Sort



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

RR R

jar

M R

k,v k,v k,v k,v k,v

Reduce Phase



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

RR R

jar

M R

Reduce Phase



JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode


37

jar

M R

Job Complete!



The Hadoop Java API



MapReduce in Java



MapReduce in Java

Let’s look at WordCountwritten in the

MapReduce Java API.



public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code





Map Code

Let’s drill into this code...





Map Code





Map CodeMapper class with 4

type parameters for the input key-value types and

output types.





Map Code

Output key-value objects we’ll reuse.





Map Code

Map method with input, output “collector”, and

reporting object.





Map Code

Tokenize the line, “collect” each (word, 1)



public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Reduce Code





Reduce Code






46

Reduce Code





46

Reduce CodeReducer class with 4

type parameters for the input key-value types and

output types.



Reduce Code

Reduce method with input, output “collector”,

and reporting object.





Reduce Code

Count the counts per word and emit(word, N)










49

Other Options



Conclusions



• A cost-effective, scalable way to:- Store massive data sets.- Perform arbitrary analyses on

those data sets.

51

Hadoop Benefits



• Offers a variety of tools for:- Application development.- Integration with other platforms

(e.g., databases).

52

Hadoop Tools



• A rich, open-source ecosystem.- Free to use.- Commercially-supported

distributions.

53

Hadoop Distributions



Thank You!- Feel free to contact me at

‣ [email protected]

- Or our solutions consultant

‣ [email protected]

- As always, THINK BIG!


mailto:[email protected]





Bonus Content








56




Hive: SQL for Hadoop



Hive



Hive

Let’s look at WordCountwritten in Hive,

the SQL for Hadoop.



CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;






Create a table to hold the raw text we’re

counting. Each line is a “column”.






Load the text in the “docs” directory into the

table.






Create the final table and fill it with the results from a nested query of

the docs table that performs WordCount

on the fly.



Hive



Hive

Because so many Hadoop userscome from SQL backgrounds,

Hive is one of the mostessential tools in the ecosystem!!








64




Pig: Data Flow

for Hadoop



Pig



Pig

Let’s look at WordCountwritten in Pig,

the Data Flow language for Hadoop.



inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';







STORE cntd INTO 'output'; Let’s drill into this code...








Like the Hive example, load “docs” content,each line is a “field”.








Tokenize into words (an array) and “flatten” into

separate records.








Collect the same words together.








Count each word.







STORE cntd INTO 'output'; Save the results.Profit!



Pig



Pig

Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc.



Questions?


Intro to HDFS and MapReduce

Technology

Transcript of Intro to HDFS and MapReduce