Intro to HDFS and MapReduce

121
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved Introduction to HDFS and MapReduce Thursday, January 10, 13

Transcript of Intro to HDFS and MapReduce

Page 1: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

Introduction to HDFS and MapReduce

Thursday, January 10, 13

Page 2: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2

Who Am I- Ryan Tabora

- Data Developer at Think Big Analytics

- Big Data Consulting

- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.

Thursday, January 10, 13

Page 3: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2

Who Am I- Ryan Tabora

- Data Developer at Think Big Analytics

- Big Data Consulting

- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.

Thursday, January 10, 13

Page 4: Intro to HDFS and MapReduce

Confidential Think Big Analytics

• One of Silicon Valley’s Fastest Growing Big Data start ups• 100% Focus on Big Data consulting & Data Science solution services• Management Background:

Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture

C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999• Clients: 40+• North America Locations

• US East: Boston, New York, Washington D.C.• US Central: Chicago, Austin• US West: HQ Mountain View, San Diego, Salt Lake City

• EMEA & APAC

3

Think Big is the leading professional services firm that’s purpose built for Big Data.

Thursday, January 10, 13

Page 5: Intro to HDFS and MapReduce

Confidential Think Big Analytics 01/04/13

Think Big Recognized as a Top Pure-Play Big Data Vendor

Source: Forbes February 2012

4Thursday, January 10, 13

Page 6: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved5

Agenda- Big Data

- Hadoop Ecosystem

- HDFS

- MapReduce in Hadoop

- The Hadoop Java API

- Conclusions

Thursday, January 10, 13

Page 7: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved6

Big Data

Thursday, January 10, 13

Page 8: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved7

A Data Shift...

Source: EMC Digital Universe Study*

Thursday, January 10, 13

Page 9: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved8

“Simple algorithms and lots of data trump complex

models. ”

Motivation

Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems

Thursday, January 10, 13

Page 10: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved9

Pioneers• Google and Yahoo:

- Index 850+ million websites, over one trillion URLs.

• Facebook ad targeting:

- 840+ million users, > 50% of whom are active daily.

Thursday, January 10, 13

Page 11: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved10

Hadoop Ecosystem

Thursday, January 10, 13

Page 12: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved11

Common Tool?• Hadoop

- Cluster: distributed computing platform.

- Commodity*, server-class hardware.

- Extensible Platform.

Thursday, January 10, 13

Page 13: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved12

Hadoop Origins• MapReduce and Google File System (GFS)

pioneered at Google.

• Hadoop is the commercially-supported open-source equivalent.

Thursday, January 10, 13

Page 14: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved13

What Is Hadoop?• Hadoop is a platform.

• Distributes and replicates data.

• Manages parallel tasks created by users.

• Runs as several processes on a cluster.

• The term Hadoop generally refers to a toolset, not a single tool.

Thursday, January 10, 13

Page 15: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved14

Why Hadoop?• Handles unstructured to semi-structured to

structured data.

• Handles enormous data volumes.

• Flexible data analysis and machine learning tools.

• Cost-effective scalability.

Thursday, January 10, 13

Page 16: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

15

The Hadoop Ecosystem

Thursday, January 10, 13

Page 17: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

15

The Hadoop Ecosystem

Thursday, January 10, 13

Page 18: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved16

HDFS

Thursday, January 10, 13

Page 19: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved17

What Is HDFS?• Hadoop Distributed File System.

• Stores files in blocks across many nodes in a cluster.

• Replicates the blocks across nodes for durability.

• Master/Slave architecture.

Thursday, January 10, 13

Page 20: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

HDFS Traits

18

• Not fully POSIX compliant.

• No file updates.

• Write once, read many times.

• Large blocks, sequential read patterns.

• Designed for batch processing.

Thursday, January 10, 13

Page 21: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

HDFS Master

19

• NameNode

- Runs on a single node as a master process

‣ Holds file metadata (which blocks are where)

‣ Directs client access to files in HDFS

• SecondaryNameNode

- Not a hot failover

- Maintains a copy of the NameNode metadata

Thursday, January 10, 13

Page 22: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

HDFS Slaves

20

• DataNode

- Generally runs on all nodes in the cluster

‣ Block creation/replication/deletion/reads

‣ Takes orders from the NameNode

Thursday, January 10, 13

Page 23: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

File

Put File

Thursday, January 10, 13

Page 24: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

File

Put File

Thursday, January 10, 13

Page 25: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12Put File

Thursday, January 10, 13

Page 26: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12,4,6

Put File

Thursday, January 10, 13

Page 27: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12,4,6,5,3Put File

Thursday, January 10, 13

Page 28: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12,2,6

,4,6,5,3Put File

Thursday, January 10, 13

Page 29: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12,2,6

,4,6,5,3Put File

Thursday, January 10, 13

Page 30: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

DataNode 1

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

3

12,2,6

,4,6,5,3Read File

Thursday, January 10, 13

Page 31: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

DataNode 1

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

3

12,2,6

,4,6,5,3Read File

Thursday, January 10, 13

Page 32: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

DataNode 1

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

3

12,2,6

,4,6,5,3Read File

Thursday, January 10, 13

Page 33: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

Thursday, January 10, 13

Page 34: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

5

Thursday, January 10, 13

Page 35: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

5

Thursday, January 10, 13

Page 36: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

Read time =

Transfer Rate x

Number of Machines*

5

Thursday, January 10, 13

Page 37: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

Read time =

Transfer Rate x

Number of Machines*

100 MB/sx3=

300MB/s

5

Thursday, January 10, 13

Page 38: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

HDFS Shell

23

• Easy to use command line interface.

• Create, copy, move, and delete files.

• Administrative duties - chmod, chown, chgrp.

• Set replication factor for a file.

• Head, tail, cat to view files.

Thursday, January 10, 13

Page 39: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

24

The Hadoop Ecosystem

Thursday, January 10, 13

Page 40: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

24

The Hadoop Ecosystem

Thursday, January 10, 13

Page 41: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved25

MapReduce in

Hadoop

Thursday, January 10, 13

Page 42: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

MapReduce Basics

26

• Logical functions: Mappers and Reducers.

• Developers write map and reduce functions, then submit a jar to the Hadoop cluster.

• Hadoop handles distributing the Map and Reduce tasks across the cluster.

• Typically batch oriented.

Thursday, January 10, 13

Page 43: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

MapReduce Daemons

27

•JobTracker (Master)

- Manages MapReduce jobs, giving tasks to different nodes, managing task failure

•TaskTracker (Slave)

- Creates individual map and reduce tasks

- Reports task status to JobTracker

Thursday, January 10, 13

Page 44: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved28

MapReduce in Hadoop

Thursday, January 10, 13

Page 45: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved28

MapReduce in Hadoop

Let’s look at how MapReduce actually works in Hadoop,

using WordCount.

Thursday, January 10, 13

Page 46: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

29

Thursday, January 10, 13

Page 47: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

29

We need to convert the Input

into the Output.

Thursday, January 10, 13

Page 48: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved30

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

Thursday, January 10, 13

Page 49: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved31

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers

There is a Reduce phase (doc4, "…")

Thursday, January 10, 13

Page 50: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved32

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers

There is a Reduce phase (doc4, "…")

(hadoop, 1)(uses, 1)(mapreduce, 1)

(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)

(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)

Thursday, January 10, 13

Page 51: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved33

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

Reducers

There is a Reduce phase (doc4, "…")

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Thursday, January 10, 13

Page 52: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved34

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

There is a Reduce phase (doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Thursday, January 10, 13

Page 53: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved35

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase (doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Thursday, January 10, 13

Page 54: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Thursday, January 10, 13

Page 55: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Map:

• Transform one input to 0-N outputs.

Thursday, January 10, 13

Page 56: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Map:

• Transform one input to 0-N outputs.

Reduce:

• Collect multiple inputs into one output.

Thursday, January 10, 13

Page 57: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

Thursday, January 10, 13

Page 58: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

Thursday, January 10, 13

Page 59: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

MMM

jar

M R

Thursday, January 10, 13

Page 60: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

MMM

jar

M R

Map Phase

Thursday, January 10, 13

Page 61: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

MMM

jar

M R

k,v k,vk,v k,v k,vMap Phase * Intermediate Data Is Stored Locally

Thursday, January 10, 13

Page 62: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

k,v k,vk,v k,v k,vMap Phase

Thursday, January 10, 13

Page 63: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

k,v k,vk,v k,v k,v

Shuffle/Sort

Thursday, January 10, 13

Page 64: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

k,v k,v k,v k,v k,v

Shuffle/Sort

Thursday, January 10, 13

Page 65: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

RR R

jar

M R

k,v k,v k,v k,v k,v

Reduce Phase

Thursday, January 10, 13

Page 66: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

RR R

jar

M R

Reduce Phase

Thursday, January 10, 13

Page 67: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

Job Complete!

Thursday, January 10, 13

Page 68: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved38

The Hadoop Java API

Thursday, January 10, 13

Page 69: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved39

MapReduce in Java

Thursday, January 10, 13

Page 70: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved39

MapReduce in Java

Let’s look at WordCountwritten in the

MapReduce Java API.

Thursday, January 10, 13

Page 71: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved40

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Thursday, January 10, 13

Page 72: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved40

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Let’s drill into this code...

Thursday, January 10, 13

Page 73: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved41

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Thursday, January 10, 13

Page 74: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved41

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map CodeMapper class with 4

type parameters for the input key-value types and

output types.

Thursday, January 10, 13

Page 75: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved42

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Output key-value objects we’ll reuse.

Thursday, January 10, 13

Page 76: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved43

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Map method with input, output “collector”, and

reporting object.

Thursday, January 10, 13

Page 77: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved44

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Tokenize the line, “collect” each (word, 1)

Thursday, January 10, 13

Page 78: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved45

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Reduce Code

Thursday, January 10, 13

Page 79: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved45

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Reduce Code

Let’s drill into this code...

Thursday, January 10, 13

Page 80: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

46

Reduce Code

Thursday, January 10, 13

Page 81: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

46

Reduce CodeReducer class with 4

type parameters for the input key-value types and

output types.

Thursday, January 10, 13

Page 82: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved47

Reduce Code

Reduce method with input, output “collector”,

and reporting object.

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Thursday, January 10, 13

Page 83: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved48

Reduce Code

Count the counts per word and emit(word, N)

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Thursday, January 10, 13

Page 84: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

49

Other Options

Thursday, January 10, 13

Page 85: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

49

Other Options

Thursday, January 10, 13

Page 86: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

49

Other Options

Thursday, January 10, 13

Page 87: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved50

Conclusions

Thursday, January 10, 13

Page 88: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• A cost-effective, scalable way to:- Store massive data sets.- Perform arbitrary analyses on

those data sets.

51

Hadoop Benefits

Thursday, January 10, 13

Page 89: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• Offers a variety of tools for:- Application development.- Integration with other platforms

(e.g., databases).

52

Hadoop Tools

Thursday, January 10, 13

Page 90: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• A rich, open-source ecosystem.- Free to use.- Commercially-supported

distributions.

53

Hadoop Distributions

Thursday, January 10, 13

Page 91: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved54

Thank You!- Feel free to contact me at

[email protected]

- Or our solutions consultant

[email protected]

- As always, THINK BIG!

Thursday, January 10, 13

Page 92: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved55

Bonus Content

Thursday, January 10, 13

Page 93: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

56

The Hadoop Ecosystem

Thursday, January 10, 13

Page 94: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

56

The Hadoop Ecosystem

Thursday, January 10, 13

Page 95: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved57

Hive: SQL for Hadoop

Thursday, January 10, 13

Page 96: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved58

Hive

Thursday, January 10, 13

Page 97: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved58

Hive

Let’s look at WordCountwritten in Hive,

the SQL for Hadoop.

Thursday, January 10, 13

Page 98: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved59

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Thursday, January 10, 13

Page 99: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved59

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Let’s drill into this code...

Thursday, January 10, 13

Page 100: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved60

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Thursday, January 10, 13

Page 101: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved60

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Create a table to hold the raw text we’re

counting. Each line is a “column”.

Thursday, January 10, 13

Page 102: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved61

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Load the text in the “docs” directory into the

table.

Thursday, January 10, 13

Page 103: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved62

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Create the final table and fill it with the results from a nested query of

the docs table that performs WordCount

on the fly.

Thursday, January 10, 13

Page 104: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved63

Hive

Thursday, January 10, 13

Page 105: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved63

Hive

Because so many Hadoop userscome from SQL backgrounds,

Hive is one of the mostessential tools in the ecosystem!!

Thursday, January 10, 13

Page 106: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

64

The Hadoop Ecosystem

Thursday, January 10, 13

Page 107: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

64

The Hadoop Ecosystem

Thursday, January 10, 13

Page 108: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved65

Pig: Data Flow

for Hadoop

Thursday, January 10, 13

Page 109: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved66

Pig

Thursday, January 10, 13

Page 110: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved66

Pig

Let’s look at WordCountwritten in Pig,

the Data Flow language for Hadoop.

Thursday, January 10, 13

Page 111: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved67

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Thursday, January 10, 13

Page 112: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved67

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output'; Let’s drill into this code...

Thursday, January 10, 13

Page 113: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved68

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Thursday, January 10, 13

Page 114: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved68

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Like the Hive example, load “docs” content,each line is a “field”.

Thursday, January 10, 13

Page 115: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved69

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Tokenize into words (an array) and “flatten” into

separate records.

Thursday, January 10, 13

Page 116: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved70

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Collect the same words together.

Thursday, January 10, 13

Page 117: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved71

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Count each word.

Thursday, January 10, 13

Page 118: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved72

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output'; Save the results.Profit!

Thursday, January 10, 13

Page 119: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved73

Pig

Thursday, January 10, 13

Page 120: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved73

Pig

Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc.

Thursday, January 10, 13

Page 121: Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved74

Questions?

Thursday, January 10, 13