Intro to HDFS and MapReduce

Post on 20-Aug-2015

8.447 views 0 download

Tags:

Transcript of Intro to HDFS and MapReduce

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

Introduction to HDFS and MapReduce

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2

Who Am I- Ryan Tabora

- Data Developer at Think Big Analytics

- Big Data Consulting

- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2

Who Am I- Ryan Tabora

- Data Developer at Think Big Analytics

- Big Data Consulting

- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.

Thursday, January 10, 13

Confidential Think Big Analytics

• One of Silicon Valley’s Fastest Growing Big Data start ups• 100% Focus on Big Data consulting & Data Science solution services• Management Background:

Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture

C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999• Clients: 40+• North America Locations

• US East: Boston, New York, Washington D.C.• US Central: Chicago, Austin• US West: HQ Mountain View, San Diego, Salt Lake City

• EMEA & APAC

3

Think Big is the leading professional services firm that’s purpose built for Big Data.

Thursday, January 10, 13

Confidential Think Big Analytics 01/04/13

Think Big Recognized as a Top Pure-Play Big Data Vendor

Source: Forbes February 2012

4Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved5

Agenda- Big Data

- Hadoop Ecosystem

- HDFS

- MapReduce in Hadoop

- The Hadoop Java API

- Conclusions

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved6

Big Data

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved7

A Data Shift...

Source: EMC Digital Universe Study*

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved8

“Simple algorithms and lots of data trump complex

models. ”

Motivation

Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved9

Pioneers• Google and Yahoo:

- Index 850+ million websites, over one trillion URLs.

• Facebook ad targeting:

- 840+ million users, > 50% of whom are active daily.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved10

Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved11

Common Tool?• Hadoop

- Cluster: distributed computing platform.

- Commodity*, server-class hardware.

- Extensible Platform.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved12

Hadoop Origins• MapReduce and Google File System (GFS)

pioneered at Google.

• Hadoop is the commercially-supported open-source equivalent.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved13

What Is Hadoop?• Hadoop is a platform.

• Distributes and replicates data.

• Manages parallel tasks created by users.

• Runs as several processes on a cluster.

• The term Hadoop generally refers to a toolset, not a single tool.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved14

Why Hadoop?• Handles unstructured to semi-structured to

structured data.

• Handles enormous data volumes.

• Flexible data analysis and machine learning tools.

• Cost-effective scalability.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

15

The Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

15

The Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved16

HDFS

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved17

What Is HDFS?• Hadoop Distributed File System.

• Stores files in blocks across many nodes in a cluster.

• Replicates the blocks across nodes for durability.

• Master/Slave architecture.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

HDFS Traits

18

• Not fully POSIX compliant.

• No file updates.

• Write once, read many times.

• Large blocks, sequential read patterns.

• Designed for batch processing.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

HDFS Master

19

• NameNode

- Runs on a single node as a master process

‣ Holds file metadata (which blocks are where)

‣ Directs client access to files in HDFS

• SecondaryNameNode

- Not a hot failover

- Maintains a copy of the NameNode metadata

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

HDFS Slaves

20

• DataNode

- Generally runs on all nodes in the cluster

‣ Block creation/replication/deletion/reads

‣ Takes orders from the NameNode

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

File

Put File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

File

Put File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12Put File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12,4,6

Put File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12,4,6,5,3Put File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12,2,6

,4,6,5,3Put File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 1 DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

HDFS Illustrated

21

3

12,2,6

,4,6,5,3Put File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

DataNode 1

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

3

12,2,6

,4,6,5,3Read File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

DataNode 1

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

3

12,2,6

,4,6,5,3Read File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

DataNode 1

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

3

12,2,6

,4,6,5,3Read File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

5

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

5

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

Read time =

Transfer Rate x

Number of Machines*

5

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

NameNode

DataNode 2 DataNode 3

DataNode 4 DataNode 5 DataNode 6

22

Power of Hadoop

32,2,6

,4,6,5,3Read File

Read time =

Transfer Rate x

Number of Machines*

100 MB/sx3=

300MB/s

5

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

HDFS Shell

23

• Easy to use command line interface.

• Create, copy, move, and delete files.

• Administrative duties - chmod, chown, chgrp.

• Set replication factor for a file.

• Head, tail, cat to view files.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

24

The Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

24

The Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved25

MapReduce in

Hadoop

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

MapReduce Basics

26

• Logical functions: Mappers and Reducers.

• Developers write map and reduce functions, then submit a jar to the Hadoop cluster.

• Hadoop handles distributing the Map and Reduce tasks across the cluster.

• Typically batch oriented.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

MapReduce Daemons

27

•JobTracker (Master)

- Manages MapReduce jobs, giving tasks to different nodes, managing task failure

•TaskTracker (Slave)

- Creates individual map and reduce tasks

- Reports task status to JobTracker

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved28

MapReduce in Hadoop

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved28

MapReduce in Hadoop

Let’s look at how MapReduce actually works in Hadoop,

using WordCount.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

29

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

29

We need to convert the Input

into the Output.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved30

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved31

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers

There is a Reduce phase (doc4, "…")

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved32

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers

There is a Reduce phase (doc4, "…")

(hadoop, 1)(uses, 1)(mapreduce, 1)

(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)

(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved33

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

Reducers

There is a Reduce phase (doc4, "…")

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved34

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

There is a Reduce phase (doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved35

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase (doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Map:

• Transform one input to 0-N outputs.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36

There is a Map phase

Hadoop uses MapReduce

Input

(doc1, "…")

(doc2, "…")

(doc3, "")

Mappers Sort,Shuffle

(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

(doc4, "…")

(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z

Map:

• Transform one input to 0-N outputs.

Reduce:

• Collect multiple inputs into one output.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

MMM

jar

M R

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

MMM

jar

M R

Map Phase

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

MMM

jar

M R

k,v k,vk,v k,v k,vMap Phase * Intermediate Data Is Stored Locally

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

k,v k,vk,v k,v k,vMap Phase

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

k,v k,vk,v k,v k,v

Shuffle/Sort

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

k,v k,v k,v k,v k,v

Shuffle/Sort

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

RR R

jar

M R

k,v k,v k,v k,v k,v

Reduce Phase

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

RR R

jar

M R

Reduce Phase

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

JobTracker

TaskTracker

DataNode

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

Cluster View of MapReduce

37

jar

M R

Job Complete!

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved38

The Hadoop Java API

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved39

MapReduce in Java

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved39

MapReduce in Java

Let’s look at WordCountwritten in the

MapReduce Java API.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved40

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved40

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Let’s drill into this code...

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved41

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved41

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map CodeMapper class with 4

type parameters for the input key-value types and

output types.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved42

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Output key-value objects we’ll reuse.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved43

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Map method with input, output “collector”, and

reporting object.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved44

public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}

Map Code

Tokenize the line, “collect” each (word, 1)

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved45

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Reduce Code

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved45

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Reduce Code

Let’s drill into this code...

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

46

Reduce Code

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

46

Reduce CodeReducer class with 4

type parameters for the input key-value types and

output types.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved47

Reduce Code

Reduce method with input, output “collector”,

and reporting object.

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved48

Reduce Code

Count the counts per word and emit(word, N)

public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

49

Other Options

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

49

Other Options

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

49

Other Options

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved50

Conclusions

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• A cost-effective, scalable way to:- Store massive data sets.- Perform arbitrary analyses on

those data sets.

51

Hadoop Benefits

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• Offers a variety of tools for:- Application development.- Integration with other platforms

(e.g., databases).

52

Hadoop Tools

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• A rich, open-source ecosystem.- Free to use.- Commercially-supported

distributions.

53

Hadoop Distributions

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved54

Thank You!- Feel free to contact me at

‣ ryan.tabora@thinkbiganalytics.com

- Or our solutions consultant

‣ matt.mcdevitt@thinkbiganalytics.com

- As always, THINK BIG!

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved55

Bonus Content

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

56

The Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

56

The Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved57

Hive: SQL for Hadoop

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved58

Hive

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved58

Hive

Let’s look at WordCountwritten in Hive,

the SQL for Hadoop.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved59

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved59

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Let’s drill into this code...

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved60

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved60

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Create a table to hold the raw text we’re

counting. Each line is a “column”.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved61

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Load the text in the “docs” directory into the

table.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved62

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;

Create the final table and fill it with the results from a nested query of

the docs table that performs WordCount

on the fly.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved63

Hive

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved63

Hive

Because so many Hadoop userscome from SQL backgrounds,

Hive is one of the mostessential tools in the ecosystem!!

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

64

The Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved

• HDFS - Hadoop Distributed File System.

• Map/Reduce - A distributed framework for executing work in parallel.

• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.

• Pig - A top down scripting language to manipulate.

• HBase - A NoSQL, non-sequential data store.

64

The Hadoop Ecosystem

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved65

Pig: Data Flow

for Hadoop

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved66

Pig

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved66

Pig

Let’s look at WordCountwritten in Pig,

the Data Flow language for Hadoop.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved67

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved67

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output'; Let’s drill into this code...

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved68

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved68

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Like the Hive example, load “docs” content,each line is a “field”.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved69

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Tokenize into words (an array) and “flatten” into

separate records.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved70

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Collect the same words together.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved71

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output';

Count each word.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved72

inpt = LOAD 'docs' using TextLoader AS (line:chararray);

words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;

grpd = GROUP words BY word;

cntd = FOREACH grpd GENERATE group, COUNT(words);

STORE cntd INTO 'output'; Save the results.Profit!

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved73

Pig

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved73

Pig

Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc.

Thursday, January 10, 13

Copyright © 2012-2013, Think Big Analytics, All Rights Reserved74

Questions?

Thursday, January 10, 13