Introduction to MapReduce and Hadoop IT 332 Distributed Systems.

56
Introduction to MapReduce and Hadoop IT 332 Distributed Systems

Transcript of Introduction to MapReduce and Hadoop IT 332 Distributed Systems.

Introduction to MapReduce and HadoopIT 332 Distributed Systems

What is MapReduce

Data-parallel programming model for clusters of commodity machines

Pioneered by Google Processes 20 PB of data per day

Popularized by open-source Hadoop project Used by Yahoo Facebook Amazon hellip

What is MapReduce used for

bull At Googlendash Index building for Google Searchndash Article clustering for Google Newsndash Statistical machine translation

bull At Yahoondash Index building for Yahoo Searchndash Spam detection for Yahoo Mail

bull At Facebookndash Data miningndash Ad optimizationndash Spam detection

What is MapReduce used for

In research Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) ltYour application heregt

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

MapReduce Design Goals

1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes

2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

Typical Hadoop Cluster

Aggregation switch

Rack switch

40 nodesrack 1000-4000 nodes in cluster

1 GBps bandwidth in rack 8 GBps out of rack

Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)

Typical Hadoop Cluster

Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

What is MapReduce

Data-parallel programming model for clusters of commodity machines

Pioneered by Google Processes 20 PB of data per day

Popularized by open-source Hadoop project Used by Yahoo Facebook Amazon hellip

What is MapReduce used for

bull At Googlendash Index building for Google Searchndash Article clustering for Google Newsndash Statistical machine translation

bull At Yahoondash Index building for Yahoo Searchndash Spam detection for Yahoo Mail

bull At Facebookndash Data miningndash Ad optimizationndash Spam detection

What is MapReduce used for

In research Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) ltYour application heregt

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

MapReduce Design Goals

1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes

2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

Typical Hadoop Cluster

Aggregation switch

Rack switch

40 nodesrack 1000-4000 nodes in cluster

1 GBps bandwidth in rack 8 GBps out of rack

Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)

Typical Hadoop Cluster

Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

What is MapReduce used for

bull At Googlendash Index building for Google Searchndash Article clustering for Google Newsndash Statistical machine translation

bull At Yahoondash Index building for Yahoo Searchndash Spam detection for Yahoo Mail

bull At Facebookndash Data miningndash Ad optimizationndash Spam detection

What is MapReduce used for

In research Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) ltYour application heregt

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

MapReduce Design Goals

1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes

2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

Typical Hadoop Cluster

Aggregation switch

Rack switch

40 nodesrack 1000-4000 nodes in cluster

1 GBps bandwidth in rack 8 GBps out of rack

Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)

Typical Hadoop Cluster

Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

What is MapReduce used for

In research Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Bioinformatics (Maryland) Astronomical image analysis (Washington) Ocean climate simulation (Washington) ltYour application heregt

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

MapReduce Design Goals

1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes

2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

Typical Hadoop Cluster

Aggregation switch

Rack switch

40 nodesrack 1000-4000 nodes in cluster

1 GBps bandwidth in rack 8 GBps out of rack

Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)

Typical Hadoop Cluster

Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

MapReduce Design Goals

1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes

2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

Typical Hadoop Cluster

Aggregation switch

Rack switch

40 nodesrack 1000-4000 nodes in cluster

1 GBps bandwidth in rack 8 GBps out of rack

Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)

Typical Hadoop Cluster

Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

MapReduce Design Goals

1 Scalability to large data volumes Scan 100 TB on 1 node 50 MBs = 23 days Scan on 1000-node cluster = 33 minutes

2 Cost-efficiency Commodity nodes (cheap but unreliable) Commodity network Automatic fault-tolerance (fewer admins) Easy to use (fewer programmers)

Typical Hadoop Cluster

Aggregation switch

Rack switch

40 nodesrack 1000-4000 nodes in cluster

1 GBps bandwidth in rack 8 GBps out of rack

Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)

Typical Hadoop Cluster

Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Typical Hadoop Cluster

Aggregation switch

Rack switch

40 nodesrack 1000-4000 nodes in cluster

1 GBps bandwidth in rack 8 GBps out of rack

Node specs (Yahoo terasort)8 x 20 GHz cores 8 GB RAM 4 disks (= 4 TB)

Typical Hadoop Cluster

Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Typical Hadoop Cluster

Image from httpwikiapacheorghadoop-dataattachmentsHadoopPresentationsattachmentsaw-apachecon-eu-2009pdf

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Challenges

Cheap nodes fail especially if you have many Mean time between failures for 1 node = 3 years MTBF for 1000 nodes = 1 day Solution Build fault-tolerance into system

Commodity network = low bandwidth Solution Push computation to the data

Programming distributed systems is hard Solution Data-parallel programming model users write ldquomaprdquo and

ldquoreducerdquo functions system handles work distribution and fault tolerance

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Hadoop Components

Distributed file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance

MapReduce implementation Executes user jobs specified as ldquomaprdquo and ldquoreducerdquo functions Manages work distribution amp fault-tolerance

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Hadoop Distributed File System Files split into 128MB blocks

Blocks replicated across several datanodes (usually 3)

Single namenode stores metadata (file names block locations etc)

Optimized for large files sequential reads

Files are append-only

Namenode

Datanodes

1234

124

213

143

324

File1

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

MapReduce Programming Model Data type key-value records

Map function

(Kin Vin) list(Kinter Vinter)

Reduce function

(Kinter list(Vinter)) list(Kout Vout)

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Example Word Count

def mapper(line) foreach word in linesplit)(

output(word 1)

def reducer(key values) output(key sum(values))

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Word Count Execution

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 1fox 1the 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Input Map Shuffle amp Sort Reduce Output

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

MapReduce Execution Details

Single master controls job execution on multiple slaves as well as user scheduling

Mappers preferentially placed on same node or same rack as their input block Push computation to data minimize network use

Mappers save outputs to local disk rather than pushing directly to reducers Allows having more reducers than nodes Allows recovery if a reducer crashes

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

An Optimization The Combiner

A combiner is a local aggregation function for repeated keys produced by same map

For associative ops like sum count max

Decreases size of intermediate data

Example local counting for Word Count

def combiner(key values) output(key sum(values))

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Word Count with CombinerInput Map amp Combine Shuffle amp Sort Reduce Output

the quickbrown fox

the fox atethe mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown 2fox 2how 1now 1the 3

ate 1cow 1

mouse 1quick 1

the 1brown 1

fox 1

quick 1

the 2fox 1

how 1now 1

brown 1ate 1

mouse 1

cow 1

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Fault Tolerance in MapReduce

1 If a task crashes Retry on another node

Okay for a map because it had no dependencies Okay for reduce because map outputs are on disk

If the same task repeatedly fails fail the job or ignore that input block (user-controlled)

Note For this and the other fault tolerance features to work your map and reduce tasks must be side-effect-free

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Fault Tolerance in MapReduce

2 If a node crashes Relaunch its current tasks on other nodes Relaunch any maps the node previously ran

Necessary because their output files were lost along with the crashed node

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Fault Tolerance in MapReduce

3 If a task is going slowly (straggler) Launch second copy of task on another node Take the output of whichever copy finishes first and kill the

other one

Critical for performance in large clusters stragglers occur frequently due to failing hardware bugs misconfiguration etc

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Takeaways

By providing a data-parallel programming model MapReduce can control job execution in useful ways Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures amp stragglers

User focuses on application not on complexities of distributed computing

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

1 Search

Input (lineNumber line) records

Output lines matching a given pattern

Map if(line matches pattern) output(line)

Reduce identify function Alternative no reducer (map-only job)

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

pigsheepyakzebra

aardvarkantbeecowelephant

2 Sort

Input (key value) records

Output same records sorted by key

Map identity function

Reduce identify function

Trick Pick partitioningfunction h such thatk1ltk2 =gt h(k1)lth(k2)

Map

Map

Map

Reduce

Reduce

ant bee

zebra

aardvarkelephant

cow

pig

sheep yak

[A-M]

[N-Z]

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

3 Inverted Index

Input (filename text) records

Output list of files containing each word

Map foreach word in textsplit() output(word filename)

Combine uniquify filenames for each word (eliminate duplicates)

Reducedef reduce(word filenames) output(word sort(filenames))

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Inverted Index Example

to be or not to be afraid (12thtxt)

be (12thtxt hamlettxt)greatness (12thtxt)not (12thtxt hamlettxt)of (12thtxt)or (hamlettxt)to (hamlettxt)

hamlettxt

be not afraid of greatness

12thtxt

to hamlettxtbe hamlettxtor hamlettxtnot hamlettxt

be 12thtxtnot 12thtxtafraid 12thtxtof 12thtxtgreatness 12thtxt

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

4 Most Popular Words

Input (filename text) records

Output the 100 words occurring in most files

Two-stage solution Job 1

Create inverted index giving (word list(file)) records Job 2

Map each (word list(file)) to (count word) Sort these records by count as in sort job

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Cloudera Virtual Manager

Cloudera VM contains a single-node Apache Hadoop cluster along with everything you need to get started with Hadoop

Requirements A 64-bit host OS A virtualization software VMware Player KVM or VirtualBox

Virtualization Software will require a laptop that supports virtualization If you are unsure one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled

A 4 GB of total RAM The total system memory required varies depending on the size

of your data set and on the other processes that are running

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Installation

Step1 Download amp Run Vmware

Step2 Download Cloudera VM

Step3 Extract to the Cloudera folder

Step4 Open the cloudera-quickstart-vm-440-1-vmware

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Once you got the software installed fire up the VirtualBox image of Cloudera QuickStart VM and you should see the initial screen similar to below

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

WordCount Example

This example computes the occurrence frequency of each word in a text file

Steps1 Set up Hadoop environment2 Upload files into HDFS3 Executing Java MapReduce functions in Hadoop

bull Tutorial

httpswwwclouderacomcontentcloudera-contentcloudera-docsHadoopTutorialCDH4Hadoop-Tutorialht_wordcount1html

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

public static class Map extends MapReduceBase implements MapperltLongWritable Text Text IntWritable gt

private final static IntWritable one = new IntWritable(1) private Text word = new Text)(

public void map(LongWritable key Text value OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

String line = valuetoString)(

StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())

outputcollect(word one)

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

public static class Reduce extends MapReduceBase implements ReducerltText IntWritable Text IntWritable gt

public void reduce(Text key IteratorltIntWritablegt values OutputCollectorltText IntWritablegt output Reporter reporter) throws IOException

int sum = 0

while (valueshasNext()) sum += valuesnext()get)(

outputcollect(key new IntWritable(sum))

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

public static void main(String[] args) throws Exception JobConf conf = new JobConf(WordCountclass)

confsetJobName(wordcount)

confsetOutputKeyClass(Textclass) confsetOutputValueClass(IntWritableclass)

confsetMapperClass(Mapclass) confsetCombinerClass(Reduceclass)

confsetReducerClass(Reduceclass)

confsetInputFormat(TextInputFormatclass) confsetOutputFormat(TextOutputFormatclass)

FileInputFormatsetInputPaths(conf new Path(args[0])) FileOutputFormatsetOutputPath(conf new Path(args[1]))

JobClientrunJob(conf)

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

How it Works

Purpose Simple serialization for keys values and other data

Interface Writable Read and write binary format Convert to String for text formats

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Classes

OutputCollector Collects data output by either the Mapper or the Reducer ie

intermediate outputs or the output of the job Methods

collect(K key V value)

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Interface

Text Stores text using standard UTF8 encoding It provides methods to

serialize deserialize and compare texts at byte level Methods

set(byte[] utf8)

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Demo

40

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Exercises

Get WordCount running

Extend WordCount to count Bigrams Bigrams are simply sequences of two consecutive words

For example the previous sentence contains the following bigrams Bigrams are are simply simply sequences sequence of etc

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Outline

MapReduce architecture

Fault tolerance in MapReduce

Sample applications

Getting started with Hadoop

Higher-level languages on top of Hadoop Pig and Hive (reading)

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Motivation

MapReduce is great as many algorithms can be expressed by a series of MR jobs

But itrsquos low-level must think about keys values partitioning etc

Can we capture common ldquojob patternsrdquo

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Pig

Started at Yahoo Research

Now runs about 30 of Yahoorsquos jobs

Features Expresses sequences of MapReduce jobs Data model nested ldquobagsrdquo of items Provides relational (SQL) operators

(JOIN GROUP BY etc) Easy to plug in Java functions Pig Pen dev env for Eclipse

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

An Example Problem

Suppose you have user data in one file website data in another and you need to find the top 5 most visited pages by users aged 18 - 25

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

In MapReduceimport javaioIOException import javautilArrayList import javautilIterator import javautilList import orgapachehadoopfsPath import orgapachehadoopioLongWritable import orgapachehadoopioText import orgapachehadoopioWritable import orgapachehadoopioWritableComparable import orgapachehadoopmapredFileInputFormat import orgapachehadoopmapredFileOutputFormat import orgapachehadoopmapredJobConf import orgapachehadoopmapredKeyValueTextInputFormat import orgapachehadoopmapredMapper import orgapachehadoopmapredMapReduceBase import orgapachehadoopmapredOutputCollector import orgapachehadoopmapredRecordReader import orgapachehadoopmapredReducer import orgapachehadoopmapredReporter import orgapachehadoopmapredSequenceFileInputFormat import orgapachehadoopmapredSequenceFileOutputFormat import orgapachehadoopmapredTextInputFormat import orgapachehadoopmapredjobcontrolJob import orgapachehadoopmapredjobcontrolJobControl import orgapachehadoopmapredlibIdentityMapper public class MRExample public static class LoadPages extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String key = linesubstring(0 firstComma) String value = linesubstring(firstComma + 1) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(1 + value) occollect(outKey outVal) public static class LoadAndFilterUsers extends MapReduceBase implements MapperltLongWritable Text Text Textgt public void map(LongWritable k Text val OutputCollectorltText Textgt oc Reporter reporter) throws IOException Pull the key out String line = valtoString() int firstComma = lineindexOf() String value = linesubstring(firstComma + 1) int age = IntegerparseInt(value) if (age lt 18 || age gt 25) return String key = linesubstring(0 firstComma) Text outKey = new Text(key) Prepend an index to the value so we know which file it came from Text outVal = new Text(2 + value) occollect(outKey outVal) public static class Join extends MapReduceBase implements ReducerltText Text Text Textgt public void reduce(Text key IteratorltTextgt iter OutputCollectorltText Textgt oc Reporter reporter) throws IOException For each value figure out which file its from and store it accordingly ListltStringgt first = new ArrayListltStringgt() ListltStringgt second = new ArrayListltStringgt() while (iterhasNext()) Text t = iternext() String value = ttoString() if (valuecharAt(0) == 1) firstadd(valuesubstring(1)) else secondadd(valuesubstring(1))

reportersetStatus(OK) Do the cross product and collect the values for (String s1 first) for (String s2 second) String outval = key + + s1 + + s2 occollect(null new Text(outval)) reportersetStatus(OK) public static class LoadJoined extends MapReduceBase implements MapperltText Text Text LongWritablegt public void map( Text k Text val OutputCollectorltText LongWritablegt oc Reporter reporter) throws IOException Find the url String line = valtoString() int firstComma = lineindexOf() int secondComma = lineindexOf( firstComma) String key = linesubstring(firstComma secondComma) drop the rest of the record I dont need it anymore just pass a 1 for the combinerreducer to sum instead Text outKey = new Text(key) occollect(outKey new LongWritable(1L)) public static class ReduceUrls extends MapReduceBase implements ReducerltText LongWritable WritableComparable Writablegt public void reduce( Text key IteratorltLongWritablegt iter OutputCollectorltWritableComparable Writablegt oc Reporter reporter) throws IOException Add up all the values we see long sum = 0 while (iterhasNext()) sum += iternext()get() reportersetStatus(OK) occollect(key new LongWritable(sum)) public static class LoadClicks extends MapReduceBase implements MapperltWritableComparable Writable LongWritable Textgt public void map( WritableComparable key Writable val OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException occollect((LongWritable)val (Text)key) public static class LimitClicks extends MapReduceBase implements ReducerltLongWritable Text LongWritable Textgt int count = 0 public void reduce( LongWritable key IteratorltTextgt iter OutputCollectorltLongWritable Textgt oc Reporter reporter) throws IOException Only output the first 100 records while (count lt 100 ampamp iterhasNext()) occollect(key iternext()) count++ public static void main(String[] args) throws IOException JobConf lp = new JobConf(MRExampleclass) lpsetJobName(Load Pages) lpsetInputFormat(TextInputFormatclass)

lpsetOutputKeyClass(Textclass) lpsetOutputValueClass(Textclass) lpsetMapperClass(LoadPagesclass) FileInputFormataddInputPath(lp new Path(usergatespages)) FileOutputFormatsetOutputPath(lp new Path(usergatestmpindexed_pages)) lpsetNumReduceTasks(0) Job loadPages = new Job(lp) JobConf lfu = new JobConf(MRExampleclass) lfusetJobName(Load and Filter Users) lfusetInputFormat(TextInputFormatclass) lfusetOutputKeyClass(Textclass) lfusetOutputValueClass(Textclass) lfusetMapperClass(LoadAndFilterUsersclass) FileInputFormataddInputPath(lfu new Path(usergatesusers)) FileOutputFormatsetOutputPath(lfu new Path(usergatestmpfiltered_users)) lfusetNumReduceTasks(0) Job loadUsers = new Job(lfu) JobConf join = new JobConf(MRExampleclass) joinsetJobName(Join Users and Pages) joinsetInputFormat(KeyValueTextInputFormatclass) joinsetOutputKeyClass(Textclass) joinsetOutputValueClass(Textclass) joinsetMapperClass(IdentityMapperclass) joinsetReducerClass(Joinclass) FileInputFormataddInputPath(join new Path(usergatestmpindexed_pages)) FileInputFormataddInputPath(join new Path(usergatestmpfiltered_users)) FileOutputFormatsetOutputPath(join new Path(usergatestmpjoined)) joinsetNumReduceTasks(50) Job joinJob = new Job(join) joinJobaddDependingJob(loadPages) joinJobaddDependingJob(loadUsers) JobConf group = new JobConf(MRExampleclass) groupsetJobName(Group URLs) groupsetInputFormat(KeyValueTextInputFormatclass) groupsetOutputKeyClass(Textclass) groupsetOutputValueClass(LongWritableclass) groupsetOutputFormat(SequenceFileOutputFormatclass) groupsetMapperClass(LoadJoinedclass) groupsetCombinerClass(ReduceUrlsclass) groupsetReducerClass(ReduceUrlsclass) FileInputFormataddInputPath(group new Path(usergatestmpjoined)) FileOutputFormatsetOutputPath(group new Path(usergatestmpgrouped)) groupsetNumReduceTasks(50) Job groupJob = new Job(group) groupJobaddDependingJob(joinJob) JobConf top100 = new JobConf(MRExampleclass) top100setJobName(Top 100 sites) top100setInputFormat(SequenceFileInputFormatclass) top100setOutputKeyClass(LongWritableclass) top100setOutputValueClass(Textclass) top100setOutputFormat(SequenceFileOutputFormatclass) top100setMapperClass(LoadClicksclass) top100setCombinerClass(LimitClicksclass) top100setReducerClass(LimitClicksclass) FileInputFormataddInputPath(top100 new Path(usergatestmpgrouped)) FileOutputFormatsetOutputPath(top100 new Path(usergatestop100sitesforusers18to25)) top100setNumReduceTasks(1) Job limit = new Job(top100) limitaddDependingJob(groupJob) JobControl jc = new JobControl(Find top 100 sites for users 18 to 25) jcaddJob(loadPages) jcaddJob(loadUsers) jcaddJob(joinJob) jcaddJob(groupJob) jcaddJob(limit) jcrun()

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Users = load lsquousersrsquo as (name age)Filtered = filter Users by age gt= 18 and age lt= 25 Pages = load lsquopagesrsquo as (user url)Joined = join Filtered by name Pages by userGrouped = group Joined by urlSummed = foreach Grouped generate group count(Joined) as clicksSorted = order Summed by clicks descTop5 = limit Sorted 5

store Top5 into lsquotop5sitesrsquo

In Pig Latin

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Ease of TranslationNotice how naturally the components of the job translate into Pig Latin

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5

Users = load hellipFltrd = filter hellip Pages = load hellipJoined = join hellipGrouped = group hellipSummed = hellip count()hellipSorted = order hellipTop5 = limithellip

Job 1

Job 2

Job 3

Example from httpwikiapacheorgpig-dataattachmentsPigTalksPapersattachmentsApacheConEurope09ppt

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Hive

Developed at Facebook

Used for majority of Facebook jobs

ldquoRelational databaserdquo built on Hadoop Maintains list of table schemas SQL-like query language (HQL) Can call Hadoop Streaming scripts from HQL Supports table partitioning clustering complex

data types some optimizations

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Creating a Hive Table

CREATE TABLE page_views(viewTime INT userid BIGINT

page_url STRING referrer_url STRING

ip STRING COMMENT User IP address)

COMMENT This is the page view table

PARTITIONED BY(dt STRING country STRING)

STORED AS SEQUENCEFILE

bullPartitioning breaks table into separate files for each (dt country) pairEx hivepage_viewdt=2008-06-08country=US hivepage_viewdt=2008-06-08country=CA

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Simple Query

SELECT page_views

FROM page_views

WHERE page_viewsdate gt= 2008-03-01

AND page_viewsdate lt= 2008-03-31

AND page_viewsreferrer_url like xyzcom

bullHive only reads partition 2008-03-01 instead of scanning entire table

bullFind all page views coming from xyzcom on March 31st

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Aggregation and Joins

SELECT pvpage_url ugender COUNT(DISTINCT uid)

FROM page_views pv JOIN user u ON (pvuserid = uid)

GROUP BY pvpage_url ugender

WHERE pvdate = 2008-03-03

bullCount users who visited each page by gender

bullSample outputpage_url gender count(userid)

homephp MALE 12141412

homephp FEMALE 15431579

photophp MALE 23941451

photophp FEMALE 21231314

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Using a Hadoop Streaming Mapper Script

SELECT TRANSFORM(page_viewsuserid

page_viewsdate)

USING map_scriptpy

AS dt uid CLUSTER BY dt

FROM page_views

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Conclusions

MapReducersquos data-parallel programming model hides complexity of distribution and fault tolerance

Principal philosophies Make it scale so you can throw hardware at problems Make it cheap saving hardware programmer and administration costs

(but requiring fault tolerance)

Hive and Pig further simplify programming

MapReduce is not suitable for all problems but when it works it may save you a lot of time

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources

Resources

Hadoop httphadoopapacheorgcore

Hadoop docs

httphadoopapacheorgcoredocscurrent

Pig httphadoopapacheorgpig

Hive httphadoopapacheorghive

Hadoop video tutorials from Cloudera httpwwwclouderacomhadoop-training

  • Introduction to MapReduce and Hadoop
  • What is MapReduce
  • What is MapReduce used for
  • What is MapReduce used for (2)
  • Outline
  • MapReduce Design Goals
  • Typical Hadoop Cluster
  • Typical Hadoop Cluster (2)
  • Challenges
  • Hadoop Components
  • Hadoop Distributed File System
  • MapReduce Programming Model
  • Example Word Count
  • Word Count Execution
  • MapReduce Execution Details
  • An Optimization The Combiner
  • Word Count with Combiner
  • Outline (2)
  • Fault Tolerance in MapReduce
  • Fault Tolerance in MapReduce (2)
  • Fault Tolerance in MapReduce (3)
  • Takeaways
  • Outline (3)
  • 1 Search
  • 2 Sort
  • 3 Inverted Index
  • Inverted Index Example
  • 4 Most Popular Words
  • Outline (4)
  • Cloudera Virtual Manager
  • Installation
  • Once you got the software installed fire up the VirtualBox ima
  • WordCount Example
  • Slide 34
  • Slide 35
  • Slide 36
  • How it Works
  • Classes
  • Interface
  • Demo
  • Exercises
  • Outline (5)
  • Motivation
  • Pig
  • An Example Problem
  • In MapReduce
  • In Pig Latin
  • Ease of Translation
  • Ease of Translation (2)
  • Hive
  • Creating a Hive Table
  • Simple Query
  • Aggregation and Joins
  • Using a Hadoop Streaming Mapper Script
  • Conclusions
  • Resources