Map Reduce
description
Transcript of Map Reduce
1
MapReduce
2
TABLE OF CONTENTS
Map Reduce
Map Reduce Features
Mapper
Mapper – An Example
Reducer
Reducer - An Example
Map Reduce – The Big Picture
Word Count – A Map Reduce Example
Word Count, Code Walk Through
How Map Reduce works in Word Count
Word Count – Execution
Only for TCS Internal Training - NextGen Solutions, Kochi
3
Map Reduce
MapReduce is the system used to process data in the Hadoop cluster. It is basically a
software framework for writing applications that parallely process vast amounts of data on
large clusters
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase
Each Map task operates on a discrete portion of the overall dataset – Typically one
HDFS block of data
After all Maps are complete, the MapReduce system distributes the intermediate data to
nodes which perform the Reduce phase
Only for TCS Internal Training - NextGen Solutions, Kochi
4
Map Reduce - Features
Automatic parallelization and distribution
Fault-tolerance
A clean abstraction for programmers
- Developer can concentrate simply on writing the Map and Reduce functions
- Can be implemented with scripting languages other than java – Streaming
Only for TCS Internal Training - NextGen Solutions, Kochi
5
Mapper
Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data
locally, to avoid network traffic
Multiple Mappers run in parallel, each processing a portion of the input data
The Mapper reads data in the form of key/value pairs
The Mapper outputs zero or more key/value pairs
If the Mapper writes anything out, the output must be in the form of key/value pairs
The Mapper may use or completely ignore the input key
– For example, a standard pattern is to read a line of a file at a time
– The key is the byte offset into the file at which the line starts
– The value is the contents of the line itself
– Typically the key is considered irrelevant
Only for TCS Internal Training - NextGen Solutions, Kochi
6
Mapper – An Example
A Mapper that turns the input key and value to their corresponding upper case
Only for TCS Internal Training - NextGen Solutions, Kochi
7
After the Map phase is over, all the intermediate values for a given intermediate key are
combined together into a list
This list is given to a Reducer
Reducer
– There may be a single Reducer, or multiple Reducers
– This is specified as part of the job configuration
– All values associated with a particular intermediate key are guaranteed to go to the same Reducer
– The intermediate keys, and their value lists, are passed to the Reducer in sorted key order
– This step is known as the ‘shuffle and sort’
The Reducer outputs zero or more final key/value pairs
– These are written to HDFS
– In practice, the Reducer usually emits a single key/value pair for each input key
Only for TCS Internal Training - NextGen Solutions, Kochi
8
Reducer – An Example
A Reducer that adds up all the values associated with each intermediate key
Only for TCS Internal Training - NextGen Solutions, Kochi
9
Map Reduce – The Big Picture
Only for TCS Internal Training - NextGen Solutions, Kochi
10
package org.apache.hadoop.examples;
import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {
word.set(itr.nextToken());context.write(word, one);
}}
}
WordCount - A Map Reduce ExampleWordCount - A Map Reduce Example
Only for TCS Internal Training - NextGen Solutions, Kochi
11
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int sum = 0;for (IntWritable val : values) {
sum += val.get();}result.set(sum);context.write(key, result);
}}public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();if (args.length != 2) {
System.err.println("Usage: wordcount <in> <out>");System.exit(2);
}Job job = new Job(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 0 : 1);
}}
WordCount - A Map Reduce ExampleWordCount - A Map Reduce Example
Only for TCS Internal Training - NextGen Solutions, Kochi
12
Word Count, Code Walk Through - main()...1Word Count, Code Walk Through - main()...1
Configuration conf = new Configuration();
Provides access to configuration parameters, here we create a new Configuration object
Job job = new Job(conf, "wordcount");
Create a new Map-Reduce Job with the name “wordcount”
Once the job is run, we can see the name “wordcount” in the jobtracker
job.setJarByClass(WordCount.class);
Specify the main class for the execution of the job.
job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);
Specify the Mapper Combiner and Reducer Class for the Job.
Only for TCS Internal Training - NextGen Solutions, Kochi
13
job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);
Expected output being a “word” and its occurance “count”, set the reducer output Key as Text and value as IntWritable.
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
Specify the Input and Output directory for the Job using the command line arguements.
System.exit(job.waitForCompletion(true) ? 0 : 1);
Submits the job to the cluster and waits for it to complete.
Word Count, Code Walk Through - main()...2Word Count, Code Walk Through - main()...2
Only for TCS Internal Training - NextGen Solutions, Kochi
14
Word Count, Code Walk Through – Map/Reduce Logic Word Count, Code Walk Through – Map/Reduce Logic
Map
Create objects of type “Text” and “IntWritable” as word and one respectively
Each line in the input file will be the variable value(defined in the map function)
Each token of the input string is taken and assigned the word
Context() is populated as pair (word,one) for each token
Reduce
An intermediate shuffle and sort mechanism ensures that the reducer receives each key
and an associated list of values
The reducer will iterate through the values for each key, incrementing the local variable
sum by 1
The output will be generated for all words in the input file.
Only for TCS Internal Training - NextGen Solutions, Kochi
15
Map 1 Emits
< Hello, 1>< World, 1>< Bye, 1>< World, 1>
Map 2 Emits
< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>
How Map Reduce Works in Word CountHow Map Reduce Works in Word Count
Consider the program executes in two maps, first two lines in one map and other in the other map
The combiner does local aggregation, after being sorted on the keys.
Combiner 1 < Bye, 1>< Hello, 1>< World, 2>
Combiner 2
< Goodbye, 1>< Hadoop, 2>< Hello, 1>
The reducer just sums up the values, which are the occurence counts for each key (i.e. words)
Reducer Output
< Bye, 1>< Goodbye, 1>< Hadoop, 2>< Hello, 2>< World, 2>
Only for TCS Internal Training - NextGen Solutions, Kochi
16
Create an input directory in HDFS
Copy the input.txt from localmachine to HDFS <input-directory>
Create the wordcount jar – wordCount.jar (Eclipse IDE can be used)
Run the jar using
Word Count - ExecutionWord Count - Execution
bin/hadoop fs -mkdir <input-directory>
hadoop fs -put input.txt <input-directory>
bin/hadoop jar wordCount.jar <MainClassName> <input-directory> <output-directory>
Input File(input.txt) “Hello World Hello Hadoop Bye World Goodbye Hadoop”
Output File
Bye 1Goodbye 1 Hadoop 2 Hello 2 World 2
Check the output using
hadoop fs -cat <output-directory-path>/<generated-output-file>
Only for TCS Internal Training - NextGen Solutions, Kochi
17
Hadoop Wiki.
Yahoo Hadoop Tutorials.
Introduction to HDFS, Developer Works, IBM.
Hadoop In Action, Chuck Lam.
REFERENCES
Only for TCS Internal Training - NextGen Solutions, Kochi
18
THANK YOU