Introduction to MapReduce -Geoinsyssoft
-
Upload
anandh-kumar -
Category
Documents
-
view
226 -
download
0
Transcript of Introduction to MapReduce -Geoinsyssoft
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
1/17
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
2/17
What is MapReduce?
A programming model (& its associatedimplementation)For processing large data setExploits large set of commodity computersExecutes process in distributed mannerOffers high degree of transparenciesIn other words:
simple and maybe suitable for your tasks !!!
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
3/17
Distributed Grep
Verybigdata
Split data
Split data
Split data
Split data
grepgrepgrep
grep
matches
matches
matches
matches
cat Allmatches
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
4/17
Distributed Word Count
Verybig
data
Split data
Split data
Split data
Split data
countcountcount
count
count
count
count
count
merge mergedcount
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
5/17
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
6/17
Partitioning Function
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
7/17
Partitioning Function (2)
Default : hash(key) mod RGuarantee:
Relatively well-balanced partitionsOrdering guarantee within partition
Distributed SortMap:
emit(key,value)Reduce (with R=1):
emit(key,value)
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
8/17
MapReduce
Distributed GrepMap:
if match(value,pattern) emit(value,1)
Reduce:emit(key,sum(value*))
Distributed Word Count
Map:for all w in value do emit(w,1)Reduce:
emit(key,sum(value*))
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
9/17
MapReduce Transparencies
Plus Google Distributed File System :Parallelization
Fault-toleranceLocality optimizationLoad balancing
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
10/17
Suitable for your task if
Have a clusterWorking with large dataset
Working with independent data (orassumed)Can be cast into map and reduce
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
11/17
MapReduce outside Google
Hadoop (Java)Emulates MapReduce and GFS
The architecture of Hadoop MapReduceand DFS is master/slave
Master Slave
MapReduce jobtracker tasktrackerDFS namenode datanode
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
12/17
Example Word Count (1)
Mappublic static class MapClass extends MapReduceBaseimplements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WritableComparable key, Writable value,OutputCollector output, Reporter reporter)throws IOException {
String line = ((Text)value).toString();
StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) {
word.set(itr.nextToken());output.collect(word, one);
}}
}
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
13/17
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
14/17
Example Word Count (3)
Mainpublic static void main(String[] args) throws IOException {
//checking goes hereJobConf conf = new JobConf();
conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);
conf.setInputPath(new Path(args[0]));conf.setOutputPath(new Path(args[1]));
JobClient.runJob(conf);
}
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
15/17
One time setup
set hadoop-site.xml and slaves
Initiate namenode
Run Hadoop MapReduce and DFSUpload your data to DFSRun your process
Download your data from DFS
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
16/17
Summary
A simple programming model forprocessing large dataset on large set ofcomputer clusterFun to use, focus on problem, and let thelibrary deal with the messy detail
-
8/13/2019 Introduction to MapReduce -Geoinsyssoft
17/17
References
Original paper(http://labs.google.com/papers/mapreduce.html)
On wikipedia(http://en.wikipedia.org/wiki/MapReduce )Hadoop MapReduce in Java
(http://lucene.apache.org/hadoop/)Starfish - MapReduce in Ruby(http://rufy.com/starfish/)
http://en.wikipedia.org/wiki/MapReducehttp://en.wikipedia.org/wiki/MapReduce