Presented by: SailiGhavat ShikhaSoni A STUDY ON...
Transcript of Presented by: SailiGhavat ShikhaSoni A STUDY ON...
![Page 1: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/1.jpg)
A STUDY ON MAPREDUCE
Presented by:Saili GhavatShikha Soni
![Page 2: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/2.jpg)
What is common between all of them?
![Page 3: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/3.jpg)
Map Reduce‐ Introduction Framework for parallel computing
Prominent parallel data processing tool
Uses clustered resource
Very good level of abstraction for the programmers, and do not
have to deal with issues of parallelization, load balancing, fault
tolerance
Data Analysis
Map and Reduce functions
Map : gives a key value pair
Reduce : for each unique key
![Page 4: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/4.jpg)
Step 1: Splitting the Input
Large data is initially divided into large number of smaller portions. The data is divided such that we have splits equal to the number of worker machines, thus each worker has something to work on.
Step2: Master and worker co‐ordination.Master
Worker Worker Worker Worker
![Page 5: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/5.jpg)
Step3: Mapping by each workerEach worker now starts generating key value pairs of the data assigned to them. So the map function gets rid of all the irrelevant data and just passes on the key value pairs of the data intended to be filtered or sorted, thus linearly scaling the performance because of the parallelism.
Step4: Partitioning within the workers
Map worker
Partition 1
Partition 2
Partition 3
![Page 6: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/6.jpg)
The partitioning is responsible for segregating the data further. The function is simply hash of key modulo. It becomes easier in the reduce stage.
Step5: Reduce SortThe map workers are done with their work and now the reduce workers are notified to start working using the data returned by the mapping worker. The reduce worker contacts every map worker via remote procedure calls to get the (key, value)data that was targeted for its partition. This data is then sorted by the keys.Thus at the end of this step we have the data with the same key grouped together.
![Page 7: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/7.jpg)
Step6: Final reducing step
Thus this step returns the required goal of running the map reduce file. This can be finding a word from large file data or studying the web search logs, or any kind of data processing.
These functions are run on Distributed file systems, like GFS, AFS, etc.
Intermediate files
Intermediate files
Intermediate files
Reduce worker
![Page 8: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/8.jpg)
Understanding Map Reduce better
![Page 9: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/9.jpg)
Lisp Map & Reduce The map function takes a function and a set of values as a parameter.
This function is then applied to each of the value from the data.(map ‘length ‘(() (a) (ab) (abc)))
The above function applies this function to each of the values. Length of the values are returned.
Now the ‘reduce’ function is given a binary function and a set of values. All the returned values are thus combined.
![Page 10: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/10.jpg)
Google’s Observation
Key Word Search
MAP: The basic working of the search engine that
is finding the key word and returning the URL
REDUCE: This stage combines all the resulting
URLs which have the keyword and return it.
![Page 11: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/11.jpg)
Google’s need for MapReduce
Client request of logs and web resources, largest
Search Engine
Large derived data
Easy calculation, large processing
Processing distributed amongst machines
Google’s new abstraction allowing simple
computations, hiding messy parallelism details
![Page 12: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/12.jpg)
Hadoop‐ Open Source for Map Reduce
Google’s patent
Not an open source
Hadoop creation
Basis of Hadoop and MapReduce
![Page 13: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/13.jpg)
http://www.cs.arizona.edu/~bkmoon/papers/sigmodrec11.pdf
![Page 14: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/14.jpg)
Scheduling in Hadoop Two Types: Fair Scheduling and Capacity Scheduling
Fair Scheduling: works when we have Single queue of jobs
Equal share of physical resources Single MR job, occupies the whole of the cluster Capacity Scheduling: More sophisticated type of scheduling
Multiple queues can work along
![Page 15: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/15.jpg)
Each process in each queue is guaranteed to get the cluster resource
MRShare : Framework for sharing multi query executions in MapReduce
finds an optimal way of grouping a set of queries using
dynamic programming. transforms a batch of queries into a new batch that will be executed more efficiently
merging jobs into groups and evaluating each group as a single query.
![Page 16: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/16.jpg)
Suppose |D| is the size of input data that n MR jobs share
Complexity of sorting the combined mapped output of all jobs will be O(n ∙ |D|log(n ∙ |D|))
![Page 17: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/17.jpg)
Some problems where MapReducehas been used Distributed grep (search for words)1. Map: emit a line if it matches a given pattern2. Reduce: just copy the intermediate data to the output
Count URL access frequency1. Map: process logs of web page access;
output2. Reduce: add all values for the same URL
![Page 18: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/18.jpg)
Debates Major step backwards in parallel processing
compared to DBMS
Hadoop scalable but achieves very low efficiency
Hadoop stands out in the “Gray Sort Benchmark
test” for 1ooTB sorting.
Not a cheap solution
Cost of maintaining cluster difficult
Increases fault tolerance
![Page 19: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/19.jpg)
Advantages Simple and easy to use
Flexible
Independent of the storage
Fault tolerant
High scalability
![Page 20: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/20.jpg)
Limitations of MapReduce Low efficiency
Cannot be used when computations depend on
previously calculated values
Can handle large data sets but constraints
program’s ability smaller data items
![Page 21: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/21.jpg)
Software in place of Hadoop and MapReduce DISCO
Greenplum
Aster Data
![Page 22: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/22.jpg)
Companies who have adopted similar algorithms A9.com AOL Facebook The New York Times Last.fm Baidu.com Joost Veoh
![Page 23: Presented by: SailiGhavat ShikhaSoni A STUDY ON MAPREDUCEmeseec.ce.rit.edu/756-projects/fall2014/1-3.pdf · Lisp Map & Reduce The map function takes a function and a set of values](https://reader036.fdocuments.in/reader036/viewer/2022071218/6050595727fb933d9c70bb99/html5/thumbnails/23.jpg)
References http://www.cs.arizona.edu/~bkmoon/papers/sigmodr
ec11.pdf http://csci89802.blogspot.com/2012/10/limitations‐
of‐mapreduce‐where‐not‐to.html http://www.dbms2.com/2008/01/18/the‐great‐
mapreduce‐debate/ http://www.cse.buffalo.edu/~stevko/courses/cse704/f
all10/papers/cse704‐mapreduce.pdf http://static.googleusercontent.com/media/research
.google.com/en/us/archive/mapreduce‐osdi04.pdf http://blog.cloudera.com/blog/2014/03/the‐truth‐
about‐mapreduce‐performance‐on‐ssds/ https://www.cs.rutgers.edu/~pxk/417/notes/content/
mapreduce.html