MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat...
![Page 1: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/1.jpg)
MapReduce: Simplified Data Processing on
Large Clusters
Authors: Jeffrey Dean and Sanjay Ghemawat
Presenter: Guangdong Liu
Jan 28th, 2011
![Page 2: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/2.jpg)
Presentation Outline
Motivation
Goal
Programming Model
Implementation
Refinement
![Page 3: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/3.jpg)
Motivation
Large-scale data processingMany data-intensive applications involve processing huge amounts of data and then producing lots of other data
Certain common themes are shared when executing such applications
Hundreds or thousands of machines are used Two categories of basic operation on the input data:
1) Map():process a key/value pair to generate a set of intermediate key/value pairs
2) Reduce(): merge all intermediate values with the same key
![Page 4: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/4.jpg)
Goal
MapReduce: an abstraction that allows users to
perform simple computations across large data set
which is distributed on large clusters of
commodity PCs while hiding the details of
parallelization, data distribution, load balancing
and fault toleration User-defined functions
Automatic parallelization and distribution
Fault tolerance
I/O scheduling
Status monitoring
![Page 5: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/5.jpg)
Programming Model
Inspired by Lisp primitives map and reduce
Map(key, val) Written by a user
Process a key/value pair to generate intermediate key/value pairs
The MapReduce library groups all intermediate values associated with the same key together and passes them to the reduce function
Reduce(key,vals) Also written by a user
Merge all intermediate values associated with the same key
![Page 6: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/6.jpg)
Programming Model
![Page 7: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/7.jpg)
Programming Model
Count words in docs Input consists of (doc_url, doc_contents) pairs
Map(key=doc_url, val=doc_contents), for each word w in contents, emit(w, “1”)
Reduce(key=word, values=counts_list), sum all “1”s in value list and emit result “(word, sum)”
![Page 8: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/8.jpg)
Programming Model
Hello World, Bye World!
Hello MapReduce, Goodbye to MapReduce.
Welcome to UNL, Goodbye to
UNL.
Reduce Phase
DFS Map Phase
Intermediate Result
DFS
M1
M2
M3
(Hello, 1) (Bye, 1)
(World, 1)(World, 1)
(Welcome, 1)(to, 1)(to, 1)
(Goodbye, 1)(UNL, 1)(UNL, 1)
(Hello, 1)(to, 1)(Goodbye, 1)(MapReduce, 1)(MapReduce, 1)
R1
R2
(Hello, 2) (Bye, 1)(Welcome, 1)(to, 3)
(World, 2)(UNL, 2)(Goodbye, 2)(MapReduce, 2)
![Page 9: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/9.jpg)
Implementation
User to do list Indicate input and output files
M: number of map tasks
R: number of reduce tasks
W: number of machines
Write map and reduce functions
Submit jobs
This requires no knowledge of parallel/distributed systems!!!
![Page 10: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/10.jpg)
Implementation
… …
Reduce Phase
DFS
… …
Map Phase
Master
M2
R1
Inpu
t
P1... …Pr
B2
… …
Bn
B1 M1
Local WriteRead fro
m
DFS
P1… …
Pr
P1… …
Pr
Assign
MapTask Assign ReduceTask
Remote ReadOutput 1
Output r
Write to DFS
… …
Intermediate Result
DFS
Rr
ReducerMapperMn
![Page 11: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/11.jpg)
Implementation
1. Input files split (M splits)
Each block is typically 16~64MB
Start up many copies of user program on a cluster of machines
2. Master & Workers One special instance becomes the master
Workers are assigned tasks by the master
There are M map tasks and R reduce tasks to assign
Master finds idle workers and assigns map or reduce tasks to them
![Page 12: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/12.jpg)
Implementation
3. Map tasks Map workers read contents of corresponding
input partition
Perform user-defined map computation to create intermediate <key,value> pairs
The intermediate <key,value> pairs produced by the map function are buffered in memory
4. Writing intermediate data to disk (R regions) Buffered output pairs written to local disk
periodically
Partitioned into R regions by a partitioning function
Location of these buffered pairs on the local disk are passed back to the master
![Page 13: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/13.jpg)
Implementation
5. Read & Sorting Use remote procedure calls to read the buffered
data from the local disks of map workers Sort intermediate data by the intermediate keys
6. Reduce tasks Reduce worker iterates over ordered
intermediate data Each unique key encountered – key & values are
passed to user's reduce function Output of user's reduce function is written to
output file on a global file system
7.When all tasks have completed, the master
wakes up user program
![Page 14: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/14.jpg)
Implementation
Fault tolerance-in a word, redo Workers are periodically pinged by master No response = failed worker Reschedule failed tasks Note: completed map task by the failed
worker need to be re-executed because the output is stored on the local disk
![Page 15: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/15.jpg)
Implementation
Locality Input data is managed by GFS and has
several replicas
Schedule a task on a machine containing a local replica or near a replica
Task GranularityM map tasks and R reduce tasks
Make M and R much larger than number of worker machines
![Page 16: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/16.jpg)
Implementation
Backup tasksStraggler: a machine that takes an unusually
long time to complete one of the last few map or reduce tasks in the computation.
Cause: bad disk, competition for CPU …
Resolution: schedule backup executions of
in-progress tasks when a MapReduce operation is close to completion
![Page 17: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/17.jpg)
Source
The example is quoted from: Wei Wei; Juan Du; Ting Yu; Xiaohui Gu; , "SecureMR:
A Service Integrity Assurance Framework for MapReduce," Computer Security Applications Conference, 2009. ACSAC '09. Annual , vol., no., pp.73-82, 7-11 Dec. 2009
![Page 18: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/18.jpg)
Making Cluster Application Energy-Aware
Authors: Nedeljko Vaasic, Martin Braistits and Vincent Salzgerber
Jan 28th, 2011
![Page 19: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/19.jpg)
Outline
Introduction
Case Study
Approach
![Page 20: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/20.jpg)
Introduction
Power consumption A critical issue in large scale clusters
Data centers consume as much energy as a city
7.4 billion dollars per year
Current techniques for efficiency Consolidate workload into fewer machines
Minimize the energy consumption while keeping the same overall performance level
Problems Cannot operate at multiple power levels
Cannot deal with energy consumption limits
![Page 21: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/21.jpg)
Case Study
Google’s Server Utilization and Energy
Consumption
![Page 22: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/22.jpg)
Case Study
Hadoop Distributed File System (HDFS)
![Page 23: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/23.jpg)
Case Study
Hadoop Distributed File System (HDFS)
![Page 24: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/24.jpg)
Case Study
MapReduce
![Page 25: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/25.jpg)
Case Study
Conclusion It is a wise decision to aggregate load
on a fewer number of machines for saving energy
Distributed applications must actively participate in the power management in order to avoid poor performance
![Page 26: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/26.jpg)
Approach
![Page 27: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/27.jpg)
On the Energy (In)efficiency of Hadoop Clusters
Authors: Jacob Leverich, Christ Kozyrakis
Jan 28th, 2011
![Page 28: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/28.jpg)
Introduction
Improvement of energy efficiency of a cluster Place some nodes into low-power standby
modes
Avoid energy waste on oversized components for each node
Problems
![Page 29: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/29.jpg)
Approach
Hadoop data layout overview Distribute replicas across different nodes in
order to improve performance and reliability
The user specifies a block replication factor n to ensure n identical copies of any data-block are stored across a cluster (typically n=3)
The largest number of nodes that can be disabled without impacting data availability is n-1
![Page 30: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/30.jpg)
Approach
Covering subsetAt least one replica of a data-block must
be stored in a subset of nodes called covering subset
Make sure that a large number of nodes can be gracefully removed from a cluster without affecting the availability of data or interrupting the normal operation of a cluster
![Page 31: MapReduce: Simplified Data Processing on Large Clusters Authors: Jeffrey Dean and Sanjay Ghemawat Presenter: Guangdong Liu Jan 28th, 2011.](https://reader035.fdocuments.in/reader035/viewer/2022062421/56649d5d5503460f94a3badb/html5/thumbnails/31.jpg)