CS 555: DISTRIBUTED SYSTEMS [MAPREDUCEcs555/lectures/slides/CS555-L10... · SLIDESCREATEDBY:...
Transcript of CS 555: DISTRIBUTED SYSTEMS [MAPREDUCEcs555/lectures/slides/CS555-L10... · SLIDESCREATEDBY:...
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.1
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS
[MAPREDUCE]
Shrideep PallickaraComputer Science
Colorado State University
September 26, 2019 L10.1
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.2Professor: SHRIDEEP PALLICKARA
Frequently asked questions from the previous class survey
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.2
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.3Professor: SHRIDEEP PALLICKARA
Topics covered in this lecture
¨ MapReduce
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
MAPREDUCE
September 26, 2019 L10.4
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.3
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.5Professor: SHRIDEEP PALLICKARA
MapReduce: Topics that we will cover
¨ Why?
¨ What it is and what it is not?¨ The core framework and original Google paper
¨ Development of simple programs using Hadoop¤ The dominant MapReduce implementation
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.6Professor: SHRIDEEP PALLICKARA
MapReduce
¨ It’s a framework for processing data residing on a large number of computers
¨ Very powerful framework¤ Excellent for some problems¤ Challenging or not applicable in other classes of problems
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.4
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.7Professor: SHRIDEEP PALLICKARA
What is MapReduce?
¨ More a framework than a tool
¨ You are required to fit (some folks shoehorn it) your solution into the MapReduce framework
¨ MapReduce is not a feature, but rather a constraint
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.8Professor: SHRIDEEP PALLICKARA
What does this constraint mean?
¨ It makes problem solving easier and harder
¨ Clear boundaries for what you can and cannot do¤ You actually need to consider fewer options than what you are used to
¨ But solving problems with constraints requires planning and a change in your thinking
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.5
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.9Professor: SHRIDEEP PALLICKARA
But what does this get us?
¨ Tradeoff of being confined to the MapReduce framework?¤ Ability to process data on a large number of computers¤ But, more importantly, without having to worry about concurrency, scale,
fault tolerance, and robustness
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.10Professor: SHRIDEEP PALLICKARA
A challenge in writing MapReduce programs
¨ Design!¤ Good programmers can produce bad software due to poor design¤ Good programmers can produce bad MapReduce algorithms
¨ Only in this case your mistakes will be amplified¤ Your job may be distributed on 100s or 1000s of machines and operating
on a Petabyte of data
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.6
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.11Professor: SHRIDEEP PALLICKARA
MapReduce: Origins of the design
¨ Process crawled data and logs of web requests
¨ Several computations work on this raw data to compute derived data¤ Inverted indices¤ Representation of graph structure of web documents¤ Pages crawled per host¤ Most frequent queries in a day …
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.12Professor: SHRIDEEP PALLICKARA
Most computations are conceptually straightforward
¨ But data is large
¨ Computations must be scalable¤ Distributed across thousands of machines¤ To complete in a reasonable amount of time
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.7
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.13Professor: SHRIDEEP PALLICKARA
Complexity of managing distributed computations can …
¨ Obscure simplicity of original computation
¨ Contributing factors:
① How to parallelize computation
② Distribute the data
③ Handle failures
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.14Professor: SHRIDEEP PALLICKARA
MapReduce was developed to cope with this complexity
¨ Express simple computations
¨ Hide messy details of ¤ Parallelization¤ Data distribution¤ Fault tolerance¤ Load balancing
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.8
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.15Professor: SHRIDEEP PALLICKARA
MapReduce
¨ Programming model
¨ Associated implementation for ¤ Processing & Generating large data sets
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.16Professor: SHRIDEEP PALLICKARA
Programming model
¨ Computation takes a set of input key/value pairs
¨ Produces a set of output key/value pairs
¨ Express the computation as two functions:¤ Map¤ Reduce
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.9
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.17Professor: SHRIDEEP PALLICKARA
Map
¨ Takes an input pair
¨ Produces a set of intermediate key/value pairs
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.18Professor: SHRIDEEP PALLICKARA
MapReduce library
¨ Groups all intermediate values with the same intermediate key
¨ Passes them to the Reduce function
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.10
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.19Professor: SHRIDEEP PALLICKARA
Reduce function
¨ Accepts intermediate key I and ¤ Set of values for that key
¨ Merge these values together to get¤ Smaller set of values
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.20Professor: SHRIDEEP PALLICKARA
Counting number occurrences of each word in a large collection of documentsmap (String key, String value)
//key: document name//value: document contents
for each word w in valueEmitIntermediate(w, “1”)
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.11
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.21Professor: SHRIDEEP PALLICKARA
Counting number occurrences of each word in a large collection of documents
reduce (String key, Iterator values)//key: a word//value: a list of counts
int result = 0; for each v in values
result += ParseInt(v);Emit(AsString(result)); Sums together all counts
emitted for a particular word
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.22Professor: SHRIDEEP PALLICKARA
MapReduce specification object contains
¨ Names of¤ Input¤ Output
¨ Tuning parameters
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.12
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.23Professor: SHRIDEEP PALLICKARA
Map and reduce functions have associated types drawn from different domains
map(k1, v1) à list(k2, v2)
reduce(k2, list(v2)) à list(v2)
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.24Professor: SHRIDEEP PALLICKARA
What’s passed to-and-from user-defined functions
¨ Strings
¨ User code converts between¤ String¤ Appropriate types
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.13
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.25Professor: SHRIDEEP PALLICKARA
Programs expressed as MapReduce computations: Distributed Grep
¨ Map¤ Emit line if it matches specified pattern
¨ Reduce¤ Just copy intermediate data to the output
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.26Professor: SHRIDEEP PALLICKARA
Term-Vector per Host
¨ Summarizes important terms that occur in a set of documents <word, frequency>
¨ Map¤ Emit <hostname, term vector>¤ For each input document
¨ Reduce function¤ Has all per-document vectors for a given host¤ Add term vectors; discard away infrequent terms
n <hostname, term vector>
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.14
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
IMPLEMENTATION OF THE RUNTIME
September 26, 2019 L10.27
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.28Professor: SHRIDEEP PALLICKARA
Implementation
¨ Machines are commodity machines
¨ GFS is used to manage the data stored on the disks
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.15
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.29Professor: SHRIDEEP PALLICKARA
Execution Overview – Part I
¨ Maps distributed across multiple machines
¨ Automatic partitioning of data into M splits
¨ Splits processed concurrently on different machines
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.30Professor: SHRIDEEP PALLICKARA
Execution Overview – Part II
¨ Partition intermediate key space into R pieces
¨ E.g. hash(key) mod R¨ User specified parameters
¤ Partitioning function¤ Number of partitions (R)
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.16
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.31Professor: SHRIDEEP PALLICKARA
Execution Overview
Split 0Split 1Split 2Split 3Split 4
User Program
Master
Worker
Worker
Worker
Worker
Worker
Output file 0
Output file 1
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.32Professor: SHRIDEEP PALLICKARA
Execution Overview: Step IThe MapReduce library
¨ Splits input files into M pieces¤ 16-64 MB per piece
¨ Starts up copies of the program on a cluster of machines
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.17
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.33Professor: SHRIDEEP PALLICKARA
Execution Overview: Step IIProgram copies
¨ One of the copies is a Master
¨ There are M map tasks and R reduce tasks to assign
¨ Master¤ Picks idle workers¤ Assigns each worker a map or reduce task
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.34Professor: SHRIDEEP PALLICKARA
Execution Overview: Step IIIWorkers that are assigned a map task
¨ Read contents of their input split
¨ Parses <key, value> pairs out of input data
¨ Pass each pair to user-defined Map function
¨ Intermediate <key, value> pairs from Maps¤ Buffered in Memory
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.18
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.35Professor: SHRIDEEP PALLICKARA
Execution Overview: Step IVWriting to disk
¨ Periodically, buffered pairs are written to disk
¨ These writes are partitioned¤ By the partitioning function
¨ Locations of buffered pairs on local disk¤ Reported to back to Master¤ Master forwards these locations to reduce workers
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.36Professor: SHRIDEEP PALLICKARA
Execution Overview: Step VReading Intermediate data
¨ Master notifies Reduce worker about locations
¨ Reduce worker reads buffered data from the local disks of Maps
¨ Read all intermediate data; sort by intermediate key¤ All occurrences of same key grouped together¤ Many different keys map to the same Reduce task
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.19
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.37Professor: SHRIDEEP PALLICKARA
Execution Overview: Step VIProcessing data at the Reduce worker
¨ Iterate over sorted intermediate data
¨ For each unique key pass¤ Key + set of intermediate values to Reduce function
¨ Output of Reduce function is appended¤ To output file of reduce partition
September 26, 2019
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.38Professor: SHRIDEEP PALLICKARA
Execution Overview: Step VIIWaking up the user
¨ After all Map & Reduce tasks have been completed
¨ Control returns to the user code
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.20
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
TASK GRANULARITY
September 26, 2019 L10.39
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.40Professor: SHRIDEEP PALLICKARA
Task Granularity
¨ Subdivide map phase into M pieces
¨ Subdivide reduce phase into R pieces
¨ M, R >> number of worker machines
¨ Each worker performing many different tasks¤ Improves dynamic load balancing
¤ Speeds up recovery during failures
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.21
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.41Professor: SHRIDEEP PALLICKARA
Master Data Structures
September 26, 2019
¨ For each Map and Reduce task¤ State: {idle, in-progress, completed}¤ Worker machine identity
¨ For each completed Map task store ¤ Location and sizes of R intermediate file regions
¨ Information pushed incrementally to in-progress Reduce tasks
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.42Professor: SHRIDEEP PALLICKARA
Practical bounds on how large M and R can be
¨ Master must make O(M + R) scheduling decisions
¨ Keep O(MR) state in memory
September 26, 2019
SLIDES CREATED BY: SHRIDEEP PALLICKARA L10.22
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
CS555: Distributed Systems [Fall 2019]Dept. Of Computer Science, Colorado State University
L10.43Professor: SHRIDEEP PALLICKARA
The contents of this slide-set are based on the following references¨ JEFFREY DEAN and SANJAY GHEMAWAT: MapReduce: Simplified Data Processing on
Large Clusters. OSDI 2004: 137-150
¨ MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoopand Other Systems. 1st Edition. Donald Miner and Adam Shook. O'Reilly Media ISBN: 978-1449327170. [Chapter 1]
September 26, 2019