MapReduce - Seoul National University

43
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat (OSDI `04) Seong Hoon Seo, Hyunji Choi December 1st, 2020

Transcript of MapReduce - Seoul National University

Page 1: MapReduce - Seoul National University

MapReduce:Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat (OSDI `04)

Seong Hoon Seo, Hyunji ChoiDecember 1st, 2020

Page 2: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Contents

2

● Introduction and Motivation

● Programming Model

● Execution Flow

● Implementation

● Details and Refinements

● Performance

● Experience

● Conclusion

Page 3: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Contents

3

● Introduction and Motivation

● Programming Model

● Execution Flow

● Implementation

● Details and Refinements

● Performance

● Experience

● Conclusion

Page 4: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Introduction and Motivation

4

● Computation in Google: Derived Data = F(large raw data)○ Input: crawled documents, web request logs

○ Output: inverted indices, set of most frequent queries

● Example: Inverted Index

Source: Lucidworkshttps://www.slideshare.net/erikhatcher/introduction-to-solr-9213241

Page 5: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Introduction and Motivation

5

● Characteristics of Computation○ Conceptually straightforward

○ Distributed computation is necessary

○ Complex Implementation in distributed environment

● Challenges of Distributed Computation○ Parallelization

○ Fault-tolerance

○ Data distribution

○ Load balancing

Page 6: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Introduction and Motivation

6

Solution: MapReduce Programming Model

● Interface

○ Enables automatic parallelization and distribution

● Implementation

○ Resolves the challenges of distributed computation and achieves performance

Page 7: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Contents

7

● Introduction and Motivation

● Programming Model

● Execution Flow

● Implementation

● Details and Refinements

● Performance

● Experience

● Conclusion

Page 8: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Programming Model

8

● Input and Output: Set of key/value pairs (i.e., (k, v))

● Map: (k1, v1) → list of (k2, v2)

● Reduce: (k2, list of (v2)) → (k2, list of (v2))

*k1, k2, v1, v2 are types (e.g., Int, String)

● Implementation Details

○ How intermediate values associated with a given key is grouped

Page 9: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Programming Model

9

Map: (k1, v1) → list of (k2, v2)

Reduce: (k2, list of (v2)) → (k2, list of (v2))

● Example 1: Word Count

○ Map: (document name, contents) → list of (word, 1)

○ Reduce: (word, list of (“1”)) → (word, Count)

Page 10: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Programming Model

10

Map: (k1, v1) → list of (k2, v2)

Reduce: (k2, list of (v2)) → (k2, list of (v2))

● Example 2: Inverted Index

○ Map: (document, words) → list of (word, document ID)

○ Reduce: (word, list of (document IDs)) → (word, sorted list of (document IDs))

Page 11: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Contents

11

● Introduction and Motivation

● Programming Model

● Execution Flow

● Implementation

● Details and Refinements

● Performance

● Experience

● Conclusion

Page 12: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Execution Flow

12

# of Map tasks: M = 5

# of Reduce tasks: R = 2

Page 13: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Execution Flow

13

● Step 1: Input Split

○ M pieces, usually 16 ~ 64 MB per piece (configurable)

○ Each piece corresponds to a “map task”

Page 14: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Execution Flow

14

● Step 2: Master and Worker Generation

○ Single master node

Page 15: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Execution Flow

15

● Two types of tasks

○ M pieces → M map tasks

○ R partitioned intermediate key space → R reduce tasks

■ e.g., hash(key) mod R

Page 16: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Execution Flow

16

● Step 3: Map Phase

○ parse input key/value pairs

○ Intermediate key/value pairs buffered in memory

Page 17: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Execution Flow

17

● Step 4: Periodic Store

○ Buffered pairs written to local disk

○ Each local disk is partitioned into R regions

○ Location of buffered pairs on local disk are passed back to the master

Page 18: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Execution Flow

18

● Step 5: Reduce Phase - Read

○ Master notifies locations of buffered pairs to reduce workers

○ Use Remote Procedure Calls (RPC) to read data from disks in map worker

Page 19: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Execution Flow

19

● Step 6: Reduce Phase - Process

○ Sorts and Groups by intermediate keys

○ Perform Reduce function for each unique key

○ Append result to output file

Page 20: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Contents

20

● Introduction and Motivation

● Programming Model

● Execution Flow

● Implementation

● Details and Refinements

● Performance

● Experience

● Conclusion

Page 21: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Implementation

21

A. Master Data Structures

● State of each map and reduce task (idle / in-progress /completed)

○ Assigned worker node identity (for non-idle tasks)

● Location and size of intermediate file regions for each map task

B. Task Granularity

● Factors to Consider

○ Scheduling decision: O(M + R)

○ Master state capacity: O(M * R)

○ User preference on number of output files

Page 22: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Implementation

22

C. Fault tolerance

1. Worker Failure

● Detection: periodic ping

● Recovery: Reset task to idle and reassign

2. Master Failure

● Retry the entire MapReduce operation

● Make master write periodic checkpoints of master data structure

Reset Required? Map Task Reduce Task

In-Progress O O

Completed O X

(O) intermediate pairs stored on local disk of failed machine is no longer accessible

(X) output of Reduce is stored in a global file system

Page 23: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Contents

23

● Introduction and Motivation

● Programming Model

● Execution Flow

● Implementation

● Details and Refinements

● Performance

● Experience

● Conclusion

Page 24: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Implementation Details - Locality

24

● Locality Optimization

○ GFS divides each file into 64MB blocks and stores several copies.

○ Master attempts to schedule a map task on a machine that contains a replica of

the corresponding input data or near it (same network switch).

○ Conserve network bandwidth.

Page 25: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Implementation Details - Backup Tasks

25

● Problem: “Straggler” workers

○ Workers that take unusually long time to complete a task

● Solution: schedule “backup” executions of the remaining tasks

○ When a MapReduce operation is close to completion

● Gain: significant on large operations

○ 44% slower without backup tasks for Sort

Page 26: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Refinements - M splits to R outputs

26

R = 2M = 5

Page 27: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Refinements - M splits to R outputs

27

Input Split Map Reduce Output

A red dog and a blue cat and a blue dog and a red cat

A red dog and a

Blue cat and a blue

Dog and a red cat

A, 4And, 3Blue, 2

Cat, 2Dog, 2Red, 2

M = 3 R = 2

A, 1Red, 1Dog, 1And, 1A, 1

Blue, 1Cat, 1And, 1A, 1Blue, 1

Dog, 1And, 1A, 1Red, 1Cat, 1

Page 28: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Refinements - M splits to R outputs

28

Map Partition Combiner Shuffle Sort

A, 4And, 3Blue, 2

Cat, 2Dog, 2Red, 2

A, 1Red, 1Dog, 1And, 1A, 1

Blue, 1Cat, 1And, 1A, 1Blue, 1

Dog, 1And, 1A, 1Red, 1Cat, 1

Reduce

A, 1And, 1A, 1

Blue, 1And, 1A, 1Blue, 1

And, 1A, 1

Red, 1Dog, 1

Cat, 1

Dog, 1Red, 1Cat, 1

A, 2And, 1Blue, 2And, 1A, 1And, 1A, 1

Red, 1Dog, 1Cat, 1Dog, 1Red, 1Cat, 1

A, 2And, 1

Blue, 2And, 1A, 1

And, 1A, 1

Red, 1Dog, 1

Cat, 1

Dog, 1Red, 1Cat, 1

A, 2A, 1A, 1And, 1And, 1And, 1Blue, 2

Cat, 1Cat, 1Dog, 1Dog, 1Red, 1Red, 1

15 k-v pairs 13 k-v pairs

Page 29: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Refinements - M splits to R outputs

29

● Partitioning Function

○ Default: hash(key) mod R

○ Custom: ex) hash(Hostname(urlkey)) mod R

● Ordering Guarantees

○ Processed in increasing key order.

● Combiner Function

○ Reducer applied in map task workers.

○ Reduce network overhead.

Page 30: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Refinements - Interaction with Master

30

Page 31: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Skipping Bad Records

31

Record 34Record 35Record 36

Map

...

Signal Handler

Record 34Record 35Record 36

Reduce

...

Signal Handler

Record 34 0Record 35 2Record 36 0

Master

...

Send skip signal

Page 32: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Status Information

32

Page 33: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Counters

33

Page 34: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Refinements

34

● Input/Output Types

○ ex) “text” mode input: <offset, contents of line>

○ User can define custom reader/writer interface.

● Side-effects

○ Produce auxiliary files as additional outputs.

● Local Execution

○ Sequentially executes all of the work on the local machine.

○ Easily use any debugging or testing tools.

Page 35: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Contents

35

● Introduction and Motivation

● Programming Model

● Execution Flow

● Implementation

● Details and Refinements

● Performance

● Experience

● Conclusion

Page 36: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Performance

36

● Cluster Configuration

○ 1800 nodes.

○ Two 2GHz Intel Xeon with HyperThreading, 2.5-3GB memory.

○ 100-200 Gbps of aggregate bandwidth.

● Benchmarks

○ Grep: search a rare pattern (92K matching records) out of 1010 100-byte records.

○ Sort: sort 1010 100-byte records (modeled after TeraSort benchmark).

Page 37: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Performance - Grep

37

1764 workers assigned

Read done

Startup overhead *

* Copy program to all workers & Locality optimization.

Page 38: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Performance - Sort

38

Text line Key, Text line (Sorted) Text line

● Input rate less than Grep

○ Intermediate files are larger (matching pattern vs total text line).

● Input > Shuffle > Output rate

○ Input rate benefits from locality optimization.

○ Output rate is low due to reliability policy of GFS - keep 2 copies.

Page 39: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Performance - Effect of Backup Tasks

39

5 stragglers44% increase in time

Page 40: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Performance - Machine failures

40

Re-read completed Map files

Only 5% increase

Page 41: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Contents

41

● Introduction and Motivation

● Programming Model

● Execution Flow

● Implementation

● Details and Refinements

● Performance

● Experience

● Conclusion

Page 42: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Experience in Google

42

● Broadly applicable including

○ Large-scale machine learning problems.

○ Extraction of properties of web pages.

● Renewed production indexing system

○ Code is simpler, hiding details regarding fault tolerance and parallelization.

○ Keep conceptually unrelated computations separate.

○ Easy to operate and scale.

Page 43: MapReduce - Seoul National University

Distributed Systems, 2020 Fall

Conclusion

43

● MapReduce programming model

○ Is easy to use.

○ A large variety of problems are easily expressible as MapReduce

computations.

○ Scales to large clusters of machines.