Topic 9: MR+

Post on 18-Nov-2014

429 views 8 download

description

Cloud Computing Workshop 2013, ITU

Transcript of Topic 9: MR+

9: MR+

Zubair Nabi

zubair.nabi@itu.edu.pk

April 19, 2013

Zubair Nabi 9: MR+ April 19, 2013 1 / 26

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 2 / 26

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 3 / 26

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Implicit MapReduce Assumptions

The input data has no structure

The distribution of intermediate data is balanced

Results materialize when all the map and reduce tasks complete

The number of values of each key is small enough to be processed bya single reduce task

Processing the data at the reduce stage in most cases is usually asimple aggregation function

Zubair Nabi 9: MR+ April 19, 2013 4 / 26

Zipf distributions are everywhere

Zubair Nabi 9: MR+ April 19, 2013 5 / 26

Reduce-intensive applications

Image and speech correlation

Backpropagation in neural networks

Co-clustering

Tree learning

Computation of node diameter and radii in Tera-scale graphs

. . .

Zubair Nabi 9: MR+ April 19, 2013 6 / 26

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 7 / 26

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Design Goals

Negate skew in intermediate data

Exploit structure in input data

Estimate results

Favour commodity clusters

Maintain original functional model of MapReduce

Zubair Nabi 9: MR+ April 19, 2013 8 / 26

Design

Maintains the simple MapReduce programming model

Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results

Leading to a multi-level inverted tree of reduce workers

Zubair Nabi 9: MR+ April 19, 2013 9 / 26

Design

Maintains the simple MapReduce programming model

Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results

Leading to a multi-level inverted tree of reduce workers

Zubair Nabi 9: MR+ April 19, 2013 9 / 26

Design

Maintains the simple MapReduce programming model

Instead of implementing MapReduce as a sequential two-stagedarchitecture, MR+ allows map and reduce stages to interleave anditerate over intermediate results

Leading to a multi-level inverted tree of reduce workers

Zubair Nabi 9: MR+ April 19, 2013 9 / 26

Architecture

MR MR End Brick-wall

Map Phase Reduce Phase

Map Reduce

(a) MapReduce

MR+ Start MR+ End Brick-wall

5% -10% Estimation cycle prioritizes data

(b) MR+

Figure: Architectural comparison of MapReduce and MR+.

Zubair Nabi 9: MR+ April 19, 2013 10 / 26

Architectural Flexibility

1 Instead of waiting for all maps to finish before scheduling a reducetask, MR+ permits a model where a reduce task can be scheduled forevery n invocations of the map function

2 A densely populated key can be recursively reduced by repeatedinvocation of the reduce function at multiple reduce workers

Zubair Nabi 9: MR+ April 19, 2013 11 / 26

Architectural Flexibility

1 Instead of waiting for all maps to finish before scheduling a reducetask, MR+ permits a model where a reduce task can be scheduled forevery n invocations of the map function

2 A densely populated key can be recursively reduced by repeatedinvocation of the reduce function at multiple reduce workers

Zubair Nabi 9: MR+ April 19, 2013 11 / 26

Advantages

Resilient to TCP Incast by amortizing data copying over the course ofthe job

Early materialization of partial results for queries with thresholds orconfidence intervals

Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query

Zubair Nabi 9: MR+ April 19, 2013 12 / 26

Advantages

Resilient to TCP Incast by amortizing data copying over the course ofthe job

Early materialization of partial results for queries with thresholds orconfidence intervals

Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query

Zubair Nabi 9: MR+ April 19, 2013 12 / 26

Advantages

Resilient to TCP Incast by amortizing data copying over the course ofthe job

Early materialization of partial results for queries with thresholds orconfidence intervals

Finds structure in the data by running a sample cycle to learn thedistribution of information and prioritizes input data with respect to theuser query

Zubair Nabi 9: MR+ April 19, 2013 12 / 26

Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce

Zubair Nabi 9: MR+ April 19, 2013 13 / 26

Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce

Zubair Nabi 9: MR+ April 19, 2013 13 / 26

Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce

Zubair Nabi 9: MR+ April 19, 2013 13 / 26

Programming Model

Retains the 2-stage MapReduce API

MR+ reducers can be likened to distributed combiners

Repeated invocation of the reducer by default rules out non-associativefunctions

But reducers can be designed in such a way that the associativeoperation is applied only at the very last reduce

Zubair Nabi 9: MR+ April 19, 2013 13 / 26

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 14 / 26

Scheduling

Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter

For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled

For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete

Zubair Nabi 9: MR+ April 19, 2013 15 / 26

Scheduling

Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter

For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled

For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete

Zubair Nabi 9: MR+ April 19, 2013 15 / 26

Scheduling

Tasks are scheduled according to a configurablemap_to_reduce_schedule_ratio parameter

For every map_to_reduce_schedule_ratio map tasks, 1reduce task is scheduled

For instance, if map_to_reduce_schedule_ratio is 4, then thefirst reduce task is scheduled when 4 map tasks complete

Zubair Nabi 9: MR+ April 19, 2013 15 / 26

Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers

Zubair Nabi 9: MR+ April 19, 2013 16 / 26

Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers

Zubair Nabi 9: MR+ April 19, 2013 16 / 26

Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers

Zubair Nabi 9: MR+ April 19, 2013 16 / 26

Level-1 reducers

Each reduce is assigned the output of map_to_reduce_rationumber of maps

The location of their inputs is communicated by the JobTracker

Each reduce task pulls its input via HTTP

After the reduce logic has been applied to all keys, the output isearmarked for L > 1 reducers

Zubair Nabi 9: MR+ April 19, 2013 16 / 26

Level > 1 reducers

Assigned the input of reduce_input_ratio number of reducetasks

Eventually all key/value pairs make their way to the final level, whichhas a single worker

This final reduce can also be used to apply any non-associativeoperation

Zubair Nabi 9: MR+ April 19, 2013 17 / 26

Level > 1 reducers

Assigned the input of reduce_input_ratio number of reducetasks

Eventually all key/value pairs make their way to the final level, whichhas a single worker

This final reduce can also be used to apply any non-associativeoperation

Zubair Nabi 9: MR+ April 19, 2013 17 / 26

Level > 1 reducers

Assigned the input of reduce_input_ratio number of reducetasks

Eventually all key/value pairs make their way to the final level, whichhas a single worker

This final reduce can also be used to apply any non-associativeoperation

Zubair Nabi 9: MR+ April 19, 2013 17 / 26

Structural comparison

Reduce1

Shuffler....

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

.

.

.

.

Reduce2

Reduce3

Reduce4

Reduceθ

k1, v1,v2,...

k2, v1,v2,...

k3, v1,v2,...

k4, v1,v2,...

kn, v1,v2,...

Brick-wall

Map1

Mapω

Map2

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

(a) MapReduce

Reduce1,1

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

.....

.....

Reduce2,1

Reduce3,1

Reduce4,1

Reduceα-1,1

Reduceα,1

Reduce1,2

Reduce2,2

Reduceβ,2

... Reduce1,φ

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

Map2

Map1

k1, v1,v2,...k2, v1,v2,...

...kn, v1,v2,...

Mapω

Mapω-1

...

...

...α = ω/mr

β = α/rr

ϒ = β/rr...1

(b) MR+

Figure: Structural comparison of MapReduce and MR+.

Zubair Nabi 9: MR+ April 19, 2013 18 / 26

Reduce Locality

MR+ does not rely on key/values for input assignment

Reduce inputs are assigned on the basis of locality1 Node-local2 Rack-local3 Any

Zubair Nabi 9: MR+ April 19, 2013 19 / 26

Reduce Locality

MR+ does not rely on key/values for input assignmentReduce inputs are assigned on the basis of locality

1 Node-local2 Rack-local3 Any

Zubair Nabi 9: MR+ April 19, 2013 19 / 26

Reduce Locality

MR+ does not rely on key/values for input assignmentReduce inputs are assigned on the basis of locality

1 Node-local2 Rack-local3 Any

Zubair Nabi 9: MR+ April 19, 2013 19 / 26

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executed

For level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed

2 Local replication: The output of each reduce is replicated on the localfile system of a rack-local neighbour

3 Distributed replication: The output of each reduce is replicated on thedistributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour

3 Distributed replication: The output of each reduce is replicated on thedistributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Fault Tolerance

Deterministic input assignment simplifies failure recovery inMapReduce

In case of MR+, if a map task or a level-1 reduce fails, it is simplyre-executedFor level > 1 reduce tasks, MR+ implements three strategies, whichexpose the trade-off between computation and storage

1 Chain re-execution: The entire chain is re-executed2 Local replication: The output of each reduce is replicated on the local

file system of a rack-local neighbour3 Distributed replication: The output of each reduce is replicated on the

distributed file system

Zubair Nabi 9: MR+ April 19, 2013 20 / 26

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Input Prioritization

User-defined map and reduce functions are applied to asample_percentage amount of input, taken at random

This sampling cycle yields a representative distribution of data

Used to exploit structure: data with semantic grouping or clusters ofrelevant information

The distribution is used to generate a priority queue to assign to maptasks

A full-fledged MR+ job is then run, in which map tasks read input fromthe priority queue

Zubair Nabi 9: MR+ April 19, 2013 21 / 26

Input Prioritization (2)

Due to this prioritization, relevant clusters of information are processedfirst

As a result, the computation can be stopped mid-way if a thresholdcondition is satisfied

Zubair Nabi 9: MR+ April 19, 2013 22 / 26

Input Prioritization (2)

Due to this prioritization, relevant clusters of information are processedfirst

As a result, the computation can be stopped mid-way if a thresholdcondition is satisfied

Zubair Nabi 9: MR+ April 19, 2013 22 / 26

Outline

1 Introduction

2 MR+

3 Implementation

4 Code-base

Zubair Nabi 9: MR+ April 19, 2013 23 / 26

Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x

Zubair Nabi 9: MR+ April 19, 2013 24 / 26

Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x

Zubair Nabi 9: MR+ April 19, 2013 24 / 26

Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x

Zubair Nabi 9: MR+ April 19, 2013 24 / 26

Code-base

Around 15,000 lines of Python code

Code implements both vanilla MapReduce and MR+

Written over the course of roughly 5 years at LUMS

Publicly available at: https://code.google.com/p/mrplus/source/browse/?name=BRANCH_VER_0_0_0_4_PY2x

Zubair Nabi 9: MR+ April 19, 2013 24 / 26

Storage

Abstracts away the underlying storage system

Currently supports the HDFS and Amazon’s S3

Also supports the local OS file system (for unit testing)

Zubair Nabi 9: MR+ April 19, 2013 25 / 26

Storage

Abstracts away the underlying storage system

Currently supports the HDFS and Amazon’s S3

Also supports the local OS file system (for unit testing)

Zubair Nabi 9: MR+ April 19, 2013 25 / 26

Storage

Abstracts away the underlying storage system

Currently supports the HDFS and Amazon’s S3

Also supports the local OS file system (for unit testing)

Zubair Nabi 9: MR+ April 19, 2013 25 / 26

Structure

Modular structure so most of the code is re-used across MapReduceand MR+

Google Protobufs and JSON used for serialization

All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)

Zubair Nabi 9: MR+ April 19, 2013 26 / 26

Structure

Modular structure so most of the code is re-used across MapReduceand MR+

Google Protobufs and JSON used for serialization

All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)

Zubair Nabi 9: MR+ April 19, 2013 26 / 26

Structure

Modular structure so most of the code is re-used across MapReduceand MR+

Google Protobufs and JSON used for serialization

All configuration options within two files: siteconf.xml (site-wide)and jobconf.xml (job-specific)

Zubair Nabi 9: MR+ April 19, 2013 26 / 26