Adaptive MapReduce using Situation-Aware Mappers

Post on 11-May-2015

1.151 views 1 download

Tags:

description

We propose new adaptive runtime techniques for MapReduce that improve performance and simplify job tuning. We implement these techniques by breaking a key assumption of MapReduce that mappers run in isolation. Instead, our mappers communicate through a distributed meta-data store and are aware of the global state of the job. However, we still preserve the fault-tolerance, scalability, and programming API of MapReduce. We utilize these situation-aware mappers to develop a set of techniques that make MapReduce more dynamic: (a) Adaptive Mappers dynamically take multiple data partitions (splits) to amortize mapper start-up costs; (b) Adaptive Combiners improve local aggregation by maintaining a cache of partial aggregates for the frequent keys; (c) Adaptive Sampling and Partitioning sample the mapper outputs and use the obtained statistics to produce balanced partitions for the reducers. Our experimental evaluation shows that adaptive techniques provide up to 3x performance improvement, in some cases, and dramatically improve performance stability across the board.

Transcript of Adaptive MapReduce using Situation-Aware Mappers

Adaptive MapReduce using Situation-AwareMappers

Rares Vernica1 (HP Labs),Andrey Balmin, Kevin S. Beyer, Vuk Ercegovac (IBM Research)

1Work done at IBM Research.

15th International Conference on Extending Database Technology,March 26-30 2012

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 1 / 25

Outline

1 Motivation

2 Problem Statement

3 Situation-Aware MappersAdaptive MappersAdaptive CombinersAdaptive Sampling and Partitioning

4 Summary

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 2 / 25

MapReduce Review

map (k,v) → list(k,v);reduce (k,list(v)) → list(k,v).

DFSINPUT 1/3

INPUT 3/3

INPUT 2/3

MAP

Input:(k,v)

MAP

MAPREDUCE

Output:list(k,v)

REDUCE

SHUFFLE

MERGE

Input:(k, list(v))

DFSOUTPUT 1/2

OUTPUT 2/2

Output:list(k,v)

combine (k,list(v)) → list(k,v).

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 3 / 25

MapReduce Review

map (k,v) → list(k,v);reduce (k,list(v)) → list(k,v).

DFSINPUT 1/3

INPUT 3/3

INPUT 2/3

MAP

Input:(k,v)

MAP

MAPREDUCE

Output:list(k,v)

REDUCE

SHUFFLE

MERGE

Input:(k, list(v))

DFSOUTPUT 1/2

OUTPUT 2/2

Output:list(k,v)

combine (k,list(v)) → list(k,v).

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 3 / 25

Motivation: MapReduce Issues

MapReduceParallel data-processing frameworkOpen-source implementation (Hadoop)Simple programming environment

MapReduce: “simplicity over performance”Limited choice of execution strategies:

Mappers checkpoint after every splitMap outputs are sorted and written to fileReducer read statically predetermined partitions

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 4 / 25

Solutions to MapReduce Issues

MapReduce-inspired alternativesDryad (Microsoft)Spark (UC Berkeley)Hyracks (UC Irvine)Nephele (TU Berlin)

Have more choices in runtime execution

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 5 / 25

Our Solution: Adaptive MapReduce

Make MapReduce (Hadoop) more flexibleLeverage existing investment in:

Framework (Hadoop)Query processing systems (Jaql, Pig, Hive)

Techniques for:Dynamic checkpoint intervals (Map)Best-effort hash-based aggregation (Combine)Dynamic, sample-based, partitioning (Reduce)

Performance tuning:Cardinality and cost estimation (due to UDFs)Adaptive to runtime environment

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 6 / 25

Problem Statement: Adaptive MapReduce

GoalsImprove MapReduce (Hadoop) performance by:

New runtime optionsAdaptive to runtime environment

Preserve Hadoop’sFault-toleranceScalabilityProgramability

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 7 / 25

Outline

1 Motivation

2 Problem Statement

3 Situation-Aware MappersAdaptive MappersAdaptive CombinersAdaptive Sampling and Partitioning

4 Summary

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 8 / 25

Situation-Aware Mappers

Main ideaMake MapReduce more dynamic

Mappers:

Aware of the global state of the jobCommunicate through a distributed meta-data storeBreak assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Situation-Aware Mappers

Main ideaMake MapReduce more dynamicMappers:

Aware of the global state of the job

Communicate through a distributed meta-data storeBreak assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Situation-Aware Mappers

Main ideaMake MapReduce more dynamicMappers:

Aware of the global state of the jobCommunicate through a distributed meta-data store

Break assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Situation-Aware Mappers

Main ideaMake MapReduce more dynamicMappers:

Aware of the global state of the jobCommunicate through a distributed meta-data storeBreak assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Situation-Aware Mappers

Main ideaMake MapReduce more dynamicMappers:

Aware of the global state of the jobCommunicate through a distributed meta-data storeBreak assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Adaptive MapReduce

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DMDSDFS

MAPMAPMAP

DFS

REDUCEREDUCE

AM

AS

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 10 / 25

Adaptive MapReduce

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DMDS

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

AM

AS

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 10 / 25

Distributed Meta-Data StoreDistributed read/writeTransactionale.g., ZooKeeper

Adaptive MapReduce

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DMDSDFS

MAPMAPMAP

DFS

REDUCEREDUCE

AM

AS

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 10 / 25

Adaptive Mappers Motivation

Input data is divided into splitsOne-to-one correspondence of mappers and splitsAM decouple # splits from # mappers

: Startup cost, e.g., scheduling, loading ref. data

, : Split processing cost

Small splits Large startup cost Balanced workload

Large splits Small startup cost Inbalanced workload

: Startup cost, e.g., scheduling, loading ref. data

, : Split processing cost

Small splits Large startup cost Balanced workload

Large splits Small startup cost Inbalanced workload

Adaptive Mappers Small startup cost Balanced workload

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 11 / 25

Adaptive Mappers Motivation

Input data is divided into splitsOne-to-one correspondence of mappers and splitsAM decouple # splits from # mappers

: Startup cost, e.g., scheduling, loading ref. data

, : Split processing cost

Small splits Large startup cost Balanced workload

Large splits Small startup cost Inbalanced workload

: Startup cost, e.g., scheduling, loading ref. data

, : Split processing cost

Small splits Large startup cost Balanced workload

Large splits Small startup cost Inbalanced workload

Adaptive Mappers Small startup cost Balanced workload

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 11 / 25

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Adaptive Mappers Algorithm

Additional FeaturesProcess local splits first, then remote splitsFault tolerance

Restated task unlocks splitsSplit reprocessing is shared

Scheduler aware (FIFO, FAIR, and FLEX)

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 13 / 25

Experimental Setting

Hardware40-node IBM Systemx iDataPlex dx340Two quad-core Intel Xeon E5540 64-bit 2.83GHz32GB RAMFour SATA disks160 map and 160 reduce slots

SoftwareUbuntu Linux, kernel 2.6.32-24 64-bit server editionJava 1.6 64-bit server editionHadoop 0.20.2ZooKeeper 3.3.1

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 14 / 25

Start-up Cost vs. ZooKeeper Overhead

20 200 2000

Number of Splits

020406080

100120140

280300

Tim

e (s

econ

ds)

Regular MappersAdaptive Mappers 2000 1-byte records

Sleep 1s/record5 nodes, 20 map slots20-2000 Reg. Mappers20 Adaptive Mappers

Small ZooKeeperoverheadLarge Map startupcost ∼2s/map

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 15 / 25

Adaptive Mappers Workloads

1 Set-Similarity Join [Vernica et al., 2010]Publication datasetsDBLP: 1.2M records, 310MBCITESEERX: 1.3M records, 1,750MBIncreased to ×10 and ×100

2 JOINSingle dataset (“fact” table), Sort Benchmark data generatorFan-out coefficient (“dimension” table)average join fan-out 1 : 30TERASORT: 1B records, 93GB

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 16 / 25

Adaptive Mappers Experiments - Set-Similarity Join

2048102451225612864 32 AM

Split Size (MB)

0

200

400

600

800

1000

Tim

e (s

econ

ds)

Regular MappersAdaptive Mappers

Stage 3:One-Phase Record JoinBroadcast join equivalentDBLP and CITESEERX ×10Single wave of AM

×3 speedup over defaultHadoop split size (64MB)Optimal with no tuning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 17 / 25

Adaptive Mappers Experiments - JOIN

102451225612864 32 16 8 AM

Split Size (MB)

0

300

600

900

1200

Tim

e (s

econ

ds)

Regular MappersAdaptive Mappers

Map-only job1B TERASORT recordsModels a skewed joinSingle wave of AM

Regular Mappers:Large split: data skewSmall split: schedulingand start-up overhead

Optimal with no tuning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 18 / 25

Adaptive MapReduce

MAP

AM

AS

AP

ACMAP

AM

AS

AP

ACMAP

DFS DFS

REDUCE

AM

AS REDUCE

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 19 / 25

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Adaptive Combiners Details

“Best-effort” aggregationNever spill to diskHash-table replacement policies:

No-Replacement (NR)Least-Recently-Used (LRU)

Implemented as:Library for HadoopOptimization choice for Jaql

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 21 / 25

Adaptive Combiners Experiments

GROUP-BYSynthetic dataset with 3 dimensions (A1, A2, and A3) and 1 factGroup records and apply aggregation functionTWL: 10B records, 120GB

Reg.

AM AC AM, AC

0

30

60

90

120

150

180

Tim

e (s

econ

ds)

Regular CombinersAdaptive Combiners NRAdaptive Combiners LRU

GROUP-BY on A1×2.5 speedup

Reg.

AM 1 25 100

Cache Size (K)

0

50

100

150

200

250

300

350

Tim

e (s

econ

ds)

0.00

0.25

0.50

0.75

1.00

Mis

s R

atio

(%

)

Regular CombinersAdaptive Combiners NRAdaptive Combiners LRUMiss Ratio NRMiss Ratio LRU

GROUP-BY on A1 and A2×3 speedup

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 22 / 25

Adaptive MapReduce

MAP

AM

AS

AP

ACMAP

AM

AS

AP

ACMAP

DFS DFS

REDUCE

AM

AS REDUCE

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 23 / 25

Adaptive Sampling and Partitioning

Step 1 Compute and publishlocal histogram

Step 2 Collect localhistograms andcompute partitioningfunction

Step 3 Broadcast partitioningfunction

MAPREDUCE

MAPREDUCE

MAP

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 24 / 25

Adaptive Sampling and Partitioning

Step 1 Compute and publishlocal histogram

Step 2 Collect localhistograms andcompute partitioningfunction

Step 3 Broadcast partitioningfunction

MAPREDUCE

MAPREDUCE

MAP

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 24 / 25

Adaptive Sampling and Partitioning

Step 1 Compute and publishlocal histogram

Step 2 Collect localhistograms andcompute partitioningfunction

Step 3 Broadcast partitioningfunction

MAPREDUCE

MAPREDUCE

MAP

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 24 / 25

Adaptive Sampling and Partitioning

Step 1 Compute and publishlocal histogram

Step 2 Collect localhistograms andcompute partitioningfunction

Step 3 Broadcast partitioningfunction

MAPREDUCE

MAPREDUCE

MAP

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 24 / 25

Summary

Adaptive runtime techniques for MapReduceSituation-Aware MappersMake MapReduce more dynamic

Up to ×3 speedup for well-tuned jobsOrders of magnitude speedup for badly tuned jobsNever hurt performanceConfigure themselvesPart of IBM InfoSphere BigInsights

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 25 / 25

Vernica, R., Carey, M., and Li, C. (2010).Efficient parallel set-similarity joins using MapReduce.In SIGMOD Conference.

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 25 / 25