Adaptive MapReduce using Situation-Aware Mappers

46
Adaptive MapReduce using Situation-Aware Mappers Rares Vernica 1 (HP Labs), Andrey Balmin, Kevin S. Beyer, Vuk Ercegovac (IBM Research) 1 Work done at IBM Research. 15th International Conference on Extending Database Technology, March 26-30 2012 Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 1 / 25

description

We propose new adaptive runtime techniques for MapReduce that improve performance and simplify job tuning. We implement these techniques by breaking a key assumption of MapReduce that mappers run in isolation. Instead, our mappers communicate through a distributed meta-data store and are aware of the global state of the job. However, we still preserve the fault-tolerance, scalability, and programming API of MapReduce. We utilize these situation-aware mappers to develop a set of techniques that make MapReduce more dynamic: (a) Adaptive Mappers dynamically take multiple data partitions (splits) to amortize mapper start-up costs; (b) Adaptive Combiners improve local aggregation by maintaining a cache of partial aggregates for the frequent keys; (c) Adaptive Sampling and Partitioning sample the mapper outputs and use the obtained statistics to produce balanced partitions for the reducers. Our experimental evaluation shows that adaptive techniques provide up to 3x performance improvement, in some cases, and dramatically improve performance stability across the board.

Transcript of Adaptive MapReduce using Situation-Aware Mappers

Page 1: Adaptive MapReduce using Situation-Aware Mappers

Adaptive MapReduce using Situation-AwareMappers

Rares Vernica1 (HP Labs),Andrey Balmin, Kevin S. Beyer, Vuk Ercegovac (IBM Research)

1Work done at IBM Research.

15th International Conference on Extending Database Technology,March 26-30 2012

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 1 / 25

Page 2: Adaptive MapReduce using Situation-Aware Mappers

Outline

1 Motivation

2 Problem Statement

3 Situation-Aware MappersAdaptive MappersAdaptive CombinersAdaptive Sampling and Partitioning

4 Summary

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 2 / 25

Page 3: Adaptive MapReduce using Situation-Aware Mappers

MapReduce Review

map (k,v) → list(k,v);reduce (k,list(v)) → list(k,v).

DFSINPUT 1/3

INPUT 3/3

INPUT 2/3

MAP

Input:(k,v)

MAP

MAPREDUCE

Output:list(k,v)

REDUCE

SHUFFLE

MERGE

Input:(k, list(v))

DFSOUTPUT 1/2

OUTPUT 2/2

Output:list(k,v)

combine (k,list(v)) → list(k,v).

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 3 / 25

Page 4: Adaptive MapReduce using Situation-Aware Mappers

MapReduce Review

map (k,v) → list(k,v);reduce (k,list(v)) → list(k,v).

DFSINPUT 1/3

INPUT 3/3

INPUT 2/3

MAP

Input:(k,v)

MAP

MAPREDUCE

Output:list(k,v)

REDUCE

SHUFFLE

MERGE

Input:(k, list(v))

DFSOUTPUT 1/2

OUTPUT 2/2

Output:list(k,v)

combine (k,list(v)) → list(k,v).

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 3 / 25

Page 5: Adaptive MapReduce using Situation-Aware Mappers

Motivation: MapReduce Issues

MapReduceParallel data-processing frameworkOpen-source implementation (Hadoop)Simple programming environment

MapReduce: “simplicity over performance”Limited choice of execution strategies:

Mappers checkpoint after every splitMap outputs are sorted and written to fileReducer read statically predetermined partitions

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 4 / 25

Page 6: Adaptive MapReduce using Situation-Aware Mappers

Solutions to MapReduce Issues

MapReduce-inspired alternativesDryad (Microsoft)Spark (UC Berkeley)Hyracks (UC Irvine)Nephele (TU Berlin)

Have more choices in runtime execution

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 5 / 25

Page 7: Adaptive MapReduce using Situation-Aware Mappers

Our Solution: Adaptive MapReduce

Make MapReduce (Hadoop) more flexibleLeverage existing investment in:

Framework (Hadoop)Query processing systems (Jaql, Pig, Hive)

Techniques for:Dynamic checkpoint intervals (Map)Best-effort hash-based aggregation (Combine)Dynamic, sample-based, partitioning (Reduce)

Performance tuning:Cardinality and cost estimation (due to UDFs)Adaptive to runtime environment

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 6 / 25

Page 8: Adaptive MapReduce using Situation-Aware Mappers

Problem Statement: Adaptive MapReduce

GoalsImprove MapReduce (Hadoop) performance by:

New runtime optionsAdaptive to runtime environment

Preserve Hadoop’sFault-toleranceScalabilityProgramability

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 7 / 25

Page 9: Adaptive MapReduce using Situation-Aware Mappers

Outline

1 Motivation

2 Problem Statement

3 Situation-Aware MappersAdaptive MappersAdaptive CombinersAdaptive Sampling and Partitioning

4 Summary

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 8 / 25

Page 10: Adaptive MapReduce using Situation-Aware Mappers

Situation-Aware Mappers

Main ideaMake MapReduce more dynamic

Mappers:

Aware of the global state of the jobCommunicate through a distributed meta-data storeBreak assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Page 11: Adaptive MapReduce using Situation-Aware Mappers

Situation-Aware Mappers

Main ideaMake MapReduce more dynamicMappers:

Aware of the global state of the job

Communicate through a distributed meta-data storeBreak assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Page 12: Adaptive MapReduce using Situation-Aware Mappers

Situation-Aware Mappers

Main ideaMake MapReduce more dynamicMappers:

Aware of the global state of the jobCommunicate through a distributed meta-data store

Break assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Page 13: Adaptive MapReduce using Situation-Aware Mappers

Situation-Aware Mappers

Main ideaMake MapReduce more dynamicMappers:

Aware of the global state of the jobCommunicate through a distributed meta-data storeBreak assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Page 14: Adaptive MapReduce using Situation-Aware Mappers

Situation-Aware Mappers

Main ideaMake MapReduce more dynamicMappers:

Aware of the global state of the jobCommunicate through a distributed meta-data storeBreak assumption: isolation

Situation-Aware Mappers

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 9 / 25

Page 15: Adaptive MapReduce using Situation-Aware Mappers

Adaptive MapReduce

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DMDSDFS

MAPMAPMAP

DFS

REDUCEREDUCE

AM

AS

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 10 / 25

Page 16: Adaptive MapReduce using Situation-Aware Mappers

Adaptive MapReduce

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DMDS

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

AM

AS

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 10 / 25

Distributed Meta-Data StoreDistributed read/writeTransactionale.g., ZooKeeper

Page 17: Adaptive MapReduce using Situation-Aware Mappers

Adaptive MapReduce

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DFS

MAPMAPMAP

DFS

REDUCEREDUCE

DMDSDFS

MAPMAPMAP

DFS

REDUCEREDUCE

AM

AS

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 10 / 25

Page 18: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Motivation

Input data is divided into splitsOne-to-one correspondence of mappers and splitsAM decouple # splits from # mappers

: Startup cost, e.g., scheduling, loading ref. data

, : Split processing cost

Small splits Large startup cost Balanced workload

Large splits Small startup cost Inbalanced workload

: Startup cost, e.g., scheduling, loading ref. data

, : Split processing cost

Small splits Large startup cost Balanced workload

Large splits Small startup cost Inbalanced workload

Adaptive Mappers Small startup cost Balanced workload

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 11 / 25

Page 19: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Motivation

Input data is divided into splitsOne-to-one correspondence of mappers and splitsAM decouple # splits from # mappers

: Startup cost, e.g., scheduling, loading ref. data

, : Split processing cost

Small splits Large startup cost Balanced workload

Large splits Small startup cost Inbalanced workload

: Startup cost, e.g., scheduling, loading ref. data

, : Split processing cost

Small splits Large startup cost Balanced workload

Large splits Small startup cost Inbalanced workload

Adaptive Mappers Small startup cost Balanced workload

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 11 / 25

Page 20: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Page 21: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Page 22: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Page 23: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Page 24: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Algorithm

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split1

JobID locations Host1 [Split1, Split2, ... ] Host2 ...

MapReduce Client

Root1ZooKeeper

Host2

Map1Init

Map2Init

...

Host1

...

2

...

3

4 assigned Split1{Map2}

Split15

OK/Fail

Store meta-data inZooKeeperImplemented as a newInputFormat

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 12 / 25

Page 25: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Algorithm

Additional FeaturesProcess local splits first, then remote splitsFault tolerance

Restated task unlocks splitsSplit reprocessing is shared

Scheduler aware (FIFO, FAIR, and FLEX)

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 13 / 25

Page 26: Adaptive MapReduce using Situation-Aware Mappers

Experimental Setting

Hardware40-node IBM Systemx iDataPlex dx340Two quad-core Intel Xeon E5540 64-bit 2.83GHz32GB RAMFour SATA disks160 map and 160 reduce slots

SoftwareUbuntu Linux, kernel 2.6.32-24 64-bit server editionJava 1.6 64-bit server editionHadoop 0.20.2ZooKeeper 3.3.1

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 14 / 25

Page 27: Adaptive MapReduce using Situation-Aware Mappers

Start-up Cost vs. ZooKeeper Overhead

20 200 2000

Number of Splits

020406080

100120140

280300

Tim

e (s

econ

ds)

Regular MappersAdaptive Mappers 2000 1-byte records

Sleep 1s/record5 nodes, 20 map slots20-2000 Reg. Mappers20 Adaptive Mappers

Small ZooKeeperoverheadLarge Map startupcost ∼2s/map

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 15 / 25

Page 28: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Workloads

1 Set-Similarity Join [Vernica et al., 2010]Publication datasetsDBLP: 1.2M records, 310MBCITESEERX: 1.3M records, 1,750MBIncreased to ×10 and ×100

2 JOINSingle dataset (“fact” table), Sort Benchmark data generatorFan-out coefficient (“dimension” table)average join fan-out 1 : 30TERASORT: 1B records, 93GB

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 16 / 25

Page 29: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Experiments - Set-Similarity Join

2048102451225612864 32 AM

Split Size (MB)

0

200

400

600

800

1000

Tim

e (s

econ

ds)

Regular MappersAdaptive Mappers

Stage 3:One-Phase Record JoinBroadcast join equivalentDBLP and CITESEERX ×10Single wave of AM

×3 speedup over defaultHadoop split size (64MB)Optimal with no tuning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 17 / 25

Page 30: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Mappers Experiments - JOIN

102451225612864 32 16 8 AM

Split Size (MB)

0

300

600

900

1200

Tim

e (s

econ

ds)

Regular MappersAdaptive Mappers

Map-only job1B TERASORT recordsModels a skewed joinSingle wave of AM

Regular Mappers:Large split: data skewSmall split: schedulingand start-up overhead

Optimal with no tuning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 18 / 25

Page 31: Adaptive MapReduce using Situation-Aware Mappers

Adaptive MapReduce

MAP

AM

AS

AP

ACMAP

AM

AS

AP

ACMAP

DFS DFS

REDUCE

AM

AS REDUCE

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 19 / 25

Page 32: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Page 33: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Page 34: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Page 35: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Page 36: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Page 37: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Combiners

Main ideaReplace sort with hashingReduce serialization, sort, and IO

Map

Regular Combiners

Sort Buffer

: User code: Data

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSortMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort MergeMap

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Map

Regular Combiners

Sort Buffer

: User code: Data

CombineSort Merge

Adaptive Combiners

Hash-group and Combine

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 20 / 25

Page 38: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Combiners Details

“Best-effort” aggregationNever spill to diskHash-table replacement policies:

No-Replacement (NR)Least-Recently-Used (LRU)

Implemented as:Library for HadoopOptimization choice for Jaql

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 21 / 25

Page 39: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Combiners Experiments

GROUP-BYSynthetic dataset with 3 dimensions (A1, A2, and A3) and 1 factGroup records and apply aggregation functionTWL: 10B records, 120GB

Reg.

AM AC AM, AC

0

30

60

90

120

150

180

Tim

e (s

econ

ds)

Regular CombinersAdaptive Combiners NRAdaptive Combiners LRU

GROUP-BY on A1×2.5 speedup

Reg.

AM 1 25 100

Cache Size (K)

0

50

100

150

200

250

300

350

Tim

e (s

econ

ds)

0.00

0.25

0.50

0.75

1.00

Mis

s R

atio

(%

)

Regular CombinersAdaptive Combiners NRAdaptive Combiners LRUMiss Ratio NRMiss Ratio LRU

GROUP-BY on A1 and A2×3 speedup

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 22 / 25

Page 40: Adaptive MapReduce using Situation-Aware Mappers

Adaptive MapReduce

MAP

AM

AS

AP

ACMAP

AM

AS

AP

ACMAP

DFS DFS

REDUCE

AM

AS REDUCE

AP

AC

DMDS

Adaptive TechniquesAM: Adaptive MappersAC: Adaptive CombinersAS: Adaptive SamplingAP: Adaptive Partitioning

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 23 / 25

Page 41: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Sampling and Partitioning

Step 1 Compute and publishlocal histogram

Step 2 Collect localhistograms andcompute partitioningfunction

Step 3 Broadcast partitioningfunction

MAPREDUCE

MAPREDUCE

MAP

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 24 / 25

Page 42: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Sampling and Partitioning

Step 1 Compute and publishlocal histogram

Step 2 Collect localhistograms andcompute partitioningfunction

Step 3 Broadcast partitioningfunction

MAPREDUCE

MAPREDUCE

MAP

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 24 / 25

Page 43: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Sampling and Partitioning

Step 1 Compute and publishlocal histogram

Step 2 Collect localhistograms andcompute partitioningfunction

Step 3 Broadcast partitioningfunction

MAPREDUCE

MAPREDUCE

MAP

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 24 / 25

Page 44: Adaptive MapReduce using Situation-Aware Mappers

Adaptive Sampling and Partitioning

Step 1 Compute and publishlocal histogram

Step 2 Collect localhistograms andcompute partitioningfunction

Step 3 Broadcast partitioningfunction

MAPREDUCE

MAPREDUCE

MAP

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

MAPREDUCE

MAPREDUCE

MAP

DMDS

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 24 / 25

Page 45: Adaptive MapReduce using Situation-Aware Mappers

Summary

Adaptive runtime techniques for MapReduceSituation-Aware MappersMake MapReduce more dynamic

Up to ×3 speedup for well-tuned jobsOrders of magnitude speedup for badly tuned jobsNever hurt performanceConfigure themselvesPart of IBM InfoSphere BigInsights

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 25 / 25

Page 46: Adaptive MapReduce using Situation-Aware Mappers

Vernica, R., Carey, M., and Li, C. (2010).Efficient parallel set-similarity joins using MapReduce.In SIGMOD Conference.

Rares Vernica (HP Labs) Adaptive MapReduce EDBT 2012 25 / 25