Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary:...

Technische Universitat Munchen

Handing Data Skew in MapReduce

B. Gufler♠, N. Augsten♣, A. Reiser♠, A. Kemper♠

♠Technische Universitat Munchen ♣Free University of Bolzano-Bozen

May 8, 2011

Cloud Data Processing with MapReduce

MapReduce for BusinessI usage statistics, index building, . . .I simple reducer tasks

MapReduce for eScience: new challenges

1. heavy data skew

2. complex reducer tasks

; How to deal with these problems?

1. heavy data skew

Data Skew

I values of an attribute not evenly distributedI very common in scientific dataI causes load imbalance if attribute is used for partitioning

Example: Node Masses in Millennium Simulation

400nodes (∗1M)

node mass1–

51–5

501–

5k–5

50k–

Complex Reducer Tasks

I complex scientific analysis algorithmsI often polynomial or even exponential complexity

I amplifies load imbalance

Reduce

6 · 22 = 24 42 + 82 = 80

Reduce

6 · 22 = 24

42 + 82 = 80

Reduce

6 · 22 = 24 42 + 82 = 80

Example: Frequent Subtree Mining

I find common substructures in(large parts of) a forest

I exponential complexity forunordered subtrees

Summary: MapReduce for eScience

ChallengesI load imbalance due to data skewI complex analysis amplifies execution time differencesI example: frequent subtree mining (PathJoin algorithm)

I 16 GB data set (subset of the Millennium simulation)I cluster of 16 nodesI total execution time: 8 hoursI execution time difference of reducers: up to 6 hours

Agenda

I MotivationI Load Balancing in MapReduce: Big PictureI Estimation of Partition CostI Assignment of Partitions to ReducersI Number of PartitionsI Evaluation

Data Partitioning in MapReduce

cluster: all itemswith the samekey

partition: all keyswith the samehash values

hash function fordata partitioning

Load Balancing: Current MapReduceP

P0 P1 cont

I hash partitioning on each mapperI static 1:1 assignment of partitions to

reducersI balances number of keys per

reducer instead of workloadI no flexibility

Improved Load Balancing

I create more partitions than there are reducersI cost-based assignment of partitions to reducers

P0+P2 P1 cont

I flexibility for load balancingI challenges

1. compute number ofpartitions

2. estimate partition cost3. assign partitions to

reducers

Estimate Partition Cost

partition cost depends on1. number and size of the clusters within the partition

I locally monitored on each mapperI centrally aggregated on one controller

2. complexity of the reducer algorithmI specified by the user

; estimate cluster cost based on this information

; sum up cluster costs to obtain partition cost

Estimate Partition Cost: Monitoring

I tuple countI monitor local count on every mapperI sum up on controller

I cluster countI create Bloom filter on every mapperI employ Linear Counting on controller

Estimate Partition Cost: MonitoringM

tuples: 17

clusters: [10011]

tuples: 31

clusters: [01011]

tuples: 17

clusters: [10011]

tuples: 31

clusters: [01011]

tuples: 17

clusters: [10011]

tuples: 31

clusters: [01011]

tuples: 17

clusters: [10011]

tuples: 31

clusters: [01011]

tetuples: 48∑

tuples: 17

clusters: [10011]

tuples: 31

clusters: [01011]

tetuples: 48∑|

tuples: 17

clusters: [10011]

tuples: 31

clusters: [01011]

tetuples: 48∑| lc

K. Whang, B. Vander-Zanden, H. Taylor: A Linear-Time Probabilistic Counting Algorithm for Databa-se Applications, ACM TODS 15(2), 1990

tuples: 17

clusters: [10011]

tuples: 31

clusters: [01011]

tetuples: 48

clusters: 8

∑| lc

K. Whang, B. Vander-Zanden, H. Taylor: A Linear-Time Probabilistic Counting Algorithm for Databa-se Applications, ACM TODS 15(2), 1990

Estimate Partition Cost: Calculation

te tuples: 48

clusters: 8

reducer complexity: quadratic(e. g., pairwise comparison)

I average cluster size: 488 = 6

I cluster cost: 62 = 36I partition cost: 8 · 36 = 288

te tuples: 48

clusters: 8

te tuples: 48

clusters: 8

I cluster cost: 62 = 36

I partition cost: 8 · 36 = 288

te tuples: 48

clusters: 8

Assign Partitions to Reducers

I assign partitions to reducers balancing the partition costI bin packing problem, but: NP-hardI rely on a heuristic instead

50 40 20 15 3

reducer 1 reducer 20 0

50 40 20 15 3

reducer 1 reducer 2050

50 40 20 15 3

reducer 1 reducer 26065

50 40 20 15 3

How many Partitions?

I static approachI number of partitions fixed beforehand by controllerI matching partitions created on every mapper

I dynamic approachI only initial number of partitions fixedI mappers may split some partitions further

How many Partitions?

I static approachI number of partitions fixed beforehand by controllerI matching partitions created on every mapper

I dynamic approachI only initial number of partitions fixedI mappers may split some partitions further

Evaluation

I load balancing qualityI influence on execution time

Synthetic Data SetsI 520M tuplesI 2 000 clustersI Zipf distributions

Millennium Data SetI 760M tuplesI ≈ 32 000 clusters

Evaluation: Load Balancing

I measure standard deviation of partition cost per reducerI small is good: small differences between reducers

01020304050

10 20 25 50

stdev [% of mean]

reducers

Synthetic Data, Moderate Skew

01020304050

10 20 25 50

stdev [% of mean]

reducers

Millennium Data

Standard MapReduce Fine Partitioning

Evaluation: Execution Time

I measure execution time per reducerI max time: overall time of reduce job

101520

10 20 25 50

Synthetic Execution Time

reducers

Synthetic Data, Moderate Skew

010203040

10 20 25 50

Synthetic Execution Time

reducers

Millennium Data

Standard MapReduce Fine PartitioningTime for Most Expensive Cluster

Summary

I scientific data analysis with MapReduceI partition-based load balancingI distributed monitoring for cost estimationI experimental evaluation

I better load balancingI reduced overall execution time ∑

Thank you!

Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary:...

Documents

Transcript of Handing Data Skew in MapReduce · 2013. 9. 12. · Technische Universitat Munc¨ hen¨ Summary:...

The MUNC Times #1

Skew Angle

Skew Stuff

Implementing Useful Clock Skew Using Skew Groups

New MapReduce and Pregel limits in BigData processingtesson.julien.free.fr/LaMHA/2017/Bamha_LaMHA_2017-03-27.pdf · 2017. 4. 3. · Randomised keys: A solution for data skew in Join

Handling Partitioning Skew in MapReduce using LEEN - … · Handling Partitioning Skew in MapReduce using LEEN 5 1 2 3 k1 k2 k3 k4 k5 k6 k7 k8 k9 Keys output in order of appearance

[HOST. SEXUAL] Inés M. Jelú v. Munc. Guaynabo-Demanda Federal

MapReduce. MapReduce Outline MapReduce Architecture MapReduce Internals MapReduce Examples JobTracker Interface.

SkewTune: Mitigating Skew in MapReduce Applicationsmagda/papers/kwon-sigmod12.pdfreconstructed by concatenation. We implement SkewTune as an extension to Hadoop and evaluate its e

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

Gerhard Tutz & Gunther Schauberger - uni-muenchen.deGerhard Tutz & Gunther Schauberger Ludwig-Maximilians-Universit at Munc hen Akademiestra e 1, 80799 Munc hen September 24, 2018

Skew Guide

THE CBOE SKEW INDEX ®SM SKEW®SM · new benchmark, the CBOE Skew Index ® (SKEW). SKEW is a global, strike-independent measure of the slope of the implied volatility curve that increases

MapReduce-MPI Library Users Manualmapreduce.sandia.gov/doc/Manual.pdf · MapReduce-MPI WWW Site - MapReduce-MPI Documentation What is a MapReduce? The canonical example of a MapReduce

Web-scale Entity Annotation Using MapReduce · standard map-reduce implementations like Hadoop, and also recent skew mitigation techniques, will not match up to the workload offered

Skew lines

Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Technische Universit at Munc hen Fakult at fur Informatik - TUM · Technische Universit at Munc hen Fakult at fur Informatik Bachelor’s Thesis in Wirtschaftsinformatik Modeling

Handling partitioning skew in MapReduce using LEENhebs/pub/shadihandling_ppna13.pdf · Handling partitioning skew in MapReduce using LEEN ... a hash partitioning function is performed,

BRAIDED SKEW MONOIDAL CATEGORIES · skew monoidal categories are also called left skew.) If Cis skew monoidal then there are induced right skew monoidal structures on the opposite