The Power of Choice in Data-Aware Cluster …...The Power of Choice in ! Data-Aware Cluster...

The Power of Choice in !Data-Aware Cluster Scheduling

Shivaram Venkataraman1 , Aurojit Panda1 Ganesh Ananthanarayanan2, Michael Franklin1, Ion Stoica1

1UC Berkeley, 2Microsoft Research

amplab

Trends: Big Data

2 [Kathy Yelick LBNL, VLDBJ 2012, Dhruba Borthakur]

Challenge&1:&Data&is&Big&Projected&Growth&

Increa

se&ove

r&201

0&

0$

10$

20$

30$

40$

50$

60$

2010$ 2011$ 2012$ 2013$ 2014$ 2015$

Moore's$Law$

Overall$Data$

Particle$Accel.$

DNA$Sequencers$

Data$Grows$faster$than$Moore’s$Law$[IDC$report,$Kathy$Yelick,$LBNL]$

Data grows faster than Moore’s Law!

Trends: Big Data

3 [Kathy Yelick LBNL, VLDBJ 2012, Dhruba Borthakur]

Projected&Growth&Increa

se&ove

r&201

0&

0$

10$

20$

30$

40$

50$

60$

2010$ 2011$ 2012$ 2013$ 2014$ 2015$

Moore's$Law$

Overall$Data$

Particle$Accel.$

DNA$Sequencers$

Facebook Hive cluster Last 4 years:

data growth 2500x ! queries/day 60x!

Microsoft Scope Cluster “The number of daily jobs has doubled every six months for the past two years.”

Trends: Low Latency

4

10 min

2004: MapReduce

batch job

2009: Hive

2010: Dremel

2012: In-memory

Spark

1 min

10s 2s

Big Data or Low Latency ? SQL Query : 2.5 TB on 100 machines

> 15 minutes 1 - 5 Minutes

5

< 10s

?

Sampling

6

Applications

Machine learning algorithms stochastic gradient, coordinate descent

Approximate Query Processing blinkdb, presto, minitable

7

Choices

N Any K

8

Choices

N Any K

9

Sampling à Smaller Inputs + Choice

Example

10

M

M

R N = 4 K = 2

1

4

Rack

Available (N) = 2 Required (K) = 2

Time

Available Data Running

Busy

Existing

11

2

3

Unavailable Data

1

4

Rack


Time

Choice-Aware

2

3

12


Busy

1

4

Rack


Time

Choice-Aware

2

3

Launched (M) = 3

13


Busy

KMN Scheduler - How much can KMN improve locality - Propagate benefits across stages - Handling stragglers

14

Job à DAG

KMN Scheduler

One-to-One Many-to-One 15

One-to-One Stages Locality

Disk ~ 100MB/s Network ~ 10 Gbps (~1GB/s) Memory ~ 50GB/s

16

KMN Locality

1 2

1 3 1 4

2 3

2 4

3 4 Any K 1

4

2

3 N

Choices NK!

"#

$

%&

17

Locality, K=100

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Prob

. of L

ocali

ty

Utilization

K/N=1.0 K/N=0.5 K/N=0.1 K – Number of blocks chosen N – Number of blocks available

18

KMN significantly improves locality

Many-to-One Stages

KMN Scheduler

19

Many-to-One Stage

M1

M2

M4

M5

M3

R3

R2

R1

15 transfers

20

Many-To-One Transfers

R3

M1 M2 M4 M5

R2

M3

Core

R1

21

M1R3

M1R2

M2R3

M2 R2

M3 R3

M3 R3

Bottleneck Link

R3

M1 M2 M4 M5

R2

M3

6

Core

2 4 2

R1

4 2

Bottleneck Link Link with Max. transfers

Cross Rack Data Skew

Minimum transfers Maximum transfers

22

=62= 3

Facebook Trace Cross Rack Data Skew

Minimum transfers

Maximum transfers

0 0.2 0.4 0.6 0.8

1

0 5 10 15 20 25 30

CDF

Cross Rack Data Skew

<50 tasks 50-150 tasks >150 tasks

23

Power of Choice Load balancing: balls and bins Insight: Run extra tasks (M > K) M1

M2

M4 M5

M3

M6 M7

Cross Rack Data Skew = 3

24

Power of Choice

Technique: Spread out choice of K tasks to reduce skew

M1

M2

M4 M5

M3

M6 M7

M = 7, K = 5 Cross Rack Data Skew = 2

25

26

Handling Stragglers M1

M2

M4

M5

M3

M6

M7

Rack

Rack

Stragglers

vs.

Cross-Rack Data Skew

Time

Using KMN

// Create Spark RDD file = sc.textFile(“tpc-‐h.data”) // Select a 10% sample using KMN sample = file.blockSample(0.1) // RDD operations sample.map { li => (li.linestatus, li.quantity) }.collect()

27

Also in the paper

User-defined sampling functions Placing reduce tasks

28

Killing extra tasks

Evaluation

Baseline: Use a pre-selected random sample Setup: 100 m2.4xlarge EC2 machines, 60GB RAM/mc

Facebook traces replay Long DAGs (Stochastic Gradient Descent) SQL queries from Conviva Reducer placement Varying Utilization

29

Facebook Overall

0 10 20 30 40 50

0-10

11-100

>100

Job Completion Time (s)

Job

Size

Baseline KMN-M/K=1.05

30

Cross Rack Skew

0 10 20 30

<=4

4-to-8

>8

Shuffle Stage Time (seconds)

Cros

s Ra

ck S

kew

Baseline KMN-M/K=1.0 KMN-M/K=1.05 KMN-M/K=1.1

31

How many extra tasks ?

32

0

0.2

0.4

0.6

0.8

1

0 10 20 Cross-Rack Skew

M/K=1.0 M/K=1.1 M/K=2.0

0

0.2

0.4

0.6

0.8

1

0 10 20 30 Cross Rack Skew

M/K=1.0 M/K=1.1 M/K=2.0

50 – 150 tasks > 150 tasks

KMN: How many stages ?

33 Gradient

Aggregate1

Aggregate2

Aggregate3

Stochastic Gradient Descent


34 Gradient

Aggregate1

Aggregate2

Aggregate3

KMN Stages Time (s)

Gradient 15.27


35 Gradient

Aggregate1

Aggregate2

Aggregate3

KMN Stages Time (s)

Gradient 15.27

Gradient + Agg1 12.72


36 Gradient

Aggregate1

Aggregate2

Aggregate3

KMN Stages Time (s)

Gradient 15.27




37 Gradient

Aggregate1

Aggregate2

Aggregate3

KMN Stages Time (s)

Gradient 15.27




Related Work Power of Choice

Power-of-Two choices [TPDS’01] Sparrow [SOSP’13]

Improving Cluster Scheduling

Quincy [SOSP’09] alsched [SOCC’12] Dolly [NSDI’13]

38

KMN Scheduler

Emerging applications: ML algorithms, AQP Improves locality, Balances network transfers

39

N Any K

The Power of Choice in Data-Aware Cluster …...The Power of Choice in ! Data-Aware Cluster...

Documents

Transcript of The Power of Choice in Data-Aware Cluster …...The Power of Choice in ! Data-Aware Cluster...