The Power of Choice in Data-Aware Cluster …...The Power of Choice in ! Data-Aware Cluster...

39
The Power of Choice in Data-Aware Cluster Scheduling Shivaram Venkataraman 1 , Aurojit Panda 1 Ganesh Ananthanarayanan 2 , Michael Franklin 1 , Ion Stoica 1 1 UC Berkeley, 2 Microsoft Research amplab

Transcript of The Power of Choice in Data-Aware Cluster …...The Power of Choice in ! Data-Aware Cluster...

The Power of Choice in !Data-Aware Cluster Scheduling

Shivaram Venkataraman1 , Aurojit Panda1 Ganesh Ananthanarayanan2, Michael Franklin1, Ion Stoica1

1UC Berkeley, 2Microsoft Research

amplab

Trends: Big Data

2  [Kathy Yelick LBNL, VLDBJ 2012, Dhruba Borthakur]

Challenge&1:&Data&is&Big&Projected&Growth&

Increa

se&ove

r&201

0&

0$

10$

20$

30$

40$

50$

60$

2010$ 2011$ 2012$ 2013$ 2014$ 2015$

Moore's$Law$

Overall$Data$

Particle$Accel.$

DNA$Sequencers$

Data$Grows$faster$than$Moore’s$Law$[IDC$report,$Kathy$Yelick,$LBNL]$

Data grows faster than Moore’s Law!

Trends: Big Data

3  [Kathy Yelick LBNL, VLDBJ 2012, Dhruba Borthakur]

Projected&Growth&Increa

se&ove

r&201

0&

0$

10$

20$

30$

40$

50$

60$

2010$ 2011$ 2012$ 2013$ 2014$ 2015$

Moore's$Law$

Overall$Data$

Particle$Accel.$

DNA$Sequencers$

Facebook Hive cluster Last 4 years:

data growth 2500x ! queries/day 60x!

Microsoft Scope Cluster “The number of daily jobs has doubled every six months for the past two years.”

Trends: Low Latency

4  

10 min

2004: MapReduce

batch job

2009: Hive

2010: Dremel

2012: In-memory

Spark

1 min

10s 2s

Big Data or Low Latency ? SQL Query : 2.5 TB on 100 machines

> 15 minutes 1 - 5 Minutes

5  

< 10s

?

Sampling

6  

Applications

Machine learning algorithms stochastic gradient, coordinate descent

Approximate Query Processing blinkdb, presto, minitable

7  

Choices

N Any K

8  

Choices

N Any K

9  

Sampling à Smaller Inputs + Choice

Example

10  

M

M

R N = 4 K = 2

1

4

Rack

Available (N) = 2 Required (K) = 2

Time

Available Data Running

Busy

Existing

11  

2

3

Unavailable Data

1

4

Rack

Available (N) = 4 Required (K) = 2

Time

Choice-Aware

2

3

12  

Available Data Running

Busy

1

4

Rack

Available (N) = 4 Required (K) = 2

Time

Choice-Aware

2

3

Launched (M) = 3

13  

Available Data Running

Busy

KMN Scheduler - How much can KMN improve locality - Propagate benefits across stages - Handling stragglers

14  

Job à DAG

KMN Scheduler

One-to-One Many-to-One 15  

One-to-One Stages Locality

Disk ~ 100MB/s Network ~ 10 Gbps (~1GB/s) Memory ~ 50GB/s

16  

KMN Locality

1   2  

1   3   1   4  

2   3  

2   4  

3   4  Any K 1  

4  

2  

3  N

Choices NK!

"#

$

%&

17  

Locality, K=100

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Prob

. of L

ocali

ty

Utilization

K/N=1.0 K/N=0.5 K/N=0.1 K – Number of blocks chosen N – Number of blocks available

18  

KMN significantly improves locality

Many-to-One Stages

KMN Scheduler

19  

Many-to-One Stage

M1

M2

M4

M5

M3

R3

R2

R1

15 transfers

20  

Many-To-One Transfers

R3

M1 M2 M4 M5

R2

M3

Core

R1

21  

M1R3

M1R2

M2R3

M2 R2

M3 R3

M3 R3

Bottleneck Link

R3

M1 M2 M4 M5

R2

M3

6

Core

2 4 2

R1

4 2

Bottleneck Link Link with Max. transfers

Cross Rack Data Skew

Minimum transfers Maximum transfers

22  

=62= 3

Facebook Trace Cross Rack Data Skew

Minimum transfers

Maximum transfers

0 0.2 0.4 0.6 0.8

1

0 5 10 15 20 25 30

CDF

Cross Rack Data Skew

<50 tasks 50-150 tasks >150 tasks

23  

Power of Choice Load balancing: balls and bins Insight: Run extra tasks (M > K) M1

M2

M4 M5

M3

M6 M7

Cross Rack Data Skew = 3

24  

Power of Choice

Technique: Spread out choice of K tasks to reduce skew

M1

M2

M4 M5

M3

M6 M7

M = 7, K = 5 Cross Rack Data Skew = 2

25  

26  

Handling Stragglers M1

M2

M4

M5

M3

M6

M7

Rack

Rack

Stragglers

vs.

Cross-Rack Data Skew

Time

Using KMN

//  Create  Spark  RDD  file  =  sc.textFile(“tpc-­‐h.data”)    //  Select  a  10%  sample  using  KMN  sample  =  file.blockSample(0.1)    //  RDD  operations    sample.map  {  li  =>      (li.linestatus,  li.quantity)  }.collect()  

27  

Also in the paper

User-defined sampling functions Placing reduce tasks

28  

Killing extra tasks

Evaluation

Baseline: Use a pre-selected random sample Setup: 100 m2.4xlarge EC2 machines, 60GB RAM/mc

Facebook traces replay Long DAGs (Stochastic Gradient Descent) SQL queries from Conviva Reducer placement Varying Utilization

29  

Facebook Overall

0 10 20 30 40 50

0-10

11-100

>100

Job Completion Time (s)

Job

Size

Baseline KMN-M/K=1.05

30  

Cross Rack Skew

0 10 20 30

<=4

4-to-8

>8

Shuffle Stage Time (seconds)

Cros

s Ra

ck S

kew

Baseline KMN-M/K=1.0 KMN-M/K=1.05 KMN-M/K=1.1

31  

How many extra tasks ?

32  

0

0.2

0.4

0.6

0.8

1

0 10 20 Cross-Rack Skew

M/K=1.0 M/K=1.1 M/K=2.0

0

0.2

0.4

0.6

0.8

1

0 10 20 30 Cross Rack Skew

M/K=1.0 M/K=1.1 M/K=2.0

50 – 150 tasks > 150 tasks

KMN: How many stages ?

33  Gradient

Aggregate1

Aggregate2

Aggregate3

Stochastic Gradient Descent

KMN: How many stages ?

34  Gradient

Aggregate1

Aggregate2

Aggregate3

KMN Stages Time (s)

Gradient 15.27

KMN: How many stages ?

35  Gradient

Aggregate1

Aggregate2

Aggregate3

KMN Stages Time (s)

Gradient 15.27

Gradient + Agg1 12.72

KMN: How many stages ?

36  Gradient

Aggregate1

Aggregate2

Aggregate3

KMN Stages Time (s)

Gradient 15.27

Gradient + Agg1 12.72

Gradient + Agg2 11.79

KMN: How many stages ?

37  Gradient

Aggregate1

Aggregate2

Aggregate3

KMN Stages Time (s)

Gradient 15.27

Gradient + Agg1 12.72

Gradient + Agg2 11.79

Gradient + Agg3 12.09

Related Work Power of Choice

Power-of-Two choices [TPDS’01] Sparrow [SOSP’13]

Improving Cluster Scheduling

Quincy [SOSP’09] alsched [SOCC’12] Dolly [NSDI’13]

38  

KMN Scheduler

Emerging applications: ML algorithms, AQP Improves locality, Balances network transfers

39  

N Any K