Preemptive ReduceTask Scheduling for Fair and Fast Job ... · Preemptive ReduceTask Scheduling for...

ICAC 2013 - 1

Preemptive ReduceTask Scheduling for Fair and Fast Job Completion

Yandong Wang, Jian Tan, Weikuan Yu,

Li Zhang, Xiaoqiao Meng

ICAC 2013 - 2

•  Background & Motivation

•  Issues in Hadoop Scheduler

•  Preemptive ReduceTask

•  Fair Completion Scheduler

•  Performance Evaluation

•  Conclusion and Future Work

Outline

ICAC 2013 - 3

Overview

•  MapReduce is a programming model for processing massive-scale data.

–  Hadoop: Open-source implementation of MapReduce

•  Hadoop has been widely adopted by leading companies.

–  Providing high scalability and strong fault tolerance.

•  Data consolidation can be highly beneficial.

–  Co-location of disparate data sets and avoiding data replication cost.

•  Mixed workloads of long batch jobs and small interactive queries.

–  Interactive queries are expected to return quickly.

–  Hadoop Fair Scheduler was introduced to allow fair sharing among concurrent jobs.

ICAC 2013 - 4

Split

Split

Split

MapTask

MapTask

MapTask

Map

ReduceTask

ReduceTask

Output

Output

Shuffle Reduce

•  Hadoop schedulers strive to overlap the map and shuffle phases to accelerate data processing pipeline.

High-Level Hadoop Overview

ICAC 2013 - 5

•  A widely used Hadoop scheduler for sharing a Hadoop cluster.

•  Providing fairness among concurrently running jobs via max-min fair sharing.

•  Delay scheduling policy are used to provide data locality awareness.

•  Tasks occupy slots until successful completion or failure.

Hadoop Fair Scheduler

ICAC 2013 - 6







Outline

ICAC 2013 - 7

•  On average, last 5 small jobs are severely slowed down by 15×.

Unfair Reduce Slots Allocation

Map slots are fairly shared by 6 jobs

Job4 is slowed down by 19×

•  Monopolizing behavior of long ReduceTasks from the large job (Job3).

ICAC 2013 - 8

Distinctions MapTask ReduceTask

Execution Time Short-lived Long-lived

Execution Phase Single-phase Multi-phase

Execution Dependency None Map phase

Distinct Execution Pattern between Map and Reduce Tasks

•  Current Hadoop schedulers treat map and reduce tasks similarly.

ICAC 2013 - 9

Distinctions MapTask ReduceTask

Execution Time Short-lived Long-lived

Execution Phase Single-phase Multi-phase

Execution Dependency None Map phase

Distinct Execution Pattern between Map and Reduce Tasks

•  Current Hadoop schedulers treat map and reduce tasks similarly.

It is critical for Hadoop schedulers to be aware of these different patterns.

ICAC 2013 - 10

•  Hadoop introduces slow start [1]

–  Mitigating the starvation but at the cost of slowing down the data processing pipeline.

–  Impacting the execution time of small jobs.

•  Coupling scheduling policy from IBM[2]

–  Similar to slow start which let monopolization progressively happen •  Copy-Compute Splitting[3]

–  Performance is unknown, no results was reported.

Existing Efforts

[1]: “mapred.reduce.slowstart.completed.maps” . [2]: Jian Tan, Xiaoqiao Meng, Li Zhang, “Coupling scheduler for MapReduce/Hadoop”, HPDC’12. [3]: “Job Scheduling for Multi-User MapReduce Cluster”, Berkeley, Technical Report UCB/EECS-2009-55.

ICAC 2013 - 11

How to achieve both high Efficiency and Fairness ?

•  How to tackle monopolizing behavior of long running ReduceTasks ? –  Existing schedulers ignore long-lasting ReduceTasks, once they are launched, they

occupy resource until completion or failure.

–  Introducing a new mechanism: Preemptive ReduceTask.

•  How to coordinate two-phase job scheduling ? –  MapReduce adopts two-phase scheme (map and reduce) to schedule tasks. However

less contemplation has been given to coordinate them.

–  A new scheduler: Fair Completion Scheduler.

Fundamental Solutions

ICAC 2013 - 12







Outline

ICAC 2013 - 13

•  Lightweight work-conserving preemption mechanism.

–  Preserving previous computation and I/O.

–  Providing lightweight preemption with no noticeable performance impact.

•  Different from Linux process suspend commend (“Kill -STOP $PID”).

–  Preemptive ReduceTask releases the reduce slot.

•  Superior to current killing preemption mechanism.

–  Killing can lead to significant waste of computation and I/O.

Preemptive ReduceTask

ICAC 2013 - 14

TaskTracker

Heap

Seg

men

t

Merge

seg seg seg Heap

seg

Retrieve

R1: Before Preempt R1: After Resume

Index

Seg

men

t

Preemption During Shuffle Phase •  Only merging the in-memory intermediate data, while maintaining on-disk

intermediate data untouched.

ICAC 2013 - 15

R1: Before Preempt

MPQ

R1: After Resume

MPQ

DFS

Index

Ret

rieve

offset

Preemption During Reduce Phase •  Recording the current offset of each segment and minimum priority queue

•  Preemption occurs at the boundary of intermediate <key,value> pairs.

TaskTracker

ICAC 2013 - 16

Evaluation of Preemptive ReduceTask

3000

4000

5000

6000

7000

10% 30% 50% 70% 90%

Job

Exec

utio

n Ti

me

(sec

)

Finished Percentage of ReduceTask Workload

Work-Conserving Preemption Killing Preemption No Preemption (Baseline)

•  Terasort benchmark with 512GB input data on a cluster of 20 worker nodes.

ICAC 2013 - 17







Outline

ICAC 2013 - 18

•  Prioritizing ReduceTasks from jobs with the shortest remaining map phases. –  Allowing small jobs to preempt long-running ReduceTasks from large jobs.

–  MapTask scheduling follows max-min fair sharing policy.

•  When remaining map phases are equal, prioritizing ReduceTasks from jobs with least remaining reduce data.

•  Detecting the job execution slowdown caused by preemptions. –  Preventing ReduceTasks of large jobs from being preempted for too long and too many

times.

Fair Completion Scheduler

ICAC 2013 - 19

Slave Node 1

FCS

R1 of J1 R2 of J1

Slave Node 2

R3 of J1 R1 of J2

Slave Node 3

R4 of J1 R5 of J1

Slave Node 4

R6 of J1 R2 of J2

Job J1

Remaining Map Phase

Remaining Reduce Data Reduce

J1 1000 s 100GB 6

J2 200 s 10GB 2

Job J2

Sort Running Jobs: (1): According to remaining map time (2): According to remaining reduce data

Fair Completion Scheduling Details

ICAC 2013 - 20

Slave Node 1

FCS

R1 of J1 R2 of J1

Slave Node 2

R3 of J1 R1 of J2

Slave Node 3

R4 of J1 R5 of J1

Slave Node 4

R6 of J1 R2 of J2

Job J1

Remaining Map Phase


J1 1000 s 100GB 6

J2 200 s 10GB 2

Job J2 Job J3


J3 80 s 8GB 4


ICAC 2013 - 21

Slave Node 1

FCS

R1 of J1 R2 of J1

Slave Node 2

R3 of J1 R1 of J2

Slave Node 3

R4 of J1 R5 of J1

Slave Node 4

R6 of J1 R2 of J2

Job J1

Remaining Map Phase


J1 1000 s 100GB 6

J2 200 s 10GB 2

Job J2 Job J3


J3 80 s 8GB 4

Preempt ReduceTask of Job1


ICAC 2013 - 22

Slave Node 1

FCS

R1 of J1 R2 of J1

Slave Node 2

R3 of J1 R1 of J2

Slave Node 3

R4 of J1 R5 of J1

Slave Node 4

R6 of J1 R2 of J2

Remaining Map Phase


J1 1000 s 100GB 6

J2 200 s 10GB 2

Job J2


J3 80 s 8GB 4

R1 of J3 R2 of J3 R3 of J3 R4 of J3


Job J1 Job J3

Launch ReduceTasks of Job3

ICAC 2013 - 23

Slave Node 1

FCS

R1 of J1 R2 of J1

Slave Node 2

R3 of J1 R1 of J2

Slave Node 3

R4 of J1 R5 of J1

Slave Node 4

R6 of J1 R2 of J2

Job J1

Remaining Map Phase


J1 800 s 100GB 6

J2 120 s 10GB 2

Job J2


Resume ReduceTasks of Job 1


Job3 completes

ICAC 2013 - 24







Outline

ICAC 2013 - 25

•  Hardware configuration –  A cluster of 46 nodes. 4 2.67GHz hex-core Intel Xeon CPUs, 24GB memory and two

hard disks.

•  Software configuration:

–  Hadoop 1.0.0 and its Fair Scheduler. 8 map slots and 4 reduce slots on each nodes.

•  Gridmix2 and Tarazu benchmarks: –  Map-heavy workload

–  Reduce-heavy workload

–  Scalability evaluation

Testbed and Benchmarks/Metrics

ICAC 2013 - 26

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10

FCS HFS

•  FCS reduces average execution time by 31% (171 jobs).

•  Significantly speeds up small jobs, slightly slow down large jobs.

Aver

age

Exe

cutio

n Ti

me

(sec

)

10 Groups of Jobs

1.9 2.4 1.9 2.3

1.6

1.9

1.1 2.2 0.79

Results for Map-heavy Workload

ICAC 2013 - 27

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10 Aver

age

Red

uceT

ask

Wai

t Ti

me

(sec

)

FCS HFS

•  Small jobs are benefited from significantly shortened reduce wait time.

•  Waiting time are reduced by 22× for the jobs in the first 6 groups.

Average ReduceTask Wait Time

19.5 12.4

22 21 32.2

1.22 4.5 0.5

0.8

10 Groups of Jobs

27.2

ICAC 2013 - 28

•  FCS controls the preemption frequency to avoid excessive preemptions.

Preemption Frequency

10 Groups of Jobs

0

0.4

0.8

1.2

1.6

1 2 3 4 5 6 7 8 9 10

Pre

empt

ion

Freq

uenc

y

ICAC 2013 - 29

0 2 4 6 8

10 12 14 16 18 20

1 2 3 4 5 6 7 8 9 10

Max

imum

Slo

wdo

wn Fair Completion

Hadoop Fair

•  FCS improves the fairness by 66.7% on average.

•  Achieving nearly uniform maximum slowdown for all groups of jobs.

10 groups of jobs

Fairness Evaluation: Maximum Slowdown

ICAC 2013 - 30

•  FCS reduces average execution time by 28% (171 jobs).

•  FCS accelerates all types of jobs in the reduce-heavy workload. –  Impact of preemption on large job is not heavy due to they are still in map phases.

Aver

age

Exe

cutio

n Ti

me

(sec

)

10 Groups of Jobs

Results for Reduce-heavy Workload

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9 10

FCS HFS

ICAC 2013 - 31

•  FCS improves the fairness by 35.2% on average.

Max

imum

Slo

wdo

wn

10 Groups of Jobs

Fairness of Reduce-heavy Workload

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10

Fair Completion

Hadoop Fair

ICAC 2013 - 32

0

400

800

1200

1600

60 120 180 240 300

Aver

age

Exe

cutio

n Ti

me

(sec

)

Different Number of Jobs

Fair Completion Hadoop Fair

•  FCS reduces the average execution time by 39.7%.

•  Small improvement at 60 due to dominant number of small jobs.

Scalability Evaluation with GridMix-2

ICAC 2013 - 33





•  System Evaluation


Outline

ICAC 2013 - 34

•  Identify the inefficiencies in existing Hadoop schedulers.

•  Preemptive ReduceTask provides an efficient preemption approach. •  Fair Completion Scheduler is introduced to improve the efficiency and

fairness of the concurrently running jobs.

•  Preemptive ReduceTask provides opportunities to improve the fault tolerance mechanism.

•  More preemptive scheduling policy can be implemented based on Preemptive ReduceTask.

Conclusion and Future Work

ICAC 2013 - 35

Sponsors of Our Research

ICAC 2013 - 36

Thank You and Questions ?

Preemptive ReduceTask Scheduling for Fair and Fast Job ... · Preemptive ReduceTask Scheduling for...

Documents

Transcript of Preemptive ReduceTask Scheduling for Fair and Fast Job ... · Preemptive ReduceTask Scheduling for...