Preemptive ReduceTask Scheduling for Fair and Fast Job ... · Preemptive ReduceTask Scheduling for...
Transcript of Preemptive ReduceTask Scheduling for Fair and Fast Job ... · Preemptive ReduceTask Scheduling for...
ICAC 2013 - 1
Preemptive ReduceTask Scheduling for Fair and Fast Job Completion
Yandong Wang, Jian Tan, Weikuan Yu,
Li Zhang, Xiaoqiao Meng
ICAC 2013 - 2
• Background & Motivation
• Issues in Hadoop Scheduler
• Preemptive ReduceTask
• Fair Completion Scheduler
• Performance Evaluation
• Conclusion and Future Work
Outline
ICAC 2013 - 3
Overview
• MapReduce is a programming model for processing massive-scale data.
– Hadoop: Open-source implementation of MapReduce
• Hadoop has been widely adopted by leading companies.
– Providing high scalability and strong fault tolerance.
• Data consolidation can be highly beneficial.
– Co-location of disparate data sets and avoiding data replication cost.
• Mixed workloads of long batch jobs and small interactive queries.
– Interactive queries are expected to return quickly.
– Hadoop Fair Scheduler was introduced to allow fair sharing among concurrent jobs.
ICAC 2013 - 4
Split
Split
Split
MapTask
MapTask
MapTask
Map
ReduceTask
ReduceTask
Output
Output
Shuffle Reduce
• Hadoop schedulers strive to overlap the map and shuffle phases to accelerate data processing pipeline.
High-Level Hadoop Overview
ICAC 2013 - 5
• A widely used Hadoop scheduler for sharing a Hadoop cluster.
• Providing fairness among concurrently running jobs via max-min fair sharing.
• Delay scheduling policy are used to provide data locality awareness.
• Tasks occupy slots until successful completion or failure.
Hadoop Fair Scheduler
ICAC 2013 - 6
• Background & Motivation
• Issues in Hadoop Scheduler
• Preemptive ReduceTask
• Fair Completion Scheduler
• Performance Evaluation
• Conclusion and Future Work
Outline
ICAC 2013 - 7
• On average, last 5 small jobs are severely slowed down by 15×.
Unfair Reduce Slots Allocation
Map slots are fairly shared by 6 jobs
Job4 is slowed down by 19×
• Monopolizing behavior of long ReduceTasks from the large job (Job3).
ICAC 2013 - 8
Distinctions MapTask ReduceTask
Execution Time Short-lived Long-lived
Execution Phase Single-phase Multi-phase
Execution Dependency None Map phase
Distinct Execution Pattern between Map and Reduce Tasks
• Current Hadoop schedulers treat map and reduce tasks similarly.
ICAC 2013 - 9
Distinctions MapTask ReduceTask
Execution Time Short-lived Long-lived
Execution Phase Single-phase Multi-phase
Execution Dependency None Map phase
Distinct Execution Pattern between Map and Reduce Tasks
• Current Hadoop schedulers treat map and reduce tasks similarly.
It is critical for Hadoop schedulers to be aware of these different patterns.
ICAC 2013 - 10
• Hadoop introduces slow start [1]
– Mitigating the starvation but at the cost of slowing down the data processing pipeline.
– Impacting the execution time of small jobs.
• Coupling scheduling policy from IBM[2]
– Similar to slow start which let monopolization progressively happen • Copy-Compute Splitting[3]
– Performance is unknown, no results was reported.
Existing Efforts
[1]: “mapred.reduce.slowstart.completed.maps” . [2]: Jian Tan, Xiaoqiao Meng, Li Zhang, “Coupling scheduler for MapReduce/Hadoop”, HPDC’12. [3]: “Job Scheduling for Multi-User MapReduce Cluster”, Berkeley, Technical Report UCB/EECS-2009-55.
ICAC 2013 - 11
How to achieve both high Efficiency and Fairness ?
• How to tackle monopolizing behavior of long running ReduceTasks ? – Existing schedulers ignore long-lasting ReduceTasks, once they are launched, they
occupy resource until completion or failure.
– Introducing a new mechanism: Preemptive ReduceTask.
• How to coordinate two-phase job scheduling ? – MapReduce adopts two-phase scheme (map and reduce) to schedule tasks. However
less contemplation has been given to coordinate them.
– A new scheduler: Fair Completion Scheduler.
Fundamental Solutions
ICAC 2013 - 12
• Background & Motivation
• Issues in Hadoop Scheduler
• Preemptive ReduceTask
• Fair Completion Scheduler
• Performance Evaluation
• Conclusion and Future Work
Outline
ICAC 2013 - 13
• Lightweight work-conserving preemption mechanism.
– Preserving previous computation and I/O.
– Providing lightweight preemption with no noticeable performance impact.
• Different from Linux process suspend commend (“Kill -STOP $PID”).
– Preemptive ReduceTask releases the reduce slot.
• Superior to current killing preemption mechanism.
– Killing can lead to significant waste of computation and I/O.
Preemptive ReduceTask
ICAC 2013 - 14
TaskTracker
Heap
Seg
men
t
Merge
seg seg seg Heap
seg
Retrieve
R1: Before Preempt R1: After Resume
Index
Seg
men
t
Preemption During Shuffle Phase • Only merging the in-memory intermediate data, while maintaining on-disk
intermediate data untouched.
ICAC 2013 - 15
R1: Before Preempt
MPQ
R1: After Resume
MPQ
DFS
Index
Ret
rieve
offset
Preemption During Reduce Phase • Recording the current offset of each segment and minimum priority queue
• Preemption occurs at the boundary of intermediate <key,value> pairs.
TaskTracker
ICAC 2013 - 16
Evaluation of Preemptive ReduceTask
3000
4000
5000
6000
7000
10% 30% 50% 70% 90%
Job
Exec
utio
n Ti
me
(sec
)
Finished Percentage of ReduceTask Workload
Work-Conserving Preemption Killing Preemption No Preemption (Baseline)
• Terasort benchmark with 512GB input data on a cluster of 20 worker nodes.
ICAC 2013 - 17
• Background & Motivation
• Issues in Hadoop Scheduler
• Preemptive ReduceTask
• Fair Completion Scheduler
• Performance Evaluation
• Conclusion and Future Work
Outline
ICAC 2013 - 18
• Prioritizing ReduceTasks from jobs with the shortest remaining map phases. – Allowing small jobs to preempt long-running ReduceTasks from large jobs.
– MapTask scheduling follows max-min fair sharing policy.
• When remaining map phases are equal, prioritizing ReduceTasks from jobs with least remaining reduce data.
• Detecting the job execution slowdown caused by preemptions. – Preventing ReduceTasks of large jobs from being preempted for too long and too many
times.
Fair Completion Scheduler
ICAC 2013 - 19
Slave Node 1
FCS
R1 of J1 R2 of J1
Slave Node 2
R3 of J1 R1 of J2
Slave Node 3
R4 of J1 R5 of J1
Slave Node 4
R6 of J1 R2 of J2
Job J1
Remaining Map Phase
Remaining Reduce Data Reduce
J1 1000 s 100GB 6
J2 200 s 10GB 2
Job J2
Sort Running Jobs: (1): According to remaining map time (2): According to remaining reduce data
Fair Completion Scheduling Details
ICAC 2013 - 20
Slave Node 1
FCS
R1 of J1 R2 of J1
Slave Node 2
R3 of J1 R1 of J2
Slave Node 3
R4 of J1 R5 of J1
Slave Node 4
R6 of J1 R2 of J2
Job J1
Remaining Map Phase
Remaining Reduce Data Reduce
J1 1000 s 100GB 6
J2 200 s 10GB 2
Job J2 Job J3
Sort Running Jobs: (1): According to remaining map time (2): According to remaining reduce data
J3 80 s 8GB 4
Fair Completion Scheduling Details
ICAC 2013 - 21
Slave Node 1
FCS
R1 of J1 R2 of J1
Slave Node 2
R3 of J1 R1 of J2
Slave Node 3
R4 of J1 R5 of J1
Slave Node 4
R6 of J1 R2 of J2
Job J1
Remaining Map Phase
Remaining Reduce Data Reduce
J1 1000 s 100GB 6
J2 200 s 10GB 2
Job J2 Job J3
Sort Running Jobs: (1): According to remaining map time (2): According to remaining reduce data
J3 80 s 8GB 4
Preempt ReduceTask of Job1
Fair Completion Scheduling Details
ICAC 2013 - 22
Slave Node 1
FCS
R1 of J1 R2 of J1
Slave Node 2
R3 of J1 R1 of J2
Slave Node 3
R4 of J1 R5 of J1
Slave Node 4
R6 of J1 R2 of J2
Remaining Map Phase
Remaining Reduce Data Reduce
J1 1000 s 100GB 6
J2 200 s 10GB 2
Job J2
Sort Running Jobs: (1): According to remaining map time (2): According to remaining reduce data
J3 80 s 8GB 4
R1 of J3 R2 of J3 R3 of J3 R4 of J3
Fair Completion Scheduling Details
Job J1 Job J3
Launch ReduceTasks of Job3
ICAC 2013 - 23
Slave Node 1
FCS
R1 of J1 R2 of J1
Slave Node 2
R3 of J1 R1 of J2
Slave Node 3
R4 of J1 R5 of J1
Slave Node 4
R6 of J1 R2 of J2
Job J1
Remaining Map Phase
Remaining Reduce Data Reduce
J1 800 s 100GB 6
J2 120 s 10GB 2
Job J2
Sort Running Jobs: (1): According to remaining map time (2): According to remaining reduce data
Resume ReduceTasks of Job 1
Fair Completion Scheduling Details
Job3 completes
ICAC 2013 - 24
• Background & Motivation
• Issues in Hadoop Scheduler
• Preemptive ReduceTask
• Fair Completion Scheduler
• Performance Evaluation
• Conclusion and Future Work
Outline
ICAC 2013 - 25
• Hardware configuration – A cluster of 46 nodes. 4 2.67GHz hex-core Intel Xeon CPUs, 24GB memory and two
hard disks.
• Software configuration:
– Hadoop 1.0.0 and its Fair Scheduler. 8 map slots and 4 reduce slots on each nodes.
• Gridmix2 and Tarazu benchmarks: – Map-heavy workload
– Reduce-heavy workload
– Scalability evaluation
Testbed and Benchmarks/Metrics
ICAC 2013 - 26
1
10
100
1000
10000
1 2 3 4 5 6 7 8 9 10
FCS HFS
• FCS reduces average execution time by 31% (171 jobs).
• Significantly speeds up small jobs, slightly slow down large jobs.
Aver
age
Exe
cutio
n Ti
me
(sec
)
10 Groups of Jobs
1.9 2.4 1.9 2.3
1.6
1.9
1.1 2.2 0.79
Results for Map-heavy Workload
ICAC 2013 - 27
1
10
100
1000
10000
1 2 3 4 5 6 7 8 9 10 Aver
age
Red
uceT
ask
Wai
t Ti
me
(sec
)
FCS HFS
• Small jobs are benefited from significantly shortened reduce wait time.
• Waiting time are reduced by 22× for the jobs in the first 6 groups.
Average ReduceTask Wait Time
19.5 12.4
22 21 32.2
1.22 4.5 0.5
0.8
10 Groups of Jobs
27.2
ICAC 2013 - 28
• FCS controls the preemption frequency to avoid excessive preemptions.
Preemption Frequency
10 Groups of Jobs
0
0.4
0.8
1.2
1.6
1 2 3 4 5 6 7 8 9 10
Pre
empt
ion
Freq
uenc
y
ICAC 2013 - 29
0 2 4 6 8
10 12 14 16 18 20
1 2 3 4 5 6 7 8 9 10
Max
imum
Slo
wdo
wn Fair Completion
Hadoop Fair
• FCS improves the fairness by 66.7% on average.
• Achieving nearly uniform maximum slowdown for all groups of jobs.
10 groups of jobs
Fairness Evaluation: Maximum Slowdown
ICAC 2013 - 30
• FCS reduces average execution time by 28% (171 jobs).
• FCS accelerates all types of jobs in the reduce-heavy workload. – Impact of preemption on large job is not heavy due to they are still in map phases.
Aver
age
Exe
cutio
n Ti
me
(sec
)
10 Groups of Jobs
Results for Reduce-heavy Workload
1
10
100
1000
10000
100000
1 2 3 4 5 6 7 8 9 10
FCS HFS
ICAC 2013 - 31
• FCS improves the fairness by 35.2% on average.
Max
imum
Slo
wdo
wn
10 Groups of Jobs
Fairness of Reduce-heavy Workload
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10
Fair Completion
Hadoop Fair
ICAC 2013 - 32
0
400
800
1200
1600
60 120 180 240 300
Aver
age
Exe
cutio
n Ti
me
(sec
)
Different Number of Jobs
Fair Completion Hadoop Fair
• FCS reduces the average execution time by 39.7%.
• Small improvement at 60 due to dominant number of small jobs.
Scalability Evaluation with GridMix-2
ICAC 2013 - 33
• Background & Motivation
• Issues in Hadoop Scheduler
• Preemptive ReduceTask
• Fair Completion Scheduler
• System Evaluation
• Conclusion and Future Work
Outline
ICAC 2013 - 34
• Identify the inefficiencies in existing Hadoop schedulers.
• Preemptive ReduceTask provides an efficient preemption approach. • Fair Completion Scheduler is introduced to improve the efficiency and
fairness of the concurrently running jobs.
• Preemptive ReduceTask provides opportunities to improve the fault tolerance mechanism.
• More preemptive scheduling policy can be implemented based on Preemptive ReduceTask.
Conclusion and Future Work
ICAC 2013 - 35
Sponsors of Our Research
ICAC 2013 - 36
Thank You and Questions ?