Simulation and Performance Evaluation of the Hadoop Capacity Scheduler · 2014-10-01 · The Task...

Simulation and Performance Evaluation of the HadoopCapacity Scheduler

Jagmohan Chauhan, Dwight Makaroff and Winfried Grassmann

Department of Computer Science, University of SaskatchewanSaskatoon, SK, CANADA, S7N 5C9

[email protected], {makaroff, grassman}@cs.usask.ca

AbstractHadoop task schedulers like Fair Share and Ca-

pacity have been specially designed to share hard-ware resources among multiple organizations. TheCapacity Scheduler provides a complex set of pa-rameters to give fine control over resource alloca-tion of a shared MapReduce cluster. Administra-tors and users often run into performance prob-lems because they do not understand the perfor-mance influence of the task scheduler parameterson MapReduce workloads. Interaction between pa-rameter settings is particularly problematic.

In this paper, we implemented a CapacityScheduler simulator component, integrated it intoan existing simulator and then validated the simu-lator with small test cases, consisting of standardbenchmark sort programs. We then studied the im-pact of Capacity Scheduler parameters on differentMapReduce workload submission patterns with amore complex set of benchmark programs. Amongother results, we found maxCapacity and minUser-LimitPCT to be influential parameters suggestedby previous work and that using separate queuesfor short and long jobs provides the best perfor-mance in terms of response ratio, execution timeand makespan compared to submitting both typesof jobs in the same queue.

1 IntroductionResearch and government organizations as wellas corporations produce huge amounts of data for

Copyright c© 2014 Mr. Jagmohan Chauhan, Dr. DwightMakaroff, and Dr. Winfried Grassmann. Permission to copy ishereby granted provided the original copyright notice is repro-duced in copies made.

varying internal and external purposes contribut-ing to the world-wide data explosion often re-ferred to as “Big Data”. Applications that ac-cess and process Big Data in a parallel manner toproduce meaningful information are called data-intensive computing applications. Analysis andprocessing of Big Data is crucial to these orga-nizations and data-intensive computing addressesthis need. MapReduce [6] was the most popularof the new processing frameworks that emerged.An open-source framework based on MapReducecalled Hadoop1 emerged in 2007 and is used by or-ganizations like Yahoo! and Facebook, and manyothers. Organizations that do not have the need fora private cluster often utilize a shared cluster on de-mand [17].

The Task Scheduler is an important part of theMapReduce framework. Initially, MapReduce wasdesigned to handle batch-oriented jobs and a FIFOtask scheduler was suited to handle such jobs. Theemergent needs to have better data locality, betterresponse time for short jobs, cluster sharing be-tween different organizations/users led to the de-velopment of various other schedulers, like Fair-Share,2 Delay aware [22], Capacity,3 Quincy [10],etc. Recently, YARN [17] was introduced to theHadoop framework which fundamentally redesignshigher level shared cluster resource management.

Several Capacity Scheduler configuration pa-rameters are provided that affect how the resourcesin the cluster are shared between organizations and

1http://hadoop.apache.org2http://hadoop.apache.org/common/docs/r0.20.2/

fairscheduler.html3http://hadoop.apache.org/common/docs/r0.20.2/

capacityscheduler.html

users within an organization. Unfortunately, set-ting these parameters properly to provide the de-sired sharing behaviour is a challenging task. Ourprevious work measuring the influence of Capac-ity Scheduler parameter settings found that onlysome parameters have significant impact on theperformance of jobs [4]. It is important to esti-mate the magnitude of these consequences beforemaking any such changes in a real production sys-tem; a Capacity Scheduler simulator would saveboth time and cost. Simulation can characterize thebehaviour of running jobs on a MapReduce clusteracross different job submission patterns, job mixes,network topologies and cluster configurations.

In this paper, we modified an existing simulatornamed MRPerf [20] and then validated it against areal cluster setup using representative MapReduceapplications. The validation experiments exposedthe factors which limited prediction accuracy in thesimulator. The factors identified can be used to de-sign a better simulator.

We then conducted a simulation study to deter-mine the degree to which the parameters identifiedas being influential in our previous experimentalstudy [4] affect performance measures when thesize and configuration of a cluster and its work-load are varied to larger scales. In particular, westudy three different job submission settings: (i) asequence of long jobs followed by a sequence ofshort jobs, (ii) interleaving jobs, and (iii) separatequeues for long jobs and short jobs. For input to thesimulation, job types are characterized by a smallnumber of parameters: the number of map/reducetasks and the size of input and output data. The cur-rent design of MRPerf abstracts away other facetsof job and task functionality.

Our simulation results confirm that Capacity,maxCapacity, and minUserLimitPCT influence theexecution and waiting time, but do not affect thedata locality of the tasks scheduled on the process-ing nodes. Finally, we observe that separating longjobs and short jobs into disjoint queues with properparameter settings improves the overall executiontime (makespan) of the short job queues. These ex-periments are a case study and not a comprehensiveevaluation of the variety of MapReduce use cases.

The remainder of this paper is organized as fol-lows. Section 2 introduces the Capacity Schedulerand its various configuration parameters. Relatedwork is covered in Section 3. Section 4 describesthe Capacity Scheduler simulator implementation.

In Section 5, we provide the simulator validationresults. In Section 6 we present and analyze theresults of additional experiments on a simulatedmedium-sized cluster. Finally, Section 7 shows ourconclusions and future work.

2 Hadoop SchedulingA Hadoop cluster consists of 3 main control com-ponents: a dedicated NameNode server to providefile system support, a secondary NameNode, whichgenerate snapshots of the NameNode’s metadataand a JobTracker server to manage job schedulingdecisions using a task scheduling algorithm. Thetask scheduler runs on the job tracker and decideswhere and when the tasks of a job will be executed.

There are at least three well-known Hadoop taskschedulers:4

• FIFO: All the users submit jobs to a singlequeue, and jobs are executed sequentially.

• Fair Scheduler: Resources are evenly allo-cated to multiple jobs with capacity guaran-tees. Jobs are placed in pools; each pool has aguaranteed capacity. Excess capacity is allo-cated between jobs using fair sharing.

• Capacity Scheduler: A number of namedqueues are configured. Each queue has a guar-anteed capacity (i.e. number of map and re-duce slots). Any unused capacity is shared be-tween the remaining queues. FIFO schedulingwith priorities is used in each queue. Limitscan be placed on the percent of running tasksper user, so that users share a cluster equally.

The Capacity Scheduler operates according to thefollowing principles:

1. At cluster startup, a configuration file with allthe task scheduler settings for the queue char-acteristics and other operations is used to ini-tialize the system.

2. An initialization poller thread is started alongwith worker threads. The poller thread wakesup at specified intervals, distributes jobs toworker threads and then sleeps.

3. At job submission time, the scheduler checksjob submission limits to determine if the jobcan be accepted for the queue and user.

4http://en.wikipedia.org/wiki/Apache Hadoop

4. If the job can be accepted, the initializationpoller checks initialization limits. If initializa-tion is permitted, a worker thread is assignedto initialize the job.

5. Whenever the JobTracker gets a heartbeatsignal from a TaskTracker indicating that anode is available, a queue is selected from allthe queues with available jobs by sorting thequeues by current settings and load. Queue-and user- specific limits are checked for ca-pacity limits. The first permissible job is cho-sen from the chosen queue. Next, a task ispicked from the job with preference given totasks with data closer to the nodes. This pro-cedure is repeated until all the jobs complete.

Note that the Capacity Scheduler dynamicallyassigns tasks to threads and nodes, respectively. Inparticular, tasks have the following relationshipswith data location: node-local tasks have the dataon the local disk, rack-local tasks have data on amachine in the same rack and network data trans-fer is necessary to complete the task, and finally,remote tasks have data further away in the cluster.

Many configuration parameters are provided, al-lowing administrators to tune scheduling parame-ters for the jobs. Table 1 shows the configurableresource allocation parameters that we consider inour experiments. Our initial exploration of the ef-fect of task scheduler parameters via experimenta-tion on a small cluster [4] showed that these werethe most influential in elapsed time performance. Avalue of -1 for a parameter means the parameter isnot used. There are other parameters that are notshown which are categorized under memory man-agement and job initialization.

Table 1: Capacity Scheduler Parameters

ParameterName

Brief Description/Use DefaultValue

queue.qname.Capacity

Percentage of cluster’sslots available for jobs inqueue.

1 queue@100%

queue.qname.maxCapacity

Maximum percentage ofcluster capacity for a sin-gle queue.

-1

queue.qname.minUserLimit-PCT

A limit on the percentageof resources allocated to auser at any given time, ifthere is competition.

100

queue.qname.userLimitFactor

Capacity multiple for userto acquire more slots.

1

3 Related WorkJiang, Ooi, Shi and Wu [12] studied the perfor-mance of MapReduce in a very detailed man-ner. They identified five design factors that affectHadoop performance: 1) grouping schemes, 2) I/Omodes, 3) data parsing, 4) indexing, and 5) block-level scheduling. By carefully tuning these factors,the overall performance improved by a factor of 2.5to 3.5. Babu [2] measured a real system and foundthat the presence of too many job configuration pa-rameters in Hadoop is cumbersome, highlightingthe importance of an automated tool to optimizeparameter values. Herodotou et al. [9] introducedStarfish, which profiles and optimizes MapReduceprograms based on cost. Starfish aims to relieveusers from having to fine tune job configuration pa-rameters for different cluster settings and input datasets. All these profiling tools require a real systemon which to test potential workloads and are tunedto the job scheduler and not the task scheduler.

Experiments with YARN utilized the CapacityScheduler with particular parameter settings [17].Production results at Yahoo! for a small clus-ter (10 machines) show the benefits of the YARNframework with a recently developed capability ofYARN: work-preserving preemption. This is notmodelled in current simulators, but shows the needfor some method of evaluating scheduler perfor-mance at scale.

Several discrete event simulators for MapRe-duce application workloads have been developed.Each provides different levels of support for inte-grating new task schedulers and different details inthe computation and communication models.

SimMR [18], MRSim [8] and SimMapRe-duce [16] are designed to evaluate different sched-ulers/provisioning strategies. SimMR focuseson JobTracker decisions and task/slot allocationsamong different jobs. MRSim measures scalabilityeasily and captures the effects of different config-urations of Hadoop setup on job completion timesand hardware utilization. SimMapReduce allowsresearchers to evaluate different scheduling algo-rithms and resource allocation policies.

Hsim [13] simulates the dynamic behaviours ofHadoop environments and models a large numberof Hadoop parameters such as node parameters (re-lated to processors, memory, hard disk, network in-terface, map and reduce instances) and cluster pa-rameters, (number of nodes, node configurations,

network routers, job queues and schedulers), andHadoop system parameters. Mumak [14] can esti-mate performance of MapReduce applications un-der different schedulers. It takes a job trace derivedfrom production workload and a cluster definitionas input, and simulates the execution of the jobs us-ing different schedulers. The detailed job executiontrace (recorded in relation to virtual simulated time)is the output that can be analyzed to capture varioustraits of individual schedulers (turn around time,throughput, fairness, capacity guarantee, etc.).

MRPerf [20, 21] analyzes application perfor-mance on a given Hadoop setup, enabling the eval-uation of design decisions for fine-tuning and cre-ating Hadoop clusters. MRPerf was made opensource to be used by the research community to en-able exploration of design issues, validation of newalgorithms and optimization in MapReduce.

The need for production level traces by somesimulators makes them inappropriate for generalresearch, since often the traces are proprietary in-formation and not easily available to the academiccommunity. Only recently have MapReduce sim-ulators supported schedulers other than FIFO, Mu-mak being the first.

A new simulation environment, SLS5 has beenrecently introduced by Yahoo! for the YARNframework includes support for many componentswithin Hadoop. SLS provides detailed executiontraces as well as resource usage metrics. It evenprovides analysis of low-level scheduler opera-tions, measuring their overhead and assessing scal-ability. The environment is designed in a modularfashion to incorporate new scheduler development.There are many parameters in the simulator itself.The goals and approach of our work and SLS aresimilar, but SLS was not a mature tool at the begin-ning of our investigation.

Optimizing cluster utilization and reducingmakespan has been studied using job scheduler ad-justments. HFSP [15] uses a run-time estimate ofjob size and age to order jobs; pre-emption priori-tizes shorter jobs without starvation. Verma et al.[19] use the FIFO scheduler and choose to orderthe jobs optimally to reduce makespan. While therehas been a rich history of scheduling research in thearea of cluster and grid computing, it is beyond thescope of this work to evaluate the design of suchscheduling algorithms.

5http://hadoop.apache.org/docs/r2.3.0/hadoop-sls/SchedulerLoadSimulator.html

4 Simulator ImplementationThe main code to implement the Capacity Sched-uler simulator includes 1) queue setup, 2) initializa-tion poller configuration, 3) job launching and ini-tialization, 4) the main scheduler driver and 5) jobremoval. A new config file was created containingparameter settings and a conversion program writ-ten to support multiple users/new job types. Theexisting main logic in MRPerf interfaces with thescheduler code. The stages of MapReduce pro-cessing are coded as front-end TCL files. Interfacecode was required for the Capacity Scheduler codeas was support for multiple heartbeats for reducetasks. The Capacity Scheduler component required500 lines of Python/TCL interface code and 2000lines of C++ code for processing.

The MRPerf simulator lacked variability be-tween simulation runs. A fixed set of jobs alwaysproduced the same results because the heartbeat ar-rival time at the JobTracker was at fixed times andin the same order as the DataNodes generated ini-tial heartbeats. This is not realistic as there is al-ways some randomness in heartbeat arrivals to theJobTracker due to operating system scheduling andnetwork communication delays. A random seedwas added to the time at which a node can gen-erate its first heartbeat to create randomness in thesystem in two ways: a node can provide an initialheartbeat to the JobTracker at a random time, andno fixed order was placed on the nodes with respectto the initial heartbeat. The subsequent heartbeatfrom a node is generated a multiple of 300 ms later.

5 Simulator Validation5.1 SetupThe experimental cluster used for validation ex-periments consisted of 5 nodes on local lab ma-chines. One node was the JobTracker and NameN-ode while the other nodes served as TaskTrackerand DataNodes. Each node had a 2.4 Ghz CPU,4 GB of RAM and 250 GB hard disk, runningUbuntu 10.04 Linux, executing Hadoop 0.20.203.The data replication factor was 3. All nodes wereon same rack and connected through a 1 Gbpsswitch. The simulator validation was conducted ona similarly-configured LINUX laptop.

In the validation experiments, Sort and TeraSort(called TSort hereafter) were used as the jobs astheir output data is proportional to the input data

(a requirement for MRPerf); this kept the valida-tion simple to implement and evaluate. Two queueswere used, with 2 users per queue. Jobs in eachqueue were submitted with an inter-arrival time of5 seconds, while jobs of the same type in differentqueues were submitted at the same time. TeraGengenerated the input data for TSort (RandomWriterfor Sort). TSort used 2 GB of input data, whileSort used 1 GB of input data, so as to have differ-ent input data sizes for different types of jobs, butto keep the dataset size intentionally small and limitthe time required for validation. We generated jobsin the following order:

• TSort for user 1/user 3 in Queue 1/Queue 2.

• TSort for user 2/user 4 in Queue 1/Queue 2,respectively, after 5 seconds.

• Sort for user 1/user 3 in Queue 1/Queue 2, re-spectively, when their TSort job is finished.The same is done for user 2/user 4 in Queue1/Queue 2 respectively.

Two different node configurations were used:one reduce slot per node and two reduce slots pernode. We captured map, reduce, execution andwaiting time. The graphs show the mean of 10runs and the 95% confidence interval as error bars.“Job/Number” is used for identification purposes inboth the graphs and the descriptions. Detailed com-parison of phase execution times is necessary toidentify sources of simulator inaccuracy. Chauhan[3] contains more experimental results.

5.2 One Reduce Slot per NodeChanging numQueues. In the first experiment, asingle queue with 2 users and 4 jobs in total wasused. In the second experiment, an identical queuewas added (Capacity set at 50% for each queue,but not modifying maxCapacity). Figure 1 showsthe results. With two queues, the map/reduce/totalexecution times increase for all jobs.

The confidence interval for the map executiontime for TSort/2 and TSort/4 was large. The reasonfor this was delayed map execution, which occurswhen a job is started before it should be, given thescheduling discipline in place. If job 1 is submittedbefore job 2, then job 2’s map tasks do not exe-cute until job 1’s map tasks are finished or thereare extra slots remaining idle. In some runs, job 2gets one or two map slots quickly (even though job

1’s map tasks were not finished). Later, when theJobTracker received new heartbeats from the Task-Tracker, job 1 took preference as it was submittedearlier and until job 1 completed all its map tasks,there were no job 2 map tasks scheduled.

This occurred only once for TSort/2 and TSort/4in the real system. These jobs almost always exhib-ited delayed map execution behaviour in the simu-lator. This leads to huge differences in executionand waiting time. The results match closely on theruns without delayed map execution. Delayed mapexecution differences are caused by heartbeat ar-rival times. The other major factor for TSort’s vari-ation was straggler tasks in some real system runswhen resources cannot be obtained in time to com-plete with other map tasks [1].

Delayed map execution also leads to increasedreduce time in some executions. The reduce phaseis started when a certain percentage of map tasksare completed. By default, it is 0.05% of the totalmap tasks. Sort with 1 GB of input data has 16 maptasks; one map task completion triggers the reducephase. However, the reduce phase cannot proceedfully until the map phase is completed, which willbe delayed until any preceding job completes all itsmap tasks. Straggler map tasks affect the reduceexecution of all jobs.

Changing minUserLimitPCT. The minUserLim-itPCT parameter was changed to 50%, allowingconcurrent execution of map tasks for differentjobs. All map, reduce and overall execution timesincreased. This trend was captured by the simula-tor. The high variation for reduce tasks in the realcluster shown in Figure 2 is due to stragglers.

5.3 Two Reduce Slots per NodeIn this configuration, some jobs’ reduce tasks wereexecuted in parallel on the same node (the same-reduce-node effect). Two different graphs areshown, one with reduce phases in isolation, and onewhen paired with another job on the same node,as the reduce time increases substantially in thepaired configuration, skewing the average elapsedtimes. Table 2 shows the number of isolated-pairedcases for the real cluster and simulation runs whenthe parameters were varied, respectively. Heart-beat arrival times affect the relative frequency ofthe same-reduce-node effect.

Changing numQueues. The same experimentalparameters were used as with one reduce slot per

(a) 1 queue 100% capacity

(b) 2 queue 50% capacity each

Figure 1: numQueues (one Reduce Slot per Node)

Figure 2: minUserLimitPCT=50%, Capacity=50% (One Reduce Slot per Node)

node. The results are shown in Figure 3. Singlequeue results matched well for Sort, but not forTSort. A small standard deviation exists because ofstraggler tasks in some real cluster runs. With theadditional queue, map, reduce and total executiontime increases for all jobs and differeces appear be-tween the real system and the simulation in both ap-plications. Delayed map execution happened oncein the real cluster for TSort/2 and TSort/4. Delayedmap execution occurred in all simulation runs. Thishelps explain the variation for TSort/2 and TSort/4.In the real cluster, TSort/2 and TSort/4 started whenthe map phase for TSort/1 and TSort/3 ended. Inthe simulation runs, they started map execution atsubmission time, leading to different waiting times.

Differences in the reduce execution time in Fig-ure 3(c) can be attributed to the same-reduce-nodeeffect. The different scale on the graph shows thatthe paired reduce is much slower. Two independentfactors cause the variation: 1) which job executes inparallel with which other job, and 2) when the jobspair with each other. In the real cluster, TSort/3 inparticular was always paired with another TSort joband as TSort jobs execute longer, the reduce execu-tion time for TSort/3 is much higher than all otherjobs. All other TSort jobs were paired with TSortas well as Sort jobs, reducing execution times lessthan TSort/3. If two jobs start their reduce phase onthe same node simultaneously, they will be affectedby competition for resources. If the reduce phase

(a) 1 queue 100% capacity Cluster

(b) 2 queue 50% capacity Cluster - Isolated reduce

(c) 2 queue 50% capacity Cluster - Paired reduce

Figure 3: numQueues (Two Reduce Slots per Node)

starts later for the second job, there will be lessoverlap. Discrepancies between simulation and thereal system can be attributed to heartbeat arrivals.

Changing minUserLimitPCT. The identical ex-periment was conducted as with one reduce slot pernode. The results in Figure 4 show increasing map,reduce and execution times captured by the simu-lator well only in the isolated reduce case. The dif-ference in the means and large variation for reducetasks in the paired-reduce phase is due to the same-reduce-node effect and delayed map execution.

5.4 Analysis/SummaryThe simulator depicts all behaviour observed onthe real cluster for which there is a scheduler com-

ponent modelled. The best results occur with mi-nUserLimitPCT set to 50% and 2 reduce slots pernode as all simulator averages were within the con-fidence interval of the real system. The worst re-sults came with paired-reduce behaviour.

Several factors limited the simulator accuracy.First, straggler tasks (which are not modelled inthe simulator), and the same-reduce-node effectoccurring more frequently in the simulator in-creased variability. Second, the simulator does notmodel all the different sub-phases of the map/re-duce phase. Factors like disk access time, CPU andnetwork contention, prefetching etc. are not mod-elled in the simulator. Stragglers and same-reduce-node effect modelling is part of future work.

(a) Isolated reduce phase

(b) Paired reduce phase

Figure 4: minUserLimitPercent (Two Reduce Slots per Node)

Table 2: Isolated-Paired Reduce Runs

Real ClusterFactor TSort

1Sort1

TSort2

Sort2

TSort3

Sort3

TSort4

Sort4

Capacity 10/0 6/4 5/5 8/2 7/3 7/3 7/3 7/3max Ca-pacity

9/1 6/4 7/3 8/2 8/2 7/3 5/5 9/1

minUserLimit-PCT

6/4 8/2 7/3 7/3 6/4 9/1 7/3 9/1

SimulationFactor TSort

1Sort1

TSort2

Sort2

TSort3

Sort3

TSort4

Sort4

Capacity 2/8 4/6 3/7 7/3 3/7 5/5 5/5 4/6max Ca-pacity

2/8 3/7 7/3 4/6 3/7 6/4 4/6 4/6

minUserLimit-PCT

4/6 3/7 3/7 5/5 2/8 4/4 4/6 6/4

6 Simulation ExperimentsA 31-node cluster was simulated under differentjob submission patterns. Each node was configuredidentically to the real cluster nodes. This is similarin size and configuration to Pastorelli et al. [15].We limited ourselves to the parameters defined inTable 1 [4]. There were no discernible effects of

UserLimitFactor in the experiments and so, thereare no results presented based on this parameter.

The experiments were designed based on threequestions: 1) What Capacity Scheduler settings canoptimize the performance of short jobs when theyarrive after a sequence of long jobs? In particu-lar, are there settings that allow some short jobs tobe executed while the majority of the cluster’s re-sources are occupied by long jobs? 2) What are theeffects of Capacity Scheduler settings where a longjob and a series of short jobs arrive in an interleavedfashion? 3) Does providing different queues forshort and long jobs improve system performanceand the performance of short jobs in particular?

Two queues were used, each containing 50 jobsas described in Table 3, motivated by Chen et al.[5]. A 1 second inter-arrival time was used. Theshuffle ratio is the amount of data generated by themap phase that needs to be sent to the reduce phaserelative to the input data, while the output ratio isthe size of the output file compared to the inputfile. Map compute and Reduce compute are mul-tiplicative factors representing the processing timeneeded per task compared to the smallest job.

A full-factorial experiment with 10 replicationswas performed with settings as shown in Table 4.

Table 3: Job Types

Job Description % of job mix Map Reduce Shuffle Output Map ReduceType Tasks Tasks ratio ratio compute compute1 Data transformation 8 100 15 1 1 5 402 Aggregate and expand 6 200 1 0.025 3 40 53 Expand and aggregate 4 400 30 3 0.025 40 404 Data summary 2 800 8 0.075 0.0005 20 205 Small job 48 1 1 1 1 1 16 Small job 24 2 1 1 1 1 17 Small job 8 10 1 1 1 1 1

We defined a short job as one having 10 or fewermap tasks. We restricted long job sizes to those

Table 4: Parameter Settings: Sequential Job Types

ExpNo

NumQueues

Capacity Max Ca-pacity

UserLimitFactor

MinUserLimitPCT

1 2 50-50 -1 1 1002 2 50-50 50-50 1 1003 2 50-50 -1 2 1004 2 50-50 -1 1 255 2 50-50 50-50 2 1006 2 50-50 50-50 1 257 2 50-50 -1 2 258 2 50-50 50-50 2 259 2 70-30 -1 1 10010 2 70-30 50-50 1 10011 2 70-30 -1 2 10012 2 70-30 -1 1 2513 2 70-30 50-50 2 10014 2 70-30 50-50 1 2515 2 70-30 -1 2 2516 2 70-30 50-50 2 25

considered large (less than 50 GB), ignoring thosewhich were denoted as very large or huge [5]. In-cluding these job types would have totally domi-nated any scheduler settings effect, and are so rare(less than .002% in the Facebook trace) that theworkload would need to be 3 orders of magnitudelarger to consider their relative occurrence. Webelieve this does not affect our ability to evaluatescheduler performance on common workloads.

To answer the first question, long jobs were sub-mitted followed by short jobs. To answer the lasttwo questions, two types of job submission pat-terns were submitted. In the interleaved case, 4short jobs were submitted and then a long job untilall 50 jobs were submitted with two equal capacityqueues. Each user had a single job in the system.In the separate queue case, each queue was split in

a 4:1 ratio, providing 4 queues and allocating moreslots to the long job queues. Table 5 contains pa-rameter settings for these experiments.

Table 5: Parameter Settings: Separate Job Queues

ExpNo

NumQueues

Capacity Max Ca-pacity

UserLimitFactor

MinUserLimitPCT

17 2 50-50 -1 1 10018 2 50-50 50-50 1 10019 2 50-50 -1 2 10020 2 50-50 -1 1 2521 2 50-50 50-50 2 10022 2 50-50 50-50 1 2523 2 50-50 -1 2 2524 2 50-50 50-50 2 2525 4 10-10-

40-40-1 1 100

26 4 10-10-40-40

10-10-40-40

1 100

27 4 10-10-40-40

-1 2 100

28 4 10-10-40-40

-1 1 25

29 4 10-10-40-40

10-10-40-40

2 100

30 4 10-10-40-40

10-10-40-40

1 25

31 4 10-10-40-40

-1 2 25

32 4 10-10-40-40

10-10-40-40

2 25

Five important metrics are reported: data local-ity, response ratio, elapsed time, coefficient of vari-ation in job execution time and makespan. Datalocality is the percentage of node-local tasks. Re-sponse ratio is defined as the ratio of execution timeto waiting time. Finally, makespan is the elapsedtime until all jobs are completed.

We model the system from an initially emptysetting. Comprehensive full-scale steady state sim-

ulations would require a much larger job pool andcomputation effort to evaluate. As well, this onlyaddresses an interleaved scenario, in which jobs ofany type can arrive. With temporally segregatedjob types, steady state measurements are not infor-mative, and makespan is not relevant. Our resultswill be relevant in a system that experiences tempo-rary load reductions where resources are not alwaysfully subscribed and queues occasionally empty.We show how long the system will be utilized whengiven a specific set of tasks. For example, our inputcould be the daily set of jobs submitted by 2 orga-nizations sharing the cluster submitted at a certaintime of day. We are interested in confirming whichparameter settings have an effect on these samplejob submission patterns.

6.1 Equal Capacity QueuesData Locality. The experiments in the simulationwere done on a single rack. Data locality plays animportant role in MapReduce environments. Re-duced network activity for node-local tasks shouldreduce latency. For small jobs, the percentageof node-local map tasks scheduled is moderate tosmall (between 39% and 47% for Job Type 7, andbetween 6% and 12% for Job Type 5). Fewer node-local tasks means less data locality. For jobs with1 map task, the distribution was bi-modal (eithernode-local or rack-local). Similarly, for jobs with 2map tasks, the distribution was tri-modal (0, 50 and100%). For large jobs, the percentage of node-localtasks is high (from 88% for Job Type 1 to 98% forJob Type 2). As the number of map tasks increases,the data is more widely distributed and nodes thathave the data locally can be found.

Changing parameter settings does not signifi-cantly alter data locality in nearly all cases. Theparameters provide stringent limits to ensure thata single job/user/queue cannot consume a dispro-portionate amount of resources. Data locality de-pends on which TaskTracker gets assigned a taskand whether the data is local. The decision to givea task to a TaskTracker is determined by the Job-Tracker running the task scheduler, and is designedto be independent of Capacity Scheduler parame-ter settings. The absolute percentage of node-localtasks varies less than 2% for long jobs. For shortjobs, the relative difference is at most 20% (from39% to 47% for job type 7 between experiment 1and experiment 2).

Response Ratio. In Figure 5, we present the ex-ecution and waiting times for selected job types.The smallest execution time is for Job Type 5, butin many of the configurations, an extreme amountof time is spent waiting. Table 6 shows the averageresponse ratios for all the job types. The waitingtimes for all job types decrease when minUserLim-itPCT is modified (Experiments 4, 6, 7 and 8). Aswell, the response ratio is improved, though lessfor the long jobs. Setting minUserLimitPCT allowsmore jobs to execute concurrently. The lower value

Table 6: Response Ratio: Equal Capacity

ExperimentJobType

1 2 3 4 5 6 7 8

1 1.2 1.2 1.2 1.0 1.2 1.0 1.0 1.02 1.9 2.0 2.0 1.5 2.0 1.5 1.5 1.53 1.2 1.5 1.5 1.1 1.5 1.2 1.1 1.14 2.1 6.7 6.7 1.5 6.8 1.5 1.5 1.55 15.0 15.0 14.0 3.0 14.0 3.6 3.2 3.56 12.8 12.9 12.8 3.0 12.7 3.0 2.7 3.07 10.3 10.4 10.3 1.8 10.5 1.8 1.8 1.9

for response ratio in experiment 1 for Job Type 4 isan artifact of delayed map execution. The benefit inelapsed time for most of the job types (in particu-lar, Job Types 4 and 5) indicates these are preferredoperating configurations.

Elapsed Time. Users are interested in elapsedtime. Does a low response ratio lead to reducedelapsed time? Execution time increases for alljob types when the response ratio decreases due tominUserLimitPCT changes, because of increasedcompetition between the jobs. The improvementin response ratio does not change elapsed times forJob Type 2 (in fact, that increases slightly). ForJob Type 4, elapsed time decreases greatly, becauseof the factor by which the response ratio improves.The response ratio changes by only 25% for JobType 2. The response ratio for Job Type 4 improvesby a factor of 4; thus, improving elapsed time. Thisimprovement averages a reduction to 50% of theoriginal elapsed time. For Job Type 5, elapsed timedecreases by a factor of 4.

Execution Time Variation. The coefficient ofvariation in execution time for long jobs was lessthan 0.1. For short jobs, it lies in the range of 0.1-0.5 (Table 7), due to data locality. Node-local datausually provides very fast map times, but node-local tasks may take longer than rack-local tasksif there are co-located map/reduce tasks from the

(a) Job Type 2 (b) Job Type 4

(c) Job Type 5 (d) Job Type 7

Figure 5: Execution And Wait Times For Selected Job Types: Equal Capacity Queues

same or different jobs due to disk contention. Fi-nally, remote disk contention may occur.

Table 7: Coefficient of Variation: Short Job Types

ExperimentJobType

1 2 3 4 5 6 7 8

5 0.18 0.18 0.18 0.18 0.19 0.18 0.49 0.186 0.14 0.12 0.14 0.41 0.12 0.31 0.31 0.257 0.14 0.13 0.12 0.50 0.15 0.49 0.46 0.42

Makespan. Makespan is similar for both queuesat 1230 seconds for nearly every experiment. Thedifference in makespan between the queues is lessthan 3 seconds across all experiments (within 0.2%,except experiment 2, which had a 1.5% smallermakespan at 1200 seconds). Makespan for the sys-tem is determined by the longest jobs.

6.2 Differential Capacity QueuesThe next set of experiments had the relative queuecapacity changed to 70-30%, so that the smallerqueue is not squeezed for capacity but the largerqueue still dominates resource allocation. The mi-nUserLimitPCT setting reduces waiting time andimproves response ratio as in the previous scenario

for the high-capacity queue in most cases as shownin Figure 6. Approximate values of response ra-tios can be inferred from the diagram. Job Types 1,2 and 3 did not show elapsed time reduction. JobType 4 had an increased response ratio for exper-iment 9, but other experiments showed slight im-provements. For the short jobs, there was signif-icant improvement in response ratio for all exper-iments. Unfortunately, the improved response ra-tio in the high-capacity queue was occasionally ac-companied by a significant increase in elapsed timefor the shorter jobs, most notably experiment 15.

In the low-capacity queue, response ratios arelargely unaffected for the long jobs, except for ex-periments 10 and 13, where they are over 10, ascompared with around 6.7 for the equal capacitycase. This is when userLimitFactor was set to 2.For the short job types, most response ratios in-creased, some substantially. In particular, Job type5 had an increase in response ratio from 15 to 26from experiment 2 to experiment 10. Figure 7shows the execution/waiting times for selected JobTypes for the low capacity queue.

In experiments 10, 13, 14 and 16, where maxCa-pacity was set to Capacity, neither queue interferedwith the other. Elapsed times were lower for thehigh-capacity queue than the low-capacity queue.



Figure 6: Execution And Wait Times For Selected Job Types: Differential Capacity (70% Queue)



Figure 7: Execution And Wait Times For Selected Job Types: Differential Capacity (30% Queue)

The magnitude of the differences changed with theexperiment and job type. In particular, experiments14 and 16 had decreased execution times in bothqueues, but the high-capacity queue had decreasedwaiting times and the low-capacity queue had in-creased waiting times.

In all other experiments, there was queue inter-ference. For Job Type 7, both execution and wait-ing time increase in experiments 12 and 15 in thehigh capacity queue, when compared with the cor-responding experiments in Figure 5. In the low-capacity queue, execution time and wait time de-crease. With minUserLimitPCT used, more usersexecuted tasks in parallel, sometimes increasingelapsed times of the high-capacity queue to longerthan the low-capacity queue.

The coefficient of variation of execution time forlong jobs was less than 0.1. Data locality and diskcontention (on either local or remote disk) againcontribute to higher variation for short job types.Makespan is similar for both queues when maxCa-pacity is not set; the low-capacity queue can getadditional capacity from the high-capacity queue.When the maxCapacity limit is used, the high-capacity queue makespan decreases by an averageof 200 seconds. The low-capacity queue makespanincreases by between 160 and 260 seconds.

6.3 Interleaved JobsThis section examines performance in a controlled,but more realistic scenario in a Hadoop cluster: aninterleaving job submission pattern, where 4 shortjobs are followed by 1 long job and so on. Again,the most important parameter is minUserLimitPCT.It improved the response ratio for all job types. Ajob arriving earlier under default settings gets min-imal improvement in response ratio as its waitingtime does not change much. Table 8 shows theresponse ratios. For a job arriving later, the hugeimprovement in waiting time due to minUserLimit-PCT improves the response ratio significantly (Ex-periments 20, 22, 23, and 24). Unfortunately, itdoes not affect makespan. In a system running overthe longer-term, more shorter jobs could be exe-cuted and those arriving late would not suffer fromlong waiting times; thus, the appearance of a fairtreatment of jobs as suggested by the analysis inthe single queue case. The fact that the responseratio does not increase for other job types indicatesthat resources are being used more evenly.

Table 8: Response Ratio: Interleaved Scenario

ExperimentJobType

17 18 19 20 21 22 23 24

1 1.18 1.18 1.18 1.02 1.18 1.02 1.01 1.022 1.90 1.94 1.93 1.46 1.91 1.46 1.47 1.473 1.35 1.51 1.51 1.15 1.50 1.15 1.15 1.154 8.46 8.70 8.91 1.79 8.48 1.79 1.77 1.805 2.03 2.02 2.03 1.41 2.02 1.43 1.43 1.426 3.19 4.16 4.24 2.47 4.10 2.53 2.53 2.487 4.20 6.91 6.85 1.90 6.70 1.89 1.90 1.90

6.4 Separate Job QueuesIn Queues 1 and 2 are short job queues and receiveJob Types 5, 6, and 7. Queues 3 and 4 are the longjob queues and receive Job Types 1 through 4.

Setting minUserLimitPCT different than the de-fault has improved the response ratio in all the con-ducted experiments so far. Separate queue sub-mission reveals something different. Instead ofminUserLimitPCT, maxCapacity improves the re-sponse ratio for short jobs. For long jobs, minUser-LimitPCT improves the response ratio. Table 9shows the response ratio for different Job Types.

Table 9: Response Ratio: Separate Queues

ExperimentJobType

25 26 27 28 29 30 31 32

1 1.14 1.23 1.25 1.01 1.25 1.01 1.02 1.022 1.76 2.21 1.94 1.39 2.18 1.56 1.38 1.543 1.19 1.59 1.51 1.12 1.58 1.16 1.12 1.154 2.46 9.36 6.51 1.55 9.06 1.73 1.56 1.725 13.1 1.20 5.85 4.84 1.18 1.06 5.00 1.086 14.1 1.65 7.03 5.22 1.48 1.20 5.18 1.207 9.04 1.53 4.18 4.88 1.45 1.27 4.87 1.25

With separate queues, short jobs should finishfaster with lower response ratios. In experiments25, 27, 28 and 31, where maxCapacity limit is notimposed, however, the response ratio for short jobs(Job types 5, 6, and 7) is high due to the longjob queues. These jobs take slots from the shortjob queues, leading to huge waiting times for shortjobs. Imposition of limits reverses the trend. ThemaxCapacity limit puts an upper bound on the ca-pacity a queue can use; lower waiting time andlower response ratios are the result.

Only jobs from a queue should determine themakespan, but the short job queue was affected by

the long job queue when maxCapacity was not set.Makespan was between 380 and 430 for Queues1 and 2 with maxCapacity set, but within 5% ofthe long job queue in the other cases and even 3%longer in experiment 31. Limiting maxCapacityprevents long job queues from stealing slots fromshort job queues. It also decreases waiting times forshort jobs and reduces the short queue makespan.

Figure 8(a) shows that the long job queue inter-feres with the short job queue, but in Figure 8(b),isolation is achieved along with good elapsed timefor the shorter jobs. While Job Types 2, 3, and 4experience longer wait times, only Job Type 4 hasa drastic change in response ratio. This job type isrelatively rare (2% of job mix), and therefore mostjobs are positively affected by this configurationand the elapsed time of Job Type 4 is unaffected.

6.5 AnalysisWe cannot make significant conclusions about thebehaviour and timing of workload scenarios in gen-eral, due to the limited nature of the workloads andsize of the simulated system. Variability in exe-cution time caused by factors not modelled by thesimulator prevent accurate predictions (e.g. samenode effect), but our results are consistent withinthe restricted mode of operation.

With a long sequence of long jobs followed bysmall jobs, minUserLimitPCT plays the most im-portant role in determining response ratio for spe-cific job types and arrival times. Changing mi-nUserLimitPCT also increases competition for re-sources and increased execution times for the jobs,with reduction in elapsed time for some job types.

For queues with unequal capacity, maxCapac-ity can be used for real performance gains. Oth-erwise, long jobs in the low-capacity queue inter-fere more often with high-capacity queue. For in-terleaved jobs, minUserLimitPCT improves the re-sponse ratio for all jobs, especially later arrivals.

For separate queues, two different behaviourswere observed. With no maxCapacity limits im-posed, long jobs affect the short job queue by steal-ing slots, increasing makespan and response ratio.With maxCapacity limits imposed, the short jobsare isolated from long jobs. Their response timeis small and makespan is not affected by the longjobs. Setting minUserLimitPCT to guarantee shar-ing when there are more jobs than instantaneouscluster capacity improves the response ratio, but

less than setting maxCapacity.If queue capacity is to be divided among short

and long jobs, long jobs must be given more ca-pacity to obtain more slots over time. More accu-rate job size estimation [15] may have an additionalincrease in performance if jobs can be moved be-tween queues. The examination of job size in termsof map and reduce tasks distinguishes our analysisfrom that of the general HPC community [11].

7 Conclusions/Future WorkThere is limited knowledge about the effects of pa-rameter settings on the performance of the CapacityScheduler. In this paper, we integrated the Capac-ity Scheduler into the open-source Hadoop Simula-tor MRPerf and validated its performance against areal test cluster. This exposed some inaccuracies inthe simulation model, namely stragglers and same-reduce-node effects, but was sufficiently accurate.Finally, a simulation study was performed to under-stand and quantify the impact of Capacity Sched-uler parameter settings on a sample workload underdifferent job submission patterns. From this limitedset of experiments, the performance improvementobtained by treating long and short jobs differen-tially suggests new schedulers that could make useof previous work in cluster scheduling [7].

MRPerf can be extended to have multiple-disksupport on a single node, memory configurationand speculative execution. Some sub phases arenot currently modelled. Such sub-phase omissionleads to prediction errors in the simulator as cer-tain real system behaviours cannot be reproduced.As well, straggler support and stochastic heartbeatarrivals can be put into MRPerf.

References[1] G. Ananthanarayanan, S. Kandula, A. Greenberg,

I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining inthe outliers in map-reduce clusters using Mantri. InOSDI, pages 1–16, Vancouver, Canada, Oct. 2010.

[2] S. Babu. Towards automatic optimization ofMapReduce programs. In IEEE SoCC, pages 137–142, Indianapolis, IN, June 2010.

[3] J. Chauhan. Simulation and Performance Evalua-tion of Hadoop Capacity Scheduler. Masters thesis,University of Saskatchewan, 2013.

[4] J. Chauhan, D. Makaroff, and W. Grassmann. TheImpact of Capacity Scheduler Configuration set-

(a) Experiment 25 (b) Experiment 26

Figure 8: Separate Queue Scenario wait/execution times

tings on MapReduce Jobs. In IEEE CGC, pages667–674, Xiangtan, China, Nov. 2012.

[5] Y. Chen, S. Alspaugh, and R. Katz. Interactive an-alytical processing in big data systems: a cross-industry study of MapReduce workloads. VLDBEndowment, 5(12):1802–1813, Aug. 2012.

[6] J. Dean and S. Ghemawat. MapReduce: sim-plified data processing on large clusters. CACM,51(1):107–113, Jan. 2008.

[7] N. Fallenbeck, H.-J. Picht, M. Smith, andB. Freisleben. Xen and the Art of Cluster Schedul-ing. In VTDC, pages 237–244, Tampa, FL, Nov.2006.

[8] S. Hammoud, M. Li, Y. Liu, N. K. Alham, andZ. Liu. MRSim: A discrete event based MapRe-duce simulator. In Fuzzy Systems and KnowledgeDiscovery, pages 2993–2997, Yantai, China, Aug.2010.

[9] H. Herodotou, H. Lim, G. Luo, N. Borisov,L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR,pages 261–272, Asilomar, CA, Jan. 2011.

[10] M. Isard, V. Prabhakaran, J. Currey, U. Wieder,K. Talwar, and A. Goldberg. Quincy: fair schedul-ing for distributed computing clusters. In SOSP,pages 261–276, Big Sky, MT, Oct. 2009.

[11] D. B. Jackson, Q. Snell, and M. J. Clement. CoreAlgorithms of the Maui Scheduler. In Revised Pa-pers from JSSPP, pages 87–102, Cambridge, MA,June 2001.

[12] D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The per-formance of MapReduce: an in-depth study. VLDBEndowment, 3(1-2):472–483, Sept. 2010.

[13] Y. Liu, M. Li, N. K. Alham, and S. Ham-moud. HSim: A MapReduce Simulator in EnablingCloud Computing. Future Gener. Comput. Syst.,29(1):300–308, Jan. 2013.

[14] A. Murthy. Mumak: Map-Reduce Simulator.MAPREDUCE-728, 2009.

[15] M. Pastorelli, A. Barbuzzi, D. Carra,M. Dell’Amico, and P. Michiardi. HFSP:Size-based scheduling for Hadoop. In 2013 IEEEIntl. Conf. on Big Data, pages 51–59, SiliconValley, CA, Oct. 2013.

[16] F. Teng, L. Yu, and F. Magoulaas. SimMapRe-duce: A simulator for modeling MapReduce frame-work. In FTRA Mobile and Ubiquitous Engineer-ing, pages 277–282, Crete, Greece, June 2011.

[17] V. Vavilapalli, A. Murthy, C. Douglas, S. Agar-wal, M. Konar, R. Evans, T. Graves, J. Lowe,H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley,S. Radia, B. Reed, and E. Baldeschwieler. ApacheHadoop YARN: Yet Another Resource Negotiator.In IEEE SOCC, pages 5:1–5:16, Santa Clara, CA,Oct. 2013.

[18] A. Verma, L. Cherkasova, and R. H. Campbell.Play it Again, SimMR! In IEEE CLUSTER, pages253–261, Austin, TX, Sept. 2011.

[19] A. Verma, L. Cherkasova, and R. H. Campbell.Two Sides of a Coin: Optimizing the Schedule ofMapReduce Jobs to Minimize Their Makespan andImprove Cluster Performance. In MASCOTS, pages11–18, Washington, DC, Aug. 2012.

[20] G. Wang, A. R. Butt, P. Pandey, and K. Gupta. Asimulation approach to evaluating design decisionsin MapReduce setups. In MASCOTS, pages 1 –11,London, UK, Sept. 2009.

[21] G. Wang, A. R. Butt, P. Pandey, and K. Gupta. Us-ing realistic simulation for performance analysis ofMapReduce setups. In LSAP Workshop, pages 19–26, Garching, Germany, June 2009.

[22] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmele-egy, S. Shenker, and I. Stoica. Delay scheduling: asimple technique for achieving locality and fairnessin cluster scheduling. In EUROSYS, pages 265–278, Paris, France, Apr. 2010.

Simulation and Performance Evaluation of the Hadoop Capacity Scheduler · 2014-10-01 · The Task...

Documents

Transcript of Simulation and Performance Evaluation of the Hadoop Capacity Scheduler · 2014-10-01 · The Task...