A Multi-criteria Class-based Job Scheduler for Large Computing Farms

7
A Multi-criteria Class-based Job Scheduler for Large Computing Farms R. Baraglia 1 , P. Dazzi 1 , and R. Ferrini 1 1 ISTI “A. Faedo”, CNR, Pisa, Italy AbstractIn this paper we propose a new multi-criteria class-based job scheduler able to dynamically schedule a stream of batch jobs on large-scale computing farms. It is driven by several configuration parameters allowing the scheduler customization with respect to the goals of an installation. The proposed scheduling policies allow to maximize the resource usage and to guarantee the ap- plications QoS requirements. The proposed solution has been evaluated by simulations using different streams of synthetically generated jobs. To analyze the quality of our solution we propose a new methodology to estimate whether at a given time the resources in the system are really sufficient to meet the service level requested by the submitted jobs. Moreover, the proposed solution was also evaluated comparing it with the Backfilling and Flexible backfilling algorithms. Our scheduler demonstrated to be able to carry out good scheduling choices. Keywords: Job scheduling; Deadline Scheduling; Software Li- cense Scheduling; Computing Farm. 1. Introduction In large computing farms providing utility computing for a large number of users, with different functional and non- functional requirements, a scheduler plays a basic role in order to efficiently and effectively schedule submitted jobs on the available resources. The objective of the scheduling is to assign tasks to specific resources maximizing the overall resource utilization and guaranteeing the QoS required by applications. The scheduling problem has shown to be NP- complete in its general as well as in some restricted forms [3]. The scheduling on utility computing environments is multi-criteria in nature [8]. In fact, these environments generally manage computational requests dynamically time varying, and with different computational requirements and constraints that compete to access shared resources. Even if in the past research efforts has been devoted to develop multi-criteria job scheduling algorithms [7], [6], [1], [2], there is still the need to improve the scheduling techniques able to manage an increasing number of jobs and to address all the application and installation requirements as well as sets of constraints for the afore mentioned computational environment. In this paper, we propose a scheduler able to schedule a continuos stream of batch jobs on large- scale computing farms. As typical scenario we consider a computing farm made up of heterogeneous, single-processor or SMP machines, linked by a low-latency, high-bandwidth network. Some characteristics of the computing nodes (e.g. processor type, memory size, number of CPUs, link band- width) are static and known whereas some others are dy- namic (e.g. floating sw licenses). The adopted scheduling policies permit us to optimize the scheduling with respect to different objectives, even contrasting, such as maximize the resource usage and to guarantee the non-functional applications requirements. The rest of this paper is organized as follows. Section 2 describes some of the most common job scheduling algorithms. Section 3 gives a description of the problem. Section 4 describes our solution. Section 5 outlines and evaluates our job scheduler. Finally, conclusions and future work are described in Section 6. 2. Related work Batch jobs scheduling are mainly divided in two main classes: on-line and offline. On-line algorithms are those that do not have any knowledge about the whole input job stream. They take decisions for each arriving job without knowing future inputs. Conversely, offline algorithms know all the jobs before taking scheduling decisions. Many of these algorithms are exploited into commercial and open source job schedulers [4]. The Backfilling algorithm [9] is a widely adopted scheduling approach, it is an optimization of the FCFS algorithm [10]. It requires each job specifies its execu- tion time, so that the scheduler can estimate when jobs finish and other ones can be started. The main goal of Backfilling is to exploit a resource reservation approach to improve the FCFS policy by increasing the system resource usage and by decreasing the average job waiting time in the scheduler’s queue. In order to improve performance, some backfilling variants, such as Flexible backfilling [5] have been proposed. The Flexible backfilling algorithm is obtained by exploiting a different order of queued jobs. Jobs prioritized according to scheduler goals are queued according to their priority value, and selected for scheduling. Even if the multi-criteria approach seems to be the most viable one to solve the resource management and scheduling problem in heteroge- neous and distributed computational environments, only a few research efforts have been done in such direction [1], [7], [2], [11]. In [1] a multi-criteria job scheduler for scheduling a continuous stream of batch jobs on large scale computing

Transcript of A Multi-criteria Class-based Job Scheduler for Large Computing Farms

A Multi-criteria Class-based Job Scheduler for Large ComputingFarms

R. Baraglia1, P. Dazzi1, and R. Ferrini11ISTI “A. Faedo”, CNR, Pisa, Italy

Abstract— In this paper we propose a new multi-criteriaclass-based job scheduler able to dynamically schedulea stream of batch jobs on large-scale computing farms.It is driven by several configuration parameters allowingthe scheduler customization with respect to the goals ofan installation. The proposed scheduling policies allow tomaximize the resource usage and to guarantee the ap-plications QoS requirements. The proposed solution hasbeen evaluated by simulations using different streams ofsynthetically generated jobs. To analyze the quality of oursolution we propose a new methodology to estimate whetherat a given time the resources in the system are reallysufficient to meet the service level requested by the submittedjobs. Moreover, the proposed solution was also evaluatedcomparing it with the Backfilling and Flexible backfillingalgorithms. Our scheduler demonstrated to be able to carryout good scheduling choices.

Keywords: Job scheduling; Deadline Scheduling; Software Li-cense Scheduling; Computing Farm.

1. IntroductionIn large computing farms providing utility computing for

a large number of users, with different functional and non-functional requirements, a scheduler plays a basic role inorder to efficiently and effectively schedule submitted jobson the available resources. The objective of the scheduling isto assign tasks to specific resources maximizing the overallresource utilization and guaranteeing the QoS required byapplications. The scheduling problem has shown to be NP-complete in its general as well as in some restricted forms[3]. The scheduling on utility computing environments ismulti-criteria in nature [8]. In fact, these environmentsgenerally manage computational requests dynamically timevarying, and with different computational requirements andconstraints that compete to access shared resources. Evenif in the past research efforts has been devoted to developmulti-criteria job scheduling algorithms [7], [6], [1], [2],there is still the need to improve the scheduling techniquesable to manage an increasing number of jobs and to addressall the application and installation requirements as well assets of constraints for the afore mentioned computationalenvironment. In this paper, we propose a scheduler ableto schedule a continuos stream of batch jobs on large-scale computing farms. As typical scenario we consider a

computing farm made up of heterogeneous, single-processoror SMP machines, linked by a low-latency, high-bandwidthnetwork. Some characteristics of the computing nodes (e.g.processor type, memory size, number of CPUs, link band-width) are static and known whereas some others are dy-namic (e.g. floating sw licenses). The adopted schedulingpolicies permit us to optimize the scheduling with respectto different objectives, even contrasting, such as maximizethe resource usage and to guarantee the non-functionalapplications requirements. The rest of this paper is organizedas follows. Section 2 describes some of the most commonjob scheduling algorithms. Section 3 gives a description ofthe problem. Section 4 describes our solution. Section 5outlines and evaluates our job scheduler. Finally, conclusionsand future work are described in Section 6.

2. Related workBatch jobs scheduling are mainly divided in two main

classes: on-line and offline. On-line algorithms are those thatdo not have any knowledge about the whole input job stream.They take decisions for each arriving job without knowingfuture inputs. Conversely, offline algorithms know all thejobs before taking scheduling decisions. Many of thesealgorithms are exploited into commercial and open sourcejob schedulers [4]. The Backfilling algorithm [9] is a widelyadopted scheduling approach, it is an optimization of theFCFS algorithm [10]. It requires each job specifies its execu-tion time, so that the scheduler can estimate when jobs finishand other ones can be started. The main goal of Backfillingis to exploit a resource reservation approach to improve theFCFS policy by increasing the system resource usage and bydecreasing the average job waiting time in the scheduler’squeue. In order to improve performance, some backfillingvariants, such as Flexible backfilling [5] have been proposed.The Flexible backfilling algorithm is obtained by exploitinga different order of queued jobs. Jobs prioritized accordingto scheduler goals are queued according to their priorityvalue, and selected for scheduling. Even if the multi-criteriaapproach seems to be the most viable one to solve theresource management and scheduling problem in heteroge-neous and distributed computational environments, only afew research efforts have been done in such direction [1], [7],[2], [11]. In [1] a multi-criteria job scheduler for schedulinga continuous stream of batch jobs on large scale computing

farms is proposed. It exploits a set of heuristics that drive thescheduler in taking decisions. Each heuristics manages a spe-cific constraint, and contributes to compute the measurementof the matching degree between a job and a machine. Thescheduler allows its extensions to manage a wide set of re-quirements and constraints. In [7] K. Kurowski et al. proposea two-level hierarchy multi-criteria scheduling approach forGrid environments. All participants of a scheduling process,i.e. endusers, Grid administrators and resource providers,express their requirements and preferences by using twosets of parameters: hard constraints and soft constraints. AGrid broker at higher level exploits the hard constraints tocompute a set of feasible solutions, which can be optimizedby using soft constraints describing preferences regardingmultiple criteria, such as various performance factors, QoS-based parameters, and characteristics of local schedulers. In[2] a bi-criteria algorithm for scheduling moldable jobs oncluster computing platforms is proposed. It exploits two pre-existing algorithms to simultaneously optimize two criteria:job makespan and weighted minimal average completiontime. Such criteria are complementarity, and well representthe objectives of both users and system administrators. Thealgorithm was evaluated by simulations using two differentsynthetic workloads. In [11] a solution based on advancedresource reservation that optimizes resource utilization anduser QoS constraints for Grid environments is proposed. Itsupports advanced reservations to deal with the dynamic ofGrids and provides a solution for agreement enforcement.The proposed advanced reservation solution is structuredaccording to a 3-layered negotiation protocol. Preferencesof end-users are taken into account to start a negotiation toselect resources to reserve. The user can select the best suit-able offer or can decide to re-negotiate by changing some ofthe constraints. End-users preferences are modeled as utilityfunctions for which end users have to specify required valuesand negotiation levels. In [6] is proposed a schedule-basedsolution for scheduling a continuous stream of batch jobson computational Grids. The solution is based on EarliestDeadline First (EG-EDF) rule and Tabu search technique.The EG-EDF rule incrementally builds the schedule for alljobs by applying technique which fills earliest existing gapsin the schedule with newly arriving jobs. If no gap for acoming job is available EG-EDF rule uses Earliest DeadlineFirst (EDF) strategy for including a new job into the existingschedule. The schedule is then optimized by using a Tabusearch algorithm to move jobs into earliest gaps. Schedulingchoices are taken to meet the QoS requested by the submittedjobs, and to optimize the hardware resource usage.

3. Problem DescriptionWe consider jobs and machines annotated with informa-

tion describing their requirements and features, respectively.Jobs in a stream can be sequential or multi-thread, andall the jobs are independent one from each other. To each

job is attached a description containing both an identifierand a set of functional and non-functional requirements.Functional requirements include the number of processors,the RAM size and the software licenses a job needs tobe executed. Non-functional requirements (also referred asQoS) are job slowdown equal to one, job deadline and jobadvanced resource reservation. The description also includesan estimation of the time required to compute the job and thefeatures describing the processor exploited to perform suchestimation (benchmark score). Each job is executed on asingle machine, and all jobs are preemptable. Job preemptioncan be performed when either a job submission or a jobending event takes place. The machines composing the farmare described by a benchmark score, the number and typeof CPUs, the size of the RAM installed and the non-floating(i.e. bound to a machine) and floating (i.e. not bound to anyspecific machine) software licenses they can run. Processorsinstalled on each machine has associated a weight. Everymachine can execute multiple jobs at the same time in aspace-sharing fashion. All the machines support two basicforms of job preemption: stop/restart and suspend/resume.The checkpoint/restart form is possible only if the runningjob is properly instrumented. Machines are assigned to jobsin the shape of sub-machines, namely a subset of a machine’sprocessors. A submachine is managed by the scheduler as aninstance of the machine from which it is originated. Floatingsw licenses can be assigned to any machine able to runthem. The only limit is that the total number of licensesin use can not be greater than their availability. In our study,we consider the association of licenses to machines. As aconsequence if a set of jobs requiring the same license canbe executed on the same machine, only one license copy isaccounted.

4. The scheduler achitectureThe proposed scheduler is based on multiple job classes.

Each job is assigned to a class on the basis of its functionaland/or non-functional requirements. Figure 1 depicts thearchitecture of our scheduler. Three main components arerepresented: Job-Dispatcher, Class-Scheduler and Control-Scheduler. The Job-Dispatcher receives, classifies, and dis-patches each job to the proper class. A class is an entitycharacterized by a set of dynamically assigned computationalresources, a job queue and a Class Scheduler. The classesare ranked according to a priority value assigned staticallyby the installation on the basis of the functional and non-functional requirements managed. To each class is associateda Class-Scheduler (CLS). This component is specialized formanaging a specific class of job requirements. To each CLSis associated a job queue and a set of resources. The CLSextracts jobs from its queue and allocates them resourcesto be run. In case of resource shortage, it issues a requestfor additional resources to the Control Scheduler. A classreleases the assigned resources when they have not been used

Fig. 1: The scheduler architecture.

for a predefined quantum of time, fixed by the installation.The Control-Scheduler (CNS) is devoted to manage requestsissued by CLSs. CNS allocates resources to CLSs in the formof sub-machines or floating sw licenses. As an example,consider a job asking for four processors in a computingfarm composed by only eight-processors machines. TheCNS assigns four free processors from an available suitablemachine to the requesting class. That class becomes thetemporary owner of the assigned resources until it will notrelease them. Requests issued by higher ranked classes arescheduled first than the other ones, and requests issued by thesame CLS are managed according to the FCFS order. CNSdefines new sub-machines according to two alternatives: 1)The sub-machine is defined on a machine already assignedto the requesting class, 2) The sub-machine is defined ona machine not assigned to any class, i.e. a free machine.If no machine is available, the CNS can decide to enacta resource stealing process. The definition of a new sub-machine is performed by exploiting the principle of the leastprivilege, namely, from all the available machines is chosenthe one with the least amount of resources that is sufficient tosatisfy the requested assignment. Sub-machines are managedusing a data structure consisting in a vector V FM whichlength equals to the number of processors P of the largestmachine in the computing farm. Each element in the vectorcontains a list in which each elements represents a farm’smachine. A machine belongs to list at index i if its actualavailability of processors equals to i. The lists are arranged inincreasing order with respect to machine memory size, andthen the number of floating sw licenses the machine can run.In order to find a machine with p processors, a RAM of sizer and l licenses, CNS starts its search from the V FM [p]element and continues until it finds a machine addressingthe assignment requirements or the vector ends. When aproper machine is found and a new sub-machine is createdthe number of available processors in that machine decreasescorrespondently. The data structure is updated consequently.The idea behind this data structure is to keep lager machinesavailable for subsequent requests and to reduce machines

fragmentation. Floating sw licenses are managed using aspecific data structure consisting in a vector which lengthequals to the number of available floating sw licenses S.Each vector entry addresses a list storing the number ofcurrently usable copies for a specific license. Such lists arestructured according to three sublists storing respectively thenumber of available copies, the number of copies assignedto a class but not used, and the number of copies assignedto a class and in use. Floating licenses belonging to the firsttwo sublists are available for assignment to classes, whereasthe copies in the third list are already assigned. When afree copy of a floating sw license is assigned to a class, itis removed from the first sublist and assigned to the thirdsublist. When a job using a floating sw license finishes itsexecution the license copy is moved to the second sublist.After an installation-defined quantum of time licenses in thesecond sublist are released and moved to the first sublist. Inour study we considered six job classes (0÷5) The first threeclasses (0, 1 and 2) manage jobs with both functional andnon-functional requirements whereas the other three ones(3, 4 and 5) manage jobs with only functional requirements.Jobs are assigned to the classes according to the followingcriteria:

• Class 0: Jobs requiring a slowdown equal to 1. Jobsare managed by the related CLS according to the FirstCome First Served (FCFS) order.

• Class 1: Jobs requiring advanced resource reservation.Jobs are scheduled according their closeness to thereservation. Jobs for which the resource reservation failsare discarded. Alternatively, but not in our study, theycould be moved to Class 3, 4 or 5 depending on theirfunctional requirements.

• Class 2: Jobs with deadline. Jobs are scheduled accord-ing to the expected time they have to start to meet theirdeadline. The queue position of a job is determinedexploiting the solution proposed in [6]. According tosuch solution, the closer the deadline of a job is, thehigher its position in the job queue is.

• Class 3: Sequential or parallel jobs requiring floatingsw licenses.

• Class 4: Parallel jobs not requiring any floating swlicense.

• Class 5: Sequential jobs not asking for floating swlicenses.

Jobs within classes 3, 4 and 5 are selected by the relatedCLS according to the FCFS order. If a job has requirementsto be assigned to two different classes, it will be assignedto the class having a higher priority. The assignment ofresources to classes permits to exploit the locality in jobrequirements. In fact, after a initial time, it is highly probablethat a class managing jobs with similar requirements ownsin advance the resources to run the jobs to it assigned.

Resource stealing: When a class is experiencing a lack offree resources to satisfy a request its CLS issues a requestto CNS. As a consequence, the execution of jobs belongingto classes with a lower priority have to be interrupted torelease the needed resources. However, the interruption of ajob has a cost for the computing farm. Actually, interruptinga job is convenient only if the gain resulting from the use ofthe released resources overcomes this cost. To evaluate thiscost several parameters should be considered, e.g. the timeelapsed in execution by the job candidate to be interrupted,the number of sw assigned to that job, etc. In our model aresource r can be moved from a class A to a class B if thefollowing expression is verified: RankA > CostB(r), whereRankA is the rank value of Class A and CostB(r) is the costassociated to the interruption of jobs running on resource rand belongs to B. Considering a generic class C, a resourcer, and the number k of jobs running on r such cost can becomputed as: CostC(r) = RankC +

∑ki=1 PC(i), where

PC(i) = Wi · ( Tex(i)Ttot(i)

) ·Wdead+(Wpr ·Pr(i))+(Wl ·L(i)).Tex(i) is the time spent executing the job i and Ttot(i) is thetotal estimated execution time of the job i. Wi is the weightassociated to the form of preemption adopted to interruptthe execution of the job i. This is small if the job supportscheckpoint/restart, it increases in case of suspend/resume andit is maximum if the only option is stop/restart. Wdead is theweight associated to jobs having a deadline. It equals to 1for jobs without deadline. Wpr is the weight associated toa processor and Pr(i) is the number of processors assignedto i. Wl is the weight associated to a floating sw licenseand L(i) is the number of floating sw licenses used by i.The idea of this approach is to allow an installation to tuneWi, Wpr, Wl and Wdead values and class ranks accordingto its objectives. As an example, suppose that an installationgoal is to respect in a very strict way the prioritization givenby the jobs classes.To this end, the Rank associated to twoconsecutive classes have to differ by a value greater thanthe maximum value that PC(i) can assume. Such value isobtained when a job i (it makes no difference to have justone or several jobs if the overall resource usage is the same)is using all the processors of the largest farm’s machine, allthe available floating sw licenses, and is approaching theend of its execution. Let’s assume that the largest farm’smachine has 1024 processors, and that the total numberof floating sw licenses is 20. Moreover, consider weightsassuming these values: Wi = 200, Wpr = 1, Wl = 0.5 andWdead = 2. Hence, the maximum value PC(i) can assumeis: PC

max = 200 ∗ 1 ∗ 2 + 1 ∗ 1024 + 0.5 ∗ 20 = 1434. Asa consequence, the rank values of the six classes have tobe fixed as follows: Rank_Class5 = 0, Rank_Class4 =1500, Rank_Class3 = 3000, Rank_Class2 = 4500,Rank_Class1 = 6000, and Rank_Class0 = 75001.

1These rank values are the ones exploited in the conducted tests

Resource search: Considering a job i belonging to a class Aand requiring Prx ≤ P processors and Lx ⊆ S floating swlicenses the resource search algorithm is structured accordingto the following steps:

1) Starting from the entry Prx, the VFM data structureis analyzed to find jobs to be interrupted.

2) For every machine m suitable for executing i indexedby Prx, a list of jobs that could be interrupted iscreated. The list also includes free processors.

3) The first Nx jobs, which interruption permits us toobtain the required Prx processors are selected.

4) If selecting the first Nx jobs a number of proces-sors greater than Prx is obtained, a refined step isconducted to fix the number of selected processors.The list of selected jobs is visited in reverse order toremove exceeding processors.

5) The cost Costr is computed as the sum of the costsrelated to the jobs being interrupted on m. If Costr issmaller than the costs computed for the other analyzedmachines, the machine m is selected, and the jobs onit executing are selected to be interrupted.

6) Steps from 2 to 5 are repeated from Prx + 1 to P tofind machines suitable for executing i.

7) At the end of step 6, the list of jobs that cloud beinterrupted (i.e. jobs running on the machines withassociated the lowest costs) is carried out.

The interruption of the selected jobs may lead to freesome required licenses. In this case, the found licenses areremoved from Lx and the following steps are executed:

1) The floating sw licenses search starts from the queuel ∈ Lx of the floating sw license data structure.

2) The cost Costr due to the interruption of a job usingl is computed. The job corresponding to the smallestCostr is selected and its execution interrupted.

3) Steps 1 and 2 are repeated until all the licenses neededto run the job i are found.

4) At the end of step 3, the set of jobs to interrupt isfound.

This phase is the most computation expensive one. In fact,the search of free processors requires in the worst case toanalyze all the available machines and floating sw licenses.The search of processors needs to sort the N jobs running oneach of the M machines in the farm. Since the sort operationhas complexity NlogN , in the worst case, the resourcesearch algorithm has complexity C = M ·NlogN . To searcha floating sw licenses in the worst case has complexity |L|.

5. Performance EvaluationThe evaluation of the proposed scheduler was conducted

by simulations using different streams of jobs and farmsof different size. Job and machine parameters have beenrandomly generated from a uniform distribution in the rangesshown in Table 1. Moreover, we compared our solution

with Backfilling and Flexible backfilling algorithms. The jobpriorities of the Flexible backfilling algorithm are updatedat each job submission or ending event and the reservationfor the first queued job is maintained through events.

Table 1: Parameters used to generate jobs and machines.

Description Range

Processor Type 1÷ 5Number of processors 1÷ 128Benchmark score 0.5÷ 2RAM 500Mb÷ 5GbJob estimated execution time (secs) 16000÷ 20000Number of licenses copies 50÷ 70Number of different licenses 20

For each simulations the percentages of jobs requiringspecific functional and non-functional requirements havebeen generated according to the values shown in Table 2.

Table 2: Percentages used to generate the job steams.

Percentage Description

5% requires a slowdown equal to 130% has a deadline5% needs of advanced resource reservation60% needs a software license30% needs a floating software license10% needs a specific hardware2% supports checkpointing40% needs 1 processor40% needs 2 processors10% needs 4 processors0.8% needs 8 processors0.08% needs 16 processors0.06% needs 32 processors0.04% needs 64 processors0.02% needs 128 processors

The duration of each simulation was set at 43200 timeunits (i.e. the number of seconds in 12 hours). For eachsimulation unit the system: (1) Generate a job and put itin the Dispatcher’s job queue, (2) Update of the runningjobs status (3) Update of the status of the resources (4)Execute the CLSs, (5) Execute the CNS, (6) Store the sim-ulation statistics. In the conducted experiments the numberof generated machines varied from 1000 to 1200, and toobtain stable values each simulation was repeated 50 timeswith different farm configurations and job streams. Theperformance metrics have been evaluated versus the systemcontention. Usually, this value is roughly computed as:ResourceR/ResourceA, where ResourceR is the amountof a specific resource requested by the jobs in the system,and ResourcesA is the available amount of such resource.This ratio does not provide an accurate information onresource availability because it ignores the jobs allocationimplied constraints. In fact, all the requirements of a jobmust be satisfied to allocate it, so the variables describingthe available resources cannot be considered independently.

To clarify this point, let us suppose that the value computedby using the above expression is less than 1. In principle,it indicates an availability of the considered resource. As aconsequence, a scheduler should be able to properly allocatethe jobs on the available resource. However, this is notalways true. As an example consider an availability of 20processors in the system and a job to schedule requiring 16processors. Clearly, if at least 16 of the 20 free processorsare not available on the same machine the job can not bescheduled even if a rough analysis would suggest enoughprocessors availability. Unfortunately, in general can be hardto understand if the resource shortage is caused by theineffectiveness of the adopted scheduler or by an insufficientnumber of available resources. To overcome this problemwe introduce the RRI index. Its aim is to exploit a simpleallocator to measure, with a certain degree of approximation,if at a certain time, the resources in the system are sufficientto meet all the jobs requirements. In particular, in this paperwe only consider the resource processor for computing theRRI index. To this end, we considered the following fourjob scheduling algorithms (but in principle others can be alsoconsidered), each one basing its strategy on a different joballocation policy: 1) Largest Machine, which allocates a jobon the machine with the largest number of free processors, 2)Smallest Machine, which allocates a job on the machine withthe smallest number of free processors, 3) Smallest Residue,which allocates a job on the machine where remains, afterthe allocation of a job, the lowest number of free processors,4) Largest Residue, which allocates a job on the machinewhere remains, after the allocation of a job, the largestnumber of free processors. These algorithms were evaluatedto find the one leading to the best processor usage in thesimulated environment. To this end, a workload able touse all the available processors of the simulated farm wasdesigned according to the following four steps: 1) A randomgeneration of a set of machines, 2) For each machine aproper set of jobs were generated, 3) A random distributionof all the processors belonging to each machine to thegenerated set of jobs, 4) Assignment of the generated jobsto a free computation slot in such a way that they finish theirexecution on the target machine all at a fixed time. RRI iscomputed as: (Processorsr+Processorsq)/Processorsa,where Processorsr are the processors request by the al-located jobs, Processorsq are the processors request bythe not allocated jobs, and Processorsa are the availableprocessors. The higher the RRI value is, the higher thesystem contention is. Smallest Residue is the method thatobtained the best results in 500 simulations we conductingvarying the number and the type of the machines inside thesimulated farm. This is the allocator we used for computingthe RRI index. It behaves as a sort of probe to measure theprocessors availability throughout a simulation. It is executedeach time a job execution is started. To evaluate the schedulerefficiency, we analyzed the algorithms exploited by CNS

to handle sub-machines. Figure 2 shows the percentages ofnew sub-machines definition and expansion we obtained bythe simulations. When the RRI value is low (RRI < 0.4)there is a high rate of new sub-machines definition becausethe classes have only a few resources assigned. When thevalue of RRI is between 0.4 and 0.8 there is a higher sub-machines expansion rate because the classes already havea large number of sub-machines and at the same time thereare enough available processors on the farm machines. Whenthe value of RRI is greater than 0.8 the percentages of theexpanding and new definition processes tend to stabilize atvalues of 30% and 40%, respectively. This result means that,when the system is heavily loaded, for example, when RRIis equal to 2 the classes in about 35% of cases already havea submachine able to execute a submitted job. While, in theother 65% of cases, CLSs require to CNS to extend a sub-machine (25%) or to define a new sub-machine (40%).

0 0.5 1 1.5 2RRI

0

20

40

60

80

100

Perc

enta

ge

Machine ExpansionNew Machines

Fig. 2: New sub-machines definition or expansion.

The graph of Figure 3 shows the percentage of processorsused by the running jobs. When the RRI index is greaterthan 1, i.e. when the requested resources begin to be unavail-able, the processor utilization approaches to 100%. But theshape of the curve clearly shows that in some cases thereare free processors even if the values of RRI are greaterthan 1. In fact, also when the value of RRI is greater than1.2 (i.e. when the estimated number of requested resourcesgo over in the available ones) the figure shows that someprocessors are not used. This happens because, also whenthe number of requests is much greater than the availableresources, the last are not able to run any one of the waitingrequests. However, it is worth to point out that our scheduleris able to schedule jobs in a way that keeps low the numberof unused resources.

We also investigated the degree of satisfaction of non-functional job requirements. We evaluated the quality ofservice provided by the Control-Scheduler on the basis ofdecisions it has made to allocate resources to the classes.This analysis has been conducted to assess the choices madein the following areas: (1) Job classification policies and rankvalues assigned to classes; (2) Resource stealing technique.Bad choices can cause long queuing times for some types ofjobs, in particular the ones belonging to low ranked classes.

To evaluate the satisfaction level of the resource demands

0 0.5 1 1.5 2RRI

0

20

40

60

80

100

Percentage

Fig. 3: Processor usage.

we used the slowdown metric. This measures the ratiobetween the response time of a job (i.e. the time elapsedbetween its submission and its termination) and its executiontime. It is computed as: (Tw + Te)/Te, with Tw the timethat a job spends waiting to start and/or restart its execution,and Te the job execution time [9]. Figure 4(a) shows theaverage slowdown obtained executing job belonging to thefollowing job classes: Class 0, i.e. jobs requiring a slowdownequal to 1; Class 3, i.e sequential or parallel jobs requiringfloating license, Class 4, i.e. parallel jobs do not requiringfloating license, and Class 5, i.e. sequential jobs do notasking for floating license. In this evaluation jobs requestingadvanced reservation or a deadLine were not consideredbecause for such jobs the slowdown is not indicative. Infact, depending on their characteristics, such jobs can spendsome time enqueued before to be executed without affectingtheir performances. In our tests all the requests of advancedresource reservation were satisfied. Figure 4(a) shows that,when the resources are available (i.e. RRI < 1) all thejobs obtain a slowdown equal to 1. When RRI > 1, i.e.the job competition to access the available computationalresources increases, jobs are forced to spend some timein the queues resulting in an increase of the job averageslowdown. It can be seen that in the conducted tests the jobrequirement slowdown=1 is satisfied also when the valueof RRI reaches 1.6 (i.e. high system contention). Theslowdown value of jobs asking floating software licensesremain under 1.2 also with high system contention, whilethe slowdown increases only up to a 20% for parallel jobs.Instead, the value of the slowdown of the serial jobs is theworst, their completion time can increase up to 80%. Figure4(b) shows the percentage of jobs executed respecting theirdeadline. The results obtained by the proposed schedulerwere compared with those obtained by running a Backfillingand Flexible backfilling algorithms with the same simulationconditions, i.e. with both the same machines and the samejob streams, used to evaluate the proposed scheduler. Asexpected, the lower the system contention is (RRI ≤ 1),the higher the percentage of the jobs meeting their deadlineis, and all the schedulers are able to satisfy all deadlinerequests. The proposed scheduler is able to obtain betterresults than the other algorithms. It obtains a percentage of

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6RRI

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Slow

dow

nSlowdown = 1LicenseParallelSerial

(a) Job slowdown

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6RRI

0

20

40

60

80

100

Perc

enta

ge

Class SchedulerBackfillingFlexible Backfilling

(b) Job deadline

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6RRI

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Slow

dow

n

Class SchedulerBackfillingFlexible Backfilling

(c) Slowdown of jobs requiring slowdown = 1

Fig. 4: Slowdown

jobs that respect their deadline very close to 100% also witha high system contention (RRI ≥ 1.5). As the system con-tention increases the Flexible backfilling algorithm reaches aperformance that is 16% lower than the one obtained by theproposed scheduler, while the Backfilling algorithm obtainsa performance significantly lower than the one obtained byour scheduler. In Figure 4(c) we show the results obtainedfrom the execution of jobs requiring a slowdown valueequal to 1. It can be seen that when the resources arenot longer available the Backfilling and Flexible backfillingalgorithms are not able to guarantee this QoS. The proposedclass scheduler, by using the technique of resource stealing,makes available the resources needed also when the systemcontention is high (RRI ≥ 1.5). It is worth to point out thatthe Flexible backfilling algorithm maintains the slowdownvalue within an acceptable level, offering in this test aperformance comparable to the one obtained by the proposedscheduler.

6. ConclusionIn this paper, we propose a new multi-criteria scheduler

to dynamically schedule a continuous stream of batch jobson large-scale non-dedicated computing farm made of het-erogeneous, single-processor or SMP machines, linked by alow-latency, high-bandwidth network. The proposed solutionaims at scheduling arriving jobs respecting several functionaland non-functional job requirements and optimizing thehardware and software resource usage. Several configurationparameters allow the scheduler customization with respect tothe goals of an installation. The scheduler was evaluated bysimulations using different job streams synthetically gener-ated. To conduct the evaluation a technique to measure thesystem contention throughout a simulation was adopted. Thescheduler has been evaluated comparing it with Backfillingand Flexible backfilling schedulers. In the conducted tests,the proposed scheduler demonstrated to be able to carry outgood scheduling choices As future work, we plan: (1) toenhance the current scheduler refining the adopted advancedresource reservation technique, and to manage jobs requiringco-allocation to be executed on more than one machine,

(2) to introduce energy efficiency policies dispatching work-loads to more energy-efficient machines, (3) to evaluate thescheduler when applied to computing platforms made ofdistributed computing farm, (4) to investigate the feasibilityof different scheduling criteria to estimate the RRI index.

7. AcknowledgmentThis work has been supported by the Projects CONTRAIL

(EU- FP7-257438) and S-CUBE (EU-FP7-215483).

References[1] G. Capannini, R. Baraglia, D. Puppin, L. Ricci, and M. Pasquali. A

job scheduling framework for large computing farms. In SC, page 54,2007.

[2] P.-F. Dutot, L. Eyraud, G. Mounié, and D. Trystram. Bi-criteriaalgorithm for scheduling jobs on cluster platforms. In Proceedingsof the sixteenth annual ACM symposium on Parallelism in algorithmsand architectures, SPAA ’04, pages 125–132, New York, NY, USA,2004. ACM.

[3] H. El-Rewini, T. G. Lewis, and H. H. ALI. Task Scheduling in Paralleland Distributed Systems. PTR Prentice Hall, Englewood Cliffs, NewJersey, 1994.

[4] Y. Etsion and D. Tsafrir. A short survey of commercial cluster batchschedulers. Technical Report 2005-13, School of Computer Scienceand Engineering, The Hebrew University of Jerusalem, May 2005.

[5] D. Feitelson, L. Rudolph, and U. Schwiegelshohn. Parallel jobscheduling—a status report. In Job Scheduling Strategies for ParallelProcessing, pages 1–16. Springer, 2005.

[6] D. Klusek, H. Rudov, R. Baraglia, M. Pasquali, and G. Capannini.Comparison of multi-criteria scheduling techniques, In: Grid Com-puting Achievements and Prospects. Springer, 2008.

[7] K. Kurowski, J. Nabrzyski, A. Oleksiak, and J. Weglarz. A mul-ticriteria approach to two-level hierarchy scheduling in grids. J. ofScheduling, 11:371–379, October 2008.

[8] Y. Kurowski, J. Nabrzyski, A. Oleksiak, and J. Weglarz. Schedulingjobs on the grid–multicriteria approach. Computational Methods inScience and Technology, 12(2):123–138, 2006.

[9] A. Mu’alem and D. Feitelson. Utilization, predictability, workloads,and user runtime estimates in scheduling the ibm sp2 with backfilling.Parallel and Distributed Systems, IEEE Transactions on, 12(6):529–543, 2001.

[10] U. Schwiegelshohn and R. Yahyapour. Analysis of first-come-first-serve parallel job scheduling. In Proceedings of the ninth annualACM-SIAM symposium on Discrete algorithms, pages 629–638. So-ciety for Industrial and Applied Mathematics, 1998.

[11] M. Siddiqui, A. Villazón, and T. Fahringer. Grid capacity planningwith negotiation-based advance reservation for optimized qos. InProceedings of the 2006 ACM/IEEE conference on Supercomputing,page 103. ACM, 2006.