[IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05)...

7
Scheduling Based on the Impact over Process Communication of Parallel Applications Renato Porf´ ırio Ishii Rodrigo Fernandes de Mello Luciano Jos´ e Senger Marcos Jos´ e Santana Regina Helena Carlucci Santana Universidade de S˜ ao Paulo – Instituto de Ciˆ encias Matem´ aticas e de Computac ¸˜ ao ao Carlos, S˜ ao Paulo - Brazil E-mail: [email protected], [email protected], [email protected], {mjs, rcs}@icmc.usp.br Abstract This paper presents a new model for the evaluation of the impacts on processing operations resulting from the communication among processes. The model quantifies the traffic volume imposed on the communication network by means of the latency parameters and the overhead. Such parameters represent the load that each process imposes over the network and the delay on the CPU, as a conse- quence of the network operations. The delay is represented on the model by means of metric measurements slowdown. The equations that quantify the costs involved in the pro- cessing operation and message exchange are defined. In the same way, equations to determine the maximum network bandwidth are used on the decision-making scheduling. The proposed model uses a constant that delimitates the commu- nication network maximum allowed usage, this constant de- fines two possible scheduling techniques: group scheduling or through communication network. Such techniques are in- corporated to the DPWP policy, generating an extension of this policy. Results confirm the performance enhancement of parallel applications. 1. Introduction Process scheduling is one of the most relevant issues on distributed computing systems. In order to solve this problem, several scheduling policies have been proposed [3, 12, 1]. Such policies distribute processes that compose the parallel applications over the available processing ele- ments (PEs), with objectives such as: load balancing, de- crease of the application time response and better use of re- sources. According to Feitelson and Rudolph [7], the policies may consider several parameters that affect the perfor- mance, such as the CPU occupation load, the quantity of available memory, the input/output and the communica- tion, in order to minimize the execution time of processes. Such parameters may be extended considering the differ- ent classes of parallel applications. Applications may be subdivided into two classes, based on the resource uti- lization: CPU-Bound and I/O-Bound. I/O-Bound applica- tions may be organized on Disk-Bound and Network-Bound classes. Disk-Bound applications perform many accesses to the secondary memory. Network-Bound applications are featured by many operations of message sending and receiv- ing through the communication network. The utilization and development of Network-Bound ap- plications have been highlighted mainly by the continuous evolution of the computing networks and by the consoli- dation of the distributed systems such as: clusters, NOW’s (Network of Workstations) and Grids. Examples of applica- tions that use such type of architectures are: corporate sys- tems, fluid dynamics, weather forecast and image process- ing. The scheduling policies may be adapted to the differ- ent classes of applications being scheduled. Nevertheless, few work have aimed the development of policies oriented to I/O-Bound applications. Such work have evaluated both Disk-Bound and Network-Bound applications, which han- dle with large data volumes (data intensive). On such eval- uations, it gets evident that these applications require much bandwidth and low latency, both for the secondary memory access and communication network. Frequently, the com- munication network is presented as the main performance bottleneck during the execution of such applications, in view of its bandwidth and latency limitations. Network-Bound applications dispute the resource utiliza- tion of the communication network. Such dispute, known as Network Contention, affects the information transference time (latency) and, consequently, lengthens the process exe- cution time. This fact has motivated the development of re- searches on the evaluations and analysis of the load occupa- tion status of the resources involved on the communication Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications (HPCS’05) 1550-5243/05 $20.00 © 2005 IEEE

Transcript of [IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05)...

Page 1: [IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05) - Guelph, ON, Canada (15-18 May 2005)] 19th International Symposium on High Performance

Scheduling Based on the Impact over Process Communication of ParallelApplications

Renato Porfırio Ishii Rodrigo Fernandes de Mello Luciano Jose SengerMarcos Jose Santana Regina Helena Carlucci Santana

Universidade de Sao Paulo – Instituto de Ciencias Matematicas e de ComputacaoSao Carlos, Sao Paulo - Brazil

E-mail: [email protected], [email protected], [email protected], {mjs, rcs}@icmc.usp.br

Abstract

This paper presents a new model for the evaluation ofthe impacts on processing operations resulting from thecommunication among processes. The model quantifies thetraffic volume imposed on the communication network bymeans of the latency parameters and the overhead. Suchparameters represent the load that each process imposesover the network and the delay on the CPU, as a conse-quence of the network operations. The delay is representedon the model by means of metric measurements slowdown.The equations that quantify the costs involved in the pro-cessing operation and message exchange are defined. In thesame way, equations to determine the maximum networkbandwidth are used on the decision-making scheduling. Theproposed model uses a constant that delimitates the commu-nication network maximum allowed usage, this constant de-fines two possible scheduling techniques: group schedulingor through communication network. Such techniques are in-corporated to the DPWP policy, generating an extension ofthis policy. Results confirm the performance enhancementof parallel applications.

1. Introduction

Process scheduling is one of the most relevant issueson distributed computing systems. In order to solve thisproblem, several scheduling policies have been proposed[3, 12, 1]. Such policies distribute processes that composethe parallel applications over the available processing ele-ments (PEs), with objectives such as: load balancing, de-crease of the application time response and better use of re-sources.

According to Feitelson and Rudolph [7], the policiesmay consider several parameters that affect the perfor-mance, such as the CPU occupation load, the quantity of

available memory, the input/output and the communica-tion, in order to minimize the execution time of processes.Such parameters may be extended considering the differ-ent classes of parallel applications. Applications may besubdivided into two classes, based on the resource uti-lization: CPU-Bound and I/O-Bound. I/O-Bound applica-tions may be organized on Disk-Bound and Network-Boundclasses. Disk-Bound applications perform many accessesto the secondary memory. Network-Bound applications arefeatured by many operations of message sending and receiv-ing through the communication network.

The utilization and development of Network-Bound ap-plications have been highlighted mainly by the continuousevolution of the computing networks and by the consoli-dation of the distributed systems such as: clusters, NOW’s(Network of Workstations) and Grids. Examples of applica-tions that use such type of architectures are: corporate sys-tems, fluid dynamics, weather forecast and image process-ing.

The scheduling policies may be adapted to the differ-ent classes of applications being scheduled. Nevertheless,few work have aimed the development of policies orientedto I/O-Bound applications. Such work have evaluated bothDisk-Bound and Network-Bound applications, which han-dle with large data volumes (data intensive). On such eval-uations, it gets evident that these applications require muchbandwidth and low latency, both for the secondary memoryaccess and communication network. Frequently, the com-munication network is presented as the main performancebottleneck during the execution of such applications, inview of its bandwidth and latency limitations.

Network-Bound applications dispute the resource utiliza-tion of the communication network. Such dispute, knownas Network Contention, affects the information transferencetime (latency) and, consequently, lengthens the process exe-cution time. This fact has motivated the development of re-searches on the evaluations and analysis of the load occupa-tion status of the resources involved on the communication

Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications (HPCS’05)

1550-5243/05 $20.00 © 2005 IEEE

Page 2: [IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05) - Guelph, ON, Canada (15-18 May 2005)] 19th International Symposium on High Performance

subsystem.Such motivation has led to a judicious study on the par-

allel application scheduling, which perform constant net-work information transference, generating an extension forthe scheduling policy named DPWP [1].

This study has considered overhead parameters of com-munication, latency and volume transmitted by messages,which, combined with the CPU occupation parameters, al-low the definition of load indexes. Such indexes quantify thesystem resource status and, thus, contribute to the schedul-ing decision-making. Well-founded decisions taken throughsuch parameters aggregate improvements to the resourceutilization in the system and improve the execution time ofthe parallel applications.

By using these indexes, which are based on the CPU andcommunication network occupations, two rules for schedul-ing decision-making are defined: group process allocationat the same PE or process distribution among the availablePEs, what increases the network utilization. It has been ob-served that this strategy, on certain load situations, increasesthe performance of the intensive communication applica-tions. That happens because when all processes are allo-cated at a single PE, it decreases the synchronization delayof processes that communicate one with another. Thus, thetime spent by the communication process is restricted to theresources capacity (local communication blocking and in-terfaces) of the receiving PE, letting the network free forother scheduling operations and, consequently, minimizingthe response time of applications that have bottlenecks atthe communication network.

The study on the process load behavior over the commu-nication network and the study on the parallel applicationscheduling, based on the communication volume amongthe processes, are presented on this paper. Through thesestudies, it has been defined a simulator and an extensionfor the DPWP policy. Such extension helps on the perfor-mance evaluation of the Network-Bound parallel applica-tions scheduling. This paper is subdivided into the follow-ing sections: 2) related works; 3) scheduling based on theevaluation of the communication impact among processes;4) experimental and simulation results; 5) conclusion.

2. Related Works

Scheduling is responsible for distributing the processesover the PEs with the aim of increasing the application per-formance. The definition of good scheduling policies con-siderably enhance the application final performance, as maybe observed on [3, 12, 1]. These studies related to the devel-opment of scheduling policies have taken into account a se-ries of parameters, however, they have not considered thefeatures related to the communication among processes ofa same parallel application. Such features comprehend the

number of messages exchanged among processes; the dis-tribution of such messages during the application execution;and the process performance slowdown when using the net-work.

With the aim of overcoming such communication fea-turing limitations, other works have been released, such as[10, 11, 13, 8, 2], which present the analysis of the impacton the information transmission and receiving among paral-lel application processes. From such analysis, it is possibleto define scheduling strategies that take into account boththe application requirements and the computing resourcesinvolved in the communication among processes.

Ni et al. [11], proposed a technique for scheduling pro-cesses on point-to-point multiprocessed systems with lowgraded growth scale. The PE that is idle or with a low loadsends requests to its neighbors for process receiving. Suchneighbors answer the request informing about their statusby means of two kinds of messages: busy or not busy. Thus,the idle PE selects a neighbor and sends a delivery mes-sage. The neighbor may answer by two means: sending anew process or a delay message, in the case which the newprocess is forwarded to another PE. With the application ofthis technique, it is aimed a decreased number of informa-tion transmitted by the communication channel. Ryou andWong [13] presented a model for process grouping. Fromsuch groups, techniques are defined to deal with messagesending and receiving, so that the communication overloadmay be decreased, as well as the applications response time.A strategy named set attempts to reduce the number of mes-sages exchanged on the communication network and theCPU overhead of each PE. These work have not evaluatedthe communication network occupation. The physical envi-ronment has not been quantified by any parameters and ithas not conducted an analysis of the impact caused by com-munication over the process scheduling.

Keren and Barak [8], proposed an opportunity cost pol-icy, which takes into account overhead applications both forinput/output and communication among processes. On thatwork, a marginal cost function has been altered to reallo-cate the processes among the PEs. It uses circuit routingconcepts to determine the CPU overhead regarding com-munication. Each PE in the system is provided with multi-ple resources, CPU, memory, input/output devices, etc. Theobjective is to minimize the utilization overload of such re-sources.

The model proposed for evaluation of overhead on [8]has not taken into account the impact that each processcauses to the communication network operations. In addi-tion, it has not evaluated the weight of such overhead at ameasurement that quantifies the CPU delay, as well as thenumber of MIPS (million of instructions per second) con-sumed by the CPU to accomplish operations related to mes-sage sending and receiving within the network.

Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications (HPCS’05)

1550-5243/05 $20.00 © 2005 IEEE

Page 3: [IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05) - Guelph, ON, Canada (15-18 May 2005)] 19th International Symposium on High Performance

Chodnekar et al. [2] defined a methodology to featurethe communication among parallel applications. Three at-tributes are considered for traffic capturing: message gen-eration frequency or rate, message spatial distribution andmessages size. The attribute related to the message gen-eration rate represents a temporal component of the com-munication subsystem. The spatial distribution representsthe traffic pattern, that is, to which other related process itwishes to communicate. The work is limited to the evalu-ation of communication requirements, not considering thecommunication impact over the process scheduling.

The limitations of the previous work have motivated thestudy on the communication network occupation features,as well as the impact caused by such occupation on the pro-cess scheduling. The limitations related to the applicationscommunication features (volume, traffic, overhead) to theimpact of the information exchange activity over the CPUhave motivated the development of this paper.

3. Scheduling based on the evaluation of theimpact over process communication

The limitations related to the quantification of the com-munication load and the insistence of an analysis related tothe influence of communication over scheduling found on[10, 11, 13, 8, 2] have motivated the study on the impactscaused by communication over processing.

Such impacts are a result of overhead, latency and vol-ume (message exchange size and frequency) transmitted bymessages. Those parameters are used on this work to modelthe communication behavior of processes. overhead repre-sents the time spent to pack and unpack messages. Latencyis the time spent for the transmission of a certain messageover the network physical layer. Volume is the number ofbytes transferred in the network at a certain time window(for instance, in bytes/second).

Thus, it has been defined a model to quantify the pro-cess communication load over a communication network ofthe system and, from this model on, it has been proposedan extension of the DPWP policy. This approach aims touse the communication network as a parameter for schedul-ing decision-making.

3.1. Communication Model

The communication model defines the behavior of thetraffic volume on the communication network. Such vol-ume is parameterized by means of latency and overhead.This model attempts to quantify the load imposed over thecommunication network and the delay over the PEs causedby the network operations. This delay is defined as slow-down caused by the message packing and unpacking opera-tions on the PEs.

The model also defines equations that quantify the costsrelated to the processing and communication of PEs dur-ing the parallel application execution. Equations are alsodefined to determine the maximum bandwidth used forscheduling decision-making.

The equations that feature the communication impactshave been based on the LogP model [4] and are formally de-scribed by the equation (1), where: Dm,i represents the de-lay (in seconds) that a message m causes to the processeslocated on the i computer defined as message emitter andon the j computer, defined as the message receiver; Om isthe overhead to pack the message m on computer i and un-pack it on j, what justifies its multiplication by the 2 con-stant; Lm is the latency (in seconds) that represents the mes-sage traffic time through the physical environment.

Dm,i,j = 2 ∗ Om + Lm (1)

Another significant feature of this model is the defini-tion of a constant k, which determines the maximum utiliza-tion of the communication network. The process distribu-tion is performed in accordance with such constant. Basedon that constant and on the relative occupation that eachsystem computer has over the network, two techniques areadopted for process scheduling PEs: the first one, namedgroup scheduling, allocates all processes of a same applica-tion to the same PE; and the second one distributes the pro-cesses, in an equal manner, over the available PEs throughthe communication network.

The adoption of one of these techniques depends on theconstant k, which delimitates the number of bandwidth usedby a computer network. The used bandwidth is quantifiedby the equation (2), where: α represents each one of theprocesses allocated on the i computer; nprocs is the quan-tity of processes on the i computer; Nα is the quantity ofbytes per second being transferred at the moment.

NCi =

nprocs∑

α=1

Nα (2)

In case the NCi of the PE i communication load im-posed by the processes is higher than the value defined for k(k < NCi), a group scheduling is created. On the contrary,when the communication load is lower than k (k > NCi),the processes are allocated on the available PEs by meansof the communication network.

Apparently, distributing the processes in accordancewith the group scheduling may result in a not very effi-cient solution, as the PEs may remain idle. Nevertheless,studies confirm that as time passes by and at the ex-tent new processes are submitted and start to occupy theidle PEs, the system utilization gets improved [6].

Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications (HPCS’05)

1550-5243/05 $20.00 © 2005 IEEE

Page 4: [IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05) - Guelph, ON, Canada (15-18 May 2005)] 19th International Symposium on High Performance

3.2. Extension of DPWP scheduling policy

In order to evaluate the behavior of parameters definedby the previously presented model on the process sched-uler, it has been proposed an extension policy for the DPWPscheduling (Dynamic Policy Without Preemption) [1], de-veloped by the research group on which this project is in-cluded.

The DPWP policy originally creates a vector that de-scribed the Ci idle capacity equation (3) of each processor(or PE), where i ∈ [1, n] represents each PE; CPUlength,i

the size of the process queue on i PE at a certain instant; andBenchi the total capacity of i PE, measured by an adoptedbenchmark.

Ci =CPUlength,i

Benchi(3)

Such vector is maintained in an ascending order at eachone of the system processors. When a processor receivesa request to execute a group of processes the DPWP pol-icy distributes them at a sequential order over the proces-sor of such vector. In case the number of processes is higherthan the number of processors, the DPWP policy restarts thedistribution of processes from the processor located at thefirst position of the vector. Thus, the process allocation isprimarily privileged over the idlest processors. The DPWPpolicy has incorporated two changes proposed at the com-munication model: the adoption of equations that featurethe impacts on communication; and the distribution of pro-cesses in accordance with the constant k, which representshow much the communication network is being used.

On the experiments, the values for the constant k passthrough variations so that their best configurations may bedefined. Improvements may be noted on the schedulingdecision-making, consequently achieving the performanceimprovement on the parallel applications.

4. Experiment and simulation results

In order to evaluate the processing impacts caused bycommunication, the simulator developed by Mello and Sen-ger [5] has been extended. Such extension has aggregatedspecific parameters to quantify the load of the physical en-vironment for message transmission: communication over-head, latency, volume and transmission frequency.

The overhead is represented by means of the ratio ofused MIPS per transferred bytes. The latency is representedby the transference rate of packages through the networkand the volume by probability distribution functions.

With the aim of capturing the ratio for used MIPS pertransferred bytes (equation 6) and latency (equation 5), ithas been developed a program client/server. On the clientcode were implemented routines for sending fixed size

PE CPU Memory Network card1 P4, 2.4 GHz 512 MB 10/1002 Athlon, 1.67 GHz 256 MB 10/1003 P3, 600 MHz 256 MB 10/1004 P2, 350 Mhz 128 MB 10/100

Table 1. Configuration of the PEs used on ex-periments

packages (64 Kbytes - maximum size for the Ethernet pack-age - this adopted value allows the measurement of responsetimes on maximum occupation conditions of the channel).

Routines for receiving the packages have been imple-mented on the server. On the measurement of the transmis-sion rate through the network interface, the client and serverprograms have been executed on two distinct computers.

In order to conduct the experiments over a heteroge-neous environment, four computers with distinct processingcapacity (table 1) have been analyzed. On each computerthe benchmark TSCP [9] has been executed over two loadconditions: idle computer and computer executing packagetransference operations through the network interface.

The TSCP returns the processing capacity values inMIPS. The communication load is established by theclient/server program, which remains sending (client)and receiving (server) packages through the network in-terface, whilst the ∆t time interval (equation 5) is notreached.

By using this technique, it is possible to identify the per-formance slowdown of the PE during the packages trans-ference by the network. In order to help the calculation ofsuch performance slowdown, it is defined the equation (4),where Mci is the number of MIPS used by the PE i trans-ferring packages through the communication network, Moi

is the execution of the benchmark TSCP on the idle PE i;and Mti, the TSPC execution value on the PE i on pack-age transference by the network.

Mci = Moi − Mti (4)

The equation (5) has quantified the network bandwidthoccupied by the client/server program, where: Tx is the net-work transference rate, pcts is the total number of pack-ages transmitted on ∆t time interval and len is the packagelength, in bytes.

Tx =pcts × len

1024

∆t(5)

The equation (6) list the equations (4 and 5), by defin-ing the communication capacity of a certain PE inKbytes/second/MIPS, where: Cci is the PE i communi-cation capacity, Tx is the (equation 5) transmission rate

Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications (HPCS’05)

1550-5243/05 $20.00 © 2005 IEEE

Page 5: [IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05) - Guelph, ON, Canada (15-18 May 2005)] 19th International Symposium on High Performance

0

200

400

600

800

1000

1200

1400

1600

PE1 PE2 PE3 PE4

MIP

S

idle processing data transference

Figure 1. PEs capacity values

and Mci is the quantity of MIPS used by the PE on pack-age transference by the network (equation 4).

Cci =Tx

Mci(6)

The equation (6) represent a ratio for the number ofMIPS used on the data transmission. For example, 1 MIPScan be used by the given PE for the transference of 32,55KBytes per second.

Such use of MIPS by the PE on network data transfer-ence is considered as a slowdown caused by the messagepacking and unpacking. Such slowdown follows the over-head definitions considered on the LogP [4] model.

The observation of this performance slowdown and over-head, has motivated the conduction of experiments that al-low the simulator parameterization. In addition, such exper-iments allow the obtainment of results more similar to theexecutions on real environments. Experiments have beenconducted over computers with the features listed on table1.

The first graph (figure 1), shows the processing capacityvalues of each PE. The first column shows the PEs capacityon the idle status, that is, when they are not executing oper-ations on the communication network. On the second col-umn are the values in MIPS of the PEs on network trans-ference. Such PEs execute the client/server program thataccomplishes the sending/receiving operations through thecommunication network and calculates the transference rate(equation 5).

From these results, it is possible to define aKBytes/second/MIPS ratio, equation (6), which helpsthe simulator parameterization, providing it with fea-tures closer to the real system.

In addition to these experiments, evaluations are con-ducted in order to obtain the PEs transmission rate. Suchevaluations consist of executing the client/server programon each PE, on two situations: idle PEs and PEs executing

the benchmark TSCP. The resulting transmission rate val-ues are between 11,500 and 12,000 KBytes/s. It may be ob-served that the transmission rate remains without significantvariations on all PEs, even executing simultaneously the op-eration over the CPU (TSCP) and the communication net-work (client/server program).

The previously mentioned analysis related to the MIPSused to transmit a certain rate in Kbytes/second allows thesimulator parameterization. Such parameters correspond tothe quantity of MIPS that a certain process consumes fromthe CPU due to the overhead caused by the messages pack-ing and unpacking. To this number is aggregated the totalMIPS spent by the process on network transmission (equa-tion 6).

On the simulator, the interval of processes arrival to thesystem is defined by a Poisson probability distribution func-tion, with average 1,500 seconds. Such distribution has beenadopted based on the experiments by Feitelson [6], whichdemonstrate that the processes arrivals to parallel machinesfollow this same distribution.

The processes occupation on the system follows themodel by Feitelson [7, 6]. Such model defines the execu-tion time of processes with a heavy tail probability distri-bution function. This means that, eventually, the workloadover the system gets excessively high, enough to counter-balance the situations of low occupation load. Such modelprovides two kind of information to the simulator: load andnumber of processes. The load represents the number ofMIPS each process consumes from the CPU. The numberof processes represents the number of processes of a cer-tain parallel application. Each log line represents an appli-cation submitted to the system. The simulator generates theresults on the average response time of the processes, in sec-onds.

In addition to the previously defined overhead and la-tency, the proposed model requires other parameters:the message size, that is, the number of bytes trans-ferred through the network; and the time spent for trans-mitting those messages. Such parameters are used in ac-cordance with the equation (7), in order to measure theaverage volume of messages transmitted by the communi-cation channel.

To understand the use of these parameters on the simu-lator, consider the situation on which the number of bytesthat a process transmits is represented by a Poisson func-tion with average 1,000 Kbytes. The same way, the transfer-ence time of this number is represented by a Poisson func-tion of average 10 seconds.

On this situation, each process shows network occupa-tion of about 100 Kbytes/s. This value is obtained throughthe equation (7), where, Weightcom express how mucha certain process consumes of the bandwidth; load is thevalue in transmitted Kbytes and time is the time interval

Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications (HPCS’05)

1550-5243/05 $20.00 © 2005 IEEE

Page 6: [IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05) - Guelph, ON, Canada (15-18 May 2005)] 19th International Symposium on High Performance

0

500

1000

1500

2000

2500

3000

1000 2000 3000 4000 5000

resp

onse

tim

e (h

ours

)

number of applications

net 0% net 25% net 50% net 75%

net 100% dpwp

Figure 2. Poisson load = 500 KBytes; time =60 seconds

spent on the load transference through the communicationnetwork.

Weightcom =load

time(7)

The simulator adopts a Poisson probability distributionfunction to represent these parameters. Such parametersused with process allocation techniques (scheduling poli-cies) allow the analysis of the communication influenceover scheduling.

In order to investigate such influence, comparative ex-periments have been conducted between the DPWP policyand a version of this policy that aggregates the alterationsproposed on this paper, named extended DPWP. Both poli-cies were implemented on the simulator developed on [5],what allowed the comparison of the average response timeof processes. There were value variances to the constant k,when the behavior of the extended policy at distinct situa-tions of network occupation load was observed.

The results of such experiments are shown on the graphsof figures 2, 3 and 4. The y axis represents the average re-sponse time of processes (in hours) and the x axis repre-sents the number of applications submitted to the system.

The figure 2 shows the results obtained on the experi-ment that uses the Poisson probability distribution function.Such function represents the occupation load of processesover the network, with average 500 Kbytes and transferencetime of 60 seconds. In addition, on extended DPWP policy,the value variances for the constant k are considered, whichrepresents the maximum network utilization limit on 0%,25%, 50%, 75% and 100% (extended DPWP policy). Thisconstant influences the scheduling decision-making, as suchpolicy, by means of the equation (2), verifies the total band-width used by the processes. Depending on the return valueof this equation (2) (higher or lower than the constant k), it

0

500

1000

1500

2000

2500

3000

1000 2000 3000 4000 5000

resp

onse

tim

e (h

ours

)

number of applications

net 0% net 25% net 50% net 75%

net 100% dpwp

Figure 3. Poisson load = 10,000 KBytes; time= 60 seconds

is decided how to distribute the processes among the avail-able PEs: in group or through the communication network.

Observing figure 2, we may note that the performancesof DPWP and extended DPWP policies are equivalent,highlighting the fact that, on the 0% configuration, the per-formance of extended DPWP is lower. This is due to the al-location of all processes from the same application at a sin-gle PE, what is different from the scheduling ruled by theDPWP policy, which distributes the processes among thePEs, what improves the occupation and, consequently, bal-ances the system load.

The extended DPWP policy obtains better results as thesystem communication load increases. The results of the ex-periment parameterized according to the Poisson probabil-ity distribution function are presented, with average 10,000Kbytes of occupation load and 60 seconds of transferencetime (figure 3).

On such load conditions, from the instant on which theprocesses are not distributed among the computers to oc-cupy the communication subsystem, the waiting for syn-chronization among the PEs processes decreases and, con-sequently, the response time is improved on the applicationsthat perform many operations over the communication net-work.

Through figure 3, it may be also concluded that the ex-tended DPWP policy improves its performance over the25% network use variation, when compared to the DPWPone. As the occupation load of the communication networkprocesses increase, the extended DPWP policy causes lowerdelays to the PEs and decreases the average response timeof the applications.

On figure 4, the results related to the experiment pa-rameterized according to the Poisson probability distribu-tion function are shown. In order to represent the occupa-tion load, such function is used with average 50,000 Kbytes

Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications (HPCS’05)

1550-5243/05 $20.00 © 2005 IEEE

Page 7: [IEEE 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05) - Guelph, ON, Canada (15-18 May 2005)] 19th International Symposium on High Performance

0

500

1000

1500

2000

2500

3000

1000 2000 3000 4000 5000

resp

onse

tim

e (h

ours

)

number of applications

net 0% net 25% net 50% net 75%

net 100% dpwp

Figure 4. Poisson load = 50,000 KBytes; time= 60 seconds

and transference time of 60 seconds. The extended DPWPpolicy, on all network utilization percent rates but 0% in-creases, considerably, the applications performance whencompared to the DPWP policy. This may be observed evenwhen the system has a low number of processes (600 pro-cesses).

5. Conclusion

This paper presents the communication influence on theperformance of Network-Bound parallel applications. It isinteresting to analyze such influence taking into accountthe communication subsystem as a parameter for decision-making of the application scheduling. In order to evalu-ate such parameter (network bandwidth use per process) itis proposed an extended DPWP policy that also considersthe latency, the communication overhead and the messagetransmission volume.

Experiments have been conducted between the DPWPand extended DPWP policies by means of simulations. Suchexperiments confirm that the use of the extended DPWPpolicy significantly decreases the average response times ofprocesses on situations of high communication network oc-cupation. In the same way, on system load situations wherethe network occupation is low, it is recommended to use ascheduling supervised by the DPWP policy, which performsthe load balancing and, consequently, increases the systemperformance related to the average response time of the ap-plications.

Aknowledgment

The authors thank to Capes and Fapesp Brazilian Foun-dations (under the process number 04/02411-9).

References

[1] A. P. F. Araujo, M. J. Santana, R. H. C. Santana, and P. S. L.Souza. DPWP: A new load balancing algorithm. In 5th In-ternational Conference on Information Systems Analysis andSynthesis - ISAS’99, Orlando, U.S.A., 1999.

[2] S. Chodnekar, V. Srinivasan, A. S. Vaidya, A. Sivasubrama-niam, and C. R. Das. Towards a communication characteri-zation methodology for parallel applications. In Proceedingsof the 3rd IEEE Symposium on High-Performance ComputerArchitecture (HPCA ’97). IEEE Computer Society, 1997.

[3] T. Choe and C. Park. A task duplication based scheduling al-gorithm with optimality condition in heterogeneous systems.In International Conference on Parallel Processing Work-shops (ICPPW’02), pages 531–536. IEEE Computer Soci-ety, 2002.

[4] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser,E. Santos, R. Subramonian, and T. von Eicken. Logp: to-wards a realistic model of parallel computation. In Pro-ceedings of the fourth ACM SIGPLAN symposium on Princi-ples and practice of parallel programming, pages 1–12, SanDiego, California, United States, 1993. ACM Press.

[5] R. F. de Mello and L. J. Senger. A new migration modelbased on the evaluation of processes load and lifetime onheterogeneous computing environments. In 16th Symposiumon Computer Architecture and High Performance Comput-ing, Foz do Iguacu - PR - Brazil, 2004.

[6] D. G. Feitelson. Metrics for parallel job scheduling and theirconvergence. In D. G. Feitelson and L. Rudolph, editors, JobScheduling Strategies for Parallel Processing, pages 188–205. Springer, 2001. Lect. Notes Comput. Sci. vol. 2221.

[7] D. G. Feitelson and L. Rudolph. Metrics and benchmark-ing for parallel job scheduling. In D. G. Feitelson andL. Rudolph, editors, Job Scheduling Strategies for ParallelProcessing, volume 1459, pages 1–24. Springer, 1998.

[8] A. Keren and A. Barak. Opportunity cost algorithms for re-duction of I/O and interprocess communication overhead ina computing cluster. IEEE Transactions on Parallel and Dis-tributed Systems, 14(1):39–50, 2003.

[9] T. C. Kerrigan. TSCP benchmark scores, 2004.http://home.comcast.net/˜tckerrigan/bench.html.

[10] W. Mao, J. Chen, and W. W. III. On-line algorithms fora parallel job scheduling problem. In in Proceedings ofthe IASTED International Conference on Parallel and Dis-tributed Computing and Systems, pages 753–757, 1999.

[11] L. M. Ni, C. Xu, and T. B. Gendreau. A distributed draftingalgorithm for load balancing. IEEE Transactions on Soft-ware Engineering, 11(10):1153–1161, 1985.

[12] A. Radulescu, A. van Gemund, and H. Lin. LLB: A fast andeffective scheduling algorithm for distributed-memory sys-tems. In 13th International and 10th Symposium on Paralleland Distributed Processing - IPPS/SPDP, pages 525–530,1999.

[13] J. Ryou and J. S. K. Wong. A task migration algorithmfor load balancing in a distributed system. In XXII AnnualHawaii International Conference on System Sciences, pages1041–1048, 1989.

Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications (HPCS’05)

1550-5243/05 $20.00 © 2005 IEEE