[IEEE Comput. Soc Proceedings. Fifth International Conference on High Performance Computing -...

8
GLB: A Low-Cost Scheduling Algorithm for Distributed-Memory Architectures Andrei R˘ adulescu Arjan J.C. van Gemund Department of Information Technology and Systems Delft University of Technology P.O.Box 5031, 2600 GA Delft, The Netherlands Abstract This paper proposes a new compile time scheduling algorithm for distributed-memory systems, called Global Load Balancing (GLB). GLB is intended as the second step in the multi-step class of scheduling algorithms. Experi- mental results show that compared with known scheduling algorithms of the same low-cost complexity, the proposed algorithm improves schedule lengths up to . Compared to algorithms with higher order complexities, the typical schedule lengths obtained with the proposed algorithm are at most twice longer. 1. Introduction One of the main problems in the field of scheduling al- gorithms for distributed memory systems is finding time- efficient heuristics that produce good schedules. The goal of scheduling is to minimize the parallel execution time of the scheduled program. Except for very restricted cases, the scheduling problem has been shown to be NP-complete [3]. Consequently, much research effort has been spent in find- ing good heuristics. For shared-memory architectures, it has been proven that even a low-cost scheduling algo- rithm is guaranteed to produce acceptable performance [4]. For distributed-memory systems however, such a guarantee does not exist. The heuristic algorithms used for task scheduling on distributed-memory machines can be divided into (a) scheduling algorithms for an unbounded number of proces- sors and (b) scheduling algorithms for a bounded number of processors. Scheduling for an unbounded number of pro- cessors can be performed easier, because the constraint on the number of processors need not be considered. Within this class a distinction can be made between clustering and duplication-based algorithms. Clustering algorithms, such as Dominant Sequence Clustering (DSC) [14] and Edge Ze- roing (EZ) [10], groups tasks together to reduce commu- nication. Duplication-based algorithms, such as Scalable Task Duplication based Scheduling (STDS) [2] and Dupli- cation First Reduction Next (DFRN) [8], further reduce the communication delays by task duplication. An important class of scheduling algorithms for a bounded number of processors is the class of list schedul- ing algorithms, such as Modified Critical Path (MCP) [12] and Earliest Task First ETF [5], which sequentially sched- ule “ready” tasks to the task’s “best” processor. A ready task is defined to be a task with all its dependencies satis- fied and a best processor is determined by the criteria used to select processors (e.g. the processor where the task can start the earliest). Secondly, duplication can be performed also for a bounded number of processors (e.g. Duplication Scheduling Heuristic (DSH) [6] and Critical Path Fast Du- plication CPFD [1]). A third approach is to use a multi- step method. In such a method, three steps can be defined: (1) clustering, (2) cluster mapping and (3) task ordering. The clustering step groups tasks in clusters, The cluster mapping step maps the clusters to the available number of processors, while the task ordering step orders the tasks’ ex- ecution within processors, according to task dependencies. In practical situations, where the number of tasks may be extremely large, the time complexity of a scheduling algo- rithm is very important. Scheduling for an unbounded num- ber of processors can be performed with low-complexity. However, the necessary number of processors is rarely available. Within the class of scheduling algorithms for a bounded number of processors, duplication-based al- gorithms have high-complexities, because they perform a backward search in order to duplicate tasks. Compared to clustering algorithms, list scheduling algorithms have higher complexities, because they have to solve moreover the constraint of a limited number of processors. Multi- step scheduling methods achieve the same low complexity as clustering, provided the other two steps, cluster mapping and task ordering, have the same or lower complexity as the clustering step. Because task ordering is basically a topo- logical sort, it can be performed fast (e.g. Ready Critical Path (RCP) [13], Free Critical Path (FCP) [13]). Clus- ter mapping can also be performed at a low cost, using a

Transcript of [IEEE Comput. Soc Proceedings. Fifth International Conference on High Performance Computing -...

GLB: A Low-Cost Scheduling Algorithm for Distributed-Memory Architectures

Andrei Radulescu Arjan J.C. van GemundDepartment of Information Technology and Systems

Delft University of TechnologyP.O.Box 5031, 2600 GA Delft, The Netherlands

Abstract

This paper proposes a new compile time schedulingalgorithm for distributed-memory systems, called GlobalLoad Balancing (GLB). GLB is intended as the second stepin the multi-step class of scheduling algorithms. Experi-mental results show that compared with known schedulingalgorithms of the same low-cost complexity, the proposedalgorithm improves schedule lengths up to30%. Comparedto algorithms with higher order complexities, the typicalschedule lengths obtained with the proposed algorithm areat most twice longer.

1. Introduction

One of the main problems in the field of scheduling al-gorithms for distributed memory systems is finding time-efficient heuristics that produce good schedules. The goalof scheduling is to minimize the parallel execution time ofthe scheduled program. Except for very restricted cases, thescheduling problem has been shown to be NP-complete [3].Consequently, much research effort has been spent in find-ing good heuristics. For shared-memory architectures, ithas been proven that even a low-cost scheduling algo-rithm is guaranteed to produce acceptable performance [4].For distributed-memory systems however, such a guaranteedoes not exist.

The heuristic algorithms used for task scheduling ondistributed-memory machines can be divided into (a)scheduling algorithms for anunboundednumber of proces-sors and (b) scheduling algorithms for aboundednumberof processors. Scheduling for an unbounded number of pro-cessors can be performed easier, because the constraint onthe number of processors need not be considered. Withinthis class a distinction can be made betweenclusteringandduplication-basedalgorithms. Clustering algorithms, suchasDominant Sequence Clustering(DSC) [14] andEdge Ze-roing (EZ) [10], groups tasks together to reduce commu-nication. Duplication-based algorithms, such asScalable

Task Duplication based Scheduling(STDS) [2] andDupli-cation First Reduction Next(DFRN) [8], further reduce thecommunication delays by task duplication.

An important class of scheduling algorithms for abounded number of processors is the class oflist schedul-ing algorithms, such asModified Critical Path(MCP) [12]andEarliest Task FirstETF [5], which sequentially sched-ule “ready” tasks to the task’s “best” processor. A readytask is defined to be a task with all its dependencies satis-fied and a best processor is determined by the criteria usedto select processors (e.g. the processor where the task canstart the earliest). Secondly, duplication can be performedalso for a bounded number of processors (e.g.DuplicationScheduling Heuristic(DSH) [6] andCritical Path Fast Du-plication CPFD [1]). A third approach is to use a multi-step method. In such a method, three steps can be defined:(1) clustering, (2) cluster mappingand (3)task ordering.The clustering step groups tasks in clusters, The clustermapping step maps the clusters to the available number ofprocessors, while the task ordering step orders the tasks’ ex-ecution within processors, according to task dependencies.

In practical situations, where the number of tasks may beextremely large, the time complexity of a scheduling algo-rithm is very important. Scheduling for an unbounded num-ber of processors can be performed with low-complexity.However, the necessary number of processors is rarelyavailable. Within the class of scheduling algorithms fora bounded number of processors, duplication-based al-gorithms have high-complexities, because they perform abackward search in order to duplicate tasks. Comparedto clustering algorithms, list scheduling algorithms havehigher complexities, because they have to solve moreoverthe constraint of a limited number of processors. Multi-step scheduling methods achieve the same low complexityas clustering, provided the other two steps, cluster mappingand task ordering, have the same or lower complexity as theclustering step. Because task ordering is basically a topo-logical sort, it can be performed fast (e.g.Ready CriticalPath (RCP) [13], Free Critical Path (FCP) [13]). Clus-ter mapping can also be performed at a low cost, using a

Symbol MeaningV The number of nodes (tasks) in the task graph.E The number of edges (task dependencies) in the task graph.P The number of processors.Tex(Ti) The execution time of the taskTi, i = 1::V .C The number of clusters obtained in the clustering step.STC(Ti) The start time ofTi obtained in the clustering step.ST(Ck) The start time of clusterCk.WC(Ck) The workload of clusterCk.WP (Pk) The workload mapped to processorPk.

Table 1. Symbols and their meanings

load balancing scheme [7, 14]. Thus, multi-step methodsare promising from a complexity point of view.

Within the multi-step method, the low-cost cluster map-ping algorithms known thus far use from the clustering steponly the information of task grouping in clusters. How-ever, the clustering step produces more information thatcan be used to improve the cluster mapping. This paperdescribes a new approach for cluster mapping step, calledGuided Load Balancing (GLB), which also uses tasks’ start-ing times which have been produced during the clusteringstep. GLB is aimed to improve the performance of the ex-isting low-cost cluster mapping algorithms, maintaining thesame complexity.

This paper is organized as follows. Next section dis-cusses previous work on cluster mapping. In Section 3, theproposed algorithm is presented. Section 4 illustrates thefunctionality of the algorithm through an execution trace.In Section 5 we compare our algorithm with previous ap-proaches. Section 6 concludes the paper.

2. Related Work

In this section, three existing cluster mapping algorithmsand their characteristics are described:List Cluster Assign-ment(LCA) [10], Wrap Cluster Merging(WCM) [13] andLoad Balancing(LB).

Table 1 lists some symbols and their meaning as will beused in the rest of the paper.

Fig. 1 shows a task graph that has been already clusteredusing DSC. While small and comprehensible, it allows usto show different characteristics of the algorithms that willbe described.

2.1. List Cluster Assignment (LCA)

LCA is an incremental algorithm, that performs bothcluster mapping and task ordering in a single step. The tasksare sorted in descending order of priority. At each step,

LCA considers the unmapped task with the highest prior-ity. The task, along with the cluster it belongs to, is mappedto the processor that yieldsthe minimum increase of paral-lel completion time. The parallel completion time is calcu-lated as if each unmapped cluster would be executed on aseparate processor. The communication delays between thetasks running on the same processor are considered zero. Ifthe task graph is modified to reflect the above communica-tion delay zeroing, the parallel completion time is simplythe graph’s critical path. The time complexity of the LCAalgorithm isO(PC(V +E)).

One can note, that if each task would have been mappedto a separate cluster, LCA degenerates to a traditional listscheduling algorithm. If the clustering step produces largeclusters, the time spent to map the clusters to processors isdecreased, since only one task from each cluster (the taskwith the highest priority) is considered. However, the com-plexity of LCA is still high, because at each mapping step, itis necessary to compute the total parallel completion time.

For the task graph in Fig. 1 the schedule for three pro-cessors using LCA is presented in Fig. 2. It yields the bestmapping for this example. However, it is not practical be-cause of its high cost.

2.2. Wrap Cluster Merging (WCM)

In WCM cluster mapping is based only on the task exe-cution times. For each cluster, the workload is calculated asthe sum of the task execution times of the tasks in the clus-ter. First, each of the clusters with a workload higher thanthe average is mapped to a separate virtual processor. Sec-ond, the remaining clusters are sorted in increasing orderby workload. Assuming the remaining virtual processorsare renumbered as0; 1; : : :Q � 1 and the remaining clus-ters as0; 1; : : :R� 1, the virtual processor for clusterCk isdefined using the following wrap around formula:

PE(Ck) = k mod Q; k = 0; 1; : : :R � 1

0

10

20

30

40

50

60

70

81

C0

C1 C2 C3

C4 C5

N0

10

N1

10

N2

16

N3

5

N4

5

5 55

5

N5

10

55

5 5

N6

9

N7

15

N8

8

5 5 5

N9

10

55

5

Figure 1. A clustered taskgraph

0

10

20

30

40

50

60

69

PE 0 PE 1 PE 2

N0

10

N1

10

N2

16

N3

5N4

5

5 5

N5

10

5 5

N6

9

N7

15

N8

8

5 5

N9

10

5 5

Figure 2. LCA schedule

0

10

20

30

40

50

60

70

74

PE 0 PE 1 PE 2

N0

10

N1

10

N2

16

N3

5

N4

5

55

N5

10

5

5

N8

8

N7

15

N6

9

55

N9

10

5 5

Figure 3. WCM schedule

0

10

20

30

40

50

60

70

80

87

PE 0 PE 1 PE 2

N0

10

N1

10

N2

16

N3

5N4

5

5

N5

10

5

N6

9

N7

15

N8

8

5

N9

10

5

Figure 4. LB schedule

0

10

20

30

40

50

60

69

PE 0 PE 1 PE 2

N0

10

N1

10

N2

16

N3

5N4

5

55

N5

10

55

N6

9

N7

15

N8

8

55

N9

10

5 5

Figure 5. GLB schedule

The time complexity of the WCM algorithm isO(V +C logC).

With the assumption that in the previous clustering stepthe effect of the largest communication delays were elim-inated, the communication delays are not considered forcluster mapping. Consequently, the communication be-tween tasks has a small impact in the cluster mapping step.The time complexity of WCM is very low. However, it canlead to processors with load imbalances at different stagesduring the execution. That is, the cluster mapping is per-formed considering only the sum of the execution times ofthe tasks in clusters, irrespective of the tasks’ starting times.As a result some processors may be temporarily overloadedin time.

For example, in the case of scheduling the clustered taskgraph in Fig. 1 on three processors, WCM will not mapclusterC3 on PE1, because the index of the cluster afterrenumbering (afterC0 was mapped) is 2, which determinesthat the destination processor isPE0 (Fig. 3).

2.3. Load Balancing (LB)

Load balancing techniques are based on computing theworkload for each cluster. A possible algorithm is to mapat each step, the unmapped cluster with the highest work-load is selected and mapped to the least loaded processor.The workload of a processor is defined as the sum of theexecution times of the tasks already mapped to it. The timecomplexity of the LB considered here isO(V + C logC).

When only the workload of the clusters is considered asthe cluster mapping criterion, the remarks for WCM alsohold here. For the task graph and clustering in Fig. 1, theLB schedule for three processors is even worse comparedto WCM (Fig. 4). The first cluster mapped isC0, whichhas the highest workload,61. After mapping the next clus-ter,C1, the workloads of processorsPE0, PE1 andPE2

are61, 10 and0 respectively. The next two clusters,C4

andC5 will be successively mapped toPE2, raising itsworkload to17. The least loaded processor isPE1, andthe last two clusters will be mapped to it. Using such a loadbalancing algorithm, the potential parallelism between clus-tersC1,C2 andC3, and betweenC4 andC5 is not exploited.

3. The GLB Algorithm

Although the three algorithms described in the previoussection can produce good cluster mappings, each of themhas its specific shortcomings. The algorithm described inwhat follows is intended to overcome the drawbacks, whilepreserving low complexity.

In the GLB algorithm (Fig. 6) a new strategy to interpretthe information from the clustering step is proposed. Un-like current approaches to cluster mapping, which rely only

GLB ()1. BEGIN2. Compute ST and WC of the clusters.3. WHILE NOTall clusters scheduled DO4. Ck := cluster with the minimum ST,

breaking ties by choosing thecluster with the minimum WC .

5. Px := the least loaded processor.6. Map Ck to Px.7. END WHILE8. END

Figure 6. The GLB algorithm

on the grouping of tasks in clusters, GLB moreover uses thetasks’ starting times of as determined during the clusteringstep to improve the performance. In the clustering step, thetasks are grouped in clusters and ordered according to theirdependencies. Henceforth, each taskTi is mapped to a clus-ter and will have a starting timeSTc(Ti) associated with it.STc(Ti) is a good measure of the relative order in which thetasks have to be executed in the final mapping, because thetasks’ starting times obtained in the clustering step satisfythe task dependencies constraints. Moreover, they reflectthe importance of each task, e.g. a task with a higher prior-ity is clustered earlier.

The starting times of the tasks obtained in the clusteringstep are further used to derive a starting time for a cluster.The starting time of a cluster is the earliest starting time ofa task in the cluster:

ST(Ck) = minTi2Ck

fSTc(Ti)g

The second measure associated to a cluster is its workload,which is the sum of the execution times of the tasks in thecluster:

WC(Ck) =X

Ti2Ck

Tex(Ti)

The starting time of the clusters determines the prior-ity of the clusters in the cluster mapping step. The clusterwhich starts the earliest is selected. If two clusters have thesame starting time, the cluster with the highest workload isselected. In case of even workloads, the cluster is randomlyselected.

The start time of the first task is selected to determinethe priority of the cluster to provide information from thedependence analysis in the previous step. Topologically or-dering tasks yields a natural cluster ordering. Schedulingclusters in the order they become available for executionwill generally yield a better schedule compared to a sched-ule considering only the amount of computation in a cluster.

Similar to the cluster workload, the processor workloadcan be defined as the sum of the execution times of the tasks

Task Start timeT0 0T1 15T2 20T3 15T4 15T5 30T6 45T7 40T8 45T9 59

Table 2. Tasks’ start times obtained inthe clustering step.

Cluster Start time WorkloadC0 0 61C1 15 10C2 15 5C3 15 5C4 45 9C5 45 8

Table 3. Start times and workloads ofthe clusters.

Unscheduled clusters U(P0) U(P1) U(P2) MappingfC0; C1; C2; C3; C4; C5g 0 0 0 C0 ! P0

fC1; C2; C3; C4; C5g 61 0 0 C1 ! P1fC2; C3; C4; C5g 61 10 0 C2 ! P2

fC3; C4; C5g 61 10 5 C3 ! P2fC4; C5g 61 10 10 C4 ! P1

fC5g 61 19 10 C5 ! P2

Table 4. Execution trace of the GLB algorithm.

mapped on the processor:

WP (Pk) =X

Ti2Pk

Tex(Ti)

The processor workload is the selection criteria for the tar-get processor in the cluster mapping process. As in the LBalgorithm, the least loaded processor is selected.

The processor with the smallest workload is selected asopposed to choosing the processor where the cluster canstart the earliest. In the first case global information aboutthe processors is used to map clusters, while in the latterlocal information is used. Using global information is bet-ter when mapping clusters, because a cluster is not a singleschedulable unit and in the task ordering step, the clusterson the same processor are expected to be interleaved. Con-sequently, a processor selection criterion such as the leastloaded processor would be more suited for such cases.

Sorting clusters by their priorities takesO(C logC)time. Mapping clusterCk takesO(logP + jCkj) time:O(logP ) to maintain the processor list ordered by theirworkload andO(jCk j) time to map the tasks in the clus-ter to the selected processor.jCkj denotes the number oftasks in clusterCk. The complexity for mapping all clus-ters is thereforeO(C logP +

PjCkj) = O(C logP + V ),

asPjCk j = V . Thus, the total complexity of the algorithm

isO(C logC + C logP + V ) = O(C logC + V ).

4. An Execution Trace of the Algorithm

In this section, we illustrate the steps of the GLB algo-rithm using the clustered task graph in Fig. 1. Table 2 liststhe start times of the tasks obtained in the clustering step.These start times are used to calculate the cluster start times(Table 3). In Table 3 the workloads of each cluster are alsolisted.

In the GLB algorithm, cluster priorities are static. Themain criteria is the start time of the cluster, the cluster withthe earliest starting time being selected. In case of a tie,the cluster with the highest workload is selected. If the tiepersists, the cluster is randomly selected. Following theserules, the order in which the cluster will be mapped to pro-cessors is:C0, C1, C2, C3, C4 andC5.

The execution trace of the GLB algorithm is presentedin Table 4. In the first step, the GLB algorithm maps thecluster with the highest priority,C0, to the first availableprocessor,P0, since all of them have both 0 workload. Thenext cluster,C1, is mapped to the first of the processors with0 workload,P1. The following two clusters are mapped toP2, since it is the least loaded processor. Then, bothP1andP2 have a workload of 10, hence the next cluster,C4 ismapped toP1. Finally,C5 is mapped to the currently leastloaded processor,P2. The resulting schedule, after orderingthe tasks, is shown in Fig. 5.

5. Performance and Comparison

The GLB algorithm is compared with the three algo-rithms described in Section 2, LCA, WCM and LB. Thealgorithms are compared as part of a full multi-step schedul-ing method to obtain a realistic test environment. The algo-rithm used for the clustering step is DSC [14], because ofboth its low cost and good performance. For task ordering,we use RCP [13]. We also included in our comparisons alist scheduling algorithm, MCP [12], to obtain a referenceto other type of scheduling algorithms.

We consider a large number of task graphs representingvarious types of parallel algorithms. The selected problemsareblock LU decomposition, row LU decomposition, asten-cil algorithmand adivide and conquerproblem. For each ofthese problems, we adjusted the problem size to obtain taskgraphs of about 2000 nodes. We varied the task graph gran-ularities, by varying the communication to computation ra-tio (CCR). The values used forCCRwere 0.2, 0.5, 1.0, 2.0and 5.0. For each problem and eachCCR value, we gener-ated 50 graphs with random execution times and communi-cation delays, resulting a total number of 1000 task graphs.Because of the very high cost of LCA, we restricted ourexperiments to 5 task graphs in this particular case. Moredetailed experiment results are presented in [9].

As a distributed system, we assume a homogeneousclique topology with no contention on communication op-erations and a non-preemptive execution of tasks. Our ex-periments are performed by simulating the execution of thetask graph on such a topology. The execution and commu-nication times on the simulated machine are selected to bethe same as those used in the scheduling algorithms.

For performance comparison, we use theefficiencymea-sure. In order to define efficiency, we first define theidealtimeas the ratio between the sequential time and the num-ber of processors:Ti = Tseq=P . Efficiency is defined asthe ratio between the ideal time and the actual parallel timeTp obtained by measurements:E = Ti=Tp. It can be easilyseen that efficiency can be expressed as the ratio betweenthe speedup and the number of processors:

E =TiTp

=Tseq=P

Tp=Tseq=Tp

P=

S

P

Efficiency is used as opposed to other measures like sched-ule length or speedup. Efficiency values scale well in theplots and offer a good separation of the interesting areas.Both speedup and schedule length have values that varyacross a large interval making the graphical comparison dif-ficult for small values of the measure function.

One of our objectives is to observe the trade-offs be-tween the performance, i.e. the efficiency, and the cost toobtain these results, i.e. the running time necessary to gen-erate a schedule. We run our experiments on a Pentium

2 4 8 16 32 64 [P]

1.0

2.0

4.0

8.0

[s]

MCPDSC-WCMDSC-LB

? DSC-GLB

? ? ? ? ? ?

Figure 7. Algorithms running times

Pro/233MHz PC with 64Mb RAM running Linux 2.0.32.In Fig. 7 the average running time of the algorithms is

shown. For the multi-step scheduling methods, the totalscheduling time is displayed (i.e. the sum of the cluster-ing, cluster mapping and task ordering times). The runningtime of MCP grows linearly with the number of processors.For a small number of processors, the running time of MCPis comparable with the running times of the three low-costmulti-step scheduling methods (DSC-WCM, DSC-LB andDSC-GLB). However, for a larger number of processors,the running time of MCP is much higher. The three low-cost multi-step scheduling methods (DSC-WCM, DSC-LBand DSC-GLB) have all comparable running times. Therunning times do not vary significantly with the number ofprocessors. DSC-WCM has the smallest running time, be-cause it has the simplest mapping scheme. Both DSC-LBand DSC-GLB have to maintain a priority queue for proces-sors, which leads to an increase in running time of5� 10%in our experiments. DSC-LCA is not displayed, because itsrunning times are much higher compared to the other run-ning times, varying from 1h forP = 2 to 11h forP = 32.

Fig. 8 shows the mean values of the efficiency obtainedfor each of the scheduling algorithms presented. One candistinguish that there are two different classes of algorithmsin terms of output performance. In the first class are the ex-pensive algorithms, MCP and DSC-LCA, which have goodschedules. The second class contains the low-cost algo-rithms, DSC-WCM, DSC-LB and DSC-GLB, which tradeperformance for cost.

Within the class of high-cost algorithms, the output per-formance of DSC-LCA is still lower compared to MCP.Here, after the clustering step, the degree of freedom inplacing tasks to processor is decreased by grouping taskstogether. Mapping a single task from a cluster to a proces-sor forces all the other tasks within the same cluster to bemapped to the same processor. In this case, using a higher-cost algorithm (DSC-LCA vs. MCP) does not imply an in-crease in scheduling quality.

While for small number of processors the performance

CCR

0.2

0.5

1.0

2.0

5.0

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� � ��

??????

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� � ��

?????

?

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� ��

????

?

?

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� ��

???

?

??

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

��

??

?

?

??

(a)

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� � ��

??????

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� � ��

????

??

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� � �

???

?

??

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� ��

???

?

??

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

��

??

?

???

(b)

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� � ��

? ?????

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� � ��

? ???

?

?

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� � �

? ???

?

?

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� ��

? ??

?

?

?

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

��

??

?

?

??

(c)

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

MCP� DSC-LCA

DSC-WCMDSC-LB

? DSC-GLB�

��

????

??

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� ��

???

?

??

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

� �

��

??

?

???

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

��

??

?

???

2 4 8 16 32 64 [P]

0.25

0.5

0.75

1.0

[E]

?

?

?

?? ?

(d)

Figure 8. Scheduling algorithms efficiencies for (a) block LU decomposition, (b) row LU decomposi-tion, (c) a stencil like problem and (d) a divide and conquer problem, using different CCR values.

of the low-cost algorithms is comparable with the high-cost algorithms, for a large number of processors the per-formance drops. If the computation time dominates thecommunication, the difference in the quality of the sched-ules is increasing in favor of high-cost algorithms, goingup to a factor of2 in some cases (row LU decomposition,CCR = 0:2, P = 32). For very large task graphs however,we consider this difference to be acceptable, since the run-ning time of the low-cost algorithms is an order of magni-tude lower compared to MCP. In the case of fine-grain taskgraphs, the low-cost algorithms obtain comparable resultswith high-cost algorithms. Minimizing the communicationdelays in the clustering step is the key factor in obtainingthese results.

Compared to the other low-cost algorithms presented,DSC-GLB obtains better schedules. However, the qualityincrease varies both with the value ofCCR and with thetype of problem. For small values ofCCR (coarse-graintask graphs), the improvements in efficiency are8 � 9%for block and row LU decomposition,15� 30% for stencilproblems and9 � 19% for divide and conquer problems.The reason for these different improvement values lies inthe clustering step. For block and row LU decomposition,the number of clusters obtained is very large (992 � 1723for a task graph of about 2000 nodes). Consequently, littleinformation is extracted from the clustering step and GLBcannot obtain much improvement.

Better improvements are obtained for the stencil, and di-

vide and conquer problems, where we have460� 569 and469 � 476 clusters. Both the clustering and cluster map-ping steps are important in this case, because the number ofclusters is large but not comparable to the number of nodes(460 � 569 clusters vs.2000 nodes). Consequently, GLBbenefits from the information accumulated in the clusteringstep, in terms of starting times of the tasks, thus obtainingbetter mappings compared to WCM and LB.

For high values ofCCR (fine-grain task graphs), theoverall behavior of GLB compared to WCM and LB withrespect to the problem type is the same, but the improve-ments in efficiency are less significant (0 � 6% for blockand row LU decomposition, and divide and conquer prob-lems and0�17% for stencil problems). The improvementsbecome smaller, because the cluster mapping step becomesless important in the overall scheduling process. For fine-grain task graphs, the most important step is the clusteringstep, which reduces the communication delays. This phe-nomenon can be also inferred from the fact that the multi-step scheduling methods have schedules that are closer inlength to MCP’s schedules compared to coarse-grain tasks.

6. Conclusion

In this paper we describe a new algorithm for clustermapping, called Global Load Balancing (GLB). GLB is in-tended as the second step in the multi-step class of schedul-ing algorithms. Unlike current approaches to cluster map-ping that rely only on the grouping of tasks in clusters, GLBalso uses the starting times of the tasks obtained in the clus-tering step to improve the performance. The time complex-ity of the GLB algorithm does not increase compared to anordinary load balancing algorithm.

Experimental results show that compared with knowncluster mapping algorithms of the same complexity, theproposed algorithm improves schedule lengths up to30%.Compared to algorithms with higher order complexities, theproposed algorithm obtains comparable results for a smallnumber of processors. As the number of the processorsincreases, the relative performance of the proposed algo-rithm degrades with respect to higher order complexity al-gorithms. Yet, the scheduling lengths obtained with the pro-posed algorithm are at most twice longer while reducing thecost by an order of magnitude.

Acknowledgements

The authors thank Hai Xiang Lin for stimulating discus-sions. This research is part of the Automap project [11]granted by the Netherlands Computer Science Foundation(SION) with financial support from the Netherlands Orga-nization for Scientific Research (NWO) under grant numberSION-2519/612-33-005.

References

[1] I. Ahmad and Y-K. Kwok. A new approach to schedul-ing parallel programs using task duplication. InInt’lConf. on Parallel Processing, pages 47–51, 1994.

[2] S. Darbha and D.P. Agrawal. A scalable and optimalscheduling algorithm for distributed memory machines.In Int’l Conf. on Parallel Processing, 1997.

[3] M.R. Garey and D. S. Johnson. Computers andIntractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co., 1979.

[4] R.L. Graham. Bounds on multiprocessing tim-ing anomalies. SIAM J. on Applied Mathematics,17(2):416–429, March 1969.

[5] J-J. Hwang, Y-C. Chow, F.D. Anger, and C-Y. Lee.Scheduling precedence graphs in systems with in-terprocessor communication times. SIAM J. onComputing, 18:244–257, April 1989.

[6] B. Kruatrachue and T.G. Lewis. Grain size determi-nation for parallel processing.IEEE Software, pages23–72, January 1988.

[7] J-C. Liou and M.A. Palis. A comparison of generalapproaches to multiprocessor scheduling. InInt’lParallel Processing Symposium, pages 152–156, 1997.

[8] G-L. Park, B. Shirazi, and J. Marquis. DFRN: Anew approach for duplication based scheduling fordistributed memory multiprocessor system. InInt’lParallel Processing Symposium, pages 157–166, 1997.

[9] A. Radulescu and A.J.C. van Gemund. GLB: Alow-cost scheduling algorithm for distributed-memoryarchitectures. Technical Report 1-68340-44(1998)02,Delft Univ. of Technology, April 1998.

[10] V. Sarkar. Partitioning and Scheduling ParallelPrograms for Execution on Multiprocessors. Ph.D.thesis, MIT, 1989.

[11] K. van Reeuwijk, H.J. Sips, H.-X. Lin, and A.J.C.van Gemund. Automap: A parallel coordination-basedprogramming system. Technical Report 1-68340-44(1997)04, Delft Univ. of Technology, April 1997.

[12] M-Y. Wu and D.D. Gajski. Hypertool: A program-ming aid for message-passing systems.IEEE Trans. onParallel and Distributed Systems, 1(7):330–743, July1990.

[13] T. Yang.Scheduling and Code Generation for ParallelArchitectures. Ph.D. thesis, Dept. of CS, Rutgers Univ.,May 1993.

[14] T. Yang and A. Gerasoulis. DSC: Scheduling paralleltasks on an unbounded number of processors.IEEETrans. on Parallel and Distributed Systems, 5(9):951–967, December 1994.