Low-cost static performance prediction of parallel stochastic task compositions

14
Low-Cost Static Performance Prediction of Parallel Stochastic Task Compositions Hasyim Gautama and Arjan J.C. van Gemund, Member, IEEE Computer Society Abstract—Current analytic solutions to the execution time distribution of a parallel composition of tasks having stochastic execution times are computationally complex, except for a limited number of distributions. In this paper, we present an analytical solution based on approximating execution time distributions in terms of the first four statistical moments. This low-cost approach allows the parallel execution time distribution to be approximated at ultra-low solution complexity for a wide range of execution time distributions. The accuracy of our method is experimentally evaluated for synthetic distributions as well as for task execution time distributions found in real parallel programs and kernels (NAS-EP, SSSP, APSP, Splash2-Barnes, PSRS, and WATOR). Our experiments show that the prediction error of the mean value of the parallel execution time for N-ary parallel composition is in the order of percents, provided the task execution time distributions are sufficiently independent and unimodal. Index Terms—Performance prediction, stochastic graphs, workload distribution. æ 1 INTRODUCTION I N parallel program performance prediction, the choice between static and dynamic prediction approaches represents a fundamental trade off between the amount of predictive information and its accuracy. Dynamic techni- ques, such as simulation, can provide accurate execution time information. However, the predictive value of this (numeric) information is limited to the specific parameter setting of the program and/or input data used. Static approaches focus on compile-time program information, such as synchronization and control flow structure. Conse- quently, they offer the potential advantage of producing (analytical) information on the performance effects of symbolic program/machine parameters (e.g., problem size, number of processors, and computation/communication bandwidths), without requiring costly simulation runs for each different parameter setting or input data set. However, their limitations with respect to modeling dynamic program behavior may have a profound negative impact on prediction accuracy. One source of dynamic program behavior is the nondeterminism related to dynamically scheduling tasks onto a limited number of processors, and other forms of contention for operating system services and resources, such as communication links, disks, memories, etc. Another form of dynamic behavior comes from the dependency of a program on the input data set. Especially for, e.g., simulation programs, and content processing applications such as sorting, the program execution time distribution across the space of input data sets can be considerable, even when problem parameters such as the data set size are kept constant. A static technique that cannot model execution time distributions is of limited practical use for two reasons. First, additional distribution information provides valuable feedback, in particular, when, e.g., execution time bounds (e.g., percentiles) on real-time applications are to be reliably predicted. As parallel and distributed processing is increasingly being applied within the real-time embedded systems domain, there is a growing interest for such additional information on execution time besides just mean values. Second, as will be elaborated throughout the paper, distribution informa- tion is essential to accurately predict the execution time of a parallel composition of tasks that have nonzero execution time variance. In this paper, we present a static technique aimed to predict the execution time distribution of stochastic parallel programs. In our prediction method, we assume an unbounded availability of resources. Similar to data- dependency, the effects of scheduling and resource conten- tion are accounted for by each task having an execution time distribution. Often, a parallel program is represented in terms of a directed acyclic graph (DAG) with nodes representing tasks, and edges representing task interde- pendencies. In terms of this DAG, our static prediction problem corresponds to computing the distribution of the critical path of the stochastic DAG where each node represents a task execution time distribution. For arbitrary parallel programs, the analysis of corre- sponding DAGs is complex since for general DAGs the computation of the critical path involves combining path length distributions which are not always independent. In general, only bounding approximations are possible [7], [17], [36], [41], or solution techniques that are based on the assumption of an exponential execution time distribution [30], [39] or on a combination of deterministic and exponential distributions [37]. For the well-known subset 78 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 1, JANUARY 2006 . H. Gautama is with Statistical Information Systems, Statistics Indonesia, PO Box 1003, 10010 Jakarta, Indonesia. E-mail: [email protected]. . A.J.C. van Gemund is with the Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Technology, PO Box 5031, NL-2600 GA Delft, The Netherlands. E-mail: [email protected]. Manuscript received 28 Nov. 2004; revised 29 Mar. 2005; accepted 31 Mar. 2005; published online 28 Nov. 2005. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPDS-0283-1104. 1045-9219/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society

Transcript of Low-cost static performance prediction of parallel stochastic task compositions

Page 1: Low-cost static performance prediction of parallel stochastic task compositions

Low-Cost Static Performance Prediction ofParallel Stochastic Task CompositionsHasyim Gautama and Arjan J.C. van Gemund, Member, IEEE Computer Society

Abstract—Current analytic solutions to the execution time distribution of a parallel composition of tasks having stochastic execution

times are computationally complex, except for a limited number of distributions. In this paper, we present an analytical solution based

on approximating execution time distributions in terms of the first four statistical moments. This low-cost approach allows the parallel

execution time distribution to be approximated at ultra-low solution complexity for a wide range of execution time distributions. The

accuracy of our method is experimentally evaluated for synthetic distributions as well as for task execution time distributions found in

real parallel programs and kernels (NAS-EP, SSSP, APSP, Splash2-Barnes, PSRS, and WATOR). Our experiments show that the

prediction error of the mean value of the parallel execution time for N-ary parallel composition is in the order of percents, provided the

task execution time distributions are sufficiently independent and unimodal.

Index Terms—Performance prediction, stochastic graphs, workload distribution.

1 INTRODUCTION

IN parallel program performance prediction, the choicebetween static and dynamic prediction approaches

represents a fundamental trade off between the amount ofpredictive information and its accuracy. Dynamic techni-ques, such as simulation, can provide accurate executiontime information. However, the predictive value of this(numeric) information is limited to the specific parametersetting of the program and/or input data used. Staticapproaches focus on compile-time program information,such as synchronization and control flow structure. Conse-quently, they offer the potential advantage of producing(analytical) information on the performance effects ofsymbolic program/machine parameters (e.g., problem size,number of processors, and computation/communicationbandwidths), without requiring costly simulation runs foreach different parameter setting or input data set. However,their limitations with respect to modeling dynamic programbehavior may have a profound negative impact onprediction accuracy. One source of dynamic programbehavior is the nondeterminism related to dynamicallyscheduling tasks onto a limited number of processors, andother forms of contention for operating system services andresources, such as communication links, disks, memories,etc. Another form of dynamic behavior comes from thedependency of a program on the input data set. Especiallyfor, e.g., simulation programs, and content processingapplications such as sorting, the program execution time

distribution across the space of input data sets can beconsiderable, even when problem parameters such as thedata set size are kept constant. A static technique thatcannot model execution time distributions is of limitedpractical use for two reasons. First, additional distributioninformation provides valuable feedback, in particular,when, e.g., execution time bounds (e.g., percentiles) onreal-time applications are to be reliably predicted. Asparallel and distributed processing is increasingly beingapplied within the real-time embedded systems domain,there is a growing interest for such additional informationon execution time besides just mean values. Second, as willbe elaborated throughout the paper, distribution informa-tion is essential to accurately predict the execution time of aparallel composition of tasks that have nonzero executiontime variance.

In this paper, we present a static technique aimed topredict the execution time distribution of stochastic parallelprograms. In our prediction method, we assume anunbounded availability of resources. Similar to data-dependency, the effects of scheduling and resource conten-tion are accounted for by each task having an executiontime distribution. Often, a parallel program is representedin terms of a directed acyclic graph (DAG) with nodesrepresenting tasks, and edges representing task interde-pendencies. In terms of this DAG, our static predictionproblem corresponds to computing the distribution of thecritical path of the stochastic DAG where each noderepresents a task execution time distribution.

For arbitrary parallel programs, the analysis of corre-sponding DAGs is complex since for general DAGs thecomputation of the critical path involves combining pathlength distributions which are not always independent. Ingeneral, only bounding approximations are possible [7],[17], [36], [41], or solution techniques that are based on theassumption of an exponential execution time distribution[30], [39] or on a combination of deterministic andexponential distributions [37]. For the well-known subset

78 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 1, JANUARY 2006

. H. Gautama is with Statistical Information Systems, Statistics Indonesia,PO Box 1003, 10010 Jakarta, Indonesia.E-mail: [email protected].

. A.J.C. van Gemund is with the Faculty of Electrical Engineering,Mathematics, and Computer Science, Delft University of Technology,PO Box 5031, NL-2600 GA Delft, The Netherlands.E-mail: [email protected].

Manuscript received 28 Nov. 2004; revised 29 Mar. 2005; accepted 31 Mar.2005; published online 28 Nov. 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-0283-1104.

1045-9219/06/$20.00 � 2006 IEEE Published by the IEEE Computer Society

Page 2: Low-cost static performance prediction of parallel stochastic task compositions

of series-parallel DAGs (SP DAGs) that have an arbitrarilynested fork-join structure, this dependency problem can becircumvented, which allows for more practical solutiontechniques. As many parallel algorithms can be modeled interms of SP DAGs [13], [15], in this paper we will consider(arbitrarily nested) fork-join programs only.

Fork-join (sub)programs can be expressed in terms ofparallel, sequential, and conditional task compositions. Inthis paper, we present a compositional technique thatanalytically predicts the execution time distribution of anarbitrary composition given the distribution of the consti-tuent tasks and (where applicable) the branching prob-ability. Unlike sequential and conditional compositions, forwhich our solution technique is exact [9], parallel composi-tion poses analytical problems. While treating all threecomposition types, in this paper, we therefore focus onsolving the parallel composition problem in particular.

In fork-join programs, parallel task compositions can bedistinguished into N-ary and binary compositions, both ofwhich will be considered in the paper. N-ary compositionstypically result from data parallelism (e.g., parallel loops)where each task essentially involves the same computation ondifferent data. Binary compositions, in contrast, typicallyresult from task parallelism where the computation involvedin the composition may be totally different. For a parallelcomposition of N tasks, each having a stochastic executiontime Xi, the resulting execution time Y is given by

Y ¼ maxN

i¼1Xi: ð1Þ

Many authors have used (1) as part of a static predictiontechnique [1], [5], [6], [8], [14], [24], [32]. In these approaches,Xi (and Y ) are implicitly assumed to be deterministic. Forstochastic Xi, however, naively interpreting (1) in terms of,e.g., mean values (i.e., E½Y � ¼ maxNi¼1 E½Xi�) would introducea prediction error that is unbounded. From order statistics, it isknown that the prediction error generally increases linearlywith the variance of Xi and logarithmically with N [16]. Theuse of distribution information on X instead of just meanvalues is therefore crucial in order to accurately predict justeven the mean ofY , let alone its distribution. Next to the use ofdistributions in interpreting application execution time(bounds), this order-statistical phenomenon is a compellingreason for our choice to use distributions in performanceprediction of stochastic parallel programs, rather than justuse a simple, deterministic approach.

Analytically solving (1) when Xi are stochastic, however,presents a major problem, well-known in the field of orderstatistics. The trade off between the accuracy and costinvolved in solving Y is largely determined by how thedistributions are represented. An exact, closed-form solu-tion for the distribution of Y can be expressed in terms ofthe cumulative density function (cdf). Let FXðxÞ denote thecdf of X. Then, the cdf of Y is given by

FY ðxÞ ¼YNi¼1

FXiðxÞ: ð2Þ

While (2) yields an exact prediction, only low-cost, parametricsolutions are of practical use. Moreover, many execution timedistributions do not have a closed-form cdf function, such as

the normal distribution and most of the distributions obtainedfrom task execution time measurements.

Many approaches have been proposed to solve the aboveproblem, typically restricting the range of distributions thatcan be treated (see Section 2). A more general, and alsomore accurate approach as we shall show in Section 3, is toapproximate execution time distributions in terms of anumber of statistical moments. In [9], such an approach ispresented which allows the moments of the execution timeof a sequential program to be expressed at Oð1Þ solutioncomplexity in terms of the moments of the basic blockexecution times, the loop bounds, and the branchingprobabilities. While the moment approach is straightfor-ward in the sequential domain, unfortunately for parallelcomposition, there exists no analytic, closed-form solutionfor the moments of Y in terms of the moments of X due toan integration problem described in Section 3.3.

In this paper, we present a novel method to solve thisintegration problem for N-ary as well as for binary paralleltask compositions. Our approach is based on the use ofgeneralized lambda distributions (GLD) as an intermediate,four-parameter modeling vehicle to approximate executiontime distributions in terms of their first four moments. Inparticular, the specific, inverse definition of the GLD is thekey to solving the integration problem. Furthermore, ourcontribution has the advantage that the range of distribu-tion shapes that can be accurately approximated by thefour-parameter GLD is much wider than for distributionssuch as the negative exponential or the normal distributionwhich have less parameters. As a result, the approximationof the moments of Y has much better accuracy, while forN-ary composition, the Oð1Þ analytical complexity benefitsof, e.g., the negative exponential distribution, is fullyretained. This result effectively extends the use of ouranalytical moment approach, shown to be successful in thesequential domain [9], into the fork-join parallel program-ming domain.

The remainder of the paper is organized as follows: Inthe next section, we review existing approaches to theparallel composition problem. In Section 3, we present ourfour-moments approach along with an elaborate rationale,and formulate the integration problem related to parallelcomposition mentioned earlier. In Section 4, we present ourGLD approach and describe how our closed-form solutionfor E½Y r� is derived. In Section 5, we experimentally studythe accuracy of our approach using synthetic distributionsas well as distributions obtained from measurements onreal applications, respectively. In Section 6, we summarizeour contributions.

2 RELATED WORK

In this section, we review related analytical approachestoward the parallel composition problem. As mentionedearlier, the trade off between the accuracy and cost involvedin solving Y is largely determined by how the distributionsare represented. Consequently, we review related work interms of the distributions and/or distribution representa-tions considered.

An approach using the cdf has been described by Gelenbe[12] to determine the completion times of SP DAGs. However,

GAUTAMA AND VAN GEMUND: LOW-COST STATIC PERFORMANCE PREDICTION OF PARALLEL STOCHASTIC TASK COMPOSITIONS 79

Page 3: Low-cost static performance prediction of parallel stochastic task compositions

the high-cost numerical integration presents a serious draw-back regarding practical use. Lester [19] uses the z-transformto approximate the probability density function (pdf). Whilethe real pdf can be approximated well, the solution complex-ity of the underlying numeric process is also still high. Schopf[33] uses histograms with a limited number of intervals.However, the analysis complexity grows with the number ofhistogram intervals needed to accurately characterize adistribution. Schopf and Berman [34] use intervals definedfrom the minimum and maximum of the possible values.While such intervals are easily defined, outliers in the datacan affect the size of the interval, and no details about theshape of the data are included in a simple range. Another wayof characterizing the pdf is based on series approximation, forexample, the Gram-Charlier series of type A [11]. While theanalysis is asymptotically exact, the number of Gram-Charlier terms needed for a sufficiently accurate approxima-tion is prohibitive [11].

To decrease solution complexity, there have been manyapproaches based on the use of representations other thanthe cdf or pdf, where generality is traded for cost reduction.Some approaches introduce restrictions to specific distribu-tions which are characterized by a limited number ofparameters. Thomasian and Bay [39] consider exponentialdistributions. Mak and Lundstrom [23] use Erlang distribu-tions, while Liang and Tripathi [20] use Erlang as well ashyperexponential distributions. Sahner and Trivedi [30] useexponomial distributions. While exponential, Erlang, hy-perexponential, and exponomial distributions offer tract-ability and are appropriate for, e.g., reliability modeling,such distributions are not always an adequate representa-tion of the execution times measured in real programs (seeSection 5). As shown in Section 3.1, the modeling errors aretherefore often quite large in practice. Sotz [37] uses anexponential distribution combined with a deterministicoffset. While the analysis is straightforward, the approachmay introduce significant errors. Schopf and Berman [34]use normal distributions. While the application to sequen-tial programs is straightforward, binary parallelism isapproximated heuristically, entailing large errors whenboth task execution times are similar.

Aimed to extend the analysis to more arbitrary executiontimes, other approaches approximate distributions in termsof, e.g., mean, variance, and/or bounds. Gumbel [16]approximates the mean execution time for identical andindependent general symmetric distributions providedtheir mean and variance are known. Apart from therestriction to symmetric distributions the approach is onlyasymptotically exact. Under the same assumptions, Robin-son [29] introduces upper and lower bounds on meanexecution time while allowing dependencies among sub-tasks. These bounds have later been improved by Madalaand Sinclair [22]. To include wider-than-symmetric dis-tributions, Kruskal and Weiss [18] use increasing failurerate (IFR) distributions to approximate the mean executiontime of parallel compositions for independent, identicallydistributed (iid) subtasks. While the approximation error ofthese approaches is quite reasonable, only the first momentcan be obtained. As, in turn, mean and variance arerequired inputs to parallel composition analysis, it is

impossible to analyze applications with nested parallelism(i.e., modeled by SP DAGs).

Although not aimed at analyzing parallel composition,other approaches also characterize the execution timedistribution in terms of the first and second moment. Theseapproaches include the work of Sarkar [31] for sequentialcompositions, and the work of Adve and Vernon [2] who,similar to us, also allow sequential loop bounds to bestochastic. Although aimed at the analysis of sequentialcomposition, their work is mentioned since it bearsresemblance with our approach to modeling distributionsin terms of moments.

3 STATISTICAL MOMENT APPROACH

In this section, we provide a rationale for our statisticalmoment approach. Next, we summarize our results forsequential and conditional compositions. Finally, we for-mulate the parallel composition problem, which is the mainfocus of this paper.

3.1 Rationale

Our choice of using statistical moments to characterize thepdf is primarily based on the following motivation. Asmentioned in Section 2, our moment approach is effectivelya generalization of the use of mean and variance indistribution characterization. Like mean and variance, theassociated benefit is a low analysis complexity. Unlike meanand variance, however, our extended approach capturesessential information on the pdf (as will be shown later on),while the complexity benefit is retained. Second, themethod of moments is a general approach to estimate theparameters from a data set and equate the sample momentsto their population counterparts. Another approach such asMaximum Likelihood (ML) is unsuited for many practicalpurposes because it is not known how the actual value ofthe ML parameters arose [38]. Although, in general, themethod of moments does not completely determine a pdf,in our case, knowledge of all moments is equivalent toknowledge of the distribution since execution time dis-tributions can be assumed to be finite [38].

The reason to limit our approach to the first fourmoments is motivated by the following. First, and foremost,the GLD, its inverse definition being the essential key tosolve the integration problem as shown in Section 4, is afour-parameter distribution. Fitting an execution timedistribution characterized by more than four moments withGLD would degrade the quality of the GLD fit in terms ofthe first four moments. In statistics, lower moments aremore important than higher moments to characterize adistribution, such as for Pearson distributions and Johnsondistributions. Furthermore, it has been shown [28] that thefirst four moments allows us to reconstruct the originaldistribution while introducing an acceptable error, and todistinguish between well-known standard distributions.Finally, in measurements, lower moments are also morerobust than higher moments. As the measured values ofhigher moments are so sensitive to sampling fluctuationsthat including higher moments does not always imply thatthe prediction accuracy will be improved (provided an

80 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 1, JANUARY 2006

Page 4: Low-cost static performance prediction of parallel stochastic task compositions

analytically tractable distribution model that accommodatesthese higher moments would exist in the first place).

The reason to extend our approach to four momentscompared to a well-known one, or two-moment approachesis based on the following motivation. As mentioned in theintroduction, an important premise underlying our ap-proach is that accurate stochastic modeling is essential to1) accurately capture the execution time distribution oftasks involved in parallel compositions due to the distribu-tion information requirements in order statistics, as well asto 2) accurately predict, e.g., the 99th percentile (theexecution time that only 1 percent of the execution runswill exceed). Strongly determining factors are the choice ofdistribution model and the number of parameters (moments)involved. In the following, we show that approaches basedon modeling execution times with standard distributions,such as the popular, 1-parameter, negative exponentialdistribution, or the 2-parameter, normal distribution, do notperform as well as our proposed moments method, despitethe fact that our method requires the use of the third andthe fourth moment, which are known to be more prone toinaccurate measurement.

In Table 1, we summarize the results of an experiment inwhich we model four sample execution time distributionsX, using seven candidate distribution models. The sampleexecution time distributions are E (negative exponential), U(uniform), N (normal), and W (a task execution timedistribution measured from the WATOR program att ¼ 64, described in Section 5.2.6). The candidate distribu-tion models are E (negative exponential), U (uniform), N(normal), GLD (our 4-moment method), GC3 (Gram-Charlier, 3 moments), GC4 (Gram-Charlier, 4 moments),and GC5 (Gram-Charlier, 5 moments). From the models ofX, we subsequently compute the distribution for an N-aryparallel composition (Y in terms of (1)), where N ¼ 128.Subsequently, we take the 99th percentile from Y as thetarget for our comparison. For all model distributions, theassociated prediction for Y is analytic (exact), except for theN and W distributions, in which case we have used anaccurate, numeric procedure. The prediction is compared tothe real solution of Y which is also computed analytically ornumerically, depending on the distribution. The numbers inthe table are the relative prediction errors (in percents) forthe 99th percentile of the Y distribution.

As an example, consider the negative exponential dis-tribution (E model, parameter �) as the vehicle to model anormal distributionX (zero mean, unit variance). In order toobtain the best fit we choose� ¼ 1 and translate the E model tothe left. Subsequently, we predict the moments of Y using thewell-known (analytical) solutions for N-ary parallel compo-sition of the negative exponential distribution, and obtain the

99th percentile. Finally, we compare this predicted percentile

with the actual 99th percentile that is directly measured from

the real Y (i.e., parallel composition of 128 normal distribu-

tions). The prediction error appears in the second column

(E model), fourth row (N sample, 86.8 percent).The table clearly indicates the advantage of increasing

the number of parameters in the distribution models. The

exponential distribution (1 parameter) is not very accurate

as a model to predict the distribution and percentiles of Y

(nor X) as shown by the numbers in the second column.

Despite the use of an additional parameter in the U and

N distributions, the prediction error is still much greater

than our 4-parameter approach (GLD). The results of the

Gram-Charlier models (3-5 parameters) are included to

illustrate that adding more parameters does not always

improve the prediction accuracy. In some cases, the GC

models are not applicable (N/A) due to the very large

variance of the execution time.

3.2 Sequential and Conditional Composition

A parallel section can be part of a higher-level sequential,

conditional, or parallel section. Conversely, within a

parallel section a task can be composed of sequential

and/or conditional tasks. Due to space limitations, we only

summarize our previous results on sequential and condi-

tional compositions. Proofs can be found in [9], [11]. Our

results are expressed in terms of the rth raw moment of

random variable X, denoted E½Xr�, which is defined by the

Stieltjes integral [38]

E½Xr� ¼Z 1�1

xrdFXðxÞ: ð3Þ

Definition 3.1: Binary sequential composition. Let Y be

defined as the sum of two independent random variables X1

and X2,

Y ¼ X1 þX2; ð4Þ

where E½Xr1� and E½Xr

2� exist. Then, the rth raw moment of Y

is given by

E½Y r� ¼Xrj¼0

r

j

� �E½Xj

1�E½Xr�j2 �: ð5Þ

Next, we present the results for N-ary sequential

composition.

Definition 3.2: N-ary sequential composition. Let Y be

defined as the sum of a random numberN of random variablesXi,

GAUTAMA AND VAN GEMUND: LOW-COST STATIC PERFORMANCE PREDICTION OF PARALLEL STOCHASTIC TASK COMPOSITIONS 81

TABLE 1Percentile Error (%) for Various Distributions and Models (N ¼ 128)

Page 5: Low-cost static performance prediction of parallel stochastic task compositions

Y ¼XNi¼1

Xi; ð6Þ

where Xi are iid variates of random variable X and N is

independent from Xi. Let also E½Xr� and E½Nr� exist. Then,

the rth raw moment of Y is given by

E½Y r� ¼ Edr

dtr

Xrj¼0

tjE½Xji �

j!

!N �����t¼0

24

35: ð7Þ

The derivations for specific values of r (r ¼ 1; 2; . . . ) is

straightforward, leading to low-complexity solutions where

the exponentN is reduced to a single multiplication term [9].For conditional composition, the following definition

applies.

Definition 3.3: Conditional composition. Let Y be defined as

Y ¼ X1; if C ¼ trueX2; if C ¼ false;

�ð8Þ

where C and Xi are independent. Let C have truth and false

probability P and Q ¼ 1� P , respectively, and let E½Pr�,E½Qr�, and E½Xr

i � exist. Then, the rth raw moment of Y is

given by

E½Y r� ¼ Edr

dtr

Xrj¼0

tjE½Xj1�

j!

!P �����t¼0

24

35þ

Edr

dtr

Xrj¼0

tjE½Xj2�

j!

!Q�����t¼0

24

35:

ð9Þ

Again, the derivation for specific values of r is relatively

straightforward, leading to simple scalar operations [9].In summary, for (binary and N-ary) sequential and

conditional compositions, our moment analysis produces

E½Y r� as an exact function of the input moments, merely

comprising a constant number of addition and multi-

plication operations (i.e., Oð1Þ, independent of N [9]).

3.3 Parallel Composition

As mentioned earlier, in contrast to the sequential domain,

for parallel composition, there is, in general, no analytic,

closed-form solution for E½Y r� in terms of E½Xri � due to the

following integration problem. Recall the cdf in (2). In the

case where Xi are N iid variates (N-ary parallel composi-

tion where Xi ¼ X), (2) reduces to

FY ðxÞ ¼ FXðxÞN: ð10Þ

From (3) and (10), we obtain

E½Y r� ¼Z 1�1

xrdFY ðxÞ ¼Z 1�1

xrdðFXðxÞÞN

¼ NZ 1�1

xrðFXðxÞÞN�1dFXðxÞ:ð11Þ

The right-hand side integration of (11) is impossible to solve

analytically in such a way that E½Y r� can be expressed

explicitly, as function of E½Xr�. In the binary case, the

related integration problem for E½Y r� is even more

complicated since two different execution times are

involved in the analysis.

4 PARALLEL COMPOSITION

In this section, we describe our approach to the analysis of

parallel composition based on statistical moments. Wesummarize the characteristics of the GLD, after which we

derive the analytical solution for N-ary and binary parallelcompositions.

As mentioned earlier, our moment to parallel composi-tion is based on the use of the GLD as distribution model.

The specific, inverse definition of the GLD allows us tosolve the integration problem in (11) so that it becomes

straightforward to obtain the moments of Y . Furthermore,unlike other approximating distributions such as the

Pearson and Johnson distributions, the GLD uses only onefunction (see (12)) and is computationally simpler [28]. As

shown in Section 3.1, the GLD also covers a wide range ofdistributions that are useful in modeling execution times.

The analysis for N-ary and binary parallel compositions

is illustrated in Fig. 1. Given the first four input momentsE½Xr

i �, the corresponding GLD parameters (� values) can be

evaluated by looking up a table [28] or running a Nelder-Mead (NM) simplex procedure [25] for more precision (see

Section 4.1). Based on the � (and N) values, we directlycompute the resulting moments E½Y r� for N-ary or binary

parallel composition, respectively (see Sections 4.2 and 4.3).

4.1 Generalized Lambda Distributions

The generalized lambda distribution,GLDð�lÞ for l ¼ 1; . . . ; 4,

is a four-parameter distribution defined by the percentilefunctionRas function of the cdfu, 0 � u � 1, according to [28]

RðuÞ ¼ �1 þu�3 � ð1� uÞ�4

�2; ð12Þ

where �1 is a location parameter, �2 is a scale parameter,

and �3 and �4 are shape parameters. While the cdf does not

exist in simple closed form, the probability density functionis given by

fðRðuÞÞ ¼ dRðuÞdu

� ��1

¼ ðR0ðuÞÞ�1

¼ �2

�3p�3�1 þ �4ð1� uÞ�4�1: ð13Þ

This four-parameter distribution includes a wide range of

curve shapes. The distribution can provide good approx-imations to well-known densities, such as the continuous

uniform distribution Uða; bÞ, the exponential distributionEð�Þ, and the normal distribution Nð�; �Þ.

In turn, let Y be distributed according to a GLD. Given

the � values, obtaining the statistical moments proceeds as

82 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 1, JANUARY 2006

Fig. 1. Proposed solution procedure.

Page 6: Low-cost static performance prediction of parallel stochastic task compositions

follows: Without loss of generality, let �1 ¼ 0. Then, from (3)

and (12), the rth raw moment of Y is derived according to

E½Y r� ¼ 1

�r2

Xri¼0

r

i

� �ð�1ÞiBð�3ðr� iÞ þ 1; �4iþ 1Þ; ð14Þ

where B denotes the beta function as defined by

Bða; bÞ ¼Z 1

0

ta�1ð1� tÞb�1dt: ð15Þ

In practice, the computation cost of (14) is negligible.To obtain the four�values from the moments, we can refer

to a table, such as provided in [28]. If more precise�values are

required, an alternative is to apply a method based on

function minimization such as the Nelder-Mead simplex

procedure [25] (objective functions (12) and (13), respec-

tively). This procedure requires a constant number of

iteration steps. Even for a 10�10 precision for the � values,

our experiments show that no more than 100 iteration steps

involving evaluating (14) are required (in total less than

105 floating point operations). Thus, given the four moments,

the cost of obtaining the � values is also negligible.As mentioned before, the cdf of the GLD does not exist in

closed-form, while in the analysis of binary parallel

composition a closed-form is required (see Section 4.3). As

an approximation, we compute the cdf u of the GLD based

on Newton’s method (relative convergence criterion 10�8)

according to

u0 ¼1

K; umþ1 ¼ um �

RðumÞR0ðumÞ

; ð16Þ

where m is the iteration index and R0ðuÞ denotes the

derivative of RðuÞ with respect to u. As for all distributions,

um is close to umþ1 the method converges (quadratically) to

a solution within two iterations.

4.2 N-ary Parallel Composition

In this section, we present how the GLD is applied in the

analysis of N-ary parallel composition. We first present a

more generalized result in terms of Theorem 4.1.

Theorem 4.1: rth raw moment of the nth order statistic. Let

Y1 � Y2 � . . . � Yn � . . . � YN be random variables obtained

by permuting N iid variates of continuous random variables

X, i.e., X1; X2; . . . ; XN , in increasing order. Let E½Xr�, r ¼1; 2; 3; 4 exist, while X can be approximated in terms of

GLDð�1; �2; �3; �4Þ, where �j are functions of E½Xr�. Without

loss of generality, let �1 ¼ 0. Then, for n ¼ 1; 2; . . . ; N , the

rth raw moment of the nth order statistic Yn is given by

E½Y rn �¼

n

�r2

N

n

� �Xri¼0

r

i

� �ð�1ÞiBð�3ðr� iÞþn; �4iþN � nþ 1Þ;

ð17Þ

where B denotes the beta function.

Proof. The rth raw moment of Yn, expressed in terms of its

cdf FYnðxÞ, is given by (3), i.e.,

E½Y rn � ¼

Z1�1

xrdFYnðxÞ; ð18Þ

where FYnðxÞ, expressed in terms of FXðxÞ, is given by [38]

FYnðxÞ ¼XNk¼n

N

k

� �ðFXðxÞÞkð1� FXðxÞÞN�k: ð19Þ

The derivative of (19) with respect to x is given by

dFYnðxÞdx

¼XNk¼n

N

k

� � kðFXðxÞÞk�1ð1� FXðxÞÞN�k

dFXðxÞdx

� ðN � kÞðFXðxÞÞkð1� FXðxÞÞN�k�1 dFXðxÞdx

!

¼ NXNk¼n

N � 1

k� 1

� �ðFXðxÞÞk�1ð1� FXðxÞÞN�k

dFXðxÞdx

�NXNk¼nþ1

N � 1

k� 1

� �ðFXðxÞÞk�1ð1� FXðxÞÞN�k

dFXðxÞdx

¼ N N � 1

n� 1

� �ðFXðxÞÞn�1ð1� FXðxÞÞN�n

dF ðxÞdx

:

Since the GLD is conveniently expressed as a function of

the cdf, i.e., X ¼ RðuÞ, xr may be directly substituted by

ðRðuÞÞr. As FXðxÞ ¼ u, we substitute dF ðxÞ by du.

Consequently, the lower and upper bound follow 0 �u � 1 since FXi

ðxÞ ¼ 0 for x! �1 and FXiðxÞ ¼ 1 for

x!1. Substituting (20) and the lower and upper

bounds to (18), it follows

E½Y rn � ¼ N

N � 1

n� 1

� �Z1

0

ðRðuÞÞrun�1ð1� uÞN�ndu; ð21Þ

where ðRðuÞÞr can be expanded into a power series

according to

ðRðuÞÞr ¼ 1

�r2

Xri¼0

r

i

� �ð�1Þiu�3ðr�iÞð1� uÞ�4i: ð22Þ

Substituting (22) within (21) yields

E½Y rn � ¼

N

�r2

N � 1

n� 1

� �Z 1

0

Xri¼0

r

i

� �ð�1Þiu�3ðr�iÞþn�1

ð1� uÞ�4iþN�ndu

¼ N

�r2

N � 1

n� 1

� �Xri¼0

r

i

� �ð�1Þi

Z 1

0

u�3ðr�iÞþn�1

ð1� uÞ�4iþN�ndu:

Reducing the integral to a beta function results in

E½Y rn � ¼

N

�r2

N � 1

n� 1

� �Xri¼0

r

i

� �ð�1ÞiBð�3ðr� iÞ þ n;

�4iþN � nþ 1Þ:ut

GAUTAMA AND VAN GEMUND: LOW-COST STATIC PERFORMANCE PREDICTION OF PARALLEL STOCHASTIC TASK COMPOSITIONS 83

Page 7: Low-cost static performance prediction of parallel stochastic task compositions

Theorem 4.1 enables us to express E½Y rn � in terms of E½Xr�

and the problem size N (the number of tasks involved in theparallel section) while the value of n ranges from 1 to N .

As in parallel composition, Y is determined by thelargest Xi, the specific instance n ¼ N of Theorem 4.1applies (i.e., the largest order statistic). This result for N-aryparallel composition is stated in terms of the followingcorollary.

Corollary 4.1: N-ary parallel composition. Under the sameassumptions as in Theorem 4.1, let random variable Y bedefined as

Y ¼ maxðX1; X2; . . . ; XNÞ: ð23Þ

Then, the rth raw moment of Y is given by

E½Y r� ¼ N

�r2

Xri¼0

r

i

� �ð�1ÞiBð�3ðr� iÞ þN; �4iþ 1Þ: ð24Þ

The result is immediately obtained from (17) by sub-stituting n ¼ N . Similar to sequential composition, thesolution complexity of (24) is entirely independent of N(i.e., Oð1Þ), while its computation cost is negligible (cf. (14)).

4.3 Binary Parallel Composition

When two different, independent execution times areinvolved in the analysis, deriving E½Y r� is slightly morecomplicated. We present our result for general binaryparallel composition in terms of Theorem 4.2.

Theorem 4.2: Binary parallel composition. Let randomvariable Y be defined as

Y ¼ maxðX1; X2Þ;

where Xi are independent random variables for which E½Xri �,

r ¼ 1; 2; 3; 4 exists. Let Xi be approximated in terms ofGLDð�i;1; �i;2; �i;3; �i;4Þ, where �i;l are functions of E½Xr

1� andE½Xr

2�. Then, the rth raw moment of Y is given by

E½Y r� ¼Z 1

0

ðRX2ðuÞÞrFX1

ðRX2ðuÞÞdu

þZ 1

0

ðRX1ðuÞÞrFX2

ðRX1ðuÞÞdu:

ð25Þ

Proof. The rth raw moment of Y , expressed in terms of itscdf, is given by (3)

E½Y r� ¼Z1�1

xrdFY ðxÞ; ð26Þ

where FY ðxÞ, expressed in terms of FXiðxÞ, is given by

FY ðxÞ ¼ FX1ðxÞFX2

ðxÞ: ð27Þ

The derivative of (27) with respect to x is given by

dFY ðxÞdx

¼ FX1ðxÞ dFX2

ðxÞdx

þ FX2ðxÞ dFX1

ðxÞdx

: ð28Þ

Since the GLD is conveniently expressed as a function ofthe cdf, i.e., Xi ¼ RXi

ðuÞ, xr may be directly substitutedby ðRXi

ðuÞÞr. As FXiðxÞ ¼ uXi

, we substitute dFXiðxÞ by

duxi . Consequently, the lower and upper bound follow

0 � uXi� 1 since uXi

¼ 0 for x! �1 and uXiðxÞ ¼ 1 for

x!1. Substituting (28) and the lower and upperbounds to (26), it follows

E½Y r� ¼Z 1

0

ðRX2ðuÞÞrFX1

ðRX2ðuÞÞdu

þZ 1

0

ðRX1ðuÞÞrFX2

ðRX1ðuÞÞdu:

ut

In (25), the integration proceeds from u ¼ 0 to u ¼ 1rather than x ¼ �1 to x ¼ 1 due to specific formulation ofthe GLD. An implementation of (25) using the Riemannsum rule is given by

E½Y r� ¼ 1

K

XKk¼1

ððRX2ðk=KÞÞrFX1

ðRX2ðk=KÞÞ

þ ðRX1ðk=KÞÞrFX2

ðRX1ðk=KÞÞÞ:

ð29Þ

In practice, K ¼ 104 is sufficiently large for the numericalapproximation error to be in the percent range, whileFXiðRXj

ðk=KÞÞ can be computed using Newton’s methodas given in (16). For K ¼ 104, the prediction requires680,000 floating point operations in total, which took30 ms on a 1 GHz Pentium III processor.

Note that our results, (24) and (29), are specifically

designed for N-ary and binary parallel composition, respec-

tively. Applying (29) for N-ary parallel composition would

require d2logðNÞe evaluations of (29). As this would indeed be

required for the parallel composition of N different tasks, for

iid tasks (24) provides anOð1Þ shortcut. In turn, applying (24)

to binary parallel composition with different execution times

would degrade prediction accuracy. This is due to the fact

that theXi with the greatest mean will primarily determine Y

(when equal variance of Xi). However, this is only partially

accounted for by the input variable X of (24), which

represents all Xi (from the iid assumption).

5 EXPERIMENTAL RESULTS

In this section, we study the quality of our predictionapproach when applied to synthetic distributions and thosemeasured from real applications. The prediction quality isexpressed in terms of the relative error "r which is definedaccording to

"r ¼jE½Y r

m� � E½Y rp �j

E½Y rm�

; ð30Þ

where E½Y rm� and E½Y r

p � are the measured and predictedmoments, respectively. For synthetic distributions(Section 5.1), E½Y r

m� is obtained analytically (or numerically).For empirical distributions (Section 5.2), E½Y r

m� is obtainedby measuring the execution time of the parallel compositionover 6,000 sample program runs. E½Y r

p � is calculated using(24) and (29) for N-ary and binary parallel compositions,respectively. The input moments E½Xr

i � are obtained bymeasuring the execution time of the tasks involved in theparallel composition over 6,000 sample program runs. In theN-ary case, all Xi samples are taken to estimate X.

84 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 1, JANUARY 2006

Page 8: Low-cost static performance prediction of parallel stochastic task compositions

5.1 Synthetic Distributions

Synthetic distributions such as the uniform, exponential,and normal distributions are frequently used for modelingexecution times. The pdf of these distributions is availablein closed form from which the moments E½Y r

m� for parallelcomposition can be obtained. While for the uniform andexponential distributions E½Y r

m� can be expressed explicitlyin terms of the input moments, for normal distributions,E½Y r

m� is obtained by numerical integration since the cdf isnot available in closed form.

When the input execution times Xi are uniformlydistributed, E½Y r

m� is exactly equal to E½Y rp � for N-ary and

binary parallel compositions since Xi are exactly modeledby the GLD.

For exponentially distributed Xi, Fig. 2 shows "r with� ¼ 1 for N ranging from 2 to 1,000. The figure shows thatthe error decreases monotonically as function of N becausethe right tail of the GLD asymptotically approaches theexponential distribution (hence, N ¼ 1; 000 suffices). Thelargest error, for N ¼ 2, is less than 1 percent. For binaryparallel composition, we consider two exponential distribu-tions with parameter �1 ¼ 1, and 0:1 � �2 � 10. Fig. 3 showsthat "r is in the percent range and not very sensitive to �2

(the fluctuation of �2 is due to the interval discretizationK ¼ 10; 000 in (29)). The error "1 is less than 1 percent.

For normal distributed Xi, Fig. 4 shows "r for N up to106, where E½X� ¼ VAR½X� ¼ 1. Memory limitations of themathematical tool used to compute the exponentiation ofthe derivative of the normal pdf (10) in the derivation of Ymprohibit larger values for N . Due to the exponentiation (10)in terms of (24), the (initially very small) approximationerror in the right tail of f (and, therefore, F ) produces anerror that tends to grow with N . This phenomenon iscommon to all approaches based on approximating dis-tribution models (although the errors are usually muchworse as illustrated in Table 1). In the figure, we alsocompare our approach with Gumbel’s well-known approx-imation [16],

E½YN � � E½X� þ Std½X�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 logð0:4NÞ

pð31Þ

which, in contrast, is asymptotically exact (in this particularcase, since the normal distribution is symmetric). Ourapproximation is better for N up to 10; 000. For binaryparallel composition, we consider two normal distributionswhere E½X1� ¼ VAR½X1� ¼ 1, while E½X2� ¼ � and VAR½X2�¼ �2. In Fig. 5 (� ¼ 1), "r decreases slowly as a function of �

GAUTAMA AND VAN GEMUND: LOW-COST STATIC PERFORMANCE PREDICTION OF PARALLEL STOCHASTIC TASK COMPOSITIONS 85

Fig. 2. "r (%) for exponentially distributed (N-ary).

Fig. 3. "r (%) for exponentially distributed (binary).

Fig. 4. "r (%) for normal distributed (N-ary).

Fig. 5. "r (%) for normal distributed (binary).

Page 9: Low-cost static performance prediction of parallel stochastic task compositions

and "r < 1 percent. In Fig. 6 (� ¼ 1), "r is again below 1 percentand increases slightly as function of �. In both cases, this isrelated to the amount of overlap in both distributions (nooverlap gives zero prediction error).

5.2 Empirical Distributions

In this section, we study the accuracy of our analyticapproach when applied to distributions as measured fromwell-known benchmark codes. As deterministic task execu-tion times would entail zero prediction error, we haveselected applications with stochastic task execution times inorder to evaluate our prediction method. As mentioned inthe introduction, stochastic execution times can be a resultof scheduling nondeterminism (e.g., workload imbalance,contention), and data-dependency. When processing onnonshared grids, processor and communication schedulingnondeterminism is an insignificant factor in task executiontime variance. In such cases, the only source of significantstochastic behavior is the program’s dependency on theinput data set. This has therefore been the basis for ourselection.

Our measurement approach is based on instrumentingthe code at each (stochastic) parallel section to incremen-tally update the four moments of X (typically, X1 in anN-ary section) at each new execution run (i.e., new data set),the application’s moments profile being stored on file. Asthe number of application runs required to obtain arepresentative repertoire of data sets is application-specific,the determination of what constitutes a representative dataset training corpus with respect to X is beyond the scope ofthis paper. As the well-known benchmark codes do notcome with a large array of input data sets, and sincedetailed, application-specific input data modeling researchis beyond the scope of this paper, we have restricted theselection of applications to those codes that either comewith an internal input data model (e.g., simulation codeswith randomized initial scene), or where the input datamodel is trivial (e.g., general-purpose sorting code). For thelatter category, we synthesize the data sets required. For allapplications, our moment measurements are based on6,000 runs.

For our experiments, we selected the following codes,i.e., NAS-EP (embarrassingly parallel [4]), SSSP (singlesource shortest path [27]), APSP (all pairs shortest path[27]), Splash2-Barnes (hierarchical n-body simulation [40]),PSRS (parallel sort with regular sampling [35]), andWATOR (fish-shark population simulation [3]). For SSSP,APSP, and PSRS, we synthesized the 6,000 input data sets,which is discussed in the corresponbing sections. Overall,we study a total of 14 parallel task compositions, featuring awide range of Xi distributions. Apart from distributionshape, some of the compositions also feature correlationeffects between the Xi. Correlation has a profound negativeeffect on the accuracy for any approach based on an iiddistribution model. When all the Xi would be totallycorrelated, Y would simply equal X. Compared to the iidsituation, the error of the mean value grows approximatelylogarithmically with N as can be observed from (31). Hence,next to studying the ability of GLD to fit real-worlddistributions, it is of particular interest to observe to whatextent other real-world characteristics, such as correlation,pose a limit to the practical applicability of our method.

5.2.1 EP

NAS-EP constitutes one N-ary parallel composition. Thecomputation within each task is based on an internalrandom generator. As a result of the fact that a constantfraction of the randomly generated data is involved in thetime-consuming computation part of each task, the inputdistribution of X1, presented in Fig. 7, approximates thenormal distribution due to the central-limit theorem. Asshown in Fig. 8, the prediction error is indeed similar inmagnitude to the standard normal distribution shownearlier. Fig. 8 also shows the error of (31).

5.2.2 SSSP

The input to SSSP is a V � V adjacency matrix A, whereV ¼ 1; 000. The matrices are generated to model connectionlocality such as in topographic maps. The edge weights aijare assigned infinity with probability ji� jj=V , while theremaining aij are sampled from the uniform distributionUð1; ji� jjÞ. A comparison with matrices where each aij isgenerated without any location dependence shows that our

86 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 1, JANUARY 2006

Fig. 6. "r (%) for normal distributed (binary). Fig. 7. pdf(X) for NAS-EP and SSSP.

Page 10: Low-cost static performance prediction of parallel stochastic task compositions

prediction results (also for APSP) are independent of the

input model. We have parallelized SSSP using binary task

parallelism. As a result of the internal farmer-worker

partitioning [27], correlation of the two tasks is to be

expected. The corresponding "r (%) is given in Fig. 9, where

V ranges from 100 to 1,000. Indeed, due to the correlation

effect, "r (%) for binary parallel composition is larger than

would be expected from Fig. 4.

5.2.3 APSP

APSP is based on distributing the SSSP adjacency matrix to

all N processors while each processor runs SSSP on the

nodes it is assigned to (row-wise block distribution,

V ¼ 640). Thus, the kernel constitutes one N-ary parallel

composition. The pdf of the normalized X is given in Fig. 7,

which shows that X is only slightly left-skewed.In Fig. 10, "r (%) is in the order of percents, which implies

that the correlation effects of each task sharing the same

adjacency matrix is quite small, but not negligible. The

figure also shows that on average our mean value is better

than Gumbel’s mean value (31).

5.2.4 Barnes

In the default setting, Splash2-Barnes constitutes a sequenceof four parallel compositions, one for each of the time stepst ¼ 0; . . . 3. At each run, the bodies are initially (t ¼ 0)randomly generated in space. Unlike the WATOR simula-tion, workload balance is maintained through dynamicrepartitioning (Costzones approach). As the evolution of then-body system at t ¼ 3 has hardly changed, we extendedthe benchmark run until t ¼ 64. Instead of 6,000 sampleruns, we chose 500 runs in view of the increasedcomputational cost. Fig. 11 shows the distribution of X forboth t ¼ 0 and t ¼ 64, which both approach the normaldistribution. Figs. 12 and 13 show the prediction error fort ¼ 0 and t ¼ 64, respectively. Although the dynamicCostzones partitioning still induces a small, yet systematicworkload imbalance such that XN is always a bit greaterthan the other Xi (i.e., correlation), the errors are quitesmall, though indeed slightly increasing with N .

5.2.5 PSRS

PSRS comprises three N-ary data parallel compositions,coined sort, disjoin, and merge, which are based on

GAUTAMA AND VAN GEMUND: LOW-COST STATIC PERFORMANCE PREDICTION OF PARALLEL STOCHASTIC TASK COMPOSITIONS 87

Fig. 8. "r (%) for NAS-EP.

Fig. 9. "r (%) of SSSP.

Fig. 10. "r (%) of APSP.

Fig. 11. pdf of X for BARNES.

Page 11: Low-cost static performance prediction of parallel stochastic task compositions

block-partitioning the input and output data vectors. Wegenerate the input data vectors from a uniform randomvariable using sample space ½0; 1�. Consequently, each taskin sort has an iid execution time. In contrast, the tasks inthe subsequent parallel sections are correlated, since theirworkload depends on the same global pivot values. The pdfof X is given in Fig. 14. The distributions of X are quitedifferent from any of the distributions we have examinedpreviously. The "r for the three compositions is given inFigs. 15, 16, and 17, respectively. Fig. 15 shows that the erroris less than 1 percent (cf. Fig. 4). Fig. 16 indicates that (24)still provides a good approximation although the indepen-dency assumption does not quite hold. Fig. 17 shows thelimitation of our approach when Xi are highly correlated.Our measurements confirm that the coefficient of correla-tion between X1 and Xj for merge are far from zero [11].The figures also show the error of Gumbel’s prediction in(31), which suffers from the correlation effect as well.

5.2.6 WATOR

WATOR constitutes a sequence of parallel compositions,

one for each simulation step t. At t ¼ 0, a number of fish

and shark is generated randomly over each location in the

ocean. Each of the N processors is assigned a consecutivesquare block of 2,500 grid points, which corresponds to astatic partitioning, in contrast to the dynamic workloadpartitioning in Splash2-Barnes. The normalized pdf of X isgiven in Fig. 18 for the simulation time (iteration) t rangingfrom 0 to 256. Due to the static partitioning, X1 evolves intoa bimodal distribution for t ¼ 256 which corresponds toeither clusters of fish (or shark), or empty spaces [10] overthe simulation runs.

The "r for t ¼ 0, 16, 128, and 256 are presented in Figs. 19,20, 21, and 22, respectively. The smallest "r appears in Fig.19 for t ¼ 0 as the Xi are still iid. The error slightly increaseswith t due to increasing correlation between the tasks,which is caused by the clustering mentioned earlier. InFig. 23, the correlation is shown between X1 and Xj forN ¼ 16. Despite the small coefficient of correlation, theactual covariance is large due to the large variance values.The figure shows that the correlation is the largest for theworkloads nearest to X1 (i.e., X2 and X16 as the processorsare interconnected in a 1D torus). Apart from correlationeffects, we observe a bimodal distribution for Xi (t ¼ 256).Such variability of distribution seriously undermines the

88 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 1, JANUARY 2006

Fig. 12. "r (%) for BARNES (t ¼ 0).

Fig. 13. "r (%) for BARNES (t ¼ 64).

Fig. 14. The pdf of X for PSRS.

Fig. 15. "r (%) for PSRS sort.

Page 12: Low-cost static performance prediction of parallel stochastic task compositions

accuracy of our approach since the GLD can only fit

unimodal distributions. Consequently, for execution times

with multimodal distributions, a different approach will

have to be taken. (A method that deals with multimodal

workloads is presented in [21] for queuing network

models.)

6 CONCLUSION

In this paper, we have presented an analytical model of the

execution time distribution Y of N-ary and binary parallel

compositions of stochastic tasks. Our approach is based on

approximating the distributions Xi and Y in terms of the

GAUTAMA AND VAN GEMUND: LOW-COST STATIC PERFORMANCE PREDICTION OF PARALLEL STOCHASTIC TASK COMPOSITIONS 89

Fig. 16. "r (%) for PSRS disjoin.

Fig. 17. "r (%) for PSRS merge.

Fig. 18. The pdf of X for WATOR.

Fig. 19. "r for WATOR t ¼ 0.

Fig. 20. "r (%) for WATOR, t ¼ 16.

Fig. 21. "r (%) for WATOR, t ¼ 128.

Page 13: Low-cost static performance prediction of parallel stochastic task compositions

first four moments, in conjunction with the use of the GLD

as an intermediate modeling vehicle, to derive a closed-

form, Oð1Þ complexity expression for E½Y r� in terms of

E½Xri � and N . Our model is intended to cover the effects of

data and task parallelism, and forms an integral part of an

analytic, ultra-low complexity approach to program perfor-

mance prediction that already covered sequential and

conditional compositions.We have shown that for many practical execution time

distributions, our 4-parameter model is capable of deliver-

ing much better prediction accuracy than models with less

parameters, such as the negative exponential and normal

distributions, at ultra-low solution complexity. Experiments

based on a large variety of task distributions, both

synthetically generated as well as measured from real

programs, indicate that the relative error for N-ary parallel

composition of iid tasks is typically in the order of percents,

provided the distributions are unimodal and uncorrelated.

Due to the GLD approximation the error often tends to

increase with N . For normal distributions, the error of the

mean grows from 1 percent to 10 percent for 1,000 and

1,000,000 processors, respectively. For different task execu-

tion times, the error for binary parallel composition is not

very sensitive to the variation of parameter values. As the

error is the largest for equal distributions, the binary

approach is complementary to the N-ary approach. The

experimental results also confirm that our GLD approach,

as well as the approaches based on alternative distribution

models, only apply as long as correlation is limited, as the

error due to correlation effects grows with N . In the

programs we measured, the first moment error due to

correlation is usually in the order of percents, ranging to

10 percent in two particular cases.

ACKNOWLEDGMENTS

The authors extend a sincere appreciation to the anon-

ymous reviewers for their extremely valuable and insightful

comments which significantly helped them in improving

the paper.

REFERENCES

[1] V.S. Adve, “Analyzing the Behavior and Performance of ParallelPrograms,” PhD thesis #1201, Univ. of Wisconsin-Madison, Dec.1993.

[2] V.S. Adve and M.K. Vernon, “The Influence of Random Delays onParallel Execution Times,” Proc. SIGMETRICS ’93, pp. 61-73, May1993.

[3] I. Angus et al., Solving Problems on Concurrent Processors, Softwarefor Concurrent Processors, vol. 2, Prentice Hall, 1990.

[4] D. Bailey et al., “The NAS Parallel Benchmarks,” Report RNR-94-007, Dept. of Math. and Computer Science, Emory Univ., Mar.1994.

[5] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, “A StaticPerformance Estimator to Guide Data Partioning Decisions,” Proc.ACM SIGPLAN PPoPP, pp. 213-223, Apr. 1991.

[6] M.J. Clement and M.J. Quinn, “Multivariate Statistical Techniquesfor Parallel Performance Prediction,” Proc. 28th Hawaii Int’l Conf.System Sciences, vol. 2, pp. 446-455, Jan. 1995.

[7] B. Dodin, “Bounding the Project Completion Time DistributionsNetworks,” Operations Research, vol. 33, no. 4, pp. 862-881, 1985.

[8] T. Fahringer and H.P. Zima, “A Static Parameter-Based Perfor-mance Prediction Tool for Parallel Programs,” Proc. Seventh ACMInt’l Conf. Supercomputing, pp. 207-219, July 1993.

[9] H. Gautama and A.J. C. van Gemund, “Static PerformancePrediction of Data-Dependent Programs,” Proc. ACM Int’l Work-shop Software and Performance ’00, pp. 216-226, Sept. 2000.

[10] H. Gautama and A.J. C. van Gemund, “Low-Cost PerformancePrediction of Data-Dependent Data Parallel Programs,” Proc. IEEEInt’l Symp. Modeling, Analysis and Simulation of Computer andTelecomm. Systems ’01, pp. 173-182, Aug. 2001.

[11] H. Gautama, “A Statistical Approach to Performance Modeling ofParallel Systems,” PhD thesis, Delft Univ. of Technology, Dec.2004.

[12] E. Gelenbe, E. Montagne, R. Suros, and C.M. Woodside,“Performance of Block-Structured Parallel Programs,” ParallelAlgorithms and Architectures, pp. 127-138, North-Holland, 1986.

[13] A.J.C. van Gemund, “The Importance of Synchronization Struc-ture in Parallel Program Optimization,” Proc. ACM Int’l Conf.Supercomputing, pp. 164-171, 1997.

[14] A.J.C. van Gemund, “Symbolic Performance Performance Model-ing of Parallel Systems,” IEEE Trans. Parallel and DistributedSystems, vol. 14, no. 2, pp. 154-165, Feb. 2003.

[15] A. Gonzalez-Escribano, A.J. C. van Gemund, and V. Cardenoso-Payo, “Mapping Unstructured Applications into Nested Paralle-lism,” Proc. Int’l Meeting High Performance Computing for Computa-tional Science ’02, pp. 469-482, 2002.

[16] E.J. Gumbel, “Statistical Theory of Extreme Values (MainResults),” Contributions to Order Statistics, pp. 56-93, Wiley andSons, 1962.

[17] F. Hartleb and V. Mertsiotakis, “Bounds for the Mean Runtime ofParallel Programs,” Proc. Conf. Object-Oriented Languages andSystems ’92, pp. 197-210, Sept. 1992.

90 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 1, JANUARY 2006

Fig. 22. "r (%) for WATOR t ¼ 256.

Fig. 23. WATOR coefficient of correlation.

Page 14: Low-cost static performance prediction of parallel stochastic task compositions

[18] C.P. Kruskal and A. Weiss, “Allocating Independent Subtasks onParallel Processors,” IEEE Trans. Software Eng., vol. 11, pp. 1001-1016, Oct. 1985.

[19] B.P. Lester, “A System for the Speedup of Parallel Programs,”Proc. Int’l Conf. Parallel Processing ’86, pp. 145-152, 1986.

[20] D.-R. Liang and S.K. Tripathi, “On Performance Prediction ofParallel Computations with Precedent Constraints,” IEEE Trans.Parallel and Distributed Systems, vol. 11, no. 5, pp. 491-508, May2000.

[21] J. Luthi, S. Majumdar, G. Kotsis, and G. Haring, “PerformanceBounds for Distributed Systems with Workload Variabilities andUncertainties,” Parallel Computing, vol. 22, pp. 1789-1806, Feb.1997.

[22] S. Madala and J.B. Sinclair, “Performance of Synchronous ParallelAlgorithms with Regular Structures,” IEEE Trans. Parallel andDistributed Systems, vol. 2, no. 2, pp. 105-116, Jan. 1991.

[23] V.W. Mak and S.F. Lundstrom, “Predicting Performance ofParallel Computations,” IEEE Trans. Parallel and DistributedSystems, vol. 1, no. 7, pp. 257-270, July 1990.

[24] C.L. Mendes, J-C. Wang, and D.A. Reed, “Automatic PerformancePrediction and Scalability Analysis for Data Parallel Programs,”Proc. Second Workshop Automatic Data Layout and PerformancePrediction, Apr. 1995.

[25] D.M. Olsson and L.S. Nelson, “Nelder-Mead Simplex Procedurefor Function Minimization,” Technometrics, vol. 17, pp. 45-51, 1975.

[26] C.D. Polychronopoulos, M. Girkar, M.R. Haghighat, C.L. Lee, B.Leung, and D. Schouten, “Parafrase-2: An Environment forParallelizing, Partitioning, Synchronizing, and Scheduling Pro-grams on Multiprocessors,” Proc. Int’l Conf. Parallel Processing ’89,pp. 39-48, Aug. 1989.

[27] M.J. Quinn, Parallel Computing: Theory and Practice. McGraw-Hill,1994.

[28] J.S. Ramberg, P.R. Tadikamalla, E.J. Dudewicz, and F.M. Mykytka,“A Probability Distribution and Its Uses in Fitting Data,”Technometrics, vol. 21, pp. 201-214, 1979.

[29] J.T. Robinson, “Some Analysis Techniques for AsynchronousMultiprocessor Algorithms,” IEEE Trans. Parallel and DistributedSystems, vol. 5, pp. 24-31, Jan. 1979.

[30] R.A. Sahner and K.S. Trivedi, “Performance and ReliabilityAnalysis Using Directed Acyclic Graphs,” IEEE Trans. SoftwareEng., vol. 13, pp. 1105-1114, Oct. 1987.

[31] V. Sarkar, “Determining Average Program Execution Times andTheir Variance,” Proc. SIGPLAN Conf. Programming LanguageDesign and Implementation ’89, pp. 298-312, 1989.

[32] V. Sarkar, Partitioning and Scheduling Parallel Programs for Multi-processors. MIT Press, 1989.

[33] J.M. Schopf, “A Practical Methodology for Defining Histogramsfor Predictions and Scheduling,” Proc. ParCo ’99, pp. 664-671, Aug.1999.

[34] J.M. Schopf and F. Berman, “Using Stochastic Information toPredict Application Behavior on Contended Resources,” Int’l J.Foundations of Computer Science, vol. 12, no. 3, pp. 341-364, 2001.

[35] H. Shi and J. Schaeffer, “Parallel Sorting by Regular Sampling,”J. Parallel and Distributed Computing, vol. 14, no. 4, pp. 361-372,1992.

[36] A.W. Shogan, “Bounding Distributions for a Stochastic PERTNetwork,” Networks, vol. 7, pp. 359-381, 1977.

[37] F. Sotz, “A Method for Performance Prediction of ParallelPrograms,” Proc. CONPAR 90-VAPP IV, Joint Int’l Conf. Vectorand Parallel Processing, pp. 98-107, 1990.

[38] A. Stuart and J.K. Ord, Kendall’s Advanced Theory of Statistics, vol. 1,sixth ed., New York: Halsted Press, 1994.

[39] A. Thomasian and P. Bay, “Analytic Queuing Network Models forParallel Processing of Task Systems,” IEEE Trans. Computers,vol. 35, no. 12, pp. 1045-1054, Dec. 1986.

[40] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “TheSPLASH-2 Programs: Characterization and Methodological Con-siderations,” Proc. Int’l Symp. Computer Architecture ’95, pp. 24-36,1995.

[41] N. Yazici-Pekergin and J.-M. Vincent, “Stochastic Bounds onExecution Times of Parallel Programs,” IEEE Trans. Software Eng.,vol. 17, no. 10, pp. 1005-1012, Oct. 1991.

Hasyim Gautama received the MSc (cumlaude) degree in electrical engineering in 1998and the PhD degree in computer science, bothfrom Delft University of Technology. Subse-quently, he performed postdoctoral research onsensor network communication at Delft Univer-sity of Technology. As of 2005, he works at theNational Statistics Bureau in Bogor, Indonesia.His research interest is into performance model-ing of parallel and distributed systems.

Arjan J.C. van Gemund received the BScdegree in physics, the MSc degree (cum laude)in computer science, and the PhD (cum laude)degree in computer science, all from DelftUniversity of Technology, the Netherlands.Between 1980 and 1992, he joined variousDutch companies and institutes, where heworked in embedded hardware and softwaredevelopment, acoustics research, fault diagno-sis, and high-performance computing. In 1992,

he joined the Information Technology and Systems Faculty at DelftUniversity of Technology, where he is currently serving as an associateprofessor. He is also the cofounder of Science and Technology, a Dutchsoftware and consultancy startup. His research interests are in the areasof parallel systems performance modeling, fault diagnosis, and parallelprogramming and scheduling. He is a member of the IEEE ComputerSociety.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

GAUTAMA AND VAN GEMUND: LOW-COST STATIC PERFORMANCE PREDICTION OF PARALLEL STOCHASTIC TASK COMPOSITIONS 91