[IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India...

6
Supporting Unbounded Process Parallelism in the SPC Programming Model Arjan J.C. van Gemund http://duteppo.et .tudelft .nl/Ngemund Department of Electrical Engineering Delft University of Technology P.O.Box 5031, NL-2600 GA Delft, The Netherlands Abstract In automatic mapping of parallel programs to tar- get pa,rallel machines the eficiency of the compile-time cost estimation needed to steer the optimization pro- cess is highly dependent on the choice of programming model. Recently a new parallel programming model, called SPC, has been introduced that specifically aims at the eficient computation of reliable cost estimates, paving the way for automatic mapping. In SPC all algorithm level parallelism is explicitly specified, re- lying on compile-time transformation of the possibly unbounded algorithm level (data) parallelism to that of the actual target machine. In this paper we present SPC’s process-algebraic framework in terms of which we demonstrate that the transformations needed to ef- ficiently support unbounded process parallelism at pro- gram level are straightforward. 1 Introduction Despite recent advances in parallel language com- pilation technology, programming models such as HPF still expose the high sensitivity of machine perform- ance on programming decisions. The user is still confronted with complex optimization issues such as computation vectorization, communication pipelining, loop transformation, and, most notably, code and data partitioning, all of which requires a major understand- ing of the complex interplay between program and target machine. A critical success factor for high- performance computing is the development of compil- ation techniques that perform optimizations automat- ically, steered by compile-time cost estimation, thus providing portability as well as performance. Although currently a range of fine, compile-time cost prediction methods exist (e.g., [3, 6, 8, 19]), either the underlying analysis (and the associated parameter space) is targeted to a particular type of parallel ar- chitecture, or the technique is not intended to pro- duce reliable estimates throughout the entire para- meter search space that may be covered by an op- timization technique. The generally limited ability of compile-time cost estimations to be extremely efficient and reliable at the same time, is due to the fact that the emphasis in parallel programming models is on expressiveness, i.e., the ability to express the inher- ent parallelism within the algorithm to the ultimate, rather than analyzability, which in the automatic op- timization context means the ability to derive highly efficient yet reliable cost estimates. Recently a new parallel programming model, called SPC has been proposed [12], that imposes specific re- strictions in the synchronization structures that can be programmed. Imposing these restrictions enables the efficient computation of reliable cost estimates thus paving the way for automatic program mapping. The limitation of expressiveness in favor of analyzability is related to the conjecture [la] that the loss of parallel- ism when programming according to the SPC model is typically limited within a constant factor of 2 com- pared to the unrestricted case. Recent results [7] in- deed provide compelling evidence in favor of this con- jecture. We feel the limited loss of parallelism is out- weighed by the unlocked potential of automatic per- formance optimization as well as the portability that is achieved. In the SPC model all inherent algorithm parallelism is explicitly expressed at program level even though the actual machine parallelism may be orders of mag- nitude less. For example, in SPC, a software pipeline comprising M stages processing a stream of N data, requires specifying N processes instead of the M pro- cesses typically found in, e.g., message-oriented solu- tions. While M is typically small, N may even be infinite (as in stream processing) which implies that 168 1094-7256/97 $10.00 0 1997 IEEE

Transcript of [IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India...

Page 1: [IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India (18-21 Dec. 1997)] Proceedings Fourth International Conference on High-Performance

Supporting Unbounded Process Parallelism in the SPC Programming Model

Arjan J.C. van Gemund http://duteppo.et .tudelft .nl/Ngemund

Department of Electrical Engineering Delft University of Technology

P.O.Box 5031, NL-2600 GA Delft, The Netherlands

Abstract In au tomat i c mapp ing of parallel programs t o tar-

get pa,rallel mach ines t h e e f i c i e n c y of t he compile-t ime cost e s t ima t ion needed t o steer t h e opt imizat ion pro- cess i s highly dependent o n t h e choice of programming model. Recent ly a n e w parallel programming model, called SPC, has been introduced tha t specifically a i m s a t t h e e f i c i e n t compu ta t ion of reliable cost es t imates , paving t h e w a y for au tomat i c mapping. In SPC all algorithm level parallelism i s explicitly specified, re- lying o n compile- t ime t rans format ion of t h e possibly unbounded algorithm level (data) parallelism t o that of t he actual target mach ine . In this paper we present SPC’s process-algebraic f ramework in t e r m s of which we demonstrate tha t t h e transformations needed t o ef- f ic ient ly support unbounded process parallelism a t pro- g r a m level are straightforward.

1 Introduction Despite recent advances in parallel language com-

pilation technology, programming models such as HPF still expose the high sensitivity of machine perform- ance on programming decisions. The user is still confronted with complex optimization issues such as computation vectorization, communication pipelining, loop transformation, and, most notably, code and data partitioning, all of which requires a major understand- ing of the complex interplay between program and target machine. A critical success factor for high- performance computing is the development of compil- ation techniques that perform optimizations automat- ically, steered by compile-time cost es t imation, thus providing portability as well as performance.

Although currently a range of fine, compile-time cost prediction methods exist (e.g., [3, 6, 8, 19]), either the underlying analysis (and the associated parameter space) is targeted to a particular type of parallel ar-

chitecture, or the technique is not intended to pro- duce reliable estimates throughout the ent ire para- meter search space that may be covered by an op- timization technique. The generally limited ability of compile-time cost estimations to be extremely efficient and reliable at the same time, is due to the fact that the emphasis in parallel programming models is on expressiveness, i.e., the ability to express the inher- ent parallelism within the algorithm to the ultimate, rather than analyzability, which in the automatic op- timization context means the ability to derive highly efficient yet reliable cost estimates.

Recently a new parallel programming model, called SPC has been proposed [12], that imposes specific re- strictions in the synchronization structures that can be programmed. Imposing these restrictions enables the efficient computation of reliable cost estimates thus paving the way for automatic program mapping. The limitation of expressiveness in favor of analyzability is related to the conjecture [la] that the loss of parallel- ism when programming according to the SPC model is typically limited within a constant factor of 2 com- pared to the unrestricted case. Recent results [7] in- deed provide compelling evidence in favor of this con- jecture. We feel the limited loss of parallelism is out- weighed by the unlocked potential of automatic per- formance optimization as well as the portability that is achieved.

In the SPC model all inherent algorithm parallelism is explicitly expressed at program level even though the actual machine parallelism may be orders of mag- nitude less. For example, in SPC, a software pipeline comprising M stages processing a stream of N data, requires specifying N processes instead of the M pro- cesses typically found in, e.g., message-oriented solu- tions. While M is typically small, N may even be inf ini te (as in stream processing) which implies that

168 1094-7256/97 $10.00 0 1997 IEEE

Page 2: [IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India (18-21 Dec. 1997)] Proceedings Fourth International Conference on High-Performance

such a portable solution heavily relies on the com- piler’s ability to efficiently reduce the N-process pro- gram to a practical, high-performance M-process im- plementation.

While in previous work we have shown how SPC enables automatic performance optimizations, in this paper we present the compile-time process reduction scheme. We show how SPC serves as an intermedi- ate, process-algebraic formalism in terms of which al- gorithms can be cleanly expressed and transformed into efficient machine level implementations. The pa- per is organized as follows. In Section 2 we briefly outline the SPC programming model. In Section 3 we present the process calculus that is used to de- scribe the transformations required to reduce SPC al- gorithms to efficient machine implementations. In Sec- tion 4 we discuss related work, followed by a summary in Section 5.

2 The SPC programming model In this section we briefly introduce the SPC pro-

gramming model as far it is needed in the paper. A more detailed description can be found in [12].

SPC is a process-algebraic coordination language that comprises a small number of operators to con- trol parallelism and synchronization. The data com- putations are expressed in a separate, imperative se- quential host language (currently c), aimed at a sep- aration of concerns typical for many coordination lan- guages [lo]. The “SP” prefix in SPC applies to the fact that the condition synchronization [2] (CS) patterns are restricted to series-parallel (SP) form. The “C” term refers to SPC’s specific programming paradigm, called contention programming, which is based on the specific use of mutual exclusion [2] (ME) to describe scheduling constraints. In the contention program- ming model all inherent algorithm parallelism is ex- plicitly expressed. Parallelism is constrained by intro- ducing contention for software resources (e.g., critical sections) or hardware resources (e.g., processing units) in a manner similar to the material-oriented paradigm used in the simulation domain [15].

2.1 Language In SPC, programming, i.e., applying composi-

tions, is done by specifying process equations using a straightforward substitution mechanism. By conven- tion, the expression tree is rooted by a special process called main that represents the overall program. In order to ensure the correct binding, C , l ’s can be used to delimit compound process expressions. Apart from data types needed to express numeric computations, the main data type in SPC is the process.

Basic data computations are expressed in terms of

t a s k ( i ) = --C-- C y[ i l = f (xCi1 ) ; 3 processes. For example, the equation

describes a process t a s k ( i ) where the --C-- C . . . 1 is an inlining facility to provide an interface with the sequential host language (C in the case of our current simulation environment).

Sequential composition is described by the in- fix operator ; like in task-1 ; task-2. Loops are described by the reduction operator seq as in seq ( i = 1, N ) t a s k ( i ) which is equivalent to task(1) ; . . . ; task(N). Unlike loops within in- lined tasks, seq loops are involved in SPC’s compile- time optimization.

Parallel composition is described by the infix op- erator I I like in task- I I I task-2. Parallel loops are described by the reduction operator par as in par ( i = I, N) t a s k ( i ) which is equivalent to task(1) I I . . . I I task(N). The parallel operator has a fork/join semantics which implies a mutual bar- rier synchronization a t the end of the parallel section. Note that this provides a structured form of condition synchronization.

Conditional composition is specified by the i f operator that has an optional e l s e part, like in i f (c) task-1 e l s e task-2. SPC also includes a while construct, which, however, is not considered in this paper.

In order to express global synchronizations at ma- chine level explicit condition variables are supported that come with wai t / s igna l operators. The opera- tion wait (c) blocks any process on the condition vari- able c until some process executes s i g n a l (c) . As this construct would allow the expression of non-SP (NSP) synchronization structures its use is restricted to the intermediate level during the compilation phase.

While the process abstract data type provides a means to specify both parallelism and condition synchronization, the resource abstract data type provides the means to impose mutual exclusion. A process that is to be executed under mutual exclusion is mapped to a resource as in

add( i ) -> c r i t i c a l

where the -> operator denotes the mapping of the pro- cesses add( i ) to a logical processor called c r i t i c a l . Consequently, all add( i) are executed under mutual exclusion as a result of the fact that they are mapped onto the same resource. Note, that again synchroniza- tion is structured in terms of a single construct instead of, e.g., using two semaphore operations to achieve mutual exclusion.

169

Page 3: [IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India (18-21 Dec. 1997)] Proceedings Fourth International Conference on High-Performance

An example application of the above resource map- ping is in the following program

main = par (i = 1, N) { t a s k ( i ) ; add( i ) 3 t a s k ( i ) = --C-- { yci] = f ( x [ i l ) ; 1 add(i) = --C-- { z = z + yci] ; 3 add( i ) -> c r i t i c a l

that computes f(x1) + . . . + f ( x ~ ) using a simple, linear-time reduction scheme. Note that in the pro- gram the add (i) tasks are scheduled dynamically, avoiding unnecessary delays in view of the generally non-deterministic execution times of the t a s k ( i ) pro- cesses. The default scheduling policy associated with the resources is FCFS with non-deterministic, fair con- flict arbitration but also other scheduling disciplines can be specified. Like in the above example, at pro- gramming level a resource is typically logical (a critical software section) in order to retain machine independ- ence. Physical resources, like hardware processors, are introduced at the intermediate (SPC) level dur- ing compilation. 2.2 Example

In the previous example the use of resource map- ping has been required to avoid numerical non- determinism. In the following example we will show the use of resource mappings to express pipelining, a form of data parallelism where the actual schedule is due to the limited functional parallelism at machine level. Although this mapping is typically derived by the optimizing compilation phase, we show this ex- ample to illustrate that, pipelining can also be expli- citly programmed by the user, effectively introducing mapping constraints with regard to the subsequent compilation phase. Examples including divide-and- conquer schemes as well as other dynamically sched- uled synchronization patterns such as logarithmic re- duction schemes, are presented in [12, 131.

Consider some data parallel computation involving asequenceofMoperations t a s k ( i , j ) , j = 1, . . . , M to be applied for a set of (data) indices i = 1,. . . , N where N is typically large. The corresponding SPC program is given by

main = par (i = 1, NI seq ( j = 1, MI

t a s k ( i , j )

where the terms are spread across multiple lines to increase readability.

A straightforward mapping to a parallel machine would be based on partitioning the i axis, involving up to N processors. However, we consider a pipelined

schedule restricting the parallelism to only M pro- cessing units. In many programming models the solu- tion is typically programmed according to a machine- oriented paradigm [15] where each each processing stage is explicitly programmed using a message- oriented producer-consumer scheme. The SPC solu- tion only requires the additional mapping constraint

t a s k ( i , j > -> u n i t ( j )

Thus we simply retain the inherent data parallelism in terms of the construct par (i = 1 , N) just like be- fore. However, we introduce the dynamic constraint that some u n i t ( j ) can only process one operation t a s k ( i , j ) at a time. Again, u n i t ( j ) does not ne- cessarily correspond to a physical processor.

The pipeline behavior automatically follows from the mutual exclusion induced by this resource assign- ment. Note that this is a natural way of programming that is also very portable. Each individual data pro- cess simply “contends” for the available computing re- sources, hence, the name ‘‘contention programming”, a concept that lies at the heart of the SPC coordina- tion model.

The SPC solution still involves the use of N pro- cesses instead of M where N can be extremely large, if not infinite. In the next section it is shown that the compile-time transformation to an efficient M-process implementation is straightforward and is expressed in terms of SPC itself which illustrates SPC’s utility as an intermediate transformation formalism.

3 Compiling SPC In the compilation of SPC programs two phases can

be distinguished, called the opt imiza t ion phase and the reduction phase. In the optimization phase, SPC transformations are considered that minimize the ex- ecut ion t i m e cost. As a result of the specific choice of synchronization restrictions in SPC the time cost of an SPC expression can be mechanically mapped into a closed-form, symbolic time cost expression that has linear solution complexity in the size of the SPC ex- pression [ll]. In general, the estimation process pro- duces a lower bound. Yet, unlike cost estimation ap- proaches based on analyzing CS only, all dominant terms are generated Typical optimizations during this phase are code and data mapping. In [12] an example is given of how optimizations for distributed- memory machines are derived for the well-known line relaxation algorithm kernel ADI, yielding the same data layout and processor pipelining scheme as presen- ted in, e.g., [l].

In the optimization phase the n u m b e r of processes in the SPC expression is unchanged (i.e., the inherent

170

Page 4: [IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India (18-21 Dec. 1997)] Proceedings Fourth International Conference on High-Performance

program parallelism). Merely different resource map- pings are considered aimed to decrease execution cost. When the optimal task and data mapping have been defined, however, the reduction phase commences in which the typically large program level parallelism is reduced to the amount of parallelism available at ma- chine level.

3.1 Process Reduction In the reduction phase of the compilation process

SPMD code is generated from SPC program expres- sions. We show how a possibly unbounded number of logical SPC processes are mapped onto either ma- chine threads (i.e., scheduled at runtime) or to ordin- ary sequential SPMD code. The SPMD code genera- tion process is partly similar to the code generation of data parallel programs in which statement execution is based on the use of the well-known “owner-computes” rule. While in data parallel programs ownership is typically determined by the mapping of the LHS state- ment data, in our scheme the mapping is based on the task mapping that is computed in the optimization phase. As explained earlier, we will assume that this phase has already taken place, the result being given by SPC resource mapping statements such as

task(i , j) -> processor(owner(i , j ) )

where processor denotes the physical processing re- source and owner denotes the computed mapping. We assume a target machine architecture based on P (multithreaded) processors.

The reduction process essentially comprises the fol- lowing transformations:

0 introduce an outer par (p = 0, P-1) loop that models the SPMD machine parallelism.

0 guard all basic blocks (tasks) with an “owner- computes” test according to individual task map- ping. Do not guard seq and if control constructs.

0 ensure all control flow is global, i.e., replace all seq loops that cross processor bounds by global synchronization schemes.

0 apply various reductions similar to the loop (in- dex) optimizations found in data parallel compil- ation schemes.

Rather than presenting the transformations in formal terms in the following we will illustrate the concepts by translating the previous pipelining example (see Section 2.2). Instead of discussing the trivial case for a block partitioning on the i axis, we will study the

case for a block partitioning on the j axis (i.e., true pipelining). Despite the pipeline’s simplicity, this par- titioning introduces the problems that are quite typ- ical for more complex programs. More details can be found in [13].

Under a j axis block partitioning the initial pro- gram is given by

main = par (i = 1, N) seq (j = 1, M)

task(i, j )

where P 5 M such that M / P = B, B > 1, and PIM for simplicity. Note that this generalizes over simply mapping the M computation stages to exactly P = M processors. Applying the first three steps of the above SPMD transformation it follows

task(i, j) -> processor(j/P)

main = par (p = 0, P-I) par (i = 0, N - 1 )

par ( j = 0, M-1) if (j/P == p> C

wait (c [i, j -11 ) ; task(i, j ) ; signal (c Ci , j 1 1

1 task (i , j) -> processor ( j /P)

where p denotes the SPMD processor index, if (j/P == p) denotes the guard of task(i, j ) , and c [i, j] denotes a global condition variable associated with each task instance (wait (c Ci, -11 ) is assumed not to block). Note that the 2-dimensional condition array is necessary as some upstream processes may produce data quicker than the downstream processes can consume. Also note that because of the explicit synchronization scheme the original seq( j) loop can be safely replaced by a par (j ) loop.

Although the above SPC form is the basis for a correct SPMD scheme an important number of reduc- tions can be applied in order to yield efficient execu- tion. Given the regular partitioning in our example, the above expression immediately reduces to

main = par (p = 0, P-1) par (i = 0, N - 1 )

par (k = 0, B-I) { wait ( c [i ,p*B+k-11 ) ; task(i ,p*B+k) ; signal(cCi,p*B+k])

1 task(i, j) -> processor(j/P)

Essentially, the inner loop still represents a sequence of M operations chained by wait/signal precedence

171

Page 5: [IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India (18-21 Dec. 1997)] Proceedings Fourth International Conference on High-Performance

relations of which blocks of B operations are local to the k loop. Now that the ownership test has been reduced in terms of loop bounds, the sequential inner loop can be partially transformed to local seq loops according to

main = par (p = 0, P-1) par (i = 0, N - 1 ) {

wait (c [i ,p*B-I] ) ; seq (k = 0, B-1)

signal(c [i,p*B+B-l]) task(i ,p*B+k) ;

1 task(i, j) -> processor(j/P)

At this point we can choose between generating multi- threaded or sequential SPMD code. In a multi- threaded version, the par (i) loop, combined with the task (i, j ) mapping statement, generates a sequential f o r loop that forks N threads on each processor p (be- cause of the guard, by definition p = j / P ) . Because of the fact, however, that in stream processing N can be extremely large, we choose the sequential version in which all N threads are statically scheduled in terms of a seq loop according to

main = par (p = 0, P-I) seq (i = 0, N-I) {

wait(c[i,p*B-I]) ; seq (k = 0, B-1)

task(i ,p*B+k) ; signal(c[i, (p+l)*B-l])

1

This form follows from the fact that every parallel i instance executes on the same processor p as defined by the task mapping (formally the par and the map- ping statement reduce to the seq schedule). Note that the 2-dimensional condition array is still necessary al- though many of the original elements are no longer used due to the sequential block aggregation. Thus the expression can be further reduced to

main = par (p = 0, P-1) seq (i = 0, N - 1 ) C

wait (c [i ,p-11) ; seq (k = 0, B-1)

signal(c[i,pl) task(i,p*B+k) ;

1

which is the final result in terms of the SPC inter- mediate form. For an asynchronous message-passing architecture a number of straightforward code trans- formations [13] yield the following pseudo code

f o r (i = 0; i <= N-1; i++) { recv(p-1,. . . ) ; for (k = 0; k <= B-1; k++)

send(p,. . . ) ; task(i ,p*B+k) ;

}

where the data computed by processor p is stored at processor p . is as- sumed not to block.

The above example shows how the SPC process algebra is applied to rewrite program level SPC ex- pressions in terms of the underlying machine. The above procedure applies to any shared-memory or distributed-memory machine with a possibly multi- threaded process layer. As can be seen from the reduc- tion procedure, the source expressions need not be in SP form, i.e., there also may be explicit wait/signal constructs present at source level (which are subject to guarding). Hence, our reduction calculus is also applicable beyond the SPC programming model.

Similar to wait, recv(-i , . . a

4 Related Work In the following we briefly discuss some of the in-

teresting parallel programming models that have been proposed in terms of how parallelism is induced and how it is mapped to machine level.

In library approaches such as MPI [17] parallelism is provided through a message-oriented API. In the absence of a compile-time process mapping layer, each process is simply mapped one-to-one to an operating system process which, in turn, is statically mapped onto a processor.

The one-to-one relationship between a program level task and a machine level process (typically, a thread) is also found in language approaches such as the parallel object-oriented language CC++ [4], the task parallel extensions to data parallelism found in Fortran-M [9] and Opus [5], and the task-annotated language Jade [16]. Also in data parallel languages such as HPF [14] the amount of parallelism, induced from the data partitioning directives, is typically the same as the number of SPMD processes used at ma- chine level.

Essentially, all the above programming models sup- port a machine-oriented approach towards expressing parallelism. Unlike the above models, SPC's conten- tion programming approach is based on the explicit expression of all task and data parallelism, the reduc- tion to actual machine level parallelism being based on compile-time transformation techniques.

172

Page 6: [IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India (18-21 Dec. 1997)] Proceedings Fourth International Conference on High-Performance

5 Conclusion In this paper we have outlined the compile-time

process transformation layer that provides the large program level process parallelism on which the SPC programming model is based. The reduction of par- allelism needed to support unbounded process paral- lelism at program level is straightforward, which is il- lustrated by the example compilation of a highly data parallel computation to a coarse grain pipelined imple- mentation on a message-passing architecture. Future research will include a formalization of the SPC pro- cess calculus. Parts of the SPC philosophy will be ap- plied within the AUTOMAP research project [18] that aims at a parallel programming languagc and com- pilation system in which code and data mapping are entirely automated.

References [l] V.S. Adve, A. Carle, E. Granston, S. Hir-

anandani, K. Kennedy, C. Koelbel, U. Kre- mer, J . Mellor-Crummey, S. Warren and C-W. Tseng, “Requirements for data-parallel program- ming environments,” IEEE Parallel and Distrib- uted Technology, Fall 1994, pp. 48-58.

[2] G.R. Andrews and F.B. Schneider, “Concepts and notations for concurrent programming,” Computing Surveys, vol. 266, no. 24, 1983, pp. 132-145.

[3] V. Balasundaram, G. Fox, K. Kennedy and U. Kremer, “A static performance estimator to guide data partioning decisions,” in Proc. 3rd ACM SIGPLAN Symposium on PPoPP, Apr. 1991.

[4] K.M. Chandy and C. Kesselman, “CC++: A declarative concurrent object-oriented program- ming notation,” in Research Directions in Con- current Object- Oriented Programming (G. Agha, P. Wegner and A. Yonezawa, eds.), MIT Press, 1993, pp. 281-313.

[5] B. Chapman, P. Mehrotra, J. Van Rosendale and H. Zima, “A software architecture for multidiscip- linary applications: Integrating task and data parallelism,” in Proc. Fifth Workshop on Com- pilers for Parallel Computers, Malaga, June 1995, pp. 454-466.

[6] M.J. Clement and M.J. Quinn, “Multivariate statistical techniques for parallel performance prediction,” in Proc. 28th Hawaii Int. Conf. on System Sciences, Vol. 11, IEEE, Jan. 1995, pp. 446-455.

[7] A. Gonzdlez Escribano, Valentin Cardeiioso Payo and A.J.C. van Gemund, “On the loss of paral- lelism by imposing synchronization structure,” in Proc. 1st EURO-PDS Int’l Conf. on Parallel and Distributed Systems, Barcelona, June 1997.

[8] T . Fahringer and H.P. Zima, “A static parameter- based performance prediction tool for parallel programs,” in Proc. 7th ACM Int’l Conf. on Su- percomputing, Tokyo, July 1993, pp. 207-219.

[9] I. Foster, “Task parallelism and high-performance languages,” IEEE Parallel and Distributed Tech- nology, Fall 1994, pp. 27-36.

[lo] D. Gelernter and N. Carriero, “Coordination lan- guages and their significance,” Communications of the ACM, vol. 35, Feb. 1992, pp. 97-107.

[ll] A.J.C. van Gemund, “Compile-time performance prediction of parallel systems,” in Computer Per- formance Evaluation: Modelling Techniques and Tools, LNCS 977, Sept. 1995, pp. 299-313.

[12] A.J.C. van Gemund, “The importance of syn- chronization structure in parallel program optim- ization,” in Proc. 11th ACM Int’l Conf. on Super- computing, Vienna, July 1997, pp. 164-171.

[13] A.J.C. van Gemund, “Notes on SPC: A par- allel programming model,” Tech. Rep. 1-68340- 44( 1997)03, Delft University of Technology, 1997.

[14] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr. and M. Zosel, The High-Performance Fortran Handbook MIT Press, 1994.

[15] W. Kreutzer, System simulation, programming styles and languages. Addison-Wesley, 1986.

[16] M.C. Rinard, D.J. Scales and M.S. Lam, “Jade: A high-level, machine-independent language for parallel programming,” Computer, June 1993,

[17] M. Snir, S. Otto, S. Huss-Lederman, D. Walker and J. Dongarra, MPI: The Complete Reference. Cambridge, MA.: MIT Press, 1996.

[18] C. van Reeuwijk, H.J. Sips, H.X. Lin and A.J.C. van Gemund, “Automap: A parallel coordination-based programming system,” Tech. Rep. 1-68340-44(1997)04, Delft University of Technology, Apr. 1997.

pp. 28-38.

[19] K-Y. Wang, “Precise compile-time performance prediction for superscalar-based computers,” in Proc. PLDI’94, Orlando, 1994, pp. 73-84.

173