[IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India...

Supporting Unbounded Process Parallelism in the SPC Programming Model

Arjan J.C. van Gemund http://duteppo.et .tudelft .nl/Ngemund

Department of Electrical Engineering Delft University of Technology

P.O.Box 5031, NL-2600 GA Delft, The Netherlands

Abstract In au tomat i c mapp ing of parallel programs t o tar-

get pa,rallel mach ines t h e e f i c i e n c y of t he compile-t ime cost e s t ima t ion needed t o steer t h e opt imizat ion process i s highly dependent o n t h e choice of programming model. Recent ly a n e w parallel programming model, called SPC, has been introduced tha t specifically a i m s a t t h e e f i c i e n t compu ta t ion of reliable cost es t imates , paving t h e w a y for au tomat i c mapping. In SPC all algorithm level parallelism i s explicitly specified, re- lying o n compile- t ime t rans format ion of t h e possibly unbounded algorithm level (data) parallelism t o that of t he actual target mach ine . In this paper we present SPC’s process-algebraic f ramework in t e r m s of which we demonstrate tha t t h e transformations needed t o ef- f ic ient ly support unbounded process parallelism a t pro- g r a m level are straightforward.

1 Introduction Despite recent advances in parallel language com-

pilation technology, programming models such as HPF still expose the high sensitivity of machine performance on programming decisions. The user is still confronted with complex optimization issues such as computation vectorization, communication pipelining, loop transformation, and, most notably, code and data partitioning, all of which requires a major understand- ing of the complex interplay between program and target machine. A critical success factor for high- performance computing is the development of compilation techniques that perform optimizations automatically, steered by compile-time cost es t imation, thus providing portability as well as performance.

Although currently a range of fine, compile-time cost prediction methods exist (e.g., [3, 6, 8, 19]), either the underlying analysis (and the associated parameter space) is targeted to a particular type of parallel ar-

chitecture, or the technique is not intended to produce reliable estimates throughout the ent ire parameter search space that may be covered by an optimization technique. The generally limited ability of compile-time cost estimations to be extremely efficient and reliable at the same time, is due to the fact that the emphasis in parallel programming models is on expressiveness, i.e., the ability to express the inherent parallelism within the algorithm to the ultimate, rather than analyzability, which in the automatic optimization context means the ability to derive highly efficient yet reliable cost estimates.

Recently a new parallel programming model, called SPC has been proposed [12], that imposes specific restrictions in the synchronization structures that can be programmed. Imposing these restrictions enables the efficient computation of reliable cost estimates thus paving the way for automatic program mapping. The limitation of expressiveness in favor of analyzability is related to the conjecture [la] that the loss of parallelism when programming according to the SPC model is typically limited within a constant factor of 2 com- pared to the unrestricted case. Recent results [7] in- deed provide compelling evidence in favor of this conjecture. We feel the limited loss of parallelism is out- weighed by the unlocked potential of automatic performance optimization as well as the portability that is achieved.

In the SPC model all inherent algorithm parallelism is explicitly expressed at program level even though the actual machine parallelism may be orders of mag- nitude less. For example, in SPC, a software pipeline comprising M stages processing a stream of N data, requires specifying N processes instead of the M processes typically found in, e.g., message-oriented solu- tions. While M is typically small, N may even be inf ini te (as in stream processing) which implies that

168 1094-7256/97 $10.00 0 1997 IEEE

http://duteppo.et

such a portable solution heavily relies on the com- piler’s ability to efficiently reduce the N-process program to a practical, high-performance M-process implementation.

While in previous work we have shown how SPC enables automatic performance optimizations, in this paper we present the compile-time process reduction scheme. We show how SPC serves as an intermediate, process-algebraic formalism in terms of which al- gorithms can be cleanly expressed and transformed into efficient machine level implementations. The paper is organized as follows. In Section 2 we briefly outline the SPC programming model. In Section 3 we present the process calculus that is used to describe the transformations required to reduce SPC al- gorithms to efficient machine implementations. In Sec- tion 4 we discuss related work, followed by a summary in Section 5.

2 The SPC programming model In this section we briefly introduce the SPC pro-

gramming model as far it is needed in the paper. A more detailed description can be found in [12].

SPC is a process-algebraic coordination language that comprises a small number of operators to control parallelism and synchronization. The data computations are expressed in a separate, imperative sequential host language (currently c), aimed at a sep- aration of concerns typical for many coordination languages [lo]. The “SP” prefix in SPC applies to the fact that the condition synchronization [2] (CS) patterns are restricted to series-parallel (SP) form. The “C” term refers to SPC’s specific programming paradigm, called contention programming, which is based on the specific use of mutual exclusion [2] (ME) to describe scheduling constraints. In the contention programming model all inherent algorithm parallelism is explicitly expressed. Parallelism is constrained by introducing contention for software resources (e.g., critical sections) or hardware resources (e.g., processing units) in a manner similar to the material-oriented paradigm used in the simulation domain [15].

2.1 Language In SPC, programming, i.e., applying composi-

tions, is done by specifying process equations using a straightforward substitution mechanism. By conven- tion, the expression tree is rooted by a special process called main that represents the overall program. In order to ensure the correct binding, C , l ’s can be used to delimit compound process expressions. Apart from data types needed to express numeric computations, the main data type in SPC is the process.

Basic data computations are expressed in terms of

t a s k ( i ) = --C-- C y[ i l = f (xCi1 ) ; 3 processes. For example, the equation

describes a process t a s k ( i ) where the --C-- C . . . 1 is an inlining facility to provide an interface with the sequential host language (C in the case of our current simulation environment).

Sequential composition is described by the infix operator ; like in task-1 ; task-2. Loops are described by the reduction operator seq as in seq ( i = 1, N ) t a s k ( i ) which is equivalent to task(1) ; . . . ; task(N). Unlike loops within in- lined tasks, seq loops are involved in SPC’s compile- time optimization.

Parallel composition is described by the infix operator I I like in task- I I I task-2. Parallel loops are described by the reduction operator par as in par ( i = I, N) t a s k ( i ) which is equivalent to task(1) I I . . . I I task(N). The parallel operator has a fork/join semantics which implies a mutual bar- rier synchronization a t the end of the parallel section. Note that this provides a structured form of condition synchronization.

Conditional composition is specified by the i f operator that has an optional e l s e part, like in i f (c) task-1 e l s e task-2. SPC also includes a while construct, which, however, is not considered in this paper.

In order to express global synchronizations at machine level explicit condition variables are supported that come with wai t / s igna l operators. The operation wait (c) blocks any process on the condition variable c until some process executes s i g n a l (c) . As this construct would allow the expression of non-SP (NSP) synchronization structures its use is restricted to the intermediate level during the compilation phase.

While the process abstract data type provides a means to specify both parallelism and condition synchronization, the resource abstract data type provides the means to impose mutual exclusion. A process that is to be executed under mutual exclusion is mapped to a resource as in

add( i ) -> c r i t i c a l

where the -> operator denotes the mapping of the processes add( i ) to a logical processor called c r i t i c a l . Consequently, all add( i) are executed under mutual exclusion as a result of the fact that they are mapped onto the same resource. Note, that again synchronization is structured in terms of a single construct instead of, e.g., using two semaphore operations to achieve mutual exclusion.

169

An example application of the above resource mapping is in the following program

main = par (i = 1, N) { t a s k ( i ) ; add( i ) 3 t a s k ( i ) = --C-- { yci] = f ( x [ i l ) ; 1 add(i) = --C-- { z = z + yci] ; 3 add( i ) -> c r i t i c a l

that computes f(x1) + . . . + f ( x ~ ) using a simple, linear-time reduction scheme. Note that in the program the add (i) tasks are scheduled dynamically, avoiding unnecessary delays in view of the generally non-deterministic execution times of the t a s k ( i ) processes. The default scheduling policy associated with the resources is FCFS with non-deterministic, fair con- flict arbitration but also other scheduling disciplines can be specified. Like in the above example, at programming level a resource is typically logical (a critical software section) in order to retain machine independ- ence. Physical resources, like hardware processors, are introduced at the intermediate (SPC) level during compilation. 2.2 Example

In the previous example the use of resource mapping has been required to avoid numerical non- determinism. In the following example we will show the use of resource mappings to express pipelining, a form of data parallelism where the actual schedule is due to the limited functional parallelism at machine level. Although this mapping is typically derived by the optimizing compilation phase, we show this example to illustrate that, pipelining can also be explicitly programmed by the user, effectively introducing mapping constraints with regard to the subsequent compilation phase. Examples including divide-and- conquer schemes as well as other dynamically scheduled synchronization patterns such as logarithmic reduction schemes, are presented in [12, 131.

Consider some data parallel computation involving asequenceofMoperations t a s k ( i , j ) , j = 1, . . . , M to be applied for a set of (data) indices i = 1,. . . , N where N is typically large. The corresponding SPC program is given by

main = par (i = 1, NI seq ( j = 1, MI

t a s k ( i , j )

where the terms are spread across multiple lines to increase readability.

A straightforward mapping to a parallel machine would be based on partitioning the i axis, involving up to N processors. However, we consider a pipelined

schedule restricting the parallelism to only M processing units. In many programming models the solution is typically programmed according to a machine- oriented paradigm [15] where each each processing stage is explicitly programmed using a message- oriented producer-consumer scheme. The SPC solution only requires the additional mapping constraint

t a s k ( i , j > -> u n i t ( j )

Thus we simply retain the inherent data parallelism in terms of the construct par (i = 1 , N) just like be- fore. However, we introduce the dynamic constraint that some u n i t ( j ) can only process one operation t a s k ( i , j ) at a time. Again, u n i t ( j ) does not ne- cessarily correspond to a physical processor.

The pipeline behavior automatically follows from the mutual exclusion induced by this resource assign- ment. Note that this is a natural way of programming that is also very portable. Each individual data process simply “contends” for the available computing resources, hence, the name ‘‘contention programming”, a concept that lies at the heart of the SPC coordination model.

The SPC solution still involves the use of N processes instead of M where N can be extremely large, if not infinite. In the next section it is shown that the compile-time transformation to an efficient M-process implementation is straightforward and is expressed in terms of SPC itself which illustrates SPC’s utility as an intermediate transformation formalism.

3 Compiling SPC In the compilation of SPC programs two phases can

be distinguished, called the opt imiza t ion phase and the reduction phase. In the optimization phase, SPC transformations are considered that minimize the ex- ecut ion t i m e cost. As a result of the specific choice of synchronization restrictions in SPC the time cost of an SPC expression can be mechanically mapped into a closed-form, symbolic time cost expression that has linear solution complexity in the size of the SPC expression [ll]. In general, the estimation process pro- duces a lower bound. Yet, unlike cost estimation approaches based on analyzing CS only, all dominant terms are generated Typical optimizations during this phase are code and data mapping. In [12] an example is given of how optimizations for distributed- memory machines are derived for the well-known line relaxation algorithm kernel ADI, yielding the same data layout and processor pipelining scheme as presented in, e.g., [l].

In the optimization phase the n u m b e r of processes in the SPC expression is unchanged (i.e., the inherent

170

program parallelism). Merely different resource mappings are considered aimed to decrease execution cost. When the optimal task and data mapping have been defined, however, the reduction phase commences in which the typically large program level parallelism is reduced to the amount of parallelism available at machine level.

3.1 Process Reduction In the reduction phase of the compilation process

SPMD code is generated from SPC program expressions. We show how a possibly unbounded number of logical SPC processes are mapped onto either machine threads (i.e., scheduled at runtime) or to ordin- ary sequential SPMD code. The SPMD code generation process is partly similar to the code generation of data parallel programs in which statement execution is based on the use of the well-known “owner-computes” rule. While in data parallel programs ownership is typically determined by the mapping of the LHS statement data, in our scheme the mapping is based on the task mapping that is computed in the optimization phase. As explained earlier, we will assume that this phase has already taken place, the result being given by SPC resource mapping statements such as

task(i , j) -> processor(owner(i , j ) )

where processor denotes the physical processing resource and owner denotes the computed mapping. We assume a target machine architecture based on P (multithreaded) processors.

The reduction process essentially comprises the following transformations:

0 introduce an outer par (p = 0, P-1) loop that models the SPMD machine parallelism.

0 guard all basic blocks (tasks) with an “owner- computes” test according to individual task mapping. Do not guard seq and if control constructs.

0 ensure all control flow is global, i.e., replace all seq loops that cross processor bounds by global synchronization schemes.

0 apply various reductions similar to the loop (index) optimizations found in data parallel compilation schemes.

Rather than presenting the transformations in formal terms in the following we will illustrate the concepts by translating the previous pipelining example (see Section 2.2). Instead of discussing the trivial case for a block partitioning on the i axis, we will study the

case for a block partitioning on the j axis (i.e., true pipelining). Despite the pipeline’s simplicity, this partitioning introduces the problems that are quite typical for more complex programs. More details can be found in [13].

Under a j axis block partitioning the initial program is given by

main = par (i = 1, N) seq (j = 1, M)

task(i, j )

where P 5 M such that M / P = B, B > 1, and PIM for simplicity. Note that this generalizes over simply mapping the M computation stages to exactly P = M processors. Applying the first three steps of the above SPMD transformation it follows

task(i, j) -> processor(j/P)

main = par (p = 0, P-I) par (i = 0, N - 1 )

par ( j = 0, M-1) if (j/P == p> C

wait (c [i, j -11 ) ; task(i, j ) ; signal (c Ci , j 1 1

1 task (i , j) -> processor ( j /P)

where p denotes the SPMD processor index, if (j/P == p) denotes the guard of task(i, j ) , and c [i, j] denotes a global condition variable associated with each task instance (wait (c Ci, -11 ) is assumed not to block). Note that the 2-dimensional condition array is necessary as some upstream processes may produce data quicker than the downstream processes can consume. Also note that because of the explicit synchronization scheme the original seq( j) loop can be safely replaced by a par (j ) loop.

Although the above SPC form is the basis for a correct SPMD scheme an important number of reductions can be applied in order to yield efficient execution. Given the regular partitioning in our example, the above expression immediately reduces to

main = par (p = 0, P-1) par (i = 0, N - 1 )

par (k = 0, B-I) { wait ( c [i ,p*B+k-11 ) ; task(i ,p*B+k) ; signal(cCi,p*B+k])

1 task(i, j) -> processor(j/P)

Essentially, the inner loop still represents a sequence of M operations chained by wait/signal precedence

171

relations of which blocks of B operations are local to the k loop. Now that the ownership test has been reduced in terms of loop bounds, the sequential inner loop can be partially transformed to local seq loops according to

main = par (p = 0, P-1) par (i = 0, N - 1 ) {

wait (c [i ,p*B-I] ) ; seq (k = 0, B-1)

signal(c [i,p*B+B-l]) task(i ,p*B+k) ;

1 task(i, j) -> processor(j/P)

At this point we can choose between generating multithreaded or sequential SPMD code. In a multithreaded version, the par (i) loop, combined with the task (i, j ) mapping statement, generates a sequential f o r loop that forks N threads on each processor p (because of the guard, by definition p = j / P ) . Because of the fact, however, that in stream processing N can be extremely large, we choose the sequential version in which all N threads are statically scheduled in terms of a seq loop according to

main = par (p = 0, P-I) seq (i = 0, N-I) {

wait(c[i,p*B-I]) ; seq (k = 0, B-1)

task(i ,p*B+k) ; signal(c[i, (p+l)*B-l])

1

This form follows from the fact that every parallel i instance executes on the same processor p as defined by the task mapping (formally the par and the mapping statement reduce to the seq schedule). Note that the 2-dimensional condition array is still necessary although many of the original elements are no longer used due to the sequential block aggregation. Thus the expression can be further reduced to

main = par (p = 0, P-1) seq (i = 0, N - 1 ) C

wait (c [i ,p-11) ; seq (k = 0, B-1)

signal(c[i,pl) task(i,p*B+k) ;

1

which is the final result in terms of the SPC intermediate form. For an asynchronous message-passing architecture a number of straightforward code transformations [13] yield the following pseudo code

f o r (i = 0; i <= N-1; i++) { recv(p-1,. . . ) ; for (k = 0; k <= B-1; k++)

send(p,. . . ) ; task(i ,p*B+k) ;

}

where the data computed by processor p is stored at processor p . is assumed not to block.

The above example shows how the SPC process algebra is applied to rewrite program level SPC expressions in terms of the underlying machine. The above procedure applies to any shared-memory or distributed-memory machine with a possibly multithreaded process layer. As can be seen from the reduction procedure, the source expressions need not be in SP form, i.e., there also may be explicit wait/signal constructs present at source level (which are subject to guarding). Hence, our reduction calculus is also applicable beyond the SPC programming model.

Similar to wait, recv(-i , . . a

4 Related Work In the following we briefly discuss some of the in-

teresting parallel programming models that have been proposed in terms of how parallelism is induced and how it is mapped to machine level.

In library approaches such as MPI [17] parallelism is provided through a message-oriented API. In the absence of a compile-time process mapping layer, each process is simply mapped one-to-one to an operating system process which, in turn, is statically mapped onto a processor.

The one-to-one relationship between a program level task and a machine level process (typically, a thread) is also found in language approaches such as the parallel object-oriented language CC++ [4], the task parallel extensions to data parallelism found in Fortran-M [9] and Opus [5], and the task-annotated language Jade [16]. Also in data parallel languages such as HPF [14] the amount of parallelism, induced from the data partitioning directives, is typically the same as the number of SPMD processes used at machine level.

Essentially, all the above programming models support a machine-oriented approach towards expressing parallelism. Unlike the above models, SPC's contention programming approach is based on the explicit expression of all task and data parallelism, the reduction to actual machine level parallelism being based on compile-time transformation techniques.

172

5 Conclusion In this paper we have outlined the compile-time

process transformation layer that provides the large program level process parallelism on which the SPC programming model is based. The reduction of parallelism needed to support unbounded process parallelism at program level is straightforward, which is il- lustrated by the example compilation of a highly data parallel computation to a coarse grain pipelined implementation on a message-passing architecture. Future research will include a formalization of the SPC process calculus. Parts of the SPC philosophy will be applied within the AUTOMAP research project [18] that aims at a parallel programming languagc and compilation system in which code and data mapping are entirely automated.

References [l] V.S. Adve, A. Carle, E. Granston, S. Hir-

anandani, K. Kennedy, C. Koelbel, U. Kre- mer, J . Mellor-Crummey, S. Warren and C-W. Tseng, “Requirements for data-parallel programming environments,” IEEE Parallel and Distrib- uted Technology, Fall 1994, pp. 48-58.

[2] G.R. Andrews and F.B. Schneider, “Concepts and notations for concurrent programming,” Computing Surveys, vol. 266, no. 24, 1983, pp. 132-145.

[3] V. Balasundaram, G. Fox, K. Kennedy and U. Kremer, “A static performance estimator to guide data partioning decisions,” in Proc. 3rd ACM SIGPLAN Symposium on PPoPP, Apr. 1991.

[4] K.M. Chandy and C. Kesselman, “CC++: A declarative concurrent object-oriented programming notation,” in Research Directions in Con- current Object- Oriented Programming (G. Agha, P. Wegner and A. Yonezawa, eds.), MIT Press, 1993, pp. 281-313.

[5] B. Chapman, P. Mehrotra, J. Van Rosendale and H. Zima, “A software architecture for multidiscip- linary applications: Integrating task and data parallelism,” in Proc. Fifth Workshop on Com- pilers for Parallel Computers, Malaga, June 1995, pp. 454-466.

[6] M.J. Clement and M.J. Quinn, “Multivariate statistical techniques for parallel performance prediction,” in Proc. 28th Hawaii Int. Conf. on System Sciences, Vol. 11, IEEE, Jan. 1995, pp. 446-455.

[7] A. Gonzdlez Escribano, Valentin Cardeiioso Payo and A.J.C. van Gemund, “On the loss of parallelism by imposing synchronization structure,” in Proc. 1st EURO-PDS Int’l Conf. on Parallel and Distributed Systems, Barcelona, June 1997.

[8] T . Fahringer and H.P. Zima, “A static parameter- based performance prediction tool for parallel programs,” in Proc. 7th ACM Int’l Conf. on Su- percomputing, Tokyo, July 1993, pp. 207-219.

[9] I. Foster, “Task parallelism and high-performance languages,” IEEE Parallel and Distributed Tech- nology, Fall 1994, pp. 27-36.

[lo] D. Gelernter and N. Carriero, “Coordination languages and their significance,” Communications of the ACM, vol. 35, Feb. 1992, pp. 97-107.

[ll] A.J.C. van Gemund, “Compile-time performance prediction of parallel systems,” in Computer Per- formance Evaluation: Modelling Techniques and Tools, LNCS 977, Sept. 1995, pp. 299-313.

[12] A.J.C. van Gemund, “The importance of synchronization structure in parallel program optimization,” in Proc. 11th ACM Int’l Conf. on Super- computing, Vienna, July 1997, pp. 164-171.

[13] A.J.C. van Gemund, “Notes on SPC: A parallel programming model,” Tech. Rep. 1-68340- 44( 1997)03, Delft University of Technology, 1997.

[14] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr. and M. Zosel, The High-Performance Fortran Handbook MIT Press, 1994.

[15] W. Kreutzer, System simulation, programming styles and languages. Addison-Wesley, 1986.

[16] M.C. Rinard, D.J. Scales and M.S. Lam, “Jade: A high-level, machine-independent language for parallel programming,” Computer, June 1993,

[17] M. Snir, S. Otto, S. Huss-Lederman, D. Walker and J. Dongarra, MPI: The Complete Reference. Cambridge, MA.: MIT Press, 1996.

[18] C. van Reeuwijk, H.J. Sips, H.X. Lin and A.J.C. van Gemund, “Automap: A parallel coordination-based programming system,” Tech. Rep. 1-68340-44(1997)04, Delft University of Technology, Apr. 1997.

pp. 28-38.

[19] K-Y. Wang, “Precise compile-time performance prediction for superscalar-based computers,” in Proc. PLDI’94, Orlando, 1994, pp. 73-84.

173

[IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India...

Documents

Transcript of [IEEE Comput. Soc Fourth International Conference on High-Performance Computing - Bangalore, India...