An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate...

22
SANDIA REPORT SAND92–2765 UC–405 Unlimited Release Printed March 1993 RECO An Efficient Parallel Algorithm MICROFICHED Bruce Hendrickson, Robert Leland, Steve Plimpton Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 84560 for the United States Department of Energy under Contract DE-AC04-76DP00789 SF2900Q(8-81 )

Transcript of An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate...

Page 1: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

SANDIA REPORT SAND92–2765 UC–405 Unlimited Release Printed March 1993

RECORD

An Efficient Parallel Algorithm MICROFICHED

Bruce Hendrickson, Robert Leland, Steve Plimpton

Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 84560 for the United States Department of Energy under Contract DE-AC04-76DP00789

SF2900Q(8-81 )

Page 2: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

Iessued by Sandia National Laboratories, operated for the Unitid States Department of Energy by Sandia Corporation. NOTICE This report wee prepared aa an account of work eponeored by an agency of the Unitad States Government. Neither the United Statm Govern- ment nor any agency thereof, nor any of their employees, nor any of their contrac@e, subcontractor, or thei~ ernoployees, make? ~y warranty, express or un had, or aasumes any legai hablhty or responslbdity for the accuracy, r comp etaneaa, or uaefulneea of any information, apparatus, product, or proceea diecloeed, or re resents that ita use would not infringe privately

f ownad rights, Reference erein tQ any specific commercial product, proceae, or aeMce b trade name, trademark, manufacturer, or otherwise, does not

j neceaean conatituta or imply ite endorsement, recommendation, or favoring by the nited Statae Government, any ency thereof or any of their

3 contractor or mhcontract.ora. The viewa an opinions expressed herein do not neceaearily data or reflect those of the United States Government, any agency thereof or any of their contractme.

Printad in the United States of America, This report haa been reproduced directly from the beet available copy.

Available to DOE and DOE contractors from Office of Scientific and Technical Information PO Box 62 Oak Ridge, TN 97831

Pricee available from (615) 576-8401, FTS 626-8401

Available t.a the public from Nationai Technicai Information Service US Department of Commerce 5285 Port Ro al Rd

J Springfield, A 22161

NTIS rice codes a Printi copy A09

Microfiche COPF AO1

Page 3: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

SAND92-2765 Unlimited Release

Printed March 1993

Distribution Category UC-405

An Efficient Parallel Algorithm for Matrix–Vector Multiplication

Bruce Hendrickson, Robert Leland and Steve Plimpton Sandia National Laboratories

Albuquerque, NM 87185

Abstract

Abstract. The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific

computation. A fast parallel algorithm for this calculation is therefore necessary if we are to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix–vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/~ + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well–known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.

Key words. matrix-vector multiplication, parallel computing, hypercube, conjugate gradient method

AMS(MOS) subject classification. 65Y05, 65F1O

Abbreviated title. Parallel Matrix–Vector Multiplication

This work was supported by the Applied Mathematical Sciences program, U.S. Department of

Energy, Office of Energy Research, and was performed at Sandia National Laboratories, operated for

the U.S. Department of Energy under contract No. DE-AC04-76DP00789. 1

Page 4: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

1. Introduction. The multiplication of a vector by a matrix is the kernel computation in many

linear algebra algorithms, including, for example, the popular Krylov methods for solving linear and

eigen systems. Recent improvements in such methods, coupled with the increasing use of massively

parallel computers, require the development of efficient parallel algorithms for matrix–vector multiplica-

tion. This paper describes such an algorithm. Although the method works on all parallel architectures,

it is particularly well suited to machines with hypercube interconnection topology, for example the Intel

iPSC/860 and the nCUBE 2.

The algorithm described here was developed independently in connection with research on efficient

methods of organizing parallel many-body calculations (see [5]). We subsequently learned that our

algorithm is very similar in structure to a parallel matrix-vector multiplication algorithm described

in [4]. We have, nevertheless, chosen to present our algorithm because it improves upon that in [4] in

several ways: First, we specify how to overlap communication and computation and thereby reduce

the overall run time. Second, we show how to map, the blocks of the matrix to processors in a novel

way which improves the performance of a critical communication operation on current hypercube

architectures. And third, we consider the actual use of the algorithm within the iterative conjugate

gradient solution method and show how in this context a small amount of redundant computation can

be used to further reduce the communication requirements. By integrating these improvements we

have been able to achieve significantly better performance on a well known benchmark than has been

previously possible with a massively parallel machine.

A very attractive property of the new algorithm is that its communication operations are indepen-

dent of the sparsity pattern of the matrix, making it applicable to all matrices. For an n x n matrix on p

processors, the cost of the communication is O(n/@+ log(p)). However, many sparse matrices exhibit

structure which allows for other algorithms with even lower communication requirements. Typically

this structure arises from the physical problem being modeled by the matrix equation and manifests

itself as the ability to reorder the rows and columns to obtain a nearly block–diagonal matrix, where

the p diagonal blocks are about equally sized, and the number of matrix elements not in the blocks is

small. This structure can also be expressed in terms of the size of the separator of the graph describing

the nonzero structure of the matrix. Our algorithm is clearly not optimal for such matrices, but there

are many contexts where the matrix structure is not helpful (e.g. dense matrices, random matrices),

or the effort required to identify the structure is too large to justify. It is these settings in which our

algorithm is most appropriate and provides high performance.

This paper is structured as follows. In the next section we describe the algorithm and its com-

munication primitives. In §3 we present refinements and improvements to the basic algorithm, and

develop a performance model. In §4 we apply the algorithm to the NAS conjugate gradient benchmark

problem to demonstrate its utility. Conclusions are drawn in §5.

2. A parallel matrix–vector multiplication algorithm. Iterative solution methods for linear

and eigen systems are one of the mainstays of scientific computation. These methods involve repeated

matrix–vector products or matvecs of the form yi = Axi where the the new iterate, xi+1, is generally

2

Page 5: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

some simple function of the product vector yi. To sustain the iteration on a parallel computer, it is

necessary that xi+l be distributed among processors in the same fashion as the previous iterate xi.

Hence, a good matvec routine will return a yi with the same distribution as xi so that xi+1 can be

constructed with a minimum of data movement. Our algorithm respects this distribution requirement.

We will simplify notation and consider the parallel matrix-vector product y = Ax where A is an

n x n matrix and x and y are n-vectors. The number of processors in the parallel machine is denoted

by p, and we assume for ease of exposition that n is evenly divisible by p and that p is an even power

of 2. It is fairly straightforward to relax these restrictions.

Let A be decomposed into square blocks of size (n/@) x (n/@, each of which is assigned to

one of the p processors, as illustrated by Fig. 1. We introduce the Greek subscripts a and /3 running

from O to ~ – 1 to index the row and column ordering of the blocks. The (a, ~) block of A is denoted

by Aap and owned by processor Pap. The input vector z and product vector y are also conceptually

divided into @ pieces indexed by /3 and a respectively. Given this block decomposition, processor

Pap must know Z@ in order to compute its contribution to ya. This contribution is a vector of length

n/fi which we denote by zap. Thus Z.B = A.~zp, and y. = 2P .z.p where the sum is over all the

processors sharing row block a of the matrix.

Ya

2.1.

Fig. 1. Structure of matrix product y = Ax.

Communication primitives. Our algorithm requires three distinct patterns of communi-

cation. The first of these is an efficient method for summing elements of vectors owned by different

processors, and is called a fold operation in [4]. We will use this operation to combine contributions

to y owned by the processors that hold a block row of A. The fold operation is sketched in Fig. 2 for

communication among processors with the same block row index a. Each processor begins the fold

operation with a vector zap of length n/fi. The operation requires Iogz (W) stages, halving the length

of the vectors involved at each stage. Within each stage, a processor first divides its vector z into two

equal sized subvectors, Z1 and 22, aa denoted by (zl 122). One of these subvectors is sent to another

3

Page 6: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

processor, while the other processor sends back itscontribution tothesubvector which remained. The

received subvector is summed element–by-element with the retained subvector to finish the stage. At

the conclusion of the fold, each processor has a unique, length n/p portion of the fully summed vector.

We denote this subvector with Greek superscripts, hence Pap owns portion ya~. The fold operation

requires no redundant floating point operations, and the total number of values sent and received by

each processor is n/fi – n/p.

Fig. 2. The fo

?rocessor Pap knows zap E IRnl @

: := zap

?ori=O, . . ..log2(l-l

(Z, IZ2) = z

PaBI := PQO with i:h bit of D flipped

If bit i of/3 is 1 Then

Send Z1 to processor Pap,

Receive W2 from processor POPI

Zp := 22 i- W2

Else

Send Z2 to processor PaO,

Receive w 1 from processor P.OI

Z:=zl+wl

,Ufl := z

‘rocessor Pap now owns yafl E lRnJp

operation for processor PQOas part of block row a.

In the second communication operation each processor knows some information that must be

shared among all the processors in a column. We use a simple algorithm called ezpand [4], that

essentially uses the inverse communication pattern of the fold operation. The expand operation is

outlined in Fig. 3 for communication between processors with the same column index /3. Each processor

in the column begins with a subvector of length n/p, and when the operation finishes all processors

in the column know all n/@ values in the union of their subvectors. At each step in the operation a

processor sends all the values it knows to another processor and receives that processor’s values. These

two subvectors are concatenated, as indicated by the “1” notation. As with the fold operation, only a

logarithmic number of stages are required, and the total number of values sent and received by each

processor is n/fi – n/p.

The optimal implementation of the fold and expand operations depends on the machine topology

and various hardware considerations, e.g. the availability of multiport communication. There are,

however, efficient implementations on most architectures. On hypercubes, for example, these operations

can be implemented using only nearest neighbor communication if the blocks in each row and column

of the matrix are owned by a sub cube with W processors. On meshes, if the blocks of the matrix are

4

Page 7: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

Processor P08 knowsfl” EIR”IP

z:= @

Fori=log2(@) -1,...,0

Pot,@ := Papwithiih bitof a flipped

Send z to processor P0,,8

Receive w from processor P=t,D

If bit i of CYis 1 Then

z := W[z

Else

z := Zlw

Yo := z

Processor Pop now knows ya E lRnl@

Fig. 3. The expand operation for processor P&Oas part of block column /?,

mapped in the natural way to a square grid of processors, then the fold and expand operations can be

implemented efficiently [9].

The third communication operation in our matvec algorithm requires each processor to send a

message to the processor owning the transpase portion of the matrix, i.e. PQB sends to PB=. Since we

want row and column communication to be efficient for the fold and expand operations, this transpose

communication can be difficult to implement efficiently. This is because a large number of messages

must travel to architecturally distant processors, so the potential for message congestion is great. We

have devised an optimal, congestion-free algorithm for this operation on hypercubes which is discussed

in $3.1. Implementations of our matvec algorithm on other architectures may benefit from a similarly

tailored transpose algorithm. However, even if congestion is unavoidable, the length of the message

in the transpose communication step of our matvec algorithm is about @ less than the volume of

data exchanged in the fold and expand steps. Consequently, the transpose messages can be delayed by

O(@ without changing the overall scaling of the algorithm.

2.2. The matrix-vector multiplication algorithm. We can now present our algorithm for

computing y = Az in Fig. 4. Further details and enhancements are presented in the following section.

All the numerical operations in the algorithm are performed in steps (1) and (2). First, in step (1),

each processor performs the local matrix-vector multiplication involving the portion of the matrix

it owns. These values are summed within processor rows in step (2) using the fold operation from

Fig. 2, after which each processor owns n/p of the values of y. Unfortunately, the values owned by

processor Pap are just a subvector of y., whereas to perform the next matvec Pep must know all of

up. This is accomplished in steps (3) and (4). In step (3), each processor exch~ges its subsegment of

~ with the processor owning the transpose block of the matrix. After the transposition, the values of

VP are d~tributed among the procemors in COIUmII block 0 of A. The expand operation among these

procemors gives each of them all of yp, so the result is distributed as required for a subsequent matvec.

5

Page 8: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

We note that at this level of detail, the algorithm is identical to the one described in [4] for dense

matrices, but m we discuss in the next section, the detaila of steps (1), (2) and (3) are different and

result in a more efficient overall algorithm.

(1)

(2)

(3)

(4)

Processor Pap owns A.P and ZP

Compute zap = Aaezp

Fold zap within rows to form v“$

Transpose the y“~, i.e.

a) Send I@ to Ppe

b) Receive #o from Pp.

Expand #o within columns to form VP

Fig. 4. Parallel matrix-vector multiplication algorithm for processor Pep.

3. Algorithmic details

3.1. Transposition on

and refinements.

parallel computers. The expand and fold primitives used in the

matvec algorithm are most efficient on a parallel computer if rows and columns of the matrix are

mapped to subsets of processors that allow for fast communication. On a hypercube a natural subset

is a subcube, while on a 2-D mesh rows, columns or submeahes are possible. Unfortunately, such a

mapping can make the transpose operation inefficient since it requires communication between pro-

cessors that are architecturally distant. Modern parallel computers use cut-through routing so that a

single message can be transmitted between non-adjacent processors at nearly the same speed as if it

were sent between adjacent processors. Nevertheless, if multiple messages are simultaneously trying

to use the same wire, all but one of them must be delayed. Hence machines with cut through routing

can still suffer from serious message congestion.

On a hypercube the scheme for routing a message is usually to compare the bit addresses of the

sending and receiving processors and flip the bits in a fixed order (and transmit along the corresponding

channel) until the two addresses agree. On the nCUBE 2 and Intel iPSC/860 hypercubes the order

of comparisons is from lowest bit to higheat, a procedure known as dimension order muting. Thus

a message from processor 1001 to processor 0100 will route from 1001 to 1000 to 1100 to 0100. The

usual scheme of assigning matrix blocks to processors uses low order bits to encode the column number

and the high order bits to encode the row number. Unfortunately, dimension order routing on this

mapping induces congestion since messages from all the @ processors in a row route through the

diagonal processor. A similar bottleneck occurs with mesh architectures where the usual routing

scheme is to move within a row before moving within a column. Fortunately, the messages being

transposed in our algorithm are shorter than thcee in the fold and expand operations by a factor of

@ So even if congestion delays the transpose messages by @ the overall communication scaling of

the algorithm will not be affected.

On a hypercube, a different mapping of matrix blocks to processors can avoid transpose congestion

6

Page 9: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

altogether. With this mapping we still have nearest neighbor communication in the fold and expand

operations, but now the transpose operation is as fast as sending and receiving a single message of

length n/p. Consider a &dimensional hypercube where the address of each processor is a d-bit string.

For simplicity we assume that d is even. The row block number a is a d/2-bit string, w is the column

block number ~. For fast fold and expand operations, we require that the processors in each row and

column form a subcube. This is assured if any set of d/2 bits in the d-bit processor address encode the

block row number and the other d/2 bits encode the block column number. Now consider a mapping

where the bits of the block row and block column indices of the matrix are interleaved in the processor

address. For a 64–proceaaor hypercube (with 3-bit row and column addresses for the 8x8 blocks of

the matrix) this means the 6-bit processor address would be ~c2r1c1 roco where the three bits ~rl r.

encode the block row index and C2C1co encodes the block column index.

Note that in this mapping each row of blocks and column of blocks of the matrix still resides on

a subcube of the hypercube, so the expand and fold, operations can be performed optimally. However,

the transpose operation is now contention-free as demonstrated by the following theorem. Although

the proof assumes a routing scheme where bits are flipped in order from loweat to highest, a similar

contention free mapping is possible for any tlxed routing scheme as long as row and column bits are

forced to change alternately.

THEOREM 3.1. Consider a hypercube using dimension order routing, and map processors to

elements of an array in such a way thai the bit-repwsentations of a processor’s row number and column

number are interleaved in the processor’s bit-address id. Then ihe wires used when each processor sends

a message to the processor in the transpose location in the array are disjoint.

Proof. Consider a processor P with bit-address rbQr& lcb- 1... roco, where the row number is

encoded with rb . . . r., and the column number with cb . . . co. The processor PT in the transpose array

location will have with bit-address cbrbcb- 1r& 1 . . coro. Under dimension order routing, a message is

transmitted in as many stages as there are bits, flipping bits in order from right to left to generate

a sequence of intermediate patterns. After each stage, the message will have been routed to the

intermediate processor denoted by the current intermediate bit pattern. The wires used in routing the

message from P to P= are those that connect two processors whose patterns occur consecutively in the

sequence of intermediate patterns. After 2k stages, the intermediate processor will have the pattern

rbcb . e .rkckck_~rk_~ . . . coro. The bits of this intermediate processor are a simple permutation of the

original bits of P in which the lowest k pairs of bits have been swapped. Also, after 2k – 1 stages, the

values in the bit positions 2k and 2k – 1 are equal.

Now consider another processor P’ # P, and assume that the message being routed from P’

to Pm usea the same wire employed in step i of the transmission from P to PT. Denote the two

processors connected by this wire by P1 and Pz. Since they differ in bit position i, PI and Pz can only

be encountered consecutively in the transition between stages i - 1 and i of the routing algorithm.

Either i – 1 or i is even, so a simple permutation of pairs of bits of P must generate either P1 or P2; say

P.. Similarly, the same permutation applied to P’ must also yield either PI or Pz; say Pi. If P. = P!

then P = P’ which is a contradiction. Otherwise, both P1 and P2 must appear after an odd number of

7

Page 10: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

stages in one of the routing sequences. If i is odd then bits i and i + 1 of P must be equal, and if i is

even then bits i and i- 1 of P are equal. In either case, P1 = P2 which again implies the contradiction

that P = P’. Cl

3.2. Overlapping computation and communication. If a processor is able to both coinpute

and communicate simultaneously, then the algorithm in Fig. 4 has the shortcoming that once a proces-

sor has sent a message in the fold or expand operations, it is idle until the message from its neighbor

arrives. This can be alleviated in the fold operation in step (2) of the algorithm by interleaving com-

munication with computation from step (1). Rather than computing all the elements of Z=e before

beginning the fold operation, we should compute just those that are about to be sent. Then whichever

values will be sent in the next pass through the fold loop get computed between the send and receive

operations in the current pass. In the final pass, the values that the processor will keep are computed.

In this way, the total run time is reduced on each pass through the fold loop by the minimum of the

message transmission time and the time to compute” the next set of elements of za6.

3.3. Balancing the computational load. The discussion above has concentrated on the com-

munication requirements of our algorithm, but an efficient algorithm must also ensure that the com-

putational load is well balanced across the processors. For our algorithm, this requires balancing the

computations within each local matvec. If the region of the matrix owned by a processor has m’ nonze-

ros, the number of floating point operations (flops) required for the local matvec is 2m’ – n/@. These

will be balanced if m’ x m/p for each processor, where m is the total number of nonzero elements in

the matrix. For dense matrices or random matrices in which m >> n, the load is likely to be balanced.

However for matrices with some structure it may not be. For these problems, ogielski and Aiello have

shown that randomly permuting the rows and columns gives good balance with high probability [8]. A

random permutation has the additional advantage that zero values encountered when summing vectors

in the fold operation are likely to be distributed randomly among the processors.

Most matrices used in real applications have nonzero diagonal elements. We have found that when

this is the case, it may be advantageous to force an even distribution of these among processors and to

randomly map the remaining elements. This can be accomplished by first applying a random symmetric

permutation to the matrix. This preserves the diagonal while moving the off-diagonal elements. The

diagonal can now be mapped to processors to match the distribution of the y“~ subsegment that each

processor owns. The contribution of the diagonal elements can then be computed in between the send

and receive operations in the transpose communication, saving either the transpose transmission time

or the diagonal computation time, whichever is smaller.

3.4. Complexity model. The algorithm described above can be implemented to require the

minimal 2m – n flops to perform a matrix-vector multiplication, where m is the number of nonzeros

in the matrix. Some of these flops will occur during the calculation of the local matvecs, and the rest

during the fold summations. We make no assumptions about the data structure used on each processor

to compute its local matrix-vector product. This allows for the implementation of whatever algorithm

works best on the particular hardware. If we assume the computational load is balanced by using the

8

Page 11: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

techniques described in $3.3, the time to execute these floating point operations should be very nearly

(2m - n)THoP/p, where T~~P is the time required for a single floating point operation.

The algorithm requires Iogz (p) + 1 read/write pairs for each processor, and a total communi-

cation volume of n(2~ – 1) floating point numbers. Accounting for the natural parallelism in the

communication operations, the effective communication volume is n(2~ – 1)/p. Unless the matrix is

very sparse, the computational time required to form the local matvec will be sufficient to hide the

transmission time in the fold operation, as discussed in ~3,2. We will assume that this is the case.

Furthermore, we will assume that the transpose transmission time can be hidden with computations

involving the matrix diagonal, as described in $3.3. The effective communication volume therefore

reduces to n(@ – I)/p. The total run time, TtOtal can now be expressed as

(1) Ttot~ =2m–n—Gop + (1%2(P) + l)(T.end + ‘Geceive) + ‘(w – 1) Tt~~n*~i~ ,

P P

where TfiOPis the time to execute a floating point opdration, Tsend and &ceive are the times to initiate a

send and receive operation respectively, and TtrmBmit is the transmission time per floating point value.

This model will be most accurate if message contention is insignificant, as it is with the mapping for

hypercubes described in $3.1.

4. Application to the Conjugate Gradient algorithm. To examine the efficiency of our

parallel matrix-vector multiplication algorithm, we used it as the kernel of a conjugate gradient (CG)

solver. A version of the CG algorithm for solving the linear system Ax = b is depicted in Fig. 5. There

are a number of variants of the basic CG method; the one we present is a slightly modified version

of the algorithm given in the NAS benchmark [1, 3] discussed later. In addition to the matrix–vector

multiplication, the inner loop of the CG algorithm requires three vector updates (of z, r and p), as

well as two inner products (forming ~ and p’).

An efficient parallel implementation of the CG algorithm should divide the workload evenly among

processors while keeping the cost of communication small, Unfortunately, these goals are in conflict

because when the vector updates are distributed, the inner product calculations require communication

among all the processors. In addition, if the algorithm in Fig. 5 is implemented in parallel, each

processor must know the value of a before it can update r to compute p’ and hence ~. The calculation

of 7 = p~y, the distribution of -y, and the calculation of p’ = rTr can actually be condensed into two

global operations because the first two operations can be accomplished simultaneously with a binary

exchange algorithm. However these global operations are still very costly. One way to reduce the

communication load of the algorithm is to modify it as shown in Fig. 6.

This modified algorithm is algebraically equivalent to the original, but instead of updating r

and then calculating rTr, the new algorithm exploits the identity r~+lri+l = (r~ – @Y)T(ri – @Y) =

r~ri – ~yTri + cr2yTy, as suggested by Van Rosendale [10]. The values of 7, ~ and @ can be summed

with a single global communication, essentially halving the communication time required outside the

matvec routine. In exchange for this communication reduction, there is a net increase of one inner

product calculation since 4 = yTr and $ = yTy must now be computed, but @ = rTr need not

9

Page 12: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

X:=cl

r:=b

p:=b

p := rTr

Fori=l,. . .

y := Ap

‘y :=pTy

ff := p/7

x :=z+ap

r:=r — ay

p’ := rTr

P:= P’IP

P:= P’p:=r+pp

Fig. 5. A conjugate gradient algorithm.

be calculated explicitly. Since the vectors are distributed across all the processors, this requires an

additional 2n/p floating point operations by each processor in order to avoid a global communication.

Whether this is a net gain depends upon the relative sizes of n and p, as well sa the cost of flops and

communication on a particular machine, but since communication is typically much more expensive

per unit than computation, the modified algorithm should generally be faster. For the nCUBE 2,

the machine used in this study, we estimate that this recasting of the algorithm is worthwhile when

n< 5x105.

This restructuring of the CG algorithm can in principle be carried further to hide more of the

communication cost of the linear solve. That is, by repeatedly substituting for the residual and search

vectors r and p we can express the current values of these vectors in terms of their values k steps

previously. (General formulas for this process are given in [7].) By proper choice of k it is possible to

completely hide the global communication in the CG algorithm. Unfortunately this leads to a serious

loss of stability in the CG process which is expensive to correct [6]. We therefore recommend only

limited application of this restructuring idea.

The vector and scalar operations associated with CG fit conveniently between steps (3) and (4)

of the matrix–vector multiplication algorithm outlined in Fig. 4. At the end of step (3) the product

vector y is distributed across all p processors, and it is trivial to achieve the identical distribution for

z, r and p. Now all the vector updates can proceed perfectly in parallel. At the end of the CG loop,

the vector p can be shared through an expand operation within columns and hence the processors will

be ready for the next matvec. The resulting algorithm is sketched in Fig. 7.

We implemented a double precision version of this algorithm in C on the

hypercube at Sandia’s Massively Parallel Computing Research Laboratory.

10

1024 processor nCUBE 2

The resulting code was

Page 13: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

Fig. 6. A mc

E:=()

r := b

D:=b

o := rTr

Fori=l,.. .,n

y := Ap

~ := p2’y

~,= #’r

$ := #y

Q := p/~

p’:=p–m$+az$

p:= #/p

P:=P’ ‘

l?:=z+ap

r:=r–ffy

p:=r+~p

ified conjugate gradient algorithm.

tested on the well–known NAS parallel benchmark problem proposed by researchers at NASA Ames

[1, 3]. The benchmark uses a conjugate gradient iteration to approximate the smallest eigenvalue of

a random, symmetric matrix of size 14,000, with an average of just over 132 nonzeros in each row.

The benchmark requires 15 calls to the conjugate gradient routine, each of which involves 25 passes

through the innermost loop containing the matvec.

This benchmark problem has been addressed by a number of different researchers on several

different machines [2]. A common theme in this previous work has been the search for some exploitable

structure within the benchmark matrix. Since arbitrary restructuring of the matrix is permitted by

the benchmark rules as a pre-processing step, the computational effort expended in this search for

structure is not counted in the benchmark timings.

In contrast, our algorithm is completely generic and does not require any special structure in the

matrix. The communication operations are entirely independent of the zero/nonzero pattern of the

matrix, and the only advantage of reordering would be to lessen the load on the most heavily burdened

processor. Because the benchmark matrix diagonal is dense, we did partition the diagonal across all

processors, as described in ~3.3. Otherwise, we accepted the matrix as given, and made no effort to

exploit structure.

Our implementation solved the benchmark problem in 6.09 seconds, which compares quite favor-

ably with all other published results on massively parallel machines [3]. For comparison, the recently

published times for the 128 processor iPSC/860 and 32K CM-2 are 8.61 and 8.8 seconds respectively,

which is more than 4070 longer than our result. Although this problem is highly unstructured, our

11

Page 14: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

2rocefjsOr PPU owns ‘W

c,r, p,b, ~ E IR”IP, zP, p” G IRnlfi

E:=o..—.—b

>:=b

? := r=r

Sum ~ over all processors to form p

Expand p within columns to form p“

Tori =l,. . .

Compute ZP = AP.P.

Fold ZP within rows to form y~v

Transpose y~”, i.e.

Send y~” to P.P ‘

Receive y := y“f’ from P.P

~ := pTy

(j := y=,

J) := y=y

Sum ~, ~ and ~ over all processors to form -y, ~ and ~

Q := p/y

p’:= p–afj+cd$

/3:= p’/p

p := p’

Z:=z+ap

r:=r+(ry

p:=r+flp

Expand p within columns to form p.

Fig. 7. A parallel CG algorithm for processor F’PV.

C code achieves about 250 Mflops, which is about 12% of the raw speed achievable by running pure

assembly language BLAS on each processor without any communication.

5. Conclusions. We have presented a parallel algorithm for matrix–vector multiplication, and

shown how this algorithm can be used very effectively within the conjugate gradient algorithm. The

communication cost of this algorithm is independent of the zero/nonzero structure of the matrix and

scales aa n/@. Consequently, the algorithm is most appropriate for matrices in which structure is

either difficult or impossible to exploit. This is clearly the case for dense and random matrices, and

it is also true more generally for sparse matrices in many contexts. For example, our algorithm could

serve aa an efficient black-box routine for prototyping sparse matrix linear algebra algorithms or could

be embedded in a sparse matrix library where few assumptions about matrix structure can be made.

12

Page 15: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

On the NAS conjugate gradient benchmark, an nCUBE 2 implementation of this algorithm runs

more than 40’%0faster than any other reported algorithm running on any massively parallel machine.

The particular mapping we employ for hypercubes is likely to be of independent interest. This

mapping ensures that rows and columns of the matrix are owned entirely by subcubes, and that with

cut–through routing the transpose operation can be performed without message contention. This

mapping haa already proved useful for parallel many–body calculations [5], and is probably applicable

to other linear algebra algorithms.

Acknowledgements. We are indebted to David Greenberg for assistance in developing the hy-

percube transposition algorithm in $3.1.

REFERENCES

[1] D. H. BAILEY, E. BARSZCZ, J. T. BARTON; D. S. BROWNING, R. L. CARTER, L. DAGUM,

R. A. FATOOHI, P. O. FREDERICKSON, T. A. LASINSKI, R. S. SCHREIBER, , H. D. SIMON,

V. VENKATAKRISHNAN, AND S. K. WEERATUNGA, The NAS parallel benchmarks, Intl. J.

Supercomputing Applications, 5 (1991), pp. 63-73.

[2] D. H. BAILEY, E. BARSZCZ, L. DAGUM, AND H. D. SIMON, NAS parallel benchmark results, in

Proc. Supercomputing ’92, IEEE Computer Society Press, 1992, pp. 386-393.

[3] D. H. BAILEY, J. T. BARTON, T. A. LASINSKI, AND H. D. SIMON, EDITORS, The NAS parallel

benchmarks, Tech. Rep. RNR-91-02, NASA Ames Research Center, Moffett Field, CA, January

1991.

[4] G. C. Fox, M. A. JOHNSON, G. A. LYZENGA, S. W. OTTO, J. K. SALMON, AND D. W.

WALKER, Solving problems on concurrent processors: Volume 1, Prentice Hall, Englewood

Cliffs, NJ, 1988.

[5] B. HENDRICKSON AND S. PLIMPTON, Parallel many-body calculations without all-to-all com-

munication, Tech. Rep. SAND 92-2766, Sandia National Laboratories, Albuquerque, NM,

December 1992.

[6] R. W. LELAND, The Effectiveness of Parallel Iterative Algorithms for Solution of Large Sparse

Linear Systems, PhD thesis, University of Oxford, Oxford, England, October 1989.

[7] R. W. LELAND AND J. S. ROLLETT, Evaluation of a parallel conjugate gradient algorithm, in

Numerical methods in fluid dynamics III, K. W. Morton and M. J. Baines, eds., Oxford

University Press, 1988, pp. 478-483.

[8] A. T. OGIELSKI AND W. AIELLO, Sparse matriz computations on parallel processor arrays, SIAM

J. Sci. Stat. Comput., 14 (1993). To appear.

[9] R. VAN DE GEIJN, Eficient global combine operations, in Proc. 6th Distributed Memory Com-

puting Conf., IEEE Computer Society Press, 1991, pp. 291-294.

[10] J. VAN ROSENDALE, Minimizing inner product data dependencies in conjugate gradient iteration,

in 1983 International conference on parallel processing, H. J. Siegel et al., eds., IEEE, 1983,

pp. 44-46.

13

Page 16: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

EXTERNAL DISTRIBUTION:Raymond A. BairMolecuhu Science Reach. Cntr.Pacific NW hborato~Richland, WA 99362

Falcon AFB, CO 80912-51MfJ

R W. AlewineDARPA/RMO1400 Wilmn Blvd.Arlington, VA 222o9

Bud Brewster410South PierceWheaton, IL 80187

R. E. BakDept. of MathematicalUniv. of CA at San DiegoLa Jolla, CA 92093

Carl N. BrooksBrc&a Associates414 Falls RoadChagrin Fafls, OH 44022

F~ AltabdiUS Air Force Weapono LabNuclear Technology Branch

Kirtfand+ AFB, NM 87117-6008Ken Bannister

US Army Bdfistic Res. LabAttrx SLCBIVIB-MAberdeen Prov. Gmd., MD 21005

John Brunet245 Fmt St.Cambridge, MA 02142

-5086John BnmoCntr for Comp. Sci and EngrCollege of EngineeringUnivemity of California

Santa BarbM~ CA 931-5110

Mad w. AllenThe Waif Street Joumaf1233 Regal hW

Daflas, TX 75247Edward BarragyDept. ASE/EMUniversity of TexasAustin, TX 78712

M. AlmeAlme and Associates6219 Bright PlumeC&unbii, MD 21044

E. BammNAS Applied Rewarch BranchNASA Ames Ikearch cent.Moffett Field, CA 94035

H. L. BuchananDARPA/DSO1400 Wifson Blvd.Arlington, VA 22209

ChaAa E. AndersonSOuthweat ~ InstitutePO Drawer 28610San Antonio, TX 78284

W. BeckAT&T Pixel Machinea, 4J-214

Crawfords Ccuner Rd.Homdel, NJ 07733-1%8

D. A. BudSupercornputing Reach. Cntr,

17100 Saence Dr.Bowie, MD 20715

Dan AndersonFord Motor Co., Suite 1100Vie Pke22400 Michigan Ave.Dearbcrn, MI 48124 David J. fkuuq

Dept. of AMES FLO1lUniv. of California at San DiegoLa Jolla, CA 92093

B. L. BuzbeeScientific Computing Dept.NCARPO Box 3000Boulder, CO 80307

Andy ArenthNational Security AgencySlwageRoadFt. Meade, MD 207ssAttn: C6

Mylea R. BergLockheed, 0/62-30, B/1501111 Lockheed Way%nqyvale, CA 9408%3.504

G. F. CareyDept. of Engineering MechanicsTICOM ASEEM WRW 305University of Texan at AustinAustin, TX 78712

Greg AstfdkConvex Computer (hp.3(N)0 Waterview ParkwqyPO Box 833851Richrmdao% TX 750s3-3851

Stephan BilykUS Army Baflist. Reach. LabSLCBKTB-AMBAberdeen Proving Ground, MD21OOWIO66

Art CarfmnNOSC Code 41New London, CT 06320Susan R. Atlas

TMC, MS B258

Center for Nonlinear StudiesLos Alarnos Nationaf LabsLos Ahunos, NM 87545

Rob BisselingShelf Research B.V.Postbus 30031003 AA Amsterdam

The Netherlands

Bonnie C. CarrolfSec. Dir.

CENDI, Information Iut’1PO Box 4141Oak Ridge, TN 37831L. Audander

DARPA/DSO1400 Wifson Blvd.

Arfington, VA 222o9Matt BlumrichDept. of Comp. SaencePrinceton UniversityPrinceto~ NJ 0s644

Charfm T. CasaleAberdeen Group, Inc.92 State StreetBoston, MA 02109D. M. Austin

Army High Per Comp. Rea. Cntr.University of Minnesota1100 S. Second St.Mirmeapolie, MN 5s41s

B. W. BoehmDARPA/ISTO1400 Wilson Blvd.Arlington, VA 222o9

J. M. CavdliniUS Department of EnergyOSC, ER-30, GTNWasbingto~ DC 20585

ScottBadmUnivemity of Cdiforni% Stm DiegoDept. of Computer Saence95OOGifman DriveEngirmxing0114

La Jofla, CA 92091-CX)14

R. R. BorchcrL-889Lawrence Lhennore Nat’1 LabsPO Box 808Livermore, CA 94550

John ChampineClay Rea. Inc., Software Div.655F Lone Oak Dr.Eagan, MN 55121

Tony ChanDepartment of Computer ScienceThe Chinese Univemity of Hong KongShatin, NTHong Kong

F. R. BaileyMS2Cxl-4NASA Ames Research Center

Moffett Field, CA 94o35

Dan BOWhISMail Code 8123Attm D. Bowlus & G. l..diecqNaval Underwater Sys. Cntr.Newport, RI 02841-5047

D. H. BaileyNAS Applied Research BranchNASA b- Research CenterMoffett F]eld, CA 94o35

J. ChandraArmy Research OfficePO Box 12211Resch Triangle Park, NC 27709

Dr. Donald BrandMS - N8930Geodynamica

14

Page 17: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

Siddmtba ChatterjeeRIAcsNASA Ames Reaeara C.terMail Stop T04~lMofTett Field, CA 94~l(W

IBM Corporation472 Wheelers F- IbadMilford, CT 06460

Yale UniversityPO Box 2158New Haven, CT 06520

J. K. CrdlumThorruu J. Watmn Resch. CenterPo Box 218Yorktown Heights, NY 10598

H. ElmanComputer Science Dept.University of MarylaudCollege Park, MD 20842Wamm Chemock

Scientific Advisor DP.1US Department of EnergyForentd Bldg. 4A.045Washington, DC 20585

Leo DagmnComputer Sciences Corp.NASA Ames Research CenterMoffett Field, CA 94CL3S

M. A. EfmerDARPA/RMO1400 Wilson Blvd.Arfington, VA 222o9

R.C.Y. ChinL-321La wre.nceLiverrmwe Nat’1 LabPO Box 808

Livermore, CA 9455o

Kenneth I. DaughertyChief ScientistHQ DMA (5A), MS -A-16

8613 Lee HighwayFairfax, VA 22031-2138

J. N. EntzmingerDARPA/TTO1400 When Blvd.

Arlington, VA 222o9

Mark ChristenL122Lawreme Llvennore Nat’1 LabPO %X 808Livemnme, CA 9455o

A. M. ErisrnanMS 7L21Boeing Computer ServicesPO Box 24346Seattle, WA 98124-0346

L. DavisCray Research Inc.1168 Industnd Blvd.Chippawa Fafls, WI s472s

M. CirnentAdv. Sci. Comp/ Div. RM 417National Science FoundationWashington, DC 20ss0

Mr. Frank R. DeisMartin MariettaFalcon AFB, CO 80912-50t13

R. E. EwingMathernatica Dept.University of W yomingPO Box 3036 Univemity StationLammie, WY 82071R. A. DeWlllo

Comp. & Comput. Reach.,Rm. 304 ,National Science FoundationWashington DC 20550

Richard CfaytorUS Department of EnergyDef. Prog. DE1

Foreatd Bldg. 4A-014Washington, DC 20585

El Dabaghi FadiCharge of ResearchInstitute Nat’1 De Recherche enInformatique et en AutomatiqueDornaine de Voluceau RomuencourtBP 10578153 Le Chemay Cedex (France)

L. DengApplied Mathematics Dept.SUNY at Stony Brook

Stony Brook, NY 11794-3&lI

Andy ClearyCentm for Information ResearchAustrafia National UniversityGPA Box 4CanberrA ACT 26o1Australia

H. D. F&Institute for Adv. Tech.4032-2 W. Braker LaneAustin. TX 78759

A. Trent DePersiaProg. Mgr.DARPA/ASTO1400 Wifscm Blvd.Arlington, VA 22209-2308

T. ColeMS180-500Jet Prop. Lab4800 Oak Grove Dr.,P.amdena. CA 911OP

Kurt D. FickieUS Army Ballistic Resch. LabATTN: SLCBR-SEAberdeen Proving Ground, MD21OWA4M6

Sean DolannCUBE919 E. Hilbdafe Blvd.Foster City, CA 944o4Monte Cole-

US Army Bd. Reach LabSLCBR-SEA (Bldg. 394/216)Aberdeen Prov. Gmd., MD 21005-5006

Tom Finnegan

NORAD/USSPACECOM J2XSSTOP 35PETERSON AFB, CO 80914

Jack DongamaDepartment of Computer ScienceUnivemity of TermeaseeKnoxville, TN 37996Tom Coleman

Dept. of Computer ScienceUpeon HdlComelf UniversityIthaca, NY 14853

J. E. FlahertyComputer Science Dept.Remselaer Polytedr Inst.Troy, NY 12181

L. DowdyComputer Science DepartmentVanderbilt UniversityNashville, TN 37235

S. ConeyNCUBE19A Davis DriveBelmont, CA 94oO2

L. D. FoedickUniversity of ColoradoComputer Science DepartmentCampus Box 43oBoulder, CO 8030P

Joanne Downey-Burke8030 Sangor Dr.Colorado Springs, CO 80920

J. CoronesAmes Laboratory236 Wilhelm HdlIowa State UniversityAmes, IA 50011-3020

L S. DutTCSS DivisionHarwelf LaboratoryOxfordshire,OX11 ORAUnited Kingdom

G. C. FoxNortheast PamUef Archit. Cntr.111 College PlaceSyracuse, NY 13244

Steve COogrOveE6220KliOb Atomic power LabPO Box 1072Schenectady, NY 12301-1072

Alan EdelmanUniversity of California, BerkeleyDept. of MathematicsBerkefey, CA 94720

R. F. FreundNRaD- Code 423San D,ego, CA 991525000

Sveme FmyenSolar Energy Research Inst.1617 Cole Blvd.

S. C. EuenstatComputer Sdence Dept.

15

C. L. Crothera

Page 18: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

Golden, CO 80401 Boa 2008Oak Ridge, TN 37831

United Kingdom

W. D. HillisThinking Machinea, Inc.245 First St.Cambridge, MA 02139

David GaleIntel Corp600 S. Cherry St.Denver, CO 802221~1

Anne GrcenbaumNew York UniversityCourant Institute251 Mercer StreetNew York, NY 10012-1185 Dan Hitchmck

US Department of EnergySCS, ER-30 GTNWashington, DC 20585

D. B. GrmnonComputer %ience Dept.Indhma UniversityBloomington, IN 47401

Satya Gupta

Intel SSDBldg. C)&O9, Zone 814924 NW Greenbriar,PwkyBeaverton, OR 97cE36

LTC Richard HochbewrgSDIO/SDAThe PentagonWaahingto~ DC 20301-7100

C. W. GearNEC Rcseadr Institute4 Independence WayPrinceton, NJ 08548 J. Guetafson

Computer Sdence Department

236 Wilhelm HallIowa State UniversityAmes, IA 50011

J. A. GeorgeNeedles HalfUniversity of Water400Waterloo, Ont., Can.N2L 3G1

C. J. HollandDirectorMath and Information SciencesAFOSR/NM, Boiling AFBWashingto~ DC 20332-6448R. Graysnrr Hall

USDOE/HQ11XH3Independence Ave, SWWashington, DC 20585

Shomit GhosenCUBE919 E. HilhuLale Blvd.Foster City, CA 94404

Dr. Albert C. HoltOft. of MunitionsOfc of Sec. of St.-ODDRE/TWPPentagon, Room 3B106oWashington, DC 203301-311Xl

Cuah HandenMinnesota Supercomputer Cntr.1200 Washington Ave. So.Minneapolis, MN 55415

Clarence Giese8135 Table Mesa WayColorado Springs, CO 80919

Mr. Daniel HoltzmanvanguardR?SearChInc.

10306 Eaton P]., Suite 450Fairfax, VA 22030-2201

Steve Hammond ,NCARPO Box 3000Borddm, CO 80307

Dr. Horst GietlnCUBE GmbHHammer Strame 858000 Mrmich 50Germany

David A. HopkinsUS Army Ballistic Resch. Lab.Attention: SLCBfVIB-MAber-decn Prov. Gmd., MD 21W5-5066

CDR. D. R. HamonChief, Space Integration Div.

ussPAcEcoM/J5sIPeterson AFB, CO 809145003

John GlbertXerox PARC3333 Coyote Hill Roadpalo Alto, CA 94304

Graimm HortonUnivemitat Erlangen-NurnbergIMMD IIIMartenmrtrase 38520 ErlangenGermany

Dr. James P. HardyNTBIC/GEODYNAMICSMS N 893oFalcon AFB$ CO 60912-LYYJo

Micbnel E. Giltrud

DNAHQ DNA/SPSD6801 Telegraph Rd.Alexan&la, VA 2231CL3398

Doug HarlessNCUBE2221 E~t Lamar Blvd., Suite 36oArlington, TX 76oo6

Fred HowesUS Department of EnergyOSC, ER-30, GTNWashington, DC 20585AktairM. Glass

AT&T Bell Labe Rm IA-1646(KI Mountain AvenueMurray Hill, NJ 07974

Mike HeathUniversity of Illinois4157 Beckman Institute405 N. Mathews Ave.Urbana, IL 61801

Chua-Huang Hu~Assist. Prof. Dept. Comp. & Info SciOhio State Uuiv.228 Bolz Hall-2036 Neil Ave.Columbus, OH 43210-1277

J. G. GlirnmDept. of App Math. & Stat.

State U. of NY at Stony BrookStony Brook, NY 11794 Greg Heileman

EECE DepartmentUnivemity of New MexicoAlbuquerque, NM 87131

R. E. HuddlestonL61Lawrence Llvermore Nat’1 LabPO Box 808Liverrnore. CA 9455o

Dr. Raphael GluckTRW-DSSG, R4/1408One Space Par-kRedondo Bea+ CA 90278 Brent Henrich

Mobile R &D Laboratory13777 Midway Rd.PO Box 819047DaU=, TX 75244-4312

Zdenek JohanThinking Machines Corp.245 First StreetCambridge, MA 02142-1264

G. H. GolubComputer =Ience Dept.Stanford UniversityStanford, CA 943o5

Michael HerouxCray Research Park655F Lone Oak DriveEagrm, MN 55121

Gary JohnsonUS Department of EnergySCS, ER30 GTNWashington, DC 20585

Marcia GrabowAT&T Bell Labe ID-1536CCIMountain Ave.

PO Box 636Murray Hill, NJ 07974-0636 A.J. Hey

University of SouthamptonDept. of Electronic and ComputerMountbatten Bldg., HighfieldSouthampton, S095NH

16

S. Lenmwt JohnssonThinking Machines Corp.

Science 245 First StreetCambridge, MA 02142-1264

Nancy GradyMS 6240Oak Ridge Nat’] Lab

Page 19: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

G. S. Jones

Teeh Program Support Cntr.DARPA/AVSTO1515 Wifson BIvd.Arlington, VA 222o9

lIXIO Independence Ave.Washington, DC 20585

Kelly AFBSan Antonio, TX 782435~

Ann KrauseHQ AFOTEC/OANKirtl.and AFB, NM 87117-7(XI1

H. MairNaval Surface Warfare Center10901 New Hampshire Ave.Silver Springs, MD 2090&5000T. H. Jordan

Dept of Earth, Atmoa & Pla. Sci.MITCambridge, MA 02139

V. KumarComputer Science D~mentUnivemity of MinnesotaMinneapolis, MN 55455

Henry MakowitzMS - 2211-INELEG&E Idaho IncorporatedIdaho F.dlE, ID 83415M. H. Kdoa

Cornell Theory Center514A Eng. and Theory CenterHoy Road, Cornell UnivemityIthaca, NY 14853

J. LarmuttiMS B-166Director, SC. Reach. InstituteFlorida State UnivemityTallahassee, FL 32306

David ManddIMS F663Hydrodynamic App. Grp. X-3IAM Alamos Nat’] LabsLos Alanms, NM 87545H. G. Kaper

Math. and Comp. %]. Division

Argonne National LaboratoryArgonne, IL 60439

P. D. Lax

New York University-Courant251 Mercer St.

New York, NY 10012

T. A. ManteufTelDepartment of Mathematics

University of Co. at DenverDenver, CO 80202S. Karin

SuperComputing Department9.5MIGilman DriveUniversity of CA at San DiegoLa Jofla, CA 92093

Lawrence A. LeeNC Supercomputing CenterPO Box 126893021 Comwdlis Rd.Research Triangle Park, NC 27709

William S. MarkLockheed - Org, 96-01Bldg. 254E3251 Hanover StreetPalo Alto, CA 943031191Herb Keller

Applied Math 217-50Cd TechPaaadena, CA 91125

Dr. H.R. LelandCalspan CorporationPO Box 400,Buff.do, NY 14225

Kapit MathurThinking Machines Corporation245 Fimt Street

Cambridge, MA 0214>1214M. J. KelleyDARPA/DMO1400 Witson Blvd.Arlington, VA 222C9

David LevineMath & Comp. S&meArgonne National Laboratay9700 Cam Avenue SouthArgonne, IL 60439

John MayKanutn Sciences Corporation1500 Garden of the Gods RoadColorado Springn, CO 60933K. W. Kennedy

Computer Science DepartmentRice UniversityPO Box 1892Houston. TX 77251

Peter LlttlewoodTheoret. Phy. Dept.AT&T Belt LabeRan lD-?35Murray Hill, NJ 07974

William McColtOxford Univ. Computing Lab6-11 Keble RoadOxford, OX1 3QDUnited KingdomAram K. Kevorkian

Codje 73L14Naval Ocean Systems Center271 Catalina Blvd.San Diego, CA 92152-5000

Peter LomdafdT-II, MS B262Los Alamoa Nat’] LabLos AhllllOS, NM 87545

S. F. McCormick

Computer Mathematical GroupUnivemity of CO at Denver1200 Larimer St.Denver, CO 80204John E. Killougb

University of HoustonDept. of Chem. EngineeringHouston, TX 77204-4792

L-As S. LemeSDIO/TNI

The PentagonWashington, DC 20301.7tO0

J. R. McGrawL-316Lawrence Livermore Nat’1 Lab

D. R. KincaidCntr. for Num. Andy.,RLM l&150Univemity of Texaa at AustinAustin, TX 78712

PO Box 808

Col. Gordon A. LongDeptuty Director for Adv. Comp.HQ USSPACECOM/JOSDEPSPeterson AFB, CO 6(X)145(X)3

Livermore, CA 9455o

Jill MesirovThinking Machines Corporation245 F]mt StreetCambridge, MA 0214Z1214T. A. Kitchens

US Department of EnergyOSC, ER-30, GTNWashington, DC 20565

John LOll3258 Caminito AmecaLa Jolla, CA 92037 P. C. Messina

158-79Mathematics & Comp Sci. Dept.CaltechPasadena, CA 91125

Daniel LoyemKoninklijke/ShelL LaboratoriumPostbus 30031C02 AA AmsterdamThe Nethertamb

Thomas Klemas394 Briar LaneNewark, DE 19711

Prof. Ralph Metmlfe

Dept. of Mech. Engr.University of Houston4600 Calhoun Road

Houston, TX 772044792

Dr. Peter L. KnepellNTBIC/GEODYNAMICSMS N 8930Falcon AFB, CO 80912-50W

Robert E. LynchDept. of CSPurdue UniversityWest Lafayete, IN 479o7

Max KoontzDOE/OAC/DP 5.1Forestal Bldg

G. A. MichaelL306Lawrence Livermore Nat Lab

Kathy MacLeodAFEWC/SAT

17

Page 20: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

PO Box 808Liverrnore, CA 94550

Lam K. MillerGoodyear Tire & RubberPO Box 3531Akron, OH 4430P-3531

Robert E. MilleteinTMC245 Fint StreetCambridge, MA 02142

G. MohnkemNOSC - Code 73San Diego, CA 92152-50W

C. MolerTbe Mathworke24 prime Pti WayNati~, MA 01760

Gery MontrySouthwcat software11812 Pemimrrton, NEAlbuquerque, NM 87111

N. R. MorseC-DO, MS B260Comp. & Comm .DivisionLoe Alamoe National LabLoe Alamoe, NM 87546

J. R. Medic

IBMThomas J. Wateon Raech Cntr.PO Box 704Yorktown Heighte, NY 10698

D. B. Neleon

US Department of EnergyOSC, E&30, GTNWeehington, DC 20s65

Jeff NewmeyerOrg. 81-04, Bldg. 1571111 Lockheed WaySunnyvale, CA 9406$3504

D. M. NoeenchuckMech. and Aero. Engr. Dept.D302 E QuadPrinceton UniversityPrinceton, NJ 08544

C. E. Oliver

Offc of Lab Comp Bldg. 4500N,Oak Ridge Nat’1 LaboratoryPO Box 2006Oak Ridge, TN 37631-6259

Dennis L. OrphalCdif Reach& Technology Inc.5117 Jolmeon Dr.Pleasanton, CA 94586

J. M. OrtegaApplied Math DepartmentUnivemity of VirginiaCharlottesville, VA 22903

John PalmerTMC245 Fimt St.Cambridge, MA 02142

Robert J . PaluckConvex Computer Corp.

3000 Waternew Parkw~

PO Box 733851Ricbrdaon, TX 75083-3851

Anthony C. ParmeeCo-nor and Attde (Def.)British Embsssy31W MaLw. Ave, NWWashingtorL DC 20008

s. v. ParterDepartment of Mathemati=Van Vleck HallUniversity of WisconsinMadison, WI 537o6

Dr. Nieheeth PatelUS Army Ballistic Ftesch. Lab.AMXBR-LFDAberdeen Prov. Gmd., MD 21005-5066

A. T. PateraMechanical Engineering Dept.77 Maseachueetto Ave.

MITCambridge, MA 02139

A. PatrinoeAtmoe. and Cfirn. Resch. DivOSice of Engy ResclL ER-74US Deprutment of EnergyWaAingtoaL D~ 20545

R. F. PeierfsMath. Saenca Group, Bldg. S1SBrookhawrt National Labupton, NY 11973

Donna PerezNOSC - MCAC Reeource Cntr.

Code 912San Diego, CA 92152-5LXKI

K. PerkoSupercomputing Solutions, Inc.6175 Mancy Ridge Dr.San Diego, CA 92121

John PetreskyBallistic Research LabSLCBRLF-C

Aberdeen Prov. Gmd., MD 21MM-5006

Linda PetzoldL-316Lawrence Livennore Natl . Lab.Livermore, CA 9455o

Wayne PfeifferSan Diego SC CenterPO Box 856o8San Diego, CA 92136

Frank X. PfennebergerMartin MariettaMS-N33104National Test BedFalcon AFB, CO 80912-~

Dr. Leslie PierreSDIO/ENAThe PentagonWashington, DC 20301-7100

Paul PlaesmanMath and Computer Science DivisionArgonne National LabArgonne, IL 6043S

R. J. PlemmoneDept. of Math. & Comp Sci.Wake Forest UniversityPO Box 7311Wimtort Salem, NC 27109

Alex PothenComputer Science DepartmentUnivemity of WaterlooWaterloo, Ontario N2L 301Canada

John K. PrenticeAmparo Corporation37OORio Grande NW, Suite 5Albuq., NM 87107-3042

Peter P. F. RadkowekiPO Box 1121LOU Alamoe, NM 87644

J. RattncrIntel Scientific Computere

15201 NW Greenbriar Pkwy,Beaverton, OR 97oo6

J. P. RetelleOrg. 94-90Lakheed - Bldg. 254G32S1 Hanover StreetPalo Alto, CA 94304

C. E. RhoadeaL298Computational Phyeia Div.PO Box 808Lawrence Livermore Nat’1LabLivennore, CA 84550

J. R. RiceComputer Science Dept.Purdue UniversityWest Lafayette, IN 479o7

John RichardsonThinking Machines Corporation245 Fhat StreetCambridge, MA 02142-1214

Lt. Col. George E. RichieChief, ADv. Tech PlansJOSDEPSPeterson AFB, CO 80914-5fK13

John RollettOxford University Computing Laboratory

%11 Keble RoadOxford, OX1 3QDUnited Kingdom

R. Z. h!?kk

Physics and Astronomy Dept.100 Allen HallUniversity of PittsburgPittsburg, PA 15206

Diane RnverMichigan State UnivemityDept. of Electrical Engineering260 Engineering Bldg.East Lansing, MI 48824

Y. sadUniversity of Minnesota4-192 EE/CSci Bldg.2LMUnion St.Minneapolis, MN 5545&O159

18

Page 21: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

P. wappmDept. of Conlp. & Info SCienaOhio State Univ.-228 Bolz HaU2036 Neil Ave.Colurnbu, OH 43210-1277

Anthony SkjellumLawrence Lkmrnore National Laboratol7000 E-ad Ave., Mail Code L316Llvermore, CA 94550

Gligor A. Taddcovicb

“Y PO Box 2%Pound Ridge, NY 1057& cY296

H. TeutebergCray Research, Suite 8306565 American Pkway, NEAlbuquerque, NM 87110

L. S-Director, Supercbmputer Apps.152 sup —puter ApplicationsBldg. 605 E. SphgfieldChrunpaign, IL 618431

Jod SdtzComputer Science DepartmentA.V. Williams Building

university of MarylandCollege Park, MD 20742

A. TludcrDivision of Math SciencesNational ScienceFoun&IonWaahingto~ DC 20550

Wk R. SomakyBallistic Remueb LaboratorySLCBR-SEA, Bldg. 394Aberdeen Proving Ground, MD 21005

A. H. SarnehCSRD305Tdbot LaboratoryUniversity of Illinois104 s. wright St.Urbana, IL 61801

Allan Ton-es125 Lincoln Ave., Suite 400Santa Fe, NM 87501D. C. Sorenson

Department of Math Scien-Rice UniversityPO BOX1892Houston, TX 77261

Harold ‘lYeMe

Loe Alamoa National LabPO Box 1666, MS F663Los Ahrnos, NM 87545

P. E. SaylorDept. of Comp. Saence222 Digital Computation LabUnivunity of IllinoisUrbana, IL 61801

s. SquiresDARPA/ISTO1400 Wilson Blvd.Arlington, VA 11109

Randy ‘humanMechanical Engineering Dept.University of NMAlbuquerque, NM 87131LCDR Robert J. sCbOppG

Chief, Operatiom RqmtsUSSPACECOM/JOSDEPS (Stop 35)Peterson AFB, CO 80914

N. SrinivasanAMOCO CorpPO 87703Chicago, IL 60680-0703

Ray TuminaroCERFACS42 Ave Gustaw Coriolis31057 Toulouse Cedexl%nce

Rob SCbreibcrRIAcsNASA Am- ReeearcbCemterMail Stop T04$1Moffett Field, CA 9403$1OW

Thomas StegmannDigital Equipment Corporation8085 S. Cbeskr StreetEngiewood, CO 80112

Mark UrmtiaIntel Corp., CO1-035200 NE Elam Youns f%vy.Hilleboro, ORE 971246497M. H. Sclodtz

Department of Computer ScienceY.de Univen3ityPO Box 2158New Haven, CT 06520

D. E. SteinAT&T100 South Jefferson Rd.Whippally, NJ 07981

Mike Uttormark

UW-Madiaonlsw Johnson Dr.Madison, WI 537o6M. Steuerwdt

Division of Math Sciences

National Science FoundationWashington, DC 20550

Da= SchwartzNOSC, Code 733San Diego, CA 92152A5C4Kl

R. VanDeGeijnComputer Science DepartmentUniversity of TexasAustin, TX 78712Mark Seager

LLNL, L80PO box 803Livermore, CA 94550

Mike StevensnCUBE919 E. Hillsdale Blvd.Foster City, CA 94404

George VandergriftDist. Mgr.Convex Computer Corp.3916 Juan Tabo, NE, Suite 38Albuquerque, NM 87111

A. H. Sh—Sa. Computing Amoc. Inc.Suite 307, 246 Church StreetNew Haven, CT 06510

G. W. StewartComputer Stience DepartmentUniversity of MarylandColfege Park, MD 20742 H. VanDerVomt

Delft University of TechnologyFacufty of MathematicsPOB 35626oO AJ DdftThe Netherlands

O. StOrasdiMS-244NASA Langley Research Cntr.Hampton, VA 23665

Horst Simon

NAS Systems Divti]onNASA Amen Research CenterMail Stop T045-1MofTett Field, CA 94035

c. StwtDARPA/TTO1400 Wilson Blvd.Arlington, VA 22209

c. vanLaanDepartment of Computer ScienceCornell University, Rm. 5146Ithaca, NY 14853

Richard SincovecOak Ridge National Laboratory

P.O. Box 2008, Bldg 6012Oak Ridge, TN 37831-6367

John VanRosenddeICASE, NASA Langley Researrh CenterMS 132CHampton, VA 23665

LTC Jarnea SweederSDIO/SDAThe PentagonWashington, DC 20301-7100

Vineet SingbHP Labs, Bldg. lU, MS 141501 Page Mill RcAP.do Alto, CA 94304

Steve VavasisDept. of Computer ScienceUpson HallCOmeU UnivemityIthaca, NY 14853

R. A. TapiaMathematical =1. DeptRice UniversityPO Box 1892Houston, TX 77251

Aehok SingludCFD Reach. Center3325 ‘IMana Blvd.Huntsville, AL 35805

19

Page 22: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.

R. G. VoigtMS 132-CNASA Langley Ikcb Cntr, ICASEHampton, VA 36665

Phuong Vu

Cray ReaeadL Inc.19607 pram RoadHouston, TX 77084

David WalkerBldg 6012Oak Ridge National LabPO Box 2008Oak Ridge, TN 37B31

Stevm J. WallachConvex Computer Corp.3000 WaterView ParkwWPO Box 833851Ricbardno~ TX 75083-3651

R. C. WardBid. 9207-AMathernatimf %iencenOak Ridge National LabPO Box 4141Oak Ridge, TN 37831-8083

Thomas A. WeberNational Science Found18Lx3G. Street, NWWashington, DC 20550

G. W. WeigandDARPA/CSTO37o1 N. Fairfax Ave.Arfington, VA 22203-1714

M. F. WheelerMath Sciences DeptRice universityPO Box 1892Houston. TX 77251

A. B. WhiteMS-265k Ahmms National LabPO Box 16433Los Alamm, NM 87544

B. WilcoxDARPA/DSO1400 Wilson Blvd.Arlington, VA 222o9

Roy WifliamsCalifornia Institute of Technology20&49Pasadena, CA 91104

C. W. Wif.90nMS MI02L3/BllDigitaf Equipment Corp.146 Main StreetMaynar~ MA IM175

K. G. WilsonPhysics Dept.Ohio State UniversityColumbus, OH 43210

Leonad T. WifncmNSWCCode G22Dahlgmn, VA 22448

Peter WOlochOw

fntef Corp., CO1-0352OONE Elam Young Pkwy.Hiflsboro, OR 97124-6497

P. R. WoodwardUniversity of MinnesotaDepartment of Astronomy

116 Chumh Street, SEMinneapolis, MN 55455

M. Wunderli&Math. Sciencm ProgramNational Security AgencyFt. George G. Mead, MD 20755

Hishashi YasumoriKTEC-Kawasaki SteelTechrwrenearch CorporationHiblya Kokusai Bldg. 2-3Uchisaiwaicho ‘khromeChiyoddm, Tokyo lCII

David YoungCenter for Numerical Analysis

RLM 13.150The University of TexaxAustin, TX 78713-8510

Robert YoungAlcoa LaboratoriesAlcoa Centw, PA 15069

Attm R. Youn8 & J. McMichaef

Wilfiam ZierkeApplied Research LabPenn State.PO Box 30State CoUege, PA 168C14

INTERNAL DISTRIBUTION:Pauf FleuryEd BamisSudip DosanjhBilf CampDoug ClineDavid Gardner

Grant HeffelfingerScott HutchinsonMartin LewittSteve PlimptonMark SearsJohn SbadidJulie SwisshelrnDick AllenBruce Hendrickson (25)David WombleErnie BrickellKevin McCurleyRobert BennerCarl DiegertArt HaleRob Leland (25)Courtenay VaughanSteve AttawayJohnny Bif3eMark BlanfordJim SchuttMichael McGlaunAllen RobinsonPauf BarringtonDavid MartinezDons CrawfordWilliam Mason

Technical Library (5)Technical PublicationDocument Processing forDOE/OSTI (10)Centraf Technical FiteCharles Tong

20

140014021421142114211421142114211421142114211421142214221422142314231424

14241424142414241425

1425142514251431143114321434lWO195271417151

7613-28523-28117