An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate...
Transcript of An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate...
![Page 1: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/1.jpg)
SANDIA REPORT SAND92–2765 UC–405 Unlimited Release Printed March 1993
RECORD
An Efficient Parallel Algorithm MICROFICHED
Bruce Hendrickson, Robert Leland, Steve Plimpton
Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 84560 for the United States Department of Energy under Contract DE-AC04-76DP00789
SF2900Q(8-81 )
![Page 2: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/2.jpg)
Iessued by Sandia National Laboratories, operated for the Unitid States Department of Energy by Sandia Corporation. NOTICE This report wee prepared aa an account of work eponeored by an agency of the Unitad States Government. Neither the United Statm Govern- ment nor any agency thereof, nor any of their employees, nor any of their contrac@e, subcontractor, or thei~ ernoployees, make? ~y warranty, express or un had, or aasumes any legai hablhty or responslbdity for the accuracy, r comp etaneaa, or uaefulneea of any information, apparatus, product, or proceea diecloeed, or re resents that ita use would not infringe privately
f ownad rights, Reference erein tQ any specific commercial product, proceae, or aeMce b trade name, trademark, manufacturer, or otherwise, does not
j neceaean conatituta or imply ite endorsement, recommendation, or favoring by the nited Statae Government, any ency thereof or any of their
3 contractor or mhcontract.ora. The viewa an opinions expressed herein do not neceaearily data or reflect those of the United States Government, any agency thereof or any of their contractme.
Printad in the United States of America, This report haa been reproduced directly from the beet available copy.
Available to DOE and DOE contractors from Office of Scientific and Technical Information PO Box 62 Oak Ridge, TN 97831
Pricee available from (615) 576-8401, FTS 626-8401
Available t.a the public from Nationai Technicai Information Service US Department of Commerce 5285 Port Ro al Rd
J Springfield, A 22161
NTIS rice codes a Printi copy A09
Microfiche COPF AO1
![Page 3: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/3.jpg)
SAND92-2765 Unlimited Release
Printed March 1993
Distribution Category UC-405
An Efficient Parallel Algorithm for Matrix–Vector Multiplication
Bruce Hendrickson, Robert Leland and Steve Plimpton Sandia National Laboratories
Albuquerque, NM 87185
Abstract
Abstract. The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific
computation. A fast parallel algorithm for this calculation is therefore necessary if we are to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix–vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/~ + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well–known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.
Key words. matrix-vector multiplication, parallel computing, hypercube, conjugate gradient method
AMS(MOS) subject classification. 65Y05, 65F1O
Abbreviated title. Parallel Matrix–Vector Multiplication
This work was supported by the Applied Mathematical Sciences program, U.S. Department of
Energy, Office of Energy Research, and was performed at Sandia National Laboratories, operated for
the U.S. Department of Energy under contract No. DE-AC04-76DP00789. 1
![Page 4: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/4.jpg)
1. Introduction. The multiplication of a vector by a matrix is the kernel computation in many
linear algebra algorithms, including, for example, the popular Krylov methods for solving linear and
eigen systems. Recent improvements in such methods, coupled with the increasing use of massively
parallel computers, require the development of efficient parallel algorithms for matrix–vector multiplica-
tion. This paper describes such an algorithm. Although the method works on all parallel architectures,
it is particularly well suited to machines with hypercube interconnection topology, for example the Intel
iPSC/860 and the nCUBE 2.
The algorithm described here was developed independently in connection with research on efficient
methods of organizing parallel many-body calculations (see [5]). We subsequently learned that our
algorithm is very similar in structure to a parallel matrix-vector multiplication algorithm described
in [4]. We have, nevertheless, chosen to present our algorithm because it improves upon that in [4] in
several ways: First, we specify how to overlap communication and computation and thereby reduce
the overall run time. Second, we show how to map, the blocks of the matrix to processors in a novel
way which improves the performance of a critical communication operation on current hypercube
architectures. And third, we consider the actual use of the algorithm within the iterative conjugate
gradient solution method and show how in this context a small amount of redundant computation can
be used to further reduce the communication requirements. By integrating these improvements we
have been able to achieve significantly better performance on a well known benchmark than has been
previously possible with a massively parallel machine.
A very attractive property of the new algorithm is that its communication operations are indepen-
dent of the sparsity pattern of the matrix, making it applicable to all matrices. For an n x n matrix on p
processors, the cost of the communication is O(n/@+ log(p)). However, many sparse matrices exhibit
structure which allows for other algorithms with even lower communication requirements. Typically
this structure arises from the physical problem being modeled by the matrix equation and manifests
itself as the ability to reorder the rows and columns to obtain a nearly block–diagonal matrix, where
the p diagonal blocks are about equally sized, and the number of matrix elements not in the blocks is
small. This structure can also be expressed in terms of the size of the separator of the graph describing
the nonzero structure of the matrix. Our algorithm is clearly not optimal for such matrices, but there
are many contexts where the matrix structure is not helpful (e.g. dense matrices, random matrices),
or the effort required to identify the structure is too large to justify. It is these settings in which our
algorithm is most appropriate and provides high performance.
This paper is structured as follows. In the next section we describe the algorithm and its com-
munication primitives. In §3 we present refinements and improvements to the basic algorithm, and
develop a performance model. In §4 we apply the algorithm to the NAS conjugate gradient benchmark
problem to demonstrate its utility. Conclusions are drawn in §5.
2. A parallel matrix–vector multiplication algorithm. Iterative solution methods for linear
and eigen systems are one of the mainstays of scientific computation. These methods involve repeated
matrix–vector products or matvecs of the form yi = Axi where the the new iterate, xi+1, is generally
2
![Page 5: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/5.jpg)
some simple function of the product vector yi. To sustain the iteration on a parallel computer, it is
necessary that xi+l be distributed among processors in the same fashion as the previous iterate xi.
Hence, a good matvec routine will return a yi with the same distribution as xi so that xi+1 can be
constructed with a minimum of data movement. Our algorithm respects this distribution requirement.
We will simplify notation and consider the parallel matrix-vector product y = Ax where A is an
n x n matrix and x and y are n-vectors. The number of processors in the parallel machine is denoted
by p, and we assume for ease of exposition that n is evenly divisible by p and that p is an even power
of 2. It is fairly straightforward to relax these restrictions.
Let A be decomposed into square blocks of size (n/@) x (n/@, each of which is assigned to
one of the p processors, as illustrated by Fig. 1. We introduce the Greek subscripts a and /3 running
from O to ~ – 1 to index the row and column ordering of the blocks. The (a, ~) block of A is denoted
by Aap and owned by processor Pap. The input vector z and product vector y are also conceptually
divided into @ pieces indexed by /3 and a respectively. Given this block decomposition, processor
Pap must know Z@ in order to compute its contribution to ya. This contribution is a vector of length
n/fi which we denote by zap. Thus Z.B = A.~zp, and y. = 2P .z.p where the sum is over all the
processors sharing row block a of the matrix.
Ya
2.1.
Fig. 1. Structure of matrix product y = Ax.
Communication primitives. Our algorithm requires three distinct patterns of communi-
cation. The first of these is an efficient method for summing elements of vectors owned by different
processors, and is called a fold operation in [4]. We will use this operation to combine contributions
to y owned by the processors that hold a block row of A. The fold operation is sketched in Fig. 2 for
communication among processors with the same block row index a. Each processor begins the fold
operation with a vector zap of length n/fi. The operation requires Iogz (W) stages, halving the length
of the vectors involved at each stage. Within each stage, a processor first divides its vector z into two
equal sized subvectors, Z1 and 22, aa denoted by (zl 122). One of these subvectors is sent to another
3
![Page 6: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/6.jpg)
processor, while the other processor sends back itscontribution tothesubvector which remained. The
received subvector is summed element–by-element with the retained subvector to finish the stage. At
the conclusion of the fold, each processor has a unique, length n/p portion of the fully summed vector.
We denote this subvector with Greek superscripts, hence Pap owns portion ya~. The fold operation
requires no redundant floating point operations, and the total number of values sent and received by
each processor is n/fi – n/p.
Fig. 2. The fo
?rocessor Pap knows zap E IRnl @
: := zap
?ori=O, . . ..log2(l-l
(Z, IZ2) = z
PaBI := PQO with i:h bit of D flipped
If bit i of/3 is 1 Then
Send Z1 to processor Pap,
Receive W2 from processor POPI
Zp := 22 i- W2
Else
Send Z2 to processor PaO,
Receive w 1 from processor P.OI
Z:=zl+wl
,Ufl := z
‘rocessor Pap now owns yafl E lRnJp
operation for processor PQOas part of block row a.
In the second communication operation each processor knows some information that must be
shared among all the processors in a column. We use a simple algorithm called ezpand [4], that
essentially uses the inverse communication pattern of the fold operation. The expand operation is
outlined in Fig. 3 for communication between processors with the same column index /3. Each processor
in the column begins with a subvector of length n/p, and when the operation finishes all processors
in the column know all n/@ values in the union of their subvectors. At each step in the operation a
processor sends all the values it knows to another processor and receives that processor’s values. These
two subvectors are concatenated, as indicated by the “1” notation. As with the fold operation, only a
logarithmic number of stages are required, and the total number of values sent and received by each
processor is n/fi – n/p.
The optimal implementation of the fold and expand operations depends on the machine topology
and various hardware considerations, e.g. the availability of multiport communication. There are,
however, efficient implementations on most architectures. On hypercubes, for example, these operations
can be implemented using only nearest neighbor communication if the blocks in each row and column
of the matrix are owned by a sub cube with W processors. On meshes, if the blocks of the matrix are
4
![Page 7: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/7.jpg)
Processor P08 knowsfl” EIR”IP
z:= @
Fori=log2(@) -1,...,0
Pot,@ := Papwithiih bitof a flipped
Send z to processor P0,,8
Receive w from processor P=t,D
If bit i of CYis 1 Then
z := W[z
Else
z := Zlw
Yo := z
Processor Pop now knows ya E lRnl@
Fig. 3. The expand operation for processor P&Oas part of block column /?,
mapped in the natural way to a square grid of processors, then the fold and expand operations can be
implemented efficiently [9].
The third communication operation in our matvec algorithm requires each processor to send a
message to the processor owning the transpase portion of the matrix, i.e. PQB sends to PB=. Since we
want row and column communication to be efficient for the fold and expand operations, this transpose
communication can be difficult to implement efficiently. This is because a large number of messages
must travel to architecturally distant processors, so the potential for message congestion is great. We
have devised an optimal, congestion-free algorithm for this operation on hypercubes which is discussed
in $3.1. Implementations of our matvec algorithm on other architectures may benefit from a similarly
tailored transpose algorithm. However, even if congestion is unavoidable, the length of the message
in the transpose communication step of our matvec algorithm is about @ less than the volume of
data exchanged in the fold and expand steps. Consequently, the transpose messages can be delayed by
O(@ without changing the overall scaling of the algorithm.
2.2. The matrix-vector multiplication algorithm. We can now present our algorithm for
computing y = Az in Fig. 4. Further details and enhancements are presented in the following section.
All the numerical operations in the algorithm are performed in steps (1) and (2). First, in step (1),
each processor performs the local matrix-vector multiplication involving the portion of the matrix
it owns. These values are summed within processor rows in step (2) using the fold operation from
Fig. 2, after which each processor owns n/p of the values of y. Unfortunately, the values owned by
processor Pap are just a subvector of y., whereas to perform the next matvec Pep must know all of
up. This is accomplished in steps (3) and (4). In step (3), each processor exch~ges its subsegment of
~ with the processor owning the transpose block of the matrix. After the transposition, the values of
VP are d~tributed among the procemors in COIUmII block 0 of A. The expand operation among these
procemors gives each of them all of yp, so the result is distributed as required for a subsequent matvec.
5
![Page 8: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/8.jpg)
We note that at this level of detail, the algorithm is identical to the one described in [4] for dense
matrices, but m we discuss in the next section, the detaila of steps (1), (2) and (3) are different and
result in a more efficient overall algorithm.
(1)
(2)
(3)
(4)
Processor Pap owns A.P and ZP
Compute zap = Aaezp
Fold zap within rows to form v“$
Transpose the y“~, i.e.
a) Send I@ to Ppe
b) Receive #o from Pp.
Expand #o within columns to form VP
Fig. 4. Parallel matrix-vector multiplication algorithm for processor Pep.
3. Algorithmic details
3.1. Transposition on
and refinements.
parallel computers. The expand and fold primitives used in the
matvec algorithm are most efficient on a parallel computer if rows and columns of the matrix are
mapped to subsets of processors that allow for fast communication. On a hypercube a natural subset
is a subcube, while on a 2-D mesh rows, columns or submeahes are possible. Unfortunately, such a
mapping can make the transpose operation inefficient since it requires communication between pro-
cessors that are architecturally distant. Modern parallel computers use cut-through routing so that a
single message can be transmitted between non-adjacent processors at nearly the same speed as if it
were sent between adjacent processors. Nevertheless, if multiple messages are simultaneously trying
to use the same wire, all but one of them must be delayed. Hence machines with cut through routing
can still suffer from serious message congestion.
On a hypercube the scheme for routing a message is usually to compare the bit addresses of the
sending and receiving processors and flip the bits in a fixed order (and transmit along the corresponding
channel) until the two addresses agree. On the nCUBE 2 and Intel iPSC/860 hypercubes the order
of comparisons is from lowest bit to higheat, a procedure known as dimension order muting. Thus
a message from processor 1001 to processor 0100 will route from 1001 to 1000 to 1100 to 0100. The
usual scheme of assigning matrix blocks to processors uses low order bits to encode the column number
and the high order bits to encode the row number. Unfortunately, dimension order routing on this
mapping induces congestion since messages from all the @ processors in a row route through the
diagonal processor. A similar bottleneck occurs with mesh architectures where the usual routing
scheme is to move within a row before moving within a column. Fortunately, the messages being
transposed in our algorithm are shorter than thcee in the fold and expand operations by a factor of
@ So even if congestion delays the transpose messages by @ the overall communication scaling of
the algorithm will not be affected.
On a hypercube, a different mapping of matrix blocks to processors can avoid transpose congestion
6
![Page 9: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/9.jpg)
altogether. With this mapping we still have nearest neighbor communication in the fold and expand
operations, but now the transpose operation is as fast as sending and receiving a single message of
length n/p. Consider a &dimensional hypercube where the address of each processor is a d-bit string.
For simplicity we assume that d is even. The row block number a is a d/2-bit string, w is the column
block number ~. For fast fold and expand operations, we require that the processors in each row and
column form a subcube. This is assured if any set of d/2 bits in the d-bit processor address encode the
block row number and the other d/2 bits encode the block column number. Now consider a mapping
where the bits of the block row and block column indices of the matrix are interleaved in the processor
address. For a 64–proceaaor hypercube (with 3-bit row and column addresses for the 8x8 blocks of
the matrix) this means the 6-bit processor address would be ~c2r1c1 roco where the three bits ~rl r.
encode the block row index and C2C1co encodes the block column index.
Note that in this mapping each row of blocks and column of blocks of the matrix still resides on
a subcube of the hypercube, so the expand and fold, operations can be performed optimally. However,
the transpose operation is now contention-free as demonstrated by the following theorem. Although
the proof assumes a routing scheme where bits are flipped in order from loweat to highest, a similar
contention free mapping is possible for any tlxed routing scheme as long as row and column bits are
forced to change alternately.
THEOREM 3.1. Consider a hypercube using dimension order routing, and map processors to
elements of an array in such a way thai the bit-repwsentations of a processor’s row number and column
number are interleaved in the processor’s bit-address id. Then ihe wires used when each processor sends
a message to the processor in the transpose location in the array are disjoint.
Proof. Consider a processor P with bit-address rbQr& lcb- 1... roco, where the row number is
encoded with rb . . . r., and the column number with cb . . . co. The processor PT in the transpose array
location will have with bit-address cbrbcb- 1r& 1 . . coro. Under dimension order routing, a message is
transmitted in as many stages as there are bits, flipping bits in order from right to left to generate
a sequence of intermediate patterns. After each stage, the message will have been routed to the
intermediate processor denoted by the current intermediate bit pattern. The wires used in routing the
message from P to P= are those that connect two processors whose patterns occur consecutively in the
sequence of intermediate patterns. After 2k stages, the intermediate processor will have the pattern
rbcb . e .rkckck_~rk_~ . . . coro. The bits of this intermediate processor are a simple permutation of the
original bits of P in which the lowest k pairs of bits have been swapped. Also, after 2k – 1 stages, the
values in the bit positions 2k and 2k – 1 are equal.
Now consider another processor P’ # P, and assume that the message being routed from P’
to Pm usea the same wire employed in step i of the transmission from P to PT. Denote the two
processors connected by this wire by P1 and Pz. Since they differ in bit position i, PI and Pz can only
be encountered consecutively in the transition between stages i - 1 and i of the routing algorithm.
Either i – 1 or i is even, so a simple permutation of pairs of bits of P must generate either P1 or P2; say
P.. Similarly, the same permutation applied to P’ must also yield either PI or Pz; say Pi. If P. = P!
then P = P’ which is a contradiction. Otherwise, both P1 and P2 must appear after an odd number of
7
![Page 10: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/10.jpg)
stages in one of the routing sequences. If i is odd then bits i and i + 1 of P must be equal, and if i is
even then bits i and i- 1 of P are equal. In either case, P1 = P2 which again implies the contradiction
that P = P’. Cl
3.2. Overlapping computation and communication. If a processor is able to both coinpute
and communicate simultaneously, then the algorithm in Fig. 4 has the shortcoming that once a proces-
sor has sent a message in the fold or expand operations, it is idle until the message from its neighbor
arrives. This can be alleviated in the fold operation in step (2) of the algorithm by interleaving com-
munication with computation from step (1). Rather than computing all the elements of Z=e before
beginning the fold operation, we should compute just those that are about to be sent. Then whichever
values will be sent in the next pass through the fold loop get computed between the send and receive
operations in the current pass. In the final pass, the values that the processor will keep are computed.
In this way, the total run time is reduced on each pass through the fold loop by the minimum of the
message transmission time and the time to compute” the next set of elements of za6.
3.3. Balancing the computational load. The discussion above has concentrated on the com-
munication requirements of our algorithm, but an efficient algorithm must also ensure that the com-
putational load is well balanced across the processors. For our algorithm, this requires balancing the
computations within each local matvec. If the region of the matrix owned by a processor has m’ nonze-
ros, the number of floating point operations (flops) required for the local matvec is 2m’ – n/@. These
will be balanced if m’ x m/p for each processor, where m is the total number of nonzero elements in
the matrix. For dense matrices or random matrices in which m >> n, the load is likely to be balanced.
However for matrices with some structure it may not be. For these problems, ogielski and Aiello have
shown that randomly permuting the rows and columns gives good balance with high probability [8]. A
random permutation has the additional advantage that zero values encountered when summing vectors
in the fold operation are likely to be distributed randomly among the processors.
Most matrices used in real applications have nonzero diagonal elements. We have found that when
this is the case, it may be advantageous to force an even distribution of these among processors and to
randomly map the remaining elements. This can be accomplished by first applying a random symmetric
permutation to the matrix. This preserves the diagonal while moving the off-diagonal elements. The
diagonal can now be mapped to processors to match the distribution of the y“~ subsegment that each
processor owns. The contribution of the diagonal elements can then be computed in between the send
and receive operations in the transpose communication, saving either the transpose transmission time
or the diagonal computation time, whichever is smaller.
3.4. Complexity model. The algorithm described above can be implemented to require the
minimal 2m – n flops to perform a matrix-vector multiplication, where m is the number of nonzeros
in the matrix. Some of these flops will occur during the calculation of the local matvecs, and the rest
during the fold summations. We make no assumptions about the data structure used on each processor
to compute its local matrix-vector product. This allows for the implementation of whatever algorithm
works best on the particular hardware. If we assume the computational load is balanced by using the
8
![Page 11: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/11.jpg)
techniques described in $3.3, the time to execute these floating point operations should be very nearly
(2m - n)THoP/p, where T~~P is the time required for a single floating point operation.
The algorithm requires Iogz (p) + 1 read/write pairs for each processor, and a total communi-
cation volume of n(2~ – 1) floating point numbers. Accounting for the natural parallelism in the
communication operations, the effective communication volume is n(2~ – 1)/p. Unless the matrix is
very sparse, the computational time required to form the local matvec will be sufficient to hide the
transmission time in the fold operation, as discussed in ~3,2. We will assume that this is the case.
Furthermore, we will assume that the transpose transmission time can be hidden with computations
involving the matrix diagonal, as described in $3.3. The effective communication volume therefore
reduces to n(@ – I)/p. The total run time, TtOtal can now be expressed as
(1) Ttot~ =2m–n—Gop + (1%2(P) + l)(T.end + ‘Geceive) + ‘(w – 1) Tt~~n*~i~ ,
P P
where TfiOPis the time to execute a floating point opdration, Tsend and &ceive are the times to initiate a
send and receive operation respectively, and TtrmBmit is the transmission time per floating point value.
This model will be most accurate if message contention is insignificant, as it is with the mapping for
hypercubes described in $3.1.
4. Application to the Conjugate Gradient algorithm. To examine the efficiency of our
parallel matrix-vector multiplication algorithm, we used it as the kernel of a conjugate gradient (CG)
solver. A version of the CG algorithm for solving the linear system Ax = b is depicted in Fig. 5. There
are a number of variants of the basic CG method; the one we present is a slightly modified version
of the algorithm given in the NAS benchmark [1, 3] discussed later. In addition to the matrix–vector
multiplication, the inner loop of the CG algorithm requires three vector updates (of z, r and p), as
well as two inner products (forming ~ and p’).
An efficient parallel implementation of the CG algorithm should divide the workload evenly among
processors while keeping the cost of communication small, Unfortunately, these goals are in conflict
because when the vector updates are distributed, the inner product calculations require communication
among all the processors. In addition, if the algorithm in Fig. 5 is implemented in parallel, each
processor must know the value of a before it can update r to compute p’ and hence ~. The calculation
of 7 = p~y, the distribution of -y, and the calculation of p’ = rTr can actually be condensed into two
global operations because the first two operations can be accomplished simultaneously with a binary
exchange algorithm. However these global operations are still very costly. One way to reduce the
communication load of the algorithm is to modify it as shown in Fig. 6.
This modified algorithm is algebraically equivalent to the original, but instead of updating r
and then calculating rTr, the new algorithm exploits the identity r~+lri+l = (r~ – @Y)T(ri – @Y) =
r~ri – ~yTri + cr2yTy, as suggested by Van Rosendale [10]. The values of 7, ~ and @ can be summed
with a single global communication, essentially halving the communication time required outside the
matvec routine. In exchange for this communication reduction, there is a net increase of one inner
product calculation since 4 = yTr and $ = yTy must now be computed, but @ = rTr need not
9
![Page 12: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/12.jpg)
X:=cl
r:=b
p:=b
p := rTr
Fori=l,. . .
y := Ap
‘y :=pTy
ff := p/7
x :=z+ap
r:=r — ay
p’ := rTr
P:= P’IP
P:= P’p:=r+pp
Fig. 5. A conjugate gradient algorithm.
be calculated explicitly. Since the vectors are distributed across all the processors, this requires an
additional 2n/p floating point operations by each processor in order to avoid a global communication.
Whether this is a net gain depends upon the relative sizes of n and p, as well sa the cost of flops and
communication on a particular machine, but since communication is typically much more expensive
per unit than computation, the modified algorithm should generally be faster. For the nCUBE 2,
the machine used in this study, we estimate that this recasting of the algorithm is worthwhile when
n< 5x105.
This restructuring of the CG algorithm can in principle be carried further to hide more of the
communication cost of the linear solve. That is, by repeatedly substituting for the residual and search
vectors r and p we can express the current values of these vectors in terms of their values k steps
previously. (General formulas for this process are given in [7].) By proper choice of k it is possible to
completely hide the global communication in the CG algorithm. Unfortunately this leads to a serious
loss of stability in the CG process which is expensive to correct [6]. We therefore recommend only
limited application of this restructuring idea.
The vector and scalar operations associated with CG fit conveniently between steps (3) and (4)
of the matrix–vector multiplication algorithm outlined in Fig. 4. At the end of step (3) the product
vector y is distributed across all p processors, and it is trivial to achieve the identical distribution for
z, r and p. Now all the vector updates can proceed perfectly in parallel. At the end of the CG loop,
the vector p can be shared through an expand operation within columns and hence the processors will
be ready for the next matvec. The resulting algorithm is sketched in Fig. 7.
We implemented a double precision version of this algorithm in C on the
hypercube at Sandia’s Massively Parallel Computing Research Laboratory.
10
1024 processor nCUBE 2
The resulting code was
![Page 13: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/13.jpg)
Fig. 6. A mc
E:=()
r := b
D:=b
o := rTr
Fori=l,.. .,n
y := Ap
~ := p2’y
~,= #’r
$ := #y
Q := p/~
p’:=p–m$+az$
p:= #/p
P:=P’ ‘
l?:=z+ap
r:=r–ffy
p:=r+~p
ified conjugate gradient algorithm.
tested on the well–known NAS parallel benchmark problem proposed by researchers at NASA Ames
[1, 3]. The benchmark uses a conjugate gradient iteration to approximate the smallest eigenvalue of
a random, symmetric matrix of size 14,000, with an average of just over 132 nonzeros in each row.
The benchmark requires 15 calls to the conjugate gradient routine, each of which involves 25 passes
through the innermost loop containing the matvec.
This benchmark problem has been addressed by a number of different researchers on several
different machines [2]. A common theme in this previous work has been the search for some exploitable
structure within the benchmark matrix. Since arbitrary restructuring of the matrix is permitted by
the benchmark rules as a pre-processing step, the computational effort expended in this search for
structure is not counted in the benchmark timings.
In contrast, our algorithm is completely generic and does not require any special structure in the
matrix. The communication operations are entirely independent of the zero/nonzero pattern of the
matrix, and the only advantage of reordering would be to lessen the load on the most heavily burdened
processor. Because the benchmark matrix diagonal is dense, we did partition the diagonal across all
processors, as described in ~3.3. Otherwise, we accepted the matrix as given, and made no effort to
exploit structure.
Our implementation solved the benchmark problem in 6.09 seconds, which compares quite favor-
ably with all other published results on massively parallel machines [3]. For comparison, the recently
published times for the 128 processor iPSC/860 and 32K CM-2 are 8.61 and 8.8 seconds respectively,
which is more than 4070 longer than our result. Although this problem is highly unstructured, our
11
![Page 14: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/14.jpg)
2rocefjsOr PPU owns ‘W
c,r, p,b, ~ E IR”IP, zP, p” G IRnlfi
E:=o..—.—b
>:=b
? := r=r
Sum ~ over all processors to form p
Expand p within columns to form p“
Tori =l,. . .
Compute ZP = AP.P.
Fold ZP within rows to form y~v
Transpose y~”, i.e.
Send y~” to P.P ‘
Receive y := y“f’ from P.P
~ := pTy
(j := y=,
J) := y=y
Sum ~, ~ and ~ over all processors to form -y, ~ and ~
Q := p/y
p’:= p–afj+cd$
/3:= p’/p
p := p’
Z:=z+ap
r:=r+(ry
p:=r+flp
Expand p within columns to form p.
Fig. 7. A parallel CG algorithm for processor F’PV.
C code achieves about 250 Mflops, which is about 12% of the raw speed achievable by running pure
assembly language BLAS on each processor without any communication.
5. Conclusions. We have presented a parallel algorithm for matrix–vector multiplication, and
shown how this algorithm can be used very effectively within the conjugate gradient algorithm. The
communication cost of this algorithm is independent of the zero/nonzero structure of the matrix and
scales aa n/@. Consequently, the algorithm is most appropriate for matrices in which structure is
either difficult or impossible to exploit. This is clearly the case for dense and random matrices, and
it is also true more generally for sparse matrices in many contexts. For example, our algorithm could
serve aa an efficient black-box routine for prototyping sparse matrix linear algebra algorithms or could
be embedded in a sparse matrix library where few assumptions about matrix structure can be made.
12
![Page 15: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/15.jpg)
On the NAS conjugate gradient benchmark, an nCUBE 2 implementation of this algorithm runs
more than 40’%0faster than any other reported algorithm running on any massively parallel machine.
The particular mapping we employ for hypercubes is likely to be of independent interest. This
mapping ensures that rows and columns of the matrix are owned entirely by subcubes, and that with
cut–through routing the transpose operation can be performed without message contention. This
mapping haa already proved useful for parallel many–body calculations [5], and is probably applicable
to other linear algebra algorithms.
Acknowledgements. We are indebted to David Greenberg for assistance in developing the hy-
percube transposition algorithm in $3.1.
REFERENCES
[1] D. H. BAILEY, E. BARSZCZ, J. T. BARTON; D. S. BROWNING, R. L. CARTER, L. DAGUM,
R. A. FATOOHI, P. O. FREDERICKSON, T. A. LASINSKI, R. S. SCHREIBER, , H. D. SIMON,
V. VENKATAKRISHNAN, AND S. K. WEERATUNGA, The NAS parallel benchmarks, Intl. J.
Supercomputing Applications, 5 (1991), pp. 63-73.
[2] D. H. BAILEY, E. BARSZCZ, L. DAGUM, AND H. D. SIMON, NAS parallel benchmark results, in
Proc. Supercomputing ’92, IEEE Computer Society Press, 1992, pp. 386-393.
[3] D. H. BAILEY, J. T. BARTON, T. A. LASINSKI, AND H. D. SIMON, EDITORS, The NAS parallel
benchmarks, Tech. Rep. RNR-91-02, NASA Ames Research Center, Moffett Field, CA, January
1991.
[4] G. C. Fox, M. A. JOHNSON, G. A. LYZENGA, S. W. OTTO, J. K. SALMON, AND D. W.
WALKER, Solving problems on concurrent processors: Volume 1, Prentice Hall, Englewood
Cliffs, NJ, 1988.
[5] B. HENDRICKSON AND S. PLIMPTON, Parallel many-body calculations without all-to-all com-
munication, Tech. Rep. SAND 92-2766, Sandia National Laboratories, Albuquerque, NM,
December 1992.
[6] R. W. LELAND, The Effectiveness of Parallel Iterative Algorithms for Solution of Large Sparse
Linear Systems, PhD thesis, University of Oxford, Oxford, England, October 1989.
[7] R. W. LELAND AND J. S. ROLLETT, Evaluation of a parallel conjugate gradient algorithm, in
Numerical methods in fluid dynamics III, K. W. Morton and M. J. Baines, eds., Oxford
University Press, 1988, pp. 478-483.
[8] A. T. OGIELSKI AND W. AIELLO, Sparse matriz computations on parallel processor arrays, SIAM
J. Sci. Stat. Comput., 14 (1993). To appear.
[9] R. VAN DE GEIJN, Eficient global combine operations, in Proc. 6th Distributed Memory Com-
puting Conf., IEEE Computer Society Press, 1991, pp. 291-294.
[10] J. VAN ROSENDALE, Minimizing inner product data dependencies in conjugate gradient iteration,
in 1983 International conference on parallel processing, H. J. Siegel et al., eds., IEEE, 1983,
pp. 44-46.
13
![Page 16: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/16.jpg)
EXTERNAL DISTRIBUTION:Raymond A. BairMolecuhu Science Reach. Cntr.Pacific NW hborato~Richland, WA 99362
Falcon AFB, CO 80912-51MfJ
R W. AlewineDARPA/RMO1400 Wilmn Blvd.Arlington, VA 222o9
Bud Brewster410South PierceWheaton, IL 80187
R. E. BakDept. of MathematicalUniv. of CA at San DiegoLa Jolla, CA 92093
Carl N. BrooksBrc&a Associates414 Falls RoadChagrin Fafls, OH 44022
F~ AltabdiUS Air Force Weapono LabNuclear Technology Branch
Kirtfand+ AFB, NM 87117-6008Ken Bannister
US Army Bdfistic Res. LabAttrx SLCBIVIB-MAberdeen Prov. Gmd., MD 21005
John Brunet245 Fmt St.Cambridge, MA 02142
-5086John BnmoCntr for Comp. Sci and EngrCollege of EngineeringUnivemity of California
Santa BarbM~ CA 931-5110
Mad w. AllenThe Waif Street Joumaf1233 Regal hW
Daflas, TX 75247Edward BarragyDept. ASE/EMUniversity of TexasAustin, TX 78712
M. AlmeAlme and Associates6219 Bright PlumeC&unbii, MD 21044
E. BammNAS Applied Rewarch BranchNASA Ames Ikearch cent.Moffett Field, CA 94035
H. L. BuchananDARPA/DSO1400 Wifson Blvd.Arlington, VA 22209
ChaAa E. AndersonSOuthweat ~ InstitutePO Drawer 28610San Antonio, TX 78284
W. BeckAT&T Pixel Machinea, 4J-214
Crawfords Ccuner Rd.Homdel, NJ 07733-1%8
D. A. BudSupercornputing Reach. Cntr,
17100 Saence Dr.Bowie, MD 20715
Dan AndersonFord Motor Co., Suite 1100Vie Pke22400 Michigan Ave.Dearbcrn, MI 48124 David J. fkuuq
Dept. of AMES FLO1lUniv. of California at San DiegoLa Jolla, CA 92093
B. L. BuzbeeScientific Computing Dept.NCARPO Box 3000Boulder, CO 80307
Andy ArenthNational Security AgencySlwageRoadFt. Meade, MD 207ssAttn: C6
Mylea R. BergLockheed, 0/62-30, B/1501111 Lockheed Way%nqyvale, CA 9408%3.504
G. F. CareyDept. of Engineering MechanicsTICOM ASEEM WRW 305University of Texan at AustinAustin, TX 78712
Greg AstfdkConvex Computer (hp.3(N)0 Waterview ParkwqyPO Box 833851Richrmdao% TX 750s3-3851
Stephan BilykUS Army Baflist. Reach. LabSLCBKTB-AMBAberdeen Proving Ground, MD21OOWIO66
Art CarfmnNOSC Code 41New London, CT 06320Susan R. Atlas
TMC, MS B258
Center for Nonlinear StudiesLos Alarnos Nationaf LabsLos Ahunos, NM 87545
Rob BisselingShelf Research B.V.Postbus 30031003 AA Amsterdam
The Netherlands
Bonnie C. CarrolfSec. Dir.
CENDI, Information Iut’1PO Box 4141Oak Ridge, TN 37831L. Audander
DARPA/DSO1400 Wifson Blvd.
Arfington, VA 222o9Matt BlumrichDept. of Comp. SaencePrinceton UniversityPrinceto~ NJ 0s644
Charfm T. CasaleAberdeen Group, Inc.92 State StreetBoston, MA 02109D. M. Austin
Army High Per Comp. Rea. Cntr.University of Minnesota1100 S. Second St.Mirmeapolie, MN 5s41s
B. W. BoehmDARPA/ISTO1400 Wilson Blvd.Arlington, VA 222o9
J. M. CavdliniUS Department of EnergyOSC, ER-30, GTNWasbingto~ DC 20585
ScottBadmUnivemity of Cdiforni% Stm DiegoDept. of Computer Saence95OOGifman DriveEngirmxing0114
La Jofla, CA 92091-CX)14
R. R. BorchcrL-889Lawrence Lhennore Nat’1 LabsPO Box 808Livermore, CA 94550
John ChampineClay Rea. Inc., Software Div.655F Lone Oak Dr.Eagan, MN 55121
Tony ChanDepartment of Computer ScienceThe Chinese Univemity of Hong KongShatin, NTHong Kong
F. R. BaileyMS2Cxl-4NASA Ames Research Center
Moffett Field, CA 94o35
Dan BOWhISMail Code 8123Attm D. Bowlus & G. l..diecqNaval Underwater Sys. Cntr.Newport, RI 02841-5047
D. H. BaileyNAS Applied Research BranchNASA b- Research CenterMoffett F]eld, CA 94o35
J. ChandraArmy Research OfficePO Box 12211Resch Triangle Park, NC 27709
Dr. Donald BrandMS - N8930Geodynamica
14
![Page 17: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/17.jpg)
Siddmtba ChatterjeeRIAcsNASA Ames Reaeara C.terMail Stop T04~lMofTett Field, CA 94~l(W
IBM Corporation472 Wheelers F- IbadMilford, CT 06460
Yale UniversityPO Box 2158New Haven, CT 06520
J. K. CrdlumThorruu J. Watmn Resch. CenterPo Box 218Yorktown Heights, NY 10598
H. ElmanComputer Science Dept.University of MarylaudCollege Park, MD 20842Wamm Chemock
Scientific Advisor DP.1US Department of EnergyForentd Bldg. 4A.045Washington, DC 20585
Leo DagmnComputer Sciences Corp.NASA Ames Research CenterMoffett Field, CA 94CL3S
M. A. EfmerDARPA/RMO1400 Wilson Blvd.Arfington, VA 222o9
R.C.Y. ChinL-321La wre.nceLiverrmwe Nat’1 LabPO Box 808
Livermore, CA 9455o
Kenneth I. DaughertyChief ScientistHQ DMA (5A), MS -A-16
8613 Lee HighwayFairfax, VA 22031-2138
J. N. EntzmingerDARPA/TTO1400 When Blvd.
Arlington, VA 222o9
Mark ChristenL122Lawreme Llvennore Nat’1 LabPO %X 808Livemnme, CA 9455o
A. M. ErisrnanMS 7L21Boeing Computer ServicesPO Box 24346Seattle, WA 98124-0346
L. DavisCray Research Inc.1168 Industnd Blvd.Chippawa Fafls, WI s472s
M. CirnentAdv. Sci. Comp/ Div. RM 417National Science FoundationWashington, DC 20ss0
Mr. Frank R. DeisMartin MariettaFalcon AFB, CO 80912-50t13
R. E. EwingMathernatica Dept.University of W yomingPO Box 3036 Univemity StationLammie, WY 82071R. A. DeWlllo
Comp. & Comput. Reach.,Rm. 304 ,National Science FoundationWashington DC 20550
Richard CfaytorUS Department of EnergyDef. Prog. DE1
Foreatd Bldg. 4A-014Washington, DC 20585
El Dabaghi FadiCharge of ResearchInstitute Nat’1 De Recherche enInformatique et en AutomatiqueDornaine de Voluceau RomuencourtBP 10578153 Le Chemay Cedex (France)
L. DengApplied Mathematics Dept.SUNY at Stony Brook
Stony Brook, NY 11794-3&lI
Andy ClearyCentm for Information ResearchAustrafia National UniversityGPA Box 4CanberrA ACT 26o1Australia
H. D. F&Institute for Adv. Tech.4032-2 W. Braker LaneAustin. TX 78759
A. Trent DePersiaProg. Mgr.DARPA/ASTO1400 Wifscm Blvd.Arlington, VA 22209-2308
T. ColeMS180-500Jet Prop. Lab4800 Oak Grove Dr.,P.amdena. CA 911OP
Kurt D. FickieUS Army Ballistic Resch. LabATTN: SLCBR-SEAberdeen Proving Ground, MD21OWA4M6
Sean DolannCUBE919 E. Hilbdafe Blvd.Foster City, CA 944o4Monte Cole-
US Army Bd. Reach LabSLCBR-SEA (Bldg. 394/216)Aberdeen Prov. Gmd., MD 21005-5006
Tom Finnegan
NORAD/USSPACECOM J2XSSTOP 35PETERSON AFB, CO 80914
Jack DongamaDepartment of Computer ScienceUnivemity of TermeaseeKnoxville, TN 37996Tom Coleman
Dept. of Computer ScienceUpeon HdlComelf UniversityIthaca, NY 14853
J. E. FlahertyComputer Science Dept.Remselaer Polytedr Inst.Troy, NY 12181
L. DowdyComputer Science DepartmentVanderbilt UniversityNashville, TN 37235
S. ConeyNCUBE19A Davis DriveBelmont, CA 94oO2
L. D. FoedickUniversity of ColoradoComputer Science DepartmentCampus Box 43oBoulder, CO 8030P
Joanne Downey-Burke8030 Sangor Dr.Colorado Springs, CO 80920
J. CoronesAmes Laboratory236 Wilhelm HdlIowa State UniversityAmes, IA 50011-3020
L S. DutTCSS DivisionHarwelf LaboratoryOxfordshire,OX11 ORAUnited Kingdom
G. C. FoxNortheast PamUef Archit. Cntr.111 College PlaceSyracuse, NY 13244
Steve COogrOveE6220KliOb Atomic power LabPO Box 1072Schenectady, NY 12301-1072
Alan EdelmanUniversity of California, BerkeleyDept. of MathematicsBerkefey, CA 94720
R. F. FreundNRaD- Code 423San D,ego, CA 991525000
Sveme FmyenSolar Energy Research Inst.1617 Cole Blvd.
S. C. EuenstatComputer Sdence Dept.
15
C. L. Crothera
![Page 18: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/18.jpg)
Golden, CO 80401 Boa 2008Oak Ridge, TN 37831
United Kingdom
W. D. HillisThinking Machinea, Inc.245 First St.Cambridge, MA 02139
David GaleIntel Corp600 S. Cherry St.Denver, CO 802221~1
Anne GrcenbaumNew York UniversityCourant Institute251 Mercer StreetNew York, NY 10012-1185 Dan Hitchmck
US Department of EnergySCS, ER-30 GTNWashington, DC 20585
D. B. GrmnonComputer %ience Dept.Indhma UniversityBloomington, IN 47401
Satya Gupta
Intel SSDBldg. C)&O9, Zone 814924 NW Greenbriar,PwkyBeaverton, OR 97cE36
LTC Richard HochbewrgSDIO/SDAThe PentagonWaahingto~ DC 20301-7100
C. W. GearNEC Rcseadr Institute4 Independence WayPrinceton, NJ 08548 J. Guetafson
Computer Sdence Department
236 Wilhelm HallIowa State UniversityAmes, IA 50011
J. A. GeorgeNeedles HalfUniversity of Water400Waterloo, Ont., Can.N2L 3G1
C. J. HollandDirectorMath and Information SciencesAFOSR/NM, Boiling AFBWashingto~ DC 20332-6448R. Graysnrr Hall
USDOE/HQ11XH3Independence Ave, SWWashington, DC 20585
Shomit GhosenCUBE919 E. HilhuLale Blvd.Foster City, CA 94404
Dr. Albert C. HoltOft. of MunitionsOfc of Sec. of St.-ODDRE/TWPPentagon, Room 3B106oWashington, DC 203301-311Xl
Cuah HandenMinnesota Supercomputer Cntr.1200 Washington Ave. So.Minneapolis, MN 55415
Clarence Giese8135 Table Mesa WayColorado Springs, CO 80919
Mr. Daniel HoltzmanvanguardR?SearChInc.
10306 Eaton P]., Suite 450Fairfax, VA 22030-2201
Steve Hammond ,NCARPO Box 3000Borddm, CO 80307
Dr. Horst GietlnCUBE GmbHHammer Strame 858000 Mrmich 50Germany
David A. HopkinsUS Army Ballistic Resch. Lab.Attention: SLCBfVIB-MAber-decn Prov. Gmd., MD 21W5-5066
CDR. D. R. HamonChief, Space Integration Div.
ussPAcEcoM/J5sIPeterson AFB, CO 809145003
John GlbertXerox PARC3333 Coyote Hill Roadpalo Alto, CA 94304
Graimm HortonUnivemitat Erlangen-NurnbergIMMD IIIMartenmrtrase 38520 ErlangenGermany
Dr. James P. HardyNTBIC/GEODYNAMICSMS N 893oFalcon AFB$ CO 60912-LYYJo
Micbnel E. Giltrud
DNAHQ DNA/SPSD6801 Telegraph Rd.Alexan&la, VA 2231CL3398
Doug HarlessNCUBE2221 E~t Lamar Blvd., Suite 36oArlington, TX 76oo6
Fred HowesUS Department of EnergyOSC, ER-30, GTNWashington, DC 20585AktairM. Glass
AT&T Bell Labe Rm IA-1646(KI Mountain AvenueMurray Hill, NJ 07974
Mike HeathUniversity of Illinois4157 Beckman Institute405 N. Mathews Ave.Urbana, IL 61801
Chua-Huang Hu~Assist. Prof. Dept. Comp. & Info SciOhio State Uuiv.228 Bolz Hall-2036 Neil Ave.Columbus, OH 43210-1277
J. G. GlirnmDept. of App Math. & Stat.
State U. of NY at Stony BrookStony Brook, NY 11794 Greg Heileman
EECE DepartmentUnivemity of New MexicoAlbuquerque, NM 87131
R. E. HuddlestonL61Lawrence Llvermore Nat’1 LabPO Box 808Liverrnore. CA 9455o
Dr. Raphael GluckTRW-DSSG, R4/1408One Space Par-kRedondo Bea+ CA 90278 Brent Henrich
Mobile R &D Laboratory13777 Midway Rd.PO Box 819047DaU=, TX 75244-4312
Zdenek JohanThinking Machines Corp.245 First StreetCambridge, MA 02142-1264
G. H. GolubComputer =Ience Dept.Stanford UniversityStanford, CA 943o5
Michael HerouxCray Research Park655F Lone Oak DriveEagrm, MN 55121
Gary JohnsonUS Department of EnergySCS, ER30 GTNWashington, DC 20585
Marcia GrabowAT&T Bell Labe ID-1536CCIMountain Ave.
PO Box 636Murray Hill, NJ 07974-0636 A.J. Hey
University of SouthamptonDept. of Electronic and ComputerMountbatten Bldg., HighfieldSouthampton, S095NH
16
S. Lenmwt JohnssonThinking Machines Corp.
Science 245 First StreetCambridge, MA 02142-1264
Nancy GradyMS 6240Oak Ridge Nat’] Lab
![Page 19: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/19.jpg)
G. S. Jones
Teeh Program Support Cntr.DARPA/AVSTO1515 Wifson BIvd.Arlington, VA 222o9
lIXIO Independence Ave.Washington, DC 20585
Kelly AFBSan Antonio, TX 782435~
Ann KrauseHQ AFOTEC/OANKirtl.and AFB, NM 87117-7(XI1
H. MairNaval Surface Warfare Center10901 New Hampshire Ave.Silver Springs, MD 2090&5000T. H. Jordan
Dept of Earth, Atmoa & Pla. Sci.MITCambridge, MA 02139
V. KumarComputer Science D~mentUnivemity of MinnesotaMinneapolis, MN 55455
Henry MakowitzMS - 2211-INELEG&E Idaho IncorporatedIdaho F.dlE, ID 83415M. H. Kdoa
Cornell Theory Center514A Eng. and Theory CenterHoy Road, Cornell UnivemityIthaca, NY 14853
J. LarmuttiMS B-166Director, SC. Reach. InstituteFlorida State UnivemityTallahassee, FL 32306
David ManddIMS F663Hydrodynamic App. Grp. X-3IAM Alamos Nat’] LabsLos Alanms, NM 87545H. G. Kaper
Math. and Comp. %]. Division
Argonne National LaboratoryArgonne, IL 60439
P. D. Lax
New York University-Courant251 Mercer St.
New York, NY 10012
T. A. ManteufTelDepartment of Mathematics
University of Co. at DenverDenver, CO 80202S. Karin
SuperComputing Department9.5MIGilman DriveUniversity of CA at San DiegoLa Jofla, CA 92093
Lawrence A. LeeNC Supercomputing CenterPO Box 126893021 Comwdlis Rd.Research Triangle Park, NC 27709
William S. MarkLockheed - Org, 96-01Bldg. 254E3251 Hanover StreetPalo Alto, CA 943031191Herb Keller
Applied Math 217-50Cd TechPaaadena, CA 91125
Dr. H.R. LelandCalspan CorporationPO Box 400,Buff.do, NY 14225
Kapit MathurThinking Machines Corporation245 Fimt Street
Cambridge, MA 0214>1214M. J. KelleyDARPA/DMO1400 Witson Blvd.Arlington, VA 222C9
David LevineMath & Comp. S&meArgonne National Laboratay9700 Cam Avenue SouthArgonne, IL 60439
John MayKanutn Sciences Corporation1500 Garden of the Gods RoadColorado Springn, CO 60933K. W. Kennedy
Computer Science DepartmentRice UniversityPO Box 1892Houston. TX 77251
Peter LlttlewoodTheoret. Phy. Dept.AT&T Belt LabeRan lD-?35Murray Hill, NJ 07974
William McColtOxford Univ. Computing Lab6-11 Keble RoadOxford, OX1 3QDUnited KingdomAram K. Kevorkian
Codje 73L14Naval Ocean Systems Center271 Catalina Blvd.San Diego, CA 92152-5000
Peter LomdafdT-II, MS B262Los Alamoa Nat’] LabLos AhllllOS, NM 87545
S. F. McCormick
Computer Mathematical GroupUnivemity of CO at Denver1200 Larimer St.Denver, CO 80204John E. Killougb
University of HoustonDept. of Chem. EngineeringHouston, TX 77204-4792
L-As S. LemeSDIO/TNI
The PentagonWashington, DC 20301.7tO0
J. R. McGrawL-316Lawrence Livermore Nat’1 Lab
D. R. KincaidCntr. for Num. Andy.,RLM l&150Univemity of Texaa at AustinAustin, TX 78712
PO Box 808
Col. Gordon A. LongDeptuty Director for Adv. Comp.HQ USSPACECOM/JOSDEPSPeterson AFB, CO 6(X)145(X)3
Livermore, CA 9455o
Jill MesirovThinking Machines Corporation245 F]mt StreetCambridge, MA 0214Z1214T. A. Kitchens
US Department of EnergyOSC, ER-30, GTNWashington, DC 20565
John LOll3258 Caminito AmecaLa Jolla, CA 92037 P. C. Messina
158-79Mathematics & Comp Sci. Dept.CaltechPasadena, CA 91125
Daniel LoyemKoninklijke/ShelL LaboratoriumPostbus 30031C02 AA AmsterdamThe Nethertamb
Thomas Klemas394 Briar LaneNewark, DE 19711
Prof. Ralph Metmlfe
Dept. of Mech. Engr.University of Houston4600 Calhoun Road
Houston, TX 772044792
Dr. Peter L. KnepellNTBIC/GEODYNAMICSMS N 8930Falcon AFB, CO 80912-50W
Robert E. LynchDept. of CSPurdue UniversityWest Lafayete, IN 479o7
Max KoontzDOE/OAC/DP 5.1Forestal Bldg
G. A. MichaelL306Lawrence Livermore Nat Lab
Kathy MacLeodAFEWC/SAT
17
![Page 20: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/20.jpg)
PO Box 808Liverrnore, CA 94550
Lam K. MillerGoodyear Tire & RubberPO Box 3531Akron, OH 4430P-3531
Robert E. MilleteinTMC245 Fint StreetCambridge, MA 02142
G. MohnkemNOSC - Code 73San Diego, CA 92152-50W
C. MolerTbe Mathworke24 prime Pti WayNati~, MA 01760
Gery MontrySouthwcat software11812 Pemimrrton, NEAlbuquerque, NM 87111
N. R. MorseC-DO, MS B260Comp. & Comm .DivisionLoe Alamoe National LabLoe Alamoe, NM 87546
J. R. Medic
IBMThomas J. Wateon Raech Cntr.PO Box 704Yorktown Heighte, NY 10698
D. B. Neleon
US Department of EnergyOSC, E&30, GTNWeehington, DC 20s65
Jeff NewmeyerOrg. 81-04, Bldg. 1571111 Lockheed WaySunnyvale, CA 9406$3504
D. M. NoeenchuckMech. and Aero. Engr. Dept.D302 E QuadPrinceton UniversityPrinceton, NJ 08544
C. E. Oliver
Offc of Lab Comp Bldg. 4500N,Oak Ridge Nat’1 LaboratoryPO Box 2006Oak Ridge, TN 37631-6259
Dennis L. OrphalCdif Reach& Technology Inc.5117 Jolmeon Dr.Pleasanton, CA 94586
J. M. OrtegaApplied Math DepartmentUnivemity of VirginiaCharlottesville, VA 22903
John PalmerTMC245 Fimt St.Cambridge, MA 02142
Robert J . PaluckConvex Computer Corp.
3000 Waternew Parkw~
PO Box 733851Ricbrdaon, TX 75083-3851
Anthony C. ParmeeCo-nor and Attde (Def.)British Embsssy31W MaLw. Ave, NWWashingtorL DC 20008
s. v. ParterDepartment of Mathemati=Van Vleck HallUniversity of WisconsinMadison, WI 537o6
Dr. Nieheeth PatelUS Army Ballistic Ftesch. Lab.AMXBR-LFDAberdeen Prov. Gmd., MD 21005-5066
A. T. PateraMechanical Engineering Dept.77 Maseachueetto Ave.
MITCambridge, MA 02139
A. PatrinoeAtmoe. and Cfirn. Resch. DivOSice of Engy ResclL ER-74US Deprutment of EnergyWaAingtoaL D~ 20545
R. F. PeierfsMath. Saenca Group, Bldg. S1SBrookhawrt National Labupton, NY 11973
Donna PerezNOSC - MCAC Reeource Cntr.
Code 912San Diego, CA 92152-5LXKI
K. PerkoSupercomputing Solutions, Inc.6175 Mancy Ridge Dr.San Diego, CA 92121
John PetreskyBallistic Research LabSLCBRLF-C
Aberdeen Prov. Gmd., MD 21MM-5006
Linda PetzoldL-316Lawrence Livennore Natl . Lab.Livermore, CA 9455o
Wayne PfeifferSan Diego SC CenterPO Box 856o8San Diego, CA 92136
Frank X. PfennebergerMartin MariettaMS-N33104National Test BedFalcon AFB, CO 80912-~
Dr. Leslie PierreSDIO/ENAThe PentagonWashington, DC 20301-7100
Paul PlaesmanMath and Computer Science DivisionArgonne National LabArgonne, IL 6043S
R. J. PlemmoneDept. of Math. & Comp Sci.Wake Forest UniversityPO Box 7311Wimtort Salem, NC 27109
Alex PothenComputer Science DepartmentUnivemity of WaterlooWaterloo, Ontario N2L 301Canada
John K. PrenticeAmparo Corporation37OORio Grande NW, Suite 5Albuq., NM 87107-3042
Peter P. F. RadkowekiPO Box 1121LOU Alamoe, NM 87644
J. RattncrIntel Scientific Computere
15201 NW Greenbriar Pkwy,Beaverton, OR 97oo6
J. P. RetelleOrg. 94-90Lakheed - Bldg. 254G32S1 Hanover StreetPalo Alto, CA 94304
C. E. RhoadeaL298Computational Phyeia Div.PO Box 808Lawrence Livermore Nat’1LabLivennore, CA 84550
J. R. RiceComputer Science Dept.Purdue UniversityWest Lafayette, IN 479o7
John RichardsonThinking Machines Corporation245 Fhat StreetCambridge, MA 02142-1214
Lt. Col. George E. RichieChief, ADv. Tech PlansJOSDEPSPeterson AFB, CO 80914-5fK13
John RollettOxford University Computing Laboratory
%11 Keble RoadOxford, OX1 3QDUnited Kingdom
R. Z. h!?kk
Physics and Astronomy Dept.100 Allen HallUniversity of PittsburgPittsburg, PA 15206
Diane RnverMichigan State UnivemityDept. of Electrical Engineering260 Engineering Bldg.East Lansing, MI 48824
Y. sadUniversity of Minnesota4-192 EE/CSci Bldg.2LMUnion St.Minneapolis, MN 5545&O159
18
![Page 21: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/21.jpg)
P. wappmDept. of Conlp. & Info SCienaOhio State Univ.-228 Bolz HaU2036 Neil Ave.Colurnbu, OH 43210-1277
Anthony SkjellumLawrence Lkmrnore National Laboratol7000 E-ad Ave., Mail Code L316Llvermore, CA 94550
Gligor A. Taddcovicb
“Y PO Box 2%Pound Ridge, NY 1057& cY296
H. TeutebergCray Research, Suite 8306565 American Pkway, NEAlbuquerque, NM 87110
L. S-Director, Supercbmputer Apps.152 sup —puter ApplicationsBldg. 605 E. SphgfieldChrunpaign, IL 618431
Jod SdtzComputer Science DepartmentA.V. Williams Building
university of MarylandCollege Park, MD 20742
A. TludcrDivision of Math SciencesNational ScienceFoun&IonWaahingto~ DC 20550
Wk R. SomakyBallistic Remueb LaboratorySLCBR-SEA, Bldg. 394Aberdeen Proving Ground, MD 21005
A. H. SarnehCSRD305Tdbot LaboratoryUniversity of Illinois104 s. wright St.Urbana, IL 61801
Allan Ton-es125 Lincoln Ave., Suite 400Santa Fe, NM 87501D. C. Sorenson
Department of Math Scien-Rice UniversityPO BOX1892Houston, TX 77261
Harold ‘lYeMe
Loe Alamoa National LabPO Box 1666, MS F663Los Ahrnos, NM 87545
P. E. SaylorDept. of Comp. Saence222 Digital Computation LabUnivunity of IllinoisUrbana, IL 61801
s. SquiresDARPA/ISTO1400 Wilson Blvd.Arlington, VA 11109
Randy ‘humanMechanical Engineering Dept.University of NMAlbuquerque, NM 87131LCDR Robert J. sCbOppG
Chief, Operatiom RqmtsUSSPACECOM/JOSDEPS (Stop 35)Peterson AFB, CO 80914
N. SrinivasanAMOCO CorpPO 87703Chicago, IL 60680-0703
Ray TuminaroCERFACS42 Ave Gustaw Coriolis31057 Toulouse Cedexl%nce
Rob SCbreibcrRIAcsNASA Am- ReeearcbCemterMail Stop T04$1Moffett Field, CA 9403$1OW
Thomas StegmannDigital Equipment Corporation8085 S. Cbeskr StreetEngiewood, CO 80112
Mark UrmtiaIntel Corp., CO1-035200 NE Elam Youns f%vy.Hilleboro, ORE 971246497M. H. Sclodtz
Department of Computer ScienceY.de Univen3ityPO Box 2158New Haven, CT 06520
D. E. SteinAT&T100 South Jefferson Rd.Whippally, NJ 07981
Mike Uttormark
UW-Madiaonlsw Johnson Dr.Madison, WI 537o6M. Steuerwdt
Division of Math Sciences
National Science FoundationWashington, DC 20550
Da= SchwartzNOSC, Code 733San Diego, CA 92152A5C4Kl
R. VanDeGeijnComputer Science DepartmentUniversity of TexasAustin, TX 78712Mark Seager
LLNL, L80PO box 803Livermore, CA 94550
Mike StevensnCUBE919 E. Hillsdale Blvd.Foster City, CA 94404
George VandergriftDist. Mgr.Convex Computer Corp.3916 Juan Tabo, NE, Suite 38Albuquerque, NM 87111
A. H. Sh—Sa. Computing Amoc. Inc.Suite 307, 246 Church StreetNew Haven, CT 06510
G. W. StewartComputer Stience DepartmentUniversity of MarylandColfege Park, MD 20742 H. VanDerVomt
Delft University of TechnologyFacufty of MathematicsPOB 35626oO AJ DdftThe Netherlands
O. StOrasdiMS-244NASA Langley Research Cntr.Hampton, VA 23665
Horst Simon
NAS Systems Divti]onNASA Amen Research CenterMail Stop T045-1MofTett Field, CA 94035
c. StwtDARPA/TTO1400 Wilson Blvd.Arlington, VA 22209
c. vanLaanDepartment of Computer ScienceCornell University, Rm. 5146Ithaca, NY 14853
Richard SincovecOak Ridge National Laboratory
P.O. Box 2008, Bldg 6012Oak Ridge, TN 37831-6367
John VanRosenddeICASE, NASA Langley Researrh CenterMS 132CHampton, VA 23665
LTC Jarnea SweederSDIO/SDAThe PentagonWashington, DC 20301-7100
Vineet SingbHP Labs, Bldg. lU, MS 141501 Page Mill RcAP.do Alto, CA 94304
Steve VavasisDept. of Computer ScienceUpson HallCOmeU UnivemityIthaca, NY 14853
R. A. TapiaMathematical =1. DeptRice UniversityPO Box 1892Houston, TX 77251
Aehok SingludCFD Reach. Center3325 ‘IMana Blvd.Huntsville, AL 35805
19
![Page 22: An Efficient Parallel Algorithm MICROFICHEDIn §4 we apply the algorithm to the NAS conjugate gradient benchmark problem to demonstrate its utility. Conclusions are drawn in §5. 2.](https://reader033.fdocuments.in/reader033/viewer/2022042023/5e7b26e0c12135404a25b495/html5/thumbnails/22.jpg)
R. G. VoigtMS 132-CNASA Langley Ikcb Cntr, ICASEHampton, VA 36665
Phuong Vu
Cray ReaeadL Inc.19607 pram RoadHouston, TX 77084
David WalkerBldg 6012Oak Ridge National LabPO Box 2008Oak Ridge, TN 37B31
Stevm J. WallachConvex Computer Corp.3000 WaterView ParkwWPO Box 833851Ricbardno~ TX 75083-3651
R. C. WardBid. 9207-AMathernatimf %iencenOak Ridge National LabPO Box 4141Oak Ridge, TN 37831-8083
Thomas A. WeberNational Science Found18Lx3G. Street, NWWashington, DC 20550
G. W. WeigandDARPA/CSTO37o1 N. Fairfax Ave.Arfington, VA 22203-1714
M. F. WheelerMath Sciences DeptRice universityPO Box 1892Houston. TX 77251
A. B. WhiteMS-265k Ahmms National LabPO Box 16433Los Alamm, NM 87544
B. WilcoxDARPA/DSO1400 Wilson Blvd.Arlington, VA 222o9
Roy WifliamsCalifornia Institute of Technology20&49Pasadena, CA 91104
C. W. Wif.90nMS MI02L3/BllDigitaf Equipment Corp.146 Main StreetMaynar~ MA IM175
K. G. WilsonPhysics Dept.Ohio State UniversityColumbus, OH 43210
Leonad T. WifncmNSWCCode G22Dahlgmn, VA 22448
Peter WOlochOw
fntef Corp., CO1-0352OONE Elam Young Pkwy.Hiflsboro, OR 97124-6497
P. R. WoodwardUniversity of MinnesotaDepartment of Astronomy
116 Chumh Street, SEMinneapolis, MN 55455
M. Wunderli&Math. Sciencm ProgramNational Security AgencyFt. George G. Mead, MD 20755
Hishashi YasumoriKTEC-Kawasaki SteelTechrwrenearch CorporationHiblya Kokusai Bldg. 2-3Uchisaiwaicho ‘khromeChiyoddm, Tokyo lCII
David YoungCenter for Numerical Analysis
RLM 13.150The University of TexaxAustin, TX 78713-8510
Robert YoungAlcoa LaboratoriesAlcoa Centw, PA 15069
Attm R. Youn8 & J. McMichaef
Wilfiam ZierkeApplied Research LabPenn State.PO Box 30State CoUege, PA 168C14
INTERNAL DISTRIBUTION:Pauf FleuryEd BamisSudip DosanjhBilf CampDoug ClineDavid Gardner
Grant HeffelfingerScott HutchinsonMartin LewittSteve PlimptonMark SearsJohn SbadidJulie SwisshelrnDick AllenBruce Hendrickson (25)David WombleErnie BrickellKevin McCurleyRobert BennerCarl DiegertArt HaleRob Leland (25)Courtenay VaughanSteve AttawayJohnny Bif3eMark BlanfordJim SchuttMichael McGlaunAllen RobinsonPauf BarringtonDavid MartinezDons CrawfordWilliam Mason
Technical Library (5)Technical PublicationDocument Processing forDOE/OSTI (10)Centraf Technical FiteCharles Tong
20
140014021421142114211421142114211421142114211421142214221422142314231424
14241424142414241425
1425142514251431143114321434lWO195271417151
7613-28523-28117