Parallel algorithms of the Purcell method for direct solution of linear systems

Parallel algorithms of the Purcell methodfor direct solution of linear systems q

Ke Chen a,*, Choi H. Lai b

a Department of Mathematical Sciences, University of Liverpool, Peach Street, Liverpool L69 7ZL, UKb School of Computing and Mathematical Sciences, University of Greenwich, Wellington Street, Woolwich,

London SE18 6PF, UK

Received 20 March 2001; received in revised form 27 January 2002; accepted 30 April 2002

Abstract

In this paper, we first demonstrate that the classical Purcell’s vector method when com-

bined with row pivoting yields a consistently small growth factor in comparison to the well-

known Gauss elimination method, the Gauss–Jordan method and the Gauss–Huard method

with partial pivoting. We then present six parallel algorithms of the Purcell method that may

be used for direct solution of linear systems. The algorithms differ in ways of pivoting and load

balancing. We recommend algorithms V and VI for their reliability and algorithms III and IV

for good load balance if local pivoting is acceptable. Some numerical results are presented.

� 2002 Elsevier Science B.V. All rights reserved

AMS: 65Y05; 65R20; 65F05

Keywords: Linear systems; Purcell elimination method; Gauss–Huard method; Parallel algorithms;

Distributed computing

1. Introduction

Direct solution of linear systems of equations is a well-studied subject. By ‘direct’,one is often referring to the Gaussian elimination method as discussed in matrixcomputation textbooks (see e.g. [7,9,12,24,25]). Parallelised algorithms based onthis method are also well developed and tested (see [8,22]). The Gauss–Jordan (GJ)

www.elsevier.com/locate/parco

Parallel Computing 28 (2002) 1275–1291

qThis research was partially supported by a London Mathematical Society Scheme 4 Grant (Ref. 4222).*Corresponding author.

E-mail addresses: [email protected] (K. Chen), [email protected] (C.H. Lai).

URLs: http://www.liv.ac.uk/�cmchenke, http://www.gre.ac.uk/.

0167-8191/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved

PII: S0167-8191 (02 )00133-3

mail to: [email protected]

http://www.liv.ac.uk/~cmchenke, http://www.gre.ac.uk/

http://www.liv.ac.uk/~cmchenke, http://www.gre.ac.uk/

elimination method is often considered inferior to the Gaussian method because theformer requires more flops (floating point operations; more precisely a flop is definedas a multiplication operation plus an addition). Two variants of the GJ method, theGauss–Huard method [3,16] and the Purcell method [23], are of special interest be-cause both require a flop count comparable to the Gaussian method. The relation-ship of the former with the Gaussian method and GJ was shown in [6,14] whilethat of the latter with the GJ in [19]. The pivoting strategy used in these variantsmimics the GJ method with row pivoting the reliability of which was establishedin [4]. Parallel algorithms based on the Gauss–Huard method were developed in[5,6,15]. This paper will address parallel algorithms for the Purcell method––a topic,to our knowledge, not yet considered in the literature.

However, while it is clear that the Gauss–Huard method and the Purcell methodwithout pivoting give rise to identical results to the GJ method, the pivoted versionsare different. We have observed that the Purcell method with row pivoting yields aconsistently smaller growth factor than the Gaussian, GJ and the Gauss–Huardmethods. As the Purcell method itself is not well-known (see [1]), we believe thatthe above observation deserves more attention and justifies further study on the Pur-cell method.

In Section 2, we shall first give a short description of the Purcell method and thencompare it with the Gaussian, GJ and the Gauss–Huard methods to illustrate ourobservation. In Section 3, we shall present parallel algorithms based on the Purcellmethod, addressing the issues of pivoting and load balancing. In Section 4, we brieflydiscuss the complexity of the proposed algorithms. In Section 5, we show some nu-merical experiments from solving several dense linear systems.

To proceed, define a linear system in the usual notation

Ax ¼ b ð1Þ

where A 2 Rn�n, b 2 Rn. We remark that both the Gauss–Huard method and thePurcell method share the advantage that at step k of elimination, rows j ¼k þ 2; . . . ; n are neither required nor needed to be updated. Therefore both methodscan be considered as flexible elimination methods [20,21]. That is, the direct solutionprocess can be combined with the elimination and coefficients forming (rows of A)processes to achieve better performance. For some applications such as the solutionof boundary element equations [21], this flexible elimination aspect of methods ispotentially useful but remains to be fully explored.

2. The Purcell method and numerical experiments

The method is based on the concept of orthogonality of vectors. Suppose A in (1)is nonsingular and the coefficients of the ith equation of the system Ax ¼ b, where½Aij ¼ aij and ½bi ¼ bi, are written as the augmented vector Ci ¼ ½ai1; ai2; . . . ;ain biT, then a vector V is said to be the solution of the system if the last compo-

nent of V is one and V is orthogonal to all Ci, i.e., CTi V ¼ 0, for i ¼ 1; 2; . . . ; n.

1276 K. Chen, C.H. Lai / Parallel Computing 28 (2002) 1275–1291

The above orthogonality conditions will be satisfied in a step-by-step manner. Theorder in which such conditions are met can be arbitrary [21]. Here we assume thatconstraints C1;C2; . . . ;Cn are eliminated in turns, although any permutation of thisorder is allowed. Let C ¼ CðnÞ ¼ fC1;C2; . . . ;Cng. For i ¼ 1; 2; . . . ; n 1, define theset

CðiÞ ¼ fC1;C2; . . . ;Cig:Clearly we have CðiÞ ¼ Cði1Þ [ Ci with Cð0Þ ¼ f;g empty. Similarly for i ¼ 1; 2; . . . ; n,define RðiÞ as the subspace, of dimension i, of the vector space Rnþ1 which consists ofvectors orthogonal to Cðnþ1iÞ. Let Rðnþ1Þ ¼ Rnþ1. We shall use matrix VðiÞ to denotethe basis vectors of RðiÞ. That is,

VðiÞ ¼ V ðiÞ1 V ðiÞ

2

�� jV ðiÞi

h iðnþ1Þ�i

and therefore (note the solution vector V 2 Rð1Þ)

RðiÞ ¼ span V ðiÞ1 ; V ðiÞ

1 ; . . . ; V ðiÞi

� �:

The basis for the ðnþ 1Þ-dimensional space Rðnþ1Þ may be chosen as the naturalbasis, i.e.

Vðnþ1Þ ¼ V ðnþ1Þ1 � � � V ðnþ1Þ

nþ1

h i¼ ½ 1 0 � � � 0 T � � � ½ 0 � � � 0 1 T

� �:

ð2ÞWe are now ready to state the Purcell method. The objective is to reduce the largesolution space Rðnþ1Þ of dimension (nþ 1) to the final solution subspace Rð1Þ of di-mension 1.

Starting from this known space Rðnþ1Þ, at step ðnþ 1 iÞ for each i from n to 1,subspace RðiÞ can be constructed by performing linear combinations of a chosen vec-tor (the pivot) from the basis Vðiþ1Þ with the remaining vectors, subject to the condi-tion that the resulting vectors are orthogonal to Cnþ1i. More specifically, forCnþ1i 2 Cðnþ1iÞ, the main construction involves the following (for k ¼ 1; . . . ; iÞ

V ðiÞk :¼ akV

ðiþ1Þsð‘ðiÞÞ þ V ðiþ1Þ

mðkÞ ; CTnþ1iV

ðiÞk ¼ 0 ð3Þ

i.e.

ak ¼ CT

nþ1iVðiþ1ÞmðkÞ

CTnþ1iV

ðiþ1Þsð‘ðiÞÞ

ð4Þ

where 16 sð‘ðiÞÞ;mðkÞ6 iþ 1; sð‘ðiÞÞ 6¼ mðkÞ, and ‘ðiÞ ¼ nþ 1 i (i.e. ‘ðnÞ ¼ 1;‘ðn 1Þ ¼ 2 etc.). Here the pivot index sð‘ðiÞÞ is so selected that jakj6 1 i.e. thedenominator on the right hand side of (4) is the largest. Once the final subspace Rð1Þ,of dimension 1, is found, its basis vector V ð1Þ

1 is orthogonal to every vector inCðnÞ ¼ C. Thus this vector gives rise to the solution of the system Ax ¼ b.

We observe that, by construction, the vector V ðiÞk is orthogonal to each vector of

Cnþ1i � C. Pivoting by the above choice of sð‘ðiÞÞ and mðkÞ leads to a more reliable

K. Chen, C.H. Lai / Parallel Computing 28 (2002) 1275–1291 1277

method than the Gaussian, GJ and the Gauss–Huard [16] methods as illustratedshortly. For the pivoted version, the Purcell method as described can reduce tothe Gauss–Huard method [14,16] if we restrict the choice of sð‘ðiÞÞ and impose thecondition that 16 sð‘ðiÞÞ6 i. Recall that for a Purcell method we have16 sð‘ðiÞÞ6 iþ 1. This seemingly simple restriction that distinguishes the Purcellmethod from the Gauss–Huard method turns out to be a vital condition whichmakes the former a better method. In contrast the Purcell method with the choiceof sð‘ðiÞÞ ¼ 1 and mðkÞ ¼ k þ 1, is equivalent to a GJ elimination and the Gauss–Huard method [16], without pivoting. However, the unpivoted version is not useful.

Finally from Rðiþ1Þ ¼ RðnÞ� span ðCnþ1iÞ, we can summarise the Purcell methodin terms of subspace decomposition. We can derive the following

Rðnþ1Þ ¼ Rðnþ1Þ ¼ RðnÞ � spanðC1Þ

¼ Rðn1Þ � spanðC1;C2Þ

..

.

¼ RðjÞ � spanðC1;C2; . . . ;Cnþ1jÞ

..

.

¼ Rð2Þ � spanðC1; . . . ;Cn1Þ

¼ Rð1Þ � spanðC1; . . . ;CnÞ ¼ Rð1Þ � rangeðCÞ:

Before we discuss parallel algorithms, we give some examples.

2.1. Solution of a 4� 4 system by the Purcell method

To illustrate the sequential method, we now consider the following example

5 1 2 1

2 10 3 1

1 4 8 2

6 2 4 20

266664

377775

x1x2x3x4

2664

3775 ¼

173541102

2664

3775;

CT1

CT2

CT3

CT4

2664

3775 ¼

5 1 2 1 172 10 3 1 351 4 8 2 416 2 4 20 102

2664

3775:

Step 1, i ¼ n ¼ 4 : sð4Þ ¼ 5;mðkÞ ¼ 1; 2; 3; 4 CT1 Vðiþ1Þ ¼ ½ 5 1 2 1 17

VðiÞ ¼ Vðiþ1Þ

1

1

1

1

a1 a2 a3 a4

26666664

37777775¼

1 0 0 00 1 0 00 0 1 00 0 0 1

0:2941 0:0588 0:1176 0:0588

266664

377775:

Step 2, i ¼ n 1 ¼ 3 : sð3Þ ¼ 1;mðkÞ ¼ 2; 3; 4 CT2 Vðiþ1Þ ¼ ½8:2941 7:9412

1:1176 1:0588


VðiÞ ¼ Vðiþ1Þ

a1 a2 a3

11

1

2664

3775 ¼

0:9574 0:1348 0:12771 0 00 1 00 0 1

0:3404 0:0780 0:0213

266664

377775:

Step 3, i ¼ n 2 ¼ 2 : sð2Þ ¼ 1;mðkÞ ¼ 2; 3 CT3 Vðiþ1Þ ¼ ½9 4:6667 1

VðiÞ ¼ Vðiþ1Þa1 a2

11

24

35 ¼

0:3617 0:02130:5185 0:1111

1 00 1

0:2545 0:0591

266664

377775:

Step 4, i ¼ n 3 ¼ 1 : sð1Þ ¼ 1;mðkÞ ¼ 2 CT4 Viþ1 ¼ ½18:7549 14:0662

VðiÞ ¼ Vðiþ1Þ a1

1

� �¼

0:250:500:751:000:25

266664

377775; V ¼ V ð1Þ

1

V ð1Þ1

h inþ1

¼ VðiÞ

0:25¼

1:02:03:04:01:0

266664

377775:

2.2. Comparison of the Purcell method with other related methods

We now use four related examples from [9,12,13] to demonstrate that the Purcellmethod has better stability properties. These examples are often used to test growthfactors of the Gaussian method. Define the usual growth factor [9] by

q ¼ maxi;j;k

jAðkÞij j

kAk1;

where i; k ¼ 1; 2; . . . ; n and j ¼ 1; 2; . . . ; n; nþ 1 (we assume that Ai;nþ1 ¼ bi, see also[12,13]). For the Purcell method we measure the growth of the components of all V ðiÞ

k .

1. Example 1, with l ¼ 0:4 : ½Aij ¼ ½A1ij ¼1; if i ¼ j or j ¼ n;l; if i > j;0; otherwise:

8<:

2. Example 2, transpose of A1 : A ¼ AT1 .

3. Example 3 (with l ¼ 1 in A1): ½Aij ¼ ½A3ij ¼1; if i ¼ j or j ¼ n;1; if i > j;0; otherwise

8<:

4. Example 4, transpose of A3 : A ¼ AT3 .

Table 1 shows results of solving the linear system (1) for the above 4 examplesusing

Gaussian (c)––the Gaussian elimination method with complete pivoting,Gaussian (p)––the Gaussian elimination method with partial pivoting,


Huard [16]––the Gauss–Huard method with partial pivoting,Purcell [23]––the Purcell method with partial pivoting.

The exact solution is chosen as x�i ¼ 1 that defines the right hand side b. In thetable, ‘‘ffl’’ indicates failure by a numerical method and ‘‘_’’ shows a loss of accu-racy due to large growth factors. Note an accuracy is considered acceptable if it isless than 102e � 1013 with e associated with double precision.

Table 1

Comparison of the Purcell method with other well-known direct methods

Problem Size n Method Growth q Accuracy kx x�k2 Failure

1 30 Gaussian (c) 1.4 3:8� 1014

Gaussian (p) 1:7� 104 6:0� 1011 _

Huard [16] 1:1� 101 3:6� 1014

Purcell [23] 1:3� 101 3:9� 1014

60 Gaussian (c) 1.4 1:3� 1013

Gaussian (p) 4:2� 108 2:8� 106 fflHuard [16] 5:6� 102 1:2� 1013

Purcell [23] 6:6� 102 1:2� 1013

2 30 Gaussian (c) 1.4 4:6� 1014

Gaussian (p) 1.4 5:0� 1014

Huard [16] 5:8� 102 5:9� 1011 _

Purcell [23] 1:2� 101 3:9� 1014

60 Gaussian (c) 1.4 1:6� 1013

Gaussian (p) 1.4 1:9� 1014

Huard [16] 7:0� 106 2:4� 104 fflPurcell [23] 5:8� 102 1:6� 1013

3 30 Gaussian (c) 2.0 1:1� 1014

Gaussian (p) 5:4� 108 3:0� 107


60 Gaussian (c) 2.0 5:5� 1014

Gaussian (p) 5:8� 1017 1:5� 102 fflHuard [16] 3:4� 102 1:2� 1013

Purcell [23] 4:3� 102 1:5� 1013

4 30 Gaussian (c) 2.0 5:8� 1015

Gaussian (p) 2.0 5:8� 1015


60 Gaussian (c) 2.0 7:9� 1014

Gaussian (p) 2.0 7:9� 1014



The results clearly demonstrate that the performance of the Purcell method isclose to Gaussian (c), and much better than Gaussian (p) and Huard [16]. Moreover,the partial pivoting used by the Purcell method is inexpensive (unlike Gaussian (c)),making it a serious candidate for wider applications and more research and develop-ment.

3. Parallel methods

We now present six parallel methods implementing this sequential Purcell method.As also observed in [20,21], the method performs only three main calculations––scalar–vector products, vector products and vector additions. Therefore the paralleli-sation is about these data calculations.

Denote by p the number of parallel processors that are accessible, and nj ¼ n=pfor j ¼ 1; . . . ; p. Before starting Step 1, we shall decide on the first pivot processorp1 and allocate n1 þ 1 columns of Vðnþ1Þ to processor p1 and nj columns of Vðnþ1Þ

to other processors. Note that finding p1 is easy because pivoting at Step 1 amountsto finding the maximum element’s position in vector C1. This strategy will be used inAlgorithms IV–VI. However in Algorithms I–III we shall make the easy choice ofp1 ¼ 1.

Thus the amount of storage required on processor i is ðnþ 1Þ � ni, correspondingto ni column vectors. Specifically, it is appropriate to assume that after Step 1 thefollowing matrix of size ðnþ 1Þ � ni is stored on processor i

V ¼ V1jV2j � � � jVni½ ðnþ1Þ�ni

which corresponds (globally) to the vectors

V ðnÞtð1;iÞjV

ðnÞtð2;iÞj � � � jV

ðnÞtðni;iÞ

h i

where (note tðnp; pÞ ¼ ðp 1Þnp þ np ¼ n

tðk; iÞ ¼ ði 1Þni þ k for k ¼ 1; 2; . . . ; ni:

That is, the subspace RðnÞ is split into smaller subspaces in p processors. For each stepj ¼ 1; 2; . . . ; n, let the jth pivoting vector V ðnþ1jÞ

sðjÞ reside in processor pj; herepj 2 f1; 2; . . . ; pg.

Observe that once data are distributed as above the pivoting vector V ðnþ1jÞsðjÞ at

each step j has to be broadcast from processor pj to all other processors; we shalldenote such an 1-to-p communication operation by ‘bcast’. In implementation it isimportant to realize that V ðnþ1jÞ

sðjÞ has at most j nonzero positions (e.g. 1 nonzeroat step j ¼ 1 and 3 nonzeros at step j ¼ 3). One other observation is that the proces-sor that holds the pivot vector (i.e. uses ‘bcast’) will have one less vector to workwith; this fact may be called dimension reduction for simplicity. Therefore an idealload balancing will be achieved if pj (the pivoting processor) takes all values from theset f1; 2; . . . ; pg (in any order) every p steps.

We first consider how to achieve load balancing (Algorithms I–III) using localpivoting, setting aside exact implementation of the global pivoting strategy. Then


we consider similar ideas in trying more exact implementation of the global pivotingstrategy (Algorithms IV–VI).

3.1. Algorithm I

The first and simplest parallel algorithm is thus to allow each processor to workwith their own vectors. The pivoting vector VsðjÞ step j is selected from processor 1,and needs to be communicated to all other processors. Due to dimension reduction,processor 1 will be idle after n1 steps, and then the pivoting vector VsðjÞ will be selectedfrom processor 2 and so on.

The overall algorithm can be clearly illustrated in Fig. 1. There an empty box in-dicates an idle processor. Obviously since some processors become idle quite early,overall work load is not well balanced. Below we consider two better methods to im-prove on load balancing.

Fig. 1. Illustration of Algorithm I.


3.2. Algorithm II

The load unbalancing in Algorithm I is caused by dimension reduction in the ac-tive processor (in the natural order 1; 2; . . . ; p). One way to balance dimension reduc-tions is for other processors in turn to send a basis vector to the active processor.Thus all processors will reduce their dimensions at the same speed and consequentlywork load is balanced. Moreover, if the sent vector is the local pivot, then selectionof local pivot at the active processor is effective among two processors. The overallalgorithm can be illustrated in Fig. 2. Here ‘shift’ means sending a basis vector

Fig. 2. Illustration of Algorithm II.


(systolic) to the active processor. As mentioned, the choice of this basis vector islinked to local pivoting i.e. select that particular vector which has the largest productwith vector Cnþ1j at step j.

3.3. Algorithm III

In Algorithm II, sending a basis vector from one processor to the active processorrequires more communication time. Here we propose to avoid this communicationby alternating the active processor in the order f1; 2; . . . ; ng. Thus we can achieveload balancing by synchronizing dimension reduction every p steps. Note that onlylocal pivoting is used. The algorithm can be illustrated in Fig. 3.

3.4. Algorithm IV

To maintain load balance and work towards global pivoting across processors, wepropose a strategy of combining global and local pivoting. By ‘local’, we again meanthat the pivot processor alternates in all p processors every p steps of the Purcellmethod. However different from Algorithm III, the alternating sequence of the activeprocessor is determined by ‘global’ pivoting within every p steps of elimination i.e.pj’s are given a chance to take any permutation of f1; 2; . . . ; pg. Thus we can achieveload balancing by synchronizing dimension reduction every p steps.

More specifically we select that particular vector, among vectors in all processorsthat have not yet been used for pivots within every p steps, which has the largestproduct with vector Cnþ1j at step j. That is, at the start, the selection is among p pro-cessors and at the next step among p 1 processors. This process is repeated every psteps. To avoid possible breakdown of the approach i.e. when the active pivot ele-ment is zero or too small, we revert to Algorithm V (i.e. let the pivot be selectedacross all processors) for this particular step if this occurs.

The above four algorithms, though different in load balancing, implement localpivoting and thus do not enforce global (row) pivoting as the sequential methoddoes. Therefore, although useful for some problems where pivoting is not necessary(e.g. symmetric positive definite systems), they may run into problems if the under-lying growth factor is too large. This will be tested in Section 5.

Below we shall present two algorithms (V and VI) that attempt the global pivotingstrategy.

3.5. Algorithm V

We first present a direct approach for the global pivoting strategy. By ‘direct’, wemean that the pivot processor is globally selected which will be used to ‘bcast’ thepivot vector (and consequently reduces its dimension by one). The overall patternof pj’s will be irregular but pivoting is carried out globally. In general, we do not ex-pect a good load balance with this algorithm. Depending on values of pj’s, the algo-rithm can be identical to Algorithms I and III for some special linear systems.


3.6. Algorithm VI

Motivated by Algorithm II, where increased communication by shifting work en-sures load balance, we now try to achieve load balance by shifting vectors betweenprocessors. Our idea is to set up a parallel cyclic counter c that takes values fromf1; 2; . . . ; pg cyclically; whenever c ¼ 1 we expect the number of vectors Vj’s to be

Fig. 3. Illustration of Algorithm III.


reduced by one and if the underlying processor is not the pivot processor, we proposeto shift a vector V1 from it to the pivoting processor. This forces an even dimensionreduction across processors.

Initially c ¼ i on processor i i.e. at Step 1 we expect p1 ¼ 1 and at Step 2 we expectp2 ¼ 2 etc otherwise we shift a vector to the pivoting processor as in Algorithm II.Therefore this achieves the aim of load balance while maintaining the global row piv-oting. It is possible to adapt the idea further for specific problems with some predict-able pivoting patterns.

4. Analysis of complexity

Flop counts are usually proportional to execution time of an algorithm. Howeverfor parallel algorithms, apart from flop counts on each processor, communicationtime will depend on the amount of data communicated. Such an amount of data willbe measured by the communication count––the total number of data elements com-municated (mainly with the data type of double precision). As shown in [20], thetotal flop count for step i of the sequential method is Fi ¼ ði 1Þðnþ 1ði 1ÞÞ þ ðiþ 1Þðnþ 1 iÞ ¼ 2iðnþ 1Þ 2i2 þ i 1 and thus the total flop countis F ¼

Pni¼1 Fi ¼ n3=3þ 3n2=2þ n=6. This determines the sequential time ts. It re-

mains to consider the parallel solution time Ts by calculating the new flop countand comparing it with the sequential count F. Remark that our complexity analysis,aiming to compare algorithms, will not include assembly of matrix A as this task iscommon to all cases and can be done in parallel.

For Algorithm I, processor p is occupied throughout. Let i0 ¼ n np þ 1. Theflop count is

FI ¼Xnnp

i¼1

ði 1Þnp þðiþ 1Þnp þXn

i¼i0

f½ði 1Þþ ðiþ 1Þ½np þ 1ði i0Þ ði i0 þ 2Þg

¼ n3½1=p 1=p2 þ 1=ð3p3ÞþOðn2Þ:

The communication count is CI ¼Pn

i¼1 iþ 1 ¼ n2=2þ 3n=2.For Algorithms II and III, all processors are occupied throughout the calculation

except in the last few steps. Thus ignoring low order terms of n,

FII ¼ FIII ¼Xnnp

i¼1

2iðnp int½i=pÞ n3=ð3pÞ þOðn2Þ;

where int[i=p] denotes the integer part of ratio i=p i.e. int½i=p ¼ 0; . . . ;0; 1; . . . ; 1; 2; . . . ; 2; . . . as the dimension in processor p is reduced by one after p steps.Therefore the flop count of Algorithms II and III is 1/3 of Algorithm I. However thecommunication count for Algorithm II is more than I and III i.e. CIII ¼ CI andCII ¼ 2CI ¼ n2 þ 3n. For moderately large problems where communication timebecomes dominant, the performance of I and III should be similar and better than II.But this depends on the relative speed of the processors and communication.


The complexity analysis of Algorithm V depends on the unknown pivoting pat-tern (i.e. pj’s) and therefore the flop count of V satisfies: FI P FV P FIII. As this algo-rithm has the same amount of communication as I but implements global pivoting,we expect its performance to be better than Algorithm I.

By construction, Algorithms IV and VI should have an identical flop count to FIIIassuming the breakdown avoiding step in IV is not active i.e. FIV ¼ FVI ¼ FIII �n3=ð3pÞ. If the breakdown avoiding step is active, then FIV is close to FV i.e.FI P FIV P FIII because load imbalance will then be permitted to allow global pivot-ing. The communication cost of Algorithm VI should be similar to II whilst the com-munication cost of IV is comparable to III (or I).

In summary, this analysis suggests that

• If pivoting is not important but communication can be expensive, the followingalgorithms are fast: I, III, IV, V.

• If pivoting is important but communication is inexpensive, then use: V, VI.• If both pivoting and communication are not important issues, the following are

fast: II, III, IV, VI.• If both pivoting and communication are important issues, the following should be

used: V.

Therefore the most robust algorithm will be either V or VI and this preliminaryconclusion will be tested in the next section.

5. Numerical examples

As shown earlier (in Section 2), the Purcell method in its sequential form appearsto produce a consistently small growth factor in comparison to other well-known di-rect elimination methods, making it a potential winner of all practical direct meth-ods.

Here we concentrate on experimenting the parallel algorithms proposed and com-paring them with a parallel GJ method with global (partial) pivoting. We remarkthat favourable comparison results with the Gaussian method have been reportedin [6,14,19]. We shall use the four examples given in Section 2 plus the following ex-ample (Problem 5): computing the boundary element equations for the potential flowpast over an ellipse as tested in [21] (see also [2,11,17,18]). The tests were initially car-ried out on a SGI IP25 machine with 14 processors and then repeated on a SGI IP30with 2 processors; in both cases, we used Fortran 77 with double precision and MPIdirectives [10].

5.1. Reliability and accuracy test

Here we test as in Section 2 how accurate the different algorithms with our pivot-ing strategies are for the five problems; each problem is solved with n ¼ 32, 64, 128,256, 512, 1024, 2048 equations. Table 2 shows the details of our experiments, where


Table 2

Comparison of reliability and accuracy of parallel direct methods on SGI IP25

� Method GJ Algorithm I Algorithm II Algorithm III Algorithm IV Algorithm V Algorithm VI

p ¼ 1 2 4 8 1 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8 2 4 8

1 . . . . M . . . . . . . . . . . . M M M M M M

2 . . . . M . . . . . . . . . M . . M M M M M M

3 . . . . M . . . . . . . . . . . . M M M M M M

4 . . . . M . . . . . . . . . M . . M M M M M M

5 M M M M M M M M M M M M M M M M M M M M M M M

1288

K.Chen,C.H

.Lai/Parallel

Computin

g28(2002)1275–1291

� denotes the problem number, M denotes an acceptable accuracy is achieved for alln and . denotes a solution failure or inaccuracy. By accuracy, we mean the machineprecision because these are direct methods. Clearly one can observe that only twoalgorithms are reliable: V, VI while IV only works for p ¼ 1, 2 because pivoting isimportant for problems 1–4. Observe that the boundary element example can besolved by all methods and we have also tested first kind integral equations and foundthat pivoting is not a big issue (i.e. all methods would work).

5.2. Scalability test

Having confirmed that only two algorithms (V and VI) are reliable methods, wenow test the scalability of these two methods versus the GJ method (which is awidely accepted method) for solving the boundary element example (Problem 5)and concentrate on timing issues (with regard to our complexity analysis). Our testshere were done on both SGI IP25 and SGI IP30 with results displayed in Table 3.There ‘Efficiency over p ¼ 1’ gain is defined as the ratio of CPU taken by a singleprocessor over the current timing. Note that the CPU also includes the parallelgeneration of coefficient vectors Ci which slightly complicates the scalability test.

Table 3

Performance of parallel direct algorithms on SGI IP25 (left) and on SGI IP30 (right most)

Algorithm

used

Processors

p

Problem

size n

IP25 CPU

(s)

Efficiency

over p ¼ 1

IP30 CPU

(s)

Efficiency

over p ¼ 1

GJ 1 1024 244 56

2048 1868 410

2 1024 122 2 32 1.8

2048 935 2 226 1.8

4 1024 61 4

2048 469 4

8 1024 31 7.9

2048 238 7.8

V and VI 1 1024 138 45

2048 1069 349

V 2 1024 118 1.2 38 1.2

2048 927 1.2 291 1.2

4 1024 76 1.8

2048 610 1.8

8 1024 42 3.3

2048 352 3.0

VI 2 1024 69 2.0 26 1.7

2048 544 2.0 195 1.7

4 1024 35 4.0

2048 277 3.9

8 1024 20 6.9

2048 143 7.5


Nevertheless, one can observe from Table 3 that the scalability of these methods isgenerally good with Algorithm V being the worst as expected.

Therefore one may conclude that Algorithm VI is the most robust parallel directmethod. Further work to optimize this method and tests on more and different com-puting platforms as well as applications will be done in the future.

6. Conclusions

Distributed algorithms of the powerful Purcell direct method are presented, testedand discussed. We have addressed the issues of load balance for fast execution andglobal pivoting for robustness. The better reliability offered by the Purcell methodthan other well-known direct methods such as the Gaussian, the GJ and theGauss–Huard methods is illustrated. The inclusion of the right hand side vector inpivoting seems an attractive idea for boundary element problems; however for prob-lems with multiple right hand sides it remains a challenge to adapt the Purcell ideaappropriately. We hope this paper will draw the attention of many researchers tofurther refinement, analysis and application of the Purcell method.

Acknowledgements

The authors are grateful to the three anonymous referees who gave critical andhelpful comments on many aspects of the previous drafts of the paper.

References

[1] M. Benzi, Who was E. Purcell?, NA Digest, 01, No. 4 (2001) http://www.netlib.org/na-net.

[2] D. Colton, R. Kress, Integral equation methods in scattering theory, Wiley, New York, 1983.

[3] M. Cosnard, Y. Robert, D. Trystram, Parallel solution of dense linear systems using diagonalization

methods, Int. J. Comput. Math. 22 (1987) 249–270.

[4] T.J. Dekker, W. Hoffmann, Rehabilitation of the Gauss–Jordan algorithm, Numer. Math. 54 (1989)

591–599.

[5] T.J. Dekker, W. Hoffmann, K. Prota, Parallel algorithms for solving large linear systems, J. Comput.

Appl. Math. 50 (1994) 221–232.

[6] T.J. Dekker, W. Hoffmann, K. Prota, Stability of the Gauss–Huard algorithm with partial pivoting,

Computing 58 (1997) 225–244.

[7] J.W. Demmel, Applied Numerical Linear Algebra, SIAM publications, USA, 1997.

[8] J.J. Dongarra, Performance of various computers using standard linear equations software, Comp.

Sci. Tech. Rep. CS-89-85, University of Tennessee, January 1999.

[9] G.H. Golub, C. van Loan, Matrix Computation, third ed., John Hopkins University Press, USA,

1996.

[10] W. Gropp, E. Lusk, A. Skjellum, Using MPI, second ed., MIT Press, USA, 1999.

[11] J.L. Hess, A. Smith, Calculations of potential flow about arbitrary bodies, in: D. Kucheman (Ed.),

Progress in Aeronautical Sciences, vol. 8, 1976.

[12] N. Higham, Accuracy and Stability of Numerical Algorithms, SIAM publications, USA, 1996.

[13] N. Higham, D. Higham, Large growth factors in Gaussian elimination with pivoting, SIAM J. Matr.

Anal. Appl. 10 (1989) 155–164.


http://www.netlib.org/na-net

[14] W. Hoffmann, The Gauss–Huard algorithm and LU factorization, Lin. Alg. Appl. 275–276 (1998)

281–286.

[15] W. Hoffmann, K. Potma, G. Pronk, Solving dense linear systems by Gauss–Huard’s method on a

distributed memory system, Future Gen. Comp. Sys. 10 (1994) 321–325.

[16] P. Huard, La methode du simplexe sans inverse explicite, E.D.F. Bull, de la Direction des Etudes at

des Recherches Serie C 2, 2 (1979) 79–98.

[17] M.A. Jaswon, G.T. Symm, Integral equation methods in potential theory and electrostatics,

Academic Press, New York, 1977.

[18] C.H. Lai, A parallel panel method for the solution of fluid flow past an aerofoil, in: C.R. Jesshope,

K.D. Reinartz (Eds.), CONPAR88, Cambridge University Press, 1989, pp. 711–718.

[19] C.H. Lai, On Purcell’s method and the Gauss–Jordan elimination method, Int. J. Math. Educ. Sci.

Technol. 25 (1994) 759–778.

[20] C.H. Lai, On an extension of Purcell’s vector method with applications to panel element equations,

Comput. Math. Appl. 33 (1997) 101–104.

[21] C.H. Lai, K. Chen, Solutions of boundary element equations by a flexible elimination process,

Contemp. Math. 218 (1998) 311–317.

[22] R. Melhem, Parallel Gauss–Jordan elimination for the solution of dense linear systems, Para. Compt.

4 (1987) 339–343.

[23] E.W. Purcell, The vector method of solving simultaneous linear systems, J. Math. Phys. 32 (1953)

180–183.

[24] G.W. Stewart, Matrix algorithms I: basic decompositions, SIAM publications, USA, 1998.

[25] L.N. Trefethen, D. Bau, Numerical Linear Algebra, SIAM publications, USA, 1997.


Parallel algorithms of the Purcell method for direct solution of linear systems

Documents

Transcript of Parallel algorithms of the Purcell method for direct solution of linear systems