Symbolic Factorization for Sparse Gaussian Elimination with Partial Pivoting

22
SIAM J. ScI. STAT. COMPUT. Vol. 8, No. 6, November 1987 (C) 1987 Society for Industrial and Applied Mathematics 001 SYMBOLIC FACTORIZATION FOR SPARSE GAUSSIAN ELIMINATION WITH PARTIAL PIVOTING* ALAN GEORGEf AND ESMOND NGt Abstract. Let Ax b be a large sparse nonsingular system of linear equations to be solved using Gaussian elimination with partial pivoting. The factorization obtained can be expressed in the form A= PIMIP2M2"’" P,,_IM,,_ U, where Pk is an elementary permutation matrix reflecting the row interchange that occurs at step k during the factorization, Mk is a unit lower triangular matrix whose kth column contains the multipliers, and U is an upper triangular matrix. Consider the kth step of the elimination. Suppose we replace the structure of row k of the partially reduced matrix by the union of the structures of those rows which are candidates for the pivot row and then perform symbolically Gaussian elimination without partial pivoting. Assume that this is done at each step k, and let/S and 0 denote the resulting lower and upper triangular matrices, respectively. Then the structures of L and U, respectively, contain the structures of k= Mk and U. This paper describes an algorithm which determines the structures of and 0, and sets up an efficient data structure for them. Since the algorithm depends only on the structure of A, the data structure can be created in advance of the actual numerical computation, which can then be performed very efficiently using the fixed storage scheme. Although the data structure is more generous than it needs to be for any specific sequence P, P2, ", P,-1, experiments indicate that the approach is competitive with conventional methods. Another important point is that the storage scheme is large enough to accommodate the QR factorization of A, so it is also useful in the context of computing a sparse orthogonal decomposition of A. The algorithm is shown to execute in time bounded by ILl /1OI, where IMI denotes the number of nonzeros in the matrix M. Key words, sparse matrix algorithms, Gaussian elimination, partial pivoting, symbolic factorization AMS(MOS) subject classifications. 65F05, 65F50, 68R10 1. Introduction. Let A be an n x n nonsingular matrix and b be an n-vector, and consider the problem of finding an n-vector x such that A standard approach for solving the problem involves reducing A to upper triangular form using elementary row eliminations (i.e., Gaussian elimination). In order to maintain numerical stability, one may have to interchange rows at each step of the elimination process (i.e., use partial pivoting). Thus, we may express the elimination as follows: A P1M1P2M2 Pn_lMn_l U where Pk is an n x n elementary permutation matrix corresponding to the row inter- change at step k, Mk is an n x n unit lower triangular matrix whose kth column contains the multipliers used at the kth step, and U is an n x n upper triangular matrix. When Received by the editors March 22, 1985; accepted for publication (in revised form) January 26, 1987. This research was sponsored by the Natural Sciences and Engineering Research Council of Canada under grant A8111, and was also sponsored by the Applied Mathematical Sciences Research Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with the Martin Marietta Energy Systems Inc., and by the U.S. Air Force Office of Scientific Research under contract AFOSR-ISS A-84-00056. " Department of Computer Science and Mathematics, University of Tennessee, Knoxville, Tennessee 37996. This author is also associated with the Oak Ridge National Laboratory through the Distinguished Scientist program, jointly sponsored by the University of Tennessee, and the Oak Ridge National Laboratory. t Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831. 877 Downloaded 11/08/14 to 128.118.88.48. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Transcript of Symbolic Factorization for Sparse Gaussian Elimination with Partial Pivoting

SIAM J. ScI. STAT. COMPUT.Vol. 8, No. 6, November 1987

(C) 1987 Society for Industrial and Applied Mathematics001

SYMBOLIC FACTORIZATION FOR SPARSE GAUSSIAN ELIMINATIONWITH PARTIAL PIVOTING*

ALAN GEORGEf AND ESMOND NGt

Abstract. Let Ax b be a large sparse nonsingular system of linear equations to be solved using Gaussianelimination with partial pivoting. The factorization obtained can be expressed in the form A=PIMIP2M2"’" P,,_IM,,_ U, where Pk is an elementary permutation matrix reflecting the row interchangethat occurs at step k during the factorization, Mk is a unit lower triangular matrix whose kth column containsthe multipliers, and U is an upper triangular matrix.

Consider the kth step of the elimination. Suppose we replace the structure of row k of the partiallyreduced matrix by the union of the structures of those rows which are candidates for the pivot row andthen perform symbolically Gaussian elimination without partial pivoting. Assume that this is done at eachstep k, and let/S and 0 denote the resulting lower and upper triangular matrices, respectively. Then thestructures of L and U, respectively, contain the structures of k= Mk and U. This paper describes analgorithm which determines the structures of and 0, and sets up an efficient data structure for them. Sincethe algorithm depends only on the structure of A, the data structure can be created in advance of the actualnumerical computation, which can then be performed very efficiently using the fixed storage scheme. Althoughthe data structure is more generous than it needs to be for any specific sequence P, P2, ", P,-1, experimentsindicate that the approach is competitive with conventional methods. Another important point is that thestorage scheme is large enough to accommodate the QR factorization of A, so it is also useful in the contextof computing a sparse orthogonal decomposition of A. The algorithm is shown to execute in time boundedby ILl /1OI, where IMI denotes the number of nonzeros in the matrix M.

Key words, sparse matrix algorithms, Gaussian elimination, partial pivoting, symbolic factorization

AMS(MOS) subject classifications. 65F05, 65F50, 68R10

1. Introduction. Let A be an n x n nonsingular matrix and b be an n-vector, andconsider the problem of finding an n-vector x such that

A standard approach for solving the problem involves reducing A to upper triangularform using elementary row eliminations (i.e., Gaussian elimination). In order tomaintain numerical stability, one may have to interchange rows at each step of theelimination process (i.e., use partial pivoting). Thus, we may express the eliminationas follows:

A P1M1P2M2 Pn_lMn_l U

where Pk is an n x n elementary permutation matrix corresponding to the row inter-change at step k, Mk is an n x n unit lower triangular matrix whose kth column containsthe multipliers used at the kth step, and U is an n x n upper triangular matrix. When

Received by the editors March 22, 1985; accepted for publication (in revised form) January 26, 1987.This research was sponsored by the Natural Sciences and Engineering Research Council of Canada undergrant A8111, and was also sponsored by the Applied Mathematical Sciences Research Program, Office ofEnergy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with the Martin MariettaEnergy Systems Inc., and by the U.S. Air Force Office of Scientific Research under contract AFOSR-ISSA-84-00056.

" Department of Computer Science and Mathematics, University of Tennessee, Knoxville, Tennessee37996. This author is also associated with the Oak Ridge National Laboratory through the DistinguishedScientist program, jointly sponsored by the University ofTennessee, and the Oak Ridge National Laboratory.

t Mathematical Sciences Section, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831.

877

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

878 A. GEORGE AND E. NG

A is sparse, it is well known that fill-in occurs during the triangular decomposition.n--1That is, there may be more nonzeros in E k=l Mk and U than in A. Our objective is

to develop a scheme that can create from the structure of A and prior to the numericaln--1decomposition a data structure which can accommodate all the nonzeros in E k=l Mk

and U. This process is known as symbolic factorization. Note that for sparse Gaussianelimination with partial pivoting, the positions of nonzeros in the triangular factorsMk and U depend not only on the structure of the matrix A, but also depend on therow interchanges. Since the row interchanges are not known until the numericaldecomposition is performed, the symbolic factorization process must be able to generatea data structure that is large enough to accommodate the nonzeros for any validsequence of elementary row permutations P1, P2, Pn-1. Note that some sequenceswill never occur because of the zeros in the matrix. That is. there will be far fewerthan n! possible sequences.

There are important reasons why it is desirable to perform such a symbolicfactorization. First, since a symbolic factorization produces a data structure that exploitsthe sparsity of the triangular factors, the numerical decomposition can be performedusing a static storage scheme. There is no need to perform storage allocations for thefill-in during the numerical computation. This reduces both storage and execution timeoverhead for the numerical phase. Second, we obtain from the symbolic factorizationa bound on the amount of space we need in order to solve the linear system. Thisimmediately tells us if the numerical computation is feasible. (Of course, this isimportant only if the symbolic factorization can be performed cheaply both in termsof storage and execution time and if the bound on space is reasonably tight.)

In 10] we have proposed an implementation of Gaussian elimination with partialpivoting for sparse matrices using a static storage scheme. The approach is based onthe following results. Assume the diagonal elements of A are nonzero. Then it can beshown that the structure of the Cholesky factor Lc of ATA contains the structure ofU. That is, if Uo is (structurally) nonzero, then (Lc)si will also be (structurally)nonzero. Furthermore, the structure of the kth column of Lc contains the structureof the kth column of Mk. These results are independent of the choice of the pivot rowat each step. Thus we can use the structures of Lc and L to bound the structures of

n--1E k=l Mk and U, respectively. The advantage of this approach is that if ATA is sparse(and this is the case in some classes of problems), the structure of the Cholesky factorLc of AA can be determined from that of AA efficiently [9] and hence a datastructure which exploits the sparsity of Lc and L can be set up. Thus this providesan efficient symbolic factorization scheme for sparse Gaussian elimination with partialpivoting. Notice that permuting the columns of A corresponds to ordering the rowsand columns of AA symmetrically. It is well known that for sparse AA, the sparsityof the Cholesky factor Lc depends very much on the choice of this symmetric ordering.Thus we may choose a column ordering for A so as to reduce the amount of fill-in inLc. Numerical experiments have shown that an implementation using the static storagescheme described above, together with a good column ordering for A, is competitivewith existing methods for solving certain classes of sparse systems of linear equations[10].

However, in the approach described above, it is often true that the structure ofn-1 UT.Lc substantially overestimates the structure of Yk=l Mk or The objective of this

paper is to propose a better approach for predicting where nonzeros will appear insparse Gaussian elimination with partial pivoting, and to present a symbolic factori-zation algorithm that will generate a tighter data structure which exploits the sparsityof the triangular factors. The basic idea is to allocate space for nonzeros that would

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 879

be introduced by all possible pivotal sequences that could occur when Gaussianelimination with partial pivoting is applied to a matrix having the structure of A.

An outline of the paper is as follows. In 2, the new approach for predictingfill-in and a symbolic factorization algorithm are described. The complexity of thealgorithm is also considered. Some enhancements to the algorithm are presented in

3, and in 4 some numerical experiments and comparisons are provided. Finally,concluding remarks and open problems are given in 5.

2. A symbolic faetorization algorithm. In the following discussion, it is convenientto assume that the diagonal elements of the n x n matrix A are nonzero (in which casethe matrix A is said to have a zero-free diagonal). If A is nonsingular and does nothave a zero-free diagonal, it is always possible to permute the rows (or columns) ofA so that the permuted matrix has a zero-free diagonal [3]. We will use the notationNonz (B) to denote the structure of a matrix B; that is,

Nonz (B) {(i,j)lB # 0}.When M is a matrix, ]M] denotes the number of nonzeros in M, and when M is aset, ]M] denotes the number of elements in M.

Now paition A into

A=uT

and consider applying one step of Gaussian elimination to A with partial pivoting.

--( 1( tv/a 0 B- vur/aOI. TA)

Our goal is to be able to allocate space for the nonzeros in l, u and A, regardless ofthe choice of the pivot row (that is, the choice of P). Our approach is based on theobservation that only the first row of A and rows j of A, where j 0, are involved inthe elimination. For convenience, row 1 and rows j of A, where j # 0, are referred toas candidace pivot rows. Thus only the structures of the candidate pivot rows will beaffected after the first step of Gaussian elimination with partial pivoting. Moreover,the new structures of these rows depend only on their original structures. In fact,assuming structural and numerical cancellations do not occur, the new structure ofeach of these rows must be contained in the union of the structures of all the candidatepivot rows.

The discussion above also explains why we prefer to be nonzero. Suppose iszero. Then we know that the structure of r is not affected during the eliminationexcept that it is moved to another row in the matrix. However, in order to have suttScientspace to store the nonzeros, we have to treat the row ( r) as a candidate pivotrow as well. Thus the prediction of fill-in may be too generous. This problem iseliminated if we arrange the rows of A so that is nonzero.

Assuming structural and numerical cancellations do not occur, the union of thestructures of the candidate pivot rows is identical to the structure of

Hence the discussion above can be summarized as follows.

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

880 A. GEORGE AND E. NG

Nonz (l)= Nonz (),TNonz (uT)_ Nonz (a r + B),

Nonz (A1) Nonz (/- P(ar + pr/)) Nonz (- P(ar + pr)/d).It is convenient to denote P by I and - p(r + pr)/ by 1. We should emphasizethat the right-hand sides, which can be computed easily from the structure of A, areindependent of the choice of the actual pivot row (i.e., the choice of P1) even thoughthe left-hand sides do depend on P1. Hence we can use the structures of I, and A1to bound those of I, u and A1, respectively, regardless of the choice of P1.

It is interesting to note that the decomposition

1 r

can be obtained by applying one step of Gaussian elimination without row interchangesto the n x n matrix A below:

A= p

In terms of structures, what we have done is to replace the first row of A by a rowwhose structure is the union of the structures of the candidate pivot rows and then toapply one stp of Gaussian elimination without paial pivoting.

Since B has a zero-free diagonal, 1 -(r+r)/ must also have azero-free diagonal under the assumption that exact cancellation does not occur. Thusthe argument and idea above can be applied recursively to A1 and A1, yielding aprocedure for generating a sequence of vectors whose structures can be used to boundthe structures of the triangular factors (Mk and U). As we shall see later, this approachoften generates a storage scheme which is much tighter than that provided by themethod in [10].

An impoant advantage of this new approach is that at each step of the symbolicfactorization, all we need is the ability to identify the candidate pivot rows and theirstructures, and the ability to update the structures of these candidate pivot rows bythe union of their original structures. Notice that we do not have to be concerned withthe actual numerical values or the actual row interchanges that may occur during thenumerical factorization. Thus the symbolic factorization procedure can be used to setup a static data structure in advance of the numerical computation, which can thenbe performed using the fixed storage scheme.

The success of the symbolic factorization procedure described above relies heavilyon the eciency of identifying the candidate pivot rows and updating their structuresat each step. Founately, these can be done very easily and eciently. In paicular,the symbolic factorization procedure can be implemented with no modification to thestructure of A!

An impoant fact about the structure ofthe pivot row and column helps us achievehigh eciency. Specifically, it is straightforward to verify that the index of the firstnonzero in the pivot column is always at least as large as the index of the first nonzeroin the pivot row (after computing the union). This fact is illustrated in Fig. 2.1.

Consider the first step of the symbolic factorization. The structure of row j (j l)of A is either unchanged (if the row is not a candidate pivot row and is not involvedin the elimination process), or the same as that of r (if the row is a candidate pivotrow and is involved in the elimination process). Thus, all we need is an indicator

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 881

p()

p(1)

o(1)

FIG. 2.1. Positions of nonzeros after the first step of symbolic factorization.

mask (j) for each row j that tells us whether row j is involved in step 1. In subsequentsteps, if mask (j) is set, then the structure of row j has been used and its new structureshould be obtained from (in this case) fir.

We must also maintain information that tells us when the new structure of themodified rows will be used again and where the new structure can be found. Suppose1), Io-(2),""", let(p) are nonzero, and ao), p(2),"" ", p(q) are nonzero. Since thediagonal elements of A (and hence of A) are assumed to be nonzero, A),,)# 0,A),) 0, and p(1) =< tr(1). Note that A must be zero for 2-<j < p(1),tr(1), r(2),..., or(p). See Fig. 2.1 for an illustration. Thus the new structures of themodified rows will be used again (at the earliest) at step p(1). We should emphasizethat the new structures are identical to that of t7 r. Also, at step p(1), the candidatepivot rows will include those (whose row indices are greater than p(1)) of step 1, butthe candidate pivot rows of step 1 could be conveniently found by examining thestructure of 1. To summarize, the only other information we have to record is the factthat at step p(1), we have to look at and t7 to obtain the correct structure of themodified matrix. In the algorithm described below, we have used a set S1) for thispurpose. After the first step, we have Sp(1) { 1}.

At this point, it is appropriate to present the entire symbolic factorization algorithm,which is a generalization of what we have just described. The structure of A isrepresented by two sets.

(1) C contains the structure of column j of A"

Cj { aij O}.

(2) R contains the structure of row of A:

R {jlAij 0}.

Similarly, the structures of the sequence of vectors obtained from the symbolic factori-zation algorithm are represented by sets.

(1) Lk contains the structure of a column vector that bounds the structure ofcolumn k of Mk.

(2) Uk contains the structure of a row vector that bounds the structure of row kof U.

SYMBOLIC FACTORIZATION ALGORITHM(1) Global initialization

(1.1) Set mask (k) =0, for k- 1,2,.considered yet.

n to denote that no rows have been

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

882 A. GEORGE AND E. NG

(1.2) Set Sk , for k 1, 2,..., n to denote that step k has not yet been influencedby any previous steps.

(2) For k 1, 2,..., n, do the following:(2.1) Local initialization" Set Lk and /)k .(2.2) Obtain structures of those unprocessed rows" For Ck, if mask (i) O,

then set/-k k t.J { i} and Uk Uk t.J Ri, and set mask (i) 1 (to indicatethat row has been looked at already).

(2.3) If Sk , then go to Step (2.5). (Step k is not influenced by any previoussteps.)

(2.4) If Sk , then /k and Ok should include , and , respectively, fori Sk. Hence, for i Sk, collect influence caused by step i:

Uk Uk t.J Ui {1, k -1}.

(2.5) Update S: Let r=min{tlt,,,k--{,,k}} and p=min{tltlk--{k}}.Since p _--< r, the structure of Lk and Uk will influence the determinationof the structures at step p (if [k[ > 1). Hence set S S [.J {k} if [/,k[ > 1.

The complexity of the algorithm is linear in the number of elements in LJiL andt.J U as the following discussion shows. Since each i, 1, 2,. ., n, will be includedin at most one Sp, Sp fq S, , for/9 p’. Thus we access the elements of/ andat most once for each (see Step (2.4)). Furthermore, we access the elements of Cand R at most once in this algorithm (see Step (2.2)). Hence the complexity of thealgorithm is

O max IAI, Y I,1 +1k=l

The overhead in storage is also very small. There are n elements in mask formarking the rows. Since at step k, k is included in at most one set S, Y.i IS[ <- n.

For convenience, we define a lower triangular matrix L so that Lj 0 if Lj and

L =0 otherwise._ Similarly, let U be an upper triangular matrix such that U0 0 ifj U and U0 0 otherwise. Hence,

Nonz (U) c__ Nonz (U)

and

Nonz Mk_Nonz (L).

k=l

In other words, a static storage scheme that exploits the sparsity of L and U is alwayslarge enough to accommodate the fill-in in Mk and U, regardless of the choice of pivotrows.

It is interesting to note that the static data structure obtained from the symbolicfactorization algorithm also has enough space to accommodate fill-in produced incomputing an orthogonal decomposition of A using Householder transformations.Partition A as before into

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 883

Suppose H is an n x n Householder transformation that reduces

The matrix H has the form

H= I--- (fl vr),

where Nonz (v)= Nonz (). Thus,

HA= (I- I (Bv)7r

o

Hence

Nonz (v) Nonz (B) Nonz (1),

Nonz r 1 (fl2r+flr/) =Nonz (r+ =Nonz (r),

and

Nonz (/j-l(/3var+rvvr/J)) Nonz (/-B(ar+ Br/)/) Nonz (1).

The argument can again be applied recursively to/d and to -(1/Tr)(Bvr + vvr).Thus, the symbolic faetorization algorithm described above can be used to generate astatic storage scheme for sparse orthogonal decomposition using Householder transfor-mations. Note that both the upper triangular matrix and the Householder transforma-tions can be saved in the data structure.

Although the symbolic factorization algorithm has been described in terms ofn x n matrices, with some slight modifications it can also be applied to rectangularmatrices. The lower trapezoidal portion of L and the upper trapezoidal portion of Uare stored, respectively, by columns and rows. If the matrix A is m x n, then the maskvector will be of size m and the number of elements in the sets Si will be at mostmin {m, n}.

3. Enhancements.3.1. Subscript compression. From the sets {Li} and { Ui}, we can generate a storage

scheme for storing the nonzeros of L and U (and hence the nonzeros of Mk and U).For example, consider the upper triangular matrix U. One way to represent U is tostore the nonzeros row by row in a floating-point array UNZ, since the structure of Uis generated row by row. Thus we need an integer array XUNZ to store the beginningpositions in UNZ of the nonzeros in the rows of U. More specifically, the nonzerosof row of U are found in UNZ(p), for p=XUNZ(i), XUNZ(i)+1,... ,XUNZ (i+1)-1. (It is assumed that XUNZ (n+l)-1+i 0il.) It is also

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

884 A. GEORGE AND E. NG

necessary to store the column subscripts of the numbers in UNZ. Hence we need anadditional integer array SUBU where SUBU (p) contains the column index of thenonzero stored in UNZ (p). The lower triangular matrix L is represented column bycolumn using a similar storage scheme. We refer to this storage scheme as the uncom-pressed storage scheme. If we use this scheme, we need Y, (]Lil ]Ui[) floating-pointlocations for the numerical values, Yi (1il +10il) integer locations for the subscripts,and 2n + 2 integer locations for the pointers. If an integer location and a floating-pointlocation have the same number of bits, then the total number of locations required bythe uncompressed storage scheme is approximately

2 . (ILl

It has been observed from experiments that very often the set of subscripts of arow (say row i) of U is a subsequence of the subscripts of a previous row (in particular,row i- 1). This is also true for the columns of the lower triangular matrix L. Thus, itmakes sense to remove these redundancies using a process referred to as subscriptcompression, so that the total number of subscripts that have to be stored is smallerthan i (I/il +1 il) The use of subscript compression is common in storage schemesfor solving sparse symmetric positive definite systems, and is due to Sherman 12].

On the other hand, if the subscripts are compressed, there is no longer a one-to-onecorrespondence between the numerical values in, say UNZ, and the subscripts inSUBU. We need additional information in order to be able to retrieve the subscripts.For example, consider the upper triangular matrix U again. We now need an extrainteger array XSUBU of size n with XSUBU (i) indicating where we can find thebegining of the subscript sequence for the nonzeros in row i. That is, the nonzeros ofrow are stored in UNZ(p), for p-XUNZ(i),XUNZ(i)+I,... ,XUNZ(i+I)-I.The corresponding column subscripts are stored in SUBU (q), for q=XSUBU (i),XSUBU (i)/ 1,.... Of course, the length of the subscript sequence is simply givenby XUNZ (i/ 1)-XUNZ (i). This storage scheme is widely known as the compressedstorage scheme, and was first employed by Sherman 12]. As we will see in the numericalexperiments presented in the next section, the additional storage required by XSUBLand XSUBU is usually more than offset by the saving in storage for the subscripts.However, since additional indexing is needed in order to retrieve the nonzeros of arow of U or a column of L, the numerical computation phase using the compressedstorage scheme is expected to be slower than that using the uncompressed storagescheme.

There are two techniques employed in subscript compression. First, after we havedetermined the structure of the kth row of U (or kth column of L), we can compareit with that of the (k- 1)st row of U (or (k- 1)st column of L) to find out if the "tail"of the latter sequence is the same as the "head" of the former one. If they are thesame, then we do not have to store the "head" of the subscript sequence for the currentrow (or column).

Another observation not only allows us to quickly compress the subscripts, butalso improves the execution time of the algorithm. Often, particularly during laterstages of the algorithm, there remain no unprocessed rows at step i, and the set Si(defined in 2) has only one element. (In particular, it is often true that after thesymbolic factorization algorithm has gone through far fewer than n steps, all n rowsof A^ may have^ been processed.) Suppose Si {l}. Thissimply means that the structuresof Li and Ui are determined entirely byL/-{i}and Ul-{i}, respectively. Hence, it isunnecessary to access the elements of L and U. This allows us to very quickly and

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 885

inexpensively compress the subscripts, which in turn usually reduces the executiontime of the algorithm.

The techniques described above have been incorporated into our implementation,and numerical experiments have indicated that there is indeed a significant reductionin both the execution time of the symbolic factorization algorithm and the total numberof subscripts that have to be stored.

3.2. Column ordering. At each step of the symbolic factorization algorithm, wereplace the structures of the candidate pivot rows by the union of their originalstructures. This operation is independent ofthe initial row ordering. That is, the numberof nonzeros (I/SI and I/1) determined by the symbolic factorization algorithm isconstant, regardless of the choice of the initial row ordering, as long as this rowordering preserves the zero-free diagonal. (However, the choice of row ordering mayhave an effect on the extent to which subscript compression can be achieved.)

Column orderings, on the other hand, may affect the sparsity of the triangularmatrices L and U, which is indirectly related to the sparsity of the triangular factors

rl--1

k=l Mkand U. (Again, the column ordering has to preserve the zero-free diagonal.)Consider the two matrices A and B below, where B is obtained by interchanging thefirst and the last columns of A.

X X X X X X X X X X

X X X X

x x B= x xX X X X

X X X X

The union of the structures of the matrices L and U in each case is displayed below.

X X X X X X X X X X

X X X X X X X X X

LA d- UA x x x LB + UB" x x x x xX X X X X X X

X X X X X X X X X X

Hence it is important to choose a column ordering for A so that the triangular matricesL and U are sparse.

In our experiments, since we want to compare the storage scheme generated bythe new symbolic factorization algorithm with that created using the approach describedin [10], we have used a column ordering that is in general effective for that method,which uses the structures of the Cholesky factors of ATA to bound the structures ofthe triangular factors -’k =1 Mk and U of A. Hence it is the sparsity of the Choleskyfactors of AT"A that has to be preserved. It is well known that when ATA is sparse, aminimum degree ordering is an effective ordering in reducing fill-in in its Choleskyfactors 11]. Hence we use that ordering to permute the columns of A before we applyour symbolic factorization algorithm to A.

A natural alternative strategy for ordering the columns of A is as follows. Considerthe first step of the symbolic factorization algorithm. Since we have to replace thestructures of the candidate pivot rows by the union of their original structures, a p qdense submatrix is introduced into the matrix, where p is the number of candidatepivot rows and q is the number of nonzeros in the union of these candidate pivotrows. By rearranging, the columns of A, p and q will be changed. Thus we could

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

886 A. GEORGE AND E. NG

permute the columns (and rows) before we carry out this step of symbolic factorizationso that p, q or even the product pq is small (or minimized). Efficient implementationof these ideas is under investigation.

3.3. Dense rows and dense columns. In some applications, the matrix A may havea few rows or columns that are relatively dense. Since the method proposed in [10]works with ATA, dense rows of A form large dense submatrices in ATA. In our presentapproach, the effect of having some dense rows in A may be less severe, so long aswhen one of these dense rows becomes a candidate pivot row, the number of candidatepivot rows at that time is small. However, since we order the rows and columns ofAT"A to obtain a column ordering for A, these dense rows may affect the quality ofthe column ordering because the sparsity of the remaining rows of A may be lost whenwe form ArA. One technique we have found useful is to remove dense rows from Awhen we order the columns of A and then to put them back into A when we carryout symbolic factorization. Some of the numerical experiments presented in the nextsection will illustrate the improvement that may be achieved.

The effect of dense columns is not as clear as that of dense rows. Suppose thelast column of A is dense. It should be clear that such a dense column should haveno effect on the sparsity of L and U. On the other hand, if the first column of A isdense, the number of candidate pivot rows at the first step of symbolic factorizationis large and the union of these candidate pivot rows is likely to be quite dense. Evenif the union is sparse, more nonzeros will likely be added to this union as the symbolicfactorization algorithm proceeds. Thus, it would appear to be desirable to have anydense columns appear least.

Instead of working with AT"A and A, we may work with AA and AT since atriangular factorization of Ar is as useful as a triangular factorization of A. Denserows and dense columns in A now become, respectively, dense columns and denserows in A. In terms of applying the symbolic factorization algorithm to A and A,the effect of having dense rows and dense columns in each case is in general different.

As an example, suppose all the rows of A are sparse, but its first column is dense.Then applying the symbolic factorization algorithm to A may result in substantialfill-in in/S and . However, if we work with Ar, then the first row of A is now denseand Ar has no dense columns. We can apply the symbolic factorization algorithm toA r. Together with the row withholding column ordering strategy mentioned above,this latter approach may generate a sparse L and U.

Clearly, there are two related problems associated with exploiting these ideas.One is to determine when a row or column should be considered dense, and the otheris to decide whether to work with A or At. Answers to these questions are probablyproblem dependent, and require further study. Our experiments provide some evidenceabout the potential effectiveness of the techniques.

4. Numerical experiments. In this section, we present some numerical experimentsto illustrate the efficiency of our symbolic factorization algorithm and to compare itsperformance with the method proposed in [10], and with the well-known HarwellMA28 code [2].

For the new scheme, and the method proposed in [10], the experiments involvedthe solution of sparse systems of linear equations

PT(QA)ppTx-- pT(Ob)

where A and b are, respectively, the coefficient matrix and the right-hand side vector,Q is a permutation matrix chosen so that QA has a zero-free diagonal, and P is a

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 887

permutation matrix corresponding to a minimum degree ordering ofArA. (We premulti-ply QAP by pT since we want to preserve the zero-free diagonal.) Symbolic factori-zation is applied to the matrix B PTQAP to create a data structure for the numericaldecomposition of B. The triangular factors Mk and U, which are computed usingGaussian elimination with partial pivoting, are stored in the resulting data structure.

All experiments were carried out on a DEC VAX 11/780 computer. The programswere written in ANSI standard FORTRAN using single precision floating-point arith-metic, and were compiled using the Berkeley FORTRAN 77 compiler. In the discussionbelow, storage requirements are expressed in terms of number of storage locationsrequired and execution times are in seconds. (Note that one floating-point locationand one integer location have the same number of bits.)

Our experiments involved three sets of test problems typical of those arising inscientific and engineering applications. The first set consists of nine finite elementproblems having the structure of those described in [8]. These problems are unsym-metric, although they have symmetric structure. Their numerical values were generatedusing a uniform random number generator. The second set is a collection of problemsarising from chemical engineering calculations, and the third set consists of problemsarising from various applications, such as surveying and linear programming. Thesecond and third sets of problems were kindly provided by Duff, Grimes, Lewis andPoole [5]. Numerical values were provided for most of the problems in the last twosets. For problems that do not have numerical values, we generated values using auniform random number generator. The right-hand side vector for each problem waschosen so that the solution vector consisted of all ones. The characteristics of theproblems are summarized in Table 4.1.

In presenting the results, we have employed the following notation"(1) store denotes the number of storage locations required;(2) time denotes the execution time (in seconds) required;(3) lower and upper refer to the lower and upper triangular portions of a matrix,

respectively;(4) factor time and solve time are, respectively, the execution times (in seconds)

required for the numerical factorization and numerical triangular solutions;(5) percentage utilization is the ratio of the actual number of off-diagonal nonzeros

obtained in the numerical factorization to the number of off-diagonal nonzerosaccommodated by the static data structure obtained from a symbolic factori-zation algorithm.

Also for convenience, the approach described in 10] is referred to as "SF-AT"A, andthe new symbolic factorization algorithm described in 2 of this paper is referred toas "SF-A." The new symbolic factorization algorithm with subscript compression asdiscussed in 3.1 is abbreviated as "SFC-A."

For comparison purposes, we first summarize in Table 4.2 the results of usingSF-ArA.

The first experiment we did involved implementing the symbolic factorizationalgorithm described in 2, and then using the algorithm to generate a static datastructure for sparse Gaussian elimination with partial pivoting. Numerical factorizationand triangular solutions were performed using that fixed storage scheme. The resultsare presented in Table 4.3. A comparison of the results in Tables 4.2 and 4.3 leads usto the following observations.

Note that the number of nonzeros in the lower triangular matrix ILl determinedby the SF-A symbolic factorization algorithm is smaller than that in the Choleskyfactor Lc of ArA predicted in SF-ArA. This is true in all test problems, and the

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

888 A. GEORGE AND E. NG

TABLE 4.1Characteristics of test problems.

Number of Number ofnonzeros nonzeros

Number per row per columnof

Problem Order nonzeros Min Max Min Max Comment

936 6,264 4 7 42 1,009 6,865 4 7 43 1,089 7,361 3 7 34 1,440 9,504 3 7 35 1,180 7,750 3 7 36 1,377 8,993 3 7 37 1,138 7,450 4 7 48 1,141 7,465 4 7 49 1,349 9,101 4 7 4

10 156 371 7

11 167 507 9

12 381 2,157 25

13 132 414 9

14 67 294 6 2

15 655 2,854 12

16 479 1,910 12

17 497 1,727 28

18 989 3,537 12

7 small hole square (finite element mesh)7 graded L (finite element mesh)7 plain square (finite element mesh)7 large hole square (finite element mesh)7 + shaped domain (finite element mesh)7 H shaped domain (finite element mesh)7 3 hole problem (finite element mesh)7 6 hole problem (finite element mesh)7 pinched hole problem (finite element mesh)

6 simple chemical plant model (chemicalengineering problem)

19 rigorous model of a chemical stage(chemical engineering problem)

50 multiply fed column, 24 components(chemical engineering problem)

19 rigorous flash unit (chemical engineeringproblem)

10 cavett problem with 5 components (chemicalengineering problem)

35 16 stage column section, some stagessimplified (chemical engineering problem)

35 8 stage column section, all sections rigorous(chemical engineering problem)

55 rigorous flash unit with recycling (chemicalengineering problem)

26 7 stage column section, all sections rigorous(chemical engineering problem)

19 113 655 20 27

20 199 701 6 2 9

21 130 1,282 124 124

22 663 1,712 426 423 363 3,279 33 3424 822 4,841 304 2125 541 4,285 11 5 541

unsymmetric pattern supplied by MorvenGentlemanunsymmetric pattern of order 199 given byWilloughbyunsymmetric matrix from laser problemgiven by A. R. Curtisunsymmetric basis from LP problem (Shell)unsymmetric basis from LP problem (Stair)unsymmetric basis from LP problem (BP)unsymmetric facsimile convergence matrix

reductions are large in some cases. As a result, the percentage utilization for thelower triangular matrix L is significantly improved using the SF-A approach. That

tl--1is, in terms of accommodating the fill-in in k=l Mk, the structure of /S is muchtighter than the structure of the Cholesky factor Lc of ATA.

Note also that the number ofnonzeros in the upper triangular matrix/] determinedby the SF-A symbolic factorization algorithm is either identical or close to that in Lin SF-ATA. We believe the explanation for this is as follows. First, it is known thatL and the R obtained from the QR factorization of A have identical structures when

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 889

TABLE 4.2Symbolic factorization using ATA (SF-ATA).

Problem Store

Symbolic factorization

Time

Number ofNumber nonzeros

ofsubscripts Lower Upper

30,620 1.083 7,713 37,554 37,5542 34,159 1.233 8,914 46,945 46,9453 36,244 1.333 9,227 47,484 47,4844 45,900 1.600 11,317 49,273 49,2735 36,751 1.283 8,588 26,570 26,5706 42,304 1.417 9,699 28,556 28,5567 36,507 1.283 9,460 35,291 35,2918 37,169 1.250 10,040 38,748 38,7489 43,842 1.517 10,465 53,161 53,161

10 2,348 0.050 417 566 56611 3,276 0.083 559 875 87512 23,172 1.050 6,993 18,334 18,33413 2,645 0.100 466 742 74214 1,746 0.067 385 843 84315 21,058 0.800 5,143 12,906 12,90616 13,617 0.500 3,118 7,383 7,38317 15,416 0.550 2,223 6,363 6,36318 23,907 0.750 4,746 8,143 8,143

19 3,702 0.133 557 1,322 1,32220 4,364 0.117 1,143 2,194 2,19421 16,829 2.000 260 7,763 7,76322 188,207 59.700 964 91,080 91,08023 15,420 0.633 4,007 7,772 7,77224 118,904 21.250 6,809 57,817 57,81725 24,429 0.900 6,634 16,600 16,600

Numerical solution

Factor SolveStore time time

Percentageutilization

Lower Upper

91,247 57.183 1.617 50.36 68.82111,887 78.617 1.983 50.13 67.29113,998 80.383 2.033 50.50 66.65122,825 62.683 2.150 50.01 65.5272,350 20.650 1.233 49.28 64.0379,206 20.550 1.350 49.24 64.6190,286 39.783 1.550 50.43 65.1497,807 49.867 1.700 50.42 66.82128,930 72.250 2.300 49.34 63.55

2,955 0.1503,814 0.217

47,092 14.9333,1402,676

36,85222,19719,42429,935

4,2207,32416,958

0.033 29.68 38.520.050 30.51 40.110.833 32.47 36.270.050 27.49 40.570.033 34.99 42.350.633 24.14 39.120.350 31.79 38.480.300 13.30 24.960.450 23.04 37.34

0.067 22.54 64.450.117 36.46 56.610.350 67.51 48.783.733 0.55 1.540.383 24.14 49.832.417 11.79 15.020.767 40.37 46.77

0.1830.2677.6004.2671.7832.500

0.5001.0179.817

189,093 14.56722,820 4.567

129,843 37.46744,705 12.950

A is irreducible [1]. Moreover, we have shown at the end of 2 that the structures ofU and R are identical. Thus, when A is irreducible, we would expect the structure ofthe upper triangular matrix /] to be identical to that of L.

When A is reducible, the situation is quite different, as the following exampleillustrates:

/x x x x x x x x x x x x x xX X X X

X X X X X

A= x x symb. fact. x x xX X X X X X

X X X X X X

X X X X

Clearly AT"A is dense, and Lc is full lower triangular. Apparently, many of our testproblems are irreducible, or are nearly so in the sense ofhaving relatively few irreducibleblocks.

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

890 A. GEORGE AND E. NG

TABLE 4.3Symbolic factorization algorithm of 2 (SF-A). (Without subscript compression.)

Problem

101112131415161718

19202122232425

Symbolic factorization Numerical solution

Number of Number ofsubscripts nonzeros

FactorStore Time Lower Upper Lower Upper Store time

79,755 3.550 20,310 37,554 20,310 37,554 124,154 39.83396,533 4.117 25,765 46,945 25,765 46,945 154,503 59.01798,974 4.450 25,875 47,484 25,875 47,484 156,521 57.617109,146 5.017 26,462 49,273 26,462 49,273 164,432 44.01768,025 3.033 14,152 26,570 14,152 26,570 92,066 14.38375,615 3.417 15,300 28,556 15,300 28,556 100,107 14.38380,603 3.550 19,029 35,291 19,029 35,291 118,884 28.35086,013 3.733 20,922 38,748 20,922 38,748 129,611 35.433

113,251 4.900 28,395 53,161 28,395 53,161 175,255 49.917

3,105 0.167 256 544 256 544 3,006 0.1003,866 0.217 349 830 349 830 3,863 0.150

34,446 1.667 8,780 17,539 8,780 17,539 56,069 10.6173,109 0.150 268 690 268 690 3,106 0.1172,530 0.133 426 843 426 843 3,143 0.217

29,413 1.517 5,358 11,794 5,358 11,794 40,201 5.23319,514 1.033 3,698 7,203 3,698 7,203 26,115 3.18315,622 0.783 1,447 5,748 1,447 5,748 18,865 0.95027,303 1.433 2,771 7,565 2,771 7,565 29,575 1.600

Percentageutilization

4,078 0.217 313 1,322 313 1,322 4,289 0.3006,549 0.317 960 2,194 960 2,194 8,101 0.73319,032 0.867 7,402 7,763 7,402 7,763 31,502 8.567

112,290 4.983 11,153 91,080 11,153 91,080 210,435 6.13319,725 0.950 2,562 6,972 2,562 6,972 22,337 2.26791,038 4.050 16,574 56,559 16,574 56,559 153,666 19.20038,193 1.800 7,610 16,600 7,610 16,600 53,291 9.017

Solvetime Lower Upper

1.267 93.13 68.821.600 91.33 67.291.550 92.67 66.651.633 93.11 65.520.917 92.52 64.031.017 91.91 64.611.150 93.53 65.141.250 93.38 66.821.700 92.38 63.55

0.050 65.63 40.070.033 76.50 42.290.550 67.80 37.910.050 76.12 43.620.017 69.25 42.350.383 58.16 42.810.250 63.47 39.440.183 58.47 27.630.300 67.70 40.20

0.050 95.21 64.450.083 83.33 56.610.317 70.81 48.781.967 4.52 1.540.217 73.22 55.551.517 41.14 15.350.517 88.07 46.77

The SF-A symbolic factorization algorithm described in 2 is usually slower thanthe SF-AT"A symbolic factorization algorithm. (There are a few instances where theSF-A algorithm runs faster than the SF-ATA algorithm. For example, see Problems22 and 24.) In most cases, the execution time of the SF-A symbolic factorizationalgorithm is about twice to three times that of the SF-ATA symbolic factorizationalgorithm. In the SF-ArA algorithm, the structures of the Cholesky factors Lc and

n--1L of ArA are used to bound the structures of the triangular factors k=a Mk and U.Only one set of subscripts is needed to describe the structures of Lc and L. However,in the SF-A symbolic factorization algorithm, the structure of L is in general not thesame as the structure of . Thus two sets of subscripts are needed, one set fordescribing the structure of the lower triangular matrix and another for describing thestructure of the upper triangular matrix. Consequently, extra work is done in the SF-Aalgorithm to generate these two sets of subscripts, so it is not surprising that the SF-Aalgorithm runs slower than the SF-AA algorithm. The discussion above also explainswhy the storage requirements for the symbolic factorization and numerical solution inTable 4.3 are larger than those in Table 4.2, since in SF-A, there is exactly one subscript

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 891

for every nonzero stored. This is not the case in SF-ArA because subscript compressionis employed in the symbolic factorization algorithm.

A final observation in connection with Tables 4.2 and 4.3 is that the executiontimes of the numerical factorization and triangular solution are smaller for SF-A thanthose for SF-ATA. One reason is that there are fewer nonzeros in + O in SF-A thanin Lc +L in SF-ArA. In addition, the uncompressed storage scheme created by theSF-A symbolic factorization algorithm is simpler than the compressed storage schemegenerated by the SF-ArA symbolic factorization algorithm. We will return to this later.

The percentage utilization for Problem 22 is extremely small for both SF-ArAand SF-A. This is caused by a few dense rows in the coefficient matrix A, as one cansee from Table 4.1. These dense rows effectively destroy the sparsity of some of theother rows when we determine a column ordering from ArA. As we shall see later inthis section, there will be substantial improvements when we withhold these denserows from A in determining a column ordering.

The analysis in 2 showed that the complexity of the SF-A symbolie factorizationalgorithm is O(max (IAI, Y, 11+1 1)). That is, for large sparse problems, the com-plexity of the SF-A symbolic factorization algorithm is essentially proportional to thenumber of subscripts generated by the algorithm. To verify this, we applied the SF-Aalgorithm to a sequence of finite element problems defined on graded-L meshes. Theresults, given in Table 4.4, indeed suggest that the complexity of the new symbolicfactorization algorithm is O(max (Ial, Y 11 / t)l)).

TABLE 4.4Applying SF-A to a sequence ofgraded-L finite element problems.

Order

265406577778

1,0091,2701,5611,8822,2332,6143,0253,466

Number ofnonzeros

in A

1,7532,7163,8895,2726,8658,668

10,68112,90415,33717,98020,83323,896

Time

0.7501.3002.0673.1004.1505.5007.3508.967

11.36713.65016.46719.417

Number of subscripts

Lower

3,9427,37711,52018,69725,76534,37647,79757,38473,11291,393111,237132,385

Upper

7,08813,10421,05833,43346,94562,78386,720

105,836134,773167,508203,579242,870

Total (nz)

11,03020,48132,57852,13072,71097,159134,517163,220207,885258,901314,816375,255

Ratio t/nz(xlO5)

6.8006.3476.3455.9475.7085.6615.4645.4945.4685.2725.2315.174

In terms of the number of nonzeros accommodated by the data structure, theresults presented so far have shown that the symbolic factorization algorithm SF-A isconsiderably better than SF-ArA. However, in terms of storage and time required, theSF-A algorithm is not as efficient as the SF-ATA algorithm. The high storage requirementis due to the fact that there is one subscript for every nonzero determined by the SF-Aalgorithm. If the number of nonzeros determined by the SF-A algorithm is r/, then werequire at least r/locations to execute the SF-A symbolic factorization algorithm, sincewe need r/locations to store the subscripts. Consequently, we need at least 2r/locationsto perform the numerical solution.

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

892 A. GEORGE AND E. NG

In 3, subscript compression was proposed to reduce the storage and executiontime required by the SF-A symbolic factorization algorithm. Our next experimentinvolved incorporating subscript compression in the implementation of the new sym-bolic factorization algorithm (SFC-A), and was intended to illustrate the improvementthat may be achieved. The results in Table 4.5 show that the new symbolic factorizationalgorithm with subscript compression performs extremely well. The number of sub-scripts that have to be stored is much smaller than that required in SF-A. Also notethat even though the storage scheme in SFC-A is more complicated, the reduction inthe number of subscripts is large enough to offset the additional storage required forthe more sophisticated storage scheme. As a result, the storage requirements for boththe symbolic factorization phase and the numerical solution phase associated withSFC-A are much smaller than those required by SF-A. Indeed, in most cases thestorage required is even less than that required by SF-ArA.

It is noteworthy that the SFC-A symbolic factorization algorithm runs faster thanthe SF-A symbolic factorization algorithm. This is because fewer subscripts have tobe generated in the former case. As noted earlier, extra indexing is needed in using

TABLE 4.5Symbolic factorization algorithm with subscript compression (SFC-A).

Symbolic factorization

Number ofsubscripts

Problem Store Time

101112131415161718

19202122232425

Number ofnonzeros

Lower Upper Lower Upper

35,459 1.667 3,9812 39,202 1.817 4,4453 42,143 1.967 5,1214 53,535 2.550 5,9255 42,636 2.000 4,3836 49,045 2.283 4,8317 42,623 1.950 4,6028 43,461 2.133 4,7949 50,791 2.600 5,931

3,1763,72017,8972,9921,953

20,14013,77112,08225,026

0.1670.1670.9330.1500.0831.0170.6830.6501.200

3,3515,3934,78014,98815,33729,88723,868

7,713 20,310 37,5548,914 25,765 46,9459,227 25,875 47,48411,317 26,462 49,2738,588 14,152 26,5709,699 15,300 28,5569,460 19,029 35,29110,040 20,922 38,74810,465 28,395 53,161

405 256 544524 349 830

6,544 8,780 17,539435 268 690385 426 843

4,988 5,358 11,7943,040 3,698 7,2032,101 1,447 5,7484,556 2,771 7,565

152173

2,462140171

1,5791,158558

1,523

Store

79,85897,17299,690108,82166,67773,53780,90487,059112,795

3,0773,717

39,5202,9892,566

30,92820,37215,32527,298

0.2000.2500.2833.6170.7332.6171.133

111440391

2,6388OO

4,1952,167

569 313 1,322 3,5621,158 960 2,194 6,945260 7,402 7,763 17,250965 11,153 91,080 113,133

3,618 2,562 6,972 17,9496,141 16,574 56,559 92,5156,634 7,610 16,600 38,966

Numerical solution

Factor Solvetime

44.03363.45064.13348.90016.45016.86732.10042.08362.317

0.1170.16712.0000.1330.2176.1333.5671.0831.850

0.3000.8009.6336.9172.633

22.05010.200

Percentageutilization

time Lower Upper

1.250 93.13 68.821.583 91.33 67.291.583 92.67 66.651.650 93.11 65.521.000 92.52 64.031.067 91.91 64.611.217 93.53 65.141.433 93.38 66.821.950 92.38 63.55

0.033 65.63 40.070.033 76.50 42.290.583 67.80 37.910.033 76.12 43.620.033 69.25 42.350.417 58.16 42.810.267 63.47 39.440.217 58.47 27.630.333 67.70 40.20

0.050 95.21 64.450.083 83.33 56.610.333 70.81 48.782.067 4.52 1.540.217 73.22 55.551.533 41.14 15.350.533 88.07 46.77

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 893

the storage scheme created by the SFC-A algorithm. Thus, we should expect theexecution times for the numerical solution associated with SFC-A to be larger thanthose in SF-A. The results in Table 4.5 show that this is indeed the case, althoughfortunately the increase is not very significant. Furthermore, even though the executiontimes for the numerical solution associated with SFC-A have been increased slightly,they are still less than those associated with SF-ATA.

For the SF-A symbolic factorization algorithm, our analysis in 2 showed thatits run time is proportional to the number of subscripts generated. We have not beenable to establish that the SFC-A symbolic factorization algorithm behaves in a similarmanner, although preliminary experiments contained in Table 4.6 for the graded-Lproblems suggest that the execution time of the SFC-A algorithm is proportional tothe number of subscripts generated as well.

TABLE 4.6Applying SFC-A to a sequence of graded-L finite element problems.

Order

265406577778

1,0091,2701,5611,8822,2332,6143,0253,466

Number ofnonzeros

in A

1,7532,7163,8895,2726,8658,668

10,68112,90415,33717,98020,83323,896

Time

0.4670.7001.0331.3671.8332.3502.9173.5834.3335.0335.9676.934

Lower

1,0291,6672,4593,3654,4456,0707,2029,17811,27612,71315,77918,746

Number of subscripts

Upper

2,0873,3465,1027,0038,914

11,63914,54317,11820,86424,93328,27032,280

Total (nz).

3,1165,0137,56110,36813,35917,70921,74526,29632,14037,64644,04951,026

Ratio t nz(xl05)

14.98713.96413.66213.18513.72113.27013.41513.62613.48213.36913.54613.589

Recall that the column ordering we used for a matrix A in the experiments wasan ordering obtained by applying Liu’s variant of the minimum degree algorithm 11to the matrix ATA. As observed at the end of 3, dense rows in the matrix A mayaffect the quality of the column ordering since they form dense submatrices in ATAthat may cause unnecessary fill-in in L and U. (Problem 22 is an example.) A suggestedsolution to this problem is to withhold such rows from A when the ordering isdetermined from ArA. The column ordering is then applied to the original matrix Awhen we perform the symbolic factorization algorithm (with subscript compression).This strategy has been implemented and applied to the problems in Table 4.1 thathave rows with more than 50 nonzeros. If we compare the results in Table 4.7 withthe corresponding results in Table 4.5, we observe that there are indeed significantimprovements when dense rows are withheld in determining the column orderings. Ineach case, there is a large reduction in the number of nonzeros determined by thesymbolic factorization algorithm. Consequently, there is also a large reduction in thestorage required by the numerical solution phase. Furthermore, the factorization andsolution times are much smaller than those in Table 4.5. In some cases, the run timefor the symbolic factorization algorithm and the number of (compressed) subscriptsgenerated are also smaller.

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

894 A. GEORGE AND E. NG

TABLE 4.7Symbolic factorization algorithm with subscript compression (SFC-A).(With withholding of dense rows in determining a column ordering.)

Problem

212224

Symbolic factorization

Number ofsubscripts

Store Time Lower Upper

4,561 0.667 131 30113,168 0.883 429 1,35428,855 1.617 2,164 7,140

Number ofnonzeros

Lower Upper

1,213 7,804617 2,216

7,501 18,756

Numerical solution

Factor SolveStore time time

10,883 1.100 0.20011,913 0.283 0.15044,607 7.750 0.650

Percentageutilization

Lower

54.3386.8756.14

Upper

11.3829.7837.09

Another strategy for possibly improving the performance of the symbolic factori-zation algorithm is to apply the symbolic factorization algorithm to the matrix A 7- andcompute (numerically) a triangular factorization of it. This decomposition is equallyuseful in solving Ax b. In Table 4.8, we have provided the results of applying SFC-Ato the transpose of some of the matrices described in Table 4.1. Rows with more than50 nonzeros are withheld when a column ordering for AT is determined. (Rowwithholding occurs in Problems 17, 21 and 25.) Comparing the results in Table 4.8with those in Tables 4.5 and 4.7, we see that it may be desirable to use the transposein some instances. For example, Problem 22 has a few very dense rows but no densecolumns. Applying SFC-A to the matrix itself results in low percentage utilization,

TABLE 4.8Symbolic factorization algorithm with subscript compression (SFC-A).

(The transpose is used and dense rows in the transpose are withheld in determining a column ordering.)

Symbolic factorization

Number of Number ofsubscripts nonzeros

Numerical solutionPercentageutilization

19 3,475 0.200 238 566 710 1,522 4,283 0.683 0.050 96.48 75.8920 5,393 0.233 488 1,110 1,355 2,471 7,617 0.983 0.100 80.37 51.1921 4,767 0.517 146 492 1,325 8,058 11,455 1.250 0.183 78.49 5.8522 12,595 0.483 134 1,076 174 1,274 9,955 0.100 0.133 89.08 78.7323 17,050 1.000 1,235 4,896 4,358 12,341 26,827 4.917 0.417 53.85 36.5824 26,650 1.300 1,396 5,703 5,189 16,703 38,037 6.650 0.533 71.71 43.6925 22,560 1.333 1,957 5,536 7,614 16,484 37,546 9.750 0.533 86.77 44.86

10 3,074 0.150 128 327 204 444 2,823 0.100 0.033 69.61 47.0711 3,860 0.150 224 613 507 1,083 4,268 0.250 0.050 67.85 51.8912 17,624 0.983 2,671 6,062 11,160 20,204 44,292 13.100 0.667 55.27 27.9613 3,123 0.133 195 511 481 993 3,636 0.250 0.050 55.51 45.7214 1,915 0.083 171 347 381 764 2,404 0.200 0.033 74.54 39.0115 19,926 1.033 1,696 4,657 6,942 16,269 36,773 10.817 0.533 73.93 50.4516 13,037 0.667 871 2,593 3,129 8,035 19,901 2.783 0.283 53.95 37.0617 11,944 0.617 480 2,041 1,304 5,405 14,701 1.100 0.183 60.58 31.5118 26,370 1.333 1,717 5,706 4,675 13,303 36,284 3.417 0.483 53.67 30.94

Factor SolveProblem Store Time Lower Upper Lower Upper Store time time Lower Upper

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 895

high storage requirement and large execution time. Another interesting example isProblem 17 which has a few dense columns but no dense rows. Applying SFC-A tothe transpose with withholding dense rows in determining the column ordering seemsto be better than just applying SFC-A to the matrix itself.

However, the opposite can also hold. As an example, the results (including storagerequirements and execution times) of applying SFC-A (with withhholding dense rowsin determining the column ordering) to Problem 21 are better than the correspondingresults in Table 4.8. At this point, we do not know an effective scheme for decidingwhether to use A or A, other than simply applying the SFC-A symbolic factorizationalgorithm to both A and A, and choosing the one that requires the least storage. Sincethe SFC-A symbolic factorization algorithm is very efficient, this is not an unreasonablething to do.

In 10], a number of comparisons of the SF-AT-A algorithm with various standardalternative schemes was provided. However, for completeness, we include here someexperimental results obtained by applying the "industry standard" Harwell librarycode MA28 to our set of test problems. (The version of MA28 used in our comparisonshas since been replaced by a version with pivoting options that can significantly reducestorage and execution time when the matrix being factored has symmetric structure;see [4].)

This code employs row and column permutations to maintain both numericalstability and sparsity. Since the permutations depend on the numerical values, and therelative importance attached to stability and sparsity, one cannot predict in advancehow much space will be needed for the factors. Moreover, since storage is allocatedduring the computation, the data structures employed must be quite flexible, andrequire substantial execution and storage "overhead." A threshold pivoting techniqueis used when searching for a pivot; that is, at the kth step of the decomposition, anelement in the reduced matrix, say Aki, will be chosen as the pivot if it satisfies

IA,I -> u max IAI

where u is a user specified parameter satisfying 0=< u-< 1. (See [2], [6] for details.)Increasing the threshold parameter may improve the accuracy, but this may alsoincrease both the storage requirements and execution times. In our experiments, wehave set u 0.1.

The amount of space reported in Table 4.9 for MA28 is the minimum amount

required in order to solve each problem. With the minimum amount of space, MA28has to perform numerous data compressions and consequently requires substantialexecution time. Thus, in order to reduce the execution time, it is common practice toprovide more space than the minimum amount. The extra space is sometimes knownas elbowroom. See [7] for some numerical experiments. In our experiments, we providedenough elbow room so that no data compressions occurred.

The results show that for the finite element problems, our new scheme enjoys aconsiderable advantage over MA28. For the nonfinite element problems, the com-parison is less decisive, although the new scheme is comparable to MA28 in the majorityof cases. An examination of the results in Tables 4.5, 4.7 and 4.8 suggests that reliableidentification and exploitation of dense rows, and the appropriate choice of either Aor AT" would make the new scheme comparable with or better than MA28 in all thetest examples.

While it is true that the structure of U often substantially overestimates thestructure of the upper triangular factor U for any given pivot sequence, we contend

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

896 A. GEORGE AND E. NG

TABLE 4.9Results obtained using the Harwell MA28 code.

Problem

101112131415161718

19202122232425

Numerical solution

Minimal store

116,630144,365141,168145,10696,390103,699110,327117,398181,826

3,3674,08716,3533,3482,412

23,52615,05812,79527,001

Total time

428.267551.317326.200363.16792.167101.233305.583321.233830.100

0.4330.5506.7500.5000.50010.2334.3331.717

11.950

3,9056,5255,934

14,41817,29727,81745,191

0.8831.6331.5171.1505.2334.950

35.450

that the important number in this connection is the total storage requirement (whichincludes storage for the numbers and the overhead storage). The use of a static datastructure in our scheme allows for very low overhead storage, so that even though asignificant number of zeros is stored in the data structure, the total storage requirementis competitive with the more elaborate dynamic storage schemes used by codes suchas MA28.

As noted earlier, an important feature of the new method is that the amount ofspace allocated to store the LU-decomposition is not sensitive to the numerical valuesin A and the row interchanges. The same data structure can be used for differentcoefficient matrices so long as they have the same structure. After the first system hasbeen solved, it is not necessary to determine the data structure again when subsequentsystems having the same structure are to be solved. This is not true for MA28, wherethe amount of space depends on the numerical values and the row interchanges (unlessone wants to decompose a new matrix having the same structure using the pivotalsequence obtained during the decomposition of the old matrix).

The MA28 suite of subroutines provides a mechanism for decomposing a matrixhaving a structure identical to a previously decomposed matrix, and employing thepreviously determined pivotal sequence. In this mode, the storage requirement is ingeneral smaller than that needed in the initial factorization, and the execution timesare considerably less. Although we have not yet pursued its implementation, the same

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

SYMBOLIC FACTORIZATION FOR PARTIAL PIVOTING 897

strategy could be used in connection with our method. After the decomposition of thefirst matrix, we can postprocess the data structure to release any unused storagelocations. We could then decompose a new matrix having the same structure usingthe same pivotal sequence.

The efficient implementation of this idea is complicated by the possibility ofnumerical cancellation occurring in the initial factorization. Since such cancellationmight not occur in a subsequent factorization, one cannot simply examine the datastructure containing the initial factorization, and "compress out" all the zeros. Someof them might have been caused by cancellation rather than being a consequence ofthe structure of the matrix. Thus, the postprocessing of the data structure mentionedabove must involve a mechanism that preserves space for the factors under the conditionthat no numerical cancellation occurs.

5. Concluding remarks. In this paper, we have described a new symbolic factori-zation algorithm for Gaussian elimination with partial pivoting. Several enhancementshave been proposed to improve the performance of the algorithm. Preliminary numeri-cal experiments have shown that the data structure created by the new symbolicfactorization algorithm (together with the enhancements) for sparse Gaussian elimina-tion with partial pivoting is much tighter than that generated by the approach proposedin 10]. Storage required for the numerical solution phase is smaller in the new aproachthan in [10]. There are also substantial improvements in execution times for thenumerical solution phase. The method appears to be quite competitive to alternativeapproaches for solving general sparse systems of equations.

Several problems are still outstanding. First, preliminary results indicate that therun time of the new symbolic factorization algorithm with subscript compression isproportional to the number of (compressed) subscripts generated. We continue to workon establishing this result. Second, the column ordering used for A in the experimentspresented in this paper is obtained by applying the minimum degree ordering algorithmto the matrix ATA. As observed in 3.2, there are other possibilities in determining"good" column orderings for A. We are currently designing efficient algorithms forgenerating these column orderings. Third, an important problem is to establish criteriafor determining when a row should be regarded as "dense." This is a problem commonto many sparse matrix computations. A related problem is to investigate when weshould work with AT instead of A. Finally, if the matrix A is reducible, then A canbe permuted symmetrically so that the permuted matrix has a block triangular form.In this case, it is only necessary to decompose the diagonal blocks. The ideas describedin this paper can of course be applied to these diagonal blocks. Efficient implementationof this generalization is under consideration.

REFERENCES

T. F. COLEMAN, A. EDENBRANDTAND J. R. GILBERT, Predictingfillfor sparse orthogonalfactorization,J. Assoc. Comput. Mach., 33 (1986), pp. 517-532.

[2] I. S. DUFF, MA-28--A set of FORTRAN subroutines for sparse unsymmetric linear equations, Tech.Rpt. AERE R-8730, Harwell, England, 1977.

[3], On algorithms for obtaining a maximum traversal, ACM Trans. Math. Software, 7 (1981),pp. 315-330.

[4], The solution of nearly symmetric sparse linear equations, in Computing Methods in AppliedSciences and Engineering VI, R. Glowinski and J. L. Lions, eds., North-Holland, Amsterdam,1984, pp. 57-74.

[5] I. S. DUFF, R. G. GRIMES, J. G. LEWIS AND W. G. POOLE, JR., Sparse matrix test problems, ACMSIGNUM Newsletter 17, No. 2 (1982), p. 22.

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

898 A. GEORGE AND E. NG

[6] I. S. DUFF AND J. K. READ, Some designfeatures ofa sparse matrix code, ACM Trans. Math. Software,5 (1979), pp. 18-35.

[7] J. A. GEORGE, M. T. HEATH AND E. G. NG, A comparison of some methods for solving sparse linearleast squares problems, this Journal, 4 (1983), pp. 177-187.

[8] J. A. GEORGE AND J. W-H. LIu, Algorithms for matrix partitioning and the numerical solution offiniteelement systems, SIAM J. Numer. Anal., 15 (1978), pp. 297-327.

[9] ., An optimal algorithm for symbolic factorization of symmetric matrices, SIAM J. Comput., 9(1980), pp. 583-593.

10] J. A. GEORGE AND E. G-Y. NG, An implementation of Gaussian elimination with partial pivoting forsparse systems, this Journal, 6 (1985), pp. 390-409.

11] J. W-H. LIU, Modification of the minimum degree algorithm by multiple elimination, ACM Trans. Math.Software, 11 (1986), pp. 141-153.

12] A. H. SHERMAN, On the efficient solution of sparse systems of linear and nonlinear equations, ResearchRpt. 46, Dept. of Computer Science, Yale University, New Haven, CT, 1975.

Dow

nloa

ded

11/0

8/14

to 1

28.1

18.8

8.48

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php