International Workshop on Collaboration between Numerical...

International Workshop on Collaboration between Numerical Methods and Large-Scale Scientific Computation 2006 (iWNMSC’06) October 25th, 2006. The University of Tokyo, Tokyo, Japan Supported by CREST/Japan Science and Technology Agency (JST) 21st Century Earth Science COE Program, The University of Tokyo

i

Goal of this Workshop Solving large-scale linear equations [A]x=b is the most important, critical and expensive processes in every field of scientific computing. In recent years, various types of methods (direct, iterative) have been proposed and applied to various types of matrices (sparse/dense, symmetric/un-symmetric etc.) derived from various types of applications and space discretization methods, such as FEM, FDM, BEM, spectral methods etc. Moreover, recent advancement of parallel computers and related technologies provides new technical issues in this area of research.

In recent ten years, it has been said that strong collaboration among applications (physics), computational/computer sciences and applied mathematics is the most critical and important issue for scientific computing using parallel computers. For example, it is well-known that preconditioning is the key technology for robust and efficient iterative linear solvers. Profound knowledge and experiences on all of applications, linear algebra and computer system are required for development of robust and efficient (parallel) preconditioned iterative solvers. Therefore, collaboration among different research area is essential. But, this type of collaboration is not so common in real development, especially in Japan.

The goal of this conference is to address the complex issues related to the solution of general sparse/dense matrix problems in large-scale real applications. The conference will bring the researchers of numerical methods and application scientists together to discuss the latest developments, progress made, and to exchange findings and explore possible new directions.

Seiji Fujino (Kyushu University)

Kengo Nakajima (University of Tokyo) Workshop Co-Chair's

ii

Program Registration 09:00-

Session 1: Opening Session 09:20-09:30

Chair: Kengo Nakajima (University of Tokyo)

09:20-09:30 Welcome Kengo Nakajima (University of Tokyo)

09:30-10:10 SuperSolvers: Hybrid, Adaptive and Composite Solvers Padma Raghavan (Pennsylvania State University) ........................................1

Session 2: Applications and Performance 10:10-11:55

Chair: Kengo Nakajima (University of Tokyo)

10:10-10:50 Multigrid Preconditioners for Ill-posed Inverse Problems Omar Ghattas (University of Texas at Austin) ................................................2

10:50-11:15 Parallel Computing with ScaLAPACK for Large-scale Stress Inversion Analysis

Toshiko Terakawa (University of Tokyo) .........................................................3 11:15-11:55 The Performance of Key Scientific Computing Algorithms on the Sony/Toshiba/IBM Cell Broadband Engine

John Shalf (Lawrence Berkeley National Laboratory) ...................................5

(Lunch Break) 11:55-13:00

Session 3: Iterative Methods and Theory 13:00-13:50

Chair: Seiji Fujino (Kyushu University)

13:00-13:25 A Variant of the Orthomin(m) Method for Solving Linear Systems Kuniyoshi Abe (Gifu Shotoku Gakuen University)..........................................7

13:25-13:50 A QMR mMthod based on an A-Biorthogonalization Process Satsuki Minami (University of Tokyo).............................................................9

Session 4: Eigensolvers and Auto Tuning 13:50-14:40

Chair: Seiji Fujino (Kyushu University)

13:50-14:15 The Multi-section with Multiple Eigenvalues Method for Symmetric Tridiagonal Eigenproblem and Its Performance on LAPACK 4.0 MRRR Routine

Takahiro Katagiri (University of Electro-Communications).........................11 14:15-14:40 Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines

Yusaku Yamamoto (Nagoya University) ........................................................13

iii

(Coffee Break) 14:40-15:00

Session 5: Preconditioning Methods 15:00-17:20

Chair: Padma Raghavan (Pennsylvania State University)

15:00-15:40 ICT-SSAI: Scalable Parallel Tree-Based Drop-Threshold Incomplete Cholesky Preconditioning using Selective Sparse Approximate Inversion

Keita Teranishi (Pennsylvania State University) ..........................................14 15:40-16:05 Comparison Index for Parallel Ordering in ILU Preconditioning Techniques

Takeshi Iwashita (Kyoto University).............................................................15 16:05-16:30 Removal of Instability of the Standard IC Decomposition due to Inverse-based Dropping Seiji Fujino (Kyushu University) ..................................................................17 16:30-16:55 Preconditioning Techniques for Saddle Point Problems arising from FEM Applications Takumi Washio (University of Tokyo)............................................................21 16:55-17:20 Parallel Preconditioning Methods for Contact Problems with FEM Kengo Nakajima (University of Tokyo).........................................................23

17:20-17:30 Closing Remarks

Seiji Fujino (Kyushu University)

SuperSolvers: Hybrid, Adaptive and Composite Solvers Padma Raghavan1, Sanjukta Bhowmick2, Keita Teranishi1 and Ingyu Lee1

1Department of Computer Science and Engineering, The Pennsylvania State University

111 IST Bldg. University Park, PA 16802, USA raghavan,teranish,[email protected]

2 Department of Applied Physics and Applied Mathematics, Columbia University 200 S.W. Mudd Bldg.New York, NY 10027, USA

[email protected]

Sparse linear system solution costs can often dominate execution times of large-scale modeling and simulation applications using implicit or semi-implicit schemes. There are a large number of basic sparse solution schemes from classes such as direct, preconditioned-iterative (Krylov), domain-decomposition, and multigrid/multilevel. There is further multiplicative growth in the number of methods when Krylov iterative methods are used as smoothers in multilevel methods, or the latter are used as preconditioners for the former. The performance of all these methods, including convergence, reliability, execution time, parallel efficiency, and scalability, can vary dramatically depending on the exact interaction of a specific method, the problem instance, and the computer architecture. It is typically neither possible nor practical to predict a priori which sparse linear solver algorithm performs best for a given problem. In this presentation, we will discuss our approach to addressing this problem by developing multimethod supersolvers. We will discuss algorithms and software for automated method selection and composition to instantiate robust and scalable solvers tailored to meet application demands. We will consider aspects of hybrid, adaptive and composite solvers as indicated below. Hybrid Solvers using flexible incomplete sparse factorization preconditioners with a range of fill-in from pure iterative to pure direct [1,2,3]. Adaptive Solvers that dynamically select a sparse solution scheme to match changing numerical attributes of systems generated across iterations of a long running simulation [5]. Composite Solvers that use a sequence of basic solution schemes on a single linear system to enable highly reliable solution with limited memory requirements [4.5]. References [1] I. Lee, P. Raghavan and E. G. Ng: Effective Preconditioning through Ordering Interleaved with Incomplete Factorization, SIAM J. Matrix Anal. Appl. Vol. 27 pp.1068—1088, 2006. [2] P. Raghavan, K. Teranishi and E. G. Ng: A Latency Tolerant Hybrid Sparse Solver Using Incomplete Cholesky Factorization, Numer. Linear Algebra Appl., Volume 10, pp. 541-560, 2003. [3] K. Teranishi, P. Raghavan and B.F. Smith: Tree-Based Parallel Hybrid Sparse Solvers, in preparation, shorter version presented at Domain Decomposition 2006. [4] S. Bhowmick, P. Raghavan, L. C. McInnes and B. Norris: Faster PDE-Based Simulations Using Robust Composite Linear Solvers, Future Generation Computer Systems, Vol. 20, pp. 373-387, 2004. [5] S. Bhowmick: Multimethod Adaptive and Composite Solvers: Algorithms, Software and Applications, Ph.D. Thesis, The Pennsylvania State University, 2004.

1

Multigrid Preconditioners for Ill-posed PDE-based Inverse Problems

Omar Ghattas 1

1 Institute for Computational Engineering and Sciences, Jackson School of Geosciences,

Department of Mechanical Engineering, The University of Texas at Austin Austin, Texas, USA

[email protected]

We are interested in the design of fast parallel scalable solvers for ill-posed inverse problems governed by PDEs. These problems have very different solver requirements than do forward (differential) solvers. While forward operators are typically local and causal, inverse operators are typically nonlocal and noncausal. Forming the inverse operator is intractable for large-sacle problems, and we must resort to matrix-free Krylov methods. While convergence of unpreconditioned Krylov methods can be mesh-independent for compact inverse operators, this is not sufficient for the design of a fast solver: each matvec requires solution of a pair of forward/adjoint solves for each independent source in the inverse problem, and thus no more than a handful of Krylov iterations can be tolerated. Effective preconditioning is therefore essential.

In joint work with V. Akcelik, G. Biros, A. Borzi, A. Dragenescu, J. Hill, and B. van Bloemen Waanders, we are working on multigrid methods for inverse operators that are related to those of Hackbush, King, and Kaltenbacher for compact operators. Standard multigrid smoothers cannot be used, and special purpose smoothers that exploit the spectral structure of the inverse operator must be developed. We demonstrate the performance of such a multigrid method on an inverse problem of reconstruction of the unknown initial concentration of an airborne contaminant in a scalar convection-diffusion transport model, from limited-time spatially-discrete measurements of the contaminant concentration. Experiments demonstrate that 17-million-parameter inversion can be effected at a cost of just 18 forward simulations with high parallel efficiency. On 1024 Alphaserver EV68 processors, the turnaround time is just 29 minutes. Moreover, inverse problems with 135 million parameters--corresponding to 139 billion total space-time unknowns--are solved in less than 5 hours on the same number of processors. These results suggest that ultra-high resolution data-driven inversion can be carried out sufficiently rapidly for simulation-based "real-time" hazard assessment.

References [1] V. Akcelik, G. Biros, A. Dragenescu, J. Hill, O. Ghattas, and B. van Bloemen Waanders,

Dynamic data-driven inversion for terascale simulations: Real-time identification of airborne contaminants, Proceedings of SC05, IEEE/ACM, Seattle, WA, November 2005.

2

Parallel Computing with ScaLAPACK for Large-scale Stress Inversion Analysis

Toshiko Terakawa1,2, Kengo Nakajima1,2 and Mitsuhiro Matsu’ura 1,2

1Department Earth and Planetary Science, The University of Tokyo

7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN [email protected]

2CREST, Japan Science and Technology Agency Kawaguchi Center Building, 1-8 Honcho, Kawaguchi-shi, Saitama 332-0012, JAPAN

CMT Data Inversion Method: To Estimate the stress state in the crust is essential for understanding earthquake generation. Since measuring the stress state directly is difficult, the stress inversion technique has played an important role to obtain the stress state from seismic data. Recently we developed a new robust inversion method to estimate the stress field related to earthquake generation (seismogenic stress field) from the centroid moment tensors (CMT) data using Akaike’s Bayesian Information criterion (ABIC) [1]. This method is based on the idea that each seismic event releases a part of the seismogenic stress field at and around its hypocenter. We can reasonably relate the CMT of a seismic event to the seismogenic stress fields by a weighted volume integral of the seismogenic stress field , considering the hypocenter and the seismic moment of the event. The main inversion scheme is the Bayesian statistical inference algorithm [2].

)(xτ

Parallel Computing with ScaLAPACK: In the Bayesian modeling our problem is to find the values of model parameters and hyperparameters ( ) which maximize the posterior probability density function for given observed data. For certain fixed value of the hyperparameter , we can obtain the solution a

22 ,ασ

2α i* (i=1,6) for each component of stress by solving the following system linear equations.

System linear equations: (1) )6,1(][ 2 ==+ iii dFaGFF TT α

where F is the coefficient matrix (N×M, N = the number of data, M = the number of model parameters) for observation equations, G is the prior constraint matrix (M×M) for the model parameters and di is the observed data vector (N×1) which consists of each component of CMT (i=1,6). The matrix F, which is determined by hypocenters and seismic moments, is basically sparse matrix (Fig.1(a)). The matrix G, which is determined by basis functions, is a symmetric band matrix and usually has full rank (Fig.1(b)). Neither of them depends on . To determine the optimum values of the hyperparameter , we can use ABIC as a criterion, which is defined by:

2α2α

CP

MPNABIC TT

+++−

+−−−+=

GFF

GaaFadFadT 22

22

loglog

*]**)(*)log[()66()(

αα

αα (2)

where P is the rank of G, || || represents the absolute value of the product of non-zero eigenvalues of a square matrix. We have to solve the system equations (Eq.1), calculate eigenvalues for the coefficient matrix of the system equations and ABIC for many candidates

3

of the best value of . We need much time and large memory to treat such computations for large-scale analysis. Therefore, we developed a parallel code for the CMT data inversion making use of ScaLAPACK. The matrix is symmetric and it approaches a band matrix as the hyperparameter becomes larger. We first distribute the global matrix

by two-dimensional block-cyclic data layout scheme (2×2). For solving system linear equations, we use PDGESV routine, which is the most simple driver routine for linear equations. For calculating eigenvalues of the global matrixes, we use PDSYEV routine. We also use PDCOPY, PDGEMV, PDNRM2, PDDOT, which are from PBLAS, to calculate ABIC.

2α

][ 2GFF α+T

2α][ 2GFF α+T

Large-scale Stress Inversion Analysis: With the parallel code for the CMT data inversion we analyzed 1427 seismic events in the Hokkaido region (horizontal plane: 500 km × 600 km, depth: 100 km), Japan to estimate the seismogenic stress field associated with plate subduction. We distributed about 6944 B-splines (every 20 km) to represent the stress fields in the study area. We executed CMT data inversion analyses by 1 CPU, 4 CPU’s and 8 CPU’s (Fujitsu PRIMEPOWER HPC2500, Kyoto University). It took 5605s (system equations: 3680s, eigenvalues: 851s), 1708s (system equations: 471s, eigenvalues: 514s) and 720s (system equations: 181s, eigenvalues: 179s) to estimate the stress field for an by 1, 4 8 CPU’S, respectively.

2α

Plate motion

(a)

M

M

N

(b)

M References [1] T. Terakawa, M. Matsu’ura: CMT Data Inversion Using a Bayesian Information Criterion

Fig.2: Example of the inverted pattern of seismogenic stress fields in the Hokkaido-Tohoku region with the lower hemisphere stereographic projection of focal sphere. The contour lines show the upper surface of the Pacific plate. The color scales show the estimation errors.

M: The number of basis functions. N: The number of observed data.

Fig.1: Matrix structure plot. (a) F, (b) G.

to Estimate Seismogenic Stress Fields, Geophys. J. Int., 2006 (submitted). [2] T. Yabuki, M. Matsu’ura, Geodetic data inversion using a Bayesian information criterion for spatial distribution of fault slip, Geophys. J. Int., 109, 363-375, 1992.

4

The Performance of Key Scientific Computing Algorithmson the Sony/Toshiba/IBM Cell Broadband Engine

John Shalf, Sam Williams, Leonid Oliker, Kathy Yelick and Parry Husbands

National Energy Research Scientific Computing Center (NERSC), Lawrence BerkeleyNational Laboratory

One Cyclotron Road, MS: 50F-1650, Berkeley, CA 94720, [email protected]

Over the last decade the HPC community has moved towards machines composed ofcommodity microprocessors as a strategy for tracking the tremendous growth in processorperformance in that market. As frequency scaling slows [1], and the power requirements ofthese mainstream processors continues to grow, the HPC community is looking for alternativearchitectures that provide high performance on scientific applications, yet have a healthymarket outside the scientific community. In this work, we examine the potential of the Sony-Toshiba-IBM (STI) Cell Broadband Engine (heart of the forthcoming Sony Playstation-3) as abuilding block for future high-end computing systems, by investigating performance acrossseveral key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply,stencil computations on regular grids, as well as 1D and 2D FFTs. Cell combines theconsiderable floating point resources required for demanding numerical algorithms with apower-efficient software-controlled memory hierarchy. Despite its radical departure fromprevious mainstream/commodity processor designs, Cell is particularly compelling because itwill be produced at such high volumes for the Playstation3/computer-games market, that itwill be cost-competitive with commodity CPUs. The current implementation of Cell is mostoften noted for its extremely high performance single-precision (SP) arithmetic, which iswidely considered insufficient for the majority of scientific applications. Although Cell’s peakdouble precision performance is still impressive relative to its commodity peers (14.6Gflop/[email protected]), we explore how modest hardware changes could significantly improveperformance for computationally intensive DP applications, which we refer to as “Cell+”.

We present quantitative performance data for scientific kernels that compares Cellperformance to leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (CrayX1E) architectures[2]. This study examines the broadest array of scientific algorithms to dateon Cell. We developed both analytical models and lightweight simulators to predict kernelperformance that we demonstrated to be accurate when compared against published Cellhardware result, as well as our own implementations on the Cell full system simulator, andfinally with full applications running on Cell Blade systems at IBM. Using this approachallowed us to explore numerous algorithmic approaches without the effort of implementingeach variation. We believe this analytical model is especially important given the relativelyimmature software environment makes Cell time-consuming to program currently. The modelproves to be quite accurate, because the programmer has explicit control over parallelism andfeatures of the memory system.

Our work also explores the complexity of mapping several important scientific algorithmsonto the Cell’s unique architecture in order to leverage the large number of availablefunctional units and the software-controlled memory. Additionally, we used our analyticperformance model to predict the performance benefits conferred by modestmicroarchitectural modifications to Cell that could increase the efficiency of double-precision

5

arithmetic calculations, and demonstrate significant performance improvements comparedwith the current Cell implementation. The results, summarized in Table 1 show an impressiveimprovement in performance and power efficiency relative to the X1E, Opteron, and Itanium2for our evaluated suite of scientific kernels. Overall the study demonstrates the tremendouspotential of the Cell architecture for scientific computations in terms of both raw performanceand power efficiency. We conclude that Cell’s heterogeneous multi-core implementation isinherently better suited to the HPC environment than homogeneous commodity multicoreprocessors.

Cell Speedup vs. Cell Power Efficiency vs.Algorithm

X1E AMD64 IA64 X1E AMD64 IA64

GEMM 0.8x 3.7x 2.7x 2.4x 8.2x 8.8xSpMV 2.7x 8.4x 8.4x 8.0x 18.7x 27.3xStencil 1.9x 12.7x 6.1x 5.7x 28.3x 19.8x1D FFT 1.0x 4.6x 3.2x 3.0x 10.2x 10.4x2D FFT 0.9x 5.5x 12.7x 2.7x 12.2x 41.3x

Table 1: Double precision speedup and increase in power efficiency of the STI Cell processor relativeto the X1E, Opteron, and Itanium2 for our evaluated suite of scientific kernels. Results show animpressive improvement in performance and power efficiency.

References[1] S. Borkar, Design challenges of technology scaling, IEEE Micro, 19(4):23--29, Jul-Aug1999.[2] S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick. The Potential of the CellProcessor for Scientific Computing. Computing Frontiers, 2006.

6

A Variant of the Orthomin(m) Method for Solving

Linear Systems

Kuniyoshi Abe ∗ and Shao-Liang Zhang†

∗ Faculty of Economics and Information, Gifu Shotoku University, Gifu, 500-8288 Japan† Graduate School of Engineering, Nagoya University, Nagoya, 464-8603 Japan

E-mail: [email protected]; [email protected]

By a Krylov subspace method, we are solving a large sparse linear system Ax = b, where A andb stand for an n-by-n matrix and an n-vector, respectively.

The Generalized Conjugate Residual (GCR) [2] method has been proposed as a Krylov sub-space method based on the minimum residual approach. The computational costs of GCR becomeexpensive when a large number of iterations are required. Therefore, a restarted version calledGCR(m) and a truncated version called Orthomin(m) [3] are used in implementation. The trun-cated version is roughly classified into the Conjugate Residual (CR) and Orthomin(m) methods.The approximation and the residual vector of CR are computed by short recurrence formulas (thelast two terms) and those of Orthomin(m) are updated by long recurrence formulas (the lastm terms), and therefore their convergence behavior are different. Moreover, a variant of CR,which is mathematically equivalent to the original CR but uses different recurrence formulas, hasrecently been developed [1]. On the other hand, a variant of Orthomin(m) has not previouslybeen proposed.

Therefore, we propose a variant of Orthomin(m) for solving linear systems with nonsym-metric coefficient matrices by using the same analogy as that in [1].

According to [3], the Orthomin(m) algorithm is defined as follows.

Algorithm 1 (Orthomin(m)) Let x0 be an initial guess, and put r0 = b−Ax0. Set β−1,j = 0.For k = 0, 1, . . . repeat the following steps until the condition ‖ rk ‖2≤ εTOL ‖ r0 ‖2 holds:

qk = Ark +k−1∑

j=k−m

βk−1,jqj , (1)

pk = rk +k−1∑

j=k−m

βk−1,jpj , αk =(rk, qk)(qk, qk)

, xk+1 = xk + αkpk,

rk+1 = rk − αkqk, (2)

βk,j = −(Ark+1, qj)(qj , qj)

(k − m + 1 ≤ j ≤ k).

The recurrence coefficients αk and βk−1,j of Orthomin(m) are replaced by the parameters ζk

and ζj

ζkηj+1, respectively. By introducing a new auxiliary vector as yk = ζk−1qk−1, the formulas

(1) and (2) are converted into yk+1 = ζkArk +∑k

j=k−m+1 ηk,jyj and rk+1 = rk−yk+1. Moreover,the approximation xk and the direction pk are updated by zk+1 = ζkrk +

∑kj=k−m+1 ηk,jzj and

xk+1 = xk + zk+1, when introducing a new auxiliary vector zk = ζk−1pk−1.

7

Next we determine the recurrence coefficients ζk and ηk,j , which are the solution of the two-

dimensional minimization problem ‖rk+1‖2 = minζk,ηk,j

∥∥∥rk − ζkArk − ∑kj=k−m+1 ηk,jyk

∥∥∥2.

Consequently, we propose the following variant of Orthomin(m). Our proposed algorithm ismathematically equivalent to the original, but the recurrence formulas are different from those ofOrthomin(m). The variant implementation costs the same as Orthomin(m).

Algorithm 2 (Proposed implementation) Let x0 be an initial guess, and put r0 = b−Ax0.For k = 0, 1, . . . until the condition ‖ rk ‖2≤ εTOL ‖ r0 ‖2 holds, iterate:

ζk =(Ark, rk)

(Ark, Ark) −∑k

j=k−m+11νj

(Ark, yj)(Ark, yj),

ηk,j = −ζk

νj(yj , Ark) (k − m + 1 ≤ j ≤ k),

(for k = 0, ζk =(Ark, rk)

(Ark, Ark), ηk = 0)

νk+1 = ζk(Ark, rk), zk+1 = ζkrk +k∑

j=k−m+1

ηk,jzj , xk+1 = xk + zk+1,

yk+1 = ζkArk +k∑

j=k−m+1

ηk,jyj , rk+1 = rk − yk+1.

We present numerical experiments on model problems with nonsingular and singular matrices.We solve a system with a nonsingular coefficient matrix derived from 5-point central differencesof the two-dimensional convex-diffusion equation −∂2u

∂x2 − ∂2u∂y2 + γ

x∂u

∂x + y ∂u∂y

+ βπ2u = f(x, y)

over the unit square Ω = (0, 1)× (0, 1) with the zero Dirichlet boundary conditions, and applying5-point central differences to the partial differential equation ∆u + d∂u

∂x = f(x, y) over the unitsquare Ω = (0, 1)× (0, 1) with the periodic boundary condition u(x, 0) = u(x, 1), u(0, y) = u(1, y)yields a singular system.

The convergence behavior of the variant accords with that of the original in numerical ex-periments on nonsingular systems. On the other hand, numerical experiments on singular sys-tems show that our implementation is more accurate and less affected by rounding errors thanOrthomin(m).

Acknowledgments

We would like to express sincere thanks to Professor Martin H. Gutknecht for his insightfuland fruitful suggestions. This research is partly supported by Grant-in-Aid for Scientific Research(C) No.18560064.

References

[1] Abe, K., Zhang, S.-L., Mitsui, T. and Jin, C.-H., A Variant of the Orthomin(2) Methodfor Singular Linear Systems, Numerical Algorithms, 36 (2004), 189-202.

[2] Eisenstat, S. C., Elman, H. C. and Schultz, M. H., Variational Iterative Methods forNonsymmetric Systems of Linear Equations, SIAM J. Numer. Anal., 20 (1983), 345-357.

[3] Vinsom, P. K. W., Orthomin, An Iterative Method for Solving Sparse Sets of SimultaneousLinear Equations, in Proc. Fourth Symposium on Reservoir Simulation, Society of PetroleumEngineers of AIME, (1976), 149-159.

8

A QMR method based on an A-biorthogonalization process

Satsuki Minami1, Tomohiro Sogabe2, Masaaki Sugihara1, and Shao-Liang Zhang2

1Department of Applied Physics, Mathematical Informatics, The University of TokyoHongo 7-3-1, Bunkyo-ku, Tokyo 113-8656, JAPAN

[email protected], m [email protected]

2Department of Computational Science and Engineering, Nagoya UniversityFuro-cho, Chikusa-ku, Nagoya 464-8603, JAPAN

sogabe, [email protected]

The bi-conjugate gradient method (Bi-CG)[1][3], which is a natural extension of theconjugate gradient method (CG), is a well-known Krylov subspace method for solvingnon-symmetric linear systems. However, Bi-CG sometimes suffers from an irregular con-vergence behavior in the residual norm, which often leads to numerical instabilities. Toavoid such problem, the quasi-minimal residual method (QMR) has been proposed byFreund & Nachtigal. This method uses the same basis vectors as Bi-CG, based on a bi-Lanczos process and produces the residual vectors that have the quasi-minimal residual(QMR) property, which leads to a smooth convergence property[2].

On the other hand, recently Sogabe & Zhang have proposed the bi-conjugate residualmethod (Bi-CR)[4] as a natural extention of the conjugate residual method (CR). Sinceit has been experimentally shown that Bi-CR has a faster and a smoother convergenceproperty compared to Bi-CG, Bi-CR is expected to become a powerful basic method. Inthe presentation, we introduce a method that uses the same basis vectors as Bi-CR, basedon an A-biorthogonalization process[5] and produces the residual vectors that have theQMR property. We will also report the results of some numerical experiments.

Table 1. The relationship of each method

Orthogonal condition QMR property

Bi-Lanczos process Bi-CG QMR

A-biorthogonalization process Bi-CR Our method

References[1]R. Fletcher, Conjugate gradient methods for indefinite systems, Lecture Notes in Math-ematics, 506, 1976, pp. 73-89.[2]R. W. Freund and N. M. Nachtigal, QMR: a quasi-minimal residual method for non-Hermitian linear systems, Numer. Math., 60, 1991, pp. 315-339.[3]C. Lanczos, Solution of systems of linear equations by minimized iterations, J. Res.Nat. Bur. Standards, 49, 1952, pp. 33-53.[4]T. Sogabe, M. Sugihara, and S.-L. Zhang, An extension of the conjugate residualmethod for solving nonsymmetric linear systems, Trans. JSIAM 15(3), 2005, pp.445-459. (in Japanese)[5]T. Sogabe and S.-L. Zhang, An iterative method based on an A-biorthogonalizationprocess for nonsymmetric linear systems, manuscript, 2005.

9

The Multi-section with Multiple Eigenvalues Method for Symmetric Tridiagonal Eigenproblem and Its Performance on LAPACK 4.0 MRRR Routine

Takahiro Katagiri1

1Graduate School of Information Systems, The University of Electro-Communications

1-5-1 Choufu-gaoka, Choufu-shi, Tokyo 182-8585, JAPAN [email protected]

We present the Multi-section with Multiple Eigenvalues (MME) Method for finding eigenvalues of symmetric tridiagonal matrices. Performance results with the HITACHI SR8000 with 8 processors per node yield (1) up to 6.3x speedup over a conventional multi-section method, and (2) up to 1.47x speedup over a statically tuned MME method.

Introduction The bisection method is widely used to compute eigenvalues of a symmetric tridiagonal matrix T, especially when a subset of the spectrum is desired. Bisection is based on repeated evaluation of the function Count(x) = number of eigenvalues of T less than x. An evaluation of the function at two points a and b yields the number of eigenvalues in the interval [a, b). There can be a great deal of parallelism available in bisection. First, disjoint intervals may be searched independently, but the number of intervals available depends on the distribution of T's eigenvalues. Second, one can divide an interval into more than two equal parts (multisection as opposed to bisection) but the efficiency depends on how much faster Count(x(1 : k)) can be evaluated at k points x(1 : k) than at one point [1, 2], which in turn depends on the computer being used. So to optimize performance, an implementation of bisection could choose the points at which to evaluate Count(x) depending both on the intervals containing eigenvalues found so far, and on the relative speeds of Count(x(1 : k)) for different k. At one extreme, when the n-by-n matrix T has fairly uniformly distributed eigenvalues, one would eventually have n disjoint intervals to each of which one would apply bisection. Until enough intervals were available to use up the available parallelism, or in the other extreme case where all T's eigenvalues lie in a few tight clusters (so that there are never many intervals containing eigenvalues), one could run multisection. We propose a method to dynamically choose the right mix of multisection and disjoint intervals to use. In the spirit of other automatic tuning systems [3] at installation time we run a few benchmarks to determine the speed of Count(x(1 : k)), and use a simple performance model at runtime to decide what to do. We assume a shared memory parallel system here but consider the use of vectorization as well. We call our method the MME method (Multisection with Multiple Eigenvalues). The Kernel of the MME Method Figure 1 shows part of the kernel of MME for LAPACK version 4.0 xSTEGR. The kernel assumes T is represented with a twisted decomposition[4]. In other words, the (shifted) matrix is represented by an array D of the diagonal entries of the diagonal factor in the twisted factorization of T , and by an array LLD containing the product of the entries of L squared and D; for a detailed explanation see [5]. For simplicity we assume there are EL disjoint intervals containing eigenvalues (indexed by K = 1 to EL), and that we will multisect each such interval into ML + 1 > 2 parts by evaluating Count() at ML equally spaced points in each interval, indexed by I = 1 to ML. The set of all points at which we evaluate Count() is stored in the array σ(1 : ML, 1 : EL). If interval K is [aK , bK ], this means σ(I, K) = aK + (bK − aK ) I/(ML + 1). The count itself is stored in the array Count(1 :

11

ML, 1 : EL). The variables S, T and DPLUS are temporaries, indexed to show that they take on different values for different pointsσ(I, K). The innermost loop (over J ) is evaluated sequentially. All EL·ML iterations of the outer two loops can be evaluated independently, and exploit parallelization, vectorization, or both. Our goal is to pick EL and ML to minimize solution time. EL can be any number up to the total number of current intervals containing eigenvalues but not yet converged. <0> S(1 : ML, 1 : EL) = 0; Count(1 : ML, 1 : EL) = 0; <1> do K = 1, EL <2> do I = 1, ML <3> do J = 1, R − 1 <4> T (I, K) = S(I, K) −σ(I, K) <5> DPLUS(I, K) = D(J) + T (I, K) <6> S(I, K) = T (I, K)*LLD(J) / DPLUS(I, K) <7> if ( DPLUS(I, K) .lt. ZERO ) Count(I, K) = Count(I, K) + 1 <8> enddo; enddo; enddo;

Fig. 1. Part of the kernel of Multi-section with Multiple Eigenvalues (MME) Method. R is the twist point in the twisted factorization. The loops for K and I may be fused.

Results We used the HITACHI SR8000 and SR11000 with 8 processors on one node at Supercomputing Division, Information Technology Center, The University of Tokyo. The results shows that in DQDS mode, which is a one of modes for computing eigenvalues in xSTEGR, dynamically tuned MME speeds up up to 6.8x over bisection, 6.3x over basic multisection, and 1.47x over statically tuned MME. In one case it was 2% slower than statically tuned MME, and otherwise always fastest. In bisection mode, the corresponding maximum speedups are 7.4x, 3.9x, and 1.4x. Again, in just one case dynamically tuned MME was 4% slower than statically tune MME, and otherwise always fastest. Acknowledgement This work is joint work with Prof. James Demmel and Dr. Christof Voemel at University of California at Berkeley. This research is partially supported by Kayamori Foundation of Information Science Advancement, and Grant-in-Aid for Scientific Research (C) No.18500018, and Grand-in-Aid for Scientific Research on Priority Areas (Cyber Infrastructure for the Information-explosion Era) No.18049014. References [1] S.-S. Lo, B. Philippe, and A. Sameh. A multiprocessor algorithm for the symmetric tridiagonal eigenvalue problem. SIAM J. Sci. Stat. Comput., 8(2):s155-s165, 1987. [2] H. D. Simon. Bisection is not optimal on vector processors. SIAM J. Sci. Stat. Comput., 10(1):205-209, 1989. [3] J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, and K. Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, Special Issue on Program Generation, Optimization, and Adaptation, 93(2), 2005. [4] I. S. Dhillon, B. N. Parlett, and C. Voemel. LAPACK working note 162: The design and implementation of the MRRR algorithm. Technical Report UCBCSD-04-1346, University of California, Berkeley, 2004. [5] O. A. Marques, E. J. Riedy, and C. Voemel. LAPACK working note 172: Benefits of IEEE-754 features in modern symmetric tridiagonal eigensolvers. Technical Report UCBCSD-05-1414, University of California, Berkeley, 2005.

12

Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines

Yusaku Yamamoto

Department of Computational Science and Engineering, Nagoya University

Furo-cho, Chikusa, Nagoya, Aichi 464-8603, JAPAN [email protected]

Computation of eigenvalues and eigenvectors of a dense symmetric matrix is one of the most important problems in numerical linear algebra. In applications such as molecular orbital methods and first-principles molecular dynamics, eigensolution of matrices of order more than 10,000 is now needed. One of the approaches to solve such a large problem in practical time is to use symmetric multi-processors (SMPs). To exploit the potential high performance of SMP machines, one has to consider two issues, namely, (i) minimizing the number of inter-processor synchronizations and (ii) utilize the cache memory efficiently, thereby avoiding memory access conflict among the processors.

The standard algorithm for dense symmetric eigenproblem consists of three parts, namely, (A) reduction of the input matrix to tridiagonal form, (B) computation of the eigenvalues and eigenvectors of the tridiagonal matrix, and (C) back-transformation of the eigenvectors. Of these, part (C) has large grain parallelism. One can also reorganize this part to use the level-3 BLAS to enhance cache utilization. For part (B), the divide-and-conquer method is widely used as an algorithm which is easily parallelizable and can use the level-3 BLAS efficiently. However, for part (A), the standard algorithm due to Dongarra et al. [1] can execute only half of the computation with the level-3 BLAS and the rest of computation is done with the level-2 BLAS. As a result, the algorithm cannot use the cache efficiently and can attain only 10 to 25% of the peak performance on modern processors. On SMP machines, the efficiency further decreases due to memory access conflict among the processors.

In contrast, the two-step tridiagonal reduction algorithm proposed by Bischof et al. [2] can execute almost all of the computation in the form of level-3 BLAS. There is also an improvement of this algorithm by Wu et al. [3]. However, with these algorithms, the computational work for back-transformation increases by a factor of two, since it also consists of two steps. As a result, it is not clear which of the three algorithms – Dongarra’s, Bischof’s, or Wu’s – to choose given a problem and computational environments.

In this talk, we compare the performance of the three algorithms on various SMP machines by varying the matrix size and the number of eigenvectors computed. The performance data thus obtained will be useful in developing an automatically tuned dense symmetric eigensolver for SMP machines.

References [1] J. Dongarra, S. Hammarling and D. Sorensen: Block reduction of matrices to condensed forms for eigenvalue computation, J. Comput. Appl. Math., Vol. 27, pp. 215-227 (1989). [2] C. Bischof, B. Lang and X. Sun: A framework for symmetric band reduction, ACM Transactions on Mathematical Software, Vol. 26, No. 4, pp. 581-601 (2000). [3] Y-J. J. Wu, P. Alpatov, C. Bischof and R. van de Geijn: A parallel implementation of symmetric band reduction using PLAPACK, in Proceedings of the Scalable Parallel Libraries Conference (1996).

13

ICT-SSAI: Scalable Parallel Tree-Based Incomplete Cholesky Preconditioning using Selective Sparse Approximate Inversion

Keita Teranishi and Padma Raghavan

Department of Computer Science and Engineering, The Pennsylvania State University

111 IST Bldg. University Park, PA 16802, USA teranish,[email protected]

Many large scale simulations require robust and scalable sparse linear solution. We have developed a flexible parallel drop-threshold Incomplete Cholesky preconditioner (ICT) to accelerate convergence of Conjugate Gradients (CG). An ICT preconditioning scheme computes a preconditioner L which is an approximation to the actual Cholesky factor L of the coefficient matrix A. Although such a scheme is used widely on uniprocessors its scalable parallel implementation pose two major challenges. First, the performance of parallel triangular solution using L , applied at every CG iteration, typically suffers from the high latencies of interprocessor communication. Second, parallel construction of the preconditioner L suffers from communication and data structure management overheads because its nonzero pattern cannot be determined at an earlier stage. To address these issues, our parallel ICT preconditioner is constructed using a supernodal elimination tree, which is used in parallel sparse direct methods. Out tree-based scheme takes advantage of task parallel computation using a subtree to processor mapping at lower levels of the tree. For the higher levels of the tree, we devise parallel fan-in left-looking factorization for data parallel computation. We also use a selective inversion (SI) [1] scheme to replace triangular solution by parallel matrix vector multiplication for latency tolerance. Furthermore, for efficient inversion, we consider a sparse approximate scheme based on Frobenius norm minimization.

We will discuss our tree-based ICT with `Selective Sparse Approximate Inversion' (SSAI) and provide results indicating that it enables scalable preconditioner construction and application [2] while retaining the reliability of sequential ICT.

References [1] P. Raghavan, K. Teranishi, and E. Ng: A Latency Tolerant Hybrid Sparse Solver Scheme Using Incomplete Cholesky Factorization, Numer. Linear Algebra Appl., Volume 10, pp. 541-560, 2003. [2] K. Teranishi and P. Raghavan: A Hybrid Parallel Preconditioner Using Incomplete Cholesky Factorization and Sparse Approximate Inversion, Proceedings of DD16, the 16th International Conference on Domain Decomposition Methods, New York, NY, Jan. 12-15, 2005. The Springer Verlag Lecture Notes in Computer Science (LNCS) series, Volume 55, pp. 749-756, 2006.

14

! ! " # # Æ # $ $

!"! # $# "! % !"! ! & ! ' (

) ! $ # !" ! *# ) ' $ # + + , ! + Æ + - $ + + !" . / %! *# )0 !" * !" - !" 1 1 . 2 + ! 1 1 ! $+ 3+ ! + ! 14564 5 +71 8 + # !

$ #

!" # $ %&' ()*

# ! - ! - *# )0 ! $ ! +195 69 5 +71 8 + ! + : + % + ! ! !

! + ! + % ! + ! ; ! 45 $ $ # 2 45 !<

,

67

! 2

3+ ! +6957 ! # " 95 ! + 95 ! % - Æ + ! + 95 !<

, 6&7

! $

,

6'7

$ $ + ! 8

15

+ + + " "

+ , " + , "

- + . - + . - + - + - - --

// $0 1* - + . - + . - + + , - --

// $ * + , - --

// $ 23*

4" % 0 1 5 23

+ ! ! ! Æ + ! )+)" = 4 Æ >>? 6 >" >@ ?7 = ! ! 95 + .=

% & ! ! 95 >A4;B 45)C() )+ )" $ + $ ! + Æ =8/

% ' >A4;B4'5)') )+ )" 4 + $ ! " 95 ! 95 ;! Æ ! =8 !

! # + Æ + )+ )" D !

6"7 # )( 8 9 : % 0 ; Æ 5

80

100

120

140

160

180

200

220

1.2 1.6 2 2.4 2.8

Num

ber

of it

erat

ions

PRI (*1E+8)

4 < 3% = = 23$"3>?" 5 (*

600

650

700

750

800

850

900

0.015 0.02 0.025 0.03 0.035

Num

ber

of it

erat

ions

PRI (*1E+8)

4 @ 3% = = 23$@3@" 5 (*

5A @"$"BCC* "? "<

6<7 8 9 4 :2%%% 2 0 DEA C $"BB* "CF"C

6@7 GH I : DH J 2 ) IA <B $"B B* @!!C

6?7 K : 0 A 2%% 2 <@

6!7 G ; :J 3% E J H L 2%%% % 41A <!$"BBB* "BB!<"?

67 9 D)( :% %%% % 1A 0 %= % "!?F"! $"BB"* C<@FC?

6C7 G 0( : I %1 DH J 0 2A M3 "?!< $"BB"*

6 7 K M( ( : %%% 0 A

6B7 # G % :% M % 0 %=A 2%% 2 "BBC

6"7 &// /5(/

16

Removal of instability of the standard IC decompositiondue to Inverse-based dropping

Seiji FUJINO∗ and Akira SHIODE∗∗

∗Computing and Communications Center, Kyushu University∗∗Graduate School of Information Science and Electrical Engineering, Kyushu University

(E-mail: [email protected])

The numerous variants of Cholesky decomposition and different dropping strategy to ensure spar-sity and preferable cost of computation can be utilized to devise a number of preconditioners. Recentlyan efficient implementation of the incomplete LU decomposition derived from Crout version of incom-plete LU decomposition (ILUC) for solving a linear system of equations with nonsymmetric matrixhas been proposed by Li et al.[1]. ILUC decomposition is a useful preconditioner which estimates andutilizes the norms of the rows of L−1 and the columns of U−1 [2]. In addition, ILUC enables efficientimplementation of rigorous dropping strategy based on estimating the norms of the inverse factors.Moreover ILUC decomposition allows the development of potentially more effective strategy for theusual incomplete Cholesky decomposition.

In this paper we extend Crout version of incomplete LU decomposition to that of incompleteCholesky decomposition for solving linear systems with symmetric positive definite matrix.

The error due to the inverse of the factors is more significant than that of the factors themselves.Because when A = UT U and A ≈ UT U , using U ≡ U + X (UT = UT + XT ) with error matix X, wecan gain the following inverses

U−1 = U−1 + X, U−T = U−T + XT . (1)

However, the preconditioned matrix is given by

U−T AU−1 = I + UX + XT UT + XT AX. (2)

Here it is remarked that we cannot find the error matix X in the expression of preconditioned matrix.Accordingly we should estimate the error matix X of U−1 instead of the error matrix X of U measuredin the usual IC decomposition.

In case of the symmetric matrix, we use representation of L = UT for convenience. To minimizethe error of L−1 leads to the criterion that an entry lj,k should be dropped in the kth step if

|lj,k/aj,j | ||ejeTk L−1||∞ = |lj,k||ξk|/|aj,j | ≤ τ (3)

ek being the kth unit vector and τ a parameter for tolerance of dropping. Here L−1 cannot be computedwith low cost, so the idea is to estimate ||eje

Tk L−1||∞ by

||ejeTk L−1||∞ ≈

||ejeTk L−1

b||∞||b||∞

, (4)

for a suitable vector b. For the purpose of recursive computation of b, we can obtain the followingalgorithm.

set ξ1 = 1/l1,1; νi = 0 (i = 1, . . . n)

for k = 2, n

temp+ = 1 − νk

temp− = −1 − νk

if |temp+| > |temp−| then

ξk = temp+/lk,k

else

17

ξk = temp−/lk,k

end if

for j = k + 1, n (lj,k 6= 0)

νj = νj + ξklj,k

end for

end for.

Here ξk is the approximation of ejeTk L−1.

Numerical experiments will be presented. A linear system of equations were solved by the pre-conditioned CG method. As preconditioning, we adopt the IC decomposition with standard droppingtechnique and that with Inverse-based dropping technique. All computations were done in double pre-cision floating point arithmetics, and performed on PC with CPU of 3.2GHz clock and main memoryof two Gigabytes. Fortran compiler ver. 8.0, and optimum option -O3 was used. The right-hand sideof vector b was imposed as all 1.0. The stopping criterion for successful convergence of the iterativemethod is less than 10−8 of the relative residual 2-norm ||rn+1||2/||r0||2. The coefficient matrix Awas normalized by diagonal scaling. In all cases the iteration of CG method was started with theinitial guess solution x0 = 0. The maximum iteration is set as same as the dimension of each matrix.Tolerance values τ were varied from 0.005 up to 0.15 with the interval of 0.005. In total 30 caseswere examined for dropping strategies and matrices, respectively. The shifted parameter α of linearsystems A + αI was fixed as 0.10, where I denotes the unit matrix. Five test matrices were derivedfrom Matrix-Market database of sparse matrices[3] and problem which stemed from a realistic anal-ysis. Table 1 shows description of four test matrices. In Table “total nnz” means the total numberof nonzero entries of each matrix. Similarly in Table “ave. nnz” means average number of nonzeroentries per one row of each matrix.

Table 1 Description of test matrices.

matrix dimensions total nnz ave. nnz analysisS3DKQ4M2 90,449 2,455,670 27.1 cylindrical shellsS3DKT3M2 90,449 1,921,955 21.2 cylindrical shellsCT20STIF 52,329 1,375,396 26.3 engine blockENGINE 143,571 2,424,822 16.9 engine headT 400000 417,524 6,053,860 14.5 structural analysis

In Figure 1(a),(b) we present the numerical results of the shifted ICCG methods with the standarddropping and that with Inverse-based dropping for matrices S3DKQ4M2 and S3DKT3M2, respectively.The vertical axis means the total computation time in seconds included preconditioning time and CGiteration time, and the horizontal axis means the tolerance values. Color bar in red means compu-tation times of the shifted ICCG method with Inverse-based dropping, and color bar in blue meanscomputation times of that with the standard dropping.

We restrict our attention to robustness for tolerance value and stability of convergence in place ofcomputation times. From Figure 1(a), for larger tolerance of 0.095 the shifted ICCG method with thestandard dropping could not converge. On the other hand, the shifted ICCG method with Inverse-based dropping converged over the whole range of tolerance. Futhermore the much computation timesof the former require sometimes as tolerances of 0.050, 0.070 and 0.095.

From Figure 1(b), it can be seen that at tolerances τ of 0.125, 0.130, 0.145 and 0.150, the shiftedICCG method with the standard dropping diverged. In contrast, the shifted ICCG method withwith Inverse-based dropping converges successfully. Moreover the former requires much computationtimes irregularly at tolerances of 0.095, 0.120, 0.135 and 0.140. In contrast, the latter convergedsmoothly over the whole range of tolerance. The Inverse-based dropping does not appear slightly to

18

"! # $ % & '! # $ ( ! ) ( % &

*+ *, *'*, + *- *'*- + *. *'*. + */'*'*

010 0 2 010 3 2 010 4 2 010 5 2 010 6 2 010 2 2 010 7 2 010 8 2 010 9 2 010 : 2 013 0 2 013 3 2 013 4 2 013 5 2 013 6 2;

< =< >?@<A BCD EF

G H'I J K L M GNH"I J K O I P O L M

(a)matrix S3DKQ4M2 (b)matrix S3DKT3M2

Figure 1 Computation times versus tolerance value of shifted ICCG method with thestandard dropping and that with Inversed-based dropping for matrix S3DKQ4M2 andS3DKT3M2.

be competitive as for the computation times at the optimal tolerance τopt, because it often requiresmore fill-in in order to yield convergent iterations.

In Figure 2(a)-(b) we present history of relative residual of the shifted CG method with IC andib IC decompositions for matrix S3DKT3M2. Figure 2(a) shows the results of the CG method withusual IC decomposition at tolerances of 0.09, 0.095 and 0.10. Similarly Figure 2(b) depicts the resultsof the shifted CG method with ib IC decomposition at the above same tolerances. From Figure 2 theobservations are made as follows:

• In case of usual IC decomposition, much computation times are required at tolerance of 0.095only compared with those of tolerance of 0.09 and 0.10. This means that error matrix X of Umeasured in the usual IC decomposition has possibility of misleading estimation.

• In case of ib IC decomposition, computation times for three tolerances are approximately assame as each other.

Additionally, in Figure 3 we exhibit the computation times of the shifted ICCG methods with twotypes of dropping when tolerance values are varied similarly. Figure 3 illustrates that at tolerancesτ of 0.090 and 0.095 the shifted ICCG method with the standard dropping diverged. Contrastivelygeneral behaviour of the shifted ICCG method with Inverse-based dropping converges quite nicely.

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

0 5000 10000 15000 20000

Rel

ativ

e R

esid

ual

Iterations

Shift_IC(0.090)Shift_IC(0.095)Shift_IC(0.100)

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

0 2000 4000 6000 8000 10000

Rel

ativ

e R

esid

ual

Iterations

Shift_ib_IC(0.090)Shift_ib_IC(0.095)Shift_ib_IC(0.100)

(a)IC decomposition (b)ib IC decomposition

Figure 2 History of relative residual of the CG method with IC and ib IC decomposi-tions for matrix S3DKT3M2 at tolerance values of 0.09, 0.095 and 0.10.

Figures 4(a),(b) show computation times versus tolerance and history of relative residual of the CGmethod with shifted IC and shifted ib IC decompositions at tolerance = 0.090 for matrix T 400000.Obviously we notice that ib IC decomposition yields better result than the usual IC decomposition.

19

!

"$#% & ' ( ) "$#% & ' * % + * ( )

,

- ,

.,

/ ,

0 ,

1 ,,

1 - ,

232 2 4 232 5 4 232 6 4 232 7 4 232 8 4 232 4 4 232 9 4 232 : 4 232 ; 4 232 < 4 235 2 4 235 5 4 235 6 4 235 7 4 235 8 4=

> ?> @AB>C DEF GH

I$JK L M N O IPJQK L M R K S R N O

(a)matrix S4DKT3M2 (b)matrix CT20STIF

Figure 3 Computation times versus tolerance value of shifted ICCG method with thestandard dropping and that with Inversed-based dropping for matrix S4DKT3M2 andCT20STIF.

TU TTV TTTV U TTW TTTW U TTX TTTX U TTYTTT

Z[Z Z \ Z[Z ] \ Z[Z ^ \ Z[Z _ \ Z[Z ` \ Z[Z \ \ Z[Z a \ Z[Z b \ Z[Z c \ Z[Z d \ Z[] Z \ Z[] ] \ Z[] ^ \ Z[] _ \ Z[] ` \e

f gf hijfk lmn op

q$rQs t u v w q$rs t u x s y x v w -9

-8

-7

-6

-5

-4

-3

-2

-1

0

0 10000 20000 30000 40000 50000 60000

Rel

ativ

e R

esid

ual

Iterations

ShiftICShift_ib_IC

(a)Computation times versus tolerance (b)history of relative residual(tolerance = 0.09)

Figure 4 Computation times versus tolerance and history of relative residual of theCG method with shifted IC and shifted ib IC decompositions for matrix T 400000.

Consequently, we concluded by presenting graphically the timing results for tested four matricesabove when tolerance values are varied that Inverse-based dropping strategy is superior to the standarddropping from the viewpoint of robustness for tolerance value and stability of convergence. As for theoptimal computation times for τ = τopt of two kinds of dropping strategies, the standard droppingmay be competitive to the Inverse-based dropping. Hence, it may be an alternative promising optionif the shifted ICCG method with the standard dropping fail, though it is unlikely to be the fastestchoice of convergence in general cases.

References

[1] Li, N., Saad, Y., and Chow, E., Crout version of ILU for general sparse matrices, SIAM J. Sci.Comput., 25(2003), pp.716-728.

[2] Mayer, J.: ILUCP: a Crout ILU preconditioner with pivoting, Numerical Linear Algebra withApplications, 12(2005), pp.941-955.

[3] Matrix Market web page: http://math.nist.gov/MatrixMarket/

20

Preconditioning Techniques for Saddle Point Problems arising from FEM Applications

Takumi Washio1 and Toshiaki Hisada 2

1Japan Science and Technology Agency,

Graduate School of Frontier Sciences, The University of Tokyo 5-1-5, Kashiwanoha, Kashiwa, Chiba 277-0882, JAPAN

[email protected]

2 Graduate School of Frontier Sciences, The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, JAPAN

hisada @mech.t.u-tokyo.ac.jp

In many finite element applications, the system of equations includes constraint conditions that stem from the conservation laws or the incompressibility condition. These constrained problems are usually dealt with the Lagrange multiplier method in which we have to solve the following linear equations in each Newton-Raphson step in nonlinear cases.

⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡ΔΔ

⎥⎦

⎤⎢⎣

⎡

λλ rru

BBA u

T

0 (1)

Here, the second row corresponds to the constraint conditions. The right hand side is the residual for the discretized nonlinear equations and the unknowns are the updates of the solution u and the Lagrange multiplier λ . There are two typical characters in Eq. (1).

1. The second diagonal block is zero.

2. The (1,2)-block is equal to the transpose of the (2,1)-block and we can assume that the sub-matrix B is full-rank due to the independence of the constraint conditions.

The first character implies the loss of diagonal dominancy which leads to failure of the convergence with the standard preconditioned iterative methods. However, we can overcome the difficulty by exploiting the second character in construction of the preconditioner. In this talk, we present two preconditioners[1] based on this idea.

ILU-preconditioner with appropriate fill-ins in (1,2), (2,1) and (2,2)-blocks:

⎥⎥⎦

⎤

⎢⎢⎣

⎡

++

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡

+

+=

−

−

λλλλλ UDBUD

D

DDLB

DL TUuuu

L

uu

0

~

0

0~

01

1

1P (2)

Inexact block LU preconditioner:

⎥⎦

⎤⎢⎣

⎡

⎥⎥⎦

⎤

⎢⎢⎣

⎡

−⎥⎦

⎤⎢⎣

⎡=

−

−

λλλ QBQ

Q

QQB

Q Tuuu

00

001

1

2P (3)

In the first preconditioner, all fill-ins in the (2,2)-block generated during the elimination of the unknowns in the first row are taken to obtain the positive matrix there, however, any fill-ins during the elimination in the second row itself are ignored to retain sparsity in λL and λU .

21

In the second preconditioner, the matrix λQ should be constructed as it approximates the

Schur complement TBBA 1− [2] or TBBQ 1−λ

[1]. In the talk, the basic theoretical properties and experimental results in incompressible fluid and hyper-elasticity finite element problems are introduced.

Since these precondtitioners are executed with the forward and backward substitutions, their direct parallelization is not easy due to its recursiveness in the operations. Usually, a simple localized preconditioning technique under a given decomposition of unknowns is applied for the parallelization. In this parallel strategy, the ILU preconditioner is applied to each local sub-domain independently of the others by ignoring the relation at the interfaces. In our case study, such an approach leads to considerable degradation of the convergence compared to the single domain preconditioner due to the existence of the Lagrange multiplier at the interface. However, the convergence is drastically improved once small overlaps exploiting underlying element connectivity information are taken[3]. In the talk, the convergence speeds and the eigenvalue distributions of the preconditioned matrices are compared between the overlapped and the non-overlapped preconditioners. The parallel performance of the solver of the fluid-structure interaction analysis in our heart simulator is also shown.

References [1] T. Washio, T. Hisada, H. Watanabe and T.E. Tezduyar: A robust preconditioner for fluid-structure interaction problems, Comput. Methods Appl. Mech. Engrg., Vol.194, pp.4027-4047, 2005. [2]J.H. Bramble, J.E. Pasciak and A. T. Vassilev: Analysis of the inexact Uzawa algorithm for saddle point problems, SIAM Numer. Anal., Vol.34, pp.1072-1092, 1997. [3]T. Washio, K. Suzuki and T. Hisada: Efficient iterative linear solvers for strongly coupled FSI analysis, Proceedings of Computational Mechanics WCCM VI (in CD-ROM), 2004, Beijing, China.

22

Parallel Preconditioning Methods for Contact Problems with FEM

Kengo Nakajima1

1 The 21st Century Earth Science COE (Center of Excellence) Program

Department of Earth and Planetary Science, The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN

nakajima @eps.s.u-tokyo.ac.jp

Contact phenomenon is one of the most important and critical issues in various types of science and engineering problems. Author has been developing preconditioning methods for contact problems in geophysics, where stress accumulation on plate boundaries (faults) is computed for estimating the earthquake generation cycle [1]. In [1], the augmented Lagrange method (ALM) and penalty method are implemented, and a large penalty number λ is introduced for constraint conditions around faults. The nonlinear process is solved iteratively by the Newton-Raphson (NR) method. A large λ can provide an accurate solution and fast nonlinear convergence for the NR method, but the condition number of the coefficient matrices is large. Therefore, many iterations are required for convergence of iterative solvers. Therefore, a robust preconditioning method is essential for such ill-conditioned problems.

Selective blocking is a special preconditioning method developed for this type of application by the author [1]. In the selective blocking method, finite element nodes in the same contact group coupled through penalty constraints are placed into a large block (the selective block or super node) (Fig.1). For symmetric positive definite matrices, incomplete block Cholesky factorization without inter-block fill-in using selective blocking (SB-BIC(0)) shows excellent performance and robustness for a wide range of penalty parameter values [1].

In the previous work [1], numbers of nodes in contact groups are consistent, and conditions for infinitesimal deformation have been also assumed. Therefore, positions of nodes do not change and consistent relationship among nodes in contact groups is kept during simulation. Moreover, a special partitioning method, where all nodes in a same contact group locates in a same domain, has been applied. Therefore, this is not flexible and cannot be applied to fault contact simulations with

2λux0= λux1 + λux22λuy0= λuy1 + λuy22λuz0= λuz1 + λuz2

λux0= λux1λuy0= λuy1λuz0= λuz1

0 1 2

0 1

3 nodes form 1 selective block.




0 1 2

0 1



0 1 2

0 1



Fig. 1 Matrix operation of nodes in contact group for selective blocking preconditioning

0

50

100

150

200

250

300

0 32 64 96 128 160 192 224 256 288

PE#

Spee

d-U

p R

ATI

O

Fig.2 Parallel performance on a Hitachi SR2201 with 16 to 256 PEs for the 3D elastic contact problem with MPC conditions (2,471,439 DOF). (: BIC(1), : BIC(2), : SB-BIC(0), : BIC(0)).

Fig.3 Inconsistent node number at component boundaries in FEM models of assembly structure

23

large slip/deformation and to simulations for assembly structures, where numbers and positions of nodes in contact groups might be inconsistent, as shown in Fig.3

In this study, author developed a new parallel preconditioning method for this type of general contact problems. One remedy is a method called, preconditioning with selective fill-in. In this method, fill-in’s with higher order are introduced for nodes with contact conditions. This approach is very similar to that of selective blocking, but it’s more flexible. Another remedy is extension of overlapped zones between processors for parallel computing. In the original work [1], only a single layer of elements has been overlapped. In this study, more layers of overlapped elements have been considered. Generally speaking, deeper layers of overlapped elements provide faster convergence in block Jacobi type parallel preconditioning methods, but cost for computation and communications is more expensive.

Figure 4 shows geometry and boundary conditions of the test problems for validation of the developed preconditioning method. This problem simulates contact problems in assembly structures, where positions and number of nodes in contact surfaces are inconsistent. There are four blocks with elastic material and discretized into cubic tri-linear type finite-elements. Each block is connected through elastic torus elements generated on each node on contact surfaces. Torus elements are crossing. In this case, elastic coefficient of torus elements is set to 103-104 times as large as that of solid elements. This condition simulates constraint boundary conditions for contact.

Fig.4 Elastic blocks connected through torus elements

Figure 5 shows effect of depth of overlapped layers on iteration number for convergence with 128 domains on a PC cluster for 3D elastic contact problem of simple block model with constraint condition in Fig.4 (2,723,772 DOF). Preconditioned GPBiCG method has been applied. BILU(1/2) preconditioning, where 2nd order fill-in’s are introduced for nodes connected to torus elements, and 1st order fill-in’s are applied to other nodes, provides better convergence.

References [1] Nakajima, K. (2003), Parallel Iterative Solvers of GeoFEM with Selective Blocking Preconditioning for Nonlinear Contact Problems on the Earth Simulator, ACM/IEEE Proceedings of SC 2003, Phoenix, AZ, USA.

0

500

1000

1500

1 2 3 4

Overlap Depth

Itera

tions

X

Z

Y

X

Z

Y

X

Y

1.00

1.00

0.10

X

Y

1.001.00

1.00

0.10

Fig. 5 Effect of depth of overlapped layers on iterations for convergence with 128 domains on a PC cluster for 3D elastic contact problem of simple block model with constraint condition in Fig.4 (2,723,772 DOF). Preconditioned GPBiCG method.

BILU(0)BILU(1)BILU(2)

BILU(1/2)BILUT(2.50)

BILU(0)BILU(1)BILU(2)

BILU(1/2)BILUT(2.50)

24

International Workshop on Collaboration between Numerical...

Documents

Transcript of International Workshop on Collaboration between Numerical...