PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI,...

PaStiX : how to reduce memory overhead

ASTER meetingBordeaux, Nov 12-14, 2007

PaStiX teamLaBRI, UMR CNRS 5800, Université Bordeaux I

Projet ScAlApplix, INRIA Futurs

PaStiX solver

• Current development team (SOLSTICE)• P. Hénon (researcher INRIA)• F. Pellegrini (assistant professor LaBRI/INRIA)• P. Ramet (assistant professor LaBRI/INRIA)• J. Roman (professor, leader of ScAlApplix INRIA project)

• PhD student & engineer• M. Faverge (NUMASIS project)• X. Lacoste (INRIA)

• Others contributors since 1998• D. Goudin (CEA-DAM)• D. Lecas (CEA-DAM)

• Main users• Electomagnetism & structural mechanics codes at CEA-DAM CESTA• MHD Plasma instabilities for ITER at CEA-Cadarache (ASTER)• Fluid mechanics at MAB Bordeaux

PaStiX solver• Functionnalities

• LLt, LDLt, LU factorization (symmetric pattern) with supernodal implementation• Static pivoting (Max. Weight Matching) + It. Raff. / CG / GMRES• 1D/2D block distribution + Full BLAS3• Support external ordering library (provided Scotch ordering)• MPI/Threads implementation (SMP node / Cluster / Multi-core / NUMA)• Simple/Double precision + Float/Complexe operations• Require only C + MPI + Posix Thread• Multiple RHS (direct factorization)• Incomplete factorization ILU(k) preconditionner

• Available on INRIA Gforge• All-in-One source code• Easy to install on Linux or AIX systems• Simple API (WSMP like)• Thread safe (can be called from multiple threads in multiple MPI communicators)

• Current works• Use of parallel ordering (PT-Scotch) and parallel symbolic factorization• Dynamic scheduling inside SMP nodes (static mapping)• Out-of Core implementation• Generic Finite Element Assembly (domaine decomposition associated to matrix

distribution)

pastix.gforge.inria.fr

• Latest publication : to appear in Parallel Computing : On finding approximate supernodes for an efficient ILU(k) factorization

• For more publications, see : http://www.labri.fr/~ramet/

• Mapping by processorStatic scheduling by processor

• Each processor owns its local part of the matrix (private user space)

• Message passing (MPI or MPI_shared_memory) between any processors

• Aggregation of all contributions is done per processor

• Data coherency insured by MPI semantic

• Mapping by SMP nodeStatic scheduling by thread

• All the processors on a same SMP node share a local part of the matrix (shared user space)

• Message passing (MPI) between processors on different SMP nodes

Direct access to shared memory (pthread) between processors on a same SMP node

• Aggregation of non local contributions is done per node

• Data coherency insured by explicit mutex

MPI/Threads implementation for SMP clusters

MPI only

• Processors 1 and 2 belong to the same SMP node

• Data exchanges when only MPI processes are used in the parallelization

MPI/Threads

• Thread 1 and 2 are created by one MPI process

• Data exchanges when there is one MPI process per SMP node and one thread per processor

much less MPI communications

(only between SMP nodes) no aggregation for modifications of

all blocks belonging to processors

inside of a SMP node

16 32 64 128

0

50

100

150

200

250

300

350

400

Ligne 8

Ligne 9

Ligne 10

Ligne 11

Processors

% m

ax o

ve

rme

m

/ m

ax c

oe

ff.

AUDI : 943.103 (symmetric)

requires MPI_THREAD_MULTIPLE !

AUDI : 943.103 (symmetric)

• SP4 : 32 ways Power4+ (with 64Go)

51,02100,44

47,2896,318

47,8093,1416

60,0394,2132

6432Procs

Nu

m.

of

thre

ads

/ M

PI

pro

cess

686

-

-

-

2

373

368

-

-

4

185

182

177

-

8

-

-

31,3

34,1

64

-952

55,399,74

57,798,68

59,79116

3216Procs

Nu

m.

of

thre

ads

/ M

PI

pro

cess

• SP5 : 16 ways Power5 (with 32Go)

pb. alloc.

no meaning

MHD1 : 485.103 (unsymmetric)

• SP4 : 32 ways Power4+ (with 64Go)

117,80202,684

115,79197,998

111,99187,8916

115,97199,1732

6432Procs

Nu

m.

of

thre

ads

/ M

PI

pro

cess

977

-

-

-

2

506

505

-

-

4

-

-

-

63,4

64

-

-

78,3

84,2

32

-2652

1362624

1412618

139-16

168Procs

Nu

m.

of

thre

ads

/ M

PI

pro

cess

• SP5 : 16 ways Power5 (with 32Go)

How to reduce memory resources• Goal: we want to adapt a (supernodal) parallel direct solver

(PaStiX) to build an incomplete block factorization and benefit from all the features that it provides:

Algorithmic is based on linear algebra kernels (BLAS)

Load-balancing and task scheduling are based on a fine modeling of computation and communication

Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation

Main steps for incomplete factorization algorithm

1. find the partition P0 induced by the supernodes of A2. compute the block symbolic incomplete

factorization Q(G,P0)k=Q(Gk,P0)3. find the exact supernode partition in Q(G,P0)k

4. given a extra fill-in tolerance α, construct an approximated supernode partition Pα to improve the block structure of the incomplete factors

5. apply a block incomplete factorization using the parallelization techniques developed for our direct solver PaStiX

Parallel Time: AUDI (Power5)

63.30.7821.2679.37.80258.140 %5

123.11.1652.31058.98.86518.520 %5

65.70.6618.6732.07.57194.640 %3

108.70.9139.2936.87.97331.120 %3

67.00.4212.7620.34.4456.440 %1

91.50.5121.4690.14.5974.520 %1

Total TR SolvFactTotalTR solvFactαK

16 processors1 processor

PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI,...

Documents

Transcript of PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI,...