PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI,...
-
Upload
barrie-waters -
Category
Documents
-
view
213 -
download
0
Transcript of PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI,...
PaStiX : how to reduce memory overhead
ASTER meetingBordeaux, Nov 12-14, 2007
PaStiX teamLaBRI, UMR CNRS 5800, Université Bordeaux I
Projet ScAlApplix, INRIA Futurs
PaStiX solver
• Current development team (SOLSTICE)• P. Hénon (researcher INRIA)• F. Pellegrini (assistant professor LaBRI/INRIA)• P. Ramet (assistant professor LaBRI/INRIA)• J. Roman (professor, leader of ScAlApplix INRIA project)
• PhD student & engineer• M. Faverge (NUMASIS project)• X. Lacoste (INRIA)
• Others contributors since 1998• D. Goudin (CEA-DAM)• D. Lecas (CEA-DAM)
• Main users• Electomagnetism & structural mechanics codes at CEA-DAM CESTA• MHD Plasma instabilities for ITER at CEA-Cadarache (ASTER)• Fluid mechanics at MAB Bordeaux
PaStiX solver• Functionnalities
• LLt, LDLt, LU factorization (symmetric pattern) with supernodal implementation• Static pivoting (Max. Weight Matching) + It. Raff. / CG / GMRES• 1D/2D block distribution + Full BLAS3• Support external ordering library (provided Scotch ordering)• MPI/Threads implementation (SMP node / Cluster / Multi-core / NUMA)• Simple/Double precision + Float/Complexe operations• Require only C + MPI + Posix Thread• Multiple RHS (direct factorization)• Incomplete factorization ILU(k) preconditionner
• Available on INRIA Gforge• All-in-One source code• Easy to install on Linux or AIX systems• Simple API (WSMP like)• Thread safe (can be called from multiple threads in multiple MPI communicators)
• Current works• Use of parallel ordering (PT-Scotch) and parallel symbolic factorization• Dynamic scheduling inside SMP nodes (static mapping)• Out-of Core implementation• Generic Finite Element Assembly (domaine decomposition associated to matrix
distribution)
pastix.gforge.inria.fr
• Latest publication : to appear in Parallel Computing : On finding approximate supernodes for an efficient ILU(k) factorization
• For more publications, see : http://www.labri.fr/~ramet/
• Mapping by processorStatic scheduling by processor
• Each processor owns its local part of the matrix (private user space)
• Message passing (MPI or MPI_shared_memory) between any processors
• Aggregation of all contributions is done per processor
• Data coherency insured by MPI semantic
• Mapping by SMP nodeStatic scheduling by thread
• All the processors on a same SMP node share a local part of the matrix (shared user space)
• Message passing (MPI) between processors on different SMP nodes
Direct access to shared memory (pthread) between processors on a same SMP node
• Aggregation of non local contributions is done per node
• Data coherency insured by explicit mutex
MPI/Threads implementation for SMP clusters
MPI only
• Processors 1 and 2 belong to the same SMP node
• Data exchanges when only MPI processes are used in the parallelization
MPI/Threads
• Thread 1 and 2 are created by one MPI process
• Data exchanges when there is one MPI process per SMP node and one thread per processor
much less MPI communications
(only between SMP nodes) no aggregation for modifications of
all blocks belonging to processors
inside of a SMP node
16 32 64 128
0
50
100
150
200
250
300
350
400
Ligne 8
Ligne 9
Ligne 10
Ligne 11
Processors
% m
ax o
ve
rme
m
/ m
ax c
oe
ff.
AUDI : 943.103 (symmetric)
requires MPI_THREAD_MULTIPLE !
AUDI : 943.103 (symmetric)
• SP4 : 32 ways Power4+ (with 64Go)
51,02100,44
47,2896,318
47,8093,1416
60,0394,2132
6432Procs
Nu
m.
of
thre
ads
/ M
PI
pro
cess
686
-
-
-
2
373
368
-
-
4
185
182
177
-
8
-
-
31,3
34,1
64
-952
55,399,74
57,798,68
59,79116
3216Procs
Nu
m.
of
thre
ads
/ M
PI
pro
cess
• SP5 : 16 ways Power5 (with 32Go)
pb. alloc.
no meaning
MHD1 : 485.103 (unsymmetric)
• SP4 : 32 ways Power4+ (with 64Go)
117,80202,684
115,79197,998
111,99187,8916
115,97199,1732
6432Procs
Nu
m.
of
thre
ads
/ M
PI
pro
cess
977
-
-
-
2
506
505
-
-
4
-
-
-
63,4
64
-
-
78,3
84,2
32
-2652
1362624
1412618
139-16
168Procs
Nu
m.
of
thre
ads
/ M
PI
pro
cess
• SP5 : 16 ways Power5 (with 32Go)
How to reduce memory resources• Goal: we want to adapt a (supernodal) parallel direct solver
(PaStiX) to build an incomplete block factorization and benefit from all the features that it provides:
Algorithmic is based on linear algebra kernels (BLAS)
Load-balancing and task scheduling are based on a fine modeling of computation and communication
Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation
Main steps for incomplete factorization algorithm
1. find the partition P0 induced by the supernodes of A2. compute the block symbolic incomplete
factorization Q(G,P0)k=Q(Gk,P0)3. find the exact supernode partition in Q(G,P0)k
4. given a extra fill-in tolerance α, construct an approximated supernode partition Pα to improve the block structure of the incomplete factors
5. apply a block incomplete factorization using the parallelization techniques developed for our direct solver PaStiX
Parallel Time: AUDI (Power5)
63.30.7821.2679.37.80258.140 %5
123.11.1652.31058.98.86518.520 %5
65.70.6618.6732.07.57194.640 %3
108.70.9139.2936.87.97331.120 %3
67.00.4212.7620.34.4456.440 %1
91.50.5121.4690.14.5974.520 %1
Total TR SolvFactTotalTR solvFactαK
16 processors1 processor