E. Agullo, P. Ramet (and HiePACS team)iterschool.univ-amu.fr/IISS2014/lecture_notes/Ramet_lec.pdfE....
-
Upload
phungkhuong -
Category
Documents
-
view
215 -
download
1
Transcript of E. Agullo, P. Ramet (and HiePACS team)iterschool.univ-amu.fr/IISS2014/lecture_notes/Ramet_lec.pdfE....
August 28, 2014
Task-based linear solvers for modern architecturesE. Agullo, P. Ramet (and HiePACS team)
7th ITER International School,High Performance Computing in Fusion Science,Aix-en-Provence
E. Agullo, P. RametHiePACS teamInria Bordeaux Sud-OuestLaBRI Bordeaux University
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - HIPS and MaPHYS
Low-rank compression - H-PaStiX
Conclusion
E. Agullo, P. Ramet - ITER School August 28, 2014- 2
Introduction
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - HIPS and MaPHYS
Low-rank compression - H-PaStiX
Conclusion
E. Agullo, P. Ramet - ITER School August 28, 2014- 3
Introduction
Mixed/Hybrid direct-iterative methods
The ”spectrum” of linear algebra solvers
I Robust/accurate for generalproblems
I BLAS-3 based implementationI Memory/CPU prohibitive for
large 3D problemsI Limited parallel scalability
I Problem dependent efficiency/controlledaccuracy
I Only mat-vec required, fine graincomputation
I Less memory consumption, possibletrade-off with CPU
I Attractive ”build-in” parallel features
E. Agullo, P. Ramet - ITER School August 28, 2014- 5
Direct
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - HIPS and MaPHYS
Low-rank compression - H-PaStiX
Conclusion
E. Agullo, P. Ramet - ITER School August 28, 2014- 6
Direct
Major steps for solving sparse linear systems
1. Analysis: matrix is preprocessed to improve its structuralproperties (A′x ′ = b′ with A′ = PnPDr ADcQPT )
2. Factorization: matrix is factorized as A = LU, LLT or LDLT
3. Solve: the solution x is computed by means of forward andbackward substitutions
E. Agullo, P. Ramet - ITER School August 28, 2014- 8
Direct
Supernodal methods
DefinitionA supernode (or supervariable) is a set of contiguous columns inthe factors L that share essentially the same sparsity structure.
I All algorithms (ordering, symbolic factor., factor., solve)generalized to block versions.
I Use of efficient matrix-matrix kernels (improve cache usage).I Same concept as supervariables for elimination tree/minimum
degree ordering.I Supernodes and pivoting: pivoting inside a supernode does
not increase fill-in.
E. Agullo, P. Ramet - ITER School August 28, 2014- 10
Direct
PaStiX main Features
I LLt, LDLt, LU : supernodal implementation (BLAS3)I Static pivoting + Refinement: CG/GMRES/BiCGstabI column-block or block mappingI Simple/Double precision + Float/Complex operationsI MPI/Threads (Cluster/Multicore/SMP/NUMA)I Multiple GPUs using DAG runtimesI Support external ordering library (PT-Scotch or METIS ...)I Multiple RHS (direct factorization)I Incomplete factorization with ILU(k) preconditionnerI Schur complement computationI C/C++/Fortran/Python/PETSc/Trilinos/FreeFem...
E. Agullo, P. Ramet - ITER School August 28, 2014- 11
Direct
Current works
I Astrid Casadei (PhD student) : memory optimization to builda Schur complement in PaStiX, tight coupling betweensparse direct and iterative solvers (HIPS) + graphpartitioning with balanced halo
I Xavier Lacoste (PhD student) and the MUMPS team : GPUoptimizations for sparse factorizations with StarPU
I Stojce Nakov (PhD student) : tight coupling between sparsedirect and iterative solvers (MaPHYS) + GPUoptimizations for GMRES with StarPU
I Mathieu Faverge (Assistant Professor) : redesigned PaStiXstatic/dynamic scheduling with PaRSEC in order to get ageneric framework for multicore/MPI/GPU/Out-of-Core +compression
E. Agullo, P. Ramet - ITER School August 28, 2014- 12
Direct
Direct Solver Highlights
Fusion - ITERPaStiX is used in the JOREK code developped by G. Huysmans atCEA/Cadarache in a fully implicit time evolution scheme for thenumerical simulations of the ELM (Edge Localized Mode)instabilities commonly observed in the standard tokamak operatingscenario.
MHD pellet injection simulated with the JOREK code
E. Agullo, P. Ramet - ITER School August 28, 2014- 13
Direct
Tasks algorithms
POTRFTRSMSYRKGEMM
(a) Task for a dense tile (b) Task for a sparse supernode
E. Agullo, P. Ramet - ITER School August 28, 2014- 14
Direct
DAG representation
POTRF
TRSM
T=>T
TRSM
T=>T
TRSM
T=>T TRSM
T=>T
GEMM
C=>B
GEMM
C=>B
GEMM
C=>B
SYRK
C=>A C=>A
SYRK
C=>A
GEMM
C=>B
GEMM
C=>B
GEMM
C=>C
SYRK
T=>T
C=>A C=>A
SYRK
C=>A
GEMM
C=>B
GEMM
C=>C
GEMM
C=>C
SYRK
T=>T
C=>A C=>AC=>A SYRK
C=>A
TRSM
C=>C
TRSM
C=>C
TRSM
C=>C
POTRF
T=>T
T=>T T=>TT=>T
C=>B
SYRK
C=>A C=>B C=>A C=>AC=>B
T=>T
GEMM
C=>C
SYRK
T=>T
SYRK
T=>T
C=>A C=>A C=>A
TRSM
C=>C POTRF
T=>T
T=>T
TRSM
T=>T
C=>C
C=>A C=>B
SYRK
T=>T
C=>AC=>A
POTRF
T=>T
TRSM
C=>C
T=>T
C=>A
POTRF
T=>T
(c) Dense DAG
panel(7)
gemm(19)
A=>A
gemm(20)
A=>A
gemm(21)
A=>Agemm(22)
A=>A
gemm(23)
A=>A
gemm(24)
A=>A
gemm(25)
A=>A
C=>C
C=>C
panel(8)
C=>A
gemm(27)
C=>C
C=>C
gemm(29)
C=>C
gemm(31)
C=>C A=>A
gemm(28)
A=>A
A=>A
gemm(30)
A=>A
A=>A
C=>C
panel(9)
C=>A
C=>C
gemm(33)
C=>C
gemm(36)
C=>C
panel(0)
gemm(1)
A=>A
gemm(2)
A=>Agemm(3)
A=>A
gemm(4)
A=>A
C=>C
panel(1)
C=>A
C=>C
gemm(6)
C=>C
A=>A
gemm(8)
C=>C
gemm(10)
C=>C
panel(5)
gemm(14)
A=>A
gemm(15)
A=>A
panel(6)
C=>A
gemm(17)
C=>C A=>A
C=>C
panel(2)
A=>A
gemm(12)
C=>C
panel(3)
A=>A
C=>C
panel(4)
A=>A
gemm(37)
C=>C
A=>A A=>A
gemm(34)
A=>A
gemm(35)
A=>A
A=>A
gemm(38)
A=>A C=>C
C=>C
panel(10)
C=>A
C=>C
gemm(40)
C=>C
panel(11)
C=>A
A=>A
(d) Sparse DAG
E. Agullo, P. Ramet - ITER School August 28, 2014- 15
Direct
Direct Solver Highlights (multicore)SGI 160-cores
Name N NNZA Fill ratio FactAudi 9.44×105 3.93×107 31.28 float LLT
10M 1.04×107 8.91×107 75.66 complex LDLT
10M 10 20 40 80 160Facto (s) 3020 1750 654 356 260Mem (Gb) 122 124 127 133 146Solve (s) 24.6 13.5 3.87 2.90 2.89
Audi 128 2x64 4x32 8x16Facto (s) 17.8 18.6 13.8 13.4Mem (Gb) 13.4 2x7.68 4x4.54 8x2.69Solve (s) 0.40 0.32 0.21 0.14
E. Agullo, P. Ramet - ITER School August 28, 2014- 16
Direct
Direct Solver Highlights (cluster of multicore)
RC3 matrix - complex double precisionN=730700 - NNA=41600758 - Fill-in=50
Facto 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6820 3520 1900 18906 threads 1020 639 337 28712 threads 525 360 155 121Mem Gb 1 MPI 2 MPI 4 MPI 8 MPI1 thread 34 19,2 12,5 9,226 threads 34,3 19,5 12,8 9,6612 threads 34,6 19,7 13 9,14Solve 1 MPI 2 MPI 4 MPI 8 MPI1 thread 6,97 3,75 1,93 1,036 threads 2,5 1,43 0,78 0,5412 threads 1,33 0,93 0,66 0,59
E. Agullo, P. Ramet - ITER School August 28, 2014- 17
Direct
Block ILU(k): supernode amalgamation algorithmDerive a block incomplete LU factorization from thesupernodal parallel direct solver
I Based on existing package PaStiXI Level-3 BLAS incomplete factorization implementationI Fill-in strategy based on level-fill among block structures
identified thanks to the quotient graphI Amalgamation strategy to enlarge block size
HighlightsI Handles efficiently high level-of-fillI Solving time faster than with scalar ILU(k)I Scalable parallel implementation
E. Agullo, P. Ramet - ITER School August 28, 2014- 18
Manycore
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - HIPS and MaPHYS
Low-rank compression - H-PaStiX
Conclusion
E. Agullo, P. Ramet - ITER School August 28, 2014- 19
Manycore
Multiple layer approach
ALGORITHM
RUNTIME
KERNELS
GPU CPU
Governing ideas: Enable advancednumerical algorithms to be executed on ascalable unified runtime system forexploiting the full potential of futureexascale machines.Basics:
I Graph of tasksI Out-of-order schedulingI Fine granularity
E. Agullo, P. Ramet - ITER School August 28, 2014- 21
Manycore
DAG schedulers consideredStarPU
I RunTime Team – Inria Bordeaux Sud-OuestI C. Augonnet, R. Namyst, S. Thibault.I Dynamic Task DiscoveryI Computes cost models on the flyI Multiple kernels on the acceleratorsI Heterogeneous First-Time strategy
PaRSEC (formerly DAGuE)I ICL – University of Tennessee, KnoxvilleI G. Bosilca, A. Bouteiller, A. Danalys, T. HeraultI Parameterized Task GraphI Only the most compute intensive kernel on acceleratorsI Simple scheduling strategy based on computing capabilitiesI GPU multi-stream enabled
E. Agullo, P. Ramet - ITER School August 28, 2014- 22
Manycore
Supernodal sequential algorithm
forall the Supernode S1 dopanel (S1);/* update of the panel */
forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT
i substracted from
S2 */
endend
E. Agullo, P. Ramet - ITER School August 28, 2014- 23
Manycore
StarPU Tasks submission
forall the Supernode S1 dosubmit panel (S1);/* update of the panel */
forall the extra diagonal block Bi of S1 doS2 ← supernode in front of (Bi );submit gemm (S1,S2);/* sparse GEMM Bk,k≥i × BT
i subtracted from S2*/
endwait for all tasks ();
end
E. Agullo, P. Ramet - ITER School August 28, 2014- 24
Manycore
PaRSEC’s parameterized task graph
panel(7)
gemm(19)
A=>A
gemm(20)
A=>A
gemm(21)
A=>Agemm(22)
A=>A
gemm(23)
A=>A
gemm(24)
A=>A
gemm(25)
A=>A
C=>C
C=>C
panel(8)
C=>A
gemm(27)
C=>C
C=>C
gemm(29)
C=>C
gemm(31)
C=>C A=>A
gemm(28)
A=>A
A=>A
gemm(30)
A=>A
A=>A
C=>C
panel(9)
C=>A
C=>C
gemm(33)
C=>C
gemm(36)
C=>C
panel(0)
gemm(1)
A=>A
gemm(2)
A=>Agemm(3)
A=>A
gemm(4)
A=>A
C=>C
panel(1)
C=>A
C=>C
gemm(6)
C=>C
A=>A
gemm(8)
C=>C
gemm(10)
C=>C
panel(5)
gemm(14)
A=>A
gemm(15)
A=>A
panel(6)
C=>A
gemm(17)
C=>C A=>A
C=>C
panel(2)
A=>A
gemm(12)
C=>C
panel(3)
A=>A
C=>C
panel(4)
A=>A
gemm(37)
C=>C
A=>A A=>A
gemm(34)
A=>A
gemm(35)
A=>A
A=>A
gemm(38)
A=>A C=>C
C=>C
panel(10)
C=>A
C=>C
gemm(40)
C=>C
panel(11)
C=>A
A=>A
Task Graph
1 p a n e l ( j )23 /∗ E x e c u t i o n Space ∗/4 j = 0 . . c b l k n b r−156 /∗ Task L o c a l i t y ( Owner Compute ) ∗/7 : A( j )89 /∗ Data d e p e n d e n c i e s ∗/
10 RW A <− ( l e a f ) ? A( j ) : C gemm( l a s t b r o w )11 −> A gemm( f i r s t b l o c k +1 . . l a s t b l o c k )12 −> A( j )
Panel Factorization in JDF Format
E. Agullo, P. Ramet - ITER School August 28, 2014- 25
Manycore
Matrices and Machines
Matrix Prec Method Size nnzA nnzL TFlopFilterV2 Z LU 0.6e+6 12e+6 536e+6 3.6Flan D LLT 1.6e+6 59e+6 1712e+6 5.3Audi D LLT 0.9e+6 39e+6 1325e+6 6.5MHD D LU 0.5e+6 24e+6 1133e+6 6.6Geo1438 D LLT 1.4e+6 32e+6 2768e+6 23Pmldf Z LDLT 1.0e+6 8e+6 1105e+6 28Hook D LU 1.5e+6 31e+6 4168e+6 35Serena D LDLT 1.4e+6 32e+6 3365e+6 47
Table: Matrix description (Z: double complex, D: double).
Machine Processors Frequency GPUs RAMMirage Westmere Intel Xeon X5650 (2× 6) 2.67 GHz Tesla M2070 (×3) 36 GB
E. Agullo, P. Ramet - ITER School August 28, 2014- 26
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
E. Agullo, P. Ramet - ITER School August 28, 2014- 27
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
E. Agullo, P. Ramet - ITER School August 28, 2014- 27
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
E. Agullo, P. Ramet - ITER School August 28, 2014- 27
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
E. Agullo, P. Ramet - ITER School August 28, 2014- 27
Manycore
CPU scaling study: GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
20
40
60
80
Perfo
rman
ce(G
Flop
/s)
native StarPU PaRSEC
1 core 1 core 1 core
3 cores 3 cores 3 cores
6 cores 6 cores 6 cores
9 cores 9 cores 9 cores
12 cores 12 cores 12 cores
E. Agullo, P. Ramet - ITER School August 28, 2014- 27
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
E. Agullo, P. Ramet - ITER School August 28, 2014- 28
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
E. Agullo, P. Ramet - ITER School August 28, 2014- 28
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
E. Agullo, P. Ramet - ITER School August 28, 2014- 28
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
E. Agullo, P. Ramet - ITER School August 28, 2014- 28
Manycore
GPU scaling study : GFlop/s for numerical factorization
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
E. Agullo, P. Ramet - ITER School August 28, 2014- 28
Manycore
GPU scaling study : GFlop/s for numerical factorization
afshell10(D, LU)
FilterV2(Z, LU)
Flan(D, LLT )
audi(D, LLT )
MHD(D, LU)
Geo1438(D, LLT )
pmlDF(Z, LDLT )
HOOK(D, LU)0
50
100
150
200
250
Perfo
rman
ce(G
Flop
/s)
Native: CPU only
StarPU: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 1 stream: CPU only 1 GPU 2 GPU 3 GPU
PaRSEC 3 streams: 1 GPU 2 GPU 3 GPU
E. Agullo, P. Ramet - ITER School August 28, 2014- 28
Manycore
Xeon Phi (from Max-Planck-Institut for Plasmaphysic)[Phi=4CPUs, GPU=6CPUs]
E. Agullo, P. Ramet - ITER School August 28, 2014- 29
Hybrid
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - HIPS and MaPHYS
Low-rank compression - H-PaStiX
Conclusion
E. Agullo, P. Ramet - ITER School August 28, 2014- 30
Hybrid Linear Solvers
Develop robust scalable parallel hybrid direct/iterative linearsolvers
I Exploit the efficiency and robustness of the sparse direct solversI Develop robust parallel preconditioners for iterative solversI Take advantage of scalable implementation of iterative solvers
Domain Decomposition (DD)
I Natural approach for PDE’sI Extend to general sparse matricesI Partition the problem into subdomainsI Use a direct solver on the subdomainsI Robust preconditioned iterative solver
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matrices
I MeTiS [G. Karypis and V.Kumar]
I Scotch [Pellegrini and al.]
I Local factorization
I MUMPS [P. Amestoy and al.](with Schur option)
I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
0 cut edges
Ω
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω
E. Agullo, P. Ramet - ITER School August 28, 2014- 33
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matrices
I MeTiS [G. Karypis and V.Kumar]
I Scotch [Pellegrini and al.]
I Local factorization
I MUMPS [P. Amestoy and al.](with Schur option)
I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
48 cut edges
Ω1
Ω2
Γ
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Γ
E. Agullo, P. Ramet - ITER School August 28, 2014- 33
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matrices
I MeTiS [G. Karypis and V.Kumar]
I Scotch [Pellegrini and al.]
I Local factorization
I MUMPS [P. Amestoy and al.](with Schur option)
I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
88 cut edges
Ω1
Ω2
Ω3
Ω4
Γ
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Ω3
Ω4
Γ
E. Agullo, P. Ramet - ITER School August 28, 2014- 33
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matricesI MeTiS [G. Karypis and V.
Kumar]I Scotch [Pellegrini and al.]
I Local factorization
I MUMPS [P. Amestoy and al.](with Schur option)
I PaStiX [P. Ramet and al.](with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
88 cut edges
Ω1
Ω2
Ω3
Ω4
Γ
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Ω3
Ω4
Γ
E. Agullo, P. Ramet - ITER School August 28, 2014- 33
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matricesI MeTiS [G. Karypis and V.
Kumar]I Scotch [Pellegrini and al.]
I Local factorizationI MUMPS [P. Amestoy and al.]
(with Schur option)I PaStiX [P. Ramet and al.]
(with Schur option andmulti-threaded version)
I Constructing of the preconditioner
I Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
E. Agullo, P. Ramet - ITER School August 28, 2014- 33
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matricesI MeTiS [G. Karypis and V.
Kumar]I Scotch [Pellegrini and al.]
I Local factorizationI MUMPS [P. Amestoy and al.]
(with Schur option)I PaStiX [P. Ramet and al.]
(with Schur option andmulti-threaded version)
I Constructing of the preconditionerI Mkl library
I Solving the reduced system
I Cg/Gmres/FGmres on thereduced system
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Ω3
Ω4
Γ
E. Agullo, P. Ramet - ITER School August 28, 2014- 33
Hybrid
Method used in MaPHYSI Partitioning the global matrix in
several local matricesI MeTiS [G. Karypis and V.
Kumar]I Scotch [Pellegrini and al.]
I Local factorizationI MUMPS [P. Amestoy and al.]
(with Schur option)I PaStiX [P. Ramet and al.]
(with Schur option andmulti-threaded version)
I Constructing of the preconditionerI Mkl library
I Solving the reduced systemI Cg/Gmres/FGmres on the
reduced system
0 50 100 150 200 250 300 350 400
0
50
100
150
200
250
300
350
400
nz = 1920
Ω1
Ω2
Ω3
Ω4
Γ
E. Agullo, P. Ramet - ITER School August 28, 2014- 33
Hybrid
Experimental set up
Hopper platform (Hardware)
I Two twelve-core AMD ’MagnyCours’ 2.1-GHzI Memory: 32 GB GDDR3I Double precision
MatricesMatrix Tdr455K Nachos4M
N 2,738K 4,147K
Nnz 112,7M 256,4M
Table: Overview of sparse matrices used on the Hopper platform
E. Agullo, P. Ramet - ITER School August 28, 2014- 34
Hybrid
Results on the Hopper platform
Achieved performance for the Tdr455K matrixAll computational steps
10
100
1000
96 192 384 768 1536 3072
tim
e (
s)
#cores
3 t/p6 t/p
12 t/p24 t/p
Memory used per node
1000
10000
96 192 384 768 1536 3072
Mem
roy_peak(M
B)
#cores
3 t/p6 t/p
12 t/p24 t/p
Achieved performance for the Nachos4M matrixAll computational steps
10
100
768 1536 3072 6144 12288 24576
tim
e (
s)
#cores
3 t/p6 t/p
12 t/p24 t/p
Memory used per node
100
1000
10000
100000
768 1536 3072 6144 12288 24576
Mem
roy_peak(M
B)
#cores
3 t/p6 t/p
12 t/p24 t/p
E. Agullo, P. Ramet - ITER School August 28, 2014- 35
Hybrid
HIPS : hybrid direct-iterative solver
Based on a domain decomposition : interface one node-wide(no overlap in DD lingo)
(AB FE AC
)B : Interior nodes of subdomains (direct factorization).C : Interface nodes.
Special decomposition and ordering of the subset C :Goal : Building a global Schur complement preconditioner (ILU)from the local domain matrices only.
E. Agullo, P. Ramet - ITER School August 28, 2014- 36
Hybrid
HIPS: preconditioners
Main featuresI Iterative or “hybrid” direct/iterative method are implemented.I Mix direct supernodal (BLAS-3) and sparse ILUT
factorization in a seamless manner.I Memory/load balancing : distribute the domains on the
processors (domains > processors).
E. Agullo, P. Ramet - ITER School August 28, 2014- 37
H-PaStiX
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - HIPS and MaPHYS
Low-rank compression - H-PaStiX
Conclusion
E. Agullo, P. Ramet - ITER School August 28, 2014- 38
H-PaStiX
Toward low rank compressions in supernodal solver
Many works on hierarchical matrices and direct solversI Eric Darve : Hierarchical matrices classifications (Building
O(N) Linear Solvers Using Nested Dissection)I Sherry Li : Multifrontal solver + HSS (Towards an
Optimal-Order Approximate Sparse Factorization ExploitingData-Sparseness in Separators)
I David Bindel : CHOLMOD + Low Rank (An Efficient Solverfor Sparse Linear Systems Based on Rank-Structured CholeskyFactorization)
I Jean-Yves L’Excellent : MUMPS + Block Low Rank
E. Agullo, P. Ramet - ITER School August 28, 2014- 40
H-PaStiX
FastLA associate team between INRIA/Berkeley/Stanford
Supernodal Solver - Hierarchical Matrices O(N .loga(N))
1. Check the potential compression ratio on top level blocks2. Develop a prototype with:
I low-rank compression on the larger supernodesI compression tree built at each updateI complexity analysis of the approach
3. Study coupling between nested dissection and compressiontree ordering
Which algorithm to find low-rank approximation ?SVD, RR-LU, RR-QR, ACA, CUR, Random ...
Which family of hierarchical matrix ?H, H2, HODLR ...
E. Agullo, P. Ramet - ITER School August 28, 2014- 44
Conclusion
Guideline
Introduction
Sparse direct factorization - PaStiX
Sparse solver on heterogenous architectures
Hybrid methods - HIPS and MaPHYS
Low-rank compression - H-PaStiX
Conclusion
E. Agullo, P. Ramet - ITER School August 28, 2014- 45
Conclusion
SoftwaresGraph/Mesh partitioner and ordering :
http://scotch.gforge.inria.fr
Sparse linear system solvers :
http://pastix.gforge.inria.fr
http://hips.gforge.inria.fr
https://wiki.bordeaux.inria.fr/maphys/doku.php
E. Agullo, P. Ramet - ITER School August 28, 2014- 47
Conclusion
Softwares
Fast Multipole Method :
http://scalfmm-public.gforge.inria.fr/
Matrices Over Runtime Systems (with University of Tenessee):
http://icl.cs.utk.edu/projectsdev/morse
E. Agullo, P. Ramet - ITER School August 28, 2014- 48