Communication-avoiding & Pipelined Krylovsolvers in...
Transcript of Communication-avoiding & Pipelined Krylovsolvers in...
-
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
Communication-avoiding & Pipelined Krylov solvers in Trilinos
Ichitaro Yamazaki (SNL), Mark Hoemmen (SNL),
Erik Boman (SNL), and Jack Dongarra (UTK)
Joint Laboratory for Extreme-Scale Computing Workshop,
Knoxville, TN, April 15, 2019
-
DOE ECP PEEKS Project
§ Krylov are powerful solvers for solving A*x = b§ used is many applications including DOE’s and ECP’s
§ some results later
§ use two main kernels (based on Krylov subspace projection)§ Matrix Vector multiply (SpMV)
– for generating Krylov subspace= span(q, A*q, A2*q, …)
– combined with Precondinerto improve convergence
– black box, provided by users
§ Orthgonalization– for generate orthonormal basis vectors
– our current focus
§ Communication can be bottleneck (time, and maybe power)§ P2P? + irregular data access for SpMV+Precond
§ all-reduce + BLAS-1 or 2 for Orthogonalization
§ becoming more expensive on a newer architecture 2/12
CPU
Memory
Cache
Memory
Memory Memory
Memory
CPU CPU
CPUCPU
-
DOE ECP PEEKS Project
§ Communication-avoiding & Pipelined Krylov solvers§ s-step aims to avoid them by generating s vectors at a time
§ latency reduced by factor of s§ pipeline tries to hide them by overlapping/pipelining
§ max speedup of 2x, but maybe more through pipelining
Make these solvers available in Trilinos§ Goal: perform “well” for ECP applications on exascale computers
§ Also should be useful for non-exascale Trilinos users 3/12
SpMV
allreduce P2P
s-step (s=2)
Block-orth SpMV
GEMV GEMV
allreduce
P2P
pipeline
-
What is Trilinos?
§ Parallel math libraries for science & engineering applications§ Sparse matrices & parallel distributions
§ Linear solvers & preconditioners
§ Nonlinear solvers, optimization, …
§ ~ 20 years’ continuous development
§ Mostly C++11, some C & Fortran
§ Supports many different platforms§ CPUs: x86, KNL, POWER, ARM, …
§ GPUs: NVIDIA, AMD in progress, …
§ Future architectures (e.g., exascale)
§ github.com/trilinos/Trilinos
§ Users inside & outside Sandia4/12
https://github.com/trilinos/Trilinos
-
Trilinos’ linear solvers
§ Iterative linear solvers (Belos)
§ Parallel linear algebra (Tpetra)§ LA (sparse/dense mat/vec) operations
§ Thread parallelism (Kokkos)§ Programming model for portability
§ Sparse direct solvers (Amesos2)
§ Direct+iterative solvers (ShyLU)
§ Algebraic preconditioners (Ifpack2)
§ Algebraic multigrid (MueLu)
§ Segregated block solvers (Teko)
5/12
Green: Programming modelYellow: Provide data & kernelsBlue: Use data & kernels directlyRed: Use kernels abstractly
Belos only uses underlying linear algebra implementation through abstract interface
-
Krylov methods we implemented§ available now in Trilinos (Belos)
§ CA-GMRES (Chronopoulos & Kim ‘90, Hoemmen 2010)§ SpMV+Precond as block box
no matrix-powers kernel (no CA)option to generate Newton basis for stability
§ block orthogonalization (a few options)>2 all-reduces/iter ~2 all-reduces/s-stepsBlas-1 or 2 Blas-3
6/12
for j=1:s:m do// generate s Krylov vectorsfor i=j+1:1:j+s,
qi = AM-1 qi-1 // SpMV+Precond (black box)end for
// orthogonalize
Qj:j+1 = Ortho(Qj:j+1, Q1:j )
Qj:j+1 = TSQR(Qj:j+1)
end for
-
Krylov methods we implemented
§ available now in Trilinos (Belos)§ 1 all-reduce GMRES (Ghysels et al. 2016)
§ 1 all-reduce CG (Saad ‘85, D’Azevedo ’93)§ merge orthogon/normalization
§ Pipelined GMRES (Ghysels et al. 2016)
§ Pipelined CG (Ghysels & Vanroose 2012)§ non-blocking collective, MPI_Iallreduce
7/12
SpMV GEMV
GEMV
allreduce
P2P
pipeline
-
Belos solver factory interface
§ available through Belos solver factory§ no code change needed for Belos users
§ available by run-time option (along with solver parameters, convergence criteria)
8/12
// Create a Solver ManagerBelos::SolverFactory factory;RCP solverManager = factory.create (args.solverName, params);
// Set problemRCP lp (new linear_problem_type (A, X, B));solverManager->setProblem (lp);
// Solve the systemBelos::ReturnType belosResult = solverManager->solve ();
§ Same solver manager may be reused for multiple solves; e.g., for different b or A.
§ A, b, x may be templated, e.g., with scalar, global/local ordinal, and node type
-
Nalu Wind performance resultscollaboration with ECP ExaWind project
§ Nalu Wind (CFD)§ github.com/exawind/nalu-wind
§ Low Mach, unstructured, C++
§ Sierra Tool Kit (STK) meshes
§ Can handle >> 10^9 dofs
§ Our experiments (mesh from NREL)§ Simulate air flow around wind turbine(s)
§ Hybrid RANS-LES (RANS near blade, LES in wake)
§ Segregated physics
§ Trilinos (Belos Krylov)
§ 95 M dofs / linear system§ Pressure and momentum systems
§ MueLu Multigrid / Ifpack Symmetric Gauss Seidel
§ NERSC Cori: Haswell, 32 cores/nodes 9/1213
First Test Case: Vestas V27225 kW, 27 m rotor
Image from Domino, Barone & Bruner; ExaWind Poster, SAND2018-1050D, Exascale Computing Project 2nd Annual Meeting, Knoxville, TN, February 6, 2018.
(image taken Thomas, et al., High-fidelity simulation of wind-turbine incompressible flows with classical and aggregation AMG preconditioners, submitted to SIAMJournal on Scientific Computing, Copper Mountain Conference on Iterative Methods 2018 Special Issue, on 15 June 2018)
Image credit: Domino, Barone, & Bruner, 2018
https://github.com/exawind/nalu-wind
-
Time to solution: Pressure system
10/12
256 512 1024 2048 4096
Process count
102
103
Tota
l solu
tion tim
e
1.4x 1.4x 1.4x 1.3x1.4x 1.3x 1.3x 1.3x1.5x 1.4x 1.4x 1.4x1.4x 1.4x Infx
GMRES+ICGSs-step, newton, CGS+CholQRs-step, newton, low-synch CGS2x+CholQR2s-step, newton, low-synch CGS+CholQR2s-step, monom, low-synch CGS+CholQR2linear
• MueLu algebraic multigrid + GMRES(100)• GMRES converges in ~ 200 iterations
• CA-GMRES • Newton basis (Ritz values from first s iterations as shifts) or monomial basis• Converges in the same number of iterations• Speedup of 1.4x due to a fewer all-reduces through block orthogonalization
Speedups over regular GMRES
-
Time to solution: Momentum system
11/12
• Symmetric Gauss-Seidel preconditioner + GMRES• GMRES converges in ~5 -- ~20 iterations
• Pipelined, non-blocking collective (pipeline depth = 1)
4096 8192 1638410
15
20
25
30
35
40
45
0.9x 1.0x 1.1x
GMRES+ICGSpipelined+CGSlinear
-
12/12
Conclusions§ Communication-avoiding & pipelined Krylov in Trilinos
§ more info at icl.utk.edu/peeks & trilinos.org
§ Improved solve performance in Nalu Wind by up to 1.5x
§ Did so without breaking software backwards compatibility
§ Working on performance improvement (e.g., comm/comp kernels with GPUs),algorithmic extension (e.g., low-synch block orthogo), other preconditioning
-
Thank you!!
§ Nalu developers (Stephen Thomas at NREL & Jonathan Hu at SNL)
§ ECP PEEKS, for fundingThis research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nations exascale computing imperative.
-
14
-
Pipelined CA-GMRES (re)orthog.
§ CGS: Classical Gram-Schmidt
§ 1 reduce CGS + CholQR1. [C; G] = [Q, P]T * P
2. Q = Q – C * P, G = G – C T * C
3. R = chol(G), Q = Q / R
§ (Next iteration orthog’s P)
§ Above + reorthogonalize P1. [T, C; G1, G] = [Q, P]T * [P’, P]
2. R’ = chol(G1), P’ = P / R’
3. Update C & G, then 2-3 above
§ MGS, CGS2, …: NREL talk today!
§ PDSEC’17; adding to Trilinos soon
15
P
start vector for new s new s vectors
P’
CholQR of PCGS with Q
Prev
ious
sve
ctor
s
Mor
e pr
evio
us v
ecto
rs
dot
dot