Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to...
Transcript of Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to...
![Page 1: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/1.jpg)
Classical Algebraic Multigrid for CFD using CUDA
Simon Layton, Lorena Barba
With thanks to Justin Luitjens, Jonathan Cohen (NVIDIA)
![Page 2: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/2.jpg)
Problem Statement
‣Solve elliptical problems with the form
‣For Instance the Pressure Poisson equation from Fluids
- This commonly takes 90%+ of total run time
‣These systems arise in many other engineering applications2
r2� = r · u⇤
Au = b
![Page 3: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/3.jpg)
What is Multigrid?
‣Hierarchical
- Repeatedly reduce the size of the problem
- Solve these smaller problems & transfer corrections
‣ Iterative
- Based on a smoother (i.e. Jacobi, Gauss-Seidel)
- Solve using recursive ‘cycles’
3
![Page 4: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/4.jpg)
Why do we care?
‣Optimal complexity
- Convergence independent of size of problem
- Consequently O(N) in time
‣Parallel & scalable
- Hypre package from LLNL scales to 100k+ cores
- “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al.
4
![Page 5: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/5.jpg)
Classical vs. Aggregative AMG
‣Solve phase is the same - only difference in setup
- Classical selects variables to be in the coarser level
- Aggregation combines variables to form the coarser level
5
Classical:Choose variable to
keep on coarse level
Aggregation:Combine variables
to form entry oncoarse level
![Page 6: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/6.jpg)
AMG - Main Components
‣ Consists of 5 main parts
1. Strength of Connection metric (SoC)- Measure of how strongly matrix variables affect each other
2. Selector / Aggregator- Choose variables / aggregates to move to coarser level
3. Interpolator / Restrictor- To generate coarsened matrices & transfer residuals between levels
4. Galerkin Product- Generate next coarsest level: Ak+1 = RkAkPk
5. Solver cycle- Consists of Sparse Matrix-Vector multiplications
6
![Page 7: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/7.jpg)
Strength of Connection
‣ Variable i strongly depends on variable j if:
‣ After max value calculated, strength calculation for each edge can be performed in parallel
7
4
�1
�0.1
�0.1
�0.3�0.5�0.8
�1
�aij � ↵max
k 6=i(�aik)
In this case: = 0.25Threshold = -0.25↵
![Page 8: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/8.jpg)
PMIS Selector
‣Select some variables to be coarse
- Use strongest connection
- Neighbours of these variables are now fine
‣Repeat until all variables are classified
8
Initial graph
![Page 9: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/9.jpg)
PMIS Selector
9
Select some variablesto be coarse
‣Select some variables to be coarse
- Use strongest connection
- Neighbours of these variables are now fine
‣Repeat until all variables are classified
![Page 10: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/10.jpg)
PMIS Selector
10
Neighbours of thesecoarse variablesmarked as fine
‣Select some variables to be coarse
- Use strongest connection
- Neighbours of these variables are now fine
‣Repeat until all variables are classified
![Page 11: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/11.jpg)
Interpolator
‣Extended+i interpolator used
- Needed to ensure good convergence with our selector
- Considers neighbours of neighbours for interpolation set
11
Distance 1
![Page 12: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/12.jpg)
Interpolator
‣Extended+i interpolator used
- Needed to ensure good convergence with our selector
- Considers neighbours of neighbours for interpolation set
12
Distance 2
![Page 13: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/13.jpg)
Solve Cycle
‣Described recursively
- Simplest is V-cycle
13
if (level = M (coarsest))solve AMuM = fM
elseApply smoother n1 times to Akuk = fk
Perform coarse grid correctionSet rk = fk - Akuk
Set rk+1 = Rkrk
Call cycle recursively for next levelInterpolate ek = Pkek+1
Correct solution: uk = uk + ek
Apply smoother n2 times to Akuk = fk
k=1
k=MRestriction Interpolation
![Page 14: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/14.jpg)
GPU Implementation - Initial Thoughts
‣Most operations seem to easily expose parallelism:
- Strength of connection
- PMIS selector
- Solve cycle - SpMV from cusp / cusparse
‣Others less obvious or trickier:
- Galerkin Product
- Interpolator
‣Want to ensure correctness
- Compare against identical algorithms in Hypre 14
![Page 15: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/15.jpg)
Initial Results - Convergence
15
‣Regular Poisson grids
- Demonstrates problem size invariance for convergence rate
0 2 4 6 8 10x 105
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1A
ver
age
Co
nv
erg
ence
Rat
e
N
5!pt9pt7pt27pt
27pt stencil runs out of memory on C2050 - we expect to get the same behaviour as other stencils
![Page 16: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/16.jpg)
Initial Results - Immersed Boundary Method
16
‣Speed compared to 1 node using Hypre (unoptimized)
-1 0 1 2 3 4-2
-1.5-1
-0.5 0
0.5 1
1.5 2 Code Time Speedup
CUDA 3.57s 1.87x
Hypre(1C)
6.69s -
Hypre(2C)
5.29s 1.26x
Hypre(4C)
4.16s 1.61x
Hypre(6C)
4.00s 1.67x
![Page 17: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/17.jpg)
Where to go from here?
‣Breakdown 5pt Poisson stencil by routine (easy case)
17
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
2%2%< 1%
39%
44%
13%< 1%
Prolongate & correctRestrict residualCompute RCompute ACompute PSmoothOther
Compute A is Galerkin product - Sparse R*A*P
Interpolation matrix:Can we form this as an existing pattern?
![Page 18: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/18.jpg)
Compute P - Interpolation Matrix
‣ Interpolation weights depend on the distance 2 set:
‣This repeated set union turns out to be fairly tricky..
- Can be formed as a specialized SpMM
18
Ci = Csi [
[
j2F si
Csj
![Page 19: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/19.jpg)
SpMM / Repeated set union
‣SpMM conceptually:
- For each output row, if influencing column already in row, accumulate to current value, else add column to row
‣Set union conceptually:
- For each set of coarse connections, if entry already added to result set, ignore, else add to set.
19
Ci = Csi [
[
j2F si
Csj
C = (I + F s) · Cs
![Page 20: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/20.jpg)
SpMM: Idea
‣Similar in more detail later by Julien Demouth (S0285)
‣ Implement Accumulator as hash table in shared memory
- Work on as many rows per block as possible that fit in sMem
20
For each row i in matrix A:Accumulator = []For each entry j in row i:rowB = column(j)For each entry k on rowB of B:val = value(j)*value(k)Accumulator.add(column(k),val)
Accumulator.write
Warp
Thread
![Page 21: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/21.jpg)
Problem!
‣Performance highly dependent on size of hash tables
- Not enough sMem = cannot solve problem
- Too many values in sMem = many collisions = poor perf.
‣Answer? Estimate storage needed
- Allocate #warps per block & size of shared memory based on this estimation
- If all else fails, pass multiplication to cusp
‣Worst case almost identical performance
21
![Page 22: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/22.jpg)
Storage Estimation
‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998
‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)
‣By looking at connections, get estimation of non-zeros in C 22
A.rows A.colsB.rows
B.colsC.rows
0
1
2
3
![Page 23: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/23.jpg)
Storage Estimation
‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998
‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)
‣By looking at connections, get estimation of non-zeros in C 22
A.rows A.colsB.rows
B.colsC.rows
0
1
2
3
![Page 24: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/24.jpg)
Storage Estimation
‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998
‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)
‣By looking at connections, get estimation of non-zeros in C 22
A.rows A.colsB.rows
B.colsC.rows
0
1
2
3
![Page 25: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/25.jpg)
Storage Estimation
‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998
‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)
‣By looking at connections, get estimation of non-zeros in C 22
A.rows A.colsB.rows
B.colsC.rows
0
1
2
3
![Page 26: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/26.jpg)
Storage Estimation
‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998
‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)
‣By looking at connections, get estimation of non-zeros in C 22
A.rows A.colsB.rows
B.colsC.rows
0
1
2
3
![Page 27: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/27.jpg)
Storage Estimation
‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998
‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)
‣By looking at connections, get estimation of non-zeros in C 22
A.rows A.colsB.rows
B.colsC.rows
0
1
2
3
![Page 28: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/28.jpg)
Storage Estimation
‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998
‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)
‣By looking at connections, get estimation of non-zeros in C 22
A.rows A.colsB.rows
B.colsC.rows
0
1
2
3
![Page 29: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/29.jpg)
Storage Prediction - Implementation
23
For r in [0,R)For each row, i:u[i] = rand() // exponential dist.
For each row in A, i:v[i] = +infinityFor each non-‐zero on i, j:v[i] = min(v[i],u[col[j]])
For each row in B, iw[i][r] = +infinityFor each non-‐zero on i, j:w[i][r] = min(w[i],v[col[j]])
nnz[i] = ceil((R-‐1)/sum(w[i][:])) // est
![Page 30: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/30.jpg)
Storage Prediction - Optimizations
24
For r in [0,R)For each row, i:u[i] = rand() // exponential dist.
For each row in A, i:v[i] = FLT_MAXFor each non-‐zero on i, j:v[i] = min(v[i],hash(col[j]))
For each row in B, iw[i][r] = FLT_MAXFor each non-‐zero on i, j:w[i][r] = min(w[i],v[col[j]])
nnz[i] = ceil((R-‐1)/sum(w[i][:])) // estimator
Use pseudo-random hash
Generalized SpMVs
Only float precisionneeded
![Page 31: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase](https://reader033.fdocuments.in/reader033/viewer/2022053101/60617846861e3157e866ab8c/html5/thumbnails/31.jpg)
Conclusions
‣ Implemented full classical AMG algorithm in parallel
- Validated against OSS code, Hypre
- Tested with CFD problem
‣Attempted to solve bottleneck of Galerkin product & repeated set intersection
- Predict storage per row
- Use this to determine launch parameters for hash-based SpMM using shared memory
25