EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE HIERARCHICAL MACHINESKate Clark, HiPar @ SC20
2
OUTLINE
Hierarchical machines
Science Motivator: Lattice QCD
Motifs Hierarchical Algorithm: Multigrid Hierarchical Communication
Summary
3
HIERARCHICAL MACHINES
4
HPC IS GETTING MORE HIERARCHICALWhat does a node even mean?
Legacy Current
Cray XT4 (2007) https://www.nersc.gov/assets/NUG-Meetings/NERSCSystemOverview.pdf
NVIDIA DGX-A100 (2020) https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
GPU0
GPU1
GPU2
GPU3
GPU4
GPU5
GPU6
GPU7
NVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitch
PEXSwitch
200G NIC
200G NIC
NVMe
NVMe
200G NIC
200G NIC
NVMe
NVMe
200G NIC
200G NIC
NVMe
NVMe
200G NIC
200G NIC
NVMe
NVMe
AMD Rome64C
AMD Rome64C
200G NIC 200G NIC
PEX Switch
PEX Switch
PEX Switch
PEX Switch
5
HPC IS GETTING MORE HIERARCHICAL
2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+
2012 Titan: Cray XK7 @ ORNL 17 PFLOPS, 16,688 nodes, 1x AMD 16-core + 1 x K20X GPU 14 SMs, Cray Gemini
2017 Summit: IBM AC922 @ ORNL 149 PFLOPS, 4356 nodes, 2x IBM P9 22-core CPU + 6 x V100 GPU 80 SMs, MLNX 2xEDR
2020 Selene: NVIDIA DGX A100 @ NVIDIA 28 PFLOPS, 280 nodes, 2x AMD 64-core CPU + 8 x A100 GPU 108 SMs, MLNX 8xHDR
6
EVOLVING HIERARCHY
Bandwidth / socket
Valu
e Ax
is
0
400
800
1200
1600
Jaguar Titan Summit Selene
FP64 / socket
Valu
e Ax
is
0
5000
10000
15000
20000
Jaguar Titan Summit Selene
Inter-node bandwidth / socket
Valu
e Ax
is0
7.5
15
22.5
30
Jaguar Titan Summit Selene
Intra-node bandwidth / socket
Valu
e Ax
is
0
75
150
225
300
Jaguar Titan Summit Selene
Bandwidth / node
Valu
e Ax
is0
3500
7000
10500
14000
Jaguar Titan Summit Selene
FP64 / node
Valu
e Ax
is
0
40000
80000
120000
160000
Jaguar Titan Summit Selene
Inter-node bandwidth / node
Valu
e Ax
is
0
50
100
150
200
Jaguar Titan Summit Selene
Intra-node bandwidth / node
Valu
e Ax
is
0
750
1500
2250
3000
Jaguar Titan Summit Selene
data is intended to be illustrative sourced from wikipedia.org 1 node equals single os instance 1 GPU = 1 socket where appropriate
7
MODERN HPC PROCESSORS
GPUs are extreme hierarchical processors
Many-core processor programmed using a massively threaded model Threads arranged as Cartesian hierarchy of grids
Deep memory hierarchy Registers <-> L1 <-> L2 <-> <->
Increasingly coupled instruction hierarchy Tensor cores <-> CUDA cores <-> shared mem atomics <-> L2 atomics
Synchronization possible at many levels (Sub-)Warp <-> Thread Block <-> Grid <-> Node <-> Cluster
AKA GPUs64 GB/s bi-directional
1.6 TB/s
> 20 TB/s
> 3 TB/s
Device Memory
Host Memory
A100 sketch
8
MAPPING TO HIERARCHYComposable Parallelism
Streaming
No need to synchronize Likely bandwidth bound
Reduce
Need to synchronize Likely bandwidth bound
Scatter
No need to synchronize Input Locality
Streaming Reduce = Multi-reduce
Only local synchronization
⊗
9
SCIENCE MOTIVATOR LATTICE QCD
10
LATTICE QCD MOTIVATION
Summit cycle breakdown in
INCITE allocation
Summit cycle breakdown in
INCITE use
NERSC Utilization (Aug ’17 - Jul’18)
LQCD ~40%
1(56&�ZRUNORDG�LV�H[WUHPHO\�GLYHUVH�EXW�QRW�HYHQO\�GLYLGHG�
Ɣ ���FRGHV�PDNH�XS�����RI�ZRUNORDG�
Ɣ ���FRGHV�PDNH�XS�����RI�ZRUNORDG�
Ɣ ���FRGHV�PDNH�XS�����RI�ZRUNORDG�
Ɣ 5HPDLQLQJ�FRGHV�RYHU������PDNH�XS�����RI�ZRUNORDG�
3\WKRQ
LQCD ~13%
11
QUANTUM CHROMODYNAMICS
The strong force is one of the basic forces of nature (along with gravity, em and weak)
It’s what binds together the quarks and gluons in the proton and the neutron (as well as hundreds of other particles seen in accelerator experiments)
Responsible for the particle zoo seen at sub-nuclear scales (mass, decay rate, etc.)
QCD is the theory of the strong force It’s a beautiful theory… …but
!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; X
3",6@0*,/7*A#"./0
! V*0$0$6./A*<.6-2$(,$"#0$"A$-*0$'&,()$A"1)0,$"A$#&-O10$+&?"#R$F(-*$R1&/(-2<$0?0)-1"K&R#0-(,K<$&#@$-*0$F0&H$A"1)03P
! 6-h,$F*&-$'(#@,$-"R0-*01$-*0$B",6@0$&#@$A#"./0$(#$-*0$E1"-"#$+&#@$-*0$#0O-1"#<$&,$F0??$&,$*O#@10@,$"A$"-*01$E&1-()?0,$,00#$(#$&))0?01&-"1$0cE01(K0#-,3P
V*"K&,$M0AA01,"#$C&-("#&?$7))0?01&-"1$i&)(?(-2i01K($C&-("#&?$7))0?01&-"1$G&'"1&-"12
h⌦i = 1
Z
Z[dU ]e�
Rd4xL(U)⌦(U)
12
LATTICE QUANTUM CHROMODYNAMICS
Theory is highly non-linear ⇒ cannot solve directly
Must resort to numerical methods to make predictions
Lattice QCD Discretize spacetime ⇒ 4-d dimensional lattice of size
Finite spacetime ⇒ periodic boundary conditions
PDEs ⇒ finite difference equations ⇒ Linear solvers Ax = b
Consumer of 10+% of public supercomputer cycles Traditionally highly optimized on every HPC platform for the past 30 years Jobs often run at the 1000+ GPU scale
Lx × Ly × Lz × Lt
13
QUDA• “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU
backend for BQCD, Chroma**, CPS**, MILC**, TIFR, etc. • Provides solvers for all major fermionic discretizations, with multi-GPU
support • Maximize performance
– Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Multigrid solvers for optimal convergence – NVSHMEM for improving strong scaling
• A research tool for how to reach the exascale (and beyond) – Optimally mapping the problem to hierarchical processors and node topologies
**ECP benchmarks apps
!9
QUDA
• “QCD on CUDA” – http://lattice.github.com/quda (C++14, open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for
BQCD, Chroma, CPS, MILC, TIFR, etc. • Various solvers for all major fermionic discretizations, with multi-GPU support • Maximize performance
– Mixed-precision methods (runtime specification of precision for maximum flexibility) – Exploit physical symmetries to minimize memory traffic – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-RHS solvers – Multigrid solvers for optimal convergence
• A research tool for how to reach the exascale (and beyond)
14
QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware
!16
QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware
GFl
op/s
1
10
100
1000
0
500
1,000
1,500
2,000
2,500
3,000
2008 2010 2012 2014 2016 2018 2019
Speedup Wilson FP32 GFLOPS
Speedup determined by measured time to solution for solving the Wilson operator against a random source on a V=24364 lattice, β=5.5, Mπ= 416 MeV. One node is defined to be 3 GPUs
Spee
dup
Multi GPUcapable
Adaptive Multigrid
OptimizedMultigrid
DeflatedMultigrid
300x
15
QUDA CONTRIBUTORS
! Ron Babich (NVIDIA) ! Simone Bacchio (Cyprus) ! Kip Barros (LANL) ! Rich Brower (Boston University) ! Nuno Cardoso (NCSA) ! Kate Clark (NVIDIA) ! Michael Cheng (Boston University) ! Carleton DeTar (Utah University) ! Justin Foley (Utah -> NIH) ! Joel Giedt (Rensselaer Polytechnic Institute) ! Arjun Gambhir (William and Mary) ! Steve Gottlieb (Indiana University) ! Kyriakos Hadjiyiannakou (Cyprus) ! Dean Howarth (LLNL) ! Bálint Joó (Jlab)
! Hyung-Jin Kim (BNL -> Samsung) ! Bartek Kostrzewa (Bonn) ! Claudio Rebbi (Boston University) ! Eloy Romero (William and Mary) ! Hauke Sandmeyer (Bielefeld) ! Guochun Shi (NCSA -> Google) ! Mario Schröck (INFN) ! Alexei Strelchenko (FNAL) ! Jiqun Tu (NVIDIA) ! Alejandro Vaquero (Utah University) ! Mathias Wagner (NVIDIA) ! André Walker-Loud (LBL) ! Evan Weinberg (NVIDIA) ! Frank Winter (Jlab) ! Yi-bo Yang (CAS)
10+ years - lots of contributors
16
STEPS IN AN LQCD CALCULATION
1. Generate an ensemble of gluon field configurations “gauge generation” Produced in sequence, with hundreds needed per ensemble Strong scaling required with 100-1000 TFLOPS sustained for several months 50-90% of the runtime is in the linear solver O(1) solve per linear system Target 164-244 per GPU
2. “Analyze” the configurations Can be farmed out, assuming ~10 TFLOPS per job Task parallelism means that clusters reign supreme here 80-99% of the runtime is in the linear solver Many solves per system, e.g., O(106) Target 244-324 per GPU
D↵�ij (x, y;U) �
j (y) = ⌘↵i (x)
or Ax = b
Simulation Cost ~ a-6 V5/4
17
LATTICE QCD IN A NUTSHELLh⌦i = 1
Z
Z[dU ]e�
Rd4xL(U)⌦(U)
Multi GPU Parallelization
faceexchange
wraparound
faceexchange
wraparound
Tuesday, July 12, 2011
theory'!
experiment)!
Brookhaven*Na,onal*Laboratory*Large&Hadron&Collider& Davies'et#al#
18
HIERARCHICAL ALGORITHM: MULTIGRID
19
MULTIGRID
Linear scaling with problem size
Condition number independent
Two distinct flavours
Algebraic: general sparse matrix
Geometric: stencil kernel
Pose problem as a hierarchy of grids
Fine grid is highly parallel
Coarse problem is essentially serial
The optimal method for solving PDE-based linear systems
-0.43 -0.42 -0.41 -0.4mass
100
1000
10000
1e+05
Dir
ac o
per
ato
r ap
pli
cati
on
s
32396 CG
24364 CG
16364 CG
24364 Eig-CG
16364 Eig-CG
32396 MG-GCR
24364 MG-GCR
16364 MG-GCR
20
MULTIGRID
GPU requirements very different from CPU
Each thread is slow, but O(10,000) threads per GPU
Fine grids run very efficiently
High parallel throughput problem
Coarse grids are worst possible scenario
More cores than degrees of freedom
Increasingly serial and latency bound
Little’s law (bytes = bandwidth * latency)
Amdahl’s law limiter
The optimal method for solving PDE-based linear systems
21
HPGMG
Benchmark developed by LBNL as an alternative TOP500 benchmark to be representative of adaptive mesh refinement computations
Port to CUDA handled by Nikolay Sakharnykh https://devblogs.nvidia.com/high-performance-geometric-multi-grid-gpu-acceleration
Optimal performance when reverse offloading the coarse grids to the CPU
True even with Kepler (2012) (never mind Volta / Ampere)
Many prior GPU implementations of Multigrid
MULTIGRID INGREDIENTS
▪ Adaptive Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition
▪ Smoothing (relaxation on a given grid) – Fine grid stencil and BLAS1 kernels
▪ Prolongation – interpolation from coarse grid to fine grid
▪ Restriction – restriction from fine grid to coarse grid
▪ Coarse Operator construction (setup) – Evaluate R A P on each block – Batched (small) dense matrix multiplication
▪ Coarse grid solver – Need optimal coarse-grid stencil application
22
MULTIGRID INGREDIENTS
▪ Adaptive Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition
▪ Smoothing (relaxation on a given grid) – Fine grid stencil and BLAS1 kernels
▪ Prolongation – interpolation from coarse grid to fine grid
▪ Restriction – restriction from fine grid to coarse grid
▪ Coarse Operator construction (setup) – Evaluate R A P on each block – Batched (small) dense matrix multiplication
▪ Coarse grid solver – Need optimal coarse-grid stencil application
23
x x
x
x−
x−
U x
Ux
μ
μ
ν
One-to-one mapping
A =
24
MAPPING THE DIRAC OPERATOR TO CUDA
• Finite difference operator in LQCD is known as Dslash
• Assign a single space-time point to each thread V = XYZT threads, e.g., V = 244 => 3.3x106 threads
• Looping over direction each thread must – Load the neighboring spinor (24 numbers x8) – Load the color matrix connecting the sites (18 numbers x8) – Do the computation – Save the result (24 numbers)
• Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity
• QUDA reduces memory traffic Exact SU(3) matrix compression (18 => 12 or 8 real numbers) Use 16-bit fixed-point representation with mixed-precision solver
review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.
II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD
arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by
Mx = b. (1)
The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by
Mx,x0 = �12
4⇤
µ=1
�P�µ ⇤ Uµ
x �x+µ,x0 + P+µ ⇤ Uµ†x�µ �x�µ,x0
⇥
+ (4 + m + Ax)�x,x0
⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)
Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x0 are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.
Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known
1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.
Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� ⇥ plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ
x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.
as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.
III. GRAPHICS PROCESSING UNITS
In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.
It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.
Dx,x0 =x x
x
x−
x−
U x
Ux
μ
μ
ν
X[0]
X[1]
Wilson-Dslash on V100SCALING OF FINE-GRID STENCIL
0
750
1,500
2,250
3,000
32 28 24 20 16 12 8
half single double
GFl
op/s
Lattice lengthstrong scaling
COARSE GRID STENCIL▪ Coarse operator looks like a Dirac operator (many dof per site)
– Link matrices have dimension 2Nv x 2Nv (e.g., 48 x 48)
▪ Fine vs. Coarse grid parallelization – Fine grid operator has plenty of grid-level parallelism
– E.g., 16x16x16x16 = 65536 lattice sites – Coarse grid operator has diminishing grid-level parallelism
– first coarse grid 4x4x4x4= 256 lattice sites – second coarse grid 2x2x2x2 = 16 lattice sites
▪ Current GPUs have up to 10,752 CUDA cores
▪ Need to consider finer-grained parallelization – Increase parallelism to use all GPU resources – Load balancing
dofs (geometry). We start by defining the fields
W±µksc,ls�c� = V †
ksc,kscP±µs,s�U(k+µ)c,lc��k+µ,lVls�c�,ls�c�
note that here we are defining di�erent links for forward and backwards,they are not simply the conjugate of each other (because of the di�erent spinprojection between the two). Also note that these e�ective link matrices havealso a spin index, this is because the vectors used to define the V rotationmatrices have spin dependence now. In this form we can now write down thecoarse Dirac operator as
Disc,js�c� = ���i,k/B�s,s/Bs
⇥ ⇤
µ
⌅W�µ
ksc,ls�c��k+µ,l + W+µ†ksc,ls�c��k�µ,l
⇧ ��l/B,j�s�/Bs,s�
⇥
+M �isc,js�c� .
We now finish up by blocking the geometry and spin onto the coarse lattice,defining the e�ective link matrices Y ±µ that connect sites on the coarselattice:
Y ±µisc,js�c� =
��i,k/B�s,s/Bs
⇥W±µ
ksc,ls�c�
��l/B,j�s�/Bs,s�
⇥�i⇤µ,j (2)
Xisc,js�c� =��i,k/B�s,s/Bs
⇥ ⇤
µ
⌅W�µ
isc,ks�c� + W+µ†isc,ks�c�
⇧ ��l/B,j�s�/Bs,s�
⇥�i,j,
where we note now that the matrix X is not Hermitian. Thus the coarseoperator is written
Disc,js�c� = �⇤
µ
⌅Y �µ
isc,js�c��i+µ,j + Y +µ†isc,js�c��i�µ,j
⇧+ (M �Xisc,js�c�) �isc,js�c� . (3)
For the explicit form of these matrices we refer the reader to Appendix A.After the first blocking, subsequent blockings require that Bs = 1, i.e., we
cannot block the spin dimension again since we cannot remove the chirality.Apart from this observation, the next coarse operator will have a similar formto the current one: it will be a nearest neighbour non-Hermitian operatorconnecting sites with ds = 2 spin dimension (in 2d and 4d anyway).
We note here in passing that because of the definition of the matrix field Vinclude explicit spin dependence, this destroys the tensor product structureof the spin and colour on the coarse operator, i.e., we have to define ane�ective link matrix that rotates in spin and colour space. If this were notthe case, i.e., if V were to be spin independent, then this structure would be
8
xx
x
x−
x−
U x
Ux
μ
μ
ν
26
27
X[0]
X[1]
SOURCE OF PARALLELISM
�a00 a01 a02 a03
�
0
BB@
b0b1b2b3
1
CCA )�a00 a01
�✓b0b1
◆+
�a02 a03
�✓b2b3
◆
xx
x
x−
x−
U x
Ux
μ
μ
ν
warp 0
xx
x
x−
x−
U x
Ux
μ
μ
ν warp 1
xx
x
x−
x−
U x
Ux
μ
μ
ν warp 2
xx
x
x−
x−
U x
Ux
μ
μ
ν
warp 3
xx
x
x−
x−
U x
Ux
μ
μ
ν
3. Stencil direction 8-way thread parallelism
1. Grid parallelism Volume of threads
2. Link matrix-vector partitioning 2 Nvec-way thread parallelism (spin * color)
0
BB@
c0c1c2c3
1
CCA+ =
0
BB@
a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33
1
CCA
0
BB@
b0b1b2b3
1
CCAthread y
index
4. Dot-product partitioning 4-way thread parallelism + ILP
28
COARSE GRID OPERATOR PERFORMANCETesla K20X (Titan), FP32, Nvec = 24
0"
20"
40"
60"
80"
100"
120"
140"
160"
10" 8" 6" 4" 2"
GFLO
PS'
La)ce'length'
baseline"
color2spin"
dimension"+"direc7on"
dot"product"
24,576-way parallel
16-way parallel
c.f. Fine grid operator ~300-400 GFLOPS
29
PARALLELISM ISN’T INFINITEG
FLO
PS
0
400
800
1200
1600
Lattice length
2 4 6 8 10
Kepler Maxwell Pascal Volta Ampere
Coarse stencil performance
Latency limited Bandwidth limited
Working set 36 MiB
One-to-many mapping Multi-scatter No synchronization required
Parallelize over fine grid points
Implement as trivial streaming kernel Coarse grid points cached in L1 and L2
PROLONGATIONInterpolation from coarse grid to fine grid
30
31
RESTRICTOR
Many-to-one mapping => race conditions
Use atomics?
Lack of determinism
Instead pose as a multi-reduction
Assign each MG aggregate to a thread block
Many-to-one mapping now local to SM
Synchronization only within thread block
Minimizes total memory traffic
Restriction of fine grid to coarse grid
64 GB/s bi-directional
1.6 TB/s
> 20 TB/s
> 3 TB/s
BLOCK ORTHOGONALIZATION
Block Gram-Schmidt Orthogonalization over vector set
n2 many-to-one and one-to-many mappings
Again assign each MG aggregate to a thread block
multi-(reduce-scatters) local to a thread block
Whole computation in a single kernel
Minimizes total memory traffic
O(n) memory traffic O(n2) compute
Forms the basis for coarse space
n ve
ctor
s
64 GB/s bi-directional
1.6 TB/s
> 20 TB/s
> 3 TB/s
33
COARSE OPERATOR CONSTRUCTION
Need to evaluate RAP - pseudo batched triple matrix product
Need to employ fine-grained parallelization
Perform each batched matrix product in parallel
Each coarse matrix element has many contributions (many-to-one)
Cannot pose as a tree-reduction so do use atomics
Floating point atomics are non-deterministic
Use 32-bit integer atomics for associativity and determinism
Minimize contention by using atomic hierarchy
thread -> shared memory atomic -> L2 atomic
34
CHROMA HMC ON SUMMIT
From Titan running 2016 code to Summit running 2019 code we see >82x speedup in HMC throughput
Mul$plica$ve speedup coming from mapping hierarchical algorithm to hierarchical machine
Highly opBmized mulBgrid for gauge field evoluBon
Mixed precision an important piece of the puzzle • double – outer defect correcBon • single – GCR solver • half – precondiBoner • int32 – determinisBc parallel coarsening
KC, Bálint Joó, Mathias Wagner, Evan Weinberg, Frank Winter, Boram Yoon
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Titan (1024xK20X)
Summit (128xV100)
Titan (512xK20X)
Summit (128xV100, Nov 2019)
Summit (128xV100, March
2019)W
allcl
ock
Tim
e (s
econ
ds)
4006
1878
974
439 392
4.1x fasteron 2x fewer GPUs
~8x gain
10.2x fasteron 8x fewer GPUs
~82x gain
Chroma ECP benchmark
35
HIERARCHICAL COMMUNICATION
36
ONE COMMUNICATIONS API TO RULE THEM ALL?
For example DGX-1 nodes 8x V100 GPUs connected through NVLink 4x EDR for inter-node communication Optimal placement of GPUs and NIC
Intra-node: DMA or load/store over NVLink @ 50 GB/s
Inter-node: Message passing over IB @ 6 GB/s
Can we use a single exchange API across this hierarchy?
NVIDIA DGX-1 With Tesla V100 System Architecture WP-08437-002_v01 | 9
V100GPU7
V100GPU4
V100GPU6
V100GPU5
CPU1NIC NIC
PCIe Switches
V100GPU0
V100GPU3
V100GPU1
V100GPU2
CPU0NIC NIC
PCIe Switches
NVLink PCIe QPI
&ŝŐƵƌĞ�ϰ �'yͲϭ�ƵƐĞƐ�ĂŶ�ϴͲ'Wh�ŚLJďƌŝĚ�ĐƵďĞͲŵĞƐŚ�ŝŶƚĞƌĐŽŶŶĞĐƚŝŽŶ�ŶĞƚǁŽƌŬ�ƚŽƉŽůŽŐLJ� dŚĞ�ĐŽƌŶĞƌƐ�ŽĨ�ƚŚĞ�ŵĞƐŚͲĐŽŶŶĞĐƚĞĚ�ĨĂĐĞƐ�ŽĨ�ƚŚĞ�ĐƵďĞ�ĂƌĞ�ĐŽŶŶĞĐƚĞĚ�ƚŽ�ƚŚĞ�W�/Ğ�ƚƌĞĞ�ŶĞƚǁŽƌŬ�ǁŚŝĐŚ�ĂůƐŽ�ĐŽŶŶĞĐƚƐ�ƚŽ�ƚŚĞ��WhƐ�ĂŶĚ�E/�Ɛ
37
MULTI GPU BUILDING BLOCKS
Halo packing Kernel
Interior Kernel
Halo communication
Halo update Kernel
Multi GPU Parallelization
faceexchange
wraparound
faceexchange
wraparound
Tuesday, July 12, 2011
Halo packing Kernel
Interior Kernel
Halo communication
Halo update Kernel
38
SOME TERMINOLGY
P2P: direct exchange between GPUs
CUDA aware MPI: MPI can take GPU pointer can do batched, overlapping transfers (high bandwidth)
GPU Direct RDMA: (implies CUDA aware MPI) RDMA transfer from/to GPU memory here NIC <-> GPU
P2P, CUDA aware, GPU Direct RDMA
NVIDIA DGX-1 With Tesla V100 System Architecture WP-08437-002_v01 | 9
V100GPU7
V100GPU4
V100GPU6
V100GPU5
CPU1NIC NIC
PCIe Switches
V100GPU0
V100GPU3
V100GPU1
V100GPU2
CPU0NIC NIC
PCIe Switches
NVLink PCIe QPI
&ŝŐƵƌĞ�ϰ �'yͲϭ�ƵƐĞƐ�ĂŶ�ϴͲ'Wh�ŚLJďƌŝĚ�ĐƵďĞͲŵĞƐŚ�ŝŶƚĞƌĐŽŶŶĞĐƚŝŽŶ�ŶĞƚǁŽƌŬ�ƚŽƉŽůŽŐLJ� dŚĞ�ĐŽƌŶĞƌƐ�ŽĨ�ƚŚĞ�ŵĞƐŚͲĐŽŶŶĞĐƚĞĚ�ĨĂĐĞƐ�ŽĨ�ƚŚĞ�ĐƵďĞ�ĂƌĞ�ĐŽŶŶĞĐƚĞĚ�ƚŽ�ƚŚĞ�W�/Ğ�ƚƌĞĞ�ŶĞƚǁŽƌŬ�ǁŚŝĐŚ�ĂůƐŽ�ĐŽŶŶĞĐƚƐ�ƚŽ�ƚŚĞ��WhƐ�ĂŶĚ�E/�Ɛ
MPI_Sendrecv
Time
39
QUDA MULTI-GPU SCALING
Key technologies employed Peer-to-peer communication - skip MPI and directly utilize DMA copies GPU Direct RDMA (GDR) - GPU <-> NIC <-> GPU without traversing CPU memory Node topology awareness
Autotuning for Dslash communication policy Kernel launch parameters (dependent on local problem size)
Starting point (already multiple years of optimization)
40
MULTI-GPU PROFILEoverlapping comms and compute
324 local volume, single precision
P2P copies
Interior kernelPacking kernel
Halo kernels t,z,y
DG
X-1,
1x2x
2x2
part
itio
ning
41
REDUCING LOCAL PROBLEM SIZEsingle GPU performance
0
750
1,500
2,250
3,000
56 52 48 44 40 36 32 28 24 20 16 12 8
half single double
GFl
op/s
Lattice length strong scaling
smallest reasonable problem size
42
STRONG SCALING PROFILEoverlapping comms and compute
P2P copiesInterior kernel
Packing kernel Halo kernels (fused)
164 local volume, half precision
DG
X-1,
1x2x
2x2
part
itio
ning
43
STRONG SCALING PROFILELatencies ate my scaling
P2P copiesInterior kernel
Packing kernel Halo kernels (fused)
API Calls
164 local volume, half precision
44
WHAT IS THE PROBLEM?
It’s not just data movement we need to minimize
Task marshaling from a lower level of hierarchy (e.g., host) adds latency
Data consistency requires synchronization between CPU and GPU
Ideally: offload of task marshaling to GPU thread to have same locality as data
NVIDIA DGX-1 With Tesla V100 System Architecture WP-08437-002_v01 | 9
V100GPU7
V100GPU4
V100GPU6
V100GPU5
CPU1NIC NIC
PCIe Switches
V100GPU0
V100GPU3
V100GPU1
V100GPU2
CPU0NIC NIC
PCIe Switches
NVLink PCIe QPI
&ŝŐƵƌĞ�ϰ �'yͲϭ�ƵƐĞƐ�ĂŶ�ϴͲ'Wh�ŚLJďƌŝĚ�ĐƵďĞͲŵĞƐŚ�ŝŶƚĞƌĐŽŶŶĞĐƚŝŽŶ�ŶĞƚǁŽƌŬ�ƚŽƉŽůŽŐLJ� dŚĞ�ĐŽƌŶĞƌƐ�ŽĨ�ƚŚĞ�ŵĞƐŚͲĐŽŶŶĞĐƚĞĚ�ĨĂĐĞƐ�ŽĨ�ƚŚĞ�ĐƵďĞ�ĂƌĞ�ĐŽŶŶĞĐƚĞĚ�ƚŽ�ƚŚĞ�W�/Ğ�ƚƌĞĞ�ŶĞƚǁŽƌŬ�ǁŚŝĐŚ�ĂůƐŽ�ĐŽŶŶĞĐƚƐ�ƚŽ�ƚŚĞ��WhƐ�ĂŶĚ�E/�Ɛ
45
NVSHMEM
NVSHMEM features Symmetric memory allocations in device memory Communication API calls on CPU (standard and stream-ordered) Kernel-side communication (API and LD/ST) between GPUs
NVLink and PCIe support (intra-node) InfiniBand support (inter-node) Interoperability with MPI and OpenSHMEM libraries
Implementation of OpenSHMEM1, a Partitioned Global Address Space (PGAS) library
1 SHMEM from Cray’s “shared memory” library, https://en.wikipedia.org/wiki/SHMEM
Available in NVIDIA HPC toolkit
46
DSLASH NVSHMEM IMPLEMENTATION
Keep general structure of packing, interior and exterior Dslash
Use nvshmem_ptr for intra-node remote writes (fine-grained)
Packing buffer is located on remote device Use nvshmem_putmem_nbi to write to remote GPU over IB (1 RDMA transfer)
Need to make sure writes are visible: nvshmem_barrier_all_on_stream
or barrier kernel that only waits for writes from neighbors
First exploration
47
NVSHMEM DSLASHfirst exploration
Packing kernel Fused HaloInterior kernel Barrier kernel
164 local volume, half precision
DG
X-1,
1x2x
2x2
part
itio
ning
48
FUSED DSLASH + PACKING KERNEL
Packing Interior
pack_blocks
Exterior (Halo)
interior_blocks = grid_dim - pack_blocks
nvshmem_signal for each direction
inte
rior
+pac
k ke
rnel
halo
ker
nel
nvshmem_wait_until for packed data
49
NVSHMEM + FUSING KERNELSno extra packing and barrier kernels needed
Barrier + Fused HaloInterior + Pack + Flag kernel
164 local volume, half precision
DG
X-1,
1x2x
2x2
part
itio
ning
50
KERNEL FUSION REDUCES LATENCY
0
400
800
1,200
1,600
2,000
IPC co
pies
IPC re
mote w
rite
NVSH
MEM w
/ bar
rier
NVSH
MEM ke
rnel
fusio
n
DGX-1,164 local volume, Wilson, half precision, 1x2x2x2 partitioningG
Flop
/s p
er G
PU
51
CAN WE GO FURTHER ?
`
One kernel to rule them all ! Communication is handled in the kernel and latencies are hidden.
52
DSLASH UBER KERNEL
Packing Interior
pack_blocks interior_blocks = grid_dim - pack_blocks - exterior_blocks
nvshmem_signal for each direction
Exterior (Halo)
exterior_blocks
atomic wait for interior
nvshmem_wait_until
atomic flag set by last block
53
ÜBER KERNEL
` 164 local volume, half precision
DG
X-1,
1x2x
2x2
part
itio
ning
über kernel (packing + interior + exterior)
54
LATENCY REDUCTIONS
0
400
800
1,200
1,600
2,000
IPC co
pies
IPC re
mote w
rite
NVSH
MEM w
/ bar
rier
NVSH
MEM ke
rnel
fusio
n
NVSH
MEM üb
er ke
rnel
DGX-1,164 local volume, Wilson, half precision, 1x2x2x2 partitioningG
Flop
/s p
er G
PU
55
DGX-2: FULL NON-BLOCKING BANDWIDTH2.4 TB/s bisection bandwidth
8
FULL NON-BLOCKING BANDWIDTH
GPU8
GPU9
GPU10
GPU11
GPU12
GPU13
GPU14
GPU15
GPU0
GPU1
GPU2
GPU3
GPU4
GPU5
GPU6
GPU7
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
NVSwitch
56
DGX-2 STRONG SCALINGGlobal Volume 324, Wilson-Dslash
0
5,000
10,000
15,000
20,000
25,000
1 2 4 8 16
MPI double SHMEM doubleMPI single SHMEM singleMPI half SHMEM half
GFl
op/s
#GPUs
@16 GPUs:
164 local volume
57
MULTI-NODE STRONG SCALINGDGX SuperPOD (DGX2 nodes: 16xV100 (32GB), 8xEDR IB)
0
100,000
200,000
300,000
400,000
16 32 64 128 256 512 1024
MPI NVSHMEM
GFl
op/s
#GPUs
Wilson Dslash, 643x128 global volume, half precision
sweet spot for simulations
1.4x
1.4x
1.3x
164 local volume, half precision
58SUMMARY
59
LQCD OUTLOOK
Reformulate vector-vector and matrix-vector problems as matrix-matrix Multi-RHS, Communication-Avoiding, block solvers Deploy using tensor cores where possible
Keep data on chip and minimize hierarchy level jumping Use NVSHMEM to fuse across communication boundaries Increasingly possible with latest generation of GPUs
Follow trends towards future architectures and aim for super-linear scaling
Super Linear Scaling
!16
QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware
GFl
op/s
1
10
100
1000
0
500
1,000
1,500
2,000
2,500
3,000
2008 2010 2012 2014 2016 2018 2019
Speedup Wilson FP32 GFLOPS
Speedup determined by measured time to solution for solving the Wilson operator against a random source on a V=24364 lattice, β=5.5, Mπ= 416 MeV. One node is defined to be 3 GPUs
Spee
dup
Multi GPUcapable
Adaptive Multigrid
OptimizedMultigrid
DeflatedMultigrid
300x
60
QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware
2022 2024
?
61
SUMMARY
Algorithms achieve maximal speedup when we Expose their parallelism to map onto ever-more parallel processors Identify their shape to map to ever-more hierarchical processors
Communications reach optimal performance when we Move task marshaling close to data, enabling a single communications API Utilize fine-grained communications which maximize latency hiding
62BACKUP
63
PARALLELISM AND LOCALITY
63
MULTIPLE RIGHT-HAND SIDESG
Flop
/s
# rhs
483x12, HISQ, single precision, one code
0
750
1,500
2,250
3,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Volta (2017) Pascal (2016) Maxwell (2014) Kepler (2012) Fermi (2010)
2.5x
3.5x
64
MULTIGRID IS FULL OF MATRIX MULTIPLICATION
Explicit
• Coarse-operator construction
• Block-orthogonalization
Implicit (through reformulation)
• Multi-RHS operators
• Coarse-grid LU solver
Reworking the pipeline to expose algorithms in matrix-matrix form
Maximize locality, parallelism for optimal mapping onto the GPU hierarchy
!38
LATTICE QCD WITH TENSOR CORES
Modern GPU have high throughput tensor-core functionality Kernel Fusion 24/36
The 4d-volume (⇠ 108) is much larger than the size of the 5thdimension (⇠ 101). Fusing the 4d-spatial operations with the 5thdimension operations right after reduces global memory tra�c.
⇥1� 2b M
†�D
†w| {z }
fuse
M�†5 M †
�D†w| {z }
fuse
M�†5
⇤⇥1� 2bM
�15 Dw| {z }
fuse
M�M�15 Dw| {z }
fuse
M�
⇤
Tesla V100 FP64: 7.5 TFLOPS FP32: 15 TFLOPS Tensor: 125 TFLOPS
KC, Jung, Mawhinney, Tu
65
TENSOR CORES FOR MULTIGRID
Multigrid is a preconditioner
16-bit precision is perfectly adequate
No impact on convergence rate
Majority of MG setup kernels now implemented using tensor cores
1.5x-10x kernel speedups observed
Ongoing WIP
Expect integer speedup factors on Summit once fully reformulated
TFLO
PS
0
7.5
15
22.5
30
4x4x4x8 2x2x2x2
single precision half precision tensor cores
Yhat kernel on Quadro GV100 (32 null space vectors)
66
FINE-GRAINED SYNCHRONIZATION
Need replacement for kernel boundaries
Packing is independent of interior
interior and exterior update on boundary → possible race condition
use cuda::atomic from libcu++
libcu++ gives us std::atomic in CUDA
#include <atomic> std::atomic<int> x;
#include <cuda/std/atomic> cuda::std::atomic<int> x;
#include <cuda/atomic> cuda::atomic<int, cuda::thread_scope_block> x;
Top Related