Download - EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE HIERARCHICAL MACHINESKate Clark, HiPar @ SC20

2

OUTLINE

Hierarchical machines

Science Motivator: Lattice QCD

Motifs Hierarchical Algorithm: Multigrid Hierarchical Communication

Summary

3

HIERARCHICAL MACHINES

4

HPC IS GETTING MORE HIERARCHICALWhat does a node even mean?

Legacy Current

Cray XT4 (2007) https://www.nersc.gov/assets/NUG-Meetings/NERSCSystemOverview.pdf

NVIDIA DGX-A100 (2020) https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

NVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitch

PEXSwitch

200G NIC

200G NIC

NVMe

NVMe

200G NIC

200G NIC

NVMe

NVMe

200G NIC

200G NIC

NVMe

NVMe

200G NIC

200G NIC

NVMe

NVMe

AMD Rome64C

AMD Rome64C

200G NIC 200G NIC

PEX Switch

PEX Switch

PEX Switch

PEX Switch

5

HPC IS GETTING MORE HIERARCHICAL

2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

2012 Titan: Cray XK7 @ ORNL 17 PFLOPS, 16,688 nodes, 1x AMD 16-core + 1 x K20X GPU 14 SMs, Cray Gemini

2017 Summit: IBM AC922 @ ORNL 149 PFLOPS, 4356 nodes, 2x IBM P9 22-core CPU + 6 x V100 GPU 80 SMs, MLNX 2xEDR

2020 Selene: NVIDIA DGX A100 @ NVIDIA 28 PFLOPS, 280 nodes, 2x AMD 64-core CPU + 8 x A100 GPU 108 SMs, MLNX 8xHDR

6

EVOLVING HIERARCHY

Bandwidth / socket

Valu

e Ax

is

0

400

800

1200

1600

Jaguar Titan Summit Selene

FP64 / socket

Valu

e Ax

is

0

5000

10000

15000

20000


Inter-node bandwidth / socket

Valu

e Ax

is0

7.5

15

22.5

30


Intra-node bandwidth / socket

Valu

e Ax

is

0

75

150

225

300


Bandwidth / node

Valu

e Ax

is0

3500

7000

10500

14000


FP64 / node

Valu

e Ax

is

0

40000

80000

120000

160000


Inter-node bandwidth / node

Valu

e Ax

is

0

50

100

150

200


Intra-node bandwidth / node

Valu

e Ax

is

0

750

1500

2250

3000


data is intended to be illustrative sourced from wikipedia.org 1 node equals single os instance 1 GPU = 1 socket where appropriate

http://wikipedia.org

7

MODERN HPC PROCESSORS

GPUs are extreme hierarchical processors

Many-core processor programmed using a massively threaded model Threads arranged as Cartesian hierarchy of grids

Deep memory hierarchy Registers <-> L1 <-> L2 <-> <->

Increasingly coupled instruction hierarchy Tensor cores <-> CUDA cores <-> shared mem atomics <-> L2 atomics

Synchronization possible at many levels (Sub-)Warp <-> Thread Block <-> Grid <-> Node <-> Cluster

AKA GPUs64 GB/s bi-directional

1.6 TB/s

> 20 TB/s

> 3 TB/s

Device Memory

Host Memory

A100 sketch

8

MAPPING TO HIERARCHYComposable Parallelism

Streaming

No need to synchronize Likely bandwidth bound

Reduce

Need to synchronize Likely bandwidth bound

Scatter

No need to synchronize Input Locality

Streaming Reduce = Multi-reduce

Only local synchronization

⊗

9

SCIENCE MOTIVATOR LATTICE QCD

10

LATTICE QCD MOTIVATION

Summit cycle breakdown in

INCITE allocation

Summit cycle breakdown in

INCITE use

NERSC Utilization (Aug ’17 - Jul’18)

LQCD ~40%

1(56&�ZRUNORDG�LV�H[WUHPHO\�GLYHUVH�EXW�QRW�HYHQO\�GLYLGHG�

Ɣ ��FRGHV�PDNH�XS��RI�ZRUNORDG�



Ɣ 5HPDLQLQJ�FRGHV�RYHU��PDNH�XS��RI�ZRUNORDG�

3\WKRQ

LQCD ~13%

11

QUANTUM CHROMODYNAMICS

The strong force is one of the basic forces of nature (along with gravity, em and weak)

It’s what binds together the quarks and gluons in the proton and the neutron (as well as hundreds of other particles seen in accelerator experiments)

Responsible for the particle zoo seen at sub-nuclear scales (mass, decay rate, etc.)

QCD is the theory of the strong force It’s a beautiful theory… …but

!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; X

3",6@0*,/7*A#"./0

! V*0$0$6./A*<.6-2$(,$"#0$"A$-*0$'&,()$A"1)0,$"A$#&-O10$+&?"#R$F(-*$R1&/(-2<$0?0)-1"K&R#0-(,K<$&#@$-*0$F0&H$A"1)03P

! 6-h,$F*&-$'(#@,$-"R0-*01$-*0$B",6@0$&#@$A#"./0$(#$-*0$E1"-"#$+&#@$-*0$#0O-1"#<$&,$F0??$&,$*O#@10@,$"A$"-*01$E&1-()?0,$,00#$(#$&))0?01&-"1$0cE01(K0#-,3P

V*"K&,$M0AA01,"#$C&-("#&?$7))0?01&-"1$i&)(?(-2i01K($C&-("#&?$7))0?01&-"1$G&'"1&-"12

h⌦i = 1

Z

Z[dU ]e�

Rd4xL(U)⌦(U)

12

LATTICE QUANTUM CHROMODYNAMICS

Theory is highly non-linear ⇒ cannot solve directly

Must resort to numerical methods to make predictions

Lattice QCD Discretize spacetime ⇒ 4-d dimensional lattice of size

Finite spacetime ⇒ periodic boundary conditions

PDEs ⇒ finite difference equations ⇒ Linear solvers Ax = b

Consumer of 10+% of public supercomputer cycles Traditionally highly optimized on every HPC platform for the past 30 years Jobs often run at the 1000+ GPU scale

Lx × Ly × Lz × Lt

13

QUDA• “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU

backend for BQCD, Chroma**, CPS**, MILC**, TIFR, etc. • Provides solvers for all major fermionic discretizations, with multi-GPU

support • Maximize performance

– Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Multigrid solvers for optimal convergence – NVSHMEM for improving strong scaling

• A research tool for how to reach the exascale (and beyond) – Optimally mapping the problem to hierarchical processors and node topologies

**ECP benchmarks apps

!9

QUDA

• “QCD on CUDA” – http://lattice.github.com/quda (C++14, open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for

BQCD, Chroma, CPS, MILC, TIFR, etc. • Various solvers for all major fermionic discretizations, with multi-GPU support • Maximize performance

– Mixed-precision methods (runtime specification of precision for maximum flexibility) – Exploit physical symmetries to minimize memory traffic – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-RHS solvers – Multigrid solvers for optimal convergence

• A research tool for how to reach the exascale (and beyond)

14

QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware

!16


GFl

op/s

1

10

100

1000

0

500

1,000

1,500

2,000

2,500

3,000

2008 2010 2012 2014 2016 2018 2019

Speedup Wilson FP32 GFLOPS

Speedup determined by measured time to solution for solving the Wilson operator against a random source on a V=24364 lattice, β=5.5, Mπ= 416 MeV. One node is defined to be 3 GPUs

Spee

dup

Multi GPUcapable

Adaptive Multigrid

OptimizedMultigrid

DeflatedMultigrid

300x

15

QUDA CONTRIBUTORS

! Ron Babich (NVIDIA) ! Simone Bacchio (Cyprus) ! Kip Barros (LANL) ! Rich Brower (Boston University) ! Nuno Cardoso (NCSA) ! Kate Clark (NVIDIA) ! Michael Cheng (Boston University) ! Carleton DeTar (Utah University) ! Justin Foley (Utah -> NIH) ! Joel Giedt (Rensselaer Polytechnic Institute) ! Arjun Gambhir (William and Mary) ! Steve Gottlieb (Indiana University) ! Kyriakos Hadjiyiannakou (Cyprus) ! Dean Howarth (LLNL) ! Bálint Joó (Jlab)

! Hyung-Jin Kim (BNL -> Samsung) ! Bartek Kostrzewa (Bonn) ! Claudio Rebbi (Boston University) ! Eloy Romero (William and Mary) ! Hauke Sandmeyer (Bielefeld) ! Guochun Shi (NCSA -> Google) ! Mario Schröck (INFN) ! Alexei Strelchenko (FNAL) ! Jiqun Tu (NVIDIA) ! Alejandro Vaquero (Utah University) ! Mathias Wagner (NVIDIA) ! André Walker-Loud (LBL) ! Evan Weinberg (NVIDIA) ! Frank Winter (Jlab) ! Yi-bo Yang (CAS)

10+ years - lots of contributors

16

STEPS IN AN LQCD CALCULATION

1. Generate an ensemble of gluon field configurations “gauge generation” Produced in sequence, with hundreds needed per ensemble Strong scaling required with 100-1000 TFLOPS sustained for several months 50-90% of the runtime is in the linear solver O(1) solve per linear system Target 164-244 per GPU

2. “Analyze” the configurations Can be farmed out, assuming ~10 TFLOPS per job Task parallelism means that clusters reign supreme here 80-99% of the runtime is in the linear solver Many solves per system, e.g., O(106) Target 244-324 per GPU

D↵�ij (x, y;U) �

j (y) = ⌘↵i (x)

or Ax = b

Simulation Cost ~ a-6 V5/4

17

LATTICE QCD IN A NUTSHELLh⌦i = 1

Z

Z[dU ]e�

Rd4xL(U)⌦(U)

Multi GPU Parallelization

faceexchange

wraparound

faceexchange

wraparound

Tuesday, July 12, 2011

theory'!

experiment)!

Brookhaven*Na,onal*Laboratory*Large&Hadron&Collider& Davies'et#al#

18

HIERARCHICAL ALGORITHM: MULTIGRID

19

MULTIGRID

Linear scaling with problem size

Condition number independent

Two distinct flavours

Algebraic: general sparse matrix

Geometric: stencil kernel

Pose problem as a hierarchy of grids

Fine grid is highly parallel

Coarse problem is essentially serial

The optimal method for solving PDE-based linear systems

-0.43 -0.42 -0.41 -0.4mass

100

1000

10000

1e+05

Dir

ac o

per

ato

r ap

pli

cati

on

s

32396 CG

24364 CG

16364 CG

24364 Eig-CG

16364 Eig-CG

32396 MG-GCR

24364 MG-GCR

16364 MG-GCR

20

MULTIGRID

GPU requirements very different from CPU

Each thread is slow, but O(10,000) threads per GPU

Fine grids run very efficiently

High parallel throughput problem

Coarse grids are worst possible scenario

More cores than degrees of freedom

Increasingly serial and latency bound

Little’s law (bytes = bandwidth * latency)

Amdahl’s law limiter

The optimal method for solving PDE-based linear systems

21

HPGMG

Benchmark developed by LBNL as an alternative TOP500 benchmark to be representative of adaptive mesh refinement computations

Port to CUDA handled by Nikolay Sakharnykh https://devblogs.nvidia.com/high-performance-geometric-multi-grid-gpu-acceleration

Optimal performance when reverse offloading the coarse grids to the CPU

True even with Kepler (2012) (never mind Volta / Ampere)

Many prior GPU implementations of Multigrid

https://devblogs.nvidia.com/high-performance-geometric-multi-grid-gpu-acceleration

MULTIGRID INGREDIENTS

▪ Adaptive Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition

▪ Smoothing (relaxation on a given grid) – Fine grid stencil and BLAS1 kernels

▪ Prolongation – interpolation from coarse grid to fine grid

▪ Restriction – restriction from fine grid to coarse grid

▪ Coarse Operator construction (setup) – Evaluate R A P on each block – Batched (small) dense matrix multiplication

▪ Coarse grid solver – Need optimal coarse-grid stencil application

22

MULTIGRID INGREDIENTS

▪ Adaptive Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition

▪ Smoothing (relaxation on a given grid) – Fine grid stencil and BLAS1 kernels

▪ Prolongation – interpolation from coarse grid to fine grid

▪ Restriction – restriction from fine grid to coarse grid

▪ Coarse Operator construction (setup) – Evaluate R A P on each block – Batched (small) dense matrix multiplication

▪ Coarse grid solver – Need optimal coarse-grid stencil application

23

x x

x

x−

x−

U x

Ux

μ

μ

ν

One-to-one mapping

A =

24

MAPPING THE DIRAC OPERATOR TO CUDA

• Finite difference operator in LQCD is known as Dslash

• Assign a single space-time point to each thread V = XYZT threads, e.g., V = 244 => 3.3x106 threads

• Looping over direction each thread must – Load the neighboring spinor (24 numbers x8) – Load the color matrix connecting the sites (18 numbers x8) – Do the computation – Save the result (24 numbers)

• Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity

• QUDA reduces memory traffic Exact SU(3) matrix compression (18 => 12 or 8 real numbers) Use 16-bit fixed-point representation with mixed-precision solver

review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.

II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD

arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by

Mx = b. (1)

The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by

Mx,x0 = �12

4⇤

µ=1

�P�µ ⇤ Uµ

x �x+µ,x0 + P+µ ⇤ Uµ†x�µ �x�µ,x0

⇥

+ (4 + m + Ax)�x,x0

⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)

Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x0 are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.

Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known

1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.

Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� ⇥ plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ

x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.

as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.

III. GRAPHICS PROCESSING UNITS

In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.

It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.

Dx,x0 =x x

x

x−

x−

U x

Ux

μ

μ

ν

X[0]

X[1]

Wilson-Dslash on V100SCALING OF FINE-GRID STENCIL

0

750

1,500

2,250

3,000

32 28 24 20 16 12 8

half single double

GFl

op/s

Lattice lengthstrong scaling

COARSE GRID STENCIL▪ Coarse operator looks like a Dirac operator (many dof per site)

– Link matrices have dimension 2Nv x 2Nv (e.g., 48 x 48)

▪ Fine vs. Coarse grid parallelization – Fine grid operator has plenty of grid-level parallelism

– E.g., 16x16x16x16 = 65536 lattice sites – Coarse grid operator has diminishing grid-level parallelism

– first coarse grid 4x4x4x4= 256 lattice sites – second coarse grid 2x2x2x2 = 16 lattice sites

▪ Current GPUs have up to 10,752 CUDA cores

▪ Need to consider finer-grained parallelization – Increase parallelism to use all GPU resources – Load balancing

dofs (geometry). We start by defining the fields

W±µksc,ls�c� = V †

ksc,kscP±µs,s�U(k+µ)c,lc��k+µ,lVls�c�,ls�c�

note that here we are defining di�erent links for forward and backwards,they are not simply the conjugate of each other (because of the di�erent spinprojection between the two). Also note that these e�ective link matrices havealso a spin index, this is because the vectors used to define the V rotationmatrices have spin dependence now. In this form we can now write down thecoarse Dirac operator as

Disc,js�c� = ��i,k/B�s,s/Bs

⇥ ⇤

µ

⌅W�µ

ksc,ls�c��k+µ,l + W+µ†ksc,ls�c��k�µ,l

⇧ ��l/B,j�s�/Bs,s�

⇥

+M �isc,js�c� .

We now finish up by blocking the geometry and spin onto the coarse lattice,defining the e�ective link matrices Y ±µ that connect sites on the coarselattice:

Y ±µisc,js�c� =

��i,k/B�s,s/Bs

⇥W±µ

ksc,ls�c�

��l/B,j�s�/Bs,s�

⇥�i⇤µ,j (2)

Xisc,js�c� =��i,k/B�s,s/Bs

⇥ ⇤

µ

⌅W�µ

isc,ks�c� + W+µ†isc,ks�c�

⇧ ��l/B,j�s�/Bs,s�

⇥�i,j,

where we note now that the matrix X is not Hermitian. Thus the coarseoperator is written

Disc,js�c� = �⇤

µ

⌅Y �µ

isc,js�c��i+µ,j + Y +µ†isc,js�c��i�µ,j

⇧+ (M �Xisc,js�c�) �isc,js�c� . (3)

For the explicit form of these matrices we refer the reader to Appendix A.After the first blocking, subsequent blockings require that Bs = 1, i.e., we

cannot block the spin dimension again since we cannot remove the chirality.Apart from this observation, the next coarse operator will have a similar formto the current one: it will be a nearest neighbour non-Hermitian operatorconnecting sites with ds = 2 spin dimension (in 2d and 4d anyway).

We note here in passing that because of the definition of the matrix field Vinclude explicit spin dependence, this destroys the tensor product structureof the spin and colour on the coarse operator, i.e., we have to define ane�ective link matrix that rotates in spin and colour space. If this were notthe case, i.e., if V were to be spin independent, then this structure would be

8

xx

x

x−

x−

U x

Ux

μ

μ

ν

26

27

X[0]

X[1]

SOURCE OF PARALLELISM

�a00 a01 a02 a03

�

0

BB@

b0b1b2b3

1

CCA )�a00 a01

�✓b0b1

◆+

�a02 a03

�✓b2b3

◆

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 0

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 1

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 2

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 3

xx

x

x−

x−

U x

Ux

μ

μ

ν

3. Stencil direction 8-way thread parallelism

1. Grid parallelism Volume of threads

2. Link matrix-vector partitioning 2 Nvec-way thread parallelism (spin * color)

0

BB@

c0c1c2c3

1

CCA+ =

0

BB@

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

1

CCA

0

BB@

b0b1b2b3

1

CCAthread y

index

4. Dot-product partitioning 4-way thread parallelism + ILP

28

COARSE GRID OPERATOR PERFORMANCETesla K20X (Titan), FP32, Nvec = 24

0"

20"

40"

60"

80"

100"

120"

140"

160"

10" 8" 6" 4" 2"

GFLO

PS'

La)ce'length'

baseline"

color2spin"

dimension"+"direc7on"

dot"product"

24,576-way parallel

16-way parallel

c.f. Fine grid operator ~300-400 GFLOPS

29

PARALLELISM ISN’T INFINITEG

FLO

PS

0

400

800

1200

1600

Lattice length

2 4 6 8 10

Kepler Maxwell Pascal Volta Ampere

Coarse stencil performance

Latency limited Bandwidth limited

Working set 36 MiB

One-to-many mapping Multi-scatter No synchronization required

Parallelize over fine grid points

Implement as trivial streaming kernel Coarse grid points cached in L1 and L2

PROLONGATIONInterpolation from coarse grid to fine grid

30

31

RESTRICTOR

Many-to-one mapping => race conditions

Use atomics?

Lack of determinism

Instead pose as a multi-reduction

Assign each MG aggregate to a thread block

Many-to-one mapping now local to SM

Synchronization only within thread block

Minimizes total memory traffic

Restriction of fine grid to coarse grid

64 GB/s bi-directional

1.6 TB/s

> 20 TB/s

> 3 TB/s

BLOCK ORTHOGONALIZATION

Block Gram-Schmidt Orthogonalization over vector set

n2 many-to-one and one-to-many mappings

Again assign each MG aggregate to a thread block

multi-(reduce-scatters) local to a thread block

Whole computation in a single kernel

Minimizes total memory traffic

O(n) memory traffic O(n2) compute

Forms the basis for coarse space

n ve

ctor

s

64 GB/s bi-directional

1.6 TB/s

> 20 TB/s

> 3 TB/s

33

COARSE OPERATOR CONSTRUCTION

Need to evaluate RAP - pseudo batched triple matrix product

Need to employ fine-grained parallelization

Perform each batched matrix product in parallel

Each coarse matrix element has many contributions (many-to-one)

Cannot pose as a tree-reduction so do use atomics

Floating point atomics are non-deterministic

Use 32-bit integer atomics for associativity and determinism

Minimize contention by using atomic hierarchy

thread -> shared memory atomic -> L2 atomic

34

CHROMA HMC ON SUMMIT

From Titan running 2016 code to Summit running 2019 code we see >82x speedup in HMC throughput

Mul$plica$ve speedup coming from mapping hierarchical algorithm to hierarchical machine

Highly opBmized mulBgrid for gauge field evoluBon

Mixed precision an important piece of the puzzle • double – outer defect correcBon • single – GCR solver • half – precondiBoner • int32 – determinisBc parallel coarsening

KC, Bálint Joó, Mathias Wagner, Evan Weinberg, Frank Winter, Boram Yoon

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Titan (1024xK20X)

Summit (128xV100)

Titan (512xK20X)

Summit (128xV100, Nov 2019)

Summit (128xV100, March

2019)W

allcl

ock

Tim

e (s

econ

ds)

4006

1878

974

439 392

4.1x fasteron 2x fewer GPUs

~8x gain

10.2x fasteron 8x fewer GPUs

~82x gain

Chroma ECP benchmark

35

HIERARCHICAL COMMUNICATION

36

ONE COMMUNICATIONS API TO RULE THEM ALL?

For example DGX-1 nodes 8x V100 GPUs connected through NVLink 4x EDR for inter-node communication Optimal placement of GPUs and NIC

Intra-node: DMA or load/store over NVLink @ 50 GB/s

Inter-node: Message passing over IB @ 6 GB/s

Can we use a single exchange API across this hierarchy?

NVIDIA DGX-1 With Tesla V100 System Architecture WP-08437-002_v01 | 9

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

&ŝŐƵƌĞ�ϰ �'yͲϭ�ƵƐĞƐ�ĂŶ�ϴͲ'Wh�ŚǇďƌŝĚ�ĐƵďĞͲŵĞƐŚ�ŝŶƚĞƌĐŽŶŶĞĐƚŝŽŶ�ŶĞƚǁŽƌŬ�ƚŽƉŽůŽŐǇ� dŚĞ�ĐŽƌŶĞƌƐ�ŽĨ�ƚŚĞ�ŵĞƐŚͲĐŽŶŶĞĐƚĞĚ�ĨĂĐĞƐ�ŽĨ�ƚŚĞ�ĐƵďĞ�ĂƌĞ�ĐŽŶŶĞĐƚĞĚ�ƚŽ�ƚŚĞ�W�/Ğ�ƚƌĞĞ�ŶĞƚǁŽƌŬ�ǁŚŝĐŚ�ĂůƐŽ�ĐŽŶŶĞĐƚƐ�ƚŽ�ƚŚĞ��WhƐ�ĂŶĚ�E/�Ɛ

37

MULTI GPU BUILDING BLOCKS

Halo packing Kernel

Interior Kernel

Halo communication

Halo update Kernel

Multi GPU Parallelization

faceexchange

wraparound

faceexchange

wraparound

Tuesday, July 12, 2011

Halo packing Kernel

Interior Kernel

Halo communication

Halo update Kernel

38

SOME TERMINOLGY

P2P: direct exchange between GPUs

CUDA aware MPI: MPI can take GPU pointer can do batched, overlapping transfers (high bandwidth)

GPU Direct RDMA: (implies CUDA aware MPI) RDMA transfer from/to GPU memory here NIC <-> GPU

P2P, CUDA aware, GPU Direct RDMA


V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI


MPI_Sendrecv

Time

39

QUDA MULTI-GPU SCALING

Key technologies employed Peer-to-peer communication - skip MPI and directly utilize DMA copies GPU Direct RDMA (GDR) - GPU <-> NIC <-> GPU without traversing CPU memory Node topology awareness

Autotuning for Dslash communication policy Kernel launch parameters (dependent on local problem size)

Starting point (already multiple years of optimization)

40

MULTI-GPU PROFILEoverlapping comms and compute

324 local volume, single precision

P2P copies

Interior kernelPacking kernel

Halo kernels t,z,y

DG

X-1,

1x2x

2x2

part

itio

ning

41

REDUCING LOCAL PROBLEM SIZEsingle GPU performance

0

750

1,500

2,250

3,000

56 52 48 44 40 36 32 28 24 20 16 12 8

half single double

GFl

op/s

Lattice length strong scaling

smallest reasonable problem size

42

STRONG SCALING PROFILEoverlapping comms and compute

P2P copiesInterior kernel

Packing kernel Halo kernels (fused)

164 local volume, half precision

DG

X-1,

1x2x

2x2

part

itio

ning

43

STRONG SCALING PROFILELatencies ate my scaling

P2P copiesInterior kernel

Packing kernel Halo kernels (fused)

API Calls


44

WHAT IS THE PROBLEM?

It’s not just data movement we need to minimize

Task marshaling from a lower level of hierarchy (e.g., host) adds latency

Data consistency requires synchronization between CPU and GPU

Ideally: offload of task marshaling to GPU thread to have same locality as data


V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI


45

NVSHMEM

NVSHMEM features Symmetric memory allocations in device memory Communication API calls on CPU (standard and stream-ordered) Kernel-side communication (API and LD/ST) between GPUs

NVLink and PCIe support (intra-node) InfiniBand support (inter-node) Interoperability with MPI and OpenSHMEM libraries

Implementation of OpenSHMEM1, a Partitioned Global Address Space (PGAS) library

1 SHMEM from Cray’s “shared memory” library, https://en.wikipedia.org/wiki/SHMEM

Available in NVIDIA HPC toolkit

46

DSLASH NVSHMEM IMPLEMENTATION

Keep general structure of packing, interior and exterior Dslash

Use nvshmem_ptr for intra-node remote writes (fine-grained)

Packing buffer is located on remote device Use nvshmem_putmem_nbi to write to remote GPU over IB (1 RDMA transfer)

Need to make sure writes are visible: nvshmem_barrier_all_on_stream

or barrier kernel that only waits for writes from neighbors

First exploration

47

NVSHMEM DSLASHfirst exploration

Packing kernel Fused HaloInterior kernel Barrier kernel


DG

X-1,

1x2x

2x2

part

itio

ning

48

FUSED DSLASH + PACKING KERNEL

Packing Interior

pack_blocks

Exterior (Halo)

interior_blocks = grid_dim - pack_blocks

nvshmem_signal for each direction

inte

rior

+pac

k ke

rnel

halo

ker

nel

nvshmem_wait_until for packed data

49

NVSHMEM + FUSING KERNELSno extra packing and barrier kernels needed

Barrier + Fused HaloInterior + Pack + Flag kernel


DG

X-1,

1x2x

2x2

part

itio

ning

50

KERNEL FUSION REDUCES LATENCY

0

400

800

1,200

1,600

2,000

IPC co

pies

IPC re

mote w

rite

NVSH

MEM w

/ bar

rier

NVSH

MEM ke

rnel

fusio

n

DGX-1,164 local volume, Wilson, half precision, 1x2x2x2 partitioningG

Flop

/s p

er G

PU

51

CAN WE GO FURTHER ?

`

One kernel to rule them all ! Communication is handled in the kernel and latencies are hidden.

52

DSLASH UBER KERNEL

Packing Interior

pack_blocks interior_blocks = grid_dim - pack_blocks - exterior_blocks

nvshmem_signal for each direction

Exterior (Halo)

exterior_blocks

atomic wait for interior

nvshmem_wait_until

atomic flag set by last block

53

ÜBER KERNEL

` 164 local volume, half precision

DG

X-1,

1x2x

2x2

part

itio

ning

über kernel (packing + interior + exterior)

54

LATENCY REDUCTIONS

0

400

800

1,200

1,600

2,000

IPC co

pies

IPC re

mote w

rite

NVSH

MEM w

/ bar

rier

NVSH

MEM ke

rnel

fusio

n

NVSH

MEM üb

er ke

rnel

DGX-1,164 local volume, Wilson, half precision, 1x2x2x2 partitioningG

Flop

/s p

er G

PU

55

DGX-2: FULL NON-BLOCKING BANDWIDTH2.4 TB/s bisection bandwidth

8

FULL NON-BLOCKING BANDWIDTH

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

56

DGX-2 STRONG SCALINGGlobal Volume 324, Wilson-Dslash

0

5,000

10,000

15,000

20,000

25,000

1 2 4 8 16

MPI double SHMEM doubleMPI single SHMEM singleMPI half SHMEM half

GFl

op/s

#GPUs

@16 GPUs:

164 local volume

57

MULTI-NODE STRONG SCALINGDGX SuperPOD (DGX2 nodes: 16xV100 (32GB), 8xEDR IB)

0

100,000

200,000

300,000

400,000

16 32 64 128 256 512 1024

MPI NVSHMEM

GFl

op/s

#GPUs

Wilson Dslash, 643x128 global volume, half precision

sweet spot for simulations

1.4x

1.4x

1.3x


58SUMMARY

59

LQCD OUTLOOK

Reformulate vector-vector and matrix-vector problems as matrix-matrix Multi-RHS, Communication-Avoiding, block solvers Deploy using tensor cores where possible

Keep data on chip and minimize hierarchy level jumping Use NVSHMEM to fuse across communication boundaries Increasingly possible with latest generation of GPUs

Follow trends towards future architectures and aim for super-linear scaling

Super Linear Scaling

!16


GFl

op/s

1

10

100

1000

0

500

1,000

1,500

2,000

2,500

3,000

2008 2010 2012 2014 2016 2018 2019

Speedup Wilson FP32 GFLOPS

Speedup determined by measured time to solution for solving the Wilson operator against a random source on a V=24364 lattice, β=5.5, Mπ= 416 MeV. One node is defined to be 3 GPUs

Spee

dup

Multi GPUcapable

Adaptive Multigrid

OptimizedMultigrid

DeflatedMultigrid

300x

60


2022 2024

?

61

SUMMARY

Algorithms achieve maximal speedup when we Expose their parallelism to map onto ever-more parallel processors Identify their shape to map to ever-more hierarchical processors

Communications reach optimal performance when we Move task marshaling close to data, enabling a single communications API Utilize fine-grained communications which maximize latency hiding

62BACKUP

63

PARALLELISM AND LOCALITY

63

MULTIPLE RIGHT-HAND SIDESG

Flop

/s

# rhs

483x12, HISQ, single precision, one code

0

750

1,500

2,250

3,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Volta (2017) Pascal (2016) Maxwell (2014) Kepler (2012) Fermi (2010)

2.5x

3.5x

64

MULTIGRID IS FULL OF MATRIX MULTIPLICATION

Explicit

• Coarse-operator construction

• Block-orthogonalization

Implicit (through reformulation)

• Multi-RHS operators

• Coarse-grid LU solver

Reworking the pipeline to expose algorithms in matrix-matrix form

Maximize locality, parallelism for optimal mapping onto the GPU hierarchy

!38

LATTICE QCD WITH TENSOR CORES

Modern GPU have high throughput tensor-core functionality Kernel Fusion 24/36

The 4d-volume (⇠ 108) is much larger than the size of the 5thdimension (⇠ 101). Fusing the 4d-spatial operations with the 5thdimension operations right after reduces global memory tra�c.

⇥1� 2b M

†�D

†w| {z }

fuse

M�†5 M †

�D†w| {z }

fuse

M�†5

⇤⇥1� 2bM

�15 Dw| {z }

fuse

M�M�15 Dw| {z }

fuse

M�

⇤

Tesla V100 FP64: 7.5 TFLOPS FP32: 15 TFLOPS Tensor: 125 TFLOPS

KC, Jung, Mawhinney, Tu

65

TENSOR CORES FOR MULTIGRID

Multigrid is a preconditioner

16-bit precision is perfectly adequate

No impact on convergence rate

Majority of MG setup kernels now implemented using tensor cores

1.5x-10x kernel speedups observed

Ongoing WIP

Expect integer speedup factors on Summit once fully reformulated

TFLO

PS

0

7.5

15

22.5

30

4x4x4x8 2x2x2x2

single precision half precision tensor cores

Yhat kernel on Quadro GV100 (32 null space vectors)

66

FINE-GRAINED SYNCHRONIZATION

Need replacement for kernel boundaries

Packing is independent of interior

interior and exterior update on boundary → possible race condition

use cuda::atomic from libcu++

libcu++ gives us std::atomic in CUDA

#include <atomic> std::atomic<int> x;

#include <cuda/std/atomic> cuda::std::atomic<int> x;

#include <cuda/atomic> cuda::atomic<int, cuda::thread_scope_block> x;