EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE...

67
EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE HIERARCHICAL MACHINES Kate Clark, HiPar @ SC20

Transcript of EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE...

Page 1: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE HIERARCHICAL MACHINESKate Clark, HiPar @ SC20

Page 2: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

2

OUTLINE

Hierarchical machines

Science Motivator: Lattice QCD

Motifs Hierarchical Algorithm: Multigrid Hierarchical Communication

Summary

Page 3: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

3

HIERARCHICAL MACHINES

Page 4: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

4

HPC IS GETTING MORE HIERARCHICALWhat does a node even mean?

Legacy Current

Cray XT4 (2007) https://www.nersc.gov/assets/NUG-Meetings/NERSCSystemOverview.pdf

NVIDIA DGX-A100 (2020) https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

NVSwitchNVSwitchNVSwitchNVSwitchNVSwitchNVSwitch

PEXSwitch

200G NIC

200G NIC

NVMe

NVMe

200G NIC

200G NIC

NVMe

NVMe

200G NIC

200G NIC

NVMe

NVMe

200G NIC

200G NIC

NVMe

NVMe

AMD Rome64C

AMD Rome64C

200G NIC 200G NIC

PEX Switch

PEX Switch

PEX Switch

PEX Switch

Page 5: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

5

HPC IS GETTING MORE HIERARCHICAL

2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

2012 Titan: Cray XK7 @ ORNL 17 PFLOPS, 16,688 nodes, 1x AMD 16-core + 1 x K20X GPU 14 SMs, Cray Gemini

2017 Summit: IBM AC922 @ ORNL 149 PFLOPS, 4356 nodes, 2x IBM P9 22-core CPU + 6 x V100 GPU 80 SMs, MLNX 2xEDR

2020 Selene: NVIDIA DGX A100 @ NVIDIA 28 PFLOPS, 280 nodes, 2x AMD 64-core CPU + 8 x A100 GPU 108 SMs, MLNX 8xHDR

Page 6: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

6

EVOLVING HIERARCHY

Bandwidth / socket

Valu

e Ax

is

0

400

800

1200

1600

Jaguar Titan Summit Selene

FP64 / socket

Valu

e Ax

is

0

5000

10000

15000

20000

Jaguar Titan Summit Selene

Inter-node bandwidth / socket

Valu

e Ax

is0

7.5

15

22.5

30

Jaguar Titan Summit Selene

Intra-node bandwidth / socket

Valu

e Ax

is

0

75

150

225

300

Jaguar Titan Summit Selene

Bandwidth / node

Valu

e Ax

is0

3500

7000

10500

14000

Jaguar Titan Summit Selene

FP64 / node

Valu

e Ax

is

0

40000

80000

120000

160000

Jaguar Titan Summit Selene

Inter-node bandwidth / node

Valu

e Ax

is

0

50

100

150

200

Jaguar Titan Summit Selene

Intra-node bandwidth / node

Valu

e Ax

is

0

750

1500

2250

3000

Jaguar Titan Summit Selene

data is intended to be illustrative sourced from wikipedia.org 1 node equals single os instance 1 GPU = 1 socket where appropriate

Page 7: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

7

MODERN HPC PROCESSORS

GPUs are extreme hierarchical processors

Many-core processor programmed using a massively threaded model Threads arranged as Cartesian hierarchy of grids

Deep memory hierarchy Registers <-> L1 <-> L2 <-> <->

Increasingly coupled instruction hierarchy Tensor cores <-> CUDA cores <-> shared mem atomics <-> L2 atomics

Synchronization possible at many levels (Sub-)Warp <-> Thread Block <-> Grid <-> Node <-> Cluster

AKA GPUs64 GB/s bi-directional

1.6 TB/s

> 20 TB/s

> 3 TB/s

Device Memory

Host Memory

A100 sketch

Page 8: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

8

MAPPING TO HIERARCHYComposable Parallelism

Streaming

No need to synchronize Likely bandwidth bound

Reduce

Need to synchronize Likely bandwidth bound

Scatter

No need to synchronize Input Locality

Streaming Reduce = Multi-reduce

Only local synchronization

Page 9: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

9

SCIENCE MOTIVATOR LATTICE QCD

Page 10: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

10

LATTICE QCD MOTIVATION

Summit cycle breakdown in

INCITE allocation

Summit cycle breakdown in

INCITE use

NERSC Utilization (Aug ’17 - Jul’18)

LQCD ~40%

1(56&�ZRUNORDG�LV�H[WUHPHO\�GLYHUVH�EXW�QRW�HYHQO\�GLYLGHG�

Ɣ ���FRGHV�PDNH�XS�����RI�ZRUNORDG�

Ɣ ���FRGHV�PDNH�XS�����RI�ZRUNORDG�

Ɣ ���FRGHV�PDNH�XS�����RI�ZRUNORDG�

Ɣ 5HPDLQLQJ�FRGHV�RYHU������PDNH�XS�����RI�ZRUNORDG�

3\WKRQ

LQCD ~13%

Page 11: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

11

QUANTUM CHROMODYNAMICS

The strong force is one of the basic forces of nature (along with gravity, em and weak)

It’s what binds together the quarks and gluons in the proton and the neutron (as well as hundreds of other particles seen in accelerator experiments)

Responsible for the particle zoo seen at sub-nuclear scales (mass, decay rate, etc.)

QCD is the theory of the strong force It’s a beautiful theory… …but

!"#$%&'()*$+%",-"#$.#(/01,(-23$$4$$5678$95:;;$$4$$8&1)*$;<$=>;; X

3",6@0*,/7*A#"./0

! V*0$0$6./A*<.6-2$(,$"#0$"A$-*0$'&,()$A"1)0,$"A$#&-O10$+&?"#R$F(-*$R1&/(-2<$0?0)-1"K&R#0-(,K<$&#@$-*0$F0&H$A"1)03P

! 6-h,$F*&-$'(#@,$-"R0-*01$-*0$B",6@0$&#@$A#"./0$(#$-*0$E1"-"#$+&#@$-*0$#0O-1"#<$&,$F0??$&,$*O#@10@,$"A$"-*01$E&1-()?0,$,00#$(#$&))0?01&-"1$0cE01(K0#-,3P

V*"K&,$M0AA01,"#$C&-("#&?$7))0?01&-"1$i&)(?(-2i01K($C&-("#&?$7))0?01&-"1$G&'"1&-"12

h⌦i = 1

Z

Z[dU ]e�

Rd4xL(U)⌦(U)

Page 12: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

12

LATTICE QUANTUM CHROMODYNAMICS

Theory is highly non-linear ⇒ cannot solve directly

Must resort to numerical methods to make predictions

Lattice QCD Discretize spacetime ⇒ 4-d dimensional lattice of size

Finite spacetime ⇒ periodic boundary conditions

PDEs ⇒ finite difference equations ⇒ Linear solvers Ax = b

Consumer of 10+% of public supercomputer cycles Traditionally highly optimized on every HPC platform for the past 30 years Jobs often run at the 1000+ GPU scale

Lx × Ly × Lz × Lt

Page 13: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

13

QUDA• “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU

backend for BQCD, Chroma**, CPS**, MILC**, TIFR, etc. • Provides solvers for all major fermionic discretizations, with multi-GPU

support • Maximize performance

– Mixed-precision methods – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Multigrid solvers for optimal convergence – NVSHMEM for improving strong scaling

• A research tool for how to reach the exascale (and beyond) – Optimally mapping the problem to hierarchical processors and node topologies

**ECP benchmarks apps

!9

QUDA

• “QCD on CUDA” – http://lattice.github.com/quda (C++14, open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for

BQCD, Chroma, CPS, MILC, TIFR, etc. • Various solvers for all major fermionic discretizations, with multi-GPU support • Maximize performance

– Mixed-precision methods (runtime specification of precision for maximum flexibility) – Exploit physical symmetries to minimize memory traffic – Autotuning for high performance on all CUDA-capable architectures – Domain-decomposed (Schwarz) preconditioners for strong scaling – Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) – Multi-RHS solvers – Multigrid solvers for optimal convergence

• A research tool for how to reach the exascale (and beyond)

Page 14: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

14

QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware

!16

QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware

GFl

op/s

1

10

100

1000

0

500

1,000

1,500

2,000

2,500

3,000

2008 2010 2012 2014 2016 2018 2019

Speedup Wilson FP32 GFLOPS

Speedup determined by measured time to solution for solving the Wilson operator against a random source on a V=24364 lattice, β=5.5, Mπ= 416 MeV. One node is defined to be 3 GPUs

Spee

dup

Multi GPUcapable

Adaptive Multigrid

OptimizedMultigrid

DeflatedMultigrid

300x

Page 15: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

15

QUDA CONTRIBUTORS

! Ron Babich (NVIDIA) ! Simone Bacchio (Cyprus) ! Kip Barros (LANL) ! Rich Brower (Boston University) ! Nuno Cardoso (NCSA) ! Kate Clark (NVIDIA) ! Michael Cheng (Boston University) ! Carleton DeTar (Utah University) ! Justin Foley (Utah -> NIH) ! Joel Giedt (Rensselaer Polytechnic Institute) ! Arjun Gambhir (William and Mary) ! Steve Gottlieb (Indiana University) ! Kyriakos Hadjiyiannakou (Cyprus) ! Dean Howarth (LLNL) ! Bálint Joó (Jlab)

! Hyung-Jin Kim (BNL -> Samsung) ! Bartek Kostrzewa (Bonn) ! Claudio Rebbi (Boston University) ! Eloy Romero (William and Mary) ! Hauke Sandmeyer (Bielefeld) ! Guochun Shi (NCSA -> Google) ! Mario Schröck (INFN) ! Alexei Strelchenko (FNAL) ! Jiqun Tu (NVIDIA) ! Alejandro Vaquero (Utah University) ! Mathias Wagner (NVIDIA) ! André Walker-Loud (LBL) ! Evan Weinberg (NVIDIA) ! Frank Winter (Jlab) ! Yi-bo Yang (CAS)

10+ years - lots of contributors

Page 16: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

16

STEPS IN AN LQCD CALCULATION

1. Generate an ensemble of gluon field configurations “gauge generation” Produced in sequence, with hundreds needed per ensemble Strong scaling required with 100-1000 TFLOPS sustained for several months 50-90% of the runtime is in the linear solver O(1) solve per linear system Target 164-244 per GPU

2. “Analyze” the configurations Can be farmed out, assuming ~10 TFLOPS per job Task parallelism means that clusters reign supreme here 80-99% of the runtime is in the linear solver Many solves per system, e.g., O(106) Target 244-324 per GPU

D↵�ij (x, y;U) �

j (y) = ⌘↵i (x)

or Ax = b

Simulation Cost ~ a-6 V5/4

Page 17: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

17

LATTICE QCD IN A NUTSHELLh⌦i = 1

Z

Z[dU ]e�

Rd4xL(U)⌦(U)

Multi GPU Parallelization

faceexchange

wraparound

faceexchange

wraparound

Tuesday, July 12, 2011

theory'!

experiment)!

Brookhaven*Na,onal*Laboratory*Large&Hadron&Collider& Davies'et#al#

Page 18: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

18

HIERARCHICAL ALGORITHM: MULTIGRID

Page 19: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

19

MULTIGRID

Linear scaling with problem size

Condition number independent

Two distinct flavours

Algebraic: general sparse matrix

Geometric: stencil kernel

Pose problem as a hierarchy of grids

Fine grid is highly parallel

Coarse problem is essentially serial

The optimal method for solving PDE-based linear systems

-0.43 -0.42 -0.41 -0.4mass

100

1000

10000

1e+05

Dir

ac o

per

ato

r ap

pli

cati

on

s

32396 CG

24364 CG

16364 CG

24364 Eig-CG

16364 Eig-CG

32396 MG-GCR

24364 MG-GCR

16364 MG-GCR

Page 20: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

20

MULTIGRID

GPU requirements very different from CPU

Each thread is slow, but O(10,000) threads per GPU

Fine grids run very efficiently

High parallel throughput problem

Coarse grids are worst possible scenario

More cores than degrees of freedom

Increasingly serial and latency bound

Little’s law (bytes = bandwidth * latency)

Amdahl’s law limiter

The optimal method for solving PDE-based linear systems

Page 21: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

21

HPGMG

Benchmark developed by LBNL as an alternative TOP500 benchmark to be representative of adaptive mesh refinement computations

Port to CUDA handled by Nikolay Sakharnykh https://devblogs.nvidia.com/high-performance-geometric-multi-grid-gpu-acceleration

Optimal performance when reverse offloading the coarse grids to the CPU

True even with Kepler (2012) (never mind Volta / Ampere)

Many prior GPU implementations of Multigrid

Page 22: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

MULTIGRID INGREDIENTS

▪ Adaptive Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition

▪ Smoothing (relaxation on a given grid) – Fine grid stencil and BLAS1 kernels

▪ Prolongation – interpolation from coarse grid to fine grid

▪ Restriction – restriction from fine grid to coarse grid

▪ Coarse Operator construction (setup) – Evaluate R A P on each block – Batched (small) dense matrix multiplication

▪ Coarse grid solver – Need optimal coarse-grid stencil application

22

Page 23: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

MULTIGRID INGREDIENTS

▪ Adaptive Multigrid setup – Block orthogonalization of null space vectors – Batched QR decomposition

▪ Smoothing (relaxation on a given grid) – Fine grid stencil and BLAS1 kernels

▪ Prolongation – interpolation from coarse grid to fine grid

▪ Restriction – restriction from fine grid to coarse grid

▪ Coarse Operator construction (setup) – Evaluate R A P on each block – Batched (small) dense matrix multiplication

▪ Coarse grid solver – Need optimal coarse-grid stencil application

23

x x

x

x−

x−

U x

Ux

μ

μ

ν

One-to-one mapping

A =

Page 24: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

24

MAPPING THE DIRAC OPERATOR TO CUDA

• Finite difference operator in LQCD is known as Dslash

• Assign a single space-time point to each thread V = XYZT threads, e.g., V = 244 => 3.3x106 threads

• Looping over direction each thread must – Load the neighboring spinor (24 numbers x8) – Load the color matrix connecting the sites (18 numbers x8) – Do the computation – Save the result (24 numbers)

• Each thread has (Wilson Dslash) 0.92 naive arithmetic intensity

• QUDA reduces memory traffic Exact SU(3) matrix compression (18 => 12 or 8 real numbers) Use 16-bit fixed-point representation with mixed-precision solver

review basic details of the LQCD application and of NVIDIAGPU hardware. We then briefly consider some related workin Section IV before turning to a general description of theQUDA library in Section V. Our parallelization of the quarkinteraction matrix is described in VI, and we present anddiscuss our performance data for the parallelized solver inSection VII. We finish with conclusions and a discussion offuture work in Section VIII.

II. LATTICE QCDThe necessity for a lattice discretized formulation of QCD

arises due to the failure of perturbative approaches commonlyused for calculations in other quantum field theories, such aselectrodynamics. Quarks, the fundamental particles that are atthe heart of QCD, are described by the Dirac operator actingin the presence of a local SU(3) symmetry. On the lattice,the Dirac operator becomes a large sparse matrix, M , and thecalculation of quark physics is essentially reduced to manysolutions to systems of linear equations given by

Mx = b. (1)

The form of M on which we focus in this work is theSheikholeslami-Wohlert [6] (colloquially known as Wilson-clover) form, which is a central difference discretization of theDirac operator. When acting in a vector space that is the tensorproduct of a 4-dimensional discretized Euclidean spacetime,spin space, and color space it is given by

Mx,x0 = �12

4⇤

µ=1

�P�µ ⇤ Uµ

x �x+µ,x0 + P+µ ⇤ Uµ†x�µ �x�µ,x0

+ (4 + m + Ax)�x,x0

⌅ �12Dx,x0 + (4 + m + Ax)�x,x0 . (2)

Here �x,y is the Kronecker delta; P±µ are 4 ⇥ 4 matrixprojectors in spin space; U is the QCD gauge field whichis a field of special unitary 3⇥ 3 (i.e., SU(3)) matrices actingin color space that live between the spacetime sites (and henceare referred to as link matrices); Ax is the 12⇥12 clover matrixfield acting in both spin and color space,1 corresponding toa first order discretization correction; and m is the quarkmass parameter. The indices x and x0 are spacetime indices(the spin and color indices have been suppressed for brevity).This matrix acts on a vector consisting of a complex-valued12-component color-spinor (or just spinor) for each point inspacetime. We refer to the complete lattice vector as a spinorfield.

Since M is a large sparse matrix, an iterative Krylovsolver is typically used to obtain solutions to (1), requiringmany repeated evaluations of the sparse matrix-vector product.The matrix is non-Hermitian, so either Conjugate Gradients[7] on the normal equations (CGNE or CGNR) is used, ormore commonly, the system is solved directly using a non-symmetric method, e.g., BiCGstab [8]. Even-odd (also known

1Each clover matrix has a Hermitian block diagonal, anti-Hermitian blockoff-diagonal structure, and can be fully described by 72 real numbers.

Fig. 1. The nearest neighbor stencil part of the lattice Dirac operator D,as defined in (2), in the µ� ⇥ plane. The color-spinor fields are located onthe sites. The SU(3) color matrices Uµ

x are associated with the links. Thenearest neighbor nature of the stencil suggests a natural even-odd (red-black)coloring for the sites.

as red-black) preconditioning is used to accelerate the solutionfinding process, where the nearest neighbor property of theDx,x0 matrix (see Fig. 1) is exploited to solve the Schur com-plement system [9]. This has no effect on the overall efficiencysince the fields are reordered such that all components ofa given parity are contiguous. The quark mass controls thecondition number of the matrix, and hence the convergence ofsuch iterative solvers. Unfortunately, physical quark massescorrespond to nearly indefinite matrices. Given that currentleading lattice volumes are 323 ⇥ 256, for > 108 degrees offreedom in total, this represents an extremely computationallydemanding task.

III. GRAPHICS PROCESSING UNITS

In the context of general-purpose computing, a GPU iseffectively an independent parallel processor with its ownlocally-attached memory, herein referred to as device memory.The GPU relies on the host, however, to schedule blocks ofcode (or kernels) for execution, as well as for I/O. Data isexchanged between the GPU and the host via explicit memorycopies, which take place over the PCI-Express bus. The low-level details of the data transfers, as well as management ofthe execution environment, are handled by the GPU devicedriver and the runtime system.

It follows that a GPU cluster embodies an inherently het-erogeneous architecture. Each node consists of one or moreprocessors (the CPU) that is optimized for serial or moderatelyparallel code and attached to a relatively large amount ofmemory capable of tens of GB/s of sustained bandwidth. Atthe same time, each node incorporates one or more processors(the GPU) optimized for highly parallel code attached to arelatively small amount of very fast memory, capable of 150GB/s or more of sustained bandwidth. The challenge we face isthat these two powerful subsystems are connected by a narrowcommunications channel, the PCI-E bus, which sustains atmost 6 GB/s and often less. As a consequence, it is criticalto avoid unnecessary transfers between the GPU and the host.

Dx,x0 =x x

x

x−

x−

U x

Ux

μ

μ

ν

X[0]

X[1]

Page 25: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

Wilson-Dslash on V100SCALING OF FINE-GRID STENCIL

0

750

1,500

2,250

3,000

32 28 24 20 16 12 8

half single double

GFl

op/s

Lattice lengthstrong scaling

Page 26: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

COARSE GRID STENCIL▪ Coarse operator looks like a Dirac operator (many dof per site)

– Link matrices have dimension 2Nv x 2Nv (e.g., 48 x 48)

▪ Fine vs. Coarse grid parallelization – Fine grid operator has plenty of grid-level parallelism

– E.g., 16x16x16x16 = 65536 lattice sites – Coarse grid operator has diminishing grid-level parallelism

– first coarse grid 4x4x4x4= 256 lattice sites – second coarse grid 2x2x2x2 = 16 lattice sites

▪ Current GPUs have up to 10,752 CUDA cores

▪ Need to consider finer-grained parallelization – Increase parallelism to use all GPU resources – Load balancing

dofs (geometry). We start by defining the fields

W±µksc,ls�c� = V †

ksc,kscP±µs,s�U(k+µ)c,lc��k+µ,lVls�c�,ls�c�

note that here we are defining di�erent links for forward and backwards,they are not simply the conjugate of each other (because of the di�erent spinprojection between the two). Also note that these e�ective link matrices havealso a spin index, this is because the vectors used to define the V rotationmatrices have spin dependence now. In this form we can now write down thecoarse Dirac operator as

Disc,js�c� = ���i,k/B�s,s/Bs

⇥ ⇤

µ

⌅W�µ

ksc,ls�c��k+µ,l + W+µ†ksc,ls�c��k�µ,l

⇧ ��l/B,j�s�/Bs,s�

+M �isc,js�c� .

We now finish up by blocking the geometry and spin onto the coarse lattice,defining the e�ective link matrices Y ±µ that connect sites on the coarselattice:

Y ±µisc,js�c� =

��i,k/B�s,s/Bs

⇥W±µ

ksc,ls�c�

��l/B,j�s�/Bs,s�

⇥�i⇤µ,j (2)

Xisc,js�c� =��i,k/B�s,s/Bs

⇥ ⇤

µ

⌅W�µ

isc,ks�c� + W+µ†isc,ks�c�

⇧ ��l/B,j�s�/Bs,s�

⇥�i,j,

where we note now that the matrix X is not Hermitian. Thus the coarseoperator is written

Disc,js�c� = �⇤

µ

⌅Y �µ

isc,js�c��i+µ,j + Y +µ†isc,js�c��i�µ,j

⇧+ (M �Xisc,js�c�) �isc,js�c� . (3)

For the explicit form of these matrices we refer the reader to Appendix A.After the first blocking, subsequent blockings require that Bs = 1, i.e., we

cannot block the spin dimension again since we cannot remove the chirality.Apart from this observation, the next coarse operator will have a similar formto the current one: it will be a nearest neighbour non-Hermitian operatorconnecting sites with ds = 2 spin dimension (in 2d and 4d anyway).

We note here in passing that because of the definition of the matrix field Vinclude explicit spin dependence, this destroys the tensor product structureof the spin and colour on the coarse operator, i.e., we have to define ane�ective link matrix that rotates in spin and colour space. If this were notthe case, i.e., if V were to be spin independent, then this structure would be

8

xx

x

x−

x−

U x

Ux

μ

μ

ν

26

Page 27: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

27

X[0]

X[1]

SOURCE OF PARALLELISM

�a00 a01 a02 a03

0

BB@

b0b1b2b3

1

CCA )�a00 a01

�✓b0b1

◆+

�a02 a03

�✓b2b3

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 0

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 1

xx

x

x−

x−

U x

Ux

μ

μ

ν warp 2

xx

x

x−

x−

U x

Ux

μ

μ

ν

warp 3

xx

x

x−

x−

U x

Ux

μ

μ

ν

3. Stencil direction 8-way thread parallelism

1. Grid parallelism Volume of threads

2. Link matrix-vector partitioning 2 Nvec-way thread parallelism (spin * color)

0

BB@

c0c1c2c3

1

CCA+ =

0

BB@

a00 a01 a02 a03a10 a11 a12 a13a20 a21 a22 a23a30 a31 a32 a33

1

CCA

0

BB@

b0b1b2b3

1

CCAthread y

index

4. Dot-product partitioning 4-way thread parallelism + ILP

Page 28: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

28

COARSE GRID OPERATOR PERFORMANCETesla K20X (Titan), FP32, Nvec = 24

0"

20"

40"

60"

80"

100"

120"

140"

160"

10" 8" 6" 4" 2"

GFLO

PS'

La)ce'length'

baseline"

color2spin"

dimension"+"direc7on"

dot"product"

24,576-way parallel

16-way parallel

c.f. Fine grid operator ~300-400 GFLOPS

Page 29: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

29

PARALLELISM ISN’T INFINITEG

FLO

PS

0

400

800

1200

1600

Lattice length

2 4 6 8 10

Kepler Maxwell Pascal Volta Ampere

Coarse stencil performance

Latency limited Bandwidth limited

Working set 36 MiB

Page 30: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

One-to-many mapping Multi-scatter No synchronization required

Parallelize over fine grid points

Implement as trivial streaming kernel Coarse grid points cached in L1 and L2

PROLONGATIONInterpolation from coarse grid to fine grid

30

Page 31: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

31

RESTRICTOR

Many-to-one mapping => race conditions

Use atomics?

Lack of determinism

Instead pose as a multi-reduction

Assign each MG aggregate to a thread block

Many-to-one mapping now local to SM

Synchronization only within thread block

Minimizes total memory traffic

Restriction of fine grid to coarse grid

64 GB/s bi-directional

1.6 TB/s

> 20 TB/s

> 3 TB/s

Page 32: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

BLOCK ORTHOGONALIZATION

Block Gram-Schmidt Orthogonalization over vector set

n2 many-to-one and one-to-many mappings

Again assign each MG aggregate to a thread block

multi-(reduce-scatters) local to a thread block

Whole computation in a single kernel

Minimizes total memory traffic

O(n) memory traffic O(n2) compute

Forms the basis for coarse space

n ve

ctor

s

64 GB/s bi-directional

1.6 TB/s

> 20 TB/s

> 3 TB/s

Page 33: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

33

COARSE OPERATOR CONSTRUCTION

Need to evaluate RAP - pseudo batched triple matrix product

Need to employ fine-grained parallelization

Perform each batched matrix product in parallel

Each coarse matrix element has many contributions (many-to-one)

Cannot pose as a tree-reduction so do use atomics

Floating point atomics are non-deterministic

Use 32-bit integer atomics for associativity and determinism

Minimize contention by using atomic hierarchy

thread -> shared memory atomic -> L2 atomic

Page 34: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

34

CHROMA HMC ON SUMMIT

From Titan running 2016 code to Summit running 2019 code we see >82x speedup in HMC throughput

Mul$plica$ve speedup coming from mapping hierarchical algorithm to hierarchical machine

Highly opBmized mulBgrid for gauge field evoluBon

Mixed precision an important piece of the puzzle • double – outer defect correcBon • single – GCR solver • half – precondiBoner • int32 – determinisBc parallel coarsening

KC, Bálint Joó, Mathias Wagner, Evan Weinberg, Frank Winter, Boram Yoon

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Titan (1024xK20X)

Summit (128xV100)

Titan (512xK20X)

Summit (128xV100, Nov 2019)

Summit (128xV100, March

2019)W

allcl

ock

Tim

e (s

econ

ds)

4006

1878

974

439 392

4.1x fasteron 2x fewer GPUs

~8x gain

10.2x fasteron 8x fewer GPUs

~82x gain

Chroma ECP benchmark

Page 35: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

35

HIERARCHICAL COMMUNICATION

Page 36: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

36

ONE COMMUNICATIONS API TO RULE THEM ALL?

For example DGX-1 nodes 8x V100 GPUs connected through NVLink 4x EDR for inter-node communication Optimal placement of GPUs and NIC

Intra-node: DMA or load/store over NVLink @ 50 GB/s

Inter-node: Message passing over IB @ 6 GB/s

Can we use a single exchange API across this hierarchy?

NVIDIA DGX-1 With Tesla V100 System Architecture WP-08437-002_v01 | 9

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

&ŝŐƵƌĞ�ϰ �'yͲϭ�ƵƐĞƐ�ĂŶ�ϴͲ'Wh�ŚLJďƌŝĚ�ĐƵďĞͲŵĞƐŚ�ŝŶƚĞƌĐŽŶŶĞĐƚŝŽŶ�ŶĞƚǁŽƌŬ�ƚŽƉŽůŽŐLJ� dŚĞ�ĐŽƌŶĞƌƐ�ŽĨ�ƚŚĞ�ŵĞƐŚͲĐŽŶŶĞĐƚĞĚ�ĨĂĐĞƐ�ŽĨ�ƚŚĞ�ĐƵďĞ�ĂƌĞ�ĐŽŶŶĞĐƚĞĚ�ƚŽ�ƚŚĞ�W�/Ğ�ƚƌĞĞ�ŶĞƚǁŽƌŬ�ǁŚŝĐŚ�ĂůƐŽ�ĐŽŶŶĞĐƚƐ�ƚŽ�ƚŚĞ��WhƐ�ĂŶĚ�E/�Ɛ

Page 37: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

37

MULTI GPU BUILDING BLOCKS

Halo packing Kernel

Interior Kernel

Halo communication

Halo update Kernel

Multi GPU Parallelization

faceexchange

wraparound

faceexchange

wraparound

Tuesday, July 12, 2011

Halo packing Kernel

Interior Kernel

Halo communication

Halo update Kernel

Page 38: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

38

SOME TERMINOLGY

P2P: direct exchange between GPUs

CUDA aware MPI: MPI can take GPU pointer can do batched, overlapping transfers (high bandwidth)

GPU Direct RDMA: (implies CUDA aware MPI) RDMA transfer from/to GPU memory here NIC <-> GPU

P2P, CUDA aware, GPU Direct RDMA

NVIDIA DGX-1 With Tesla V100 System Architecture WP-08437-002_v01 | 9

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

&ŝŐƵƌĞ�ϰ �'yͲϭ�ƵƐĞƐ�ĂŶ�ϴͲ'Wh�ŚLJďƌŝĚ�ĐƵďĞͲŵĞƐŚ�ŝŶƚĞƌĐŽŶŶĞĐƚŝŽŶ�ŶĞƚǁŽƌŬ�ƚŽƉŽůŽŐLJ� dŚĞ�ĐŽƌŶĞƌƐ�ŽĨ�ƚŚĞ�ŵĞƐŚͲĐŽŶŶĞĐƚĞĚ�ĨĂĐĞƐ�ŽĨ�ƚŚĞ�ĐƵďĞ�ĂƌĞ�ĐŽŶŶĞĐƚĞĚ�ƚŽ�ƚŚĞ�W�/Ğ�ƚƌĞĞ�ŶĞƚǁŽƌŬ�ǁŚŝĐŚ�ĂůƐŽ�ĐŽŶŶĞĐƚƐ�ƚŽ�ƚŚĞ��WhƐ�ĂŶĚ�E/�Ɛ

MPI_Sendrecv

Time

Page 39: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

39

QUDA MULTI-GPU SCALING

Key technologies employed Peer-to-peer communication - skip MPI and directly utilize DMA copies GPU Direct RDMA (GDR) - GPU <-> NIC <-> GPU without traversing CPU memory Node topology awareness

Autotuning for Dslash communication policy Kernel launch parameters (dependent on local problem size)

Starting point (already multiple years of optimization)

Page 40: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

40

MULTI-GPU PROFILEoverlapping comms and compute

324 local volume, single precision

P2P copies

Interior kernelPacking kernel

Halo kernels t,z,y

DG

X-1,

1x2x

2x2

part

itio

ning

Page 41: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

41

REDUCING LOCAL PROBLEM SIZEsingle GPU performance

0

750

1,500

2,250

3,000

56 52 48 44 40 36 32 28 24 20 16 12 8

half single double

GFl

op/s

Lattice length strong scaling

smallest reasonable problem size

Page 42: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

42

STRONG SCALING PROFILEoverlapping comms and compute

P2P copiesInterior kernel

Packing kernel Halo kernels (fused)

164 local volume, half precision

DG

X-1,

1x2x

2x2

part

itio

ning

Page 43: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

43

STRONG SCALING PROFILELatencies ate my scaling

P2P copiesInterior kernel

Packing kernel Halo kernels (fused)

API Calls

164 local volume, half precision

Page 44: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

44

WHAT IS THE PROBLEM?

It’s not just data movement we need to minimize

Task marshaling from a lower level of hierarchy (e.g., host) adds latency

Data consistency requires synchronization between CPU and GPU

Ideally: offload of task marshaling to GPU thread to have same locality as data

NVIDIA DGX-1 With Tesla V100 System Architecture WP-08437-002_v01 | 9

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

&ŝŐƵƌĞ�ϰ �'yͲϭ�ƵƐĞƐ�ĂŶ�ϴͲ'Wh�ŚLJďƌŝĚ�ĐƵďĞͲŵĞƐŚ�ŝŶƚĞƌĐŽŶŶĞĐƚŝŽŶ�ŶĞƚǁŽƌŬ�ƚŽƉŽůŽŐLJ� dŚĞ�ĐŽƌŶĞƌƐ�ŽĨ�ƚŚĞ�ŵĞƐŚͲĐŽŶŶĞĐƚĞĚ�ĨĂĐĞƐ�ŽĨ�ƚŚĞ�ĐƵďĞ�ĂƌĞ�ĐŽŶŶĞĐƚĞĚ�ƚŽ�ƚŚĞ�W�/Ğ�ƚƌĞĞ�ŶĞƚǁŽƌŬ�ǁŚŝĐŚ�ĂůƐŽ�ĐŽŶŶĞĐƚƐ�ƚŽ�ƚŚĞ��WhƐ�ĂŶĚ�E/�Ɛ

Page 45: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

45

NVSHMEM

NVSHMEM features Symmetric memory allocations in device memory Communication API calls on CPU (standard and stream-ordered) Kernel-side communication (API and LD/ST) between GPUs

NVLink and PCIe support (intra-node) InfiniBand support (inter-node) Interoperability with MPI and OpenSHMEM libraries

Implementation of OpenSHMEM1, a Partitioned Global Address Space (PGAS) library

1 SHMEM from Cray’s “shared memory” library, https://en.wikipedia.org/wiki/SHMEM

Available in NVIDIA HPC toolkit

Page 46: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

46

DSLASH NVSHMEM IMPLEMENTATION

Keep general structure of packing, interior and exterior Dslash

Use nvshmem_ptr for intra-node remote writes (fine-grained)

Packing buffer is located on remote device Use nvshmem_putmem_nbi to write to remote GPU over IB (1 RDMA transfer)

Need to make sure writes are visible: nvshmem_barrier_all_on_stream

or barrier kernel that only waits for writes from neighbors

First exploration

Page 47: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

47

NVSHMEM DSLASHfirst exploration

Packing kernel Fused HaloInterior kernel Barrier kernel

164 local volume, half precision

DG

X-1,

1x2x

2x2

part

itio

ning

Page 48: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

48

FUSED DSLASH + PACKING KERNEL

Packing Interior

pack_blocks

Exterior (Halo)

interior_blocks = grid_dim - pack_blocks

nvshmem_signal for each direction

inte

rior

+pac

k ke

rnel

halo

ker

nel

nvshmem_wait_until for packed data

Page 49: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

49

NVSHMEM + FUSING KERNELSno extra packing and barrier kernels needed

Barrier + Fused HaloInterior + Pack + Flag kernel

164 local volume, half precision

DG

X-1,

1x2x

2x2

part

itio

ning

Page 50: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

50

KERNEL FUSION REDUCES LATENCY

0

400

800

1,200

1,600

2,000

IPC co

pies

IPC re

mote w

rite

NVSH

MEM w

/ bar

rier

NVSH

MEM ke

rnel

fusio

n

DGX-1,164 local volume, Wilson, half precision, 1x2x2x2 partitioningG

Flop

/s p

er G

PU

Page 51: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

51

CAN WE GO FURTHER ?

`

One kernel to rule them all ! Communication is handled in the kernel and latencies are hidden.

Page 52: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

52

DSLASH UBER KERNEL

Packing Interior

pack_blocks interior_blocks = grid_dim - pack_blocks - exterior_blocks

nvshmem_signal for each direction

Exterior (Halo)

exterior_blocks

atomic wait for interior

nvshmem_wait_until

atomic flag set by last block

Page 53: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

53

ÜBER KERNEL

` 164 local volume, half precision

DG

X-1,

1x2x

2x2

part

itio

ning

über kernel (packing + interior + exterior)

Page 54: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

54

LATENCY REDUCTIONS

0

400

800

1,200

1,600

2,000

IPC co

pies

IPC re

mote w

rite

NVSH

MEM w

/ bar

rier

NVSH

MEM ke

rnel

fusio

n

NVSH

MEM üb

er ke

rnel

DGX-1,164 local volume, Wilson, half precision, 1x2x2x2 partitioningG

Flop

/s p

er G

PU

Page 55: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

55

DGX-2: FULL NON-BLOCKING BANDWIDTH2.4 TB/s bisection bandwidth

8

FULL NON-BLOCKING BANDWIDTH

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

NVSwitch

Page 56: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

56

DGX-2 STRONG SCALINGGlobal Volume 324, Wilson-Dslash

0

5,000

10,000

15,000

20,000

25,000

1 2 4 8 16

MPI double SHMEM doubleMPI single SHMEM singleMPI half SHMEM half

GFl

op/s

#GPUs

@16 GPUs:

164 local volume

Page 57: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

57

MULTI-NODE STRONG SCALINGDGX SuperPOD (DGX2 nodes: 16xV100 (32GB), 8xEDR IB)

0

100,000

200,000

300,000

400,000

16 32 64 128 256 512 1024

MPI NVSHMEM

GFl

op/s

#GPUs

Wilson Dslash, 643x128 global volume, half precision

sweet spot for simulations

1.4x

1.4x

1.3x

164 local volume, half precision

Page 58: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

58SUMMARY

Page 59: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

59

LQCD OUTLOOK

Reformulate vector-vector and matrix-vector problems as matrix-matrix Multi-RHS, Communication-Avoiding, block solvers Deploy using tensor cores where possible

Keep data on chip and minimize hierarchy level jumping Use NVSHMEM to fuse across communication boundaries Increasingly possible with latest generation of GPUs

Follow trends towards future architectures and aim for super-linear scaling

Super Linear Scaling

Page 60: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

!16

QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware

GFl

op/s

1

10

100

1000

0

500

1,000

1,500

2,000

2,500

3,000

2008 2010 2012 2014 2016 2018 2019

Speedup Wilson FP32 GFLOPS

Speedup determined by measured time to solution for solving the Wilson operator against a random source on a V=24364 lattice, β=5.5, Mπ= 416 MeV. One node is defined to be 3 GPUs

Spee

dup

Multi GPUcapable

Adaptive Multigrid

OptimizedMultigrid

DeflatedMultigrid

300x

60

QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware

2022 2024

?

Page 61: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

61

SUMMARY

Algorithms achieve maximal speedup when we Expose their parallelism to map onto ever-more parallel processors Identify their shape to map to ever-more hierarchical processors

Communications reach optimal performance when we Move task marshaling close to data, enabling a single communications API Utilize fine-grained communications which maximize latency hiding

Page 62: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

62BACKUP

Page 63: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

63

PARALLELISM AND LOCALITY

63

MULTIPLE RIGHT-HAND SIDESG

Flop

/s

# rhs

483x12, HISQ, single precision, one code

0

750

1,500

2,250

3,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Volta (2017) Pascal (2016) Maxwell (2014) Kepler (2012) Fermi (2010)

2.5x

3.5x

Page 64: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

64

MULTIGRID IS FULL OF MATRIX MULTIPLICATION

Explicit

• Coarse-operator construction

• Block-orthogonalization

Implicit (through reformulation)

• Multi-RHS operators

• Coarse-grid LU solver

Reworking the pipeline to expose algorithms in matrix-matrix form

Maximize locality, parallelism for optimal mapping onto the GPU hierarchy

!38

LATTICE QCD WITH TENSOR CORES

Modern GPU have high throughput tensor-core functionality Kernel Fusion 24/36

The 4d-volume (⇠ 108) is much larger than the size of the 5thdimension (⇠ 101). Fusing the 4d-spatial operations with the 5thdimension operations right after reduces global memory tra�c.

⇥1� 2b M

†�D

†w| {z }

fuse

M�†5 M †

�D†w| {z }

fuse

M�†5

⇤⇥1� 2bM

�15 Dw| {z }

fuse

M�M�15 Dw| {z }

fuse

M�

Tesla V100 FP64: 7.5 TFLOPS FP32: 15 TFLOPS Tensor: 125 TFLOPS

KC, Jung, Mawhinney, Tu

Page 65: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

65

TENSOR CORES FOR MULTIGRID

Multigrid is a preconditioner

16-bit precision is perfectly adequate

No impact on convergence rate

Majority of MG setup kernels now implemented using tensor cores

1.5x-10x kernel speedups observed

Ongoing WIP

Expect integer speedup factors on Summit once fully reformulated

TFLO

PS

0

7.5

15

22.5

30

4x4x4x8 2x2x2x2

single precision half precision tensor cores

Yhat kernel on Quadro GV100 (32 null space vectors)

Page 66: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+

66

FINE-GRAINED SYNCHRONIZATION

Need replacement for kernel boundaries

Packing is independent of interior

interior and exterior update on boundary → possible race condition

use cuda::atomic from libcu++

libcu++ gives us std::atomic in CUDA

#include <atomic> std::atomic<int> x;

#include <cuda/std/atomic> cuda::std::atomic<int> x;

#include <cuda/atomic> cuda::atomic<int, cuda::thread_scope_block> x;

Page 67: EXPLOITING HIERARCHICAL ALGORITHMS ON EVERMORE … · 2020. 11. 28. · 5 HPC IS GETTING MORE HIERARCHICAL 2009 Jaguar: Cray XT5 @ ORNL 1.75 PFLOPS, 18,688 nodes, 2x AMD 6-core, Seastar2+