GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend...

20
PH - SFT GeantV Geometry: SIMD abstraction and interfacing with CUDA Johannes de Fine Licht (johannes.defi[email protected]) Sandro Wenzel ([email protected]) Thanks to Michal Husejko and Romain Wartel at TechLab for providing hardware.

Transcript of GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend...

Page 1: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

PH - SFT

GeantV Geometry: SIMD abstraction and interfacing with CUDA

Johannes de Fine Licht ([email protected]) Sandro Wenzel ([email protected]) !Thanks to Michal Husejko and Romain Wartel at TechLab for providing hardware.

Page 2: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Abstracted algorithms

Write common code for multiple types of SIMD “backends” (Vc/Cilk/CUDA).

Eases code maintenance; requires only one kernel to be written for a given functionality

Must retain performance of a low level implementation

���2PH - SFT

Page 3: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT ���3

Illustrating scalar/SIMD abstraction and kernelsSingle particle

interface

C-like abstract kernels

Vector interface

External CUDA kernels

Page 4: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT ���3

Illustrating scalar/SIMD abstraction and kernelsSingle particle

interface

C-like abstract kernels

Vector interface

External CUDA kernels

Vc Looper Thread access

Scalar Looper Cilk Plus Looper

Page 5: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT ���3

Illustrating scalar/SIMD abstraction and kernelsSingle particle

interface

C-like abstract kernels

Backend instantiation

Cilk Plus Backend

Vector interface

External CUDA kernels

Vc Looper Thread access

Scalar Looper Cilk Plus Looper

Scalar Backend Vc Backend CUDA Backend

Page 6: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT ���3

Illustrating scalar/SIMD abstraction and kernelsSingle particle

interface

C-like abstract kernels

Backend instantiation

Cilk Plus Backend

Vector interface

External CUDA kernels

Vc Looper Thread access

Scalar Looper Cilk Plus Looper

C-like specialized kernels

Scalar Backend Vc Backend CUDA Backend

Kernel instantiation

Page 7: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Generic kernels on abstract types

���4

Generic programming; write algorithms in a generic way on abstract types

!Implementation depends on the backend !Backend determines the types on which the high level operations are performed

Page 8: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Generic kernels on abstract types

���4

Generic programming; write algorithms in a generic way on abstract types

!Implementation depends on the backend !Backend determines the types on which the high level operations are performed

Page 9: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Implementations wrapped in types

Operations are abstracted

Backends implement all operations

Vc already supports arithmetics

CUDA uses primitive types

Wrapper structs are implemented if needed

���5

Page 10: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Interfacing with CUDA

���6

Involving the GPU should be easy !Principle of common code !Issue: GPU code needs nvcc, but we want a different compiler for CPU

vs.GNU/Clang:

C++11 Vc

Cilk

nvcc: CUDA

Page 11: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Host memory !vecgeom::CudaManager

vecgeom::UnplacedBox vecgeom::LogicalVolume …

Dual namespaces

���7

GPU memory !vecgeom_cuda::UnplacedBox vecgeom_cuda::LogicalVolume …

CUDA kernel

namespace vecgeom

namespace vecgeom_cuda

namespace vecgeom

Utilize distinct namespaces but same code for CPU and GPU !Classes live in one or both environments !Provide abstraction to interface between environments

Page 12: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Host memory !vecgeom::CudaManager

vecgeom::UnplacedBox vecgeom::LogicalVolume …

Dual namespaces

���7

GPU memory !vecgeom_cuda::UnplacedBox vecgeom_cuda::LogicalVolume …

CUDA kernel

namespace vecgeom

namespace vecgeom_cuda

Allocation

namespace vecgeom

Utilize distinct namespaces but same code for CPU and GPU !Classes live in one or both environments !Provide abstraction to interface between environments

Page 13: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Host memory !vecgeom::CudaManager

vecgeom::UnplacedBox vecgeom::LogicalVolume …

Dual namespaces

���7

GPU memory !vecgeom_cuda::UnplacedBox vecgeom_cuda::LogicalVolume …

Launch kernel CUDA kernel

namespace vecgeom

namespace vecgeom_cuda

Instantiate

Allocation

Synchronize content

namespace vecgeom

Utilize distinct namespaces but same code for CPU and GPU !Classes live in one or both environments !Provide abstraction to interface between environments

Page 14: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Compilation scheme

���8

namespace vecgeom .o

Compile for C++11 with vector backend

Source files .cu

Compile with NVCC with CUDA backend

namespace vecgeom_cuda

.o

Optional modules (ROOT, benchmarking…)

Source files .cpp

+ CUDA interface .cu

+

Executable

Linking

Interface header

Page 15: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Status

���9

Nu m b e r of p a rt ic le s loca te d

0 .0

0 .5

1 .0

1 .5

2 .0

2 .5

3 .0

3 .5

4 .0

4 .5

Sp

ee

du

p

Pa rt ic le loca t ion in b ox g e om e t ry

GPU

GPU with ove rh e a d

CPU

Preliminary benchmarks run in hybrid CPU/GPU environment for four-level box geometry.

Location of particles is a tree algorithm, so speedups are only seen at very high particle multiplicity.

Common code for box methods runs for scalar, Vc, Cilk and CUDA backends !Implemented GPU memory synchronization !Dual namespaces allows dispatching work to CPU or GPU !Location can run in either environment !Results are in agreement

!

Page 16: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Appendix: CPU/GPU same interface

���10

Main executable compiled with C++ compiler

CUDA kernel compiled with NVCC

Same code used

Page 17: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Appendix: Common code

���11

Page 18: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT

Appendix: Template hierarchy

���12

Page 19: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT ���13

External CUDA kernels

C-like abstract kernels

Vector interface

Page 20: GeantV Geometry: SIMD abstraction and interfacing with CUDA · C-like abstract kernels Backend instantiation Cilk Plus Backend Vector! interface External! CUDA kernels Vc Looper Thread

Johannes de Fine Licht ([email protected])

PH - SFT ���13

External CUDA kernels

C-like abstract kernels

Scalar Backend

Vc Backend

Cilk Plus Backend

CUDA Backend

Vector interface