Early Experiences in GPU-enabling the Gadget-2 Simulation Code

Early Experiences in GPU-enabling the Gadget-2 Simulation Code

Syed Akbar Mehdi and Aamir ShafiSchool of Electrical Engineering and Computer Science

(SEECS)National University of Sciences and Technology (NUST)

04/19/2023

04/19/2023

Collaborators

•Dr Bryan Carpenter from the University of Portsmouth, UK

04/19/2023

Presentation Outline

•Introduction to the HPC research group•Emergence of Cluster of GPUs•Programming GPUs•Introduction to the Gadget-2 Code•Preliminary Performance Evaluation•Summary

04/19/2023

HPC Research Group at SEECS NUST

•Research interests: ▫Conduct research, development, and

evaluation of parallel programming languages, libraries, and paradigms

▫Support computational scientists in doing their job

•Parallel Computing Training: ▫Workshops▫UG/PG courses NUST

•Computational resources: ▫Three small-scale compute clusters

04/19/2023

Computational Resources

A Tier-2 compliant data centre …

Cluster Name

Brand # Procs #Nodes #Mem OS Network

Raad Sunfire v890

64 Ultra SPARC IV

4 128 GB Solaris 10 Myrinet

Barq Custom Built

36 Intel XEON

9 36 GB OpenSuSE Myrinet

Burraq Sun v20z/HP DL 145

64 AMD Opterons

32 128 GB Cent OS 5.3

GigE

MPJ Express† • MPJ Express is an MPI-like library for parallel

Java applications for compute clusters/clouds and multicore processors

• The software is an open-source currently developed and maintained at NUST:▫Available for free from the sourceforge website

• http://mpj-express.org • A recent success story:

▫MPJ Express has been adopted by SHARCNET (https://www.sharcnet.ca), which is an HPC consortium of Canadian academic institutions

6

† Shafi et al, Nested Parallelism for Multi-core HPC Systems using Java, JPDC, pp 532-545, 69(6), June 2009

http://mpj-express.org/

https://www.sharcnet.ca/

Some MPJ Express Users ...

7

04/19/2023



04/19/2023

Cluster of GPUs•Increasing interest in building parallel

hardware using a mixture of heterogeneous computing devices:▫Multicore CPUs ▫Manycore GPUs

•On such hardware—also known as Cluster of GPUs—compute intensive parts of the user application are executed on GPUs

•The current generation of GPUs is capable of executing general purpose computation in a massively parallel fashion

The TOP10 in the TOP500† List

04/19/2023

† The TOP500 Project: http://top500.org

http://top500.org/

A Typical Cluster of GPUs

04/19/2023

Node 6

Node 8

Node 1

Node 3

Node 2

Node 4

Node 5

Node 7

Memory

CPU

GPU

04/19/2023



04/19/2023

Graphics Processing Units (GPUs)

•A GPU is a specialized processor that offloads graphics rendering from the host CPU:▫Or at least this used to be “old” definition

•Modern GPUs are capable of executing general purpose computation

•Two leading GPU manufacturers include: ▫Nvidia▫AMD

04/19/2023

Floating-Point Operations per Second for the CPU and the GPU

Image courtesy: NVIDIA CUDA Programming Guide version 3.0

Comparison of a CPU and a GPUSpecifications Core i7 960 GTX285

Processing Elements 4 cores, 4 way [email protected] GHz

30 cores, 8 way [email protected] GHz

Resident Strands/Threads (max)

4 cores, 2 threads, 4 way SIMD:

32 strands

30 cores, 32 SIMD vectors, 32 way SIMD:

30720 threads

SP GFLOP/s 102 1080

Memory Bandwidth 25.6 GB/s 159 GB/s

Register File - 1.875 MB

Local Store - 480 kB

CPU versus GPU

Images courtesy: NVIDIA CUDA Programming Guide version 3.0

GPU Architecture and its Programming

04/19/2023

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemor

y

Thread (0, 0)

Registers

LocalMemor

y

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemor

y

Thread (0, 0)

Registers

LocalMemor

y

Thread (1, 0)

Registers

Host

Image courtesy: Ibid

04/19/2023

General Purpose GPU (GPGPU) Computing

Image courtesy: Ibid

Compute Unified Device Architecture (CUDA)

• CUDA is the computing engine in NVIDIA graphics processing units (GPUs) that is accessible to software developers through industry standard programming languages: ▫ CUDA offers simple extensions to the C++ language▫ Nvidia’s way of programming GPUs▫ Explicit memory management

• OpenCL: ▫ Vendor neutral way of programming GPUs

http://en.wikipedia.org/wiki/Graphics_processing_unit

Scientific Applications using CUDA

04/19/2023



04/19/2023

GPU-enabling Gadget-2

• Gadget-2: ▫A free production code for cosmological N-body and

hydrodynamic computations†.▫Written in C—already fully parallelized using MPI.▫Versions used in various astrophysics research papers,

including the Millennium Simulation.• Our aim in this study is to execute compute intensive

parts of this code on the GPU: ▫Gadget-2 is an N-body simulation and highly irregular

† http://www.mpa-garching.mpg.de/gadget

http://www.mpa-garching.mpg.de/gadget

04/19/2023

Dynamics in Gadget

•Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity, plus hydrodynamic forces on gas

•Classical N-body problem

04/19/2023

Gadget Main Loop

•Schematic view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities }

•Most of the interesting work happens in domain_Decomposition() and compute_accelerations() .

04/19/2023

Computing Forces

•The function compute_accelerations() must compute forces experienced by all particles.▫In particular must compute gravitational

force.•Because gravity is a long range force,

every particle affects every other. Naively, total cost is O(N2).

•With N ≈ 1010, this is infeasible.•Need some kind of approximation ...

04/19/2023

Barnes-Hut Tree

Recursively split space into octree (quadtree in this 2d example), until no node contains more than 1 particle.

04/19/2023

Distribution of BH Tree in Gadget†

† Springel et al, The cosmological simulation code GADGET-2, Mon.Not.Roy.Astron.Soc. 364 (2005) 1105-1134

04/19/2023

GPU-enabling the Gadget-2 code

•The big idea is “to perform tree walk (force calculation) on the GPU instead of CPU”

•Steps: ▫Copy particles and tree information to the

GPU memory▫Perform tree walk in parallel for multiple

partilces on the GPU▫Copy particles array back to the CPU

04/19/2023

CPU

. ..

GPU

MPI Process o on Node A

CPU

. ..

GPU

MPI Process 2 on Node C

CPU

. ..

GPU

MPI Process 1 on Node B

CPU

. ..

GPU

MPI Process 3 on Node D

Cluster Interconnect

Zoomed-In Thread Block on a GPU on Process 0

(0,0) Tree Walk for Particle 1

(1,0) Tree Walk for Particle 2

(BlkWidth,0) Walk for Particle j

04/19/2023 (0,1) Tree Walk for Particle j+1

(1,1) Tree Walk for Particle j+2

(BlkWidth,1) Walk for Particle k

…..

…..

….. …… …….

Block 0 Block 1 Block 0 Block 1

Block 0 Block 1 Block 0 Block 1

DRAM DRAM DRAM DRAM

DRAM DRAMDRAM DRAM

GPU-enabled Gadget-2 code executing on a cluster of four nodes. Each node has a CPU and a GPU …

04/19/2023



04/19/2023

Preliminary Performance Evaluation

0

1500

3000

3203.45

1674.05

226.69

Execution Time (in seconds)

(1.91x)

(1x)

(14.1x)

04/19/2023

Optimization•Reduce the memory transfer between the

CPU and the GPU•“42 TFlops Hierarchical N-body Simulations

on GPUs with Applications in both Astrophysics and Turbulence” by Hamada et al: ▫Multiple tree walks use the same “interaction

list”•Perform the tree walk “level by level”:

▫Currently implementing this algorithm

04/19/2023

Summary•Our group focuses on the Computer Science

(CS) aspects of HPC:▫Languages, libraries, and scientific software

•Need for collaboration between CS and scientific community on national scale before reaching out to the industry: ▫At NUST we are going in this direction

•“Cluster of GPUs” is an interesting trend, which is likely to continue

•We discussed GPU-enabling the Gadget-2 code

04/19/2023

Thanks

04/19/2023

TreePM

•Artificially split gravitational potential into two parts:

Gmr = + ][1 –

Gmr

Gmr

erfc( )r2rs

erfc( )r2rs

Φshort (r) Φlong(r)

{ { Calculate Φshort using BH; calculate Φlong by

projecting particle mass distribution onto a mesh, then working in Fourier space.

04/19/2023

Domain Decomposition

•Can’t just divide space in a fixed way, because some regions will have many more particle than others – poor load balancing.

•Can’t just divide particles in a fixed way, because particles move independently through space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.

04/19/2023

Peano-Hilbert Curve †

† Picture borrowed fromhttp://www.mpa-garching.mpg.de/gadget/gadget2-

paper.pdf

04/19/2023

Decomposition based on P-H Curve

•Sort particles by position on Peano-Hilbert curve, then divide evenly into P domains.

•Characteristics of this decomposition:▫Good load balancing.▫Domains simply connected and quite

“compact” in real space, because particles that are close along P-H curve are close in real space.

▫Domains have relatively simple mapping to BH octree nodes.

04/19/2023

Communication in Gadget

•Can identify 4 recurring “non-trivial” patterns:▫Distributed sort of particles, according to P-H

key: implements domain decomposition.▫Export of particles to other nodes, for

calculation of remote contribution to force, density, etc, and retrieval of results.

▫Projection of particle density to regular grid for calculation of Φlong; distribution of results back to irregularly distributed particles.

▫Distributed Fast Fourier Transform.

Early Experiences in GPU-enabling the Gadget-2 Simulation Code

Documents

Transcript of Early Experiences in GPU-enabling the Gadget-2 Simulation Code