Early Experiences in GPU-enabling the Gadget-2 Simulation Code
-
Upload
mark-nicholson -
Category
Documents
-
view
26 -
download
2
description
Transcript of Early Experiences in GPU-enabling the Gadget-2 Simulation Code
Early Experiences in GPU-enabling the Gadget-2 Simulation Code
Syed Akbar Mehdi and Aamir ShafiSchool of Electrical Engineering and Computer Science
(SEECS)National University of Sciences and Technology (NUST)
04/19/2023
04/19/2023
Collaborators
•Dr Bryan Carpenter from the University of Portsmouth, UK
04/19/2023
Presentation Outline
•Introduction to the HPC research group•Emergence of Cluster of GPUs•Programming GPUs•Introduction to the Gadget-2 Code•Preliminary Performance Evaluation•Summary
04/19/2023
HPC Research Group at SEECS NUST
•Research interests: ▫Conduct research, development, and
evaluation of parallel programming languages, libraries, and paradigms
▫Support computational scientists in doing their job
•Parallel Computing Training: ▫Workshops▫UG/PG courses NUST
•Computational resources: ▫Three small-scale compute clusters
04/19/2023
Computational Resources
A Tier-2 compliant data centre …
Cluster Name
Brand # Procs #Nodes #Mem OS Network
Raad Sunfire v890
64 Ultra SPARC IV
4 128 GB Solaris 10 Myrinet
Barq Custom Built
36 Intel XEON
9 36 GB OpenSuSE Myrinet
Burraq Sun v20z/HP DL 145
64 AMD Opterons
32 128 GB Cent OS 5.3
GigE
MPJ Express† • MPJ Express is an MPI-like library for parallel
Java applications for compute clusters/clouds and multicore processors
• The software is an open-source currently developed and maintained at NUST:▫Available for free from the sourceforge website
• http://mpj-express.org • A recent success story:
▫MPJ Express has been adopted by SHARCNET (https://www.sharcnet.ca), which is an HPC consortium of Canadian academic institutions
6
† Shafi et al, Nested Parallelism for Multi-core HPC Systems using Java, JPDC, pp 532-545, 69(6), June 2009
Some MPJ Express Users ...
7
04/19/2023
Presentation Outline
•Introduction to the HPC research group•Emergence of Cluster of GPUs•Programming GPUs•Introduction to the Gadget-2 Code•Preliminary Performance Evaluation•Summary
04/19/2023
Cluster of GPUs•Increasing interest in building parallel
hardware using a mixture of heterogeneous computing devices:▫Multicore CPUs ▫Manycore GPUs
•On such hardware—also known as Cluster of GPUs—compute intensive parts of the user application are executed on GPUs
•The current generation of GPUs is capable of executing general purpose computation in a massively parallel fashion
A Typical Cluster of GPUs
04/19/2023
Node 6
Node 8
Node 1
Node 3
Node 2
Node 4
Node 5
Node 7
Memory
CPU
GPU
04/19/2023
Presentation Outline
•Introduction to the HPC research group•Emergence of Cluster of GPUs•Programming GPUs•Introduction to the Gadget-2 Code•Preliminary Performance Evaluation•Summary
04/19/2023
Graphics Processing Units (GPUs)
•A GPU is a specialized processor that offloads graphics rendering from the host CPU:▫Or at least this used to be “old” definition
•Modern GPUs are capable of executing general purpose computation
•Two leading GPU manufacturers include: ▫Nvidia▫AMD
04/19/2023
Floating-Point Operations per Second for the CPU and the GPU
Image courtesy: NVIDIA CUDA Programming Guide version 3.0
Comparison of a CPU and a GPUSpecifications Core i7 960 GTX285
Processing Elements 4 cores, 4 way [email protected] GHz
30 cores, 8 way [email protected] GHz
Resident Strands/Threads (max)
4 cores, 2 threads, 4 way SIMD:
32 strands
30 cores, 32 SIMD vectors, 32 way SIMD:
30720 threads
SP GFLOP/s 102 1080
Memory Bandwidth 25.6 GB/s 159 GB/s
Register File - 1.875 MB
Local Store - 480 kB
CPU versus GPU
Images courtesy: NVIDIA CUDA Programming Guide version 3.0
GPU Architecture and its Programming
04/19/2023
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemor
y
Thread (0, 0)
Registers
LocalMemor
y
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemor
y
Thread (0, 0)
Registers
LocalMemor
y
Thread (1, 0)
Registers
Host
Image courtesy: Ibid
04/19/2023
General Purpose GPU (GPGPU) Computing
Image courtesy: Ibid
Compute Unified Device Architecture (CUDA)
• CUDA is the computing engine in NVIDIA graphics processing units (GPUs) that is accessible to software developers through industry standard programming languages: ▫ CUDA offers simple extensions to the C++ language▫ Nvidia’s way of programming GPUs▫ Explicit memory management
• OpenCL: ▫ Vendor neutral way of programming GPUs
Scientific Applications using CUDA
04/19/2023
Presentation Outline
•Introduction to the HPC research group•Emergence of Cluster of GPUs•Programming GPUs•Introduction to the Gadget-2 Code•Preliminary Performance Evaluation•Summary
04/19/2023
GPU-enabling Gadget-2
• Gadget-2: ▫A free production code for cosmological N-body and
hydrodynamic computations†.▫Written in C—already fully parallelized using MPI.▫Versions used in various astrophysics research papers,
including the Millennium Simulation.• Our aim in this study is to execute compute intensive
parts of this code on the GPU: ▫Gadget-2 is an N-body simulation and highly irregular
† http://www.mpa-garching.mpg.de/gadget
04/19/2023
Dynamics in Gadget
•Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity, plus hydrodynamic forces on gas
•Classical N-body problem
04/19/2023
Gadget Main Loop
•Schematic view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities }
•Most of the interesting work happens in domain_Decomposition() and compute_accelerations() .
04/19/2023
Computing Forces
•The function compute_accelerations() must compute forces experienced by all particles.▫In particular must compute gravitational
force.•Because gravity is a long range force,
every particle affects every other. Naively, total cost is O(N2).
•With N ≈ 1010, this is infeasible.•Need some kind of approximation ...
04/19/2023
Barnes-Hut Tree
Recursively split space into octree (quadtree in this 2d example), until no node contains more than 1 particle.
04/19/2023
Distribution of BH Tree in Gadget†
† Springel et al, The cosmological simulation code GADGET-2, Mon.Not.Roy.Astron.Soc. 364 (2005) 1105-1134
04/19/2023
GPU-enabling the Gadget-2 code
•The big idea is “to perform tree walk (force calculation) on the GPU instead of CPU”
•Steps: ▫Copy particles and tree information to the
GPU memory▫Perform tree walk in parallel for multiple
partilces on the GPU▫Copy particles array back to the CPU
04/19/2023
CPU
. ..
GPU
MPI Process o on Node A
CPU
. ..
GPU
MPI Process 2 on Node C
CPU
. ..
GPU
MPI Process 1 on Node B
CPU
. ..
GPU
MPI Process 3 on Node D
Cluster Interconnect
Zoomed-In Thread Block on a GPU on Process 0
(0,0) Tree Walk for Particle 1
(1,0) Tree Walk for Particle 2
(BlkWidth,0) Walk for Particle j
04/19/2023 (0,1) Tree Walk for Particle j+1
(1,1) Tree Walk for Particle j+2
(BlkWidth,1) Walk for Particle k
…..
…..
….. …… …….
Block 0 Block 1 Block 0 Block 1
Block 0 Block 1 Block 0 Block 1
DRAM DRAM DRAM DRAM
DRAM DRAMDRAM DRAM
GPU-enabled Gadget-2 code executing on a cluster of four nodes. Each node has a CPU and a GPU …
04/19/2023
Presentation Outline
•Introduction to the HPC research group•Emergence of Cluster of GPUs•Programming GPUs•Introduction to the Gadget-2 Code•Preliminary Performance Evaluation•Summary
04/19/2023
Preliminary Performance Evaluation
0
1500
3000
3203.45
1674.05
226.69
Execution Time (in seconds)
(1.91x)
(1x)
(14.1x)
04/19/2023
Optimization•Reduce the memory transfer between the
CPU and the GPU•“42 TFlops Hierarchical N-body Simulations
on GPUs with Applications in both Astrophysics and Turbulence” by Hamada et al: ▫Multiple tree walks use the same “interaction
list”•Perform the tree walk “level by level”:
▫Currently implementing this algorithm
04/19/2023
Summary•Our group focuses on the Computer Science
(CS) aspects of HPC:▫Languages, libraries, and scientific software
•Need for collaboration between CS and scientific community on national scale before reaching out to the industry: ▫At NUST we are going in this direction
•“Cluster of GPUs” is an interesting trend, which is likely to continue
•We discussed GPU-enabling the Gadget-2 code
04/19/2023
Thanks
04/19/2023
TreePM
•Artificially split gravitational potential into two parts:
Gmr = + ][1 –
Gmr
Gmr
erfc( )r2rs
erfc( )r2rs
Φshort (r) Φlong(r)
{ { Calculate Φshort using BH; calculate Φlong by
projecting particle mass distribution onto a mesh, then working in Fourier space.
04/19/2023
Domain Decomposition
•Can’t just divide space in a fixed way, because some regions will have many more particle than others – poor load balancing.
•Can’t just divide particles in a fixed way, because particles move independently through space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.
04/19/2023
Peano-Hilbert Curve †
† Picture borrowed fromhttp://www.mpa-garching.mpg.de/gadget/gadget2-
paper.pdf
04/19/2023
Decomposition based on P-H Curve
•Sort particles by position on Peano-Hilbert curve, then divide evenly into P domains.
•Characteristics of this decomposition:▫Good load balancing.▫Domains simply connected and quite
“compact” in real space, because particles that are close along P-H curve are close in real space.
▫Domains have relatively simple mapping to BH octree nodes.
04/19/2023
Communication in Gadget
•Can identify 4 recurring “non-trivial” patterns:▫Distributed sort of particles, according to P-H
key: implements domain decomposition.▫Export of particles to other nodes, for
calculation of remote contribution to force, density, etc, and retrieval of results.
▫Projection of particle density to regular grid for calculation of Φlong; distribution of results back to irregularly distributed particles.
▫Distributed Fast Fourier Transform.