VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx,...
-
Upload
cecily-williams -
Category
Documents
-
view
214 -
download
1
Transcript of VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx,...
VORPAL Optimizations for Petascale Systems
Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala
Tech-X Corporation
Boyana NorrisArgonne National Lab
Work supported by DOE Office of Science SBIR Phase II Award DE-FG02-07ER84731
VORPAL: Introduction
• VORPAL plasma framework widely used on leadership-class systems– Particle-in-cell (PIC) algorithm for kinetic plasma model– Finite-Difference Time-Domain (FDTD) for Maxwell solver– Access to various libraries (Trilinos, PETSc) for doing linear
system solves (Electrostatic PIC)– ADI methods for doing implicit Maxwell solve
• Self-consistent model for charged particles and electromagnetic field– Electromagnetic field discretized on 3D cartesian mesh– Particles located anywhere in space– Particles gather forces and scatter charges/currents to the
field• Parallelization via domain decomposition
VORPAL Optimization Challenges
• Petascale systems require strong scaling– FDTD computationally cheap to start with– Need good performance on small domain
Þ Efficient messaging on small computational domainsÞ Efficient computation and messaging for heterogeneous
architectures
• If particles present, they are the main contribution to compute time– Significant time savings via optimized particle-push algorithm– Key challenge: finding good data layout
Þ Optimize nearest neighbor messaging for small domain sizes
Þ Optimize for heterogeneous architectures … GPUs
JumpShot Messaging Patterns in an FDTD Simulation
PETSc on BG/L VORPAL on BG/P
Improving parallel efficiency with different messaging patterns
Conventional field messaging: Send and
receive messages to/from all neighbors at once
Staged messaging: Send in one direction at a time, waiting for one direction to complete before starting the next
Reduces overall number of messages: 6 instead of 26
Messaging results
• Staged messaging can be up to 5× faster for small domain sizes
• Similar performance on Cray XT4 and BG/P
Cray XT4 BG/P
Effect of E/B Field Memory Structure on FDTD Performance
Ex(i,j,k)
Ey(i,j,k+1)
Ey(i,j,k)
Ez(i,j,k)
Ex(i,j,k+1)
Ez(i,j,k+1)
Ex(i,j,k)
Ex(i,j,k+1)
Ey(i,j,k)
Ey(i,j,k+1)
Ez(i,j,k)
Ez(i,j,k+1)
…
…
…
Layout A Layout B
…
…
…
Memory Layout is a key consideration for GPU optimization
Using GPUs for Accelerating FDTD
• FERMI : 8x improvement over Tesla Series GPUs for double precision (> 500 GFlops), likely 2x improvement in single precision (~2 TFlops)
• FERMI available in Spring 2010 ???
GPU Implementations of FDTDImplementation 1: Generic 4 pt
Stencil Kernelsfor(int i = tid; i < nx; i +=
nThreads){ float r = res[i]; r += a1 * in1[i]; r += a2 * in2[i]; r += a3 * in3[i]; r += a4 * in4[i]; res[i] += r; }
Implementation 2: Yee Mesh Specific Kernels
float ex, ey, ez; for(int i = tid; i < n; i += nThreads){ ex = Ex[i], ey = Ey[i], ez = Ez[i]; Bx[i] += dtOverDy*(ez-EzYp1[i]) + dtOverDz*(EyZp1[i]-ey); By[i] += dtOverDz*(ex-ExZp1[i]) + dtOverDx*(EzXp1[i]-ez); Bz[i] += dtOverDx*(ey-EyXp1[i]) + dtOverDy*(ExYp1[i]-ex); }
• Requires 6 calls to Generic Kernel to do the full FDTD update
• Makes no use of GPU memory hierarchy
• All memory accesses are global
• Corresponding call to Ampere update• Reuse 3 global memory accesses• no use of shared memory or other
trickery
Timings: FDTD Performance
Speedup
Specifics
• Boundary Conditions/PMLs can be handled through a spatially dependent dielectric constant
• Dey-Mittra Cut Cell algorithms can be used through slight modifications to 4 pt Stencil Kernel.
• Collection of optimized vector routines that can perform all of the above mentioned algorithms – Accessible from within VORPAL or from High-
Level Languages like IDL and MATLAB– See Peter Messmer’s dinner-time presentation
on GPULib at the NUG meeting (Wed. 10/7)
GPULib (https://gpulib.txcorp.com)
Future Work
• Fully optimize the FDTD algorithm• Move to multiple GPUs so that VORPAL can
take advantage of new heterogeneous systems
• VORPAL ADI algorithm on GPUs: – GPULib has highly optimized tridiagonal and
pentadiagonal solvers for “small” linear systems (<1000 unknowns) that are easily tasked-farmed out to GPU thread blocks. Potential for huge speedup vs CPU implementation.
• Move Particle Push to GPU• Vlasov Solver??