VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx,...

VORPAL Optimizations for Petascale Systems

Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala

Tech-X Corporation

Boyana NorrisArgonne National Lab

*[email protected]

*[email protected]

Work supported by DOE Office of Science SBIR Phase II Award DE-FG02-07ER84731

mailto:*[email protected]

mailto:*[email protected]

VORPAL: Introduction

• VORPAL plasma framework widely used on leadership-class systems– Particle-in-cell (PIC) algorithm for kinetic plasma model– Finite-Difference Time-Domain (FDTD) for Maxwell solver– Access to various libraries (Trilinos, PETSc) for doing linear

system solves (Electrostatic PIC)– ADI methods for doing implicit Maxwell solve

• Self-consistent model for charged particles and electromagnetic field– Electromagnetic field discretized on 3D cartesian mesh– Particles located anywhere in space– Particles gather forces and scatter charges/currents to the

field• Parallelization via domain decomposition

VORPAL Optimization Challenges

• Petascale systems require strong scaling– FDTD computationally cheap to start with– Need good performance on small domain

Þ Efficient messaging on small computational domainsÞ Efficient computation and messaging for heterogeneous

architectures

• If particles present, they are the main contribution to compute time– Significant time savings via optimized particle-push algorithm– Key challenge: finding good data layout

Þ Optimize nearest neighbor messaging for small domain sizes

Þ Optimize for heterogeneous architectures … GPUs

JumpShot Messaging Patterns in an FDTD Simulation

PETSc on BG/L VORPAL on BG/P

Improving parallel efficiency with different messaging patterns

Conventional field messaging: Send and

receive messages to/from all neighbors at once

Staged messaging: Send in one direction at a time, waiting for one direction to complete before starting the next

Reduces overall number of messages: 6 instead of 26

Messaging results

• Staged messaging can be up to 5× faster for small domain sizes

• Similar performance on Cray XT4 and BG/P

Cray XT4 BG/P

Effect of E/B Field Memory Structure on FDTD Performance

Ex(i,j,k)

Ey(i,j,k+1)

Ey(i,j,k)

Ez(i,j,k)

Ex(i,j,k+1)

Ez(i,j,k+1)

Ex(i,j,k)

Ex(i,j,k+1)

Ey(i,j,k)

Ey(i,j,k+1)

Ez(i,j,k)

Ez(i,j,k+1)

…

…

…

Layout A Layout B

…

…

…

Memory Layout is a key consideration for GPU optimization

Using GPUs for Accelerating FDTD

• FERMI : 8x improvement over Tesla Series GPUs for double precision (> 500 GFlops), likely 2x improvement in single precision (~2 TFlops)

• FERMI available in Spring 2010 ???

GPU Implementations of FDTDImplementation 1: Generic 4 pt

Stencil Kernelsfor(int i = tid; i < nx; i +=

nThreads){ float r = res[i]; r += a1 * in1[i]; r += a2 * in2[i]; r += a3 * in3[i]; r += a4 * in4[i]; res[i] += r; }

Implementation 2: Yee Mesh Specific Kernels

float ex, ey, ez; for(int i = tid; i < n; i += nThreads){ ex = Ex[i], ey = Ey[i], ez = Ez[i]; Bx[i] += dtOverDy*(ez-EzYp1[i]) + dtOverDz*(EyZp1[i]-ey); By[i] += dtOverDz*(ex-ExZp1[i]) + dtOverDx*(EzXp1[i]-ez); Bz[i] += dtOverDx*(ey-EyXp1[i]) + dtOverDy*(ExYp1[i]-ex); }

• Requires 6 calls to Generic Kernel to do the full FDTD update

• Makes no use of GPU memory hierarchy

• All memory accesses are global

• Corresponding call to Ampere update• Reuse 3 global memory accesses• no use of shared memory or other

trickery

Timings: FDTD Performance

Speedup

Specifics

• Boundary Conditions/PMLs can be handled through a spatially dependent dielectric constant

• Dey-Mittra Cut Cell algorithms can be used through slight modifications to 4 pt Stencil Kernel.

• Collection of optimized vector routines that can perform all of the above mentioned algorithms – Accessible from within VORPAL or from High-

Level Languages like IDL and MATLAB– See Peter Messmer’s dinner-time presentation

on GPULib at the NUG meeting (Wed. 10/7)

GPULib (https://gpulib.txcorp.com)

Future Work

• Fully optimize the FDTD algorithm• Move to multiple GPUs so that VORPAL can

take advantage of new heterogeneous systems

• VORPAL ADI algorithm on GPUs: – GPULib has highly optimized tridiagonal and

pentadiagonal solvers for “small” linear systems (<1000 unknowns) that are easily tasked-farmed out to GPU thread blocks. Potential for huge speedup vs CPU implementation.

• Move Particle Push to GPU• Vlasov Solver??

VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx,...

Documents

Transcript of VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx,...