Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute...
-
Upload
nayeli-code -
Category
Documents
-
view
216 -
download
0
Transcript of Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute...
Stencil Framework for Portable High Performance Computing
Naoya MaruyamaRIKEN Advanced Institute for Computational Science
April 2013 @ NCSA, IL, USA
1
Talk Outline• Physis stencil framework
• Map Reduce for K
• Mini-app development
Multi-GPU Application Development
• Non-unified programming models– MPI for inter-node
parallelism– CUDA/OpenCL/OpenACC
for accelerators• Optimization
– Blocking– Overlapped computation
and communication
CUDA
CUDA
MPI
Good performance with low programmer productivity
GoalHigh performance, highly productive programming for heterogeneous clusters
Approach
High level abstractions for structured parallel programming
– Simplifying programming models– Portability across platforms– Does not sacrifice too much
performance
Physis (Φύσις) Framework [SC’11]Stencil DSL• Declarative• Portable• Global-view• C-based
void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) {float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1);PSGridEmit(g2,v/7.0);}
DSL Compiler• Target-specific
code generation and optimizations
• Automatic parallelization
Physis
C
C+MPI
CUDA
CUDA+MPI
OpenMP
OpenCL
DSL Overview• C + custom data types and intrinsics• Grid data types
– PSGrid3DFloat, PSGrid3DDouble, etc.• Dense Cartesian domain types
– PSDomain1D, PSDomain2D, and PSDomain3D• Intrinsics
– Runtime management– Grid object management (PSGridFloat3DNew, etc)– Grid accesses (PSGridCopyin, PSGridGet, etc)– Applying stencils to grids (PSGridMap, PSGridRun)– Grid reductions (PSGridReduce)
Writing Stencils• Stencil Kernel
– C functions describing a single flow of scalar execution on one grid element
– Executed over specified rectangular domainsvoid diffusion(const int x, const int y, const int z, PSGrid3DFloat g1, PSGrid3DFloat g2, float t) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0*t);}
Issues a write to grid g2 Offset must be constant
Periodic access is possible with PSGridGetPeriodic.
Applying Stencils to Grids
• Map: Creates a stencil closure that encapsulates stencil and grids
• Run: Iteratively executes stencil closures
PSGrid3DFloat g1 = PSGrid3DFloatNew(NX, NY, NZ);PSGrid3DFloat g2 = PSGrid3DFloatNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ);
PSStencilRun(PSStencilMap(diffusion,d,g1,g2,0.5), PSStencilMap(diffusion,d,g2,g1,0.5), 10);
Grouping by PSStencilRun Target for kernel fusion optimization
Implementation
• DSL translator– Translate intrinsics calls to RT API calls– Generate GPU kernels with boundary exchanges based on
static analysis– Using the ROSE compiler framework (LLNL)
• Runtime– Provides a shared memory-like interface for
multidimensional grids over distributed CPU/GPU memory
Physis Code
Implementation Source Code
Executable Code
CUDA Thread Blocking• Each thread sweeps points in the Z dimension• X and Y dimensions are blocked with AxB thread blocks,
where A and B are user-configurable parameters (64x4 by default)
X
Z
Y
Example: 7-point Stencil GPU Code
__device__ void kernel(const int x,const int y,const int z,__PSGrid3DFloatDev *g, __PSGrid3DFloatDev *g2){ float v = (((((( *__PSGridGetAddrNoHaloFloat3D(g,x,y,z) + *__PSGridGetAddrFloat3D_0_fw(g,(x + 1),y,z)) + *__PSGridGetAddrFloat3D_0_bw(g,(x - 1),y,z)) + *__PSGridGetAddrFloat3D_1_fw(g,x,(y + 1),z)) + *__PSGridGetAddrFloat3D_1_bw(g,x,(y - 1),z)) + *__PSGridGetAddrFloat3D_2_bw(g,x,y,(z - 1))) + *__PSGridGetAddrFloat3D_2_fw(g,x,y,(z + 1))); *__PSGridEmitAddrFloat3D(g2,x,y,z) = v;}
__global__ void __PSStencilRun_kernel(int offset0,int offset1,__PSDomain dom, __PSGrid3DFloatDev g,__PSGrid3DFloatDev g2){ int x = blockIdx.x * blockDim.x + threadIdx.x + offset0, y = blockIdx.y * blockDim.y + threadIdx.y + offset1; if (x < dom.local_min[0] || x >= dom.local_max[0] || (y < dom.local_min[1] || y >= dom.local_max[1])) return ; int z; for (z = dom.local_min[2]; z < dom.local_max[2]; ++z) { kernel(x,y,z,&g,&g2); }}
Example: 7-point Stencil CPU Codestatic void __PSStencilRun_0(int iter,void **stencils) { struct dim3 block_dim(64,4,1); struct __PSStencil_kernel *s0 = (struct __PSStencil_kernel *)stencils[0]; cudaFuncSetCacheConfig(__PSStencilRun_kernel,cudaFuncCachePreferL1); struct dim3 s0_grid_dim((int )(ceil(__PSGetLocalSize(0) / ((double )64))),(int )(ceil(__PSGetLocalSize(1) / ((double )4))),1); __PSDomainSetLocalSize(&s0 -> dom); s0 -> g = __PSGetGridByID(s0 -> __g_index); s0 -> g2 = __PSGetGridByID(s0 -> __g2_index); int i;for (i = 0; i < iter; ++i) {{ int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); } __PSStencilRun_kernel<<<s0_grid_dim,block_dim>>>(__PSGetLocalOffset(0), __PSGetLocalOffset(1),s0 -> dom, *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); } cudaThreadSynchronize();}
Optimization : Overlapped Computation and Communication
Boundary
Inner points 1. Copy boundaries from
GPU to CPU for non-unit stride cases
2. Computes interior points
3. Boundary exchanges with neighbors
4. Computes boundaries
Time
Optimization Example: 7-Point Stencil CPU Code
for (i = 0; i < iter; ++i) { __PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> (__PSGetLocalOffset(0),__PSGetLocalOffset(1),__PSDomainShrink(&s0 -> dom,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); … __PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>(__PSDomainGetBoundary(&s0 -> dom,1,1,1,1,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); cudaThreadSynchronize();} cudaThreadSynchronize();}
Computing Interior Points
Boundary Exchange
Computing Boundary Planes Concurrently
Local Optimization• Register blocking
– Reuse loaded grid elements with registers
for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i][j][k]+f[i][j][k-1]+f[i][j][k+1]);}
double kc = f[i][j][0];doubke kn = f[i][j][1];for (int k = 1; k < n-1; ++k) { double kp = kc; kc = kn; kn = f[i][j][k+1]; g[i][j][k] = a*(kc+kp+kn);}
Original
Optimized
Local Optimization• Common subexpression elimination in offset
computation– Eliminates intra- and inter-iteration common
subexpressions
for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i+j*n+k*n*n]+ f[i+j*n+(k-1)*n*n]+f[i+j*n+(k+1)*n*n]);}
int idx = i+j*n+n*n;for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[idx]+f[idx-n*n]+f[idx+n*n]); idx += n*n;}
Original
Optimized
Evaluation• Performance and productivity• Sample code
– 7-point diffusion kernel (#stencil: 1)– Jacobi kernel from Himeno benchmark (#stencil:
1)– Seismic simulation (#stencil: 15)
• Platform– Tsubame 2.0
• Node: Westmere-EP 2.9GHz x 2 + M2050 x 3
• Dual Infiniband QDR with full bisection BW fat tree
Productivity
Similar size as sequential code in C
Optimization Effect
Hand-tuned No-opt Register Blocking
Offset CSE Full Opt0
10
20
30
40
50
60
70
80
90
Ba
nd
wid
th (
GB
/s)
7-point diffusion stencil on 1 GPU (Tesla M2050)
Diffusion Weak Scaling
0 50 100 150 200 250 3000
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
512x256x256
256x128x128
Number of GPUs
GFl
ops
Seismic Weak Scaling
0 100 200 300 400 500 6000
500
1000
1500
2000
2500
3000
3500
4000
4500
Number of GPUs (2 GPUs per node)
GFL
OPS
Problem size: 256x256x256 per GPU
Diffusion Strong Scaling
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500
40001-D
2-D
3-D
Number of GPUs
GFl
ops
Problem size: 512x512x4096
Himeno Strong Scaling
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
1-D
Number of GPUs
Gflo
ps
Problem size XL (1024x1024x512)
Ongoing Work• Auto-tuning
– Preliminary AT for the CUDA backend available• Supporting different accelerators• Supporting more complex problems
– Stencil with limited data dependency– Hierarchically organized problems
• Work unit: Dense structured grids• Overall problem domain: Sparsely connected work
units• Example
– NICAM: An Icosahedral model of climate simulation– UPACS: Fluid simulation of engineering problems
Further Information• Code is available at http://github.com/naoyam/physis
• Maruyama et al., “Physis: Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers,” SC’11, 2011.
Talk Outline• Physis stencil framework
• Map Reduce for K
• Mini-app development
Map Reduce for K• In-memory MPI-based MapReduce for the K
computer– Implemented as a C library– Provides (some of) Hadoop-like programming interfaces – Strong focus on scalable processing of large data sets on
K– Supports standard MPI clusters too
• Application examples– Metagenome sequence
analysis– Replica-exchange MD
• Runs hundreds of NAMD as a Map Fast data loading by the Tofu network
Talk Outline• Physis stencil framework
• Map Reduce for K
• Mini-app development
HPC Mini Applications• A set of mini-apps derived from national key
applications.– Source code will be publicly released around Q1-
Q2 ’14.– Expected final number of apps: < 20
• Part of the national pre-exa projects– Supported by the MEXT HPCI Feasibility Study
program (PI: Hirofumi Tomita, RIKEN AICS)
Mini-App Methodology1. Request for application submissions to the current
users of the Japanese HPC systems– Documentation
• Mathematical background• Target capability and capacity
– Input data sets– Validation methods– Application code
• Full-scale applications– Capable to run complete end-to-end simulations– 17 submissions so far
• Stripped-down applications– Simplified applications with only essential part of code– 6 submissions so far
Mini-App Methodology2. Deriving “mini” applications from submitted
applications– Understanding performance-critical patterns
• Computations, memory access, file I/O, network I/O
– Reducing codebase size• Removing code not used for target problems• Applying standard software engineering practices (e.g., DRY)
– Refactoring into reference implementations– Performance modeling
• In collaboration with the ASPEN project (Jeff Vetter at ORNL)
– (Optional) Versions optimized for specific architecture
Mni-App Example: Molecular Dynamics• Kernels: Pairwise force calculation + Long-range updates• Two alternative algorithms for solving the equivalent problems
– FFT-based Particle Mesh Ewald• Bottlenecked by all-to-all communications at scale
– Fast Multipole Method• Tree-based problem formulation with no all-to-all communications
• Simplified problem settings– Only simulates water molecules in the NVE setting– Can reduce the codebase significantly– Easier to create input data sets of different scales
• Two reference implementations to study performance implications by algorithmic differences– MARBLE (20K SLOC)– MODYLAS (16K SLOC)