Stencil Framework for Portable High Performance Computing
description
Transcript of Stencil Framework for Portable High Performance Computing
Stencil Framework for Portable High Performance Computing
Naoya MaruyamaRIKEN Advanced Institute for Computational Science
April 2013 @ NCSA, IL, USA
1
Talk Outline• Physis stencil framework
• Map Reduce for K
• Mini-app development
Multi-GPU Application Development
• Non-unified programming models– MPI for inter-node
parallelism– CUDA/OpenCL/OpenACC
for accelerators• Optimization
– Blocking– Overlapped computation
and communication
CUDA
CUDA
MPI
Good performance with low programmer productivity
GoalHigh performance, highly productive programming for heterogeneous clusters
Approach
High level abstractions for structured parallel programming
– Simplifying programming models– Portability across platforms– Does not sacrifice too much
performance
Physis (Φύσις) Framework [SC’11]Stencil DSL• Declarative• Portable• Global-view• C-based
void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) {float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1);PSGridEmit(g2,v/7.0);}
DSL Compiler• Target-specific
code generation and optimizations
• Automatic parallelization
Physis
C
C+MPI
CUDA
CUDA+MPI
OpenMP
OpenCL
DSL Overview• C + custom data types and intrinsics• Grid data types
– PSGrid3DFloat, PSGrid3DDouble, etc.• Dense Cartesian domain types
– PSDomain1D, PSDomain2D, and PSDomain3D• Intrinsics
– Runtime management– Grid object management (PSGridFloat3DNew, etc)– Grid accesses (PSGridCopyin, PSGridGet, etc)– Applying stencils to grids (PSGridMap, PSGridRun)– Grid reductions (PSGridReduce)
Writing Stencils• Stencil Kernel
– C functions describing a single flow of scalar execution on one grid element
– Executed over specified rectangular domainsvoid diffusion(const int x, const int y, const int z, PSGrid3DFloat g1, PSGrid3DFloat g2, float t) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0*t);}
Issues a write to grid g2 Offset must be constant
Periodic access is possible with PSGridGetPeriodic.
Applying Stencils to Grids• Map: Creates a stencil closure that encapsulates
stencil and grids• Run: Iteratively executes stencil closures
PSGrid3DFloat g1 = PSGrid3DFloatNew(NX, NY, NZ);PSGrid3DFloat g2 = PSGrid3DFloatNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ);
PSStencilRun(PSStencilMap(diffusion,d,g1,g2,0.5), PSStencilMap(diffusion,d,g2,g1,0.5), 10);
Grouping by PSStencilRun Target for kernel fusion optimization
Implementation
• DSL translator– Translate intrinsics calls to RT API calls– Generate GPU kernels with boundary exchanges based on
static analysis– Using the ROSE compiler framework (LLNL)
• Runtime– Provides a shared memory-like interface for
multidimensional grids over distributed CPU/GPU memory
Physis Code
Implementation Source Code
Executable Code
CUDA Thread Blocking• Each thread sweeps points in the Z dimension• X and Y dimensions are blocked with AxB thread blocks,
where A and B are user-configurable parameters (64x4 by default)
X
Z
Y
Example: 7-point Stencil GPU Code__device__ void kernel(const int x,const int y,const int z,__PSGrid3DFloatDev *g, __PSGrid3DFloatDev *g2){ float v = (((((( *__PSGridGetAddrNoHaloFloat3D(g,x,y,z) + *__PSGridGetAddrFloat3D_0_fw(g,(x + 1),y,z)) + *__PSGridGetAddrFloat3D_0_bw(g,(x - 1),y,z)) + *__PSGridGetAddrFloat3D_1_fw(g,x,(y + 1),z)) + *__PSGridGetAddrFloat3D_1_bw(g,x,(y - 1),z)) + *__PSGridGetAddrFloat3D_2_bw(g,x,y,(z - 1))) + *__PSGridGetAddrFloat3D_2_fw(g,x,y,(z + 1))); *__PSGridEmitAddrFloat3D(g2,x,y,z) = v;}
__global__ void __PSStencilRun_kernel(int offset0,int offset1,__PSDomain dom, __PSGrid3DFloatDev g,__PSGrid3DFloatDev g2){ int x = blockIdx.x * blockDim.x + threadIdx.x + offset0, y = blockIdx.y * blockDim.y + threadIdx.y + offset1; if (x < dom.local_min[0] || x >= dom.local_max[0] || (y < dom.local_min[1] || y >= dom.local_max[1])) return ; int z; for (z = dom.local_min[2]; z < dom.local_max[2]; ++z) { kernel(x,y,z,&g,&g2); }}
Example: 7-point Stencil CPU Codestatic void __PSStencilRun_0(int iter,void **stencils) { struct dim3 block_dim(64,4,1); struct __PSStencil_kernel *s0 = (struct __PSStencil_kernel *)stencils[0]; cudaFuncSetCacheConfig(__PSStencilRun_kernel,cudaFuncCachePreferL1); struct dim3 s0_grid_dim((int )(ceil(__PSGetLocalSize(0) / ((double )64))),(int )(ceil(__PSGetLocalSize(1) / ((double )4))),1); __PSDomainSetLocalSize(&s0 -> dom); s0 -> g = __PSGetGridByID(s0 -> __g_index); s0 -> g2 = __PSGetGridByID(s0 -> __g2_index); int i;for (i = 0; i < iter; ++i) {{ int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); } __PSStencilRun_kernel<<<s0_grid_dim,block_dim>>>(__PSGetLocalOffset(0), __PSGetLocalOffset(1),s0 -> dom, *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); } cudaThreadSynchronize();}
Optimization : Overlapped Computation and Communication
Boundary
Inner points 1. Copy boundaries from
GPU to CPU for non-unit stride cases
2. Computes interior points
3. Boundary exchanges with neighbors
4. Computes boundaries
Time
Optimization Example: 7-Point Stencil CPU Code
for (i = 0; i < iter; ++i) { __PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> (__PSGetLocalOffset(0),__PSGetLocalOffset(1),__PSDomainShrink(&s0 -> dom,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); … __PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>(__PSDomainGetBoundary(&s0 -> dom,1,1,1,1,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); cudaThreadSynchronize();} cudaThreadSynchronize();}
Computing Interior Points
Boundary Exchange
Computing Boundary Planes Concurrently
Local Optimization• Register blocking
– Reuse loaded grid elements with registers
for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i][j][k]+f[i][j][k-1]+f[i][j][k+1]);}
double kc = f[i][j][0];doubke kn = f[i][j][1];for (int k = 1; k < n-1; ++k) { double kp = kc; kc = kn; kn = f[i][j][k+1]; g[i][j][k] = a*(kc+kp+kn);}
Original
Optimized
Local Optimization• Common subexpression elimination in offset
computation– Eliminates intra- and inter-iteration common
subexpressions
for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i+j*n+k*n*n]+ f[i+j*n+(k-1)*n*n]+f[i+j*n+(k+1)*n*n]);}
int idx = i+j*n+n*n;for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[idx]+f[idx-n*n]+f[idx+n*n]); idx += n*n;}
Original
Optimized
Evaluation• Performance and productivity• Sample code
– 7-point diffusion kernel (#stencil: 1)– Jacobi kernel from Himeno benchmark (#stencil: 1)– Seismic simulation (#stencil: 15)
• Platform– Tsubame 2.0
• Node: Westmere-EP 2.9GHz x 2 + M2050 x 3
• Dual Infiniband QDR with full bisection BW fat tree
Productivity
Similar size as sequential code in C
Optimization Effect
Hand-tuned No-opt Register Blocking
Offset CSE Full Opt0
10
20
30
40
50
60
70
80
90
Band
wid
th (G
B/s
)7-point diffusion stencil on 1 GPU (Tesla M2050)
Diffusion Weak Scaling
0 50 100 150 200 250 3000
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
512x256x256
256x128x128
Number of GPUs
GFlo
ps
Seismic Weak Scaling
0 100 200 300 400 500 6000
50010001500200025003000350040004500
Number of GPUs (2 GPUs per node)
GFLO
PS
Problem size: 256x256x256 per GPU
Diffusion Strong Scaling
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
3500
40001-D
2-D
3-D
Number of GPUs
GFlo
psProblem size: 512x512x4096
Himeno Strong Scaling
0 20 40 60 80 100 120 1400
500
1000
1500
2000
2500
3000
1-D
Number of GPUs
Gflop
s
Problem size XL (1024x1024x512)
Ongoing Work• Auto-tuning
– Preliminary AT for the CUDA backend available• Supporting different accelerators• Supporting more complex problems
– Stencil with limited data dependency– Hierarchically organized problems
• Work unit: Dense structured grids• Overall problem domain: Sparsely connected work units• Example
– NICAM: An Icosahedral model of climate simulation– UPACS: Fluid simulation of engineering problems
Further Information• Code is available at http://github.com/naoyam/physis
• Maruyama et al., “Physis: Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers,” SC’11, 2011.
Talk Outline• Physis stencil framework
• Map Reduce for K
• Mini-app development
Map Reduce for K• In-memory MPI-based MapReduce for the K
computer– Implemented as a C library– Provides (some of) Hadoop-like programming interfaces – Strong focus on scalable processing of large data sets on K– Supports standard MPI clusters too
• Application examples– Metagenome sequence
analysis– Replica-exchange MD
• Runs hundreds of NAMD as a Map
Fast data loading by the Tofu network
Talk Outline• Physis stencil framework
• Map Reduce for K
• Mini-app development
HPC Mini Applications• A set of mini-apps derived from national key
applications.– Source code will be publicly released around Q1-
Q2 ’14.– Expected final number of apps: < 20
• Part of the national pre-exa projects– Supported by the MEXT HPCI Feasibility Study
program (PI: Hirofumi Tomita, RIKEN AICS)
Mini-App Methodology1. Request for application submissions to the current
users of the Japanese HPC systems– Documentation
• Mathematical background• Target capability and capacity
– Input data sets– Validation methods– Application code
• Full-scale applications– Capable to run complete end-to-end simulations– 17 submissions so far
• Stripped-down applications– Simplified applications with only essential part of code– 6 submissions so far
Mini-App Methodology2. Deriving “mini” applications from submitted
applications– Understanding performance-critical patterns
• Computations, memory access, file I/O, network I/O– Reducing codebase size
• Removing code not used for target problems• Applying standard software engineering practices (e.g., DRY)
– Refactoring into reference implementations– Performance modeling
• In collaboration with the ASPEN project (Jeff Vetter at ORNL)– (Optional) Versions optimized for specific architecture
Mni-App Example: Molecular Dynamics• Kernels: Pairwise force calculation + Long-range updates• Two alternative algorithms for solving the equivalent problems
– FFT-based Particle Mesh Ewald• Bottlenecked by all-to-all communications at scale
– Fast Multipole Method• Tree-based problem formulation with no all-to-all communications
• Simplified problem settings– Only simulates water molecules in the NVE setting– Can reduce the codebase significantly– Easier to create input data sets of different scales
• Two reference implementations to study performance implications by algorithmic differences– MARBLE (20K SLOC)– MODYLAS (16K SLOC)