Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute...

32
Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA 1

Transcript of Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute...

Page 1: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Stencil Framework for Portable High Performance Computing

Naoya MaruyamaRIKEN Advanced Institute for Computational Science

April 2013 @ NCSA, IL, USA

1

Page 2: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Talk Outline• Physis stencil framework

• Map Reduce for K

• Mini-app development

Page 3: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Multi-GPU Application Development

• Non-unified programming models– MPI for inter-node

parallelism– CUDA/OpenCL/OpenACC

for accelerators• Optimization

– Blocking– Overlapped computation

and communication

CUDA

CUDA

MPI

Good performance with low programmer productivity

Page 4: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

GoalHigh performance, highly productive programming for heterogeneous clusters

Approach

High level abstractions for structured parallel programming

– Simplifying programming models– Portability across platforms– Does not sacrifice too much

performance

Page 5: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Physis (Φύσις) Framework [SC’11]Stencil DSL• Declarative• Portable• Global-view• C-based

void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) {float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1);PSGridEmit(g2,v/7.0);}

DSL Compiler• Target-specific

code generation and optimizations

• Automatic parallelization

Physis

C

C+MPI

CUDA

CUDA+MPI

OpenMP

OpenCL

Page 6: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

DSL Overview• C + custom data types and intrinsics• Grid data types

– PSGrid3DFloat, PSGrid3DDouble, etc.• Dense Cartesian domain types

– PSDomain1D, PSDomain2D, and PSDomain3D• Intrinsics

– Runtime management– Grid object management (PSGridFloat3DNew, etc)– Grid accesses (PSGridCopyin, PSGridGet, etc)– Applying stencils to grids (PSGridMap, PSGridRun)– Grid reductions (PSGridReduce)

Page 7: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Writing Stencils• Stencil Kernel

– C functions describing a single flow of scalar execution on one grid element

– Executed over specified rectangular domainsvoid diffusion(const int x, const int y, const int z, PSGrid3DFloat g1, PSGrid3DFloat g2, float t) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0*t);}

Issues a write to grid g2 Offset must be constant

Periodic access is possible with PSGridGetPeriodic.

Page 8: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Applying Stencils to Grids

• Map: Creates a stencil closure that encapsulates stencil and grids

• Run: Iteratively executes stencil closures

PSGrid3DFloat g1 = PSGrid3DFloatNew(NX, NY, NZ);PSGrid3DFloat g2 = PSGrid3DFloatNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ);

PSStencilRun(PSStencilMap(diffusion,d,g1,g2,0.5), PSStencilMap(diffusion,d,g2,g1,0.5), 10);

Grouping by PSStencilRun Target for kernel fusion optimization

Page 9: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Implementation

• DSL translator– Translate intrinsics calls to RT API calls– Generate GPU kernels with boundary exchanges based on

static analysis– Using the ROSE compiler framework (LLNL)

• Runtime– Provides a shared memory-like interface for

multidimensional grids over distributed CPU/GPU memory

Physis Code

Implementation Source Code

Executable Code

Page 10: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

CUDA Thread Blocking• Each thread sweeps points in the Z dimension• X and Y dimensions are blocked with AxB thread blocks,

where A and B are user-configurable parameters (64x4 by default)

X

Z

Y

Page 11: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Example: 7-point Stencil GPU Code

__device__ void kernel(const int x,const int y,const int z,__PSGrid3DFloatDev *g, __PSGrid3DFloatDev *g2){ float v = (((((( *__PSGridGetAddrNoHaloFloat3D(g,x,y,z) + *__PSGridGetAddrFloat3D_0_fw(g,(x + 1),y,z)) + *__PSGridGetAddrFloat3D_0_bw(g,(x - 1),y,z)) + *__PSGridGetAddrFloat3D_1_fw(g,x,(y + 1),z)) + *__PSGridGetAddrFloat3D_1_bw(g,x,(y - 1),z)) + *__PSGridGetAddrFloat3D_2_bw(g,x,y,(z - 1))) + *__PSGridGetAddrFloat3D_2_fw(g,x,y,(z + 1))); *__PSGridEmitAddrFloat3D(g2,x,y,z) = v;}

__global__ void __PSStencilRun_kernel(int offset0,int offset1,__PSDomain dom, __PSGrid3DFloatDev g,__PSGrid3DFloatDev g2){ int x = blockIdx.x * blockDim.x + threadIdx.x + offset0, y = blockIdx.y * blockDim.y + threadIdx.y + offset1; if (x < dom.local_min[0] || x >= dom.local_max[0] || (y < dom.local_min[1] || y >= dom.local_max[1])) return ; int z; for (z = dom.local_min[2]; z < dom.local_max[2]; ++z) { kernel(x,y,z,&g,&g2); }}

Page 12: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Example: 7-point Stencil CPU Codestatic void __PSStencilRun_0(int iter,void **stencils) { struct dim3 block_dim(64,4,1); struct __PSStencil_kernel *s0 = (struct __PSStencil_kernel *)stencils[0]; cudaFuncSetCacheConfig(__PSStencilRun_kernel,cudaFuncCachePreferL1); struct dim3 s0_grid_dim((int )(ceil(__PSGetLocalSize(0) / ((double )64))),(int )(ceil(__PSGetLocalSize(1) / ((double )4))),1); __PSDomainSetLocalSize(&s0 -> dom); s0 -> g = __PSGetGridByID(s0 -> __g_index); s0 -> g2 = __PSGetGridByID(s0 -> __g2_index); int i;for (i = 0; i < iter; ++i) {{ int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); } __PSStencilRun_kernel<<<s0_grid_dim,block_dim>>>(__PSGetLocalOffset(0), __PSGetLocalOffset(1),s0 -> dom, *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); } cudaThreadSynchronize();}

Page 13: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Optimization : Overlapped Computation and Communication

Boundary

Inner points 1. Copy boundaries from

GPU to CPU for non-unit stride cases

2. Computes interior points

3. Boundary exchanges with neighbors

4. Computes boundaries

Time

Page 14: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Optimization Example: 7-Point Stencil CPU Code

for (i = 0; i < iter; ++i) { __PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> (__PSGetLocalOffset(0),__PSGetLocalOffset(1),__PSDomainShrink(&s0 -> dom,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); … __PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>(__PSDomainGetBoundary(&s0 -> dom,1,1,1,1,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); cudaThreadSynchronize();} cudaThreadSynchronize();}

Computing Interior Points

Boundary Exchange

Computing Boundary Planes Concurrently

Page 15: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Local Optimization• Register blocking

– Reuse loaded grid elements with registers

for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i][j][k]+f[i][j][k-1]+f[i][j][k+1]);}

double kc = f[i][j][0];doubke kn = f[i][j][1];for (int k = 1; k < n-1; ++k) { double kp = kc; kc = kn; kn = f[i][j][k+1]; g[i][j][k] = a*(kc+kp+kn);}

Original

Optimized

Page 16: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Local Optimization• Common subexpression elimination in offset

computation– Eliminates intra- and inter-iteration common

subexpressions

for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i+j*n+k*n*n]+ f[i+j*n+(k-1)*n*n]+f[i+j*n+(k+1)*n*n]);}

int idx = i+j*n+n*n;for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[idx]+f[idx-n*n]+f[idx+n*n]); idx += n*n;}

Original

Optimized

Page 17: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Evaluation• Performance and productivity• Sample code

– 7-point diffusion kernel (#stencil: 1)– Jacobi kernel from Himeno benchmark (#stencil:

1)– Seismic simulation (#stencil: 15)

• Platform– Tsubame 2.0

• Node: Westmere-EP 2.9GHz x 2 + M2050 x 3

• Dual Infiniband QDR with full bisection BW fat tree

Page 18: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Productivity

Similar size as sequential code in C

Page 19: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Optimization Effect

Hand-tuned No-opt Register Blocking

Offset CSE Full Opt0

10

20

30

40

50

60

70

80

90

Ba

nd

wid

th (

GB

/s)

7-point diffusion stencil on 1 GPU (Tesla M2050)

Page 20: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Diffusion Weak Scaling

0 50 100 150 200 250 3000

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

512x256x256

256x128x128

Number of GPUs

GFl

ops

Page 21: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Seismic Weak Scaling

0 100 200 300 400 500 6000

500

1000

1500

2000

2500

3000

3500

4000

4500

Number of GPUs (2 GPUs per node)

GFL

OPS

Problem size: 256x256x256 per GPU

Page 22: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Diffusion Strong Scaling

0 20 40 60 80 100 120 1400

500

1000

1500

2000

2500

3000

3500

40001-D

2-D

3-D

Number of GPUs

GFl

ops

Problem size: 512x512x4096

Page 23: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Himeno Strong Scaling

0 20 40 60 80 100 120 1400

500

1000

1500

2000

2500

3000

1-D

Number of GPUs

Gflo

ps

Problem size XL (1024x1024x512)

Page 24: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Ongoing Work• Auto-tuning

– Preliminary AT for the CUDA backend available• Supporting different accelerators• Supporting more complex problems

– Stencil with limited data dependency– Hierarchically organized problems

• Work unit: Dense structured grids• Overall problem domain: Sparsely connected work

units• Example

– NICAM: An Icosahedral model of climate simulation– UPACS: Fluid simulation of engineering problems

Page 25: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Further Information• Code is available at http://github.com/naoyam/physis

• Maruyama et al., “Physis: Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers,” SC’11, 2011.

Page 26: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Talk Outline• Physis stencil framework

• Map Reduce for K

• Mini-app development

Page 27: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Map Reduce for K• In-memory MPI-based MapReduce for the K

computer– Implemented as a C library– Provides (some of) Hadoop-like programming interfaces – Strong focus on scalable processing of large data sets on

K– Supports standard MPI clusters too

• Application examples– Metagenome sequence

analysis– Replica-exchange MD

• Runs hundreds of NAMD as a Map Fast data loading by the Tofu network

Page 28: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Talk Outline• Physis stencil framework

• Map Reduce for K

• Mini-app development

Page 29: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

HPC Mini Applications• A set of mini-apps derived from national key

applications.– Source code will be publicly released around Q1-

Q2 ’14.– Expected final number of apps: < 20

• Part of the national pre-exa projects– Supported by the MEXT HPCI Feasibility Study

program (PI: Hirofumi Tomita, RIKEN AICS)

Page 30: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Mini-App Methodology1. Request for application submissions to the current

users of the Japanese HPC systems– Documentation

• Mathematical background• Target capability and capacity

– Input data sets– Validation methods– Application code

• Full-scale applications– Capable to run complete end-to-end simulations– 17 submissions so far

• Stripped-down applications– Simplified applications with only essential part of code– 6 submissions so far

Page 31: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Mini-App Methodology2. Deriving “mini” applications from submitted

applications– Understanding performance-critical patterns

• Computations, memory access, file I/O, network I/O

– Reducing codebase size• Removing code not used for target problems• Applying standard software engineering practices (e.g., DRY)

– Refactoring into reference implementations– Performance modeling

• In collaboration with the ASPEN project (Jeff Vetter at ORNL)

– (Optional) Versions optimized for specific architecture

Page 32: Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA.

Mni-App Example: Molecular Dynamics• Kernels: Pairwise force calculation + Long-range updates• Two alternative algorithms for solving the equivalent problems

– FFT-based Particle Mesh Ewald• Bottlenecked by all-to-all communications at scale

– Fast Multipole Method• Tree-based problem formulation with no all-to-all communications

• Simplified problem settings– Only simulates water molecules in the NVE setting– Can reduce the codebase significantly– Easier to create input data sets of different scales

• Two reference implementations to study performance implications by algorithmic differences– MARBLE (20K SLOC)– MODYLAS (16K SLOC)