OpenACC vs.OpenMP4: The Strong, the Weak, the Missing...

25
1 Center for High Performance Computing Shanghai Jiao Tong University OpenACC2 vs.OpenMP4 The Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi 2 Satoshi MATSUOKA Laboratory Tokyo Institute of Technology James Lin 1,2 and Satoshi Matsuoka 2 C 2014@San Jose

Transcript of OpenACC vs.OpenMP4: The Strong, the Weak, the Missing...

Page 1: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

1Center for High Performance Computing Shanghai Jiao Tong University

 

OpenACC2  vs.OpenMP4      

The  Strong,  the  Weak,  and  the  Missing  to  Develop  Performance  Portable  Applica>ons  on  GPU  and  Xeon  Phi

2Satoshi MATSUOKA Laboratory Tokyo Institute of Technology

James  Lin1,2                        and                          Satoshi  Matsuoka2    

GTC  2014@San  Jose

Page 2: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Outline

•  OpenACC2 and OpenMP4

•  Systematic Optimizations to Portable Performance on GPU and Xeon Phi

•  Summary

2

Page 3: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

•  We may have many reasons to use directive-based programming, but for me, it can keep the code readable/maintainable from the application developers' point of view

CUDA  Experts  

Applica>on  Developers

Version  1 Version  2  is  based  on  develops’  own  version

Port  to  CUDA

CUDA  Version  1

Unmaintainable  to  applica>on  developers  

Why  Direc>ve-­‐based  Programming?

3

Page 4: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Direc>ve-­‐based  Programming  for  Accelerators[1]

•  Standard – OpenACC – OpenMP >=4.0

•  Product – PGI Accelerators – HMPP

•  Research Projects – R-Stream – HiCUDA – OpenMPC/OpenMP for accelerators

[1]  S.  Lee  and  J.  S.  VeTer,  “Early  evalua>on  of  direc>ve-­‐based  GPU  programming  models  for  produc>ve  exascale  compu>ng,”  presented  at  the  High  Performance  Compu>ng,  Networking,  Storage  and  Analysis  (SC),  2012  Interna>onal  Conference  for,  2012,  pp.  1–11.

4

Page 5: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

OpenACC2 •  Standard version evolution is much faster

than OpenMP and MPI : ~1.5year

•  Supported by PGI/CAPS/CRAY compiler

Standard/Version 1.0 2.0 3.0 4.0

OpenACC Nov  2011 July  2013   -­‐-­‐ -­‐-­‐

OpenMP-­‐Fortran Oct  1997 Nov  2000 May  2008 July  2013

MPI June  1994 July  1997 Sept  2012   -­‐-­‐

5

Page 6: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

OpenMP  4.0  for  Accelerators •  Released in July 2013, it supports on directive-

based programming on accelerators, such as GPU and Xeon Phi

•  Directives for –  Parallelism: target/parallel –  Data: target data –  Two levels of parallelism: teams/distribute

6

Page 7: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

OpenACC2 OpenMP4 Parallel   Target

Kernel N/A

Data Target  Data

Loop Distribute/Do/for/SIMD

Host  data N/A

Cache N/A

Update Target  Update

Wait N/A

Declare   Declare  Target

{enter,  exit}  data N/A

rou>ne Declare  target

Async  wait N/A

Tile   N/A

Device_type N/A

Atomic Atomic

N/A Cri>cal  sec>ons…  Barrier

OpenACC2  has  richer  features  than  OpenMP4   7

Page 8: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Experimental  Setup •  2 Target Devices

–  Accelerators used in Supercomputers, so NO AMD GPU –  GPU Kepler: K40 –  Xeon Phi KNC: 5110p

•  3 Compilers –  OpenACC: CAPS Compiler

•  V3.4.1, support OpenACC 2.0 •  OpenARC is not public available yet

–  OpenMP4: HOMP-ROSE for GPU, Intel Compiler for Xeon Phi

•  2 Test Cases –  HydroBench is a miniapp https://github.com/HydroBench/Hydro –  Copy benchmark

8

Page 9: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

HOMP,  OpenMP  compiler  for  CUDA

•  Developed by LLNL, it is build on ROSE[1], a source-to-source compiler

•  It is an early research implementation[2] of OpenMP4.0 on GPU for CUDA5.0

•  Support C/C++ only now

[1]    hTp://rosecompiler.org

[2]    C.  Liao,  Y.  Yan,  B.  R.  Supinski,  D.  J.  Quinlan,  and  B.  Chapman,  “Early  Experiences  with  the  OpenMP  Accelerator  Model,”  IWOMP13

9

Page 10: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Tools  used

CUDA5.5 OpenCL1.2

PTX  is  open

GPU  Assemble  Code  

OpenACC OpenMP4.0

Kepler  K20/20X/40 KNC  3/5/7110p

SPIR

X86  Assemble  Code

ptxas/ocelet

             asfemi+

nvcc*  based  on  LLVM

CAPS3.4 HOMP

OBJ

ICC13.0

ICC

Offload

10

Page 11: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

11

Page 12: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

12

Page 13: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

13

Performance  impact  of  K40  and  5110P  measure  with  data  transfer

Page 14: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Outline

•  OpenACC2 and OpenMP4

•  Systematic Optimizations to Portable Performance on GPU and Xeon Phi

•  Summary

14

Page 15: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Performance  Portability

•  Includes source code portability of a kernel implementation and performance that is commensurate with a device-specific implementation of that kernel [Kokks]

•  Portability for Accelerators used in HPC –  Different accelerators: GPU V.S. Xeon Phi –  Different generations: Kepler V.S. Fermi –  Different versions: K20 V.S. K40

15

Page 16: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Naïve  version,  simple  direc>ves,    0  op>miza>on    

Op>miza>ons   Ninja  version,  Best    op>miza>on,  Ideal  performance  

Frog  version,  achieving  reasonable  performance  on  both  GPU  and  Xeon  Phi

Algorithm  changes  

Compiler  technology  

Portable  Performance

16

Page 17: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Op>miza>on  for  Portable  Performance   SystemaDc  

OpDmizaDons   Accelerators Applica>ons Direc>ves  

SM

REG

SHMEM

GM

HM

SGEMM

FFT

LBM

Stencil

SPMV

Compiler  Technology

Arithme>c  Intensity SIMD  Friendly  Algo

Register  Level    Data  Locality    

SHMEM  Level  Data  Locality    

GM  Level  Data  Locality    

Data  Direc>ves

Loop  Direc>ves

Algorithm  Changes

17

Page 18: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Global  Memory  Level  

•  Memory Coalescing •  Thread-Grid Mapping

[ISCA09]

[MuCoCoS13]  18

Page 19: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Shared  Memory/L2  (SHMEM)  Level

•  SHMEM Blocking (data tiling)

•  Data Layout: AOS->SOA •  Avoid SHMEM bank

conflict

Avoid  SHMEM  bank  Conflict 19

Page 20: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Register  Level  

•  Prefetching •  Loop Unrolling and Jam •  Undocumented Register Bank Conflicts in

Kepler [CGO13]

20

Page 21: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

SIMD  Friendly  Algorithm  

[ISCA12] 21

Page 22: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Compiler  Technology •  Fast math, -fastmath

–  For __sin(), replace several FMAD (Floating Point Multiply-Add) with SFU Intrinsic functions

–  Faster, but SP only and loss some accuracy

•  Replace FMAD with FMA, -fmad=false –  FMAD -> RND(RND(a*b)+c) –  FMA (Fused Multiply-Add) -> RND(a*b+c), –  Improve both accuracy and performance

•  Asynchronous –  Using async clauses (OpenACC only)

22

Page 23: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Opportuni>es  for  auto-­‐tuning  with  single  source  code  base  

•  Code Level: self-embed code •  Compiler Level: support vendor intrinsic •  Tool Level: CAPS Auto-tuning tool •  Library Level:

– Lib for portable performance: Kokkos – Drop-in support for CUDA math Lib and MKL

23

Page 24: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Outline

•  OpenACC2 and OpenMP4

•  Systematic Optimizations to Portable Performance on GPU and Xeon Phi

•  Summary

24

Page 25: OpenACC vs.OpenMP4: The Strong, the Weak, the Missing …on-demand.gputechconf.com/.../S4595-openacc-openm… ·  · 2014-04-08Outline • OpenACC2 and OpenMP4 • Systematic Optimizations

Summary •  Portable performance would be an important feature for

directive programming approach

•  OpenACC2 is in early stage –  gridfication is recommended –  gang/work/vector directive is weak –  tile directive is weak –  cache directive support is still missing

•  OpenMP4 is not ready yet –  teams/distribute support on GPU is still missing

•  Systematic optimization would be needed

25