Paul Sathre, Wu Feng {sath6220, feng} @cs.vt.edu...

1
http://synergy.cs.vt.edu/ Paul Sathre, Wu Feng {sath6220, feng} @cs.vt.edu Virginia Tech --- Center for Synergistic Environments for Experimental Computing Backend Layer Plugin Layer CUDA Backend OpenCL Backend Future Device Backends Timer Plugin MPI Plugin Fortran Compat. Layer Fortran 2003 API Future Plugins C API User Apps -D WITH_CUDA -D WITH_OPENCL -D WITH_BACKENDX -D WITH_TIMERS -D WITH_MPI -D WITH_FORTRAN -D WITH_PLUGINX ... ... Design “Make -your-own” library from modular building blocks Include only needed plugins and backends Core 1 Core 0 Core 5 Core 4 Core 3 Core 2 Core 7 Core 6 Accelerator NIC Node 0 Plugins NVIDIA GPU - CUDA Interconnect Interconnect PCI-E Core 1 Core 0 Core 5 Core 4 Core 3 Core 2 Core 7 Core 6 NIC Node 1 NVIDIA GPU - CUDA PCI-E Core 1 Core 0 Core 5 Core 4 Core 3 Core 2 Core 7 Core 6 NIC Node 3 Intel MIC - OpenCL PCI-E NIC Node 2 CPU - OpenCL PCI-E Core 1 Core 0 Core 5 Core 4 Core 3 Core 2 Core 7 Core 6 NIC Node 4 AMD GPU - OpenCL PCI-E MPI On-device packing of ghost regions GPU Direct option, if available with MPI/ backend Fallback to host-staged transfers otherwise Transparently exchange device buffers between processes, regardless of backend Automatic Profiling Fortran Compatibility If compiled in, all kernels and transfers are timed behind the scenes automatically Environment variable controls verbosity Introduction Accelerated Backends Purpose-built to provide single API to accelerated kernels for current and future devices Runtime control over backends, plugins, and accelerator device selection Verbose pass-by-reference C API, usable with ISO_C_BINDINGS Fortran 2003 wrapper to verbose API provides simplified, unified calling convention for supported real and integer types Uses C API internally, so all runtime control variables work equivalently Performance Heterogeneity is becoming a fact of life in HPC, largely driven by demands for increased parallelism and power efficiency over what traditional CPUs can provide. However, extracting the full performance of heterogeneous systems is non-trivial and requires architecture expertise. Future Work Retrofitting existing codes for heterogeneity is tedious and error-prone, architecture experts are in short supply, and accelerators are moving targets. Therefore, a single API for transparently executing optimized code on accelerators with minimal intervention is needed for scientific productivity. Related Efforts Solver Frameworks The heavy-lifters of the library, selected at runtime by a “mode” environment variable from those included at compile-time Include implementations of all C API-supported kernels for a single accelerator model Standalone libraries in their own right can be used and distributed separately from the top- level API, as long as they are API compliant, supporting community development of closed- or open-source alternatives Currently support both CUDA and OpenCL, providing access to the most popular accelerators Currently support simple operations on subsets of 3D dense matrices: reduction-sum, dot-product, 2D transpose, pack/unpack of subregions More kernels from computational fluid dynamics in the pipeline; extensions to other domains to follow, simply a matter of adding necessary kernels (timing, Fortran API vs. Strictly C All tests performed in-node on a single system containing: 2x Intel Xeon X5550 Quad-core CPUs, 20GB RAM, and 4x Tesla C2070 GPUs Continue expanding the API’s provided set of kernels and backends with other primitive operations underlying fluid simulations. i.e. Krylov solvers, stencil computations, and various preconditioners Generalize operations to work on non-3D data, and add primitives for computations on unstructured grids Generate a third automatically runtime-scheduled backend to transparently execute code across entire node, a la CoreTSAR [7]. Solver Libraries OpenFOAM [1] MAGMA [3] PARALUTION [2] Trilinos [4] Pros: Support for useful pre- and post-processing (mesh generation and visualization); many solvers for many domains Cons: No internal accelerator support; framework-centric development; cumbersome API and “case” construction Pros: Many matrix storage formats; many solvers; many preconditioners; support for OpenMP, CUDA, and OpenCL on CPUs/GPUs and MIC; plugins for Fortran and OpenFOAM Cons: Framework-centric development; interop. with existing code low; no MPI support (yet), asynchronous operations only on CUDA; lack of non-destructive copy to/from C arrays Pros: Full BLAS and LAPACK support for CUDA, OpenCL, and MIC; support for several factorizations and eigenvalue problems; smart scheduling of hybrid CPU/GPU algorithms with QUARK directed acyclic graph scheduler; Multi-GPU methods Cons: CUDA, OpenCL, and MIC variants are separate implementations; no internal MPI support; MKL/ACML dependency poorly documented and cumbersome Pros: Massive set of capability areas beyond linear algebra, solvers, and meshes; built-in distributed memory support; some preliminary CUDA/MIC work (e.g. Kokkos, Phalanx, Tpetra packages) Cons: Redundancies of capability between packages; breadth of packages difficult to navigate for newcomers [1] H. Jasak, A. Jemcov, and Z. Tukovic,“ OpenFOAM: A C++ library for complex physics simulations.” [2] D. Lukarski, “PARALUTIONprojectv0.7.0,” 2012. [3] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki, “Accelerating Numerical Dense Linear Algebra Calculations with GPUs,” in Numerical Computations with GPUs . Springer, 2014, pp. 328. [4] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley, “An overview of the Trilinos project,” ACM Trans. Math. Softw., vol. 31, no. 3, pp. 397423, 2005. [5] AMD. clMath (formerly APPML). Accessed 2014.10.17. [Online]. Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel- processing-math-libraries/ [6] D. K. Panda. MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE. Network- Based Computing Laboratory, The Ohio State University. Accessed 2014.10.17. [Online]. Available: http://mvapich.cse.ohio-state.edu/ [7] T. Scogland, W.-c. Feng, B. Rountree, and B. de Supinski ,“ CoreTSAR: Adaptive Worksharing for HeterogeneousSystems,”in Supercomputing, ser. Lecture Notes in Computer Science, J. Kunkel, T. Ludwig, and H. Meuer, Eds. Springer International Publishing, 2014, vol. 8488, pp. 172186. This work was funded in part by the Air Force Office of Scientific Research (AFOSR) Basic Research Initiative from the Computational Mathematics Program via Grant No. FA9550-12-1-0442. performance vs. alternative GPU MKL float (CPU) MKL double (CPU) MetaMorph+OpenCL float (GPU) MetaMorph+OpenCL double (GPU) MetaMorph+CUDA float (GPU) MetaMorph+CUDA double (GPU) clMAGMA float (GPU) clMAGMA double (GPU) cuMAGMA float (GPU) cuMAGMA double (GPU) 10 0 10 -1 10 -2 10 -3 10 -4 MetaMorph Transpose vs. Alternatives Transpose Size Time per element (µs) 64 512 4K 32K 256K 2M 16M 128M PLASMA/MKL float (CPU) PLASMA/MKL double (CPU) ATLAS float (CPU) ATLAS double (CPU) MetaMorph+OpenCL float (GPU) MetaMorph+OpenCL double (GPU) MetaMorph+CUDA float (GPU) MetaMorph+CUDA double (GPU) clAmdBlas float (GPU) clAmdBlas double (GPU) cuMAGMA float (GPU) cuMAGMA double (GPU) 10 0 10 -1 10 -2 10 3 10 4 10 1 10 2 MetaMorph Dot Product vs Alternatives Vector Length Time per element (µs) 0 500 1000 1500 2000 2500 3000 H2D Copy D2D Copy D2H Copy Dot Product C+CUDA C+OpenCL C+Timers+CUDA C+Timers+OpenCL Fortran+CUDA Fortran+OpenCL Fortran+Timers+CUDA Fortran+Timers+OpenCL MetaMorph Fortran & Timer Plugin Overhead Total time per call, 2 21 floats (µs) 2x2 4x4 8x8 16x16 32x32 64x64 0 5 10 15 20 25 30 35 40 45 50 GPU Direct Pack CUDA Pack GPU Direct Send CUDA Send GPU Direct Recv CUDA Recv GPU Direct Unpack CUDA Unpack OpenCL Pack CUDA-to-OpenCL Pack OpenCL Send CUDA-to-OpenCL Send OpenCL Recv CUDA-to-OpenCL Recv OpenCL Unpack CUDA-to-OpenCL Unpack 2D Face Size MetaMorph Transparent Face Exchange Primitive Time per float element ( µs)

Transcript of Paul Sathre, Wu Feng {sath6220, feng} @cs.vt.edu...

Page 1: Paul Sathre, Wu Feng {sath6220, feng} @cs.vt.edu …synergy.cs.vt.edu/pubs/papers/sathre-sc14-metamorph-poster.pdfPaul Sathre, Wu Feng {sath6220, feng}

http://synergy.cs.vt.edu/

Paul Sathre, Wu Feng

{sath6220, feng} @cs.vt.edu

Virginia Tech --- Center for Synergistic Environments for Experimental Computing

Backend Lay

er

Plugin Lay

er

CUDA Backend

OpenCLBackend

Future Device Backends

Timer Plugin

MPIPlugin

Fortran Compat. Layer

Fortran 2003 API

Future Plugins

C API

User Apps

-D WITH_CUDA -D WITH_OPENCL -D WITH_BACKENDX

-D WITH_TIMERS -D WITH_MPI -D WITH_FORTRAN -D WITH_PLUGINX

...

...

Design“Make-your-own”  library  from  

modular building blocks

Include only needed plugins and backends

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

AcceleratorNIC

Node 0

Plugins

MPI

NVIDIA GPU - CUDA

Interconnect

Interconnect

PCI-E

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

AcceleratorNIC

Node 1 NVIDIA GPU - CUDA

PCI-E

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

AcceleratorNIC

Node 3 Intel MIC - OpenCL

PCI-E

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

NIC

Node 2 CPU - OpenCL

PCI-E

Core 1Core 0

Core 5Core 4

Core 3Core 2

Core 7Core 6

AcceleratorNIC

Node 4 AMD GPU - OpenCL

PCI-E

MPIOn-device packing of

ghost regions

GPU Direct option, if available with MPI/ backend

Fallback to host-staged transfers otherwise

Transparently exchange device buffers between processes, regardless of backend

Automatic Profiling

Fortran Compatibility

If compiled in, all kernels and transfers are timed behind the scenes automatically

Environment variable controls verbosity

Introduction

Accelerated Backends

Purpose-built to provide single API to accelerated kernels for current and future devices

Runtime control over backends, plugins, and accelerator device selection

Verbose pass-by-reference C API, usable with ISO_C_BINDINGS

Fortran 2003 wrapper to verbose API provides simplified, unified calling convention for supported real and integer types

Uses C API internally, so all runtime control variables work equivalently

Performance

Heterogeneity is becoming a fact of life in HPC, largely driven by demands for increased parallelism and power efficiency over what traditional CPUs can provide.

However, extracting the full performance of heterogeneous systems is non-trivial and requires architecture expertise.

Future Work

Retrofitting existing codes for heterogeneity is tedious and error-prone, architecture experts are in short supply, and accelerators are moving targets.

Therefore, a single API for transparently executing optimized code on accelerators with minimal intervention is needed for scientific productivity.

Related EffortsSolver Frameworks

The heavy-lifters of the library, selected at runtime  by  a  “mode”  environment  variable  from those included at compile-time

Include implementations of all C API-supported kernels for a single accelerator model

Standalone libraries in their own right can be used and distributed separately from the top-level API, as long as they are API compliant, supporting community development of closed- or open-source alternatives

Currently support both CUDA and OpenCL, providing access to the most popular accelerators

Currently support simple operations on subsets of 3D dense matrices: reduction-sum, dot-product, 2D transpose, pack/unpack of subregions

More kernels from computational fluid dynamics in the pipeline; extensions to other domains to follow, simply a matter of adding necessary kernels

Figure of MPI Exchange performance testing, of

packing/unpacking ghost regions, in GPUDirect, host-staged, and mixed-backend

transfers

Figure of API Plugin overheads (timing, Fortran API vs. Strictly C

API)

All tests performed in-node on a single system containing: 2x Intel Xeon X5550 Quad-core CPUs, 20GB RAM, and 4x Tesla C2070 GPUs

Continue  expanding  the  API’s  provided  set  of  kernels and backends with other primitive operations underlying fluid simulations. i.e. Krylov solvers, stencil computations, and various preconditioners

Generalize operations to work on non-3D data, and add primitives for computations on unstructured grids

Generate a third automatically runtime-scheduled backend to transparently execute code across entire node, a la CoreTSAR [7].

Solver Libraries

OpenFOAM [1]

MAGMA [3]

PARALUTION [2]

Trilinos [4]

Pros: Support for useful pre- and post-processing (mesh generation and visualization); many solvers for many domains

Cons: No internal accelerator support; framework-centric development;  cumbersome  API  and  “case”  construction

Pros: Many matrix storage formats; many solvers; many preconditioners; support for OpenMP, CUDA, and OpenCL on CPUs/GPUs and MIC; plugins for Fortran and OpenFOAM

Cons: Framework-centric development; interop. with existing code low; no MPI support (yet), asynchronous operations only on CUDA; lack of non-destructive copy to/from C arrays

Pros: Full BLAS and LAPACK support for CUDA, OpenCL, and MIC; support for several factorizations and eigenvalue problems; smart scheduling of hybrid CPU/GPU algorithms with QUARK directed acyclic graph scheduler; Multi-GPU methods

Cons: CUDA, OpenCL, and MIC variants are separate implementations; no internal MPI support; MKL/ACML dependency poorly documented and cumbersome

Pros: Massive set of capability areas beyond linear algebra, solvers, and meshes; built-in distributed memory support; some preliminary CUDA/MIC work (e.g. Kokkos, Phalanx, Tpetra packages)

Cons: Redundancies of capability between packages; breadth of packages difficult to navigate for newcomers

[1] H. Jasak, A. Jemcov, and Z. Tukovic,  “OpenFOAM: A C++ library for complex physics  simulations.”[2] D. Lukarski,  “PARALUTION  project  v0.7.0,”  2012.[3] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki,  “Accelerating  Numerical  Dense  Linear  Algebra  Calculations with GPUs,”  in  

Numerical Computations with GPUs. Springer, 2014, pp. 3–28.[4] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A.  Williams,  and  K.  S.  Stanley,  “An  overview  of  the  Trilinos project,”  ACM Trans. Math. Softw., vol. 31, no. 3, pp. 397–423, 2005.

[5] AMD. clMath (formerly APPML). Accessed 2014.10.17. [Online]. Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-math-libraries/[6] D. K. Panda. MVAPICH: MPI over InfiniBand, 10GigE/iWARPand RoCE. Network-Based Computing Laboratory, The Ohio State University. Accessed 2014.10.17.

[Online]. Available: http://mvapich.cse.ohio-state.edu/[7] T. Scogland, W.-c. Feng, B. Rountree, and B. de Supinski,  “CoreTSAR: Adaptive Worksharing for  Heterogeneous  Systems,”  in  Supercomputing, ser. Lecture Notes in Computer Science, J. Kunkel, T. Ludwig, and H. Meuer, Eds. Springer International Publishing, 2014, vol. 8488, pp. 172–186.

This work was funded in part by the Air Force Office of Scientific Research (AFOSR) Basic Research Initiative from the Computational Mathematics Program via Grant No. FA9550-12-1-0442.

Figure of transpose performance vs. alternative GPU

BLAS libraries

MKL float (CPU)MKL double (CPU)MetaMorph+OpenCL float (GPU)MetaMorph+OpenCL double (GPU)MetaMorph+CUDA float (GPU)MetaMorph+CUDA double (GPU)clMAGMA float (GPU)clMAGMA double (GPU)cuMAGMA float (GPU)cuMAGMA double (GPU)

100

10-1

10-2

10-3

10-4

MetaMorph Transpose vs. Alternatives

Transpose Size

Tim

e pe

r ele

men

t (µs

)

Figure of dot-product performance vs. alternative GPU

BLAS libraries64 512 4K 32K 256K 2M 16M 128M

PLASMA/MKL float (CPU)PLASMA/MKL double (CPU)ATLAS float (CPU)ATLAS double (CPU)MetaMorph+OpenCL float (GPU)MetaMorph+OpenCL double (GPU)MetaMorph+CUDA float (GPU)MetaMorph+CUDA double (GPU)clAmdBlas float (GPU)clAmdBlas double (GPU)cuMAGMA float (GPU)cuMAGMA double (GPU)

100

10-1

10-2

103

104

101

102

MetaMorph Dot Product vs Alternatives

Vector Length

Tim

e pe

r ele

men

t (µ

s)

MPI Exchange Primitives

Vector Length

Tim

e pe

r ele

men

t (µs

)0

500

1000

1500

2000

2500

3000

H2D Copy D2D Copy D2H Copy Dot Product

C+CUDAC+OpenCLC+Timers+CUDAC+Timers+OpenCLFortran+CUDAFortran+OpenCLFortran+Timers+CUDAFortran+Timers+OpenCL

MetaMorph Fortran & Timer Plugin Overhead

Tot

al t

ime

per c

all,

221flo

ats

(µs)

2x2 4x4 8x8 16x16 32x32 64x640

5

10

15

20

25

30

35

40

45

50 GPU Direct Pack CUDA PackGPU Direct Send CUDA SendGPU Direct Recv CUDA RecvGPU Direct Unpack CUDA UnpackOpenCL Pack CUDA-to-OpenCL PackOpenCL Send CUDA-to-OpenCL SendOpenCL Recv CUDA-to-OpenCL RecvOpenCL Unpack CUDA-to-OpenCL Unpack

2D Face Size

MetaMorph Transparent Face Exchange Primitive

Tim

e p

er f

loat

ele

men

t (µ

s)

Umar Kalim
SC|14: ACM/IEEE International Conference on High-Performance Computing, Networking, Storage, and Analysis, New Orleans, LA, Nov. 2014.