Data Parallel Algorithmic Skeletons with Accelerator Support
Transcript of Data Parallel Algorithmic Skeletons with Accelerator Support
Data Parallel Algorithmic Skeletons withAccelerator Support
Steffen Ernstingand Herbert Kuchen
July 2, 2015living knowledgeWWU Münster
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 2 /22
Agenda
I Hardware acceleratorsI GPU and Xeon Phi
I Skeleton implementation (C++ vs Java)I ParallelizationI Providing the user functionI Adding additional arguments
I Performance comparisonI 4 benchmark applications: Matrix mult., N-Body, Shortest
paths, Ray tracing
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 3 /22
Hardware AcceleratorsGraphics Processing Units
I K20x: 2688 CUDA coresI C++ + CUDA/OpenCLI Offload compute-intensive
kernelsIntel Xeon Phi
I ~60 x86 cores (240 threads)I C++ + pragmas (+ SIMD
intrinsics)I Offload and native
programming modelsI Intel: “Recompile and run”
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 4 /22
Accelerator-Support in Java
I Various projects aim to add GPU support to JavaI Bindings vs. JIT (byte)code translation
I Bindings: e.g. JCUDA, JOCL, JavaCLI Code translation: e.g. Aparapi, Rootbeer, Java-GPUI Java 9 ...?
I Why choose accelerated Java over accelerated C++?I Write and compile once, run everywhereI Huge JDK: lots of little helpersI Automated memory management
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 5 /22
Aparapi
I Created by Gary Frost (AMD), released to open sourceI JIT-compilation: Java bytecode→ OpenCLI Execution modes: GPU, JTP (Java Thread Pool), SEQ, PHI
(experimental)I Offers good programmabilityI Restriction: primitive types only
Java code:1 int[] a = {1, 2, 3, 4};2 int[] b = {5, 6, 7, 8};3 int[] c = new int[a.length ];45 for (int i=0; i<c.length; i++) {6 c[i]=a[i]+b[i];7 }
Aparapi code:1 Kernel k = new Kernel () {2 public void run(){3 int i=getGlobalId ();4 c[i]=a[i]+b[i];5 }6 };7 k.execute(Range.create(c.length));
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 6 /22
The Muenster Skeleton Library
I C++ skeleton libraryI Target architectures: multi-core clusters possibly equipped
with accelerators such as GPU and Xeon PhiI Parallelization: MPI, OpenMP, and CUDAI Data parallel skeletons (with accelerator support)
I map, zip, fold + variantsI Task parallel skeletons
I pipe, farm, D&C, B&B
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 7 /22
ParallelizationI Inter-node parallelization (MPI):
I Distributed data structures (DArray<T>, DMatrix<T>)1 0
0 2
3 2
0 1
1 1
3 0
2 2
0 1P2
P0 P1
P3
1 0
0 2
3 2
0 1
1 1
3 0
2 20 1
(a) (b)
2 1
1 3
4 3
1 2
2 2
4 1
3 3
1 2P2
P0 P1
P3
(c)
M = new DMatrix(...) M.map(add1)
I Intra-node parallelization:I C++: OpenMP, CUDAI Java: Aparapi
1 T fold(FoldFunction f) {2 // OpenMP , CUDA , Aparapi3 T localResult = localFold(f, localPartition);45 // MPI6 T[] localResults = new T[numProcs ];7 allgather(localResult , localResults);89 return localFold(f, localResults);
10 }
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 8 /22
The User Function (C++)
I Pass function as skeleton argumentI But: (host) function pointers cannot be used in device code
⇒ Use functors instead:
1 template <typename IN, typename OUT >2 class MapFunctor : public FunctorBase3 {4 public:5 // To be implemented by the user.6 MSL_UFCT virtual OUT operator () (IN value) const = 0;78 virtual ~MapFunctor ();9 };
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 9 /22
The User Function (Java)
I No function pointers in JavaI Analogous to C++: use functors
1 public abstract class MapKernel extends Kernel {2 protected final int[] in , out;34 // To be implemented by the user.5 public abstract int mapFunction(int value);67 public void run() {8 int gid = getGlobalId ();9 out[gid] = mapFunction(in[gid]);
10 }1112 // Called by skeleton implementation.13 public void init(DIArray in, DIArray out) {14 this.in = in.getLocalPartition ();15 this.out = out.getLocalPartition ();16 }17 }
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 10 /22
Additional Arguments (C++)
I Arguments to the user function are determined by the skeleton⇒ Add additional arguments as data membersI But: Works for primitives and PODs, but not for classes with
pointer data membersI Pointer data + multi-GPU: need 1 pointer for each GPU⇒ Solution: Observer pattern
FunctorBase
MapFunctor
ArgumentType
Argument
update()notify()addArgument(...)
operator() (...)
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 11 /22
Additional Arguments (Java)
I Add additional arguments as data membersI Aparapi handles memory allocation and data transfer to GPU
memory
1 public class AddN extends MapKernel {2 protected int n;34 public AddN(int n) {5 this.n = n;6 }78 public int mapFunction(int value) {9 return value+n;
10 }11 }12 DIArray A = new DIArray (...);13 A.map(new AddN (2));
I Limited to (arrays of) primitive types
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 12 /22
Benchmarks
I 4 benchmark applications: Matrix mult., N-Body, Shortestpaths, Ray tracing
I 2 test systems:1. GPU cluster: 2 Xeon E5-2450 (16 cores) + 2 K20x GPUs per node2. Xeon Phi system with 8 Xeon Phi 5110p coprocessors
I 6 configurations:I 2× CPU: C++ CPU, Java CPUI 3× GPU: C++ GPU, C++ multi-GPU, Java GPUI 1× Xeon Phi: C++ Xeon Phi
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 13 /22
Case Study: Matrix Multiplication
I Cannon’s algorithm for square matrix multiplicationI Checkerboard block decomposition (2D Torus)
p(0,1)
p(1,0) p(1,1)
p(0,0) p(0,1)
p(1,0) p(1,1)
p(0,0)
p(0,1)
p(1,1) p(1,0)
p(0,0) p(0,1)
p(1,1) p(1,0)
p(0,0)
A B
(a)
(b)
A B
p(0,1)
p(1,0) p(1,1)
p(0,0)
C
= *
mLocal
(a) Initial shifting of A and B(b) Submatrix multiplication + stepwise shifting
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 14 /22
Matrix Multiplication
Main algorithm:
1 template <typename T>2 DMatrix <T>& matmult(DMatrix <T>& A, DMatrix <T>& B, DMatrix <T>* C) {3 // Initial shifting.4 A.rotateRows (& negate);5 B.rotateCols (& negate);67 for (int i = 0; i < A.getBlocksInRow (); i++) {8 DotProduct <T> dp(A, B);9 // Submatrix multiplication.
10 C->mapIndexInPlace(dp);1112 // Stepwise shifting.13 A.rotateRows (-1);14 B.rotateCols (-1);15 }16 return *C;17 }
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 15 /22
Matrix Multiplication
Map functor (C++):
1 template <typename T>2 struct DotProduct : public MapIndexFunctor <T, T> {3 LMatrix <T> A, B;45 dotproduct(DMatrix <T>& A_ , DMatrix <T>& B_)6 : A(A_), B(B_)7 {8 }9
10 MSL_UFCT T operator ()(int row , int col , T Cij) const11 {12 T sum = Cij;13 for (int k = 0; k < this ->mLocal; k++) {14 sum += A[row][k] * B[k][col];15 }16 return sum;17 }18 };
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 16 /22
Matrix Multiplication
Map functor (Java):
1 class DotProduct extends MapIndexInPlaceKernel {2 protected float[] A, B;34 public Dotproduct(DFMatrix A, DFMatrix B) {5 super();6 this.A = A.getLocalPartition ();7 this.B = B.getLocalPartition ();8 }9
10 public float mapIndexFunction(int row , int col , float Cij) {11 float sum = Cij;12 for (int k = 0; k < mLocal; k++) {13 sum += A[row * mLocal + k] * B[k * mLocal + col];14 }15 return sum;16 }17 }
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 17 /22
Results: Matrix Multiplication
0.1
1
10
100
1000
10000
1 4 16
sec
nodes
Run time
C++ CPUJava CPU
C++ GPUC++ multi-GPU
Java GPUC++ Xeon Phi
I C++ (multi-)GPU performs best (30×-160× speedup vs. CPU)I Java GPU and C++ Xeon Phi on a similar levelI C++ CPU and Java CPU on a similar level
I Superlinear speedups due to cache effects
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 18 /22
Results: N-Body
1
10
100
1000
10000
1 2 4 8 16
sec
nodes
Run time
C++ CPUJava CPU
C++ GPUC++ multi-GPU
Java GPUC++ Xeon Phi
I C++ (multi-)GPU performs best (10×-13× speedup vs. CPU)I C++ GPU delivers better scalability than Java GPU on higher
node countsI CPU versions on same levelI C++ Xeon Phi perf. between CPU and GPU perf.
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 19 /22
Results: Shortest Paths
1
10
100
1000
10000
100000
1 4 16
sec
nodes
Run time
C++ CPUJava CPU
C++ GPUC++ multi-GPU
Java GPUC++ Xeon Phi
I C++ (multi-)GPU performs best (20×-160× speedup vs. CPU)I Java GPU and C++ Xeon Phi on a similar levelI C++ CPU and Java CPU on a similar level
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 20 /22
Results: Ray tracing
1
10
100
1000
1 2 4 8 16
sec
nodes
Run time
C++ CPUC++ GPU
C++ multi-GPUC++ Xeon Phi
I C++ (multi-)GPU performs best (5×-10× speedup vs. CPU)I Xeon Phi performance only close to CPU performance
I No auto-vectorization
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 21 /22
Conclusion & Future Work
I C++ and Java version offer comparable performanceI Still: restrictions on Java side
I (Arrays of) primitive typesI No multi-GPU support (yet)I Will be addressed in future work
I Xeon Phi can be considered in-between CPUs and GPUsI Performance-wise and in terms of programmability
Steffen Ernsting and Herbert Kuchen
livin
gkn
owle
dge
WW
UM
ünst
er
WESTFÄLISCHEWILHELMS-UNIVERSITÄTMÜNSTER Data Parallel Algorithmic Skeletons with Accelerator Support 22 /22
Thank you for your attention!
Questions?
Steffen Ernsting and Herbert Kuchen