GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main...
Transcript of GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main...
GPGPU Programming in Haskellwith Accelerate
Trevor L. McDonellUniversity of New South Wales
https://github.com/AccelerateHS
Friday, 17 May 13
What is GPGPU Programming?
• General Purpose Programming on Graphics Processing Units (GPUs)
• Using your graphics card for something other than playing games
• GPUs have many more cores than a CPU
- GeForce GTX Titan
- 2688 cores @ 837 MHz
- 6 GB memory @ 288 GB/s
Friday, 17 May 13
What is GPGPU Programming?
• Main differences:
- Single program multiple data (SPMD / SIMD), or just data-parallelism
- All the cores run the same program, but on different data
• We can’t program these in the same way as a CPU
- Different instruction sets: can’t run a Haskell program directly
- More restrictive hardware designs, limited control structures
• GPUs have their own memory
- Data has to be explicitly moved back and forth
Friday, 17 May 13
Dot product four ways
• Dot-product: pair-wise multiply two arrays and sum the result
• C (sequential):
float dotp(float *xs, float *ys, int size){ int i; float sum = 0;
for (i = 0; i < size; ++i) { sum = sum + xs[i] * ys[i]; }
return sum;}
Friday, 17 May 13
Dot product four ways
• Haskell (sequential):
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
Friday, 17 May 13
Dot product four ways
• Haskell (sequential):
- [Float] is a list of floating point numbers
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
Friday, 17 May 13
Dot product four ways
• Haskell (sequential):
- [Float] is a list of floating point numbers
- zipWith applies the function (*) element-wise to the two input lists
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
Friday, 17 May 13
Dot product four ways
• Haskell (sequential):
- [Float] is a list of floating point numbers
- zipWith applies the function (*) element-wise to the two input lists
- foldl sums the result of zipWith
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
Friday, 17 May 13
Dot product four ways
• CUDA (parallel):
- Step 1: element-wise multiplication
__global__void zipWithMult(float *xs, float *ys, float *zs, int size){ int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) { zs[i] = xs[i] * ys[i]; }}
Friday, 17 May 13
Dot product four ways
• CUDA (parallel):
- Step 1: element-wise multiplication
- __global__ indicates this is a function that runs on the GPU
__global__void zipWithMult(float *xs, float *ys, float *zs, int size){ int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) { zs[i] = xs[i] * ys[i]; }}
Friday, 17 May 13
Dot product four ways
• CUDA (parallel):
- Step 1: element-wise multiplication
- __global__ indicates this is a function that runs on the GPU
- spawn one thread for each element in the vector: unique for each thread
__global__void zipWithMult(float *xs, float *ys, float *zs, int size){ int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) { zs[i] = xs[i] * ys[i]; }}
Friday, 17 May 13
Dot product four ways
• CUDA (parallel):
- Step 2: vector reduction … is somewhat complexstruct SharedMemory{ __device__ inline operator float *() { extern __shared__ int __smem[]; return (float *)__smem; }
__device__ inline operator const float *() const { extern __shared__ int __smem[]; return (float *)__smem; }};
template <unsigned int blockSize, bool nIsPow2>__global__ voidreduce_kernel(float *g_idata, float *g_odata, unsigned int n){ float *sdata = SharedMemory();
unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x;
float sum = 0;
while (i < n) { sum += g_idata[i];
if (nIsPow2 || i + blockSize < n) sum += g_idata[i+blockSize];
i += gridSize; }
sdata[tid] = sum; __syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] = sum = sum + sdata[tid + 256]; }
__syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] = sum = sum + sdata[tid + 128]; }
__syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] = sum = sum + sdata[tid + 64]; }
__syncthreads(); }
if (tid < 32) { volatile float *smem = sdata;
if (blockSize >= 64) { smem[tid] = sum = sum + smem[tid + 32]; }
if (blockSize >= 32) { smem[tid] = sum = sum + smem[tid + 16]; }
if (blockSize >= 16) { smem[tid] = sum = sum + smem[tid + 8]; }
if (blockSize >= 8) { smem[tid] = sum = sum + smem[tid + 4]; }
if (blockSize >= 4) {
smem[tid] = sum = sum + smem[tid + 2]; }
if (blockSize >= 2) { smem[tid] = sum = sum + smem[tid + 1]; } }
if (tid == 0) g_odata[blockIdx.x] = sdata[0];}
void getNumBlocksAndThreads(int n, int maxBlocks, int maxThreads, int &blocks, int &threads){ cudaDeviceProp prop; int device; checkCudaErrors(cudaGetDevice(&device)); checkCudaErrors(cudaGetDeviceProperties(&prop, device));
threads = (n < maxThreads*2) ? nextPow2((n + 1)/ 2) : maxThreads; blocks = (n + (threads * 2 -‐ 1)) / (threads * 2);
if (blocks > prop.maxGridSize[0]) { blocks /= 2; threads *= 2; }
blocks = min(maxBlocks, blocks);}
floatreduce(int n, float *d_idata, float *d_odata){ int threads = 0; int blocks = 0; int maxThreads = 256; int maxBlocks = 64; int size = n
while (size > 1) { getNumBlocksAndThreads(size, maxBlocks, maxThreads, blocks, threads);
int smemSize = (threads <= 32) ? 2 * threads * sizeof(float) : threads * sizeof(float); dim3 dimBlock(threads, 1, 1); dim3 dimGrid(blocks, 1, 1);
if (isPow2(size)) { switch (threads) { case 512: reduce_kernel<512, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } else
{ switch (threads) { case 512: reduce_kernel<512, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } }
size = (size + (threads*2-‐1)) / (threads*2); }
float sum; checkCudaErrors(cudaMemcpy(&sum, d_odata, sizeof(float), cudaMemcpyDeviceToHost));
}
Friday, 17 May 13
Dot product four ways
• CUDA (parallel):
- Step 2: vector reduction … is somewhat complexstruct SharedMemory{ __device__ inline operator float *() { extern __shared__ int __smem[]; return (float *)__smem; }
__device__ inline operator const float *() const { extern __shared__ int __smem[]; return (float *)__smem; }};
template <unsigned int blockSize, bool nIsPow2>__global__ voidreduce_kernel(float *g_idata, float *g_odata, unsigned int n){ float *sdata = SharedMemory();
unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x;
float sum = 0;
while (i < n) { sum += g_idata[i];
if (nIsPow2 || i + blockSize < n) sum += g_idata[i+blockSize];
i += gridSize; }
sdata[tid] = sum; __syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] = sum = sum + sdata[tid + 256]; }
__syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] = sum = sum + sdata[tid + 128]; }
__syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] = sum = sum + sdata[tid + 64]; }
__syncthreads(); }
if (tid < 32) { volatile float *smem = sdata;
if (blockSize >= 64) { smem[tid] = sum = sum + smem[tid + 32]; }
if (blockSize >= 32) { smem[tid] = sum = sum + smem[tid + 16]; }
if (blockSize >= 16) { smem[tid] = sum = sum + smem[tid + 8]; }
if (blockSize >= 8) { smem[tid] = sum = sum + smem[tid + 4]; }
if (blockSize >= 4) {
smem[tid] = sum = sum + smem[tid + 2]; }
if (blockSize >= 2) { smem[tid] = sum = sum + smem[tid + 1]; } }
if (tid == 0) g_odata[blockIdx.x] = sdata[0];}
void getNumBlocksAndThreads(int n, int maxBlocks, int maxThreads, int &blocks, int &threads){ cudaDeviceProp prop; int device; checkCudaErrors(cudaGetDevice(&device)); checkCudaErrors(cudaGetDeviceProperties(&prop, device));
threads = (n < maxThreads*2) ? nextPow2((n + 1)/ 2) : maxThreads; blocks = (n + (threads * 2 -‐ 1)) / (threads * 2);
if (blocks > prop.maxGridSize[0]) { blocks /= 2; threads *= 2; }
blocks = min(maxBlocks, blocks);}
floatreduce(int n, float *d_idata, float *d_odata){ int threads = 0; int blocks = 0; int maxThreads = 256; int maxBlocks = 64; int size = n
while (size > 1) { getNumBlocksAndThreads(size, maxBlocks, maxThreads, blocks, threads);
int smemSize = (threads <= 32) ? 2 * threads * sizeof(float) : threads * sizeof(float); dim3 dimBlock(threads, 1, 1); dim3 dimGrid(blocks, 1, 1);
if (isPow2(size)) { switch (threads) { case 512: reduce_kernel<512, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } else
{ switch (threads) { case 512: reduce_kernel<512, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } }
size = (size + (threads*2-‐1)) / (threads*2); }
float sum; checkCudaErrors(cudaMemcpy(&sum, d_odata, sizeof(float), cudaMemcpyDeviceToHost));
}
Friday, 17 May 13
Dot product four ways
• CUDA (parallel):
- Step 2: vector reduction … is somewhat complexstruct SharedMemory{ __device__ inline operator float *() { extern __shared__ int __smem[]; return (float *)__smem; }
__device__ inline operator const float *() const { extern __shared__ int __smem[]; return (float *)__smem; }};
template <unsigned int blockSize, bool nIsPow2>__global__ voidreduce_kernel(float *g_idata, float *g_odata, unsigned int n){ float *sdata = SharedMemory();
unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x;
float sum = 0;
while (i < n) { sum += g_idata[i];
if (nIsPow2 || i + blockSize < n) sum += g_idata[i+blockSize];
i += gridSize; }
sdata[tid] = sum; __syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] = sum = sum + sdata[tid + 256]; }
__syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] = sum = sum + sdata[tid + 128]; }
__syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] = sum = sum + sdata[tid + 64]; }
__syncthreads(); }
if (tid < 32) { volatile float *smem = sdata;
if (blockSize >= 64) { smem[tid] = sum = sum + smem[tid + 32]; }
if (blockSize >= 32) { smem[tid] = sum = sum + smem[tid + 16]; }
if (blockSize >= 16) { smem[tid] = sum = sum + smem[tid + 8]; }
if (blockSize >= 8) { smem[tid] = sum = sum + smem[tid + 4]; }
if (blockSize >= 4) {
smem[tid] = sum = sum + smem[tid + 2]; }
if (blockSize >= 2) { smem[tid] = sum = sum + smem[tid + 1]; } }
if (tid == 0) g_odata[blockIdx.x] = sdata[0];}
void getNumBlocksAndThreads(int n, int maxBlocks, int maxThreads, int &blocks, int &threads){ cudaDeviceProp prop; int device; checkCudaErrors(cudaGetDevice(&device)); checkCudaErrors(cudaGetDeviceProperties(&prop, device));
threads = (n < maxThreads*2) ? nextPow2((n + 1)/ 2) : maxThreads; blocks = (n + (threads * 2 -‐ 1)) / (threads * 2);
if (blocks > prop.maxGridSize[0]) { blocks /= 2; threads *= 2; }
blocks = min(maxBlocks, blocks);}
floatreduce(int n, float *d_idata, float *d_odata){ int threads = 0; int blocks = 0; int maxThreads = 256; int maxBlocks = 64; int size = n
while (size > 1) { getNumBlocksAndThreads(size, maxBlocks, maxThreads, blocks, threads);
int smemSize = (threads <= 32) ? 2 * threads * sizeof(float) : threads * sizeof(float); dim3 dimBlock(threads, 1, 1); dim3 dimGrid(blocks, 1, 1);
if (isPow2(size)) { switch (threads) { case 512: reduce_kernel<512, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } else
{ switch (threads) { case 512: reduce_kernel<512, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } }
size = (size + (threads*2-‐1)) / (threads*2); }
float sum; checkCudaErrors(cudaMemcpy(&sum, d_odata, sizeof(float), cudaMemcpyDeviceToHost));
}
o_O
Friday, 17 May 13
Dot product four ways
• Accelerate (parallel):
dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float)dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )
Friday, 17 May 13
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float)dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
Friday, 17 May 13
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float)dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
Friday, 17 May 13
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float)dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
Friday, 17 May 13
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float)dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
left-to-right traversal
Friday, 17 May 13
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float)dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
left-to-right traversal
neither left nor right: happens in parallel (tree-like)
Friday, 17 May 13
Dot product four ways
• Accelerate (parallel):
• Recall the sequential Haskell version:
dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float)dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )
dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )
left-to-right traversal
neither left nor right: happens in parallel (tree-like)
But… how does it perform?
Friday, 17 May 13
Dot product four ways
0.1
1
10
100
2 4 6 8 10 12 14 16 18 20
Run
Tim
e (m
s)
Elements (millions)
Dot product
sequentialAccelerate
CUBLAS
Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)
Friday, 17 May 13
Dot product four ways
0.1
1
10
100
2 4 6 8 10 12 14 16 18 20
Run
Tim
e (m
s)
Elements (millions)
Dot product
sequentialAccelerate
CUBLAS
1.2x
Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)
Friday, 17 May 13
Dot product four ways
0.1
1
10
100
2 4 6 8 10 12 14 16 18 20
Run
Tim
e (m
s)
Elements (millions)
Dot product
sequentialAccelerate
CUBLAS
30x
1.2x
Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)
Friday, 17 May 13
Friday, 17 May 13
Mandelbrot fractal
Friday, 17 May 13
Mandelbrot fractal
n-body gravitational simulation
Friday, 17 May 13
Mandelbrot fractal
n-body gravitational simulation
Canny edge detectionFriday, 17 May 13
Mandelbrot fractal
n-body gravitational simulation
Canny edge detectionSmoothLife cellular automataFriday, 17 May 13
Mandelbrot fractal
n-body gravitational simulation
Canny edge detectionSmoothLife cellular automata
stable fluid flow
Friday, 17 May 13
Mandelbrot fractal
n-body gravitational simulation
Canny edge detectionSmoothLife cellular automata
stable fluid flow
...d6b821d937a4170b3c4f8ad93495575d: saitek1d0e52829bf7962ee0aa90550ffdcccaa: laura1230494a8204b800c41b2da763f9bbbcc462: lina03d8ff07c52a95b30800809758f84ce28c: Jenny10e81bed02faa9892f8360c705241191ae: carmen8946f7d75718029de99dd81fd907034bc9: mellon220dd3c176cf34486ec00b526b6920b782: helena049351c4bc8c8ba17b58d5a6a1f839f356: 855485549c36c5599f40d08f874559ac824d091a: 5851234564b4dce6c91b429e8360aa65f97342e90: 5678go3aa561d4c17d9d58443fc15d10cc86ae: momo55
Recovered 150/1000 (15.00 %) digests in 59.45 s, 185.03 MHash/sec
Password “recovery” (MD5 dictionary attack)
Friday, 17 May 13
Accelerate
• Accelerate is a Domain-Specific Language for GPU programming
Haskell/Accelerate program
CUDA code
Compile with NVIDIA’s compiler & load onto the GPU
Copy result back to Haskell
Transform Accelerate program into CUDA program
Friday, 17 May 13
Accelerate
• Accelerate is a Domain-Specific Language for GPU programming
- This process may happen several times during program execution
- Code and data fragments get cached and reused
• An Accelerate program is a Haskell program that generates a CUDA program
- However, in many respects this still looks like a Haskell program
- Shares various concepts with Repa, a Haskell array library for CPUs
Friday, 17 May 13
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
run :: Arrays a => Acc a -‐> a
Friday, 17 May 13
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
- Arrays constrains the result to be an Array, or tuple thereof
run :: Arrays a => Acc a -‐> a
Friday, 17 May 13
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
- Arrays constrains the result to be an Array, or tuple thereof
• What is Acc?
run :: Arrays a => Acc a -‐> a
Friday, 17 May 13
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
- Arrays constrains the result to be an Array, or tuple thereof
• What is Acc?
- This is our DSL type
- A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’
run :: Arrays a => Acc a -‐> a
Friday, 17 May 13
Accelerate
• To execute an Accelerate computation (on the GPU):
- run comes from whichever backend we have chosen (CUDA)
- Arrays constrains the result to be an Array, or tuple thereof
• What is Acc?
- This is our DSL type
- A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’
run :: Arrays a => Acc a -‐> a
Accelerate is a library of collective operations over
arrays of type Acc a
Friday, 17 May 13
Accelerate
• Accelerate computations take place on arrays
- Parallelism is introduced in the form of collective operations over arrays
Accelerate computationArrays in Arrays out
run :: Arrays a => Acc a -‐> a
Friday, 17 May 13
Accelerate
• Accelerate computations take place on arrays
- Parallelism is introduced in the form of collective operations over arrays
• Arrays have two type parametersdata Array sh e
Accelerate computationArrays in Arrays out
run :: Arrays a => Acc a -‐> a
Friday, 17 May 13
Accelerate
• Accelerate computations take place on arrays
- Parallelism is introduced in the form of collective operations over arrays
• Arrays have two type parameters
- The shape of the array, or dimensionalitydata Array sh e
Accelerate computationArrays in Arrays out
run :: Arrays a => Acc a -‐> a
Friday, 17 May 13
Accelerate
• Accelerate computations take place on arrays
- Parallelism is introduced in the form of collective operations over arrays
• Arrays have two type parameters
- The shape of the array, or dimensionality
- The element type of the array: Int, Float, etc.
data Array sh e
Accelerate computationArrays in Arrays out
run :: Arrays a => Acc a -‐> a
Friday, 17 May 13
Arrays
• Supported array element types are members of the Elt class:
- ()
- Int, Int32, Int64, Word, Word32, Word64...
- Float, Double
- Char
- Bool
- Tuples up to 9-tuples of these, including nested tuples
• Note that Array itself is not an allowable element type. There are no nested arrays in Accelerate, regular arrays only!
data Array sh e
Friday, 17 May 13
Accelerate by exampleMandelbrot fractal
Friday, 17 May 13
Mandelbrot set generator
• Basics
- Pick a window onto the complex plane & divide into pixels
- A point is in the set if the value of does not diverge to infinity
- Each pixel has a value given by its coordinates in the complex plane
- Colour depends on number of iterations before divergence
• Each pixel is independent: lots of data parallelism
c
n
|z|zn+1 = c+ z2n
Friday, 17 May 13
Mandelbrot set generator
• First, some types:
- A pair of floating point numbers for the real and imaginary parts
data Complex = (Float, Float)
data Array sh e
Friday, 17 May 13
Mandelbrot set generator
• First, some types:
- A pair of floating point numbers for the real and imaginary parts
- DIM2 is a type synonym for a two dimensional Shape
data Complex = (Float, Float)data ComplexPlane = Array DIM2 Complex
data Array sh e
Friday, 17 May 13
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z :. Int)
- Two-dimensional array, indexed by Int: (Z :. Int :. Int)
• This style is used at both the type and value level:
data Z = Zdata tail :. head = tail :. head
Friday, 17 May 13
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z :. Int)
- Two-dimensional array, indexed by Int: (Z :. Int :. Int)
• This style is used at both the type and value level:
data Z = Zdata tail :. head = tail :. head
Friday, 17 May 13
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z :. Int)
- Two-dimensional array, indexed by Int: (Z :. Int :. Int)
• This style is used at both the type and value level:
data Z = Zdata tail :. head = tail :. head
Friday, 17 May 13
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z :. Int)
- Two-dimensional array, indexed by Int: (Z :. Int :. Int)
data Z = Zdata tail :. head = tail :. head
Friday, 17 May 13
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z :. Int)
- Two-dimensional array, indexed by Int: (Z :. Int :. Int)
• This style is used at both the type and value level:
data Z = Zdata tail :. head = tail :. head
sh :: Z :. Intsh = Z :. 10
Friday, 17 May 13
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z :. Int)
- Two-dimensional array, indexed by Int: (Z :. Int :. Int)
• This style is used at both the type and value level:
data Z = Zdata tail :. head = tail :. head
sh :: Z :. Intsh = Z :. 10
Friday, 17 May 13
Shapes
• Shapes determines the dimensions of the array and the type of the index
- Z represents a rank-zero array (singleton array with one element)
- (:.) increases the rank by adding a new dimension on the right
• Examples:
- One-dimensional array (Vector) indexed by Int: (Z :. Int)
- Two-dimensional array, indexed by Int: (Z :. Int :. Int)
• This style is used at both the type and value level:
data Z = Zdata tail :. head = tail :. head
sh :: Z :. Intsh = Z :. 10
Friday, 17 May 13
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a)
z0 = c
Friday, 17 May 13
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a)
z0 = c
Friday, 17 May 13
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
- Size of the result array: number of pixels in the image
generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a)
z0 = c
Friday, 17 May 13
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
- Size of the result array: number of pixels in the image
- A function to apply at every index: generate the values of at each pixel
generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a)
c
z0 = c
Friday, 17 May 13
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
- Size of the result array: number of pixels in the image
- A function to apply at every index: generate the values of at each pixel
generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a)
c
z0 = c
Friday, 17 May 13
Mandelbrot set generator
• The initial complex plane:
• generate is a collective operation that yields a value for an index in the array
- Supported shape and element types: we will use DIM2 and Complex
- Size of the result array: number of pixels in the image
- A function to apply at every index: generate the values of at each pixel
generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a)
c
z0 = c
If Acc is our DSL type, what is Exp?
Friday, 17 May 13
A Stratified Language
• Accelerate is split into two worlds: Acc and Exp
- Acc represents collective operations over instances of Arrays
- Exp is a scalar computation on things of type Elt
Friday, 17 May 13
A Stratified Language
• Accelerate is split into two worlds: Acc and Exp
- Acc represents collective operations over instances of Arrays
- Exp is a scalar computation on things of type Elt
• Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays
- Scalar operations can not contain collective operations
Friday, 17 May 13
A Stratified Language
• Accelerate is split into two worlds: Acc and Exp
- Acc represents collective operations over instances of Arrays
- Exp is a scalar computation on things of type Elt
• Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays
- Scalar operations can not contain collective operations
• This stratification excludes nested data parallelism
Friday, 17 May 13
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
example :: Acc (Vector Int)example = generate (constant (Z:.10)) (\ix -‐> f ix)
Friday, 17 May 13
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
- constant lifts a plain value into Exp land of scalar expressions
example :: Acc (Vector Int)example = generate (constant (Z:.10)) (\ix -‐> f ix)
constant :: Elt e => e -‐> Exp e
Friday, 17 May 13
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
- constant lifts a plain value into Exp land of scalar expressions
example :: Acc (Vector Int)example = generate (constant (Z:.10)) (\ix -‐> f ix)
generate (constant (Z:.10)) f
constant :: Elt e => e -‐> Exp e
Friday, 17 May 13
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
- constant lifts a plain value into Exp land of scalar expressions
example :: Acc (Vector Int)example = generate (constant (Z:.10)) (\ix -‐> f ix)
f 0 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9
generate (constant (Z:.10)) f
f :: Exp DIM1 -‐> Exp Int
constant :: Elt e => e -‐> Exp e
Friday, 17 May 13
Collective Operations
• Collective operations comprise many scalar operations applied in parallel
- constant lifts a plain value into Exp land of scalar expressions
example :: Acc (Vector Int)example = generate (constant (Z:.10)) (\ix -‐> f ix)
f 0 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9
Acc (Vector Int)
constant :: Elt e => e -‐> Exp e
Friday, 17 May 13
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
- Even though operations are in Exp, can still use standard Haskell operations like (*) and (-‐)
Mandelbrot set generator
genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlanegenPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin
Friday, 17 May 13
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
- Even though operations are in Exp, can still use standard Haskell operations like (*) and (-‐)
Mandelbrot set generator
genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlanegenPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin
Friday, 17 May 13
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
Mandelbrot set generator
genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlanegenPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin
unlift :: Exp (Z :. Int :. Int) -‐> Z :. Exp Int :. Exp Int
Friday, 17 May 13
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
Mandelbrot set generator
genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlanegenPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin
unlift :: Exp (F,F,F,F) -‐> (Exp F, Exp F, Exp F, Exp F)
Friday, 17 May 13
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
Mandelbrot set generator
genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlanegenPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin
Friday, 17 May 13
- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.
- Even though operations are in Exp, can still use standard Haskell operations like (*) and (-‐)
Mandelbrot set generator
genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlanegenPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin
Friday, 17 May 13
Mandelbrot set generator
• Let’s define the function we will iterate
zn+1 = c+ z2n
next :: Exp Complex -‐> Exp Complex -‐> Exp Complexnext c z = plus c (times z z)
Friday, 17 May 13
Mandelbrot set generator
• Let’s define the function we will iterate
- Use lift and unlift as before to unpack the tuples
zn+1 = c+ z2n
next :: Exp Complex -‐> Exp Complex -‐> Exp Complexnext c z = plus c (times z z)
plus :: Exp Complex -‐> Exp Complex -‐> Exp Complexplus a b = lift (ax+bx, ay+by) where (ax,ay) = unlift a :: (Exp Float, Exp Float) (bx,by) = unlift b :: (Exp Float, Exp Float)
Friday, 17 May 13
Mandelbrot set generator
• Let’s define the function we will iterate
- Use lift and unlift as before to unpack the tuples
zn+1 = c+ z2n
next :: Exp Complex -‐> Exp Complex -‐> Exp Complexnext c z = plus c (times z z)
plus :: Exp Complex -‐> Exp Complex -‐> Exp Complexplus a b = lift (ax+bx, ay+by) where (ax,ay) = unlift a :: (Exp Float, Exp Float) (bx,by) = unlift b :: (Exp Float, Exp Float)
lift :: (Exp Float, Exp Float) -‐> Exp (Float, Float)
Friday, 17 May 13
Mandelbrot set generator
• Let’s define the function we will iterate
- Use lift and unlift as before to unpack the tuples
- Note that we had to add some type signatures to unlift
zn+1 = c+ z2n
next :: Exp Complex -‐> Exp Complex -‐> Exp Complexnext c z = plus c (times z z)
plus :: Exp Complex -‐> Exp Complex -‐> Exp Complexplus a b = lift (ax+bx, ay+by) where (ax,ay) = unlift a :: (Exp Float, Exp Float) (bx,by) = unlift b :: (Exp Float, Exp Float)
lift :: (Exp Float, Exp Float) -‐> Exp (Float, Float)
Friday, 17 May 13
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c+ z2n
Friday, 17 May 13
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c+ z2n
step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int)step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )
Friday, 17 May 13
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
zn+1 = c+ z2n
step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int)step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )
Friday, 17 May 13
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
zn+1 = c+ z2n
step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int)step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )
Friday, 17 May 13
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c+ z2n
(?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t
step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int)step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )
Friday, 17 May 13
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c+ z2n
(?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t
step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int)step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )
Friday, 17 May 13
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c+ z2n
(?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t
step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int)step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )
Friday, 17 May 13
Mandelbrot set generator
• Complication: GPUs must do the same thing to lots of different data
- Keep a pair (z,i) for each pixel, where i is the point iteration diverged
- fst and snd to extract individual tuple components
- dot is the magnitude of a complex number squared
- Conditionals can lead to SIMD divergence, so use sparingly
zn+1 = c+ z2n
(?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t
step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int)step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )
Friday, 17 May 13
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
zn+1 = c+ z2n
mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int))mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs
Friday, 17 May 13
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
- zipWith applies its function step pairwise to elements of the two arrays
zn+1 = c+ z2n
mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int))mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs
Friday, 17 May 13
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
- zipWith applies its function step pairwise to elements of the two arrays
- Replicate the transition step from to
zn+1 = c+ z2n
mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int))mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs
zn zn+1
Friday, 17 May 13
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
- zipWith applies its function step pairwise to elements of the two arrays
- Replicate the transition step from to
- Applies the steps in sequence, beginning with the initial data zs0
zn+1 = c+ z2n
mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int))mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs
zn zn+1
Friday, 17 May 13
Mandelbrot set generator
• Accelerate is a meta programming language
- So, just use regular Haskell to unroll the loop a fixed number of times
- zipWith applies its function step pairwise to elements of the two arrays
- Replicate the transition step from to
- Applies the steps in sequence, beginning with the initial data zs0
zn+1 = c+ z2n
mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int))mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs
zn zn+1
f (g x) ≣ (f . g) x
Friday, 17 May 13
In the workshop…
• More example code, computational kernels
- Common pitfalls
- Tips for good performance
- Figuring out what went wrong (or knowing who to blame)
Friday, 17 May 13
In the workshop…
• More example code, computational kernels
- Common pitfalls
- Tips for good performance
- Figuring out what went wrong (or knowing who to blame)
me
Friday, 17 May 13
Questions?https://github.com/AccelerateHS/
http://xkcd.com/365/
Friday, 17 May 13
Extra Slides…
Friday, 17 May 13
Seriously?
Friday, 17 May 13
Arrays
• Create an array from a list:
- Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order
• Example:
data Array sh e
fromList :: (Shape sh, Elt e) => sh -‐> [e] -‐> Array sh e
ghci> fromList (Z:.10) [1..10]
Friday, 17 May 13
Arrays
• Create an array from a list:
data Array sh e
> fromList (Z:.10) [1..10] :: Vector FloatArray (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]
Friday, 17 May 13
Arrays
• Create an array from a list:
• Multidimensional arrays are similar:
- Elements are filled along the right-most dimension first
data Array sh e
> fromList (Z:.10) [1..10] :: Vector FloatArray (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]
> fromList (Z:.3:.5) [1..] :: Array DIM2 IntArray (Z :. 3 :. 5) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Friday, 17 May 13
Arrays
• Array indices start counting from zero
data Array sh e
> let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> indexArray mat (Z:.2:.1)12
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
Friday, 17 May 13
Arrays
• Similarly, an array of (nested) tuples:
- This is just a trick: internally converted into a tuple of arrays
data Array sh e
> fromList (Z:.2:.3) $ P.zip [1..] ['a'..] :: Array DIM2 (Int,Char)Array (Z :. 2 :. 3) [(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e'),(6,'f')]
1 2 3
4 5 6
a b c
d e f( ),
Friday, 17 May 13
Data.Array.Accelerate
• Need to import both the base library as well as a backend
- There is also an interpreter available for testing
- Runs without using the GPU (much more slowly of course)
import Prelude as Pimport Data.Array.Accelerate as Aimport Data.Array.Accelerate.CUDA as CUDA
Friday, 17 May 13
Data.Array.Accelerate
Friday, 17 May 13
Data.Array.Accelerate
• To get arrays into the Acc world:
- This may involve copying data to the GPU
use :: Arrays a => a -‐> Acc a
Friday, 17 May 13
Data.Array.Accelerate
• To get arrays into the Acc world:
- This may involve copying data to the GPU
• use injects arrays into our DSL
• run executes the computation to get arrays out
• Using Accelerate focuses on everything in between
use :: Arrays a => a -‐> Acc a
Friday, 17 May 13
Collective Operations
• Example: add one to each element of an array
> let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ A.map (+1) (use arr)Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
Friday, 17 May 13
Collective Operations
• Example: add one to each element of an array
• What is the type of map?
> let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ A.map (+1) (use arr)Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b)
Friday, 17 May 13
Collective Operations
• Example: add one to each element of an array
• What is the type of map?
> let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ A.map (+1) (use arr)Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b)
Supported shape & element types
Friday, 17 May 13
Collective Operations
• Example: add one to each element of an array
• What is the type of map?
> let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ A.map (+1) (use arr)Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b)
DSL array
Supported shape & element types
Friday, 17 May 13
Collective Operations
• Example: add one to each element of an array
• What is the type of map?
> let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ A.map (+1) (use arr)Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b)
DSL array
Function to apply at every element. But what is Exp?
Supported shape & element types
Friday, 17 May 13
• The type class overloading trick is used for standard Haskell classes
• Standard boolean operations are available with slightly different names
- The standard names can not be overloaded
• Conditionals
- Use sparingly: leads to SIMD divergence
(+1) :: (Elt a, IsNum a) => Exp a -‐> Exp a
Scalar Expressions
(==*) :: (Elt t, IsScalar t) => Exp t -‐> Exp t -‐> Exp Bool(/=*), (<*), (>*), min, max, (||*), (&&*) -‐-‐ and so on...
(?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t
Friday, 17 May 13
Scalar Expressions
• Bring a Haskell value into Exp land
• Lift an expression into a singleton array
• Extract the element from a singleton array
constant :: Elt e => e -‐> Exp e
unit :: Exp e -‐> Acc (Scalar e)
the :: Acc (Scalar e) -‐> Exp e
Friday, 17 May 13
Reductions
• Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why?
> let xs = fromList (Z:.10) [1..] :: Vector Int> run $ A.fold (+) 0 (use xs)Array (Z) [55]
Friday, 17 May 13
Reductions
• Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why?
• Fold has an interesting type:
> let xs = fromList (Z:.10) [1..] :: Vector Int> run $ A.fold (+) 0 (use xs)Array (Z) [55]
fold :: (Shape sh, Elt a) => (Exp a -‐> Exp a -‐> Exp a) -‐> Exp a -‐> Acc (Array (sh:.Int) a) -‐> Acc (Array sh a)
Friday, 17 May 13
Reductions
• Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why?
• Fold has an interesting type:
> let xs = fromList (Z:.10) [1..] :: Vector Int> run $ A.fold (+) 0 (use xs)Array (Z) [55]
fold :: (Shape sh, Elt a) => (Exp a -‐> Exp a -‐> Exp a) -‐> Exp a -‐> Acc (Array (sh:.Int) a) -‐> Acc (Array sh a)
input array
Friday, 17 May 13
Reductions
• Folding (+) over a vector produces a sum
- The result is a one-element array (scalar). Why?
• Fold has an interesting type:
> let xs = fromList (Z:.10) [1..] :: Vector Int> run $ A.fold (+) 0 (use xs)Array (Z) [55]
fold :: (Shape sh, Elt a) => (Exp a -‐> Exp a -‐> Exp a) -‐> Exp a -‐> Acc (Array (sh:.Int) a) -‐> Acc (Array sh a)
outer dimension removed
input array
Friday, 17 May 13
Reductions
• Fold occurs over the outer dimension of the array
> let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ A.fold (+) 0 (use mat)Array (Z :. 3) [15,40,65]
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
15
40
65
Friday, 17 May 13
Reductions
• Is this a left-fold or a right-fold?
- Neither! The fold happens in parallel, tree-like
- Therefore the function must be associative: (Exp a -‐> Exp a -‐> Exp a)
- (We pretend that floating point operations are associative, though strictly speaking they are not)
Friday, 17 May 13
Stencils
• A stencil is a map with access to the neighbourhood around each element
- Useful in many scientific & image processing algorithms
laplace :: Stencil3x3 Int -‐> Exp Intlaplace ((_,t,_) ,(l,c,r) ,(_,b,_)) = t + b + l + r -‐ 4*c
t
l c r
b
Friday, 17 May 13
Stencils
• A stencil is a map with access to the neighbourhood around each element
- Useful in many scientific & image processing algorithms
- Boundary conditions specify how to handle out-of-bounds neighbours
laplace :: Stencil3x3 Int -‐> Exp Intlaplace ((_,t,_) ,(l,c,r) ,(_,b,_)) = t + b + l + r -‐ 4*c
> let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ stencil laplace (Constant 0) (use mat)Array (Z :. 3 :. 5) [4,3,2,1,-‐6,-‐5,0,0,0,-‐11,-‐26,-‐17,-‐18,-‐19,-‐36]
t
l c r
b
Friday, 17 May 13
Stencils
• A stencil is a map with access to the neighbourhood around each element
- Useful in many scientific & image processing algorithms
Friday, 17 May 13
Index Transforms
• Index transforms change the order of elements, not their values
- We usually want to push such operations into the consumer
• backpermute specifies which element to read from a source array
backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a)
Friday, 17 May 13
Index Transforms
• Index transforms change the order of elements, not their values
- We usually want to push such operations into the consumer
• backpermute specifies which element to read from a source array
backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a)
shape of result
Friday, 17 May 13
Index Transforms
• Index transforms change the order of elements, not their values
- We usually want to push such operations into the consumer
• backpermute specifies which element to read from a source array
backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a)
shape of result
source data
Friday, 17 May 13
Index Transforms
• Index transforms change the order of elements, not their values
- We usually want to push such operations into the consumer
• backpermute specifies which element to read from a source array
backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a)
shape of result
source data index mapping from destination array to
source
Friday, 17 May 13
Index Transforms
• Index transforms change the order of elements, not their values
transpose mat = let swap = lift1 $ \(Z:.j:.i) -‐> Z:.i:.j :: Z :. Exp Int :. Exp Int in backpermute (swap $ shape mat) swap mat
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15
Friday, 17 May 13