GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main...

GPGPU Programming in Haskellwith Accelerate

Trevor L. McDonellUniversity of New South Wales

@[email protected]

https://github.com/AccelerateHS

Friday, 17 May 13

mailto:[email protected]

mailto:[email protected]



What is GPGPU Programming?

• General Purpose Programming on Graphics Processing Units (GPUs)

• Using your graphics card for something other than playing games

• GPUs have many more cores than a CPU

- GeForce GTX Titan

- 2688 cores @ 837 MHz

- 6 GB memory @ 288 GB/s

Friday, 17 May 13

What is GPGPU Programming?

• Main differences:

- Single program multiple data (SPMD / SIMD), or just data-parallelism

- All the cores run the same program, but on different data

• We can’t program these in the same way as a CPU

- Different instruction sets: can’t run a Haskell program directly

- More restrictive hardware designs, limited control structures

• GPUs have their own memory

- Data has to be explicitly moved back and forth

Friday, 17 May 13

Dot product four ways

• Dot-product: pair-wise multiply two arrays and sum the result

• C (sequential):

float dotp(float *xs, float *ys, int size){ int i; float sum = 0;

for (i = 0; i < size; ++i) { sum = sum + xs[i] * ys[i]; }

return sum;}

Friday, 17 May 13


• Haskell (sequential):

dotp :: [Float] -‐> [Float] -‐> Floatdotp xs ys = foldl (+) 0 ( zipWith (*) xs ys )

Friday, 17 May 13



- [Float] is a list of floating point numbers


Friday, 17 May 13




- zipWith applies the function (*) element-wise to the two input lists


Friday, 17 May 13




- zipWith applies the function (*) element-wise to the two input lists

- foldl sums the result of zipWith


Friday, 17 May 13


• CUDA (parallel):

- Step 1: element-wise multiplication

__global__void zipWithMult(float *xs, float *ys, float *zs, int size){ int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < size) { zs[i] = xs[i] * ys[i]; }}

Friday, 17 May 13




- __global__ indicates this is a function that runs on the GPU



Friday, 17 May 13




- __global__ indicates this is a function that runs on the GPU

- spawn one thread for each element in the vector: unique for each thread



Friday, 17 May 13



- Step 2: vector reduction … is somewhat complexstruct SharedMemory{ __device__ inline operator float *() { extern __shared__ int __smem[]; return (float *)__smem; }

__device__ inline operator const float *() const { extern __shared__ int __smem[]; return (float *)__smem; }};

template <unsigned int blockSize, bool nIsPow2>__global__ voidreduce_kernel(float *g_idata, float *g_odata, unsigned int n){ float *sdata = SharedMemory();

unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x;

float sum = 0;

while (i < n) { sum += g_idata[i];

if (nIsPow2 || i + blockSize < n) sum += g_idata[i+blockSize];

i += gridSize; }

sdata[tid] = sum; __syncthreads();

if (blockSize >= 512) { if (tid < 256) { sdata[tid] = sum = sum + sdata[tid + 256]; }

__syncthreads(); }


__syncthreads(); }


__syncthreads(); }

if (tid < 32) { volatile float *smem = sdata;

if (blockSize >= 64) { smem[tid] = sum = sum + smem[tid + 32]; }




if (blockSize >= 4) {

smem[tid] = sum = sum + smem[tid + 2]; }

if (blockSize >= 2) { smem[tid] = sum = sum + smem[tid + 1]; } }

if (tid == 0) g_odata[blockIdx.x] = sdata[0];}

void getNumBlocksAndThreads(int n, int maxBlocks, int maxThreads, int &blocks, int &threads){ cudaDeviceProp prop; int device; checkCudaErrors(cudaGetDevice(&device)); checkCudaErrors(cudaGetDeviceProperties(&prop, device));

threads = (n < maxThreads*2) ? nextPow2((n + 1)/ 2) : maxThreads; blocks = (n + (threads * 2 -‐ 1)) / (threads * 2);

if (blocks > prop.maxGridSize[0]) { blocks /= 2; threads *= 2; }

blocks = min(maxBlocks, blocks);}

floatreduce(int n, float *d_idata, float *d_odata){ int threads = 0; int blocks = 0; int maxThreads = 256; int maxBlocks = 64; int size = n

while (size > 1) { getNumBlocksAndThreads(size, maxBlocks, maxThreads, blocks, threads);

int smemSize = (threads <= 32) ? 2 * threads * sizeof(float) : threads * sizeof(float); dim3 dimBlock(threads, 1, 1); dim3 dimGrid(blocks, 1, 1);

if (isPow2(size)) { switch (threads) { case 512: reduce_kernel<512, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } else

{ switch (threads) { case 512: reduce_kernel<512, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } }

size = (size + (threads*2-‐1)) / (threads*2); }

float sum; checkCudaErrors(cudaMemcpy(&sum, d_odata, sizeof(float), cudaMemcpyDeviceToHost));

}

Friday, 17 May 13



- Step 2: vector reduction … is somewhat complexstruct SharedMemory{ __device__ inline operator float *() { extern __shared__ int __smem[]; return (float *)__smem; }

__device__ inline operator const float *() const { extern __shared__ int __smem[]; return (float *)__smem; }};

template <unsigned int blockSize, bool nIsPow2>__global__ voidreduce_kernel(float *g_idata, float *g_odata, unsigned int n){ float *sdata = SharedMemory();

unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x;

float sum = 0;

while (i < n) { sum += g_idata[i];

if (nIsPow2 || i + blockSize < n) sum += g_idata[i+blockSize];

i += gridSize; }

sdata[tid] = sum; __syncthreads();


__syncthreads(); }


__syncthreads(); }


__syncthreads(); }

if (tid < 32) { volatile float *smem = sdata;





if (blockSize >= 4) {

smem[tid] = sum = sum + smem[tid + 2]; }

if (blockSize >= 2) { smem[tid] = sum = sum + smem[tid + 1]; } }

if (tid == 0) g_odata[blockIdx.x] = sdata[0];}

void getNumBlocksAndThreads(int n, int maxBlocks, int maxThreads, int &blocks, int &threads){ cudaDeviceProp prop; int device; checkCudaErrors(cudaGetDevice(&device)); checkCudaErrors(cudaGetDeviceProperties(&prop, device));

threads = (n < maxThreads*2) ? nextPow2((n + 1)/ 2) : maxThreads; blocks = (n + (threads * 2 -‐ 1)) / (threads * 2);

if (blocks > prop.maxGridSize[0]) { blocks /= 2; threads *= 2; }

blocks = min(maxBlocks, blocks);}

floatreduce(int n, float *d_idata, float *d_odata){ int threads = 0; int blocks = 0; int maxThreads = 256; int maxBlocks = 64; int size = n

while (size > 1) { getNumBlocksAndThreads(size, maxBlocks, maxThreads, blocks, threads);

int smemSize = (threads <= 32) ? 2 * threads * sizeof(float) : threads * sizeof(float); dim3 dimBlock(threads, 1, 1); dim3 dimGrid(blocks, 1, 1);

if (isPow2(size)) { switch (threads) { case 512: reduce_kernel<512, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } } else

{ switch (threads) { case 512: reduce_kernel<512, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 256: reduce_kernel<256, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 128: reduce_kernel<128, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 64: reduce_kernel< 64, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 32: reduce_kernel< 32, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 16: reduce_kernel< 16, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 8: reduce_kernel< 8, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 4: reduce_kernel< 4, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 2: reduce_kernel< 2, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; case 1: reduce_kernel< 1, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break; } }

size = (size + (threads*2-‐1)) / (threads*2); }

float sum; checkCudaErrors(cudaMemcpy(&sum, d_odata, sizeof(float), cudaMemcpyDeviceToHost));

}

o_O

Friday, 17 May 13


• Accelerate (parallel):

dotp :: Acc (Vector Float) -‐> Acc (Vector Float) -‐> Acc (Scalar Float)dotp xs ys = fold (+) 0 ( zipWith (*) xs ys )

Friday, 17 May 13



• Recall the sequential Haskell version:



Friday, 17 May 13






left-to-right traversal

Friday, 17 May 13







neither left nor right: happens in parallel (tree-like)

Friday, 17 May 13







neither left nor right: happens in parallel (tree-like)

But… how does it perform?

Friday, 17 May 13


0.1

1

10

100

2 4 6 8 10 12 14 16 18 20

Run

Tim

e (m

s)

Elements (millions)

Dot product

sequentialAccelerate

CUBLAS

Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)

Friday, 17 May 13


0.1

1

10

100

2 4 6 8 10 12 14 16 18 20

Run

Tim

e (m

s)

Elements (millions)

Dot product


CUBLAS

1.2x


Friday, 17 May 13


0.1

1

10

100

2 4 6 8 10 12 14 16 18 20

Run

Tim

e (m

s)

Elements (millions)

Dot product


CUBLAS

30x

1.2x


Friday, 17 May 13

Friday, 17 May 13

Mandelbrot fractal

Friday, 17 May 13

Mandelbrot fractal

n-body gravitational simulation

Friday, 17 May 13

Mandelbrot fractal


Canny edge detectionFriday, 17 May 13

Mandelbrot fractal


Canny edge detectionSmoothLife cellular automataFriday, 17 May 13

Mandelbrot fractal


Canny edge detectionSmoothLife cellular automata

stable fluid flow

Friday, 17 May 13

Mandelbrot fractal


Canny edge detectionSmoothLife cellular automata

stable fluid flow

...d6b821d937a4170b3c4f8ad93495575d: saitek1d0e52829bf7962ee0aa90550ffdcccaa: laura1230494a8204b800c41b2da763f9bbbcc462: lina03d8ff07c52a95b30800809758f84ce28c: Jenny10e81bed02faa9892f8360c705241191ae: carmen8946f7d75718029de99dd81fd907034bc9: mellon220dd3c176cf34486ec00b526b6920b782: helena049351c4bc8c8ba17b58d5a6a1f839f356: 855485549c36c5599f40d08f874559ac824d091a: 5851234564b4dce6c91b429e8360aa65f97342e90: 5678go3aa561d4c17d9d58443fc15d10cc86ae: momo55

Recovered 150/1000 (15.00 %) digests in 59.45 s, 185.03 MHash/sec

Password “recovery” (MD5 dictionary attack)

Friday, 17 May 13

Accelerate

• Accelerate is a Domain-Specific Language for GPU programming

Haskell/Accelerate program

CUDA code

Compile with NVIDIA’s compiler & load onto the GPU

Copy result back to Haskell

Transform Accelerate program into CUDA program

Friday, 17 May 13

Accelerate

• Accelerate is a Domain-Specific Language for GPU programming

- This process may happen several times during program execution

- Code and data fragments get cached and reused

• An Accelerate program is a Haskell program that generates a CUDA program

- However, in many respects this still looks like a Haskell program

- Shares various concepts with Repa, a Haskell array library for CPUs

Friday, 17 May 13

Accelerate

• To execute an Accelerate computation (on the GPU):

- run comes from whichever backend we have chosen (CUDA)

run :: Arrays a => Acc a -‐> a

Friday, 17 May 13

Accelerate



- Arrays constrains the result to be an Array, or tuple thereof


Friday, 17 May 13

Accelerate




• What is Acc?


Friday, 17 May 13

Accelerate




• What is Acc?

- This is our DSL type

- A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’


Friday, 17 May 13

Accelerate




• What is Acc?

- This is our DSL type

- A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’


Accelerate is a library of collective operations over

arrays of type Acc a

Friday, 17 May 13

Accelerate

• Accelerate computations take place on arrays

- Parallelism is introduced in the form of collective operations over arrays

Accelerate computationArrays in Arrays out


Friday, 17 May 13

Accelerate



• Arrays have two type parametersdata Array sh e



Friday, 17 May 13

Accelerate



• Arrays have two type parameters

- The shape of the array, or dimensionalitydata Array sh e



Friday, 17 May 13

Accelerate



• Arrays have two type parameters

- The shape of the array, or dimensionality

- The element type of the array: Int, Float, etc.

data Array sh e



Friday, 17 May 13

Arrays

• Supported array element types are members of the Elt class:

- ()

- Int, Int32, Int64, Word, Word32, Word64...

- Float, Double

- Char

- Bool

- Tuples up to 9-tuples of these, including nested tuples

• Note that Array itself is not an allowable element type. There are no nested arrays in Accelerate, regular arrays only!

data Array sh e

Friday, 17 May 13

Accelerate by exampleMandelbrot fractal

Friday, 17 May 13

Mandelbrot set generator

• Basics

- Pick a window onto the complex plane & divide into pixels

- A point is in the set if the value of does not diverge to infinity

- Each pixel has a value given by its coordinates in the complex plane

- Colour depends on number of iterations before divergence

• Each pixel is independent: lots of data parallelism

c

n

|z|zn+1 = c+ z2n

Friday, 17 May 13


• First, some types:

- A pair of floating point numbers for the real and imaginary parts

data Complex = (Float, Float)

data Array sh e

Friday, 17 May 13


• First, some types:

- A pair of floating point numbers for the real and imaginary parts

- DIM2 is a type synonym for a two dimensional Shape

data Complex = (Float, Float)data ComplexPlane = Array DIM2 Complex

data Array sh e

Friday, 17 May 13

Shapes

• Shapes determines the dimensions of the array and the type of the index

- Z represents a rank-zero array (singleton array with one element)

- (:.) increases the rank by adding a new dimension on the right

• Examples:

- One-dimensional array (Vector) indexed by Int: (Z :. Int)

- Two-dimensional array, indexed by Int: (Z :. Int :. Int)

• This style is used at both the type and value level:

data Z = Zdata tail :. head = tail :. head

Friday, 17 May 13

Shapes




• Examples:




Friday, 17 May 13

Shapes




• Examples:



• This style is used at both the type and value level:


sh :: Z :. Intsh = Z :. 10

Friday, 17 May 13


• The initial complex plane:

• generate is a collective operation that yields a value for an index in the array

generate :: (Shape sh, Elt a) => Exp sh -‐> (Exp sh -‐> Exp a) -‐> Acc (Array sh a)

z0 = c

Friday, 17 May 13




- Supported shape and element types: we will use DIM2 and Complex


z0 = c

Friday, 17 May 13





- Size of the result array: number of pixels in the image


z0 = c

Friday, 17 May 13






- A function to apply at every index: generate the values of at each pixel


c

z0 = c

Friday, 17 May 13






- A function to apply at every index: generate the values of at each pixel


c

z0 = c

If Acc is our DSL type, what is Exp?

Friday, 17 May 13

A Stratified Language

• Accelerate is split into two worlds: Acc and Exp

- Acc represents collective operations over instances of Arrays

- Exp is a scalar computation on things of type Elt

Friday, 17 May 13





• Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays

- Scalar operations can not contain collective operations

Friday, 17 May 13





• Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays

- Scalar operations can not contain collective operations

• This stratification excludes nested data parallelism

Friday, 17 May 13

Collective Operations

• Collective operations comprise many scalar operations applied in parallel

example :: Acc (Vector Int)example = generate (constant (Z:.10)) (\ix -‐> f ix)

Friday, 17 May 13



- constant lifts a plain value into Exp land of scalar expressions


constant :: Elt e => e -‐> Exp e

Friday, 17 May 13





generate (constant (Z:.10)) f


Friday, 17 May 13





f 0 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9

generate (constant (Z:.10)) f

f :: Exp DIM1 -‐> Exp Int


Friday, 17 May 13





f 0 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9

Acc (Vector Int)


Friday, 17 May 13

- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.

- Even though operations are in Exp, can still use standard Haskell operations like (*) and (-‐)


genPlane :: Exp (Float,Float,Float,Float) -‐> Acc ComplexPlanegenPlane view = generate (constant (Z:.600:.800)) (\ix -‐> let Z :. y :. x = unlift ix in lift ( xmin + (fromIntegral x * viewx) / 800 , ymin + (fromIntegral y * viewy) / 600 )) where (xmin,ymin,xmax,ymax) = unlift view viewx = xmax -‐ xmin viewy = ymax -‐ ymin

Friday, 17 May 13




unlift :: Exp (Z :. Int :. Int) -‐> Z :. Exp Int :. Exp Int

Friday, 17 May 13




unlift :: Exp (F,F,F,F) -‐> (Exp F, Exp F, Exp F, Exp F)

Friday, 17 May 13




Friday, 17 May 13


- Even though operations are in Exp, can still use standard Haskell operations like (*) and (-‐)



Friday, 17 May 13


• Let’s define the function we will iterate

zn+1 = c+ z2n

next :: Exp Complex -‐> Exp Complex -‐> Exp Complexnext c z = plus c (times z z)

Friday, 17 May 13



- Use lift and unlift as before to unpack the tuples

zn+1 = c+ z2n


plus :: Exp Complex -‐> Exp Complex -‐> Exp Complexplus a b = lift (ax+bx, ay+by) where (ax,ay) = unlift a :: (Exp Float, Exp Float) (bx,by) = unlift b :: (Exp Float, Exp Float)

Friday, 17 May 13




zn+1 = c+ z2n



lift :: (Exp Float, Exp Float) -‐> Exp (Float, Float)

Friday, 17 May 13




- Note that we had to add some type signatures to unlift

zn+1 = c+ z2n



lift :: (Exp Float, Exp Float) -‐> Exp (Float, Float)

Friday, 17 May 13


• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

- dot is the magnitude of a complex number squared

- Conditionals can lead to SIMD divergence, so use sparingly

zn+1 = c+ z2n

Friday, 17 May 13







zn+1 = c+ z2n

step :: Exp Complex -‐> Exp (Complex, Int) -‐> Exp (Complex, Int)step c zi = f (fst zi) (snd zi) where f z i = let z' = next c z in dot z' >* 4 ? ( zi, lift (z', i+1) )

Friday, 17 May 13





zn+1 = c+ z2n


Friday, 17 May 13







zn+1 = c+ z2n

(?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t


Friday, 17 May 13


• Accelerate is a meta programming language

- So, just use regular Haskell to unroll the loop a fixed number of times

zn+1 = c+ z2n

mandel :: Exp (Float,Float,Float,Float) -‐> Acc (Array DIM2 (Complex, Int))mandel view = Prelude.foldl1 (.) (Prelude.replicate 255 go) zs0 where cs = genPlane view zs0 = zip cs (fill (shape cs) 0) go zs = zipWith step cs zs

Friday, 17 May 13




- zipWith applies its function step pairwise to elements of the two arrays

zn+1 = c+ z2n


Friday, 17 May 13





- Replicate the transition step from to

zn+1 = c+ z2n


zn zn+1

Friday, 17 May 13






- Applies the steps in sequence, beginning with the initial data zs0

zn+1 = c+ z2n


zn zn+1

Friday, 17 May 13






- Applies the steps in sequence, beginning with the initial data zs0

zn+1 = c+ z2n


zn zn+1

f (g x) ≣ (f . g) x

Friday, 17 May 13

In the workshop…

• More example code, computational kernels

- Common pitfalls

- Tips for good performance

- Figuring out what went wrong (or knowing who to blame)

Friday, 17 May 13

In the workshop…

• More example code, computational kernels

- Common pitfalls

- Tips for good performance

- Figuring out what went wrong (or knowing who to blame)

me

Friday, 17 May 13

Questions?https://github.com/AccelerateHS/

http://xkcd.com/365/

Friday, 17 May 13

https://github.com/AccelerateHS/

https://github.com/AccelerateHS/



Extra Slides…

Friday, 17 May 13

Seriously?

Friday, 17 May 13

Arrays

• Create an array from a list:

- Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order

• Example:

data Array sh e

fromList :: (Shape sh, Elt e) => sh -‐> [e] -‐> Array sh e

ghci> fromList (Z:.10) [1..10]

Friday, 17 May 13

Arrays


data Array sh e

> fromList (Z:.10) [1..10] :: Vector FloatArray (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

Friday, 17 May 13

Arrays


• Multidimensional arrays are similar:

- Elements are filled along the right-most dimension first

data Array sh e

> fromList (Z:.10) [1..10] :: Vector FloatArray (Z :. 10) [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

> fromList (Z:.3:.5) [1..] :: Array DIM2 IntArray (Z :. 3 :. 5) [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

Friday, 17 May 13

Arrays

• Array indices start counting from zero

data Array sh e

> let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> indexArray mat (Z:.2:.1)12

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

Friday, 17 May 13

Arrays

• Similarly, an array of (nested) tuples:

- This is just a trick: internally converted into a tuple of arrays

data Array sh e

> fromList (Z:.2:.3) $ P.zip [1..] ['a'..] :: Array DIM2 (Int,Char)Array (Z :. 2 :. 3) [(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e'),(6,'f')]

1 2 3

4 5 6

a b c

d e f( ),

Friday, 17 May 13

Data.Array.Accelerate

• Need to import both the base library as well as a backend

- There is also an interpreter available for testing

- Runs without using the GPU (much more slowly of course)

import Prelude as Pimport Data.Array.Accelerate as Aimport Data.Array.Accelerate.CUDA as CUDA

Friday, 17 May 13


Friday, 17 May 13


• To get arrays into the Acc world:

- This may involve copying data to the GPU

use :: Arrays a => a -‐> Acc a

Friday, 17 May 13


• To get arrays into the Acc world:

- This may involve copying data to the GPU

• use injects arrays into our DSL

• run executes the computation to get arrays out

• Using Accelerate focuses on everything in between

use :: Arrays a => a -‐> Acc a

Friday, 17 May 13


• Example: add one to each element of an array

> let arr = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ A.map (+1) (use arr)Array (Z :. 3 :. 5) [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

Friday, 17 May 13



• What is the type of map?


map :: (Shape sh, Elt a, Elt b) => (Exp a -‐> Exp b) -‐> Acc (Array sh a) -‐> Acc (Array sh b)

Friday, 17 May 13






Supported shape & element types

Friday, 17 May 13






DSL array


Friday, 17 May 13






DSL array

Function to apply at every element. But what is Exp?


Friday, 17 May 13

• The type class overloading trick is used for standard Haskell classes

• Standard boolean operations are available with slightly different names

- The standard names can not be overloaded

• Conditionals

- Use sparingly: leads to SIMD divergence

(+1) :: (Elt a, IsNum a) => Exp a -‐> Exp a

Scalar Expressions

(==*) :: (Elt t, IsScalar t) => Exp t -‐> Exp t -‐> Exp Bool(/=*), (<*), (>*), min, max, (||*), (&&*) -‐-‐ and so on...

(?) :: Elt t => Exp Bool -‐> (Exp t, Exp t) -‐> Exp t

Friday, 17 May 13

Scalar Expressions

• Bring a Haskell value into Exp land

• Lift an expression into a singleton array

• Extract the element from a singleton array


unit :: Exp e -‐> Acc (Scalar e)

the :: Acc (Scalar e) -‐> Exp e

Friday, 17 May 13

Reductions

• Folding (+) over a vector produces a sum

- The result is a one-element array (scalar). Why?

> let xs = fromList (Z:.10) [1..] :: Vector Int> run $ A.fold (+) 0 (use xs)Array (Z) [55]

Friday, 17 May 13

Reductions



• Fold has an interesting type:


fold :: (Shape sh, Elt a) => (Exp a -‐> Exp a -‐> Exp a) -‐> Exp a -‐> Acc (Array (sh:.Int) a) -‐> Acc (Array sh a)

Friday, 17 May 13

Reductions






input array

Friday, 17 May 13

Reductions






outer dimension removed

input array

Friday, 17 May 13

Reductions

• Fold occurs over the outer dimension of the array

> let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ A.fold (+) 0 (use mat)Array (Z :. 3) [15,40,65]

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

15

40

65

Friday, 17 May 13

Reductions

• Is this a left-fold or a right-fold?

- Neither! The fold happens in parallel, tree-like

- Therefore the function must be associative: (Exp a -‐> Exp a -‐> Exp a)

- (We pretend that floating point operations are associative, though strictly speaking they are not)

Friday, 17 May 13

Stencils

• A stencil is a map with access to the neighbourhood around each element

- Useful in many scientific & image processing algorithms

laplace :: Stencil3x3 Int -‐> Exp Intlaplace ((_,t,_) ,(l,c,r) ,(_,b,_)) = t + b + l + r -‐ 4*c

t

l c r

b

Friday, 17 May 13

Stencils



- Boundary conditions specify how to handle out-of-bounds neighbours

laplace :: Stencil3x3 Int -‐> Exp Intlaplace ((_,t,_) ,(l,c,r) ,(_,b,_)) = t + b + l + r -‐ 4*c

> let mat = fromList (Z:.3:.5) [1..] :: Array DIM2 Int> run $ stencil laplace (Constant 0) (use mat)Array (Z :. 3 :. 5) [4,3,2,1,-‐6,-‐5,0,0,0,-‐11,-‐26,-‐17,-‐18,-‐19,-‐36]

t

l c r

b

Friday, 17 May 13

Stencils



Friday, 17 May 13

Index Transforms

• Index transforms change the order of elements, not their values

- We usually want to push such operations into the consumer

• backpermute specifies which element to read from a source array

backpermute :: (Shape ix, Shape ix', Elt a) => Exp ix' -‐> (Exp ix' -‐> Exp ix) -‐> Acc (Array ix a) -‐> Acc (Array ix' a)

Friday, 17 May 13

Index Transforms





shape of result

Friday, 17 May 13

Index Transforms





shape of result

source data

Friday, 17 May 13

Index Transforms





shape of result

source data index mapping from destination array to

source

Friday, 17 May 13

Index Transforms


transpose mat = let swap = lift1 $ \(Z:.j:.i) -‐> Z:.i:.j :: Z :. Exp Int :. Exp Int in backpermute (swap $ shape mat) swap mat

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

1 6 11

2 7 12

3 8 13

4 9 14

5 10 15

Friday, 17 May 13

GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main...

Documents

Transcript of GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main...