GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main...

126
GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales @tlmcdonell [email protected] https://github.com/AccelerateHS Friday, 17 May 13

Transcript of GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main...

Page 1: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

GPGPU Programming in Haskellwith Accelerate

Trevor L. McDonellUniversity of New South Wales

@[email protected]

https://github.com/AccelerateHS

Friday, 17 May 13

Page 2: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

What is GPGPU Programming?

• General Purpose Programming on Graphics Processing Units (GPUs)

• Using your graphics card for something other than playing games

• GPUs have many more cores than a CPU

- GeForce GTX Titan

- 2688 cores @ 837 MHz

- 6 GB memory @ 288 GB/s

Friday, 17 May 13

Page 3: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

What is GPGPU Programming?

• Main differences:

- Single program multiple data (SPMD / SIMD), or just data-parallelism

- All the cores run the same program, but on different data

• We can’t program these in the same way as a CPU

- Different instruction sets: can’t run a Haskell program directly

- More restrictive hardware designs, limited control structures

• GPUs have their own memory

- Data has to be explicitly moved back and forth

Friday, 17 May 13

Page 4: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Dot-product: pair-wise multiply two arrays and sum the result

• C (sequential):

float  dotp(float  *xs,  float  *ys,  int  size){        int      i;        float  sum  =  0;

       for  (i  =  0;  i  <  size;  ++i)  {                sum  =  sum  +  xs[i]  *  ys[i];        }

       return  sum;}

Friday, 17 May 13

Page 5: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Haskell (sequential):

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

Friday, 17 May 13

Page 6: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Haskell (sequential):

- [Float] is a list of floating point numbers

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

Friday, 17 May 13

Page 7: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Haskell (sequential):

- [Float] is a list of floating point numbers

- zipWith applies the function (*) element-wise to the two input lists

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

Friday, 17 May 13

Page 8: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Haskell (sequential):

- [Float] is a list of floating point numbers

- zipWith applies the function (*) element-wise to the two input lists

- foldl sums the result of zipWith

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

Friday, 17 May 13

Page 9: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• CUDA (parallel):

- Step 1: element-wise multiplication

__global__void  zipWithMult(float  *xs,  float  *ys,  float  *zs,  int  size){        int  i  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;

       if  (i  <  size)  {                zs[i]  =  xs[i]  *  ys[i];        }}

Friday, 17 May 13

Page 10: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• CUDA (parallel):

- Step 1: element-wise multiplication

- __global__ indicates this is a function that runs on the GPU

__global__void  zipWithMult(float  *xs,  float  *ys,  float  *zs,  int  size){        int  i  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;

       if  (i  <  size)  {                zs[i]  =  xs[i]  *  ys[i];        }}

Friday, 17 May 13

Page 11: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• CUDA (parallel):

- Step 1: element-wise multiplication

- __global__ indicates this is a function that runs on the GPU

- spawn one thread for each element in the vector: unique for each thread

__global__void  zipWithMult(float  *xs,  float  *ys,  float  *zs,  int  size){        int  i  =  blockDim.x  *  blockIdx.x  +  threadIdx.x;

       if  (i  <  size)  {                zs[i]  =  xs[i]  *  ys[i];        }}

Friday, 17 May 13

Page 12: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• CUDA (parallel):

- Step 2: vector reduction … is somewhat complexstruct  SharedMemory{        __device__  inline  operator  float  *()        {                extern  __shared__  int  __smem[];                return  (float  *)__smem;        }

       __device__  inline  operator  const  float  *()  const        {                extern  __shared__  int  __smem[];                return  (float  *)__smem;        }};

template  <unsigned  int  blockSize,  bool  nIsPow2>__global__  voidreduce_kernel(float  *g_idata,  float  *g_odata,  unsigned  int  n){        float  *sdata  =  SharedMemory();

       unsigned  int  tid            =  threadIdx.x;        unsigned  int  i                =  blockIdx.x*blockSize*2  +  threadIdx.x;        unsigned  int  gridSize  =  blockSize*2*gridDim.x;

       float  sum  =  0;

       while  (i  <  n)  {                sum  +=  g_idata[i];

               if  (nIsPow2  ||  i  +  blockSize  <  n)                        sum  +=  g_idata[i+blockSize];

               i  +=  gridSize;        }

       sdata[tid]  =  sum;        __syncthreads();

       if  (blockSize  >=  512)  {                if  (tid  <  256)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +  256];                }

               __syncthreads();        }

       if  (blockSize  >=  256)  {                if  (tid  <  128)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +  128];                }

               __syncthreads();        }

       if  (blockSize  >=  128)  {                if  (tid  <    64)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +    64];                }

               __syncthreads();        }

       if  (tid  <  32)        {                volatile  float  *smem  =  sdata;

               if  (blockSize  >=    64)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +  32];                }

               if  (blockSize  >=    32)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +  16];                }

               if  (blockSize  >=    16)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    8];                }

               if  (blockSize  >=      8)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    4];                }

               if  (blockSize  >=      4)  {

                       smem[tid]  =  sum  =  sum  +  smem[tid  +    2];                }

               if  (blockSize  >=      2)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    1];                }        }

       if  (tid  ==  0)                g_odata[blockIdx.x]  =  sdata[0];}

void  getNumBlocksAndThreads(int  n,  int  maxBlocks,  int  maxThreads,  int  &blocks,  int  &threads){        cudaDeviceProp  prop;        int  device;        checkCudaErrors(cudaGetDevice(&device));        checkCudaErrors(cudaGetDeviceProperties(&prop,  device));

       threads  =  (n  <  maxThreads*2)  ?  nextPow2((n  +  1)/  2)  :  maxThreads;        blocks    =  (n  +  (threads  *  2  -­‐  1))  /  (threads  *  2);

       if  (blocks  >  prop.maxGridSize[0])        {                blocks    /=  2;                threads  *=  2;        }

       blocks  =  min(maxBlocks,  blocks);}

floatreduce(int  n,  float  *d_idata,  float  *d_odata){        int  threads        =  0;        int  blocks          =  0;        int  maxThreads  =  256;        int  maxBlocks    =  64;        int  size              =  n

       while  (size  >  1)  {                getNumBlocksAndThreads(size,  maxBlocks,  maxThreads,  blocks,  threads);

               int  smemSize  =  (threads  <=  32)  ?  2  *  threads  *  sizeof(float)  :  threads  *  sizeof(float);                dim3  dimBlock(threads,  1,  1);                dim3  dimGrid(blocks,  1,  1);

               if  (isPow2(size))                {                        switch  (threads)                        {                                case  512:                                        reduce_kernel<512,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  256:                                        reduce_kernel<256,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  128:                                        reduce_kernel<128,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  64:                                        reduce_kernel<  64,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  32:                                        reduce_kernel<  32,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  16:                                        reduce_kernel<  16,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    8:                                        reduce_kernel<    8,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    4:                                        reduce_kernel<    4,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    2:                                        reduce_kernel<    2,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    1:                                        reduce_kernel<    1,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                        }                }                else

               {                        switch  (threads)                        {                                case  512:                                        reduce_kernel<512,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  256:                                        reduce_kernel<256,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  128:                                        reduce_kernel<128,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  64:                                        reduce_kernel<  64,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  32:                                        reduce_kernel<  32,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  16:                                        reduce_kernel<  16,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    8:                                        reduce_kernel<    8,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    4:                                        reduce_kernel<    4,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    2:                                        reduce_kernel<    2,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    1:                                        reduce_kernel<    1,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                        }                }

               size  =  (size  +  (threads*2-­‐1))  /  (threads*2);        }

       float  sum;        checkCudaErrors(cudaMemcpy(&sum,  d_odata,  sizeof(float),  cudaMemcpyDeviceToHost));

}

Friday, 17 May 13

Page 13: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• CUDA (parallel):

- Step 2: vector reduction … is somewhat complexstruct  SharedMemory{        __device__  inline  operator  float  *()        {                extern  __shared__  int  __smem[];                return  (float  *)__smem;        }

       __device__  inline  operator  const  float  *()  const        {                extern  __shared__  int  __smem[];                return  (float  *)__smem;        }};

template  <unsigned  int  blockSize,  bool  nIsPow2>__global__  voidreduce_kernel(float  *g_idata,  float  *g_odata,  unsigned  int  n){        float  *sdata  =  SharedMemory();

       unsigned  int  tid            =  threadIdx.x;        unsigned  int  i                =  blockIdx.x*blockSize*2  +  threadIdx.x;        unsigned  int  gridSize  =  blockSize*2*gridDim.x;

       float  sum  =  0;

       while  (i  <  n)  {                sum  +=  g_idata[i];

               if  (nIsPow2  ||  i  +  blockSize  <  n)                        sum  +=  g_idata[i+blockSize];

               i  +=  gridSize;        }

       sdata[tid]  =  sum;        __syncthreads();

       if  (blockSize  >=  512)  {                if  (tid  <  256)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +  256];                }

               __syncthreads();        }

       if  (blockSize  >=  256)  {                if  (tid  <  128)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +  128];                }

               __syncthreads();        }

       if  (blockSize  >=  128)  {                if  (tid  <    64)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +    64];                }

               __syncthreads();        }

       if  (tid  <  32)        {                volatile  float  *smem  =  sdata;

               if  (blockSize  >=    64)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +  32];                }

               if  (blockSize  >=    32)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +  16];                }

               if  (blockSize  >=    16)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    8];                }

               if  (blockSize  >=      8)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    4];                }

               if  (blockSize  >=      4)  {

                       smem[tid]  =  sum  =  sum  +  smem[tid  +    2];                }

               if  (blockSize  >=      2)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    1];                }        }

       if  (tid  ==  0)                g_odata[blockIdx.x]  =  sdata[0];}

void  getNumBlocksAndThreads(int  n,  int  maxBlocks,  int  maxThreads,  int  &blocks,  int  &threads){        cudaDeviceProp  prop;        int  device;        checkCudaErrors(cudaGetDevice(&device));        checkCudaErrors(cudaGetDeviceProperties(&prop,  device));

       threads  =  (n  <  maxThreads*2)  ?  nextPow2((n  +  1)/  2)  :  maxThreads;        blocks    =  (n  +  (threads  *  2  -­‐  1))  /  (threads  *  2);

       if  (blocks  >  prop.maxGridSize[0])        {                blocks    /=  2;                threads  *=  2;        }

       blocks  =  min(maxBlocks,  blocks);}

floatreduce(int  n,  float  *d_idata,  float  *d_odata){        int  threads        =  0;        int  blocks          =  0;        int  maxThreads  =  256;        int  maxBlocks    =  64;        int  size              =  n

       while  (size  >  1)  {                getNumBlocksAndThreads(size,  maxBlocks,  maxThreads,  blocks,  threads);

               int  smemSize  =  (threads  <=  32)  ?  2  *  threads  *  sizeof(float)  :  threads  *  sizeof(float);                dim3  dimBlock(threads,  1,  1);                dim3  dimGrid(blocks,  1,  1);

               if  (isPow2(size))                {                        switch  (threads)                        {                                case  512:                                        reduce_kernel<512,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  256:                                        reduce_kernel<256,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  128:                                        reduce_kernel<128,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  64:                                        reduce_kernel<  64,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  32:                                        reduce_kernel<  32,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  16:                                        reduce_kernel<  16,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    8:                                        reduce_kernel<    8,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    4:                                        reduce_kernel<    4,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    2:                                        reduce_kernel<    2,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    1:                                        reduce_kernel<    1,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                        }                }                else

               {                        switch  (threads)                        {                                case  512:                                        reduce_kernel<512,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  256:                                        reduce_kernel<256,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  128:                                        reduce_kernel<128,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  64:                                        reduce_kernel<  64,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  32:                                        reduce_kernel<  32,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  16:                                        reduce_kernel<  16,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    8:                                        reduce_kernel<    8,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    4:                                        reduce_kernel<    4,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    2:                                        reduce_kernel<    2,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    1:                                        reduce_kernel<    1,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                        }                }

               size  =  (size  +  (threads*2-­‐1))  /  (threads*2);        }

       float  sum;        checkCudaErrors(cudaMemcpy(&sum,  d_odata,  sizeof(float),  cudaMemcpyDeviceToHost));

}

Friday, 17 May 13

Page 14: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• CUDA (parallel):

- Step 2: vector reduction … is somewhat complexstruct  SharedMemory{        __device__  inline  operator  float  *()        {                extern  __shared__  int  __smem[];                return  (float  *)__smem;        }

       __device__  inline  operator  const  float  *()  const        {                extern  __shared__  int  __smem[];                return  (float  *)__smem;        }};

template  <unsigned  int  blockSize,  bool  nIsPow2>__global__  voidreduce_kernel(float  *g_idata,  float  *g_odata,  unsigned  int  n){        float  *sdata  =  SharedMemory();

       unsigned  int  tid            =  threadIdx.x;        unsigned  int  i                =  blockIdx.x*blockSize*2  +  threadIdx.x;        unsigned  int  gridSize  =  blockSize*2*gridDim.x;

       float  sum  =  0;

       while  (i  <  n)  {                sum  +=  g_idata[i];

               if  (nIsPow2  ||  i  +  blockSize  <  n)                        sum  +=  g_idata[i+blockSize];

               i  +=  gridSize;        }

       sdata[tid]  =  sum;        __syncthreads();

       if  (blockSize  >=  512)  {                if  (tid  <  256)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +  256];                }

               __syncthreads();        }

       if  (blockSize  >=  256)  {                if  (tid  <  128)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +  128];                }

               __syncthreads();        }

       if  (blockSize  >=  128)  {                if  (tid  <    64)  {                        sdata[tid]  =  sum  =  sum  +  sdata[tid  +    64];                }

               __syncthreads();        }

       if  (tid  <  32)        {                volatile  float  *smem  =  sdata;

               if  (blockSize  >=    64)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +  32];                }

               if  (blockSize  >=    32)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +  16];                }

               if  (blockSize  >=    16)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    8];                }

               if  (blockSize  >=      8)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    4];                }

               if  (blockSize  >=      4)  {

                       smem[tid]  =  sum  =  sum  +  smem[tid  +    2];                }

               if  (blockSize  >=      2)  {                        smem[tid]  =  sum  =  sum  +  smem[tid  +    1];                }        }

       if  (tid  ==  0)                g_odata[blockIdx.x]  =  sdata[0];}

void  getNumBlocksAndThreads(int  n,  int  maxBlocks,  int  maxThreads,  int  &blocks,  int  &threads){        cudaDeviceProp  prop;        int  device;        checkCudaErrors(cudaGetDevice(&device));        checkCudaErrors(cudaGetDeviceProperties(&prop,  device));

       threads  =  (n  <  maxThreads*2)  ?  nextPow2((n  +  1)/  2)  :  maxThreads;        blocks    =  (n  +  (threads  *  2  -­‐  1))  /  (threads  *  2);

       if  (blocks  >  prop.maxGridSize[0])        {                blocks    /=  2;                threads  *=  2;        }

       blocks  =  min(maxBlocks,  blocks);}

floatreduce(int  n,  float  *d_idata,  float  *d_odata){        int  threads        =  0;        int  blocks          =  0;        int  maxThreads  =  256;        int  maxBlocks    =  64;        int  size              =  n

       while  (size  >  1)  {                getNumBlocksAndThreads(size,  maxBlocks,  maxThreads,  blocks,  threads);

               int  smemSize  =  (threads  <=  32)  ?  2  *  threads  *  sizeof(float)  :  threads  *  sizeof(float);                dim3  dimBlock(threads,  1,  1);                dim3  dimGrid(blocks,  1,  1);

               if  (isPow2(size))                {                        switch  (threads)                        {                                case  512:                                        reduce_kernel<512,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  256:                                        reduce_kernel<256,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  128:                                        reduce_kernel<128,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  64:                                        reduce_kernel<  64,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  32:                                        reduce_kernel<  32,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  16:                                        reduce_kernel<  16,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    8:                                        reduce_kernel<    8,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    4:                                        reduce_kernel<    4,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    2:                                        reduce_kernel<    2,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    1:                                        reduce_kernel<    1,  true><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                        }                }                else

               {                        switch  (threads)                        {                                case  512:                                        reduce_kernel<512,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  256:                                        reduce_kernel<256,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  128:                                        reduce_kernel<128,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  64:                                        reduce_kernel<  64,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  32:                                        reduce_kernel<  32,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case  16:                                        reduce_kernel<  16,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    8:                                        reduce_kernel<    8,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    4:                                        reduce_kernel<    4,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    2:                                        reduce_kernel<    2,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                                case    1:                                        reduce_kernel<    1,  false><<<  dimGrid,  dimBlock,  smemSize  >>>(d_idata,  d_odata,  size);                                        break;                        }                }

               size  =  (size  +  (threads*2-­‐1))  /  (threads*2);        }

       float  sum;        checkCudaErrors(cudaMemcpy(&sum,  d_odata,  sizeof(float),  cudaMemcpyDeviceToHost));

}

o_O

Friday, 17 May 13

Page 15: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Accelerate (parallel):

dotp  ::  Acc  (Vector  Float)  -­‐>  Acc  (Vector  Float)  -­‐>  Acc  (Scalar  Float)dotp  xs  ys  =  fold  (+)  0                      (  zipWith  (*)  xs  ys  )

Friday, 17 May 13

Page 16: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Accelerate (parallel):

• Recall the sequential Haskell version:

dotp  ::  Acc  (Vector  Float)  -­‐>  Acc  (Vector  Float)  -­‐>  Acc  (Scalar  Float)dotp  xs  ys  =  fold  (+)  0                      (  zipWith  (*)  xs  ys  )

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

Friday, 17 May 13

Page 17: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Accelerate (parallel):

• Recall the sequential Haskell version:

dotp  ::  Acc  (Vector  Float)  -­‐>  Acc  (Vector  Float)  -­‐>  Acc  (Scalar  Float)dotp  xs  ys  =  fold  (+)  0                      (  zipWith  (*)  xs  ys  )

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

Friday, 17 May 13

Page 18: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Accelerate (parallel):

• Recall the sequential Haskell version:

dotp  ::  Acc  (Vector  Float)  -­‐>  Acc  (Vector  Float)  -­‐>  Acc  (Scalar  Float)dotp  xs  ys  =  fold  (+)  0                      (  zipWith  (*)  xs  ys  )

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

Friday, 17 May 13

Page 19: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Accelerate (parallel):

• Recall the sequential Haskell version:

dotp  ::  Acc  (Vector  Float)  -­‐>  Acc  (Vector  Float)  -­‐>  Acc  (Scalar  Float)dotp  xs  ys  =  fold  (+)  0                      (  zipWith  (*)  xs  ys  )

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

left-to-right traversal

Friday, 17 May 13

Page 20: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Accelerate (parallel):

• Recall the sequential Haskell version:

dotp  ::  Acc  (Vector  Float)  -­‐>  Acc  (Vector  Float)  -­‐>  Acc  (Scalar  Float)dotp  xs  ys  =  fold  (+)  0                      (  zipWith  (*)  xs  ys  )

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

left-to-right traversal

neither left nor right: happens in parallel (tree-like)

Friday, 17 May 13

Page 21: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

• Accelerate (parallel):

• Recall the sequential Haskell version:

dotp  ::  Acc  (Vector  Float)  -­‐>  Acc  (Vector  Float)  -­‐>  Acc  (Scalar  Float)dotp  xs  ys  =  fold  (+)  0                      (  zipWith  (*)  xs  ys  )

dotp  ::  [Float]  -­‐>  [Float]  -­‐>  Floatdotp  xs  ys  =  foldl  (+)  0                      (  zipWith  (*)  xs  ys  )

left-to-right traversal

neither left nor right: happens in parallel (tree-like)

But… how does it perform?

Friday, 17 May 13

Page 22: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

0.1

1

10

100

2 4 6 8 10 12 14 16 18 20

Run

Tim

e (m

s)

Elements (millions)

Dot product

sequentialAccelerate

CUBLAS

Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)

Friday, 17 May 13

Page 23: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

0.1

1

10

100

2 4 6 8 10 12 14 16 18 20

Run

Tim

e (m

s)

Elements (millions)

Dot product

sequentialAccelerate

CUBLAS

1.2x

Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)

Friday, 17 May 13

Page 24: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Dot product four ways

0.1

1

10

100

2 4 6 8 10 12 14 16 18 20

Run

Tim

e (m

s)

Elements (millions)

Dot product

sequentialAccelerate

CUBLAS

30x

1.2x

Tesla T10 (240 cores @ 1.3 GHz) vs. Xenon E5405 (2GHz)

Friday, 17 May 13

Page 25: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Friday, 17 May 13

Page 26: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot fractal

Friday, 17 May 13

Page 27: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot fractal

n-body gravitational simulation

Friday, 17 May 13

Page 28: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot fractal

n-body gravitational simulation

Canny edge detectionFriday, 17 May 13

Page 29: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot fractal

n-body gravitational simulation

Canny edge detectionSmoothLife cellular automataFriday, 17 May 13

Page 30: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot fractal

n-body gravitational simulation

Canny edge detectionSmoothLife cellular automata

stable fluid flow

Friday, 17 May 13

Page 31: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot fractal

n-body gravitational simulation

Canny edge detectionSmoothLife cellular automata

stable fluid flow

...d6b821d937a4170b3c4f8ad93495575d:  saitek1d0e52829bf7962ee0aa90550ffdcccaa:  laura1230494a8204b800c41b2da763f9bbbcc462:  lina03d8ff07c52a95b30800809758f84ce28c:  Jenny10e81bed02faa9892f8360c705241191ae:  carmen8946f7d75718029de99dd81fd907034bc9:  mellon220dd3c176cf34486ec00b526b6920b782:  helena049351c4bc8c8ba17b58d5a6a1f839f356:  855485549c36c5599f40d08f874559ac824d091a:  5851234564b4dce6c91b429e8360aa65f97342e90:  5678go3aa561d4c17d9d58443fc15d10cc86ae:  momo55

Recovered  150/1000  (15.00  %)  digests  in  59.45  s,  185.03  MHash/sec

Password “recovery” (MD5 dictionary attack)

Friday, 17 May 13

Page 32: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• Accelerate is a Domain-Specific Language for GPU programming

Haskell/Accelerate program

CUDA code

Compile with NVIDIA’s compiler & load onto the GPU

Copy result back to Haskell

Transform Accelerate program into CUDA program

Friday, 17 May 13

Page 33: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• Accelerate is a Domain-Specific Language for GPU programming

- This process may happen several times during program execution

- Code and data fragments get cached and reused

• An Accelerate program is a Haskell program that generates a CUDA program

- However, in many respects this still looks like a Haskell program

- Shares various concepts with Repa, a Haskell array library for CPUs

Friday, 17 May 13

Page 34: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• To execute an Accelerate computation (on the GPU):

- run comes from whichever backend we have chosen (CUDA)

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Friday, 17 May 13

Page 35: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• To execute an Accelerate computation (on the GPU):

- run comes from whichever backend we have chosen (CUDA)

- Arrays constrains the result to be an Array, or tuple thereof

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Friday, 17 May 13

Page 36: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• To execute an Accelerate computation (on the GPU):

- run comes from whichever backend we have chosen (CUDA)

- Arrays constrains the result to be an Array, or tuple thereof

• What is Acc?

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Friday, 17 May 13

Page 37: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• To execute an Accelerate computation (on the GPU):

- run comes from whichever backend we have chosen (CUDA)

- Arrays constrains the result to be an Array, or tuple thereof

• What is Acc?

- This is our DSL type

- A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Friday, 17 May 13

Page 38: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• To execute an Accelerate computation (on the GPU):

- run comes from whichever backend we have chosen (CUDA)

- Arrays constrains the result to be an Array, or tuple thereof

• What is Acc?

- This is our DSL type

- A data structure (Abstract Syntax Tree) representing a computation that once executed will yield a result of type ‘a’

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Accelerate is a library of collective operations over

arrays of type Acc  a

Friday, 17 May 13

Page 39: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• Accelerate computations take place on arrays

- Parallelism is introduced in the form of collective operations over arrays

Accelerate computationArrays in Arrays out

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Friday, 17 May 13

Page 40: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• Accelerate computations take place on arrays

- Parallelism is introduced in the form of collective operations over arrays

• Arrays have two type parametersdata  Array  sh  e

Accelerate computationArrays in Arrays out

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Friday, 17 May 13

Page 41: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• Accelerate computations take place on arrays

- Parallelism is introduced in the form of collective operations over arrays

• Arrays have two type parameters

- The shape of the array, or dimensionalitydata  Array  sh  e

Accelerate computationArrays in Arrays out

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Friday, 17 May 13

Page 42: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate

• Accelerate computations take place on arrays

- Parallelism is introduced in the form of collective operations over arrays

• Arrays have two type parameters

- The shape of the array, or dimensionality

- The element type of the array: Int, Float, etc.

data  Array  sh  e

Accelerate computationArrays in Arrays out

run  ::  Arrays  a  =>  Acc  a  -­‐>  a

Friday, 17 May 13

Page 43: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Arrays

• Supported array element types are members of the Elt class:

- ()

- Int, Int32, Int64, Word, Word32, Word64...

- Float, Double

- Char

- Bool

- Tuples up to 9-tuples of these, including nested tuples

• Note that Array itself is not an allowable element type. There are no nested arrays in Accelerate, regular arrays only!

data  Array  sh  e

Friday, 17 May 13

Page 44: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Accelerate by exampleMandelbrot fractal

Friday, 17 May 13

Page 45: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Basics

- Pick a window onto the complex plane & divide into pixels

- A point is in the set if the value of does not diverge to infinity

- Each pixel has a value given by its coordinates in the complex plane

- Colour depends on number of iterations before divergence

• Each pixel is independent: lots of data parallelism

c

n

|z|zn+1 = c+ z2n

Friday, 17 May 13

Page 46: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• First, some types:

- A pair of floating point numbers for the real and imaginary parts

data  Complex     =  (Float,  Float)

data  Array  sh  e

Friday, 17 May 13

Page 47: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• First, some types:

- A pair of floating point numbers for the real and imaginary parts

- DIM2 is a type synonym for a two dimensional Shape

data  Complex     =  (Float,  Float)data  ComplexPlane  =  Array  DIM2  Complex

data  Array  sh  e

Friday, 17 May 13

Page 48: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Shapes

• Shapes determines the dimensions of the array and the type of the index

- Z represents a rank-zero array (singleton array with one element)

- (:.) increases the rank by adding a new dimension on the right

• Examples:

- One-dimensional array (Vector) indexed by Int: (Z  :.  Int)

- Two-dimensional array, indexed by Int: (Z  :.  Int  :.  Int)

• This style is used at both the type and value level:

data  Z                        =  Zdata  tail  :.  head  =  tail  :.  head

Friday, 17 May 13

Page 49: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Shapes

• Shapes determines the dimensions of the array and the type of the index

- Z represents a rank-zero array (singleton array with one element)

- (:.) increases the rank by adding a new dimension on the right

• Examples:

- One-dimensional array (Vector) indexed by Int: (Z  :.  Int)

- Two-dimensional array, indexed by Int: (Z  :.  Int  :.  Int)

• This style is used at both the type and value level:

data  Z                        =  Zdata  tail  :.  head  =  tail  :.  head

Friday, 17 May 13

Page 50: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Shapes

• Shapes determines the dimensions of the array and the type of the index

- Z represents a rank-zero array (singleton array with one element)

- (:.) increases the rank by adding a new dimension on the right

• Examples:

- One-dimensional array (Vector) indexed by Int: (Z  :.  Int)

- Two-dimensional array, indexed by Int: (Z  :.  Int  :.  Int)

• This style is used at both the type and value level:

data  Z                        =  Zdata  tail  :.  head  =  tail  :.  head

Friday, 17 May 13

Page 51: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Shapes

• Shapes determines the dimensions of the array and the type of the index

- Z represents a rank-zero array (singleton array with one element)

- (:.) increases the rank by adding a new dimension on the right

• Examples:

- One-dimensional array (Vector) indexed by Int: (Z  :.  Int)

- Two-dimensional array, indexed by Int: (Z  :.  Int  :.  Int)

data  Z                        =  Zdata  tail  :.  head  =  tail  :.  head

Friday, 17 May 13

Page 52: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Shapes

• Shapes determines the dimensions of the array and the type of the index

- Z represents a rank-zero array (singleton array with one element)

- (:.) increases the rank by adding a new dimension on the right

• Examples:

- One-dimensional array (Vector) indexed by Int: (Z  :.  Int)

- Two-dimensional array, indexed by Int: (Z  :.  Int  :.  Int)

• This style is used at both the type and value level:

data  Z                        =  Zdata  tail  :.  head  =  tail  :.  head

sh  ::  Z  :.  Intsh  =    Z  :.  10

Friday, 17 May 13

Page 53: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Shapes

• Shapes determines the dimensions of the array and the type of the index

- Z represents a rank-zero array (singleton array with one element)

- (:.) increases the rank by adding a new dimension on the right

• Examples:

- One-dimensional array (Vector) indexed by Int: (Z  :.  Int)

- Two-dimensional array, indexed by Int: (Z  :.  Int  :.  Int)

• This style is used at both the type and value level:

data  Z                        =  Zdata  tail  :.  head  =  tail  :.  head

sh  ::  Z  :.  Intsh  =    Z  :.  10

Friday, 17 May 13

Page 54: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Shapes

• Shapes determines the dimensions of the array and the type of the index

- Z represents a rank-zero array (singleton array with one element)

- (:.) increases the rank by adding a new dimension on the right

• Examples:

- One-dimensional array (Vector) indexed by Int: (Z  :.  Int)

- Two-dimensional array, indexed by Int: (Z  :.  Int  :.  Int)

• This style is used at both the type and value level:

data  Z                        =  Zdata  tail  :.  head  =  tail  :.  head

sh  ::  Z  :.  Intsh  =    Z  :.  10

Friday, 17 May 13

Page 55: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• The initial complex plane:

• generate is a collective operation that yields a value for an index in the array

generate  ::  (Shape  sh,  Elt  a)                  =>  Exp  sh                  -­‐>  (Exp  sh  -­‐>  Exp  a)                  -­‐>  Acc  (Array  sh  a)

z0 = c

Friday, 17 May 13

Page 56: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• The initial complex plane:

• generate is a collective operation that yields a value for an index in the array

- Supported shape and element types: we will use DIM2 and Complex

generate  ::  (Shape  sh,  Elt  a)                  =>  Exp  sh                  -­‐>  (Exp  sh  -­‐>  Exp  a)                  -­‐>  Acc  (Array  sh  a)

z0 = c

Friday, 17 May 13

Page 57: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• The initial complex plane:

• generate is a collective operation that yields a value for an index in the array

- Supported shape and element types: we will use DIM2 and Complex

- Size of the result array: number of pixels in the image

generate  ::  (Shape  sh,  Elt  a)                  =>  Exp  sh                  -­‐>  (Exp  sh  -­‐>  Exp  a)                  -­‐>  Acc  (Array  sh  a)

z0 = c

Friday, 17 May 13

Page 58: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• The initial complex plane:

• generate is a collective operation that yields a value for an index in the array

- Supported shape and element types: we will use DIM2 and Complex

- Size of the result array: number of pixels in the image

- A function to apply at every index: generate the values of at each pixel

generate  ::  (Shape  sh,  Elt  a)                  =>  Exp  sh                  -­‐>  (Exp  sh  -­‐>  Exp  a)                  -­‐>  Acc  (Array  sh  a)

c

z0 = c

Friday, 17 May 13

Page 59: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• The initial complex plane:

• generate is a collective operation that yields a value for an index in the array

- Supported shape and element types: we will use DIM2 and Complex

- Size of the result array: number of pixels in the image

- A function to apply at every index: generate the values of at each pixel

generate  ::  (Shape  sh,  Elt  a)                  =>  Exp  sh                  -­‐>  (Exp  sh  -­‐>  Exp  a)                  -­‐>  Acc  (Array  sh  a)

c

z0 = c

Friday, 17 May 13

Page 60: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• The initial complex plane:

• generate is a collective operation that yields a value for an index in the array

- Supported shape and element types: we will use DIM2 and Complex

- Size of the result array: number of pixels in the image

- A function to apply at every index: generate the values of at each pixel

generate  ::  (Shape  sh,  Elt  a)                  =>  Exp  sh                  -­‐>  (Exp  sh  -­‐>  Exp  a)                  -­‐>  Acc  (Array  sh  a)

c

z0 = c

If Acc is our DSL type, what is Exp?

Friday, 17 May 13

Page 61: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

A Stratified Language

• Accelerate is split into two worlds: Acc and Exp

- Acc represents collective operations over instances of Arrays

- Exp is a scalar computation on things of type Elt

Friday, 17 May 13

Page 62: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

A Stratified Language

• Accelerate is split into two worlds: Acc and Exp

- Acc represents collective operations over instances of Arrays

- Exp is a scalar computation on things of type Elt

• Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays

- Scalar operations can not contain collective operations

Friday, 17 May 13

Page 63: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

A Stratified Language

• Accelerate is split into two worlds: Acc and Exp

- Acc represents collective operations over instances of Arrays

- Exp is a scalar computation on things of type Elt

• Collective operations in Acc comprise many scalar operations in Exp, executed in parallel over Arrays

- Scalar operations can not contain collective operations

• This stratification excludes nested data parallelism

Friday, 17 May 13

Page 64: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Collective operations comprise many scalar operations applied in parallel

example  ::  Acc  (Vector  Int)example  =  generate  (constant  (Z:.10))                                      (\ix  -­‐>  f  ix)

Friday, 17 May 13

Page 65: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Collective operations comprise many scalar operations applied in parallel

- constant lifts a plain value into Exp land of scalar expressions

example  ::  Acc  (Vector  Int)example  =  generate  (constant  (Z:.10))                                      (\ix  -­‐>  f  ix)

constant  ::  Elt  e  =>  e  -­‐>  Exp  e

Friday, 17 May 13

Page 66: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Collective operations comprise many scalar operations applied in parallel

- constant lifts a plain value into Exp land of scalar expressions

example  ::  Acc  (Vector  Int)example  =  generate  (constant  (Z:.10))                                      (\ix  -­‐>  f  ix)

generate  (constant  (Z:.10))  f

constant  ::  Elt  e  =>  e  -­‐>  Exp  e

Friday, 17 May 13

Page 67: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Collective operations comprise many scalar operations applied in parallel

- constant lifts a plain value into Exp land of scalar expressions

example  ::  Acc  (Vector  Int)example  =  generate  (constant  (Z:.10))                                      (\ix  -­‐>  f  ix)

f  0 f  1 f  2 f  3 f  4 f  5 f  6 f  7 f  8 f  9

generate  (constant  (Z:.10))  f

f  ::  Exp  DIM1  -­‐>  Exp  Int

constant  ::  Elt  e  =>  e  -­‐>  Exp  e

Friday, 17 May 13

Page 68: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Collective operations comprise many scalar operations applied in parallel

- constant lifts a plain value into Exp land of scalar expressions

example  ::  Acc  (Vector  Int)example  =  generate  (constant  (Z:.10))                                      (\ix  -­‐>  f  ix)

f  0 f  1 f  2 f  3 f  4 f  5 f  6 f  7 f  8 f  9

Acc  (Vector  Int)

constant  ::  Elt  e  =>  e  -­‐>  Exp  e

Friday, 17 May 13

Page 69: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.

- Even though operations are in Exp, can still use standard Haskell operations like (*) and (-­‐)

Mandelbrot set generator

genPlane  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  ComplexPlanegenPlane  view  =    generate  (constant  (Z:.600:.800))                      (\ix  -­‐>  let  Z  :.  y  :.  x  =  unlift  ix                                      in  lift  (  xmin  +  (fromIntegral  x  *  viewx)  /  800                                                      ,  ymin  +  (fromIntegral  y  *  viewy)  /  600  ))    where        (xmin,ymin,xmax,ymax)  =  unlift  view        viewx                                  =  xmax  -­‐  xmin        viewy                                  =  ymax  -­‐  ymin

Friday, 17 May 13

Page 70: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.

- Even though operations are in Exp, can still use standard Haskell operations like (*) and (-­‐)

Mandelbrot set generator

genPlane  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  ComplexPlanegenPlane  view  =    generate  (constant  (Z:.600:.800))                      (\ix  -­‐>  let  Z  :.  y  :.  x  =  unlift  ix                                      in  lift  (  xmin  +  (fromIntegral  x  *  viewx)  /  800                                                      ,  ymin  +  (fromIntegral  y  *  viewy)  /  600  ))    where        (xmin,ymin,xmax,ymax)  =  unlift  view        viewx                                  =  xmax  -­‐  xmin        viewy                                  =  ymax  -­‐  ymin

Friday, 17 May 13

Page 71: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.

Mandelbrot set generator

genPlane  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  ComplexPlanegenPlane  view  =    generate  (constant  (Z:.600:.800))                      (\ix  -­‐>  let  Z  :.  y  :.  x  =  unlift  ix                                      in  lift  (  xmin  +  (fromIntegral  x  *  viewx)  /  800                                                      ,  ymin  +  (fromIntegral  y  *  viewy)  /  600  ))    where        (xmin,ymin,xmax,ymax)  =  unlift  view        viewx                                  =  xmax  -­‐  xmin        viewy                                  =  ymax  -­‐  ymin

unlift  ::  Exp  (Z  :.  Int  :.  Int)              -­‐>  Z  :.  Exp  Int  :.  Exp  Int

Friday, 17 May 13

Page 72: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.

Mandelbrot set generator

genPlane  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  ComplexPlanegenPlane  view  =    generate  (constant  (Z:.600:.800))                      (\ix  -­‐>  let  Z  :.  y  :.  x  =  unlift  ix                                      in  lift  (  xmin  +  (fromIntegral  x  *  viewx)  /  800                                                      ,  ymin  +  (fromIntegral  y  *  viewy)  /  600  ))    where        (xmin,ymin,xmax,ymax)  =  unlift  view        viewx                                  =  xmax  -­‐  xmin        viewy                                  =  ymax  -­‐  ymin

unlift  ::  Exp  (F,F,F,F)              -­‐>  (Exp  F,  Exp  F,  Exp  F,  Exp  F)

Friday, 17 May 13

Page 73: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.

Mandelbrot set generator

genPlane  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  ComplexPlanegenPlane  view  =    generate  (constant  (Z:.600:.800))                      (\ix  -­‐>  let  Z  :.  y  :.  x  =  unlift  ix                                      in  lift  (  xmin  +  (fromIntegral  x  *  viewx)  /  800                                                      ,  ymin  +  (fromIntegral  y  *  viewy)  /  600  ))    where        (xmin,ymin,xmax,ymax)  =  unlift  view        viewx                                  =  xmax  -­‐  xmin        viewy                                  =  ymax  -­‐  ymin

Friday, 17 May 13

Page 74: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

- unlift has a funky type, but just unpacks “stuff”. lift does the reverse.

- Even though operations are in Exp, can still use standard Haskell operations like (*) and (-­‐)

Mandelbrot set generator

genPlane  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  ComplexPlanegenPlane  view  =    generate  (constant  (Z:.600:.800))                      (\ix  -­‐>  let  Z  :.  y  :.  x  =  unlift  ix                                      in  lift  (  xmin  +  (fromIntegral  x  *  viewx)  /  800                                                      ,  ymin  +  (fromIntegral  y  *  viewy)  /  600  ))    where        (xmin,ymin,xmax,ymax)  =  unlift  view        viewx                                  =  xmax  -­‐  xmin        viewy                                  =  ymax  -­‐  ymin

Friday, 17 May 13

Page 75: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Let’s define the function we will iterate

zn+1 = c+ z2n

next  ::  Exp  Complex  -­‐>  Exp  Complex  -­‐>  Exp  Complexnext  c  z  =  plus  c  (times  z  z)

Friday, 17 May 13

Page 76: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Let’s define the function we will iterate

- Use lift and unlift as before to unpack the tuples

zn+1 = c+ z2n

next  ::  Exp  Complex  -­‐>  Exp  Complex  -­‐>  Exp  Complexnext  c  z  =  plus  c  (times  z  z)

plus  ::  Exp  Complex  -­‐>  Exp  Complex  -­‐>  Exp  Complexplus  a  b  =  lift  (ax+bx,  ay+by)    where        (ax,ay)  =  unlift  a    ::  (Exp  Float,  Exp  Float)        (bx,by)  =  unlift  b    ::  (Exp  Float,  Exp  Float)

Friday, 17 May 13

Page 77: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Let’s define the function we will iterate

- Use lift and unlift as before to unpack the tuples

zn+1 = c+ z2n

next  ::  Exp  Complex  -­‐>  Exp  Complex  -­‐>  Exp  Complexnext  c  z  =  plus  c  (times  z  z)

plus  ::  Exp  Complex  -­‐>  Exp  Complex  -­‐>  Exp  Complexplus  a  b  =  lift  (ax+bx,  ay+by)    where        (ax,ay)  =  unlift  a    ::  (Exp  Float,  Exp  Float)        (bx,by)  =  unlift  b    ::  (Exp  Float,  Exp  Float)

lift  ::  (Exp  Float,  Exp  Float)          -­‐>  Exp  (Float,  Float)

Friday, 17 May 13

Page 78: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Let’s define the function we will iterate

- Use lift and unlift as before to unpack the tuples

- Note that we had to add some type signatures to unlift

zn+1 = c+ z2n

next  ::  Exp  Complex  -­‐>  Exp  Complex  -­‐>  Exp  Complexnext  c  z  =  plus  c  (times  z  z)

plus  ::  Exp  Complex  -­‐>  Exp  Complex  -­‐>  Exp  Complexplus  a  b  =  lift  (ax+bx,  ay+by)    where        (ax,ay)  =  unlift  a    ::  (Exp  Float,  Exp  Float)        (bx,by)  =  unlift  b    ::  (Exp  Float,  Exp  Float)

lift  ::  (Exp  Float,  Exp  Float)          -­‐>  Exp  (Float,  Float)

Friday, 17 May 13

Page 79: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

- dot is the magnitude of a complex number squared

- Conditionals can lead to SIMD divergence, so use sparingly

zn+1 = c+ z2n

Friday, 17 May 13

Page 80: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

- dot is the magnitude of a complex number squared

- Conditionals can lead to SIMD divergence, so use sparingly

zn+1 = c+ z2n

step  ::  Exp  Complex  -­‐>  Exp  (Complex,  Int)  -­‐>  Exp  (Complex,  Int)step  c  zi  =  f  (fst  zi)  (snd  zi)    where        f  z  i  =  let  z'  =  next  c  z                        in    dot  z'  >*  4  ?  (  zi,  lift  (z',  i+1)  )

Friday, 17 May 13

Page 81: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

zn+1 = c+ z2n

step  ::  Exp  Complex  -­‐>  Exp  (Complex,  Int)  -­‐>  Exp  (Complex,  Int)step  c  zi  =  f  (fst  zi)  (snd  zi)    where        f  z  i  =  let  z'  =  next  c  z                        in    dot  z'  >*  4  ?  (  zi,  lift  (z',  i+1)  )

Friday, 17 May 13

Page 82: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

- dot is the magnitude of a complex number squared

zn+1 = c+ z2n

step  ::  Exp  Complex  -­‐>  Exp  (Complex,  Int)  -­‐>  Exp  (Complex,  Int)step  c  zi  =  f  (fst  zi)  (snd  zi)    where        f  z  i  =  let  z'  =  next  c  z                        in    dot  z'  >*  4  ?  (  zi,  lift  (z',  i+1)  )

Friday, 17 May 13

Page 83: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

- dot is the magnitude of a complex number squared

- Conditionals can lead to SIMD divergence, so use sparingly

zn+1 = c+ z2n

(?)  ::  Elt  t  =>  Exp  Bool  -­‐>  (Exp  t,  Exp  t)  -­‐>  Exp  t

step  ::  Exp  Complex  -­‐>  Exp  (Complex,  Int)  -­‐>  Exp  (Complex,  Int)step  c  zi  =  f  (fst  zi)  (snd  zi)    where        f  z  i  =  let  z'  =  next  c  z                        in    dot  z'  >*  4  ?  (  zi,  lift  (z',  i+1)  )

Friday, 17 May 13

Page 84: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

- dot is the magnitude of a complex number squared

- Conditionals can lead to SIMD divergence, so use sparingly

zn+1 = c+ z2n

(?)  ::  Elt  t  =>  Exp  Bool  -­‐>  (Exp  t,  Exp  t)  -­‐>  Exp  t

step  ::  Exp  Complex  -­‐>  Exp  (Complex,  Int)  -­‐>  Exp  (Complex,  Int)step  c  zi  =  f  (fst  zi)  (snd  zi)    where        f  z  i  =  let  z'  =  next  c  z                        in    dot  z'  >*  4  ?  (  zi,  lift  (z',  i+1)  )

Friday, 17 May 13

Page 85: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

- dot is the magnitude of a complex number squared

- Conditionals can lead to SIMD divergence, so use sparingly

zn+1 = c+ z2n

(?)  ::  Elt  t  =>  Exp  Bool  -­‐>  (Exp  t,  Exp  t)  -­‐>  Exp  t

step  ::  Exp  Complex  -­‐>  Exp  (Complex,  Int)  -­‐>  Exp  (Complex,  Int)step  c  zi  =  f  (fst  zi)  (snd  zi)    where        f  z  i  =  let  z'  =  next  c  z                        in    dot  z'  >*  4  ?  (  zi,  lift  (z',  i+1)  )

Friday, 17 May 13

Page 86: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Complication: GPUs must do the same thing to lots of different data

- Keep a pair (z,i) for each pixel, where i is the point iteration diverged

- fst and snd to extract individual tuple components

- dot is the magnitude of a complex number squared

- Conditionals can lead to SIMD divergence, so use sparingly

zn+1 = c+ z2n

(?)  ::  Elt  t  =>  Exp  Bool  -­‐>  (Exp  t,  Exp  t)  -­‐>  Exp  t

step  ::  Exp  Complex  -­‐>  Exp  (Complex,  Int)  -­‐>  Exp  (Complex,  Int)step  c  zi  =  f  (fst  zi)  (snd  zi)    where        f  z  i  =  let  z'  =  next  c  z                        in    dot  z'  >*  4  ?  (  zi,  lift  (z',  i+1)  )

Friday, 17 May 13

Page 87: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Accelerate is a meta programming language

- So, just use regular Haskell to unroll the loop a fixed number of times

zn+1 = c+ z2n

mandel  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  (Array  DIM2  (Complex,  Int))mandel  view  =    Prelude.foldl1  (.)  (Prelude.replicate  255  go)  zs0    where        cs          =  genPlane  view        zs0        =  zip  cs  (fill  (shape  cs)  0)        go  zs    =  zipWith  step  cs  zs

Friday, 17 May 13

Page 88: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Accelerate is a meta programming language

- So, just use regular Haskell to unroll the loop a fixed number of times

- zipWith applies its function step pairwise to elements of the two arrays

zn+1 = c+ z2n

mandel  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  (Array  DIM2  (Complex,  Int))mandel  view  =    Prelude.foldl1  (.)  (Prelude.replicate  255  go)  zs0    where        cs          =  genPlane  view        zs0        =  zip  cs  (fill  (shape  cs)  0)        go  zs    =  zipWith  step  cs  zs

Friday, 17 May 13

Page 89: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Accelerate is a meta programming language

- So, just use regular Haskell to unroll the loop a fixed number of times

- zipWith applies its function step pairwise to elements of the two arrays

- Replicate the transition step from to

zn+1 = c+ z2n

mandel  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  (Array  DIM2  (Complex,  Int))mandel  view  =    Prelude.foldl1  (.)  (Prelude.replicate  255  go)  zs0    where        cs          =  genPlane  view        zs0        =  zip  cs  (fill  (shape  cs)  0)        go  zs    =  zipWith  step  cs  zs

zn zn+1

Friday, 17 May 13

Page 90: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Accelerate is a meta programming language

- So, just use regular Haskell to unroll the loop a fixed number of times

- zipWith applies its function step pairwise to elements of the two arrays

- Replicate the transition step from to

- Applies the steps in sequence, beginning with the initial data zs0

zn+1 = c+ z2n

mandel  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  (Array  DIM2  (Complex,  Int))mandel  view  =    Prelude.foldl1  (.)  (Prelude.replicate  255  go)  zs0    where        cs          =  genPlane  view        zs0        =  zip  cs  (fill  (shape  cs)  0)        go  zs    =  zipWith  step  cs  zs

zn zn+1

Friday, 17 May 13

Page 91: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Mandelbrot set generator

• Accelerate is a meta programming language

- So, just use regular Haskell to unroll the loop a fixed number of times

- zipWith applies its function step pairwise to elements of the two arrays

- Replicate the transition step from to

- Applies the steps in sequence, beginning with the initial data zs0

zn+1 = c+ z2n

mandel  ::  Exp  (Float,Float,Float,Float)  -­‐>  Acc  (Array  DIM2  (Complex,  Int))mandel  view  =    Prelude.foldl1  (.)  (Prelude.replicate  255  go)  zs0    where        cs          =  genPlane  view        zs0        =  zip  cs  (fill  (shape  cs)  0)        go  zs    =  zipWith  step  cs  zs

zn zn+1

f (g x) ≣ (f . g) x

Friday, 17 May 13

Page 92: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

In the workshop…

• More example code, computational kernels

- Common pitfalls

- Tips for good performance

- Figuring out what went wrong (or knowing who to blame)

Friday, 17 May 13

Page 93: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

In the workshop…

• More example code, computational kernels

- Common pitfalls

- Tips for good performance

- Figuring out what went wrong (or knowing who to blame)

me

Friday, 17 May 13

Page 94: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Questions?https://github.com/AccelerateHS/

http://xkcd.com/365/

Friday, 17 May 13

Page 95: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Extra Slides…

Friday, 17 May 13

Page 96: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Seriously?

Friday, 17 May 13

Page 97: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Arrays

• Create an array from a list:

- Generates a multidimensional array by consuming elements from the list and adding them to the array in row-major order

• Example:

data  Array  sh  e

fromList  ::  (Shape  sh,  Elt  e)  =>  sh  -­‐>  [e]  -­‐>  Array  sh  e

ghci>  fromList  (Z:.10)  [1..10]

Friday, 17 May 13

Page 98: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Arrays

• Create an array from a list:

data  Array  sh  e

>  fromList  (Z:.10)  [1..10]  ::  Vector  FloatArray  (Z  :.  10)  [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

Friday, 17 May 13

Page 99: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Arrays

• Create an array from a list:

• Multidimensional arrays are similar:

- Elements are filled along the right-most dimension first

data  Array  sh  e

>  fromList  (Z:.10)  [1..10]  ::  Vector  FloatArray  (Z  :.  10)  [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

>  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  IntArray  (Z  :.  3  :.  5)  [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

Friday, 17 May 13

Page 100: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Arrays

• Array indices start counting from zero

data  Array  sh  e

>  let  mat  =  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  Int>  indexArray  mat  (Z:.2:.1)12

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

Friday, 17 May 13

Page 101: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Arrays

• Similarly, an array of (nested) tuples:

- This is just a trick: internally converted into a tuple of arrays

data  Array  sh  e

>  fromList  (Z:.2:.3)  $  P.zip  [1..]  ['a'..]  ::  Array  DIM2  (Int,Char)Array  (Z  :.  2  :.  3)  [(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e'),(6,'f')]

1 2 3

4 5 6

a b c

d e f( ),

Friday, 17 May 13

Page 102: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Data.Array.Accelerate

• Need to import both the base library as well as a backend

- There is also an interpreter available for testing

- Runs without using the GPU (much more slowly of course)

import  Prelude                                        as  Pimport  Data.Array.Accelerate            as  Aimport  Data.Array.Accelerate.CUDA  as  CUDA

Friday, 17 May 13

Page 103: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Data.Array.Accelerate

Friday, 17 May 13

Page 104: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Data.Array.Accelerate

• To get arrays into the Acc world:

- This may involve copying data to the GPU

use  ::  Arrays  a  =>  a  -­‐>  Acc  a

Friday, 17 May 13

Page 105: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Data.Array.Accelerate

• To get arrays into the Acc world:

- This may involve copying data to the GPU

• use injects arrays into our DSL

• run executes the computation to get arrays out

• Using Accelerate focuses on everything in between

use  ::  Arrays  a  =>  a  -­‐>  Acc  a

Friday, 17 May 13

Page 106: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Example: add one to each element of an array

>  let  arr  =  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  Int>  run  $  A.map  (+1)  (use  arr)Array  (Z  :.  3  :.  5)  [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

Friday, 17 May 13

Page 107: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Example: add one to each element of an array

• What is the type of map?

>  let  arr  =  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  Int>  run  $  A.map  (+1)  (use  arr)Array  (Z  :.  3  :.  5)  [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

map  ::  (Shape  sh,  Elt  a,  Elt  b)        =>  (Exp  a  -­‐>  Exp  b)        -­‐>  Acc  (Array  sh  a)        -­‐>  Acc  (Array  sh  b)

Friday, 17 May 13

Page 108: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Example: add one to each element of an array

• What is the type of map?

>  let  arr  =  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  Int>  run  $  A.map  (+1)  (use  arr)Array  (Z  :.  3  :.  5)  [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

map  ::  (Shape  sh,  Elt  a,  Elt  b)        =>  (Exp  a  -­‐>  Exp  b)        -­‐>  Acc  (Array  sh  a)        -­‐>  Acc  (Array  sh  b)

Supported shape & element types

Friday, 17 May 13

Page 109: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Example: add one to each element of an array

• What is the type of map?

>  let  arr  =  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  Int>  run  $  A.map  (+1)  (use  arr)Array  (Z  :.  3  :.  5)  [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

map  ::  (Shape  sh,  Elt  a,  Elt  b)        =>  (Exp  a  -­‐>  Exp  b)        -­‐>  Acc  (Array  sh  a)        -­‐>  Acc  (Array  sh  b)

DSL array

Supported shape & element types

Friday, 17 May 13

Page 110: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Collective Operations

• Example: add one to each element of an array

• What is the type of map?

>  let  arr  =  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  Int>  run  $  A.map  (+1)  (use  arr)Array  (Z  :.  3  :.  5)  [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

map  ::  (Shape  sh,  Elt  a,  Elt  b)        =>  (Exp  a  -­‐>  Exp  b)        -­‐>  Acc  (Array  sh  a)        -­‐>  Acc  (Array  sh  b)

DSL array

Function to apply at every element. But what is Exp?

Supported shape & element types

Friday, 17 May 13

Page 111: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

• The type class overloading trick is used for standard Haskell classes

• Standard boolean operations are available with slightly different names

- The standard names can not be overloaded

• Conditionals

- Use sparingly: leads to SIMD divergence

(+1)  ::  (Elt  a,  IsNum  a)  =>  Exp  a  -­‐>  Exp  a

Scalar Expressions

(==*)  ::  (Elt  t,  IsScalar  t)  =>  Exp  t  -­‐>  Exp  t  -­‐>  Exp  Bool(/=*),  (<*),  (>*),  min,  max,  (||*),  (&&*)      -­‐-­‐  and  so  on...

(?)  ::  Elt  t  =>  Exp  Bool  -­‐>  (Exp  t,  Exp  t)  -­‐>  Exp  t

Friday, 17 May 13

Page 112: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Scalar Expressions

• Bring a Haskell value into Exp land

• Lift an expression into a singleton array

• Extract the element from a singleton array

constant  ::  Elt  e  =>  e  -­‐>  Exp  e

unit  ::  Exp  e  -­‐>  Acc  (Scalar  e)

the  ::  Acc  (Scalar  e)  -­‐>  Exp  e

Friday, 17 May 13

Page 113: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Reductions

• Folding (+) over a vector produces a sum

- The result is a one-element array (scalar). Why?

>  let  xs  =  fromList  (Z:.10)  [1..]  ::  Vector  Int>  run  $  A.fold  (+)  0  (use  xs)Array  (Z)  [55]

Friday, 17 May 13

Page 114: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Reductions

• Folding (+) over a vector produces a sum

- The result is a one-element array (scalar). Why?

• Fold has an interesting type:

>  let  xs  =  fromList  (Z:.10)  [1..]  ::  Vector  Int>  run  $  A.fold  (+)  0  (use  xs)Array  (Z)  [55]

fold  ::  (Shape  sh,  Elt  a)          =>  (Exp  a  -­‐>  Exp  a  -­‐>  Exp  a)          -­‐>  Exp  a          -­‐>  Acc  (Array  (sh:.Int)  a)          -­‐>  Acc  (Array  sh                a)

Friday, 17 May 13

Page 115: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Reductions

• Folding (+) over a vector produces a sum

- The result is a one-element array (scalar). Why?

• Fold has an interesting type:

>  let  xs  =  fromList  (Z:.10)  [1..]  ::  Vector  Int>  run  $  A.fold  (+)  0  (use  xs)Array  (Z)  [55]

fold  ::  (Shape  sh,  Elt  a)          =>  (Exp  a  -­‐>  Exp  a  -­‐>  Exp  a)          -­‐>  Exp  a          -­‐>  Acc  (Array  (sh:.Int)  a)          -­‐>  Acc  (Array  sh                a)

input array

Friday, 17 May 13

Page 116: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Reductions

• Folding (+) over a vector produces a sum

- The result is a one-element array (scalar). Why?

• Fold has an interesting type:

>  let  xs  =  fromList  (Z:.10)  [1..]  ::  Vector  Int>  run  $  A.fold  (+)  0  (use  xs)Array  (Z)  [55]

fold  ::  (Shape  sh,  Elt  a)          =>  (Exp  a  -­‐>  Exp  a  -­‐>  Exp  a)          -­‐>  Exp  a          -­‐>  Acc  (Array  (sh:.Int)  a)          -­‐>  Acc  (Array  sh                a)

outer dimension removed

input array

Friday, 17 May 13

Page 117: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Reductions

• Fold occurs over the outer dimension of the array

>  let  mat  =  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  Int>  run  $  A.fold  (+)  0  (use  mat)Array  (Z  :.  3)  [15,40,65]

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

15

40

65

Friday, 17 May 13

Page 118: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Reductions

• Is this a left-fold or a right-fold?

- Neither! The fold happens in parallel, tree-like

- Therefore the function must be associative: (Exp  a  -­‐>  Exp  a  -­‐>  Exp  a)

- (We pretend that floating point operations are associative, though strictly speaking they are not)

Friday, 17 May 13

Page 119: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Stencils

• A stencil is a map with access to the neighbourhood around each element

- Useful in many scientific & image processing algorithms

laplace  ::  Stencil3x3  Int  -­‐>  Exp  Intlaplace  ((_,t,_)                ,(l,c,r)                ,(_,b,_))  =  t  +  b  +  l  +  r  -­‐  4*c

t

l c r

b

Friday, 17 May 13

Page 120: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Stencils

• A stencil is a map with access to the neighbourhood around each element

- Useful in many scientific & image processing algorithms

- Boundary conditions specify how to handle out-of-bounds neighbours

laplace  ::  Stencil3x3  Int  -­‐>  Exp  Intlaplace  ((_,t,_)                ,(l,c,r)                ,(_,b,_))  =  t  +  b  +  l  +  r  -­‐  4*c

>  let  mat  =  fromList  (Z:.3:.5)  [1..]  ::  Array  DIM2  Int>  run  $  stencil  laplace  (Constant  0)  (use  mat)Array  (Z  :.  3  :.  5)  [4,3,2,1,-­‐6,-­‐5,0,0,0,-­‐11,-­‐26,-­‐17,-­‐18,-­‐19,-­‐36]

t

l c r

b

Friday, 17 May 13

Page 121: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Stencils

• A stencil is a map with access to the neighbourhood around each element

- Useful in many scientific & image processing algorithms

Friday, 17 May 13

Page 122: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Index Transforms

• Index transforms change the order of elements, not their values

- We usually want to push such operations into the consumer

• backpermute specifies which element to read from a source array

backpermute  ::  (Shape  ix,  Shape  ix',  Elt  a)                        =>  Exp  ix'                        -­‐>  (Exp  ix'  -­‐>  Exp  ix)                        -­‐>  Acc  (Array  ix    a)                        -­‐>  Acc  (Array  ix'  a)

Friday, 17 May 13

Page 123: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Index Transforms

• Index transforms change the order of elements, not their values

- We usually want to push such operations into the consumer

• backpermute specifies which element to read from a source array

backpermute  ::  (Shape  ix,  Shape  ix',  Elt  a)                        =>  Exp  ix'                        -­‐>  (Exp  ix'  -­‐>  Exp  ix)                        -­‐>  Acc  (Array  ix    a)                        -­‐>  Acc  (Array  ix'  a)

shape of result

Friday, 17 May 13

Page 124: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Index Transforms

• Index transforms change the order of elements, not their values

- We usually want to push such operations into the consumer

• backpermute specifies which element to read from a source array

backpermute  ::  (Shape  ix,  Shape  ix',  Elt  a)                        =>  Exp  ix'                        -­‐>  (Exp  ix'  -­‐>  Exp  ix)                        -­‐>  Acc  (Array  ix    a)                        -­‐>  Acc  (Array  ix'  a)

shape of result

source data

Friday, 17 May 13

Page 125: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Index Transforms

• Index transforms change the order of elements, not their values

- We usually want to push such operations into the consumer

• backpermute specifies which element to read from a source array

backpermute  ::  (Shape  ix,  Shape  ix',  Elt  a)                        =>  Exp  ix'                        -­‐>  (Exp  ix'  -­‐>  Exp  ix)                        -­‐>  Acc  (Array  ix    a)                        -­‐>  Acc  (Array  ix'  a)

shape of result

source data index mapping from destination array to

source

Friday, 17 May 13

Page 126: GPGPU Programming in Haskell with Accelerate · 2019-09-23 · What is GPGPU Programming? • Main differences: - Single program multiple data (SPMD / SIMD), or just data-parallelism

Index Transforms

• Index transforms change the order of elements, not their values

transpose  mat  =    let  swap  =  lift1  $  \(Z:.j:.i)  -­‐>  Z:.i:.j  ::  Z  :.  Exp  Int  :.  Exp  Int    in    backpermute  (swap  $  shape  mat)  swap  mat

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

1 6 11

2 7 12

3 8 13

4 9 14

5 10 15

Friday, 17 May 13