Gpu Cuda

9
GPU’s and CUDA

description

Stewart Gleadow, GPU CUDA, September 2nd 2009

Transcript of Gpu Cuda

Page 1: Gpu Cuda

GPU’s and CUDA

Page 2: Gpu Cuda

CPU Architecture -  good for serial programs

-  do many different things well

-  many transistors for purposes other than ALUs (eg. flow control and caching)

-  memory access is slow (1GB/s)

-  switching threads is slow

Image from Alex Moore, “Introduction to Programming in CUDA”, http://astro.pas.rochester.edu/~aquillen/gpuworkshop.html

Page 3: Gpu Cuda

GPU Architecture -  many processors perform similar operations on a large data set in parallel (single-instruction multiple-data parallelism)

-  recent GPUs have around 30 multiprocessors, each containing 8 stream processors

-  GPUs devote most (80%) of their transistors to ALUs

-  fast memory (80GB/s)

ALUs Control Cache

Page 4: Gpu Cuda

Thread Hierarchy -  a block of threads runs on a single stream processor

-  a grid of blocks makes up the entire set

-  each thread in a block can access the same shared memory

-  many more threads than processors

Image from Johan Seland, “CUDA Programming”, http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf

Page 5: Gpu Cuda

Memory Hierarchy

Image from Johan Seland, “CUDA Programming”, http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf

Page 6: Gpu Cuda

CUDA -  a set of C extensions for running programs on a GPU

-  Windows, Linux, Mac…. Nvidia Cards only

-  single instruction, multiple data

-  gives you direct access to the memory architecture

-  don’t need to know graphics APIs

-  built-in libraries: cudaMalloc, cudaFree, cudaMemcpy, cublas, cufft

Page 7: Gpu Cuda

Elementwise Matrix Addition

CPU Program void add_matrix(float* a, float* b, float* c, int N) { int index; for ( int i = 0; i < N; ++i ) { for ( int j = 0; j < N; ++j ) { index = i + j*N; c[index] = a[index] + b[index]; } } }

int main() { add_matrix( a, b, c, N ); }

Page 8: Gpu Cuda

Elemtwise Matix Addition

CUDA program __global__ add_matrix(float* a, float* b, float* c, int N ) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N;

if ( i < N && j < N ) c[index] = a[index] + b[index]; }

int main() { dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N ); }

Page 9: Gpu Cuda

Results

0 10 20 30 40 50 60 70

RotateAndAccumulateVisibilities

MeasureIonosphericOffset

MeasureTileResponse

ReRotateVisibilities

PeelTileResponse

UnpeelTileResponse

Gridding *

Imaging

Module

Speedup

Image from Kevin Dale, “A Graphics Hardware-Accelerated Real-Time Processing Pipeline for Radio Astronomy”, Presented at AstroGPU, Nov 2007.

(for tasks relevant to the MWA Real Time System)