Gpu Cuda
-
Upload
melbournepatterns -
Category
Sports
-
view
2.129 -
download
0
description
Transcript of Gpu Cuda
![Page 1: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/1.jpg)
GPU’s and CUDA
![Page 2: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/2.jpg)
CPU Architecture - good for serial programs
- do many different things well
- many transistors for purposes other than ALUs (eg. flow control and caching)
- memory access is slow (1GB/s)
- switching threads is slow
Image from Alex Moore, “Introduction to Programming in CUDA”, http://astro.pas.rochester.edu/~aquillen/gpuworkshop.html
![Page 3: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/3.jpg)
GPU Architecture - many processors perform similar operations on a large data set in parallel (single-instruction multiple-data parallelism)
- recent GPUs have around 30 multiprocessors, each containing 8 stream processors
- GPUs devote most (80%) of their transistors to ALUs
- fast memory (80GB/s)
ALUs Control Cache
![Page 4: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/4.jpg)
Thread Hierarchy - a block of threads runs on a single stream processor
- a grid of blocks makes up the entire set
- each thread in a block can access the same shared memory
- many more threads than processors
Image from Johan Seland, “CUDA Programming”, http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf
![Page 5: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/5.jpg)
Memory Hierarchy
Image from Johan Seland, “CUDA Programming”, http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf
![Page 6: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/6.jpg)
CUDA - a set of C extensions for running programs on a GPU
- Windows, Linux, Mac…. Nvidia Cards only
- single instruction, multiple data
- gives you direct access to the memory architecture
- don’t need to know graphics APIs
- built-in libraries: cudaMalloc, cudaFree, cudaMemcpy, cublas, cufft
![Page 7: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/7.jpg)
Elementwise Matrix Addition
CPU Program void add_matrix(float* a, float* b, float* c, int N) { int index; for ( int i = 0; i < N; ++i ) { for ( int j = 0; j < N; ++j ) { index = i + j*N; c[index] = a[index] + b[index]; } } }
int main() { add_matrix( a, b, c, N ); }
![Page 8: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/8.jpg)
Elemtwise Matix Addition
CUDA program __global__ add_matrix(float* a, float* b, float* c, int N ) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N;
if ( i < N && j < N ) c[index] = a[index] + b[index]; }
int main() { dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N ); }
![Page 9: Gpu Cuda](https://reader035.fdocuments.in/reader035/viewer/2022081805/55388c504a79593a698b47e5/html5/thumbnails/9.jpg)
Results
0 10 20 30 40 50 60 70
RotateAndAccumulateVisibilities
MeasureIonosphericOffset
MeasureTileResponse
ReRotateVisibilities
PeelTileResponse
UnpeelTileResponse
Gridding *
Imaging
Module
Speedup
Image from Kevin Dale, “A Graphics Hardware-Accelerated Real-Time Processing Pipeline for Radio Astronomy”, Presented at AstroGPU, Nov 2007.
(for tasks relevant to the MWA Real Time System)