CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host...
Transcript of CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host...
![Page 1: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/1.jpg)
CUDA Programming
![Page 2: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/2.jpg)
Outline
● Basic CUDA program
![Page 3: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/3.jpg)
CUDA C - Terminology
● Host – The CPU and its memory (host memory)● Device – The GPU and its memory (device
memory)
HOST
DEVICE
![Page 4: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/4.jpg)
Compute Unified Device Architecture
● Hybrid CPU/GPU Code● Low latency code is run on
CPU– Result immediately
available
● High latency, high throughput code is run on GPU– Result on bus– GPU has many more cores
than CPU
![Page 5: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/5.jpg)
Hello, World!
● The standard C program runs on the host● NVIDIA’s compiler (nvcc) will not complain
about CUDA programs with no device code● At its simplest, CUDA C is just C!
int main( void ) {printf( "Hello, World!\n" );return 0;
}
int main( void ) {printf( "Hello, World!\n" );return 0;
}
![Page 6: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/6.jpg)
Hello, World! with Device Code
__global__ void kernel( void ) {}
int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );
return 0;}
__global__ void kernel( void ) {}
int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );
return 0;}
Function runs on the device. Called from the host.
Function runs on the device. Called from the host.
![Page 7: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/7.jpg)
Hello, World! with Device Code
__global__ void kernel( void ) {}
int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );
return 0;}
__global__ void kernel( void ) {}
int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );
return 0;}
Function runs on the device. Called from the host.
Function runs on the device. Called from the host.
● nvcc splits source file into host and device components– NVIDIA’s compiler handles device functions. eg. kernel()
– Standard host compiler handles host functions like main()
![Page 8: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/8.jpg)
Hello, World! with Device Code
__global__ void kernel( void ) {}
int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );
return 0;}
__global__ void kernel( void ) {}
int main( void ) {kernel<<<1,1>>>();printf( "Hello, World!\n" );
return 0;}
Function runs on the device. Called from the host.
Function runs on the device. Called from the host.
● Triple angle brackets mark a call from host code to device code - a kernel launch
● The kernel executes on the GPU
![Page 9: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/9.jpg)
CUDA /OpenCL – Execution Model
...
...
CPU Serial Code (on Host)
Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);
CUDA programExecution
CPU Serial Code (on Host)
Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);
![Page 10: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/10.jpg)
CUDA /OpenCL – Execution Model
● Heterogeneous host+device application C program– Serial parts in host C code
– Parallel parts in device SPMD kernel C code
...
...
CPU Serial Code (on Host)
Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);
CUDA programExecution
CPU Serial Code (on Host)
Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);
![Page 11: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/11.jpg)
A von-Neumann Processor
● A thread is a “virtualized” or “abstracted” von-Neumann Processor
![Page 12: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/12.jpg)
CUDA Program Example
![Page 13: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/13.jpg)
Vector Addition Example
for (i=0; i<N; i++) { C[i] = A[i] + B[i]}
for (i=0; i<N; i++) { C[i] = A[i] + B[i]}
![Page 14: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/14.jpg)
Vector Addition Example
for (i=0; i<N; i++) { C[i] = A[i] + B[i]}
for (i=0; i<N; i++) { C[i] = A[i] + B[i]}
A[0] A[1] A[2] A[N-1]
B[0] B[1] B[2] B[N-1]
C[0] C[1] C[2] C[N-1]
+ + + +
...
...
...
![Page 15: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/15.jpg)
Vector Addition Example
for (i=0; i<N; i++) { C[i] = A[i] + B[i]}
for (i=0; i<N; i++) { C[i] = A[i] + B[i]}
A[0] A[1] A[2] A[N-1]
B[0] B[1] B[2] B[N-1]
C[0] C[1] C[2] C[N-1]
+ + + +
...
...
...
● Each thread adds one element from A[] and B[] and updates one element in C[]– Data level parallelism!
![Page 16: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/16.jpg)
Parallel Threads
0 1 2 254 255... 1022 1023... ... ...A
0 1 2 254 255... 1022 1023... ... ...B
... ... ......
0 1 2 254 255... 1022 1023... ... ...C
![Page 17: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/17.jpg)
Parallel Threads
0 1 2 254 255... 1022 1023... ... ...A
0 1 2 254 255... 1022 1023... ... ...B
... ... ...... i = threadIdx.x;C[i]=A[i]+B[i];i = threadIdx.x;C[i]=A[i]+B[i];
0 1 2 254 255... 1022 1023... ... ...C
![Page 18: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/18.jpg)
Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...
0 255 256 511 767 1023... ...B 512... ... 768 ...
0 255 256 511 767 1023... ...C 512... ... 768 ...
![Page 19: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/19.jpg)
Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...
0 255 256 511 767 1023... ...B 512... ... 768 ...
Block ID = 0Thread Ids = 0, 1, ... ,255
Block ID = 0Thread Ids = 0, 1, ... ,255
Block IDs = 2Thread Ids = 0, 1, ... ,255
Block IDs = 2Thread Ids = 0, 1, ... ,255
Block ID = 1Thread Ids = 0, 1, ... ,255
Block ID = 1Thread Ids = 0, 1, ... ,255
Block ID = 3Thread Ids = 0, 1, ... ,255
Block ID = 3Thread Ids = 0, 1, ... ,255
Block Dimension = 256(in the X direction)Block Dimension = 256(in the X direction)
![Page 20: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/20.jpg)
Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...
0 255 256 511 767 1023... ...B 512... ... 768 ...
Block ID = 0Thread Ids = 0, 1, ... ,255
Block ID = 0Thread Ids = 0, 1, ... ,255
Block IDs = 2Thread Ids = 0, 1, ... ,255
Block IDs = 2Thread Ids = 0, 1, ... ,255
Block ID = 1Thread Ids = 0, 1, ... ,255
Block ID = 1Thread Ids = 0, 1, ... ,255
Block ID = 3Thread Ids = 0, 1, ... ,255
Block ID = 3Thread Ids = 0, 1, ... ,255
Kernel_Call<<<No_of_Blocks,ThreadsPerBlock>>>(params)Kernel_Call<<<No_of_Blocks,ThreadsPerBlock>>>(params)
![Page 21: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/21.jpg)
Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...
0 255 256 511 767 1023... ...B 512... ... 768 ...
Block ID = 0Thread Ids = 0, 1, ... ,255
Block ID = 0Thread Ids = 0, 1, ... ,255
Block IDs = 2Thread Ids = 0, 1, ... ,255
Block IDs = 2Thread Ids = 0, 1, ... ,255
Block ID = 1Thread Ids = 0, 1, ... ,255
Block ID = 1Thread Ids = 0, 1, ... ,255
Block ID = 3Thread Ids = 0, 1, ... ,255
Block ID = 3Thread Ids = 0, 1, ... ,255
Kernel_Call<<<4,256>>>(params)Kernel_Call<<<4,256>>>(params)
![Page 22: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/22.jpg)
Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...
0 255 256 511 767 1023... ...B 512... ... 768 ...
0 255 256 511 767 1023... ...C 512... ... 768 ...
i = blockIDx.x * blockDim.x + threadIdx.x;C[i]=A[i]+B[i];
i = blockIDx.x * blockDim.x + threadIdx.x;C[i]=A[i]+B[i];
![Page 23: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/23.jpg)
Arrays of Parallel Threads0 255 256 511 767 1023... ...A 512... ... 768 ...
0 255 256 511 767 1023... ...B 512... ... 768 ...
0 255 256 511 767 1023... ...C 512... ... 768 ...
i = blockIDx.x * blockDim.x + threadIdx.x;C[i]=A[i]+B[i];
i = blockIDx.x * blockDim.x + threadIdx.x;C[i]=A[i]+B[i];
● A CUDA kernel is executed by a grid (array) of threads– All threads in a grid run the same kernel code (SPMD)
– Each thread has indexes that it uses to compute memory addresses and make control decisions
![Page 24: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/24.jpg)
Thread Blocks: Scalable Cooperation0 255 0 255 255... ... 0 ......
Thread Block 0Thread Block 0 Thread Block 1Thread Block 1 Thread Block N1Thread Block N1
...
...
![Page 25: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/25.jpg)
Thread Blocks: Scalable Cooperation
● Thread array is divided into multiple blocks● Threads within a block cooperate via shared memory,
atomic operations and barrier synchronization● Threads in different blocks do not interact
0 255 0 255 255... ... 0 ......
Thread Block 0Thread Block 0 Thread Block 1Thread Block 1 Thread Block N1Thread Block N1
...
...
![Page 26: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/26.jpg)
blockIdx and threadIdx
00 11 22 33
threadIdx.xthreadIdx.x
![Page 27: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/27.jpg)
blockIdx and threadIdx
00 11 22 33 (0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)
(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)
threadIdx.xthreadIdx.x (threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)
![Page 28: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/28.jpg)
blockIdx and threadIdx
00 11 22 33 (0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)
(0,1,0)(0,1,0)
threadIdx.xthreadIdx.x
(threadIdx.x,threadIdx.y,threadIdx.z)
(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)
(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)
(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1) (0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)
(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)
![Page 29: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/29.jpg)
blockIdx and threadIdx
00 11 22 33
Block (1,0)Block (1,0)
threadIdx.xthreadIdx.x
(threadIdx.x,threadIdx.y,threadIdx.z)
(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)
(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)
(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)(0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)
(0,1,0)(0,1,0)(0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)
(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)
![Page 30: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/30.jpg)
blockIdx and threadIdx
00 11 22 33
Block (0,0)Block (0,0) Block (1,0)Block (1,0)
Block (0,1)Block (0,1) Block (1,1)Block (1,1)
threadIdx.xthreadIdx.x
(threadIdx.x,threadIdx.y,threadIdx.z)
(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)
(blockIdx.x,blockIdx.y)
(blockIdx.x,blockIdx.y)
(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)
(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)(0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)
(0,1,0)(0,1,0)(0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)
(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)
![Page 31: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/31.jpg)
blockIdx and threadIdx
00 11 22 33
Grid
threadIdx.xthreadIdx.x
(threadIdx.x,threadIdx.y,threadIdx.z)
(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)
(blockIdx.x,blockIdx.y)
(blockIdx.x,blockIdx.y)
(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)
(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)(0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)
(0,1,0)(0,1,0)(0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)
(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)
Block (0,0)Block (0,0) Block (1,0)Block (1,0)
Block (0,1)Block (0,1) Block (1,1)Block (1,1)
![Page 32: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/32.jpg)
blockIdx and threadIdx
00 11 22 33
Grid
threadIdx.xthreadIdx.x
(threadIdx.x,threadIdx.y,threadIdx.z)
(threadIdx.x,threadIdx.y,threadIdx.z)(threadIdx.x,threadIdx.y)(threadIdx.x,threadIdx.y)
(blockIdx.x,blockIdx.y)
(blockIdx.x,blockIdx.y)
Device
(0,0)(0,0) (1,0)(1,0) (2,0)(2,0) (3,0)(3,0)
(0,1)(0,1) (1,1)(1,1) (2,1)(2,1) (3,1)(3,1)(0,0,1)(0,0,1) (1,0,1)(1,0,1) (2,0,1)(2,0,1) (3,0,1)(3,0,1)
(0,1,0)(0,1,0)(0,0,0)(0,0,0) (1,0,0)(1,0,0) (2,0,0)(2,0,0) (3,0,0)(3,0,0)
(0,1,0)(0,1,0) (1,1,0)(1,1,0) (2,1,0)(2,1,0) (3,1,0)(3,1,0)
Block (0,0)Block (0,0) Block (1,0)Block (1,0)
Block (0,1)Block (0,1) Block (1,1)Block (1,1)
![Page 33: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/33.jpg)
blockIdx and threadIdx
● Each thread uses indices to decide what data to work on
![Page 34: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/34.jpg)
blockIdx and threadIdx
● Each thread uses indices to decide what data to work on
● blockIdx: 1D, 2D, or 3D (CUDA 4.0)
![Page 35: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/35.jpg)
blockIdx and threadIdx
● Each thread uses indices to decide what data to work on
● blockIdx: 1D, 2D, or 3D (CUDA 4.0)● threadIdx: 1D, 2D, or 3D
![Page 36: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/36.jpg)
blockIdx and threadIdx
● Each thread uses indices to decide what data to work on
● blockIdx: 1D, 2D, or 3D (CUDA 4.0)● threadIdx: 1D, 2D, or 3D● Simplifies memory addressing when processing
multidimensional data– Image processing
– Solving PDEs on volumes
![Page 37: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/37.jpg)
CUDA C – Vector Addition Kernel
![Page 38: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/38.jpg)
Vector Addition – C Code
int main(){
// Memory allocation for h_A, h_B, and h_C// I/O to read h_A and h_B, N elements...vecAdd(h_A, h_B, h_C, N);
}
int main(){
// Memory allocation for h_A, h_B, and h_C// I/O to read h_A and h_B, N elements...vecAdd(h_A, h_B, h_C, N);
}
![Page 39: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/39.jpg)
Vector Addition – C Code
// Compute vector sum C = A+Bvoid vecAdd(float *h_A, float *h_B, float *h_C, int n){
int i;for (i = 0; i<n; i++) h_C[i] = h_A[i]+h_B[i];
}
int main(){
// Memory allocation for h_A, h_B, and h_C// I/O to read h_A and h_B, N elements...vecAdd(h_A, h_B, h_C, N);
}
// Compute vector sum C = A+Bvoid vecAdd(float *h_A, float *h_B, float *h_C, int n){
int i;for (i = 0; i<n; i++) h_C[i] = h_A[i]+h_B[i];
}
int main(){
// Memory allocation for h_A, h_B, and h_C// I/O to read h_A and h_B, N elements...vecAdd(h_A, h_B, h_C, N);
}
![Page 40: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/40.jpg)
vecAdd - CUDA Version
Host MemoryHost Memory
CPUCPU
Device MemoryDevice Memory
GPUGPU
Step 1. Allocate space for data on the GPU Memory
Step 1. Allocate space for data on the GPU Memory
A[]B[]C[]
![Page 41: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/41.jpg)
vecAdd - CUDA Version
Host MemoryHost Memory
CPUCPU
Device MemoryDevice Memory
GPUGPUmalloc() for
A,B,C
Step 1. Allocate space for data on the GPU Memory
Step 1. Allocate space for data on the GPU Memory
A[]B[]C[]
![Page 42: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/42.jpg)
vecAdd - CUDA Version
Host MemoryHost Memory
CPUCPU
Device MemoryDevice Memory
GPUGPUA[]B[]C[]
Step 1. Allocate space for data on the GPU Memory
Step 1. Allocate space for data on the GPU Memory
Step 2. Copy data on to GPU Memory
Step 2. Copy data on to GPU MemoryA[]
B[]C[]
![Page 43: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/43.jpg)
vecAdd - CUDA Version
Host MemoryHost Memory
CPUCPU
Device MemoryDevice Memory
GPUGPUA[]B[]C[]
Step 1. Allocate space for data on the GPU Memory
Step 1. Allocate space for data on the GPU Memory
Step 2. Copy data on to GPU Memory
Step 2. Copy data on to GPU Memory
Step 3. Kernel Launch.
Step 3. Kernel Launch.
A[]B[]C[]
![Page 44: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/44.jpg)
vecAdd - CUDA Version
Host MemoryHost Memory
CPUCPU
Device MemoryDevice Memory
GPUGPUA[]B[]C[]
Step 1. Allocate space for data on the GPU Memory
Step 1. Allocate space for data on the GPU Memory
Step 2. Copy data on to GPU Memory
Step 2. Copy data on to GPU Memory
Step 3. Kernel Launch.
Step 3. Kernel Launch.
Step 4. Copy result data back to Host
Main Memory
Step 4. Copy result data back to Host
Main Memory
A[]B[]C[]
![Page 45: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/45.jpg)
vecAdd - CUDA Version
Host MemoryHost Memory
CPUCPU
Device MemoryDevice Memory
GPUGPU
Step 1. Allocate space for data on the GPU Memory
Step 1. Allocate space for data on the GPU Memory
Step 2. Copy data on to GPU Memory
Step 2. Copy data on to GPU Memory
Step 3. Kernel Launch.
Step 3. Kernel Launch.
Step 4. Copy result data back to Host
Main Memory
Step 4. Copy result data back to Host
Main Memory
Step 5. Free device memory
Step 5. Free device memory
A[]B[]C[]
![Page 46: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/46.jpg)
vecAdd - CUDA Host Code
#include <cuda.h>void vecAdd(float *h_A, float *h_B, float *h_C, int n){
int size = n* sizeof(float);float *d_A, *d_B, *d_C;1. // Allocate device memory for A, B, and C
// copy A and B to device memory2. // Kernel launch code – the device performs the
// actual vector addition3. // copy C from the device memory // Free device
// vectors}
#include <cuda.h>void vecAdd(float *h_A, float *h_B, float *h_C, int n){
int size = n* sizeof(float);float *d_A, *d_B, *d_C;1. // Allocate device memory for A, B, and C
// copy A and B to device memory2. // Kernel launch code – the device performs the
// actual vector addition3. // copy C from the device memory // Free device
// vectors}
![Page 47: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/47.jpg)
CUDA Memories – Quick Overview
![Page 48: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/48.jpg)
CUDA Memories – Quick Overview
● Device code can:– R/W per-thread
registers
– R/W all-shared global memory
![Page 49: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/49.jpg)
CUDA Memories – Quick Overview
● Device code can:– R/W per-thread
registers
– R/W all-shared global memory
● Host code can:– Transfer data to/from
per grid global memory
![Page 50: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/50.jpg)
CUDA Device Memory Management API
cudaMalloc()
Allocates object in the device global memory
Two Parameters:
• Address of a pointer to the allocated object
• Size of allocated object in terms of bytes
cudaMalloc()
Allocates object in the device global memory
Two Parameters:
• Address of a pointer to the allocated object
• Size of allocated object in terms of bytes
cudaMalloc((void **) &d_A, size);
![Page 51: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/51.jpg)
CUDA Device Memory Management API
cudaMalloc()
Allocates object in the device global memory
Two Parameters:
• Address of a pointer to the allocated object
• Size of allocated object in terms of bytes
cudaMalloc()
Allocates object in the device global memory
Two Parameters:
• Address of a pointer to the allocated object
• Size of allocated object in terms of bytes
cudaFree()
Frees object from device global memory
Pointer to freed object
cudaFree()
Frees object from device global memory
Pointer to freed object
cudaMalloc((void **) &d_A, size);
cudaFree(d_A);
![Page 52: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/52.jpg)
Host-Device Data Transfer API
cudaMemcpy()
Memory data transfer
Four parameters
• Pointer to destination, Pointer to source, bytes copied, Type/Direction of transfer
Transfer to device is asynchronous
cudaMemcpy()
Memory data transfer
Four parameters
• Pointer to destination, Pointer to source, bytes copied, Type/Direction of transfer
Transfer to device is asynchronous
cudaMemcpy(d_A, h_A, size,cudaMemcpyHostToDevice);
![Page 53: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/53.jpg)
Vector Addition Host Codevoid vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float);float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &d_B, size);cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &d_C, size);
// Kernel invocation code – not shown here
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);
}
void vecAdd(float *h_A, float *h_B, float *h_C, int n){
int size = n * sizeof(float);
float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &d_B, size);cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &d_C, size);
// Kernel invocation code – not shown here
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);
}
![Page 54: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/54.jpg)
Error Checking for API
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %d\n”,cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
cudaError_t err = cudaMalloc((void **) &d_A, size);if (err != cudaSuccess) {
printf(“%s in %s at line %d\n”,
cudaGetErrorString(err), __FILE__, __LINE__);exit(EXIT_FAILURE);
}
![Page 55: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/55.jpg)
Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pairwise addition
__global__void vecAddKernel(float* A, float* B, float* C, int n){
int i = threadIdx.x+blockDim.x*blockIdx.x;if(i<n)
C[i] = A[i] + B[i];
}
// Compute vector sum C = A+B
// Each thread performs one pairwise addition
__global__void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;if(i<n)
C[i] = A[i] + B[i];}
![Page 56: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/56.jpg)
Host Code – Kernel Launch
int vecAdd(float* h_A, float* h_B, float* h_C, int n){
// ..., cudaMalloc(), cudaMemcpy(), ...
// ...
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);// ...
}
int vecAdd(float* h_A, float* h_B, float* h_C, int n)
{// ..., cudaMalloc(), cudaMemcpy(), ...
// ...
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);// ...
}
![Page 57: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/57.jpg)
Better Kernel Launch Code
int vecAdd(float* h_A, float* h_B, float* h_C, int n){
// ..., cudaMalloc(), cudaMemcpy(), ...
// ...
// Run ceil(n/256.0) blocks of 256 threads each
dim3 DimGrid((n1)/256+1, 1, 1);dim3 DimBlock(256, 1, 1);vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);// ...
}
int vecAdd(float* h_A, float* h_B, float* h_C, int n)
{// ..., cudaMalloc(), cudaMemcpy(), ...
// ...
// Run ceil(n/256.0) blocks of 256 threads each
dim3 DimGrid((n1)/256+1, 1, 1);dim3 DimBlock(256, 1, 1);vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);// ...
}
![Page 58: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/58.jpg)
Kernel Execution
__host__
int vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
...
dim3 DimGrid((n1)/256+1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
...
}
__global__
void vecAddKernel(float *A, float *B, float *C, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<n) C[i] = A[i]+B[i];
}
__host__
int vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
...
dim3 DimGrid((n1)/256+1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
...
}
__global__
void vecAddKernel(float *A, float *B, float *C, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<n) C[i] = A[i]+B[i];
}
![Page 59: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/59.jpg)
CUDA Function Declarations
● __global__ defines a kernel function– kernel function must return void
● __device__ and __host__ can be used together● __host__ is optional if used alone
![Page 60: CUDA Programming - Basavaraj Talawar · Hello, World! The standard C program runs on the host NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code](https://reader036.fdocuments.in/reader036/viewer/2022070109/6045bd85a6190e414d6bbc40/html5/thumbnails/60.jpg)
Compiling a CUDA Program
Integrated C programs with CUDA extensions
NVCC Compiler
Host C Compiler/Linker Device Just-in-Time Compiler
Heterogeneous Computing Platform withCPUs, GPUs, etc.