CUDA Programming Model

14
CUDA CUDA Session 2 CUDA Programming CUDA Programming Model Hadi Salimi and Nima Ghaemian Distributed Systems Lab. School of Computer Engineering School of Computer Engineering, Iran University of Science and Technology, [email protected] and [email protected] Data parallelism in matrix multiplication N Matrix multiplication of large dimensions can have very large WIDTH amount of data parallelism W M P Note that the dot product operations for computing WIDTH operations for computing different P elements can be simultaneously performed. WIDTH WIDTH

Transcript of CUDA Programming Model

Microsoft PowerPoint - Session 18 - CUDA2.pptxHadi Salimi and Nima Ghaemian Distributed Systems Lab.
School of Computer EngineeringSchool of Computer Engineering, Iran University of Science and Technology, [email protected] and [email protected]
Data parallelism in matrix multiplication
N
W ID
T H
W ID
T H
operations for computing different P elements can be simultaneously performed.
2 WIDTH WIDTH
CUDA program structure
CUDA Programs in general is divided into two parts:
• The host part (CPU) – This part have a little or no data parallelism
– This part is compiled with the host's standard C compilers and executed on CPU
• The device part (GPU) – This part has rich amounts of data parallelism
– This part is compiled by the NVCC (NVIDIA C Compiler) and executed on a GPU device.
CUDA program execution
Serial Code (host)
Parallel Kernel (device) . . .
Parallel Kernel (device)
Kernel functions
Th d i t f th d i itt i ANSI• The device part of the code is written using ANSI C extended with keywords for labeling data- parallel functions (kernel functions)parallel functions (kernel functions).
• The kernel functions generate too many threads• The kernel functions generate too many threads to utilize data parallelism.
• From now on, we call “kernel functions” as “kernels” for simplificationkernels for simplification.
CUDA host code example
Int main(void) {
1 // All t d i iti li th ( h h )1. // Allocate and initialize the arrays (on the host)
// I/O to read the input arrays (on the host)
….
ComputeOnDevice(arrays);
// Free arrays (on the host)

CUDA device memory model (Cont.)
• The host code can: – Read/Write per grid global, constant and texture
memories (stored in DRAM)
/– Read/Write per-thread local memory
– Read/Write per-block shared memory
– Read only per-grid constant memory
– Read only per-grid texture memoryRead only per grid texture memory
CUDA API functions
• cudaFree() Shared Memory
Registers Registers
Shared Memory
Registers Registers
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Global Memory
cudaMalloc()&cudaFree()
• cudaMalloc(void* p , int size): can be called from the host to allocate a piece of Global Memory forthe host to allocate a piece of Global Memory for an object.
• The first parameter: the address of a pointer that• The first parameter: the address of a pointer that needs to point to the allocated object.
• The second parameter: the size of the object to beThe second parameter: the size of the object to be allocated.
d F ( id*) i ll d ith th i t t th• cudaFree(void*): is called with the pointer to the allocated object as parameter to free the storage space from the Global Memoryspace from the Global Memory.
Example
C t i t Di 100Const int Dim=100; //Dimension of the 2D array
float *Od //d means that the pointer is on the device
int size = Dim * Dim * sizeof(long);
cudaMalloc((void**)&Od, size);
cudaFree(Od);
Host
cudaMemcpy()
d M ( id* id* i t )cudaMemcpy(void* , void* , size , type) – memory data transfer
parameters:– parameters: • First parameter: Pointer to destination
• Second parameter: Pointer to source
• Third parameter: Number of bytes copied
• Fourth parameter: Type of transfer:
– Host to HostHost to Host
– Host to Device
– Device to Host
Example 1
• cudaMemcpy(Od O size cudaMemcpyHostToDevice);• cudaMemcpy(Od, O, size, cudaMemcpyHostToDevice);
• // Copy the size bytes of the O object on the host to the Od object on the device.j
• cudaMemcpy(O, Od, size, cudaMemcpyDeviceToHost);
• //Copy the size bytes of the Od object on the device to the O object on the host.
Example 2
{
1. // Load M and N to device memory
cudaMalloc(Md, size);
cudaMalloc(Nd, size);
cudaMalloc(Pd, size);
cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
}
SPMD
• SPMD:
All th d f ll l h t thAll threads of a parallel phase execute the same code, thus CUDA programming can be categorized as SPMD (Single-Program Multiple-categorized as SPMD (Single-Program Multiple- Data).
• “__global__” keyword :
• This keyword indicates that the function executed• This keyword indicates that the function executed on the device as a kernel function and is Callable from the host only to create a grid of threads that y g all execute the kernel function.
Built-in Variables
• gridDim
This variable is of type dim3 and contains the dimensionsThis variable is of type dim3 and contains the dimensions of the grid.
• blockIdx
This variable is of type uint3 and contains the block index within the grid.
bl kDi• blockDim
This variable is of type dim3 and contains the dimensions of the block.of the block.
• threadIdx
This variable is of type uint3 and contains the thread indexwithin the block.
Grid and block dimension
• Each grid has one or more thread blocks.g
• All blocks in a grid have the same number of threads that organized in the same manner.g
• Each grid has a unique two dimensional coordinate given by the CUDA specific keywords blockIdx.x and blockIdx.y.
• The coordinates of threads in a block are uniquely defined by three thread indices: threadIdx.x, threadIdx.y, and threadIdx.z.
Calling a kernel
dim3 dimBlock(Dim1, Dim2, Dim3);
dim3 dimGrid(gridDim.x, gridDim.y);dim3 dimGrid(gridDim.x, gridDim.y);
/*The values of gridDim.x and gridDim.y can be anywhere between 1 and 65,536.*/y ,
// Launch the device computation threads!
M t i M lK l<<<di G id di Bl k>>>(MdMatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd);
Programming Model (SPMD + SIMD)
Programming Model (SPMD + SIMD)
• A kernel is executed as a grid of thread blocks
A th d bl k i b t h f th d th t• A thread block is a batch of threads that can cooperate with each other by:
Synchronizing their execution– Synchronizing their execution • For hazard-free shared memory accesses
– Efficiently sharing data through a low latency shared y g g y memory
• Two threads from two different blocks cannot cooperate
Thread Assignment
SP
Hardware Constraints
• 8 thread blocks per SM• 8 thread blocks per SM
• 768 threads per SM > 768x16=12,288 threads totally!
16 384 b t f h d h SM• 16,384 bytes of shared cache per SM
• 8,192 total registers per SM (shared among all thread blocks assigned to that SM)thread blocks assigned to that SM)
Example 1
4. dim3 dimBlock(8, 9, 8);4. dim3 dimBlock(8, 9, 8);
The answer is: 4The answer is: 4
because the number of threads per block is greater than 512, it defines 576(8x9x8) threads per block!than 512, it defines 576(8x9x8) threads per block!
Example 2
Which is false? Th th d b i d t h SM ld b i th f fThe threads can be assigned to each SM could be in the form of:
1. 4 blocks of 156 threads each
2 3 blocks of 256 threads each2. 3 blocks of 256 threads each
3. 8 blocks of 228 threads each
4 12 bl k f 64 th d h4. 12 blocks of 64 threads each
The answer is: 4
because the number of blocks per SM is greater th 8!than 8!
Web Resources
• CUDA ZONE: htt // idi / bj t/ d d ti ht lhttp://www.nvidia.com/object/cuda_education.html
• David Kirk and Wen-mei W. Hwu, Lecture Notes of Programming Massively Parallel ProcessorsProgramming Massively Parallel Processors, University of Illinois, Urbana-Champaign, 2009
• Rob Farber “CUDA Supercomputing for the• Rob Farber, CUDA, Supercomputing for the Masses”, Dr. Dobb’s Journal, can be found at:
http://www ddj com/hpc-high-performance-http://www.ddj.com/hpc-high-performance- computing/
• Special Thanks to Mr. Ebrahim Khademi for his technical commentstechnical comments.
Any Questions?