Lecture 8: CUDA
description
Transcript of Lecture 8: CUDA
Lecture 8:Lecture 8:
CUDACUDA
CUDA
A scalable parallel programming model for GPUs and multicore CPUs
Provides facilities for heterogeneous programming
Allows the GPU to be used as both a graphics processor and a computing processor.
Pollack’s Rule
Performance increase is roughly proportional to the square root of the increase in complexity
performance √complexity
Power consumption increase is roughly linearly proportional to the increase in complexity
power consumption complexity
CUDA SPMD (Single Program Multiple Data) Programming Model
Programmer writes code for a single thread and GPU runs thread instances in parallel
Extends C and C++
CUDA
Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization
CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism
CUDA
Kernel: a function designed to be executed by many threads
Thread block: a set of concurrent threads that can cooperate among themselves through barrier synchronization and through shared-memory access
Grid: a set of thread blocks execute the same kernel program
function designed to be executed by many threads
CUDA
Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization
CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism
CUDA
__ global__
void mykernel (int a, …)
{
...
}
main()
{
...
nblocks = N/512; // max. 512 threads per block
mykernel <<< nblocks, 512 >>> (aa, …);
...
}
CUDA
CUDA Thread management is performed by hardware
Max. 512 threads per block
The number of blocks can exceed the number of processors
Blocks execute independently and in any order
Threads can communicate through shared memory
Atomic memory operations exist on the global memory
CUDA
Memory Types
Local Memory: private to a thread
Shared Memory: shared by all threads of the block
__shared__
Device memory: shared by all threads of an application
__device__
Pollack’s Rule
Performance increase is roughly proportional to the square root of the increase in complexity
performance √complexity
Power consumption increase is roughly linearly proportional to the increase in complexity
power consumption complexity
CUDA
Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization
CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallel
and task parallelism
CUDA__ global__
void mykernel (float* a, …) {
...
}
main() {
...
int nbytes=N*sizeof(float);
float* ha=(float*)malloc(nbytes);
float* da=0;
cudaMalloc((void**)&da, nbytes);
cudaMemcpy(da, ha, nbytes, CudaMemcpyHosttoDevice);
mykernel <<< N/blocksize, blocksize >>> (da, …);
cudaMemcpy(ha, da, nbytes, CudaMemcpyDevicetoHost);
cudaFree(da);
...
}
CUDA
Synchronization Barrier: Threads wait until all threads in the block arrive at the barrier
__syncthreads()
The thread increments barrier count and the scheduler marks it as waiting. When all threads arrive barrier, scheduler releases all waiting threads.
CUDA__ global__
void shift_reduce (int *inp, int N, int *tot)
{
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int x[blocksize];
x[tid] = (i<N) ? inp[i] : 0;
__synchthreads();
for (int s=blockDim.x / 2; s>0; s=s/2)
{
if (tid<s) x[tid] += x[tid+s];
__synchthreads();
}
if (tid==0) atomicAdd(tot, x[tid]);
}
CUDA
SPMD (Single Program Multiple Data) programming model All threads execute the same program Threads coordinate with barrier synchronization
Threads of a block express fine-grained data parallelism and thread parallelism
Independent blocks of a grid express coarse-grained data parallelism Independent grids express coarse-grained task parallelism
CUDA
CUDA
Scheduler
Hardware management and scheduling of threads and thread blocks
Scheduler has minimal runtime overhead
CUDA
Multithreading
Memory and texture fetch latency requires hundreds of processor clock cycles
While one thread is waiting for a load or texture fetch, the processor can execute another thread
Thousands of independent threads can keep many processors busy
CUDA
GPU Multiprocessor Architecture
Lightweight thread creation Zero-overhead thread scheduling Fast barrier synchronization
Each thread has its own Private registers Private per-thread memory PC Thread execution state
Support very fine-grained parallelism
CUDA
CUDA
GPU Multiprocessor Architecture
Each SP core contains scalar integer and floating-point units is hardware multithreaded supports up to 64 threads is pipelined and executes one instruction per thread per clock has a large register file (RF), 1024 32-bit registers, registers are partitioned among the assigned threads (Programs
declare their register demand; compiler optimizes register allocation.
Ex: (a) 32 registers per thread => 256 threads per block, or
(b) fewer registers – more threads, or (c) more registers – fewer threads)
CUDA
Single Instruction Multiple Thread (SIMT)
SIMT: a processor architecture that applies one instruction to multiple independent threads in parallel
Warp: the set of parallel threads that execute the same instruction together in a SIMT architecture
Warp size is 32 threads (4 threads per SP, executed in 4 clock cycles)
Threads in a warp start at the same program address, but they can branch and execute independently.
Individual threads may be inactive due to independent branching
CUDA
CUDA
SIMT Warp Execution
There are 4 thread lanes per SP An issued warp instruction executes in 4 processor cycles Instruction scheduler selects a warp every 4 clocks The controler:
• Collects thread programs into warps
• Allocates a warp
• Allocates registers for the warp threads (it can start a warp only when it can allocate the requested register count)
• Starts warp execution
• When all threads exit, it frees the registers
CUDA
Streaming Processor (SP)
Has 1024 32-bit registers (RF) Can perform 32-bit and 64-bit integer operations: arithmetic,
comparison, conversion, logic operations Can perform 32-bit floating-point operations: add, multiply, min,
max, multiply-add, etc.
SFU (Special Function Unit) Pipelined unit Generates one 32-bit floating-point function per cycle: square root,
sin, cos, 2x, log2x
CUDA
CUDA
Memory System
Global Memory – external DRAM Shared Memory – on chip Per-thread local memory – external DRAM Constant memory – in external DRAM and cached in shared memory Texture memory – on chip
Project
Performance Measurement, Evaluation and Prediction of
Multicore and GPU systemsMulticore systems CPU performance (instruction execution time, pipelining, etc.) Cache performance Performance using algorithmic structures
GPU systems (NVIDIA-CUDA) GPU core performance (instruction execution time, pipelining, etc.) Global and shared memory performance (2) Performance using algorithmic structures
GPU performance in MATLAB environment