Lecture 8: CUDA

30
Lecture 8: Lecture 8: CUDA CUDA

description

Lecture 8: CUDA. CUDA. A scalable parallel programming model for GPUs and multicore CPUs Provides facilities for heterogeneous programming Allows the GPU to be used as both a graphics processor and a computing processor. Pollack’s Rule. - PowerPoint PPT Presentation

Transcript of Lecture 8: CUDA

Page 1: Lecture 8: CUDA

Lecture 8:Lecture 8:

CUDACUDA

Page 2: Lecture 8: CUDA

CUDA

A scalable parallel programming model for GPUs and multicore CPUs

Provides facilities for heterogeneous programming

Allows the GPU to be used as both a graphics processor and a computing processor.

Page 3: Lecture 8: CUDA

Pollack’s Rule

Performance increase is roughly proportional to the square root of the increase in complexity

performance √complexity

Power consumption increase is roughly linearly proportional to the increase in complexity

power consumption complexity

Page 4: Lecture 8: CUDA

CUDA SPMD (Single Program Multiple Data) Programming Model

Programmer writes code for a single thread and GPU runs thread instances in parallel

Extends C and C++

Page 5: Lecture 8: CUDA

CUDA

Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization

CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism

Page 6: Lecture 8: CUDA

CUDA

Kernel: a function designed to be executed by many threads

Thread block: a set of concurrent threads that can cooperate among themselves through barrier synchronization and through shared-memory access

Grid: a set of thread blocks execute the same kernel program

function designed to be executed by many threads

Page 7: Lecture 8: CUDA

CUDA

Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization

CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism

Page 8: Lecture 8: CUDA

CUDA

__ global__

void mykernel (int a, …)

{

...

}

main()

{

...

nblocks = N/512; // max. 512 threads per block

mykernel <<< nblocks, 512 >>> (aa, …);

...

}

Page 9: Lecture 8: CUDA

CUDA

Page 10: Lecture 8: CUDA

CUDA Thread management is performed by hardware

Max. 512 threads per block

The number of blocks can exceed the number of processors

Blocks execute independently and in any order

Threads can communicate through shared memory

Atomic memory operations exist on the global memory

Page 11: Lecture 8: CUDA

CUDA

Memory Types

Local Memory: private to a thread

Shared Memory: shared by all threads of the block

__shared__

Device memory: shared by all threads of an application

__device__

Page 12: Lecture 8: CUDA

Pollack’s Rule

Performance increase is roughly proportional to the square root of the increase in complexity

performance √complexity

Power consumption increase is roughly linearly proportional to the increase in complexity

power consumption complexity

Page 13: Lecture 8: CUDA

CUDA

Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization

CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallel

and task parallelism

Page 14: Lecture 8: CUDA

CUDA__ global__

void mykernel (float* a, …) {

...

}

main() {

...

int nbytes=N*sizeof(float);

float* ha=(float*)malloc(nbytes);

float* da=0;

cudaMalloc((void**)&da, nbytes);

cudaMemcpy(da, ha, nbytes, CudaMemcpyHosttoDevice);

mykernel <<< N/blocksize, blocksize >>> (da, …);

cudaMemcpy(ha, da, nbytes, CudaMemcpyDevicetoHost);

cudaFree(da);

...

}

Page 15: Lecture 8: CUDA

CUDA

Synchronization Barrier: Threads wait until all threads in the block arrive at the barrier

__syncthreads()

The thread increments barrier count and the scheduler marks it as waiting. When all threads arrive barrier, scheduler releases all waiting threads.

Page 16: Lecture 8: CUDA

CUDA__ global__

void shift_reduce (int *inp, int N, int *tot)

{

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

__shared__ int x[blocksize];

x[tid] = (i<N) ? inp[i] : 0;

__synchthreads();

for (int s=blockDim.x / 2; s>0; s=s/2)

{

if (tid<s) x[tid] += x[tid+s];

__synchthreads();

}

if (tid==0) atomicAdd(tot, x[tid]);

}

Page 17: Lecture 8: CUDA

CUDA

SPMD (Single Program Multiple Data) programming model All threads execute the same program Threads coordinate with barrier synchronization

Threads of a block express fine-grained data parallelism and thread parallelism

Independent blocks of a grid express coarse-grained data parallelism Independent grids express coarse-grained task parallelism

Page 18: Lecture 8: CUDA

CUDA

Page 19: Lecture 8: CUDA

CUDA

Scheduler

Hardware management and scheduling of threads and thread blocks

Scheduler has minimal runtime overhead

Page 20: Lecture 8: CUDA

CUDA

Multithreading

Memory and texture fetch latency requires hundreds of processor clock cycles

While one thread is waiting for a load or texture fetch, the processor can execute another thread

Thousands of independent threads can keep many processors busy

Page 21: Lecture 8: CUDA

CUDA

GPU Multiprocessor Architecture

Lightweight thread creation Zero-overhead thread scheduling Fast barrier synchronization

Each thread has its own Private registers Private per-thread memory PC Thread execution state

Support very fine-grained parallelism

Page 22: Lecture 8: CUDA

CUDA

Page 23: Lecture 8: CUDA

CUDA

GPU Multiprocessor Architecture

Each SP core contains scalar integer and floating-point units is hardware multithreaded supports up to 64 threads is pipelined and executes one instruction per thread per clock has a large register file (RF), 1024 32-bit registers, registers are partitioned among the assigned threads (Programs

declare their register demand; compiler optimizes register allocation.

Ex: (a) 32 registers per thread => 256 threads per block, or

(b) fewer registers – more threads, or (c) more registers – fewer threads)

Page 24: Lecture 8: CUDA

CUDA

Single Instruction Multiple Thread (SIMT)

SIMT: a processor architecture that applies one instruction to multiple independent threads in parallel

Warp: the set of parallel threads that execute the same instruction together in a SIMT architecture

Warp size is 32 threads (4 threads per SP, executed in 4 clock cycles)

Threads in a warp start at the same program address, but they can branch and execute independently.

Individual threads may be inactive due to independent branching

Page 25: Lecture 8: CUDA

CUDA

Page 26: Lecture 8: CUDA

CUDA

SIMT Warp Execution

There are 4 thread lanes per SP An issued warp instruction executes in 4 processor cycles Instruction scheduler selects a warp every 4 clocks The controler:

• Collects thread programs into warps

• Allocates a warp

• Allocates registers for the warp threads (it can start a warp only when it can allocate the requested register count)

• Starts warp execution

• When all threads exit, it frees the registers

Page 27: Lecture 8: CUDA

CUDA

Streaming Processor (SP)

Has 1024 32-bit registers (RF) Can perform 32-bit and 64-bit integer operations: arithmetic,

comparison, conversion, logic operations Can perform 32-bit floating-point operations: add, multiply, min,

max, multiply-add, etc.

SFU (Special Function Unit) Pipelined unit Generates one 32-bit floating-point function per cycle: square root,

sin, cos, 2x, log2x

Page 28: Lecture 8: CUDA

CUDA

Page 29: Lecture 8: CUDA

CUDA

Memory System

Global Memory – external DRAM Shared Memory – on chip Per-thread local memory – external DRAM Constant memory – in external DRAM and cached in shared memory Texture memory – on chip

Page 30: Lecture 8: CUDA

Project

Performance Measurement, Evaluation and Prediction of

Multicore and GPU systemsMulticore systems CPU performance (instruction execution time, pipelining, etc.) Cache performance Performance using algorithmic structures

GPU systems (NVIDIA-CUDA) GPU core performance (instruction execution time, pipelining, etc.) Global and shared memory performance (2) Performance using algorithmic structures

GPU performance in MATLAB environment