CUDA programming - Aalto University Wiki

78

Transcript of CUDA programming - Aalto University Wiki

Page 1: CUDA programming - Aalto University Wiki
Page 2: CUDA programming - Aalto University Wiki

NVIDIA Research

!   What is the world going to look like, what should our hardware look like in 5–10 years? !   And how do we get there?

!   Engage and participate in the academic community

!   ~30 researchers around the globe (4 in Helsinki)

!   See http://research.nvidia.com

Page 3: CUDA programming - Aalto University Wiki

Today

!   Brief history to GPU programming

!   CUDA programming model !   Writing CUDA programs

!   Designing parallel algorithms

Page 4: CUDA programming - Aalto University Wiki

Motivation for GPU Programming

!   This thing packs a lot of oomph

!   How to tap into that?

Page 5: CUDA programming - Aalto University Wiki

Early Days: GPGPU (bad!)

!   General-Purpose GPU programming !   The craze around 2004 – 2006

!   Trick the GPU into general-purpose computing by casting problem as graphics !   Turn data into images (textures) !   Turn algorithms into image synthesis (rendering passes)

!   Many attempts to handle these automatically !   Brook, Sh, PeakStream, MS Accelerator, … !   Take a “program”, somehow convert to shaders

Page 6: CUDA programming - Aalto University Wiki

Problems with GPGPU

!   Highly constrained memory access model !   No scatter, no read/write access

!   Split computation into highly constrained passes !   Limited by what shaders can do

!   Tough learning curve

!   To understand limitations, must understand graphics HW !   Need crazy stunts to circumvent rigidity of hardware

!   Overhead of graphics API

Page 7: CUDA programming - Aalto University Wiki

GPGPU: An Illustrated Guide

Using graphics API to express programs

Designing GPGPU algorithms

Page 8: CUDA programming - Aalto University Wiki

The Road to CUDA

!   Okay, this GPGPU thing has potential !   The only problem is that it sucks

!   Let’s design the right tool for the job

!   Need new hardware capabilities? Build it. !   We are a hardware company, after all

!   Need a better API for poking the GPU? Ok.

!   Don’t invent a new language

Page 9: CUDA programming - Aalto University Wiki

CUDA Design Goals

!   Heterogeneous CPU/GPU computing platform

!   Easy to program !   Also, easy to integrate GPU code into existing programs

!   Close enough to hardware to get best performance !   For those who know what they’re doing

Page 10: CUDA programming - Aalto University Wiki

Some Ingredients of CUDA

!   SIMT execution model !   Single Instruction, Multiple Thread !   Allows to write scalar code instead of explicit SIMD

!   Ways to exploit locality !   Warp and block execution model !   Shared memory

!   Direct memory access

!   C/C++ with minimal extensions

Page 11: CUDA programming - Aalto University Wiki

To Whet Your Appetite

146X

Interactive visualization of

volumetric white matter connectivity

36X

Ionic placement for molecular dynamics simulation on GPU

19X

Transcoding HD video stream to H.264

17X

Fluid mechanics in Matlab using .mex file

CUDA function

100X

Astrophysics N-body simulation

149X

Financial simulation of LIBOR model with

swaptions

47X

GLAME@lab: an M-script API for GPU

linear algebra

20X

Ultrasound medical imaging for cancer

diagnostics

24X

Highly optimized object oriented

molecular dynamics

30X

Cmatch exact string matching to find

similar proteins and gene sequences

Page 12: CUDA programming - Aalto University Wiki

CUDA Programming Model

Page 13: CUDA programming - Aalto University Wiki

Programmer’s View of Hardware

GPU

GPU Memory (DRAM)

PC

IE B

US

CPU SM SM SM …

L1 L1 L1

L2

CPU Memory (DRAM)

Page 14: CUDA programming - Aalto University Wiki

Threads, Warps, Blocks

!   A thread in CUDA executes scalar code !   Very much like a usual CPU program

!   Hardware packs threads into warps !   Crucial for efficient execution !   Programmer can ignore warps (but you shouldn’t)

!   Threads are logically grouped into blocks !   Threads in the same block can communicate and

synchronize efficiently

Page 15: CUDA programming - Aalto University Wiki

Programmer’s View of SM

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

Page 16: CUDA programming - Aalto University Wiki

Programmer’s View of SM

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

a CUDA thread state

Each thread has otherwise independent state, but it shares PC with other threads of warp

Page 17: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Execution

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

Page 18: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Execution

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

… r1 = r2 * r3

Page 19: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Execution

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

… r1 = r2 * r3

read r2 and r3

Page 20: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Execution

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

… r1 = r2 * r3

Page 21: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Execution

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

write to r1

r1 = r2 * r3

Page 22: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Execution

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

Page 23: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Execution

SM

core core core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

Page 24: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Blocks

SM

core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

a CUDA thread block Shared memory

Note: Blocks are formed on the fly from the available warps (don’t need to consecutive)

Page 25: CUDA programming - Aalto University Wiki

Programmer’s View of SM: Blocks

SM

core core core core core core core core

Warp 0 Warp 1

… thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC

… thr thr thr thr thr thr thr thr PC

Warp 2 Warp 3 Warp 4 Warp 5 Warp 6

Warp n

another CUDA thread block Shared memory

Note: Blocks are formed on the fly from the available warps (don’t need to consecutive)

Page 26: CUDA programming - Aalto University Wiki

Implications

!   All threads in a warp always execute concurrently !   Same PC, same instruction !   You can exploit this if you’re careful!

!   But warps in a block are scheduled irregularly !   Hence, threads of a block are not implicitly synchronized !   But they are always in the same SM, and can synchronize

efficiently and communicate through shared memory

!   Blocks are instantiated in SMs that have space !   No way of knowing which blocks end up in which SMs !   Key to good load balancing

Page 27: CUDA programming - Aalto University Wiki

Occupancy

!   How many warps can fit in one SM depends on resource usage !   Number of registers / thread !   Amount of shared memory / block

!   Block size matters too !   Work is always launched in full blocks !   Number of blocks / SM also limited

!   Occupancy = percentage of thread slots used !   Handy occupancy calculator spreadsheet available !   Directly affects latency hiding capability

Page 28: CUDA programming - Aalto University Wiki

Synchronization

!   Threads can specify a synchronization point !   __syncthreads() intrinsic !   This prevents warp from being scheduled until all warps in

the same block have arrived at sync point !   Very lightweight mechanism

!   Atomic operations can be used for avoiding race conditions globally !   E.g., append to an array with atomicAdd()

!   Implicit synchronization between launches !   Unless asynchronous operation is explicitly allowed

Page 29: CUDA programming - Aalto University Wiki

SIMT Execution Model

!   How can threads of a warp diverge if they all have the same PC?

!   Partial solution: Per-instruction execution predication

!   Full solution: Hardware-supported execution mask, execution stack, and related instructions

Page 30: CUDA programming - Aalto University Wiki

Example: Instruction Predication

if (a < 10) small++;

else big++;

ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;

Page 31: CUDA programming - Aalto University Wiki

Example: Instruction Predication

if (a < 10) small++;

else big++;

ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;

Set predicate register P0 if a > 9

Page 32: CUDA programming - Aalto University Wiki

Example: Instruction Predication

if (a < 10) small++;

else big++;

ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;

If P0 is cleared, R5 = R5 + 1

Page 33: CUDA programming - Aalto University Wiki

Example: Instruction Predication

if (a < 10) small++;

else big++;

ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1; If P0 is set, R4 = R4 + 1

Page 34: CUDA programming - Aalto University Wiki

What About Complex Cases?

!   Nested if-else blocks, loops, recursion …

!   Need hardware execution mask and execution stack

Page 35: CUDA programming - Aalto University Wiki

Non-Predicated Example

if (a < 10) foo();

else bar();

/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block

else branch

if branch

Page 36: CUDA programming - Aalto University Wiki

Non-Predicated Example

if (a < 10) foo();

else bar();

/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block

else branch

if branch

Case 1: All threads take the if branch

// no thread wants to jump

Page 37: CUDA programming - Aalto University Wiki

Non-Predicated Example

if (a < 10) foo();

else bar();

/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block

else branch

if branch

Case 2: All threads take the else branch

// all threads want to jump

Page 38: CUDA programming - Aalto University Wiki

Non-Predicated Example

if (a < 10) foo();

else bar();

/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block

else branch

if branch

Case 3: Some threads take the if branch, some take the else branch

// some threads want to jump: push

// pop

// restore active thread mask

Page 39: CUDA programming - Aalto University Wiki

Benefits of SIMT

!   Supports all structured C++ constructs !   If/else, switch/case, loops, function calls, exceptions ! goto is a different beast – supported, but best to avoid

!   Multi-level constructs handled efficiently !   Break/continue from inside multiple levels of conditionals !   Function return from inside loops and conditionals !   Retreating to exception handler from anywhere

!   You only need to care about SIMT when tuning for performance

Page 40: CUDA programming - Aalto University Wiki

Some Consequences of SIMT

!   An if statement takes the same number of cycles for any number of threads > 0 !   If nobody participates it’s cheap !   Also, masked-out threads don’t do memory accesses

!   A loop is iterated until all active threads in the warp are done

!   A warp stays alive until every thread in it has terminated !   Terminated threads cause “empty slots” in warps !   Thread utilization = percentage of active threads

Page 41: CUDA programming - Aalto University Wiki

Coherent Execution Is Great

!   An if statement is perfectly efficient if either everyone takes it or nobody does !   All threads stay active

!   A loop is perfectly efficient if everyone does the same number of iterations

!   Note: These are required for traditional SIMD

Page 42: CUDA programming - Aalto University Wiki

Incoherent Execution Is Okay

!   Conditionals are efficient as long as threads usually agree

!   Loops are efficient if threads usually take roughly the same number of iterations

!   Much easier to program than explicit SIMD !   SIMT: Incoherence is supported, performance degrades

gracefully if control diverges !   SIMD: performance is fixed, incoherence not supported

Page 43: CUDA programming - Aalto University Wiki

Striving for Execution Coherence

!   Learn to spot low-hanging fruit for improving execution coherence

!   Process input in coherent order !   E.g., process nearby pixels of an image together

!   Fold branches together as much as possible !   Only put the differing part in a conditional

!   Simple low-level fixes !   Favor [f]min / [f]max over conditionals !   Bitwise operators sometimes help

Page 44: CUDA programming - Aalto University Wiki

Memory in CUDA, part 1

!   Global memory !   Accessible from everywhere, including CPU (memcpy) !   Requests go through L1, L2, DRAM

!   Shared memory !   Either 16 or 48 KB per SM in Fermi !   Pieces allocated to thread blocks when launched !   Accessible from threads in the same block !   Requests served directly, very fast

!   Local memory !   Actually a thread-local portion of global memory !   Used for register spilling and indexed arrays

Page 45: CUDA programming - Aalto University Wiki

Memory in CUDA, part 2

!   Textures !   Data can also be fetched from DRAM through texture units !   Separate texture caches !   High latency, extreme pipelining capability !   Read-only

!   Surfaces !   Read / write access with pixel format conversions !   Useful for integrating with graphics

!   Constants !   Coherent and frequent access of same data

Page 46: CUDA programming - Aalto University Wiki

Simplified

!   Global memory !   Almost all data access goes here, you will need this

!   Shared memory !   Use to share data between threads

!   Textures !   Use to accelerate data fetching

!   Local memory, constants, surfaces !   Let’s ignore for now, details can be found in manuals

Page 47: CUDA programming - Aalto University Wiki

Memory Access Coherence

!   GPU memory buses are wide !   Both external and internal

!   When warp executes a memory instruction, the addresses matter a lot !   Those that land on the same cache line are served

together !   Different cache lines are served sequentially

!   This can have a huge impact on performance !   Easy to accidentally burden the memory system !   Incoherent access also easily overflows caches

Page 48: CUDA programming - Aalto University Wiki

Improving Memory Coherence

!   Try to access nearby addresses from nearby threads

!   If each thread processes just one element, choose wisely which one

!   If each thread processes multiple elements, preferably use striding

Page 49: CUDA programming - Aalto University Wiki

Striding Example

!   We want each thread to process 10 elements of an array !   64 threads per block

No striding Thread 0: 0 1 2 3 4 5 6 7 8 9 Thread 1: 10 11 12 13 14 15 16 17 18 19 .. Thread 63: 630 631 632 633 634 635 636 637 638 639

With stride of 64 Thread 0: 0 64 128 192 256 320 384 448 512 576 Thread 1: 1 65 129 193 257 321 385 449 513 577 .. Thread 63: 63 127 191 255 319 383 447 511 575 639

Bad access pattern

Optimal access pattern

Time

Page 50: CUDA programming - Aalto University Wiki

Launching Work in CUDA

!   Kernel = function running on GPU !   Written in CUDA C

!   A kernel is launched for a grid of blocks !   The blocks and the grid can be 1D, 2D or 3D !   The extra dimensions are really just syntactic sugar but

convenient if the data lives in a 2D or 3D domain

!   Every thread gets to know its !   Thread location within the block (threadIdx) !   Block location within the grid (blockIdx) !   Block and grid dimensions (blockDim, gridDim)

Page 51: CUDA programming - Aalto University Wiki

Example

!   Each block has 8×8 threads !   So, 64 threads in a block = 2 warps

!   We launch a grid of 10×5 blocks !   So, 50 blocks in total

threadIdx.x = 1 threadIdx.y = 1 blockIdx.x = 9 blockIdx.y = 0 blockDim.x = 8 blockDim.y = 8 gridDim.x = 10 gridDim.y = 5

Page 52: CUDA programming - Aalto University Wiki

What’s with the Blocks?

!   Why did we have blocks again, instead of just a flat gigantic grid of threads? !   Because a block can be guaranteed to be localized !   Launched at the same time, in the same SM

!   Threads of a block have the same shared memory !   Load common data together and work on it

!   Threads of a block can synchronize efficiently !   Synchronization points in code

!   Individual blocks must be truly independent

Page 53: CUDA programming - Aalto University Wiki

Writing CUDA Programs

Page 54: CUDA programming - Aalto University Wiki

Two APIs

!   CUDA can be used through two APIs

!   Driver API !   Low-level API !   GPU code compiled separately into binaries !   CPU code manually loads GPU code and invokes it

!   Runtime API !   User-friendly high-level API !   Language extensions for launching kernels !   Compiler automatically splits code into GPU and CPU

parts, compiles them separately, and links together

I will talk about this now

Page 55: CUDA programming - Aalto University Wiki

Defining and Launching Kernels

// Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() // N-length vector add { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }

Number of blocks

Threads per block In general, these are of type dim3

Page 56: CUDA programming - Aalto University Wiki

Defining and Launching Kernels (2D)

// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() // N*N matrix add { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }

Page 57: CUDA programming - Aalto University Wiki

Extending for Multiple Blocks

// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() // N*N matrix add { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }

Page 58: CUDA programming - Aalto University Wiki

Function Type Qualifiers

__global__ !   A kernel function !   Executed on GPU !   Callable from CPU only (using <<< >>> launch)

__device__ !   GPU-local function !   Executed on GPU !   Callable from GPU only (using standard function call)

__host__ !   Default if nothing else specified !   CPU-only function !   Can be combined with __device__ to compile for both

Page 59: CUDA programming - Aalto University Wiki

Variable Type Qualifiers

__device__ !   “Global” variable residing in GPU memory !   Accessible from all threads

__shared__ !   Resides in SM shared memory space !   Accessible from threads of the same block

#define BLOCK_SIZE 16 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; ... }

Page 60: CUDA programming - Aalto University Wiki

!   Threads in a block may operate on largely same data !   Convolution-like operations, matrix multiply, …

!   Load the data once into shared memory, then operate on it !   Share the loading between threads in the block

!   Synchronization is important !   Call __syncthreads() after reading the data to ensure

that it is valid before starting any computation on it

Using Shared Memory

Page 61: CUDA programming - Aalto University Wiki

!   GPU memory needs to be allocated !   cudaMalloc() and cudaFree()

!   Data transfers must be done manually !   cudaMemcpy()

Using Global Memory

Page 62: CUDA programming - Aalto University Wiki

// Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int VecAddWrapper(float* h_A, float* h_B, float* h_C, int N) { size_t size = N * sizeof(float); // Allocate vectors in device memory float* d_A; float* d_B; float* d_C; cudaMalloc(&d_A, size); cudaMalloc(&d_B, size); cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel ... VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); // Copy result from device memory to host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); }

host (CPU) memory pointers

device (GPU) memory pointers

Page 63: CUDA programming - Aalto University Wiki

!   Avoid moving data around unnecessarily !   Keep intermediate buffers on GPU

!   Only transfer what you need

!   Don’t copy unchanged data again !   Don’t copy unnecessary data

!   Concurrent data transfer possible on latest devices !   Needs host memory to be page-locked !   Needs kernel execution to be non-blocking !   Needs something useful to be done at the same time !   Non-trivial, so do only if you know you need it

Smart Use of Memory

Page 64: CUDA programming - Aalto University Wiki

!   All functions return an error code !   cudaSuccess if the call was successful

!   Also possible to check for last error !   cudaGetLastError() and cudaPeekAtLastError()

!   Error strings available through the API !   cudaGetErrorString()

!   Checking errors of asynchronous operations is a little more complex, refer to manual

Error Checking

Page 65: CUDA programming - Aalto University Wiki

!   CUDA C Programming Guide !   Best starting point !   Describes CUDA programming model, language

extensions, builtin functions and types, etc.

!   CUDA Toolkit Reference Manual !   Documentation of host-side CUDA functions

!   CUDA C Best Practices Guide !   Information on improving performance and ensuring

compatibility

Manuals and Resources

Page 66: CUDA programming - Aalto University Wiki

!   NVCC compilation chain: CUDA C à PTX à SASS !   CPU code is compiled by MSVC

!   PTX is device-independent intermediate assembly !   To export PTX: nvcc -ptx foo.cu !   PTX is not supposed to be optimized, so don’t expect it

!   SASS is device-specific low-level assembly !   Compile: nvcc –cubin –arch=sm_<nn> foo.cu !   Dump: cuobjdump -sass foo.cubin !   SASS instruction sets in cuobjdump manual

Peeking Under the Hood

Page 67: CUDA programming - Aalto University Wiki

Designing Parallel Algorithms

Page 68: CUDA programming - Aalto University Wiki

!   Designing parallel algorithms is not trivial !   Especially so for thousands of threads !   Cannot rely on fine-grained global synchronization

!   Active area of research (partially thanks to GPUs) !   E.g., sorting performance going up all the time

!   A highly parallel algorithm may need to do more work than a sequential one !   Almost always a higher number of primitive operations !   Or worse complexity, e.g., O(n log n) instead of O(n) !   The price we have to pay for better performance

A Full Can of Worms

Page 69: CUDA programming - Aalto University Wiki

!   A data-parallel program does some computation for a large number of elements !   All computations must be independent !   There must be enough input to utilize GPU properly

!   Natural way to parallelize: One thread per output element !   Convolution, Mandelbrot, etc.: thread = pixel !   Ray tracing: thread = ray

!   Boost performance by sharing data if possible

Data-Parallel Is Easy

Page 70: CUDA programming - Aalto University Wiki

!   Many interesting tasks are not data-parallel !   Sorting, compression, variable-length data, etc. !   Even simple stuff like finding the maximum element

!   Hierarchical processing is often a good idea

!   Split input into a number of chunks !   Need enough chunks to utilize the GPU

!   Do something per chunk to reduce problem size !   Then process the remains on CPU or continue on GPU

Everything Else

Page 71: CUDA programming - Aalto University Wiki

!   Let’s say we have 100 million elements in an array

!   Split into 100000 chunks with 1000 elements each !   To find best performance, experiment with the numbers

!   Process one chunk of 1000 elements per thread !   Find and output the maximum

!   Now we have 100000 elements, repeat !   Utilization bad from here forward, but the heavy part was

parallelized successfully !   Or: download the 100K elements to CPU, process there

Example 1: Find Maximum

Page 72: CUDA programming - Aalto University Wiki

Example 2: Cumulative Sum

!   Add to each element of an array the sum of its predecessors

!   Can be done using n summations !   Looks like an inherently serial algorithm

1 4 0 3 4 Input

Output

Each summation requires that the previous one has completed Impossible to parallelize?!

Not at all 1 5 5 8 12

Page 73: CUDA programming - Aalto University Wiki

Parallel Cumulative Sum

!   Let’s suppose we have 2000000 elements and 1000 threads

input . . .

. . . in0 in1

slice input into 1000 segments, each 2000 elements long

in999

. . . out0 out1

calculate cumulative sum for each segment in parallel

out999

take the last element of each output segment sums 1000 elements

Page 74: CUDA programming - Aalto University Wiki

Parallel Cumulative Sum

!   Let’s suppose we have 2000000 elements and 1000 threads

. . . out0 out1 out999

take the last element of each output segment sums

compute cumulative sum over these

Σsums

. . . out0 out1 out999

add result to output segments

… and we’re done!

Page 75: CUDA programming - Aalto University Wiki

Parallel Cumulative Sum

!   1st pass: process each input segment !   Perfectly parallelized, perfectly coherent memory access

!   2nd pass: cumulative sum over last elements !   Parallelizes badly, but very small amount of work

!   3rd pass: add bias to every output element !   Perfectly parallelized, perfectly coherent memory access

!   Need two additions per element, but still O(n)

Page 76: CUDA programming - Aalto University Wiki

Wrapping Up

Page 77: CUDA programming - Aalto University Wiki

Takeaways

!   It’s very easy to get started !   Just C++, some extra work needed for managing memory

!   But there’s plenty of room for creativity when striving for performance !   Low-level optimizations, data sharing, algorithmic

improvements, concurrent processing, … !   Profiling tools available

!   Scalable code will be fast on future hardware as well !   Basically just more blocks running concurrently

Page 78: CUDA programming - Aalto University Wiki

Thank You