CUDA programming - Aalto University Wiki
Transcript of CUDA programming - Aalto University Wiki
NVIDIA Research
! What is the world going to look like, what should our hardware look like in 5–10 years? ! And how do we get there?
! Engage and participate in the academic community
! ~30 researchers around the globe (4 in Helsinki)
! See http://research.nvidia.com
Today
! Brief history to GPU programming
! CUDA programming model ! Writing CUDA programs
! Designing parallel algorithms
Motivation for GPU Programming
! This thing packs a lot of oomph
! How to tap into that?
Early Days: GPGPU (bad!)
! General-Purpose GPU programming ! The craze around 2004 – 2006
! Trick the GPU into general-purpose computing by casting problem as graphics ! Turn data into images (textures) ! Turn algorithms into image synthesis (rendering passes)
! Many attempts to handle these automatically ! Brook, Sh, PeakStream, MS Accelerator, … ! Take a “program”, somehow convert to shaders
Problems with GPGPU
! Highly constrained memory access model ! No scatter, no read/write access
! Split computation into highly constrained passes ! Limited by what shaders can do
! Tough learning curve
! To understand limitations, must understand graphics HW ! Need crazy stunts to circumvent rigidity of hardware
! Overhead of graphics API
GPGPU: An Illustrated Guide
Using graphics API to express programs
Designing GPGPU algorithms
The Road to CUDA
! Okay, this GPGPU thing has potential ! The only problem is that it sucks
! Let’s design the right tool for the job
! Need new hardware capabilities? Build it. ! We are a hardware company, after all
! Need a better API for poking the GPU? Ok.
! Don’t invent a new language
CUDA Design Goals
! Heterogeneous CPU/GPU computing platform
! Easy to program ! Also, easy to integrate GPU code into existing programs
! Close enough to hardware to get best performance ! For those who know what they’re doing
Some Ingredients of CUDA
! SIMT execution model ! Single Instruction, Multiple Thread ! Allows to write scalar code instead of explicit SIMD
! Ways to exploit locality ! Warp and block execution model ! Shared memory
! Direct memory access
! C/C++ with minimal extensions
To Whet Your Appetite
146X
Interactive visualization of
volumetric white matter connectivity
36X
Ionic placement for molecular dynamics simulation on GPU
19X
Transcoding HD video stream to H.264
17X
Fluid mechanics in Matlab using .mex file
CUDA function
100X
Astrophysics N-body simulation
149X
Financial simulation of LIBOR model with
swaptions
47X
GLAME@lab: an M-script API for GPU
linear algebra
20X
Ultrasound medical imaging for cancer
diagnostics
24X
Highly optimized object oriented
molecular dynamics
30X
Cmatch exact string matching to find
similar proteins and gene sequences
CUDA Programming Model
Programmer’s View of Hardware
GPU
GPU Memory (DRAM)
PC
IE B
US
CPU SM SM SM …
L1 L1 L1
L2
CPU Memory (DRAM)
Threads, Warps, Blocks
! A thread in CUDA executes scalar code ! Very much like a usual CPU program
! Hardware packs threads into warps ! Crucial for efficient execution ! Programmer can ignore warps (but you shouldn’t)
! Threads are logically grouped into blocks ! Threads in the same block can communicate and
synchronize efficiently
Programmer’s View of SM
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
Programmer’s View of SM
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
a CUDA thread state
Each thread has otherwise independent state, but it shares PC with other threads of warp
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
… r1 = r2 * r3
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
… r1 = r2 * r3
read r2 and r3
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
… r1 = r2 * r3
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
write to r1
r1 = r2 * r3
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
Programmer’s View of SM: Blocks
SM
core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
a CUDA thread block Shared memory
…
Note: Blocks are formed on the fly from the available warps (don’t need to consecutive)
Programmer’s View of SM: Blocks
SM
core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
another CUDA thread block Shared memory
…
Note: Blocks are formed on the fly from the available warps (don’t need to consecutive)
Implications
! All threads in a warp always execute concurrently ! Same PC, same instruction ! You can exploit this if you’re careful!
! But warps in a block are scheduled irregularly ! Hence, threads of a block are not implicitly synchronized ! But they are always in the same SM, and can synchronize
efficiently and communicate through shared memory
! Blocks are instantiated in SMs that have space ! No way of knowing which blocks end up in which SMs ! Key to good load balancing
Occupancy
! How many warps can fit in one SM depends on resource usage ! Number of registers / thread ! Amount of shared memory / block
! Block size matters too ! Work is always launched in full blocks ! Number of blocks / SM also limited
! Occupancy = percentage of thread slots used ! Handy occupancy calculator spreadsheet available ! Directly affects latency hiding capability
Synchronization
! Threads can specify a synchronization point ! __syncthreads() intrinsic ! This prevents warp from being scheduled until all warps in
the same block have arrived at sync point ! Very lightweight mechanism
! Atomic operations can be used for avoiding race conditions globally ! E.g., append to an array with atomicAdd()
! Implicit synchronization between launches ! Unless asynchronous operation is explicitly allowed
SIMT Execution Model
! How can threads of a warp diverge if they all have the same PC?
! Partial solution: Per-instruction execution predication
! Full solution: Hardware-supported execution mask, execution stack, and related instructions
Example: Instruction Predication
if (a < 10) small++;
else big++;
ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;
Example: Instruction Predication
if (a < 10) small++;
else big++;
ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;
Set predicate register P0 if a > 9
Example: Instruction Predication
if (a < 10) small++;
else big++;
ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;
If P0 is cleared, R5 = R5 + 1
Example: Instruction Predication
if (a < 10) small++;
else big++;
ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1; If P0 is set, R4 = R4 + 1
What About Complex Cases?
! Nested if-else blocks, loops, recursion …
! Need hardware execution mask and execution stack
Non-Predicated Example
if (a < 10) foo();
else bar();
/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block
else branch
if branch
Non-Predicated Example
if (a < 10) foo();
else bar();
/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block
else branch
if branch
Case 1: All threads take the if branch
// no thread wants to jump
Non-Predicated Example
if (a < 10) foo();
else bar();
/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block
else branch
if branch
Case 2: All threads take the else branch
// all threads want to jump
Non-Predicated Example
if (a < 10) foo();
else bar();
/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block
else branch
if branch
Case 3: Some threads take the if branch, some take the else branch
// some threads want to jump: push
// pop
// restore active thread mask
Benefits of SIMT
! Supports all structured C++ constructs ! If/else, switch/case, loops, function calls, exceptions ! goto is a different beast – supported, but best to avoid
! Multi-level constructs handled efficiently ! Break/continue from inside multiple levels of conditionals ! Function return from inside loops and conditionals ! Retreating to exception handler from anywhere
! You only need to care about SIMT when tuning for performance
Some Consequences of SIMT
! An if statement takes the same number of cycles for any number of threads > 0 ! If nobody participates it’s cheap ! Also, masked-out threads don’t do memory accesses
! A loop is iterated until all active threads in the warp are done
! A warp stays alive until every thread in it has terminated ! Terminated threads cause “empty slots” in warps ! Thread utilization = percentage of active threads
Coherent Execution Is Great
! An if statement is perfectly efficient if either everyone takes it or nobody does ! All threads stay active
! A loop is perfectly efficient if everyone does the same number of iterations
! Note: These are required for traditional SIMD
Incoherent Execution Is Okay
! Conditionals are efficient as long as threads usually agree
! Loops are efficient if threads usually take roughly the same number of iterations
! Much easier to program than explicit SIMD ! SIMT: Incoherence is supported, performance degrades
gracefully if control diverges ! SIMD: performance is fixed, incoherence not supported
Striving for Execution Coherence
! Learn to spot low-hanging fruit for improving execution coherence
! Process input in coherent order ! E.g., process nearby pixels of an image together
! Fold branches together as much as possible ! Only put the differing part in a conditional
! Simple low-level fixes ! Favor [f]min / [f]max over conditionals ! Bitwise operators sometimes help
Memory in CUDA, part 1
! Global memory ! Accessible from everywhere, including CPU (memcpy) ! Requests go through L1, L2, DRAM
! Shared memory ! Either 16 or 48 KB per SM in Fermi ! Pieces allocated to thread blocks when launched ! Accessible from threads in the same block ! Requests served directly, very fast
! Local memory ! Actually a thread-local portion of global memory ! Used for register spilling and indexed arrays
Memory in CUDA, part 2
! Textures ! Data can also be fetched from DRAM through texture units ! Separate texture caches ! High latency, extreme pipelining capability ! Read-only
! Surfaces ! Read / write access with pixel format conversions ! Useful for integrating with graphics
! Constants ! Coherent and frequent access of same data
Simplified
! Global memory ! Almost all data access goes here, you will need this
! Shared memory ! Use to share data between threads
! Textures ! Use to accelerate data fetching
! Local memory, constants, surfaces ! Let’s ignore for now, details can be found in manuals
Memory Access Coherence
! GPU memory buses are wide ! Both external and internal
! When warp executes a memory instruction, the addresses matter a lot ! Those that land on the same cache line are served
together ! Different cache lines are served sequentially
! This can have a huge impact on performance ! Easy to accidentally burden the memory system ! Incoherent access also easily overflows caches
Improving Memory Coherence
! Try to access nearby addresses from nearby threads
! If each thread processes just one element, choose wisely which one
! If each thread processes multiple elements, preferably use striding
Striding Example
! We want each thread to process 10 elements of an array ! 64 threads per block
No striding Thread 0: 0 1 2 3 4 5 6 7 8 9 Thread 1: 10 11 12 13 14 15 16 17 18 19 .. Thread 63: 630 631 632 633 634 635 636 637 638 639
With stride of 64 Thread 0: 0 64 128 192 256 320 384 448 512 576 Thread 1: 1 65 129 193 257 321 385 449 513 577 .. Thread 63: 63 127 191 255 319 383 447 511 575 639
Bad access pattern
Optimal access pattern
Time
Launching Work in CUDA
! Kernel = function running on GPU ! Written in CUDA C
! A kernel is launched for a grid of blocks ! The blocks and the grid can be 1D, 2D or 3D ! The extra dimensions are really just syntactic sugar but
convenient if the data lives in a 2D or 3D domain
! Every thread gets to know its ! Thread location within the block (threadIdx) ! Block location within the grid (blockIdx) ! Block and grid dimensions (blockDim, gridDim)
Example
! Each block has 8×8 threads ! So, 64 threads in a block = 2 warps
! We launch a grid of 10×5 blocks ! So, 50 blocks in total
threadIdx.x = 1 threadIdx.y = 1 blockIdx.x = 9 blockIdx.y = 0 blockDim.x = 8 blockDim.y = 8 gridDim.x = 10 gridDim.y = 5
What’s with the Blocks?
! Why did we have blocks again, instead of just a flat gigantic grid of threads? ! Because a block can be guaranteed to be localized ! Launched at the same time, in the same SM
! Threads of a block have the same shared memory ! Load common data together and work on it
! Threads of a block can synchronize efficiently ! Synchronization points in code
! Individual blocks must be truly independent
Writing CUDA Programs
Two APIs
! CUDA can be used through two APIs
! Driver API ! Low-level API ! GPU code compiled separately into binaries ! CPU code manually loads GPU code and invokes it
! Runtime API ! User-friendly high-level API ! Language extensions for launching kernels ! Compiler automatically splits code into GPU and CPU
parts, compiles them separately, and links together
I will talk about this now
Defining and Launching Kernels
// Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() // N-length vector add { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }
Number of blocks
Threads per block In general, these are of type dim3
Defining and Launching Kernels (2D)
// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() // N*N matrix add { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }
Extending for Multiple Blocks
// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() // N*N matrix add { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }
Function Type Qualifiers
__global__ ! A kernel function ! Executed on GPU ! Callable from CPU only (using <<< >>> launch)
__device__ ! GPU-local function ! Executed on GPU ! Callable from GPU only (using standard function call)
__host__ ! Default if nothing else specified ! CPU-only function ! Can be combined with __device__ to compile for both
Variable Type Qualifiers
__device__ ! “Global” variable residing in GPU memory ! Accessible from all threads
__shared__ ! Resides in SM shared memory space ! Accessible from threads of the same block
#define BLOCK_SIZE 16 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; ... }
! Threads in a block may operate on largely same data ! Convolution-like operations, matrix multiply, …
! Load the data once into shared memory, then operate on it ! Share the loading between threads in the block
! Synchronization is important ! Call __syncthreads() after reading the data to ensure
that it is valid before starting any computation on it
Using Shared Memory
! GPU memory needs to be allocated ! cudaMalloc() and cudaFree()
! Data transfers must be done manually ! cudaMemcpy()
Using Global Memory
// Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int VecAddWrapper(float* h_A, float* h_B, float* h_C, int N) { size_t size = N * sizeof(float); // Allocate vectors in device memory float* d_A; float* d_B; float* d_C; cudaMalloc(&d_A, size); cudaMalloc(&d_B, size); cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel ... VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); // Copy result from device memory to host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); }
host (CPU) memory pointers
device (GPU) memory pointers
! Avoid moving data around unnecessarily ! Keep intermediate buffers on GPU
! Only transfer what you need
! Don’t copy unchanged data again ! Don’t copy unnecessary data
! Concurrent data transfer possible on latest devices ! Needs host memory to be page-locked ! Needs kernel execution to be non-blocking ! Needs something useful to be done at the same time ! Non-trivial, so do only if you know you need it
Smart Use of Memory
! All functions return an error code ! cudaSuccess if the call was successful
! Also possible to check for last error ! cudaGetLastError() and cudaPeekAtLastError()
! Error strings available through the API ! cudaGetErrorString()
! Checking errors of asynchronous operations is a little more complex, refer to manual
Error Checking
! CUDA C Programming Guide ! Best starting point ! Describes CUDA programming model, language
extensions, builtin functions and types, etc.
! CUDA Toolkit Reference Manual ! Documentation of host-side CUDA functions
! CUDA C Best Practices Guide ! Information on improving performance and ensuring
compatibility
Manuals and Resources
! NVCC compilation chain: CUDA C à PTX à SASS ! CPU code is compiled by MSVC
! PTX is device-independent intermediate assembly ! To export PTX: nvcc -ptx foo.cu ! PTX is not supposed to be optimized, so don’t expect it
! SASS is device-specific low-level assembly ! Compile: nvcc –cubin –arch=sm_<nn> foo.cu ! Dump: cuobjdump -sass foo.cubin ! SASS instruction sets in cuobjdump manual
Peeking Under the Hood
Designing Parallel Algorithms
! Designing parallel algorithms is not trivial ! Especially so for thousands of threads ! Cannot rely on fine-grained global synchronization
! Active area of research (partially thanks to GPUs) ! E.g., sorting performance going up all the time
! A highly parallel algorithm may need to do more work than a sequential one ! Almost always a higher number of primitive operations ! Or worse complexity, e.g., O(n log n) instead of O(n) ! The price we have to pay for better performance
A Full Can of Worms
! A data-parallel program does some computation for a large number of elements ! All computations must be independent ! There must be enough input to utilize GPU properly
! Natural way to parallelize: One thread per output element ! Convolution, Mandelbrot, etc.: thread = pixel ! Ray tracing: thread = ray
! Boost performance by sharing data if possible
Data-Parallel Is Easy
! Many interesting tasks are not data-parallel ! Sorting, compression, variable-length data, etc. ! Even simple stuff like finding the maximum element
! Hierarchical processing is often a good idea
! Split input into a number of chunks ! Need enough chunks to utilize the GPU
! Do something per chunk to reduce problem size ! Then process the remains on CPU or continue on GPU
Everything Else
! Let’s say we have 100 million elements in an array
! Split into 100000 chunks with 1000 elements each ! To find best performance, experiment with the numbers
! Process one chunk of 1000 elements per thread ! Find and output the maximum
! Now we have 100000 elements, repeat ! Utilization bad from here forward, but the heavy part was
parallelized successfully ! Or: download the 100K elements to CPU, process there
Example 1: Find Maximum
Example 2: Cumulative Sum
! Add to each element of an array the sum of its predecessors
! Can be done using n summations ! Looks like an inherently serial algorithm
1 4 0 3 4 Input
Output
Each summation requires that the previous one has completed Impossible to parallelize?!
Not at all 1 5 5 8 12
Parallel Cumulative Sum
! Let’s suppose we have 2000000 elements and 1000 threads
input . . .
. . . in0 in1
slice input into 1000 segments, each 2000 elements long
in999
. . . out0 out1
calculate cumulative sum for each segment in parallel
out999
take the last element of each output segment sums 1000 elements
Parallel Cumulative Sum
! Let’s suppose we have 2000000 elements and 1000 threads
. . . out0 out1 out999
take the last element of each output segment sums
compute cumulative sum over these
Σsums
. . . out0 out1 out999
add result to output segments
… and we’re done!
Parallel Cumulative Sum
! 1st pass: process each input segment ! Perfectly parallelized, perfectly coherent memory access
! 2nd pass: cumulative sum over last elements ! Parallelizes badly, but very small amount of work
! 3rd pass: add bias to every output element ! Perfectly parallelized, perfectly coherent memory access
! Need two additions per element, but still O(n)
Wrapping Up
Takeaways
! It’s very easy to get started ! Just C++, some extra work needed for managing memory
! But there’s plenty of room for creativity when striving for performance ! Low-level optimizations, data sharing, algorithmic
improvements, concurrent processing, … ! Profiling tools available
! Scalable code will be fast on future hardware as well ! Basically just more blocks running concurrently
Thank You