Post on 29-Dec-2015
OV
ER
VIE
WLecture Overview
Introduction To GPGPU
CUDA Architecture
Programming with CUDA
Maximising Performance
Results
Memory Model
INTR
OD
UC
TIO
N T
O G
PG
PU
Introduction
What is a GPU?A GPU is essentially a massively parallel coprocessor, primarily
utilised for real-time rendering of complex graphical scenes.
What is CUDA?Compute Unified Device Architecture.A highly parallel architecture for General Processing on a GPU (GPGPU).
Benefits:• Commodity hardware.• Negligible CPU overhead.• Astoundingly fast, when done right.
What should I use it for?Highly parallelisable processing of very large datasets.
What shouldn’t I use it for?• Sequential processing tasks (obviously).• Context sensitive data-set processing.• Small data-sets.
INTR
OD
UC
TIO
N T
O G
PG
PU
Brief History of GPGPU
• Pre-1999 – Early graphics cards.
• 1999 – Geforce256, the first GPU.
• 2001 – nfiniteFX and custom programmable shaders.
• 2003 – Stanford Brook, the first GPGPU language.
• 2006 – DirectX10 and the Unified Shading Model.
• 2007 – NVIDIA CUDA and AMD (Fire)Stream released.
• 2009 – OpenCL & DirectCompute.
• 2011 – AMD Accelerated Parallel Processing (APP) SDK.
INTR
OD
UC
TIO
N T
O G
PG
PU
CPU vs. GPU (Performance)
Jun2008
May2007
Nov2006
Mar2006
Jun2005
Apr2004
Jan2003
Jun2003
3.0 GHzCore2 Duo
3.2 GHzHarpertown
GT200
G92G80Ultra
G80
G71
G70NV40
NV35NV30
G80Ultra
G80
G71
NV40
NV30HarpertownWoodcrest
Prescott EENorthwood
INTR
OD
UC
TIO
N T
O G
PG
PU
Currently….
Core i7 980 XE Geforce GTX 590 Radeon HD 69900
1000
2000
3000
4000
5000
6000
7000
107.6
2488.3
5952
GFLOP/s
GFLOP/s
INTR
OD
UC
TIO
N T
O G
PG
PU
CPU vs. GPU (Architecture)
Control
CACHE
ALU
ALU
ALU
ALU
CPU GPU
The GPU devotes more transistors to data processing
CUDA DEVICE
Block NBlock 1
CU
DA
AR
CH
ITEC
TU
RE
CUDA Programming Model (Abstract)
Thread 1 Thread M...Shared Memory
Local Memory 1
Local Memory M
Thread 1 Thread M...Shared Memory
Local Memory 1
Local Memory M
...
Device Memory
Global Memory Constant Memory
A CUDA Device executes a Grid containing N Blocks.Each Block has a Shared Memory store.And executes M threads.Each Thread has a local memory store (registers).Device memory is accessible to all threads in all blocks.Device memory is divided into Global and Constant Memory.
CU
DA
AR
CH
ITEC
TU
RE
How Thread Blocks Map to Multiprocessors
GTX 280 Multiprocessor (1 of 30 on GTX 280)
Maximum 1024 Threads
Thread Block
256 Threads
Thread Block
256 Threads
Thread Block
256 Threads
Thread Block
256 Threads
Thread Block
512 Threads
(Maximum Block Size)
Thread Block
512 Threads
(Maximum Block Size)
Thread Block
192 Threads
Thread Block
192 Threads
Thread Block
192 Threads
Thread Block
192 Threads
Thread Block
192 Threads
Unu
sed
Cap
acity
64 T
hrea
ds
FULL OCCUPANCYPARTIAL OCCUPANCY
Occupancy is also affected by:• Per Block Shared Memory Requirements• Per Block Register Requirements
Data-dependent conditional branching within a warp is handled by executing both if and else branches sequentially. • Threads which do not match the condition sleep. • If no threads match a condition, the associated branch is not executed at all.
Warp size is set in hardware, and is unlikely to change in successive GPU generations.
CU
DA
AR
CH
ITEC
TU
RE
Thread Warps
CUDA enabled GPUs implement a SIMT (Single Instruction, Multiple Thread) architecture, with zero thread scheduling overhead.
While essentially SPMD (Single Program, Multiple Data), threads are executed in SIMD batches called Warps.
Each Warp contains 32 threads.
All threads in a warp execute the same instructions in unison.
Thread divergence in different warps
Thread divergence in same warp
Thread 0
Thread 32
Thread 0
Thread 1
if
else
CU
DA
AR
CH
ITEC
TU
RE
Why Warps?
Each processing core can issue an instruction to 4 threads in the time it takes the instruction register to update.
This corresponds to threads sharing a single instruction. The instruction register on Fermi cards is twice as fast. GTX 400 series MP has 32 cores and 2 IRs. GTX 500 series MP has 48 cores and 3 IRs.
Warps are differentiated by block-level thread index: Threads 00-31 are in the first warp. Threads 32-63 are in the second warp. Threads 64-95 and in the third warp, etc.
GT
X 2
80
Mul
tipro
cess
or
Texture Cache
Shared MemoryRegister File
Processing Cores Instruction Register
5 61 2 7 83 4
CU
DA
AR
CH
ITEC
TU
RE
Thread Synchronisation
Thread Synchronisation provides a mechanism for merging divergent thread paths in a block, thus reducing divergence.
if (threadIdx.x > 7) //if thread index is greater than 7 {
...Important Stuff... //warp divergence } __syncthreads(); //no more warp divergence
All threads must wait at __syncthreads() until all other threads in the block reach the directive. Only then may threads recommence.
Must be used with caution – if some threads cant reach the directive, the kernel will never complete.
if (threadIdx.x > 7) //if thread index is greater than 7 {
...Important Stuff...__syncthreads(); //FAIL
}
MEM
ORY M
OD
EL
Type Access Resides In Size
Registers Thread local
On-chip Register File 16384 32-bit Registers
Shared Memory
Block local Multiprocessor 16KB
Constant Memory
Global Device DRAM 64KB (16KB cached)
Global Memory
Global Device DRAM 1024 MB
Texture Memory
Global Multiprocessor & Device DRAM
512MB + 64KB cache
MEM
ORY M
OD
EL
Global Memory
Global Memory is the largest memory region on the device, with 1024 MB of GDDR3 memory. While plentiful, it is at least two orders of magnitude slower than on-chip memory.
Access Speed: 200 – 1000+ clock cycles
Exceptions:
Global memory performance can be improved through Coalescing, or by leveraging the texture cache.
MEM
ORY M
OD
EL
Texture Memory
To avoid the headaches of coalescing, you can instead abuse the 64KB texture cache available on each multiprocessor, by binding regions of Global Memory to a Texture Reference.
Memory accessed using a Texture Reference reads roughly as fast as fully coalesced Global Memory.
Texture References Limitations:
1. READ ONLY.
2. Memory must be bound to texture references in host-side code, prior to kernel execution.
3. Only supports 32-bit and 64-bit ints, floats, doubles, and vectors of these types.
4. On GTX 280, texture memory is mysteriously limited to 512 MB...
MEM
ORY M
OD
EL
Registers
Each Multiprocessor contains 16 384 registers, which reside in an on-chip register file.
These registers are divided equally between each of the blocks running on the multiprocessor.
• Blocks of 512 threads get 8192 registers each.• Blocks of 256 threads get 4096 registers each.
Access Speed: 0 clock cycles
Exceptions:Read-After-Write Dependencies:
• 24 clock cycles if thread count is less than 192.• 0 otherwise.
Register Bank Conflicts:• Register allocation is handled internally, so cannot be explicitly avoided.• Can be minimised by ensuring a multiple of 64 threads per block.
Register Pressure:• Occurs when registers are over allocated, and are pushed into
global memory.• Extremely expensive (200+ clock cycles)
MEM
ORY M
OD
EL
Shared Memory
16KB of Shared memory resides on each multiprocessor, acting as an explicit cache. This memory is shared between all executing blocks.
• Blocks of 512 threads get 8KB each.• Blocks of 256 threads get 4KB each.
Values stored in shared memory are accessible by any other thread in the executing block. This allows threads in the same block to communicate with another, while also providing fast temporary storage.
Access Speed: 1 clock cycleExceptions:Memory Bank Conflicts:
• If N threads in a half-warp try to access the same memory bank, their requests must be serialised into N separate requests.
1 2 3 4 5 6 7 8
0
8
1
9
2
10
3
11
4
12
5
13
6
14
7
15
Memory Banks
Shared Memory Indexes
MEM
ORY M
OD
EL
Constant Memory
CUDA devices have 64KB of constant memory, with 16KB of cache. This region can be used to store program instructions, pointers to global memory, and kernel arguments (which would otherwise consume shared memory).
Access Speed on Cache Hit: Fast as a Register (0 – 24 clock cycles)
Access Speed on Cache Miss: Slow as Global Memory (200 – 1000+ clock cycles).
Exceptions:
On a cache hit, constant memory is as fast as a register ONLY if all threads within the half-warp access the same cached value. If threads in the half-warp request different cached values, then the requests are serialised.
PR
OG
RA
MM
ING
WIT
H C
UD
AUsing the CUDA Runtime API
Typical Program Structure (Host Code)
1. Includes: #include “cuda.h”, #include “cuda_runtime_api.h”
2. Allocate Memory for host and device• cudaMalloc, cudaMallocHost, cudaMallocPitch, etc.
3. Fill host memory
4. Copy host memory to device memory• cudaMemcpy, cudaMemcpyHostToDevice
5. Execute Kernel• myKernel<<<BLOCK_DIM, GRID_DIM [, SHARED_MEM]>>>(args);
6. Copy device memory to host• cudaMemcpy, cudaMemcpyDeviceToHost
7. Free allocated memory• cudaFree
Fill Memory
Execute Kernel
Collect Results
PR
OG
RA
MM
ING
WIT
H C
UD
ACopying Data to and from a Device
The CUDA Runtime API provides functions for allocating Device Memory, and copying data to and from the device.
int * data_h, * data_d;
cudaMallocHost((void **) &data_h, 256 * sizeof(int)); //== data_h = (int *) malloc(256 * sizeof(int));cudaMalloc((void **) &data_d, 256 * sizeof(int));
for (int k = 0; k < 256; ++k) data_h[k] = k; //fill host array
// (dest, source, size, type)cudaMemcpy(data_d, data_h, 256 * sizeof(int), cudaMemcpyHostToDevice);
/* - run the kernel (asynchronous) - do any host-side processing */
cudaThreadSynchronize(); //ensure all threads have completed
cudaMemcpy(data_h, data_d, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaFreeHost(data_h); //same as free(data_h)cudaFree(data_d);
Declare, Allocate and Fill Arrays
Copy Data to and from Device Memory
Free Memory
PR
OG
RA
MM
ING
WIT
H C
UD
AExecuting a Kernel
Only CUDA Kernels may be called from the Host.
Kernels need to be informed of Grid and Block dimensions (and optionally Shared Memory size) when called.
These are passed to the kernel at runtime. This allows kernels to be optimised for a variety of datasets.
dim3 BlockDim(256); //y and z components set to 1dim3 GridDim(data_size / BlockDim.x); //assume data_size is multiple of 256int Shared = BlockDim.x * 4 * sizeof(int); //each thread uses 4 ints shared storage
arbitraryKernel<<<GridDim, BlockDim, Shared>>>(someData);
Note that 1D grid and block dimensions may be integers, but dim3 is required for higher dimensions.
blockblockblock
PR
OG
RA
MM
ING
WIT
H C
UD
AKernel Orienteering
thread
0
thread
1
thread
2
thread
3
thread
0
thread
1
Max Threads / Block: 512Max Shared Memory / Block: 16 384 bytesMax Blocks / Grid: 65 536 per dimension (> 281 trillion)
thread
2
thread
3
thread
0
thread
1
thread
2
thread
3
Shared Memory
0
Shared MemoryShared Memory
1 2
threadIdx.x
blockDim.x
blockIdx.x
gridDim.x
Global Thread ID = blockDim.x * blockIdx.x + threadIdx.x
6 = 4 * 1 + 2
NOTE: Block execution order is not guaranteed. Ensure block independence.
PR
OG
RA
MM
ING
WIT
H C
UD
AExample 1 – Increasing 1024 integers by 1.
__global__ void add_one(int* data_in, int* data_out)
{
//find the index of the thread
int thread = blockDim.x * blockIdx.x + threadIdx.x;
//read in data and increment
int tmp = data_in[thread] + 1;
//copy out data
data_out[thread] = tmp;
}
…
add_one<<< 4, 256 >>>(data_in_host, data_out_host); //on host
PR
OG
RA
MM
ING
WIT
H C
UD
ADeclaring Shared Memory
Shared Memory is typically statically allocated within a kernel, using the __shared__ qualifier.
When statically allocated, multiple shared variables and arrays may be declared.
__shared__ char temp1[512];__shared__ int temp2[SHARED_SIZE];
Shared Memory may also be dynamically allocated using the extern qualifier.
All dynamically allocated shared memory starts at the same memory offset, so layout must be explicitly managed in code.
extern __shared__ char array[];
short * array0 = (short *) array;float * array1 = (float *) &array0[128];int * array2 = (int *) &array1[64];
PR
OG
RA
MM
ING
WIT
H C
UD
AUsing Shared Memory
Once declared, shared memory can be treated as a normal array.
Because shared memory is shared by all threads in a block, it is important for threads to orientate themselves, such that they read and write to the correct elements of the array.
//each thread deals with 1 piece of datatemp1[threadIdx.x] = device_array[blockIdx.x * blockDim.x + threadIdx.x];
//each thread deals with 16 pieces of datafor (int k = 0; k < 16; ++k)
temp2[ 16 * threadIdx.x + k] = device_array[16 * (blockIdx.x * blockDim.x + threadIdx.x) + k];
if (temp3[threadIdx.x] == 5) /* do something */
When Reading and Writing to Shared Memory operated on by other threads in a block, use __syncthreads() and __threadfence_block() to protect data integrity.
Atomic functions are also helpful.
PR
OG
RA
MM
ING
WIT
H C
UD
ATools: Occupancy Calculator And Visual Profiler
Occupancy Calculator:
• Excel Spreadsheet for CUDA algorithm design.
• Given Threads, Registers and Shared Memory usage per block, calculates the associated performance implications.
• Allows you to maximise GPU utilization within your kernel.
Visual Profiler:
• Visualizes Kernels, to help determine bottlenecks, defects and areas of general low performance.
• Produces graphs and tables.
• Useful for diagnosing poor performance.
MA
XIM
ISIN
G
PER
FOR
MA
NC
ESpeed Bumps
Memory Access Latency
Thread Divergence
Operator Performance
Host-Device Transfer Overhead
PCI-E 16x can transfer at up to 16GB/s, which, in the grand scheme of things, is quite slow.
Due to the wide range of explicit criteria for optimal performance, poorly crafted Kernels can suffer significant penalties.
Because warp level thread-divergence is essentially serialised by the instruction register, decisional logic should be eliminated where possible.
Certain arithmetic operators perform relatively poorly, and should be avoided.
MA
XIM
ISIN
G
PER
FOR
MA
NC
EExample: Array Reversal
__global__ void reverse_array(int * array_in, int * array_out) //naive kernel{
int curr_index = blockDim.x * blockIdx.x + threadIdx.x;
array_out[curr_index] = array_in[gridDim.x * blockDim.x – 1 – curr_index];}
__global__ void reverse_array_ex(int * array_in, int * array_out) //less naive kernel{
__shared__ int tmp[BLOCK_DIM];
tmp[blockDim.x - 1 - threadIdx.x] = array_in[blockDim.x * blockIdx.x + threadIdx.x];
__syncthreads();
array_out[blockDim.x * (gridDim.x - 1 - blockIdx.x) + threadIdx.x] = tmp[threadIdx.x];}
0 1 2 3 4 5 6 7 8 9 10 11
0123 4567 891011
01234567891011
array_in
tmp
array_out
Better Kernel
MA
XIM
ISIN
G
PER
FOR
MA
NC
ESumming Elements in an Array (Conceptual)
15 6 32 11 2 27 19 9
0 1 2 3 4 5 6 7
21 0 43 0 29 0 28 0
0 1 2 3 4 5 6 7
64 0 0 0 57 0 0 0
0 1 2 3 4 5 6 7
121 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7
MA
XIM
ISIN
G
PER
FOR
MA
NC
EA Better Solution
15 6 32 11 2 27 19 9
0 1 2 3 4 5 6 7
21 43 29 28 0 0 0 0
0 1 2 3 4 5 6 7
64 57 0 0 0 0 0 0
0 1 2 3 4 5 6 7
121 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7
Minimises Divergence in Thread Warps
Performance difference
With 256 threads in a block, both solutions take 9 iterations to sum 512 elements. Total active warps differ significantly, however.
Iteration Reduction Solution 1Active warps
Solution 2Active warps
1 512 – 256 8 8
2 256 – 128 8 4
3 128 – 64 8 2
4 64 – 32 8 1
5 32 – 16 8 1
6 16 – 8 8 1
7 8 – 4 4 1
8 4 – 2 2 1
9 2 – 1 1 1
Total 55 20
PAR
ALLE
L CLA
SS
IFICATIO
NNested Loop Unroll-and-Jam
for (int j = 0; j < M; j++){
for (int k = 0; k < N; k++){
foo(j, k);}
}
for (int j = 0; j < M; j += 4){
for (int k = 0; k < N; k++){
foo(j, k);}for (int k = 0; k < N; k++){
foo(j + 1, k);}for (int k = 0; k < N; k++){
foo(j + 2, k);}for (int k = 0; k < N; k++){
foo(j + 3, k);}
}
for (int j = 0; j < M; j += 4){
for (int k = 0; k < N; k++){
foo(j, k);foo(j + 1, k);foo(j + 2, k);foo(j + 3, k);
}}
1. Initial Nested Loop 2. Partially Unrolled Outer Loop
3. Unroll-and-Jam
MA
XIM
ISIN
G
PER
FOR
MA
NC
EOptimising to Host-To-Device Transfer
CUDA supports two basic memory types:1. Pageable Memory (8 GBps)2. Pagelocked Memory (16 GBps)
Pagelocked memory transfers faster than pageable memory, and supports several optimisations. It is however a scarce resource, and thus overuse can degrade system performance.
Write-Combined Memory – Transfers faster over PCI-E, and frees up L1 and L2 cache resources for the rest of the program to use. It should not be read from by the host.
Mapped Memory – Map a region of host memory onto the GPU. Eliminates all transfer overhead in integrated devices, but is very slow on discreet devices (only 4GBps) .
Asynchronous Streams – Typically, data transfer and kernel execution happen synchronously (separately). Through streams, data transfer and kernel execution can occur asynchronously (concurrently), dramatically improving the overall speed of the program. Buffers are advised.
MA
XIM
ISIN
G
PER
FOR
MA
NC
EInteger Operator Performance
Most arithmetic integer operators perform well in CUDA (between 2 and 8 operations per clock cycle). However, integer division and modulo operations are extremely expensive, taking up to 10x longer to compute.
This expense can be avoided if the dividend is a multiple of 2 using bitwise operations. These are particularly fast for 32-bit integers.
a / b == a >> log2(b) a % b == a & (b – 1)
The performance of the operators and functions available to kernels is provided in the CUDA Programming Guide.