GPU-Computing
International Summer School 2015Tomsk Polytechnic University
Parallelverarbeitung mit GPUs
2
� GPU Hardware
� CUDA Programming Fundamentals
� CUDA Programming Examples
� Summary
Overview
Parallelverarbeitung mit GPUs
3
Graphics Processing Units are Complex
Abbildung: DirectX10 Pipeline
• Vertex-Shader: Transformation 3D-2D Coordinates, Point Position+Color
• Geometry-Shader (DirectX 10): triangulation, add e.g. line segments to
improve to improve curve representation
• Pixel-Shader: modifies color and shade of pixels
Parallelverarbeitung mit GPUs
Start of GPU-Computing
4
First Languages for Shading• RenderMan Shading Language (Pixar 1988)
• Stanford Real-Time Shading Language (2001)
Standardized Shading Languages - SL• GLSL (OpenGL Shading Language)
• HLSL (High Level Shading Language, Microsoft)
• Cg (NVIDIA, GL und D3D)
Parallelverarbeitung mit GPUs
General Purpose Graphics Processing Units
� GPU Computing or GPGPU means: usage of GPU for normal computation (not for computation of images for display)
� GPU Computing: CPU and GPU both participate in computation
� CPU: control-flow intensive partGPU: data-intensive part
5
Parallelverarbeitung mit GPUs
6
GPU vs. CPU
� CPU
� Small to medium number of strong general purpose cores
� High performance for single to medium number of threads
� GPU (Graphics Processing Unit)
� Large number of small, specialized cores
� High performance for very large number of threads
� Now: Integration of GPU and CPU on same die
� Examples:
AMD Trinity, Intel HD2000, HD4000 for i{3, 5, 7}-3K
Nvidia Tegra
Parallelverarbeitung mit GPUs
7
SIMD Processing
CPU instruction set extensions
� x86 SSE or AVX, 3DNow!, PowerPC AltiVec
� Knights Ferry/Corner/Landing (MIC, Larrabee)
� Typical SIMD width: 2–8
GPU
� Implicit vectorization through hardware
� NVIDIA GPUs: SIMD warps
� AMD GPUs: VLIW wavefronts
� Typical SIMD width: 16–64 (warp=32)
Parallelverarbeitung mit GPUs
Programming APIs:
8
• CUDA (NVIDIA, leading but proprietary)
• OpenCL (Open Compute Language, open standard)
• DirectCompute (Microsoft)
• OpenACC, pragma-based standard like OpenMP
• PGI (The Portland Group, Inc.) Accelerator Compiler, implicitly parallel
programming language
• Shader in OpenGL/DirectX (GLSL, HLSL, open standards)
• Brook ⇒ Brook+, RapidMind ⇒ CAL (Compute Abstraction Layer, AMD)
• FireStream (AMD), HMPP, GPUSs, StarPU, QUARK, OpenMPC
Parallelverarbeitung mit GPUs
General Organization and Connection to Host(CPU)
9
Parallelverarbeitung mit GPUs
General Organization of GPU
10
� Many simple cores, streaming processors (SP)
� Grouped (32) into multi processors
� Many registers
� Small, fast local shared memories
� Large, slow, global memory(Access in 400 to 800 cycles)
� Optimized for data-parallel processing� Serialization upon control flow divergence
� No branch prediction
Parallelverarbeitung mit GPUs
NVidia GeForce-8 Architektur
11
• Floats and doubles
available
• Doubles slow
• Missing exceptions
(e.g. divide by 0)
SLI:
Scalable Link Interface
Parallelverarbeitung mit GPUs
Example: NVidia GeForce 8800 GT
12
Parallelverarbeitung mit GPUs
FermiStreaming Multiprocessor (SM)
- 2 Warp-Scheduler
- Warp: 32 parallelen Threads
- 2 Dispatch Units
- 4*8=32 Cores
- 32K 32 Bit Registers
- 4 Special Function Units
SIN, COS, EXP, RCP, etc.
13
Parallelverarbeitung mit GPUs
14
Nvidia Tesla Series
� Based on Fermi Architecture
� T10P:
� 240 Stream Prozessors (SP) @ 1,33 GHz,
� 4GB @ 800 MHz GDDR3 Memory
� 512 Bit Memory bus
� 1,4*109 Transistoren
� ~ 1*1012 FLOPS (1 TFLOP)
Parallelverarbeitung mit GPUs
15
Tesla 10 Series
� Tesla C1060 Computing Prozessor� PCIe 2.0 Card (x16)
� 1 T10P Processor with 240 SPs @ 1,33GHz
� 4 GB @ 800 MHz GDDR3 Memory
� 512 Bit Memory Bus
� 102 GB/s Memory Bandwidth (theoretical)
� ~160 W Power Consumption
� Tesla S1070 1U System
� 1U -Form-Faktor
� 4 T10P Processors @ 1,5GHz
� I.e. 960 SPs, 16GB RAM
� 2048 Bit Memory bus
� Host connection through 2 PCIe 2.0 Cables
� ~700 W Power Consumption
Parallelverarbeitung mit GPUs
16
Compute Capabilities
Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5Threads/ Warp 32 32 32 32 32 32 32 32Warps/ Multiprocessor 24 24 32 32 48 48 64 64Threads/ Multiprocessor 768 768 1024 1024 1536 1536 2048 2048Thread Blocks/ Multiprocessor 8 8 8 8 8 8 16 16Max Shared Memory/ Multiprocessor (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152Register File Size 8192 8192 16384 16384 32768 32768 65536 65536Register Allocation Unit Size 256 256 512 512 64 64 256 256Allocation Granularity block block block block warp warp warp warpMax. Registers / Thread 124 124 124 124 63 63 63 255Shared Memory Allocation Unit Size 512 512 512 512 128 128 256 256Warp allocation granularity 2 2 2 2 2 2 4 4Max. Thread Block Size 512 512 512 512 1024 1024 1024 1024
Shared Memory Size Configurations (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152Warp register allocationgranularities 64 64 256 256
Parallelverarbeitung mit GPUs
Programming with CUDA
CUDA = Compute Unified Device Architecture
17
Parallelverarbeitung mit GPUs
CUDA
Last Version:http://developer.nvidia.com/cuda/cuda-downloads
Aktuell: CUDA 7 (May 2015)Toolkit contains: NVIDIA Performance Primitives (NPP) librarySupport for: Eclipse, Visual StudioLLVM-based Compiler (nvcc)(LLVM = Low Level Virtual Maschine)Visual Profilercuda-gdb Debugger (Linux & MacOS)GPU Disassembler (cuobjdump)Examples with source text
18
Parallelverarbeitung mit GPUs
19
Libraries (from CUDA Toolkit)
� cuFFT Fast Fourier Transformation
� cuBLAS Complete BLAS (Basic Linear Algebra Subprograms)
� cuSPARSE Sparse Matrix Computations
� cuRAND Random Number Generation (RNG)
� NPP Performance Primitives for Image andVideo Processing
� nvcuvid Video Decoding
� nvcuvenc Video Encoding
� Thrust Templated Parallel Algorithms & Data Structurese.g. Parallel sorting, parallel summation, Data structures for vectors
Parallelverarbeitung mit GPUs
20
CUDA
� CUDA extends C language
� Compiled through nvcc compiler
� High portability between different CUDA architectures
Parallelverarbeitung mit GPUs
21
Compilation
CUDA C Functions
Compiler nvcc
PTX Code
PTX fortarget
architecture(Objectcode)
C Program(withoutCUDA)
Compiler, e.g. gcc
CPU objectfiles
CPU/GPU executable
PTX: Parallel Thread eXecution architecture, virtual instruction set architecture
Parallelverarbeitung mit GPUs
22
CUDA Notation
� Device
� Graphics card with GPU and graphics memory
� Kernel
� Program that runs on device
� Kernel can only access GPU-memory
� New CUDA versions can run several kernels
simultaneously
� Host
� CPU, which starts kernels on device
Parallelverarbeitung mit GPUs
23
CUDA Programming Model (1)
� Programming in C
� Some functions executed on GPU in kernels
� Program partitioned in host code (Standard C) and
device code (CUDA)
� Distinguish functions by qualifiers
__host__ Functions on CPU (default)
__device__ GPU functions
__global__ Entry points into CUDA code,
define kernel
Parallelverarbeitung mit GPUs
24
CUDA Further Notation
� Kernel: mapped onto Grid
� Grid: 2D-mesh of Thread blocks
Parallelverarbeitung mit GPUs
Kernel start (CUDA):
kernel<<<dim3 gridS, dim blockS, size_t sm, cudaStream_t str>>>(…)
dim3 gridS (max. 65536 × 65536) Dimension of grid (2D)
dim blockS (max. 512 × 512 × 64) Dimension of block (3D)
size_t smem (optional) Shared Memory per block
cudaStream_t str (optional)
Block: groups threads as 3D-mesh
Blocks must be independent
Threads in a thread block can be synchronized
Shared memory can be used to exchange data
Threads: have unique ID, 1-3D
25
Parallelverarbeitung mit GPUs
Scheduling of Kernel onto GPU Hardware
26
• Kernel is Grid, which contains blocks
• Each block assigned to Streaming Multiprocessor (SM)
• SM partitions block into warps (warp: 32 threads)
• All threads of one warp executed simultaneously on the Streaming
Processors (SPs) of SM
• Assignment Threads → SPs dynamical
• Warp executed in SIMD style
Parallelverarbeitung mit GPUs
27
Memory Hierarchy
Register:per thread, small capacity (KB), small latency
Shared Memory: per block, medium capacity (KB), medium latency, can coordinate threads
Global Memory: per grid, large capacity (GB), large latency, necessary for I/O
Parallelverarbeitung mit GPUs
28
CUDA Memory Model
Texture Memory
Constant Memory
Global Memory
Grid
Block (0,0)
Shared Memory
Block (0,1)
Shared Memory
Block (0,n-1)
Shared Memory
…
Parallelverarbeitung mit GPUs
29
Memory Model
� Memory specified through qualifiers__device__ in global memory on GPU
__shared__ in shared memory on SMs
� Operations on global memory:� Allocation: cudaMalloc(void **ptr, size_t bytes)
� Deallocation: cudaFree(void *ptr)
� Set Memory: cudaMemSet(void *ptr, int value, size_t bytes)
� Transfer System-RAM ↔ GPU-RAM cudaMemCopy(*dst, *src, size_t bytes, …)
� In Kernels: Transfer global memory↔ shared memory
Parallelverarbeitung mit GPUs
Caching - Default: enabled- Access first L1, then L2, then global mem- Granularity: 128-byte cache line
Non-caching - activate with option –Xptxas –dlcm=cg in Nvidia-compiler
- Access first L2, then global mem- No access to L1; if present in L1: invalidate cache line- Granularity: 32-byte
30
Parallelverarbeitung mit GPUs
31
Accessing Array Elements
Size of 2-dim block:blockDim.x // size of 2-dim block (X coord)
blockDim.y // size of 2-dim block (Y coord)
Identifying thread within 2-dim block
threadIdx.x // Thread ID in Block (0 to blockDim.x-1)
threadIdx.y // Thread ID in Block (0 to blockDim.y-1)
For block dimension (3,4):
threadIdx.x = 0,1,2
threadIdx.y = 0,1,2,3
Parallelverarbeitung mit GPUs
32
Accessing Array Elements
Identifying block within 1-dim grid
blockIdx.x // block ID in grid
Access to array element (1-dim grid, 2-dim blocks):
blocksize = blockDim.x * blockDim.y; // no. threads in block
tid = threadIdx.y * blockIdx.x + threadIdx.x; // linear ID
index = blockIdx.x * blocksize + tid; // array element index
Access to array element (1-dim grid, 1-dim blocks):
blocksize = blockDim.x; // no. threads in block
tid = threadIdx.x; // linear ID
index = blockIdx.x * blocksize + tid; // array element index
orindex = blockIdx.x * blockDim.x + threadIdx.x;
Parallelverarbeitung mit GPUs
33
First parallel Code – GPU part
� Add two vectors of length N1-dim grid of N/B blocks, each 1-dim block consisting of B threads
� blockDim.x : first dimension of block (i.e. B)
__global__ void vecAdd(float *a,float *b,float *c, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i<N){
c[i] = a[i] + b[i];
}
}
Parallelverarbeitung mit GPUs
34
First Parallel Code – CPU Part#define N 65536 // N is Dimension of vectors
#include <stdio.h>
#include <cuda.h>
int main(void){
size_t size = N*sizeof(float);
float * dA, *dB, *dC;
float hA[N], hB[N], hC[N];
for(int i=0;i<N;i++){ hA[i]=(float)i; hB[i]=(float)(N-i); }
cudaMalloc((void**)&dA, size); // alloc vectors on cuda mem
cudaMalloc((void**)&dB, size);
cudaMalloc((void**)&dC, size);
cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice);
cudaMemcpy(dB, hB, size, cudaMemcpyHostToDevice);
int threadS = 256; // Threads per block
int blockS =(N + threadS - 1) / threadS; // Blocks per grid
vecAdd<<<blockS, threadS>>>(dA, dB, dC, N);
cudaMemcpy(hC, dC, size, cudaMemcpyDeviceToHost);
cudaFree(dA); cudaFree(dB); cudaFree(dC);
for(int i=0;i<N;i++) if(hC[i]!=N) printf(„Wrong result at %d\n“,i);
return 0;
}
Parallelverarbeitung mit GPUs
Matrix Multiplicationint main() {
dim3 threads (16,16,1); // blocksize=256, really 2-dim
dim3 grid(DIM/ threads.x ,DIM/ threads.y, 1); // 2-dim
float a[DIM*DIM], b[DIM*DIM], res[DIM*DIM];
float *devA , *devB, *devRes;
int matSize = DIM*DIM*sizeof (float);
cudaMalloc (( void)&devA , matSize);
cudaMalloc (( void)&devB , matSize);
cudaMalloc (( void)&devRes , matSize);
cudaMemcpy (devA , a , matSize , cudaMemcpyHostToDevice);
cudaMemcpy (devB , b , matSize , cudaMemcpyHostToDevice);
matMul<<<grid, threads >>>(devA, devB , devRes);
cudaThreadSynchronize();
cudaMemcpy (res, devRes, matSize, cudaMemcpyDeviceToHost);
for (int i =0; i <DIM; i ++) {
for (int j =0; j <DIM; j ++) printf("%d" , res [ i + j*DIM ]);
printf("\n");
}
return 0;
}
35
Parallelverarbeitung mit GPUs
Matrix Multiplication
Dimension of matrix: DIM (global variable)
In each of 2 dimensions: DIM threads
__global__ void matMul (float *a, float *b, float *c) {
int2 id;
id.x = blockDim.x*blockIdx.x+threadIdx.x;
id.y = blockDim.y*blockIdx.y+threadIdx.y;
float sum=0;
for (int z=0; z<DIM; z++) {
sum +=a[id.y*DIM+z]*b[z*DIM+id.x] ;
}
c[id.y*DIM+id.x]=sum;
}
36
Parallelverarbeitung mit GPUs
Choice of Block Size
37
• Number of threads per block: multiple of warp size (32)
• SMs can execute up to 8 thread blocks in parallel
• Small block size: prevents high utilization
• Large block size: less flexibility
• Typical: ~128 to 256 threads per block
• Depends on application � experiments necessary
Parallelverarbeitung mit GPUs
Avoid Control Flow Divergence
38
• Typical code for CPU:
if(idx&1) a[idx] = b[idx]; else a[idx] = b[idx] + 1;
• On GPU: performance loss by factor 2!
First: all odd threads in warp execute a[idx] = b[idx];
Then: all even threads in warp execute a[idx] = b[idx] + 1;
• Better: a[idx] = b[idx] + (idx&1);
• Not always so simple
to detect and to cure!
Parallelverarbeitung mit GPUs
Comparison OpenCL and CUDA
� OpenCL
� Compiles kernel at runtime for actual platform (~)
� Supports cards of several manufacturers (++)
� Supports also non-GPU devices (++)
� Open standard (++)
� Available for: Windows, Mac, Linux
� Larger setup code (-)
� Less programming comfort (-)
� Command-Queues resemble CUDA streams (~)
� API commands similar, partly different parameters
39
Parallelverarbeitung mit GPUs
Comparison OpenCL and CUDA
� CUDA� Abstrahierung für general purpose computing auf einer GPU (GPGPU)
� Gute High-Level API (+)
� Einheitliches Programmier-Model (+)
� Low und Highlevel Thread Synchronisation (+)
� Stream Synchronisation (+)
� Native Unterstützung für atomare Operationen (+)
� Sehr gute Dokumentation (+)
� Funktioniert nur auf NVIDIA Hardware (-)
� Manuelles Speichermanagement (-)
� Verfügbar für: Windows, Mac, Linux
40
Parallelverarbeitung mit GPUs
Summary
� use GPUs as co-processor for
massively parallel problems
with regular structure (few control flow statements)
� CUDA: most popular programming environment
� Many more issues not mentioned in introduction
���� see CUDA manual and textbooks
41
Parallelverarbeitung mit GPUs
Thank youvery much
for your attention
42
Top Related