Download - GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

GPU-Computing

International Summer School 2015Tomsk Polytechnic University

Parallelverarbeitung mit GPUs

2

� GPU Hardware

� CUDA Programming Fundamentals

� CUDA Programming Examples

� Summary

Overview


3

Graphics Processing Units are Complex

Abbildung: DirectX10 Pipeline

• Vertex-Shader: Transformation 3D-2D Coordinates, Point Position+Color

• Geometry-Shader (DirectX 10): triangulation, add e.g. line segments to

improve to improve curve representation

• Pixel-Shader: modifies color and shade of pixels


Start of GPU-Computing

4

First Languages for Shading• RenderMan Shading Language (Pixar 1988)

• Stanford Real-Time Shading Language (2001)

Standardized Shading Languages - SL• GLSL (OpenGL Shading Language)

• HLSL (High Level Shading Language, Microsoft)

• Cg (NVIDIA, GL und D3D)


General Purpose Graphics Processing Units

� GPU Computing or GPGPU means: usage of GPU for normal computation (not for computation of images for display)

� GPU Computing: CPU and GPU both participate in computation

� CPU: control-flow intensive partGPU: data-intensive part

5


6

GPU vs. CPU

� CPU

� Small to medium number of strong general purpose cores

� High performance for single to medium number of threads

� GPU (Graphics Processing Unit)

� Large number of small, specialized cores

� High performance for very large number of threads

� Now: Integration of GPU and CPU on same die

� Examples:

AMD Trinity, Intel HD2000, HD4000 for i{3, 5, 7}-3K

Nvidia Tegra


7

SIMD Processing

CPU instruction set extensions

� x86 SSE or AVX, 3DNow!, PowerPC AltiVec

� Knights Ferry/Corner/Landing (MIC, Larrabee)

� Typical SIMD width: 2–8

GPU

� Implicit vectorization through hardware

� NVIDIA GPUs: SIMD warps

� AMD GPUs: VLIW wavefronts

� Typical SIMD width: 16–64 (warp=32)


Programming APIs:

8

• CUDA (NVIDIA, leading but proprietary)

• OpenCL (Open Compute Language, open standard)

• DirectCompute (Microsoft)

• OpenACC, pragma-based standard like OpenMP

• PGI (The Portland Group, Inc.) Accelerator Compiler, implicitly parallel

programming language

• Shader in OpenGL/DirectX (GLSL, HLSL, open standards)

• Brook ⇒ Brook+, RapidMind ⇒ CAL (Compute Abstraction Layer, AMD)

• FireStream (AMD), HMPP, GPUSs, StarPU, QUARK, OpenMPC


General Organization and Connection to Host(CPU)

9


General Organization of GPU

10

� Many simple cores, streaming processors (SP)

� Grouped (32) into multi processors

� Many registers

� Small, fast local shared memories

� Large, slow, global memory(Access in 400 to 800 cycles)

� Optimized for data-parallel processing� Serialization upon control flow divergence

� No branch prediction


NVidia GeForce-8 Architektur

11

• Floats and doubles

available

• Doubles slow

• Missing exceptions

(e.g. divide by 0)

SLI:

Scalable Link Interface


Example: NVidia GeForce 8800 GT

12


FermiStreaming Multiprocessor (SM)

- 2 Warp-Scheduler

- Warp: 32 parallelen Threads

- 2 Dispatch Units

- 4*8=32 Cores

- 32K 32 Bit Registers

- 4 Special Function Units

SIN, COS, EXP, RCP, etc.

13


14

Nvidia Tesla Series

� Based on Fermi Architecture

� T10P:

� 240 Stream Prozessors (SP) @ 1,33 GHz,

� 4GB @ 800 MHz GDDR3 Memory

� 512 Bit Memory bus

� 1,4*109 Transistoren

� ~ 1*1012 FLOPS (1 TFLOP)


15

Tesla 10 Series

� Tesla C1060 Computing Prozessor� PCIe 2.0 Card (x16)

� 1 T10P Processor with 240 SPs @ 1,33GHz

� 4 GB @ 800 MHz GDDR3 Memory

� 512 Bit Memory Bus

� 102 GB/s Memory Bandwidth (theoretical)

� ~160 W Power Consumption

� Tesla S1070 1U System

� 1U -Form-Faktor

� 4 T10P Processors @ 1,5GHz

� I.e. 960 SPs, 16GB RAM

� 2048 Bit Memory bus

� Host connection through 2 PCIe 2.0 Cables

� ~700 W Power Consumption


16

Compute Capabilities

Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5Threads/ Warp 32 32 32 32 32 32 32 32Warps/ Multiprocessor 24 24 32 32 48 48 64 64Threads/ Multiprocessor 768 768 1024 1024 1536 1536 2048 2048Thread Blocks/ Multiprocessor 8 8 8 8 8 8 16 16Max Shared Memory/ Multiprocessor (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152Register File Size 8192 8192 16384 16384 32768 32768 65536 65536Register Allocation Unit Size 256 256 512 512 64 64 256 256Allocation Granularity block block block block warp warp warp warpMax. Registers / Thread 124 124 124 124 63 63 63 255Shared Memory Allocation Unit Size 512 512 512 512 128 128 256 256Warp allocation granularity 2 2 2 2 2 2 4 4Max. Thread Block Size 512 512 512 512 1024 1024 1024 1024

Shared Memory Size Configurations (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152Warp register allocationgranularities 64 64 256 256


Programming with CUDA

CUDA = Compute Unified Device Architecture

17


CUDA

Last Version:http://developer.nvidia.com/cuda/cuda-downloads

Aktuell: CUDA 7 (May 2015)Toolkit contains: NVIDIA Performance Primitives (NPP) librarySupport for: Eclipse, Visual StudioLLVM-based Compiler (nvcc)(LLVM = Low Level Virtual Maschine)Visual Profilercuda-gdb Debugger (Linux & MacOS)GPU Disassembler (cuobjdump)Examples with source text

18


19

Libraries (from CUDA Toolkit)

� cuFFT Fast Fourier Transformation

� cuBLAS Complete BLAS (Basic Linear Algebra Subprograms)

� cuSPARSE Sparse Matrix Computations

� cuRAND Random Number Generation (RNG)

� NPP Performance Primitives for Image andVideo Processing

� nvcuvid Video Decoding

� nvcuvenc Video Encoding

� Thrust Templated Parallel Algorithms & Data Structurese.g. Parallel sorting, parallel summation, Data structures for vectors


20

CUDA

� CUDA extends C language

� Compiled through nvcc compiler

� High portability between different CUDA architectures


21

Compilation

CUDA C Functions

Compiler nvcc

PTX Code

PTX fortarget

architecture(Objectcode)

C Program(withoutCUDA)

Compiler, e.g. gcc

CPU objectfiles

CPU/GPU executable

PTX: Parallel Thread eXecution architecture, virtual instruction set architecture


22

CUDA Notation

� Device

� Graphics card with GPU and graphics memory

� Kernel

� Program that runs on device

� Kernel can only access GPU-memory

� New CUDA versions can run several kernels

simultaneously

� Host

� CPU, which starts kernels on device


23

CUDA Programming Model (1)

� Programming in C

� Some functions executed on GPU in kernels

� Program partitioned in host code (Standard C) and

device code (CUDA)

� Distinguish functions by qualifiers

__host__ Functions on CPU (default)

__device__ GPU functions

__global__ Entry points into CUDA code,

define kernel


24

CUDA Further Notation

� Kernel: mapped onto Grid

� Grid: 2D-mesh of Thread blocks


Kernel start (CUDA):

kernel<<<dim3 gridS, dim blockS, size_t sm, cudaStream_t str>>>(…)

dim3 gridS (max. 65536 × 65536) Dimension of grid (2D)

dim blockS (max. 512 × 512 × 64) Dimension of block (3D)

size_t smem (optional) Shared Memory per block

cudaStream_t str (optional)

Block: groups threads as 3D-mesh

Blocks must be independent

Threads in a thread block can be synchronized

Shared memory can be used to exchange data

Threads: have unique ID, 1-3D

25


Scheduling of Kernel onto GPU Hardware

26

• Kernel is Grid, which contains blocks

• Each block assigned to Streaming Multiprocessor (SM)

• SM partitions block into warps (warp: 32 threads)

• All threads of one warp executed simultaneously on the Streaming

Processors (SPs) of SM

• Assignment Threads → SPs dynamical

• Warp executed in SIMD style


27

Memory Hierarchy

Register:per thread, small capacity (KB), small latency

Shared Memory: per block, medium capacity (KB), medium latency, can coordinate threads

Global Memory: per grid, large capacity (GB), large latency, necessary for I/O


28

CUDA Memory Model

Texture Memory

Constant Memory

Global Memory

Grid

Block (0,0)

Shared Memory

Block (0,1)

Shared Memory

Block (0,n-1)

Shared Memory

…


29

Memory Model

� Memory specified through qualifiers__device__ in global memory on GPU

__shared__ in shared memory on SMs

� Operations on global memory:� Allocation: cudaMalloc(void **ptr, size_t bytes)

� Deallocation: cudaFree(void *ptr)

� Set Memory: cudaMemSet(void *ptr, int value, size_t bytes)

� Transfer System-RAM ↔ GPU-RAM cudaMemCopy(*dst, *src, size_t bytes, …)

� In Kernels: Transfer global memory↔ shared memory


Caching - Default: enabled- Access first L1, then L2, then global mem- Granularity: 128-byte cache line

Non-caching - activate with option –Xptxas –dlcm=cg in Nvidia-compiler

- Access first L2, then global mem- No access to L1; if present in L1: invalidate cache line- Granularity: 32-byte

30


31

Accessing Array Elements

Size of 2-dim block:blockDim.x // size of 2-dim block (X coord)

blockDim.y // size of 2-dim block (Y coord)

Identifying thread within 2-dim block

threadIdx.x // Thread ID in Block (0 to blockDim.x-1)

threadIdx.y // Thread ID in Block (0 to blockDim.y-1)

For block dimension (3,4):

threadIdx.x = 0,1,2

threadIdx.y = 0,1,2,3


32

Accessing Array Elements

Identifying block within 1-dim grid

blockIdx.x // block ID in grid

Access to array element (1-dim grid, 2-dim blocks):

blocksize = blockDim.x * blockDim.y; // no. threads in block

tid = threadIdx.y * blockIdx.x + threadIdx.x; // linear ID

index = blockIdx.x * blocksize + tid; // array element index

Access to array element (1-dim grid, 1-dim blocks):

blocksize = blockDim.x; // no. threads in block

tid = threadIdx.x; // linear ID

index = blockIdx.x * blocksize + tid; // array element index

orindex = blockIdx.x * blockDim.x + threadIdx.x;


33

First parallel Code – GPU part

� Add two vectors of length N1-dim grid of N/B blocks, each 1-dim block consisting of B threads

� blockDim.x : first dimension of block (i.e. B)

__global__ void vecAdd(float *a,float *b,float *c, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i<N){

c[i] = a[i] + b[i];

}

}


34

First Parallel Code – CPU Part#define N 65536 // N is Dimension of vectors

#include <stdio.h>

#include <cuda.h>

int main(void){

size_t size = N*sizeof(float);

float * dA, *dB, *dC;

float hA[N], hB[N], hC[N];

for(int i=0;i<N;i++){ hA[i]=(float)i; hB[i]=(float)(N-i); }

cudaMalloc((void**)&dA, size); // alloc vectors on cuda mem

cudaMalloc((void**)&dB, size);

cudaMalloc((void**)&dC, size);

cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice);

cudaMemcpy(dB, hB, size, cudaMemcpyHostToDevice);

int threadS = 256; // Threads per block

int blockS =(N + threadS - 1) / threadS; // Blocks per grid

vecAdd<<<blockS, threadS>>>(dA, dB, dC, N);

cudaMemcpy(hC, dC, size, cudaMemcpyDeviceToHost);

cudaFree(dA); cudaFree(dB); cudaFree(dC);

for(int i=0;i<N;i++) if(hC[i]!=N) printf(„Wrong result at %d\n“,i);

return 0;

}


Matrix Multiplicationint main() {

dim3 threads (16,16,1); // blocksize=256, really 2-dim

dim3 grid(DIM/ threads.x ,DIM/ threads.y, 1); // 2-dim

float a[DIM*DIM], b[DIM*DIM], res[DIM*DIM];

float *devA , *devB, *devRes;

int matSize = DIM*DIM*sizeof (float);

cudaMalloc (( void)&devA , matSize);

cudaMalloc (( void)&devB , matSize);

cudaMalloc (( void)&devRes , matSize);

cudaMemcpy (devA , a , matSize , cudaMemcpyHostToDevice);

cudaMemcpy (devB , b , matSize , cudaMemcpyHostToDevice);

matMul<<<grid, threads >>>(devA, devB , devRes);

cudaThreadSynchronize();

cudaMemcpy (res, devRes, matSize, cudaMemcpyDeviceToHost);

for (int i =0; i <DIM; i ++) {

for (int j =0; j <DIM; j ++) printf("%d" , res [ i + j*DIM ]);

printf("\n");

}

return 0;

}

35


Matrix Multiplication

Dimension of matrix: DIM (global variable)

In each of 2 dimensions: DIM threads

__global__ void matMul (float *a, float *b, float *c) {

int2 id;

id.x = blockDim.x*blockIdx.x+threadIdx.x;

id.y = blockDim.y*blockIdx.y+threadIdx.y;

float sum=0;

for (int z=0; z<DIM; z++) {

sum +=a[id.y*DIM+z]*b[z*DIM+id.x] ;

}

c[id.y*DIM+id.x]=sum;

}

36


Choice of Block Size

37

• Number of threads per block: multiple of warp size (32)

• SMs can execute up to 8 thread blocks in parallel

• Small block size: prevents high utilization

• Large block size: less flexibility

• Typical: ~128 to 256 threads per block

• Depends on application � experiments necessary


Avoid Control Flow Divergence

38

• Typical code for CPU:

if(idx&1) a[idx] = b[idx]; else a[idx] = b[idx] + 1;

• On GPU: performance loss by factor 2!

First: all odd threads in warp execute a[idx] = b[idx];

Then: all even threads in warp execute a[idx] = b[idx] + 1;

• Better: a[idx] = b[idx] + (idx&1);

• Not always so simple

to detect and to cure!


Comparison OpenCL and CUDA

� OpenCL

� Compiles kernel at runtime for actual platform (~)

� Supports cards of several manufacturers (++)

� Supports also non-GPU devices (++)

� Open standard (++)

� Available for: Windows, Mac, Linux

� Larger setup code (-)

� Less programming comfort (-)

� Command-Queues resemble CUDA streams (~)

� API commands similar, partly different parameters

39


Comparison OpenCL and CUDA

� CUDA� Abstrahierung für general purpose computing auf einer GPU (GPGPU)

� Gute High-Level API (+)

� Einheitliches Programmier-Model (+)

� Low und Highlevel Thread Synchronisation (+)

� Stream Synchronisation (+)

� Native Unterstützung für atomare Operationen (+)

� Sehr gute Dokumentation (+)

� Funktioniert nur auf NVIDIA Hardware (-)

� Manuelles Speichermanagement (-)

� Verfügbar für: Windows, Mac, Linux

40


Summary

� use GPUs as co-processor for

massively parallel problems

with regular structure (few control flow statements)

� CUDA: most popular programming environment

� Many more issues not mentioned in introduction

�� see CUDA manual and textbooks

41


Thank youvery much

for your attention

42