GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung...

42
GPU-Computing International Summer School 2015 Tomsk Polytechnic University

Transcript of GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung...

Page 1: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

GPU-Computing

International Summer School 2015Tomsk Polytechnic University

Page 2: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

2

� GPU Hardware

� CUDA Programming Fundamentals

� CUDA Programming Examples

� Summary

Overview

Page 3: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

3

Graphics Processing Units are Complex

Abbildung: DirectX10 Pipeline

• Vertex-Shader: Transformation 3D-2D Coordinates, Point Position+Color

• Geometry-Shader (DirectX 10): triangulation, add e.g. line segments to

improve to improve curve representation

• Pixel-Shader: modifies color and shade of pixels

Page 4: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Start of GPU-Computing

4

First Languages for Shading• RenderMan Shading Language (Pixar 1988)

• Stanford Real-Time Shading Language (2001)

Standardized Shading Languages - SL• GLSL (OpenGL Shading Language)

• HLSL (High Level Shading Language, Microsoft)

• Cg (NVIDIA, GL und D3D)

Page 5: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

General Purpose Graphics Processing Units

� GPU Computing or GPGPU means: usage of GPU for normal computation (not for computation of images for display)

� GPU Computing: CPU and GPU both participate in computation

� CPU: control-flow intensive partGPU: data-intensive part

5

Page 6: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

6

GPU vs. CPU

� CPU

� Small to medium number of strong general purpose cores

� High performance for single to medium number of threads

� GPU (Graphics Processing Unit)

� Large number of small, specialized cores

� High performance for very large number of threads

� Now: Integration of GPU and CPU on same die

� Examples:

AMD Trinity, Intel HD2000, HD4000 for i{3, 5, 7}-3K

Nvidia Tegra

Page 7: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

7

SIMD Processing

CPU instruction set extensions

� x86 SSE or AVX, 3DNow!, PowerPC AltiVec

� Knights Ferry/Corner/Landing (MIC, Larrabee)

� Typical SIMD width: 2–8

GPU

� Implicit vectorization through hardware

� NVIDIA GPUs: SIMD warps

� AMD GPUs: VLIW wavefronts

� Typical SIMD width: 16–64 (warp=32)

Page 8: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Programming APIs:

8

• CUDA (NVIDIA, leading but proprietary)

• OpenCL (Open Compute Language, open standard)

• DirectCompute (Microsoft)

• OpenACC, pragma-based standard like OpenMP

• PGI (The Portland Group, Inc.) Accelerator Compiler, implicitly parallel

programming language

• Shader in OpenGL/DirectX (GLSL, HLSL, open standards)

• Brook ⇒ Brook+, RapidMind ⇒ CAL (Compute Abstraction Layer, AMD)

• FireStream (AMD), HMPP, GPUSs, StarPU, QUARK, OpenMPC

Page 9: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

General Organization and Connection to Host(CPU)

9

Page 10: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

General Organization of GPU

10

� Many simple cores, streaming processors (SP)

� Grouped (32) into multi processors

� Many registers

� Small, fast local shared memories

� Large, slow, global memory(Access in 400 to 800 cycles)

� Optimized for data-parallel processing� Serialization upon control flow divergence

� No branch prediction

Page 11: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

NVidia GeForce-8 Architektur

11

• Floats and doubles

available

• Doubles slow

• Missing exceptions

(e.g. divide by 0)

SLI:

Scalable Link Interface

Page 12: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Example: NVidia GeForce 8800 GT

12

Page 13: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

FermiStreaming Multiprocessor (SM)

- 2 Warp-Scheduler

- Warp: 32 parallelen Threads

- 2 Dispatch Units

- 4*8=32 Cores

- 32K 32 Bit Registers

- 4 Special Function Units

SIN, COS, EXP, RCP, etc.

13

Page 14: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

14

Nvidia Tesla Series

� Based on Fermi Architecture

� T10P:

� 240 Stream Prozessors (SP) @ 1,33 GHz,

� 4GB @ 800 MHz GDDR3 Memory

� 512 Bit Memory bus

� 1,4*109 Transistoren

� ~ 1*1012 FLOPS (1 TFLOP)

Page 15: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

15

Tesla 10 Series

� Tesla C1060 Computing Prozessor� PCIe 2.0 Card (x16)

� 1 T10P Processor with 240 SPs @ 1,33GHz

� 4 GB @ 800 MHz GDDR3 Memory

� 512 Bit Memory Bus

� 102 GB/s Memory Bandwidth (theoretical)

� ~160 W Power Consumption

� Tesla S1070 1U System

� 1U -Form-Faktor

� 4 T10P Processors @ 1,5GHz

� I.e. 960 SPs, 16GB RAM

� 2048 Bit Memory bus

� Host connection through 2 PCIe 2.0 Cables

� ~700 W Power Consumption

Page 16: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

16

Compute Capabilities

Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5Threads/ Warp 32 32 32 32 32 32 32 32Warps/ Multiprocessor 24 24 32 32 48 48 64 64Threads/ Multiprocessor 768 768 1024 1024 1536 1536 2048 2048Thread Blocks/ Multiprocessor 8 8 8 8 8 8 16 16Max Shared Memory/ Multiprocessor (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152Register File Size 8192 8192 16384 16384 32768 32768 65536 65536Register Allocation Unit Size 256 256 512 512 64 64 256 256Allocation Granularity block block block block warp warp warp warpMax. Registers / Thread 124 124 124 124 63 63 63 255Shared Memory Allocation Unit Size 512 512 512 512 128 128 256 256Warp allocation granularity 2 2 2 2 2 2 4 4Max. Thread Block Size 512 512 512 512 1024 1024 1024 1024

Shared Memory Size Configurations (Bytes) 16384 16384 16384 16384 49152 49152 49152 49152Warp register allocationgranularities 64 64 256 256

Page 17: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Programming with CUDA

CUDA = Compute Unified Device Architecture

17

Page 18: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

CUDA

Last Version:http://developer.nvidia.com/cuda/cuda-downloads

Aktuell: CUDA 7 (May 2015)Toolkit contains: NVIDIA Performance Primitives (NPP) librarySupport for: Eclipse, Visual StudioLLVM-based Compiler (nvcc)(LLVM = Low Level Virtual Maschine)Visual Profilercuda-gdb Debugger (Linux & MacOS)GPU Disassembler (cuobjdump)Examples with source text

18

Page 19: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

19

Libraries (from CUDA Toolkit)

� cuFFT Fast Fourier Transformation

� cuBLAS Complete BLAS (Basic Linear Algebra Subprograms)

� cuSPARSE Sparse Matrix Computations

� cuRAND Random Number Generation (RNG)

� NPP Performance Primitives for Image andVideo Processing

� nvcuvid Video Decoding

� nvcuvenc Video Encoding

� Thrust Templated Parallel Algorithms & Data Structurese.g. Parallel sorting, parallel summation, Data structures for vectors

Page 20: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

20

CUDA

� CUDA extends C language

� Compiled through nvcc compiler

� High portability between different CUDA architectures

Page 21: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

21

Compilation

CUDA C Functions

Compiler nvcc

PTX Code

PTX fortarget

architecture(Objectcode)

C Program(withoutCUDA)

Compiler, e.g. gcc

CPU objectfiles

CPU/GPU executable

PTX: Parallel Thread eXecution architecture, virtual instruction set architecture

Page 22: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

22

CUDA Notation

� Device

� Graphics card with GPU and graphics memory

� Kernel

� Program that runs on device

� Kernel can only access GPU-memory

� New CUDA versions can run several kernels

simultaneously

� Host

� CPU, which starts kernels on device

Page 23: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

23

CUDA Programming Model (1)

� Programming in C

� Some functions executed on GPU in kernels

� Program partitioned in host code (Standard C) and

device code (CUDA)

� Distinguish functions by qualifiers

__host__ Functions on CPU (default)

__device__ GPU functions

__global__ Entry points into CUDA code,

define kernel

Page 24: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

24

CUDA Further Notation

� Kernel: mapped onto Grid

� Grid: 2D-mesh of Thread blocks

Page 25: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Kernel start (CUDA):

kernel<<<dim3 gridS, dim blockS, size_t sm, cudaStream_t str>>>(…)

dim3 gridS (max. 65536 × 65536) Dimension of grid (2D)

dim blockS (max. 512 × 512 × 64) Dimension of block (3D)

size_t smem (optional) Shared Memory per block

cudaStream_t str (optional)

Block: groups threads as 3D-mesh

Blocks must be independent

Threads in a thread block can be synchronized

Shared memory can be used to exchange data

Threads: have unique ID, 1-3D

25

Page 26: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Scheduling of Kernel onto GPU Hardware

26

• Kernel is Grid, which contains blocks

• Each block assigned to Streaming Multiprocessor (SM)

• SM partitions block into warps (warp: 32 threads)

• All threads of one warp executed simultaneously on the Streaming

Processors (SPs) of SM

• Assignment Threads → SPs dynamical

• Warp executed in SIMD style

Page 27: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

27

Memory Hierarchy

Register:per thread, small capacity (KB), small latency

Shared Memory: per block, medium capacity (KB), medium latency, can coordinate threads

Global Memory: per grid, large capacity (GB), large latency, necessary for I/O

Page 28: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

28

CUDA Memory Model

Texture Memory

Constant Memory

Global Memory

Grid

Block (0,0)

Shared Memory

Block (0,1)

Shared Memory

Block (0,n-1)

Shared Memory

Page 29: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

29

Memory Model

� Memory specified through qualifiers__device__ in global memory on GPU

__shared__ in shared memory on SMs

� Operations on global memory:� Allocation: cudaMalloc(void **ptr, size_t bytes)

� Deallocation: cudaFree(void *ptr)

� Set Memory: cudaMemSet(void *ptr, int value, size_t bytes)

� Transfer System-RAM ↔ GPU-RAM cudaMemCopy(*dst, *src, size_t bytes, …)

� In Kernels: Transfer global memory↔ shared memory

Page 30: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Caching - Default: enabled- Access first L1, then L2, then global mem- Granularity: 128-byte cache line

Non-caching - activate with option –Xptxas –dlcm=cg in Nvidia-compiler

- Access first L2, then global mem- No access to L1; if present in L1: invalidate cache line- Granularity: 32-byte

30

Page 31: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

31

Accessing Array Elements

Size of 2-dim block:blockDim.x // size of 2-dim block (X coord)

blockDim.y // size of 2-dim block (Y coord)

Identifying thread within 2-dim block

threadIdx.x // Thread ID in Block (0 to blockDim.x-1)

threadIdx.y // Thread ID in Block (0 to blockDim.y-1)

For block dimension (3,4):

threadIdx.x = 0,1,2

threadIdx.y = 0,1,2,3

Page 32: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

32

Accessing Array Elements

Identifying block within 1-dim grid

blockIdx.x // block ID in grid

Access to array element (1-dim grid, 2-dim blocks):

blocksize = blockDim.x * blockDim.y; // no. threads in block

tid = threadIdx.y * blockIdx.x + threadIdx.x; // linear ID

index = blockIdx.x * blocksize + tid; // array element index

Access to array element (1-dim grid, 1-dim blocks):

blocksize = blockDim.x; // no. threads in block

tid = threadIdx.x; // linear ID

index = blockIdx.x * blocksize + tid; // array element index

orindex = blockIdx.x * blockDim.x + threadIdx.x;

Page 33: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

33

First parallel Code – GPU part

� Add two vectors of length N1-dim grid of N/B blocks, each 1-dim block consisting of B threads

� blockDim.x : first dimension of block (i.e. B)

__global__ void vecAdd(float *a,float *b,float *c, int N) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i<N){

c[i] = a[i] + b[i];

}

}

Page 34: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

34

First Parallel Code – CPU Part#define N 65536 // N is Dimension of vectors

#include <stdio.h>

#include <cuda.h>

int main(void){

size_t size = N*sizeof(float);

float * dA, *dB, *dC;

float hA[N], hB[N], hC[N];

for(int i=0;i<N;i++){ hA[i]=(float)i; hB[i]=(float)(N-i); }

cudaMalloc((void**)&dA, size); // alloc vectors on cuda mem

cudaMalloc((void**)&dB, size);

cudaMalloc((void**)&dC, size);

cudaMemcpy(dA, hA, size, cudaMemcpyHostToDevice);

cudaMemcpy(dB, hB, size, cudaMemcpyHostToDevice);

int threadS = 256; // Threads per block

int blockS =(N + threadS - 1) / threadS; // Blocks per grid

vecAdd<<<blockS, threadS>>>(dA, dB, dC, N);

cudaMemcpy(hC, dC, size, cudaMemcpyDeviceToHost);

cudaFree(dA); cudaFree(dB); cudaFree(dC);

for(int i=0;i<N;i++) if(hC[i]!=N) printf(„Wrong result at %d\n“,i);

return 0;

}

Page 35: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Matrix Multiplicationint main() {

dim3 threads (16,16,1); // blocksize=256, really 2-dim

dim3 grid(DIM/ threads.x ,DIM/ threads.y, 1); // 2-dim

float a[DIM*DIM], b[DIM*DIM], res[DIM*DIM];

float *devA , *devB, *devRes;

int matSize = DIM*DIM*sizeof (float);

cudaMalloc (( void)&devA , matSize);

cudaMalloc (( void)&devB , matSize);

cudaMalloc (( void)&devRes , matSize);

cudaMemcpy (devA , a , matSize , cudaMemcpyHostToDevice);

cudaMemcpy (devB , b , matSize , cudaMemcpyHostToDevice);

matMul<<<grid, threads >>>(devA, devB , devRes);

cudaThreadSynchronize();

cudaMemcpy (res, devRes, matSize, cudaMemcpyDeviceToHost);

for (int i =0; i <DIM; i ++) {

for (int j =0; j <DIM; j ++) printf("%d" , res [ i + j*DIM ]);

printf("\n");

}

return 0;

}

35

Page 36: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Matrix Multiplication

Dimension of matrix: DIM (global variable)

In each of 2 dimensions: DIM threads

__global__ void matMul (float *a, float *b, float *c) {

int2 id;

id.x = blockDim.x*blockIdx.x+threadIdx.x;

id.y = blockDim.y*blockIdx.y+threadIdx.y;

float sum=0;

for (int z=0; z<DIM; z++) {

sum +=a[id.y*DIM+z]*b[z*DIM+id.x] ;

}

c[id.y*DIM+id.x]=sum;

}

36

Page 37: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Choice of Block Size

37

• Number of threads per block: multiple of warp size (32)

• SMs can execute up to 8 thread blocks in parallel

• Small block size: prevents high utilization

• Large block size: less flexibility

• Typical: ~128 to 256 threads per block

• Depends on application � experiments necessary

Page 38: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Avoid Control Flow Divergence

38

• Typical code for CPU:

if(idx&1) a[idx] = b[idx]; else a[idx] = b[idx] + 1;

• On GPU: performance loss by factor 2!

First: all odd threads in warp execute a[idx] = b[idx];

Then: all even threads in warp execute a[idx] = b[idx] + 1;

• Better: a[idx] = b[idx] + (idx&1);

• Not always so simple

to detect and to cure!

Page 39: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Comparison OpenCL and CUDA

� OpenCL

� Compiles kernel at runtime for actual platform (~)

� Supports cards of several manufacturers (++)

� Supports also non-GPU devices (++)

� Open standard (++)

� Available for: Windows, Mac, Linux

� Larger setup code (-)

� Less programming comfort (-)

� Command-Queues resemble CUDA streams (~)

� API commands similar, partly different parameters

39

Page 40: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Comparison OpenCL and CUDA

� CUDA� Abstrahierung für general purpose computing auf einer GPU (GPGPU)

� Gute High-Level API (+)

� Einheitliches Programmier-Model (+)

� Low und Highlevel Thread Synchronisation (+)

� Stream Synchronisation (+)

� Native Unterstützung für atomare Operationen (+)

� Sehr gute Dokumentation (+)

� Funktioniert nur auf NVIDIA Hardware (-)

� Manuelles Speichermanagement (-)

� Verfügbar für: Windows, Mac, Linux

40

Page 41: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Summary

� use GPUs as co-processor for

massively parallel problems

with regular structure (few control flow statements)

� CUDA: most popular programming environment

� Many more issues not mentioned in introduction

���� see CUDA manual and textbooks

41

Page 42: GPU-Computingportal.tpu.ru/SHARED/a/AXOENOWSW/Publications_rus/Tab5/gpu.pdf · Parallelverarbeitung mit GPUs Start of GPU-Computing 4 First Languages for Shading • RenderMan Shading

Parallelverarbeitung mit GPUs

Thank youvery much

for your attention

42