GPU Introduction

P.Bakowski 1

The Architecture of The Architecture of

Graphic Processor Unit Graphic Processor Unit -- GPUGPU

P. BakowskiP. Bakowski

P.Bakowski 2

Evolution of parallel architecturesEvolution of parallel architectures

..We can distinguish 3 generations of massively parallel We can distinguish 3 generations of massively parallel

architectures (scientific calculation):architectures (scientific calculation):

(1) The super(1) The super--computers with special processors for computers with special processors for

vector calculation (vector calculation (SSingle ingle IInstruction nstruction MMultiple ultiple DData)ata)

The CrayThe Cray--1 (1976) contained 200,000 integrated circuits 1 (1976) contained 200,000 integrated circuits

and could perform 100 million floating point operations and could perform 100 million floating point operations

per second (per second (100 MFLOPS100 MFLOPS).).

price: $5 price: $5 -- $8.8 million $8.8 million

Number of units sold: 85Number of units sold: 85

P.Bakowski 3

Evolution of parallel architecturesEvolution of parallel architectures

(2) The super(2) The super--computers with standard computers with standard microprocessors adapted for massive multiprocessing microprocessors adapted for massive multiprocessing operating as operating as MMultiple ultiple IInstruction nstruction MMultiple ultiple DData computers.ata computers.

IBMIBM Roadrunner: Roadrunner: PowerXCellPowerXCell 8i 8i

CPUs, 6480 dual cores CPUs, 6480 dual cores -- AMD AMD

OpteronOpteron, Linux , Linux

Consumption: 2,35 MWConsumption: 2,35 MW

Surface: 296 racks, 560Surface: 296 racks, 560 mm22

Memory: 103,6 Memory: 103,6 TiBTiBPerformance: 1,042 Performance: 1,042 petaflopspetaflops

Price:Price: USD $125MUSD $125M

P.Bakowski 4

Evolution of GPU architectures Evolution of GPU architectures

((3) 3) GGeneral eneral PProcessing on rocessing on GGraphic raphic PProcessing rocessing

UUnits (GPGPU) technology based on the circuits nits (GPGPU) technology based on the circuits integrated into graphic cards.integrated into graphic cards.

P.Bakowski 5

GPU based processingGPU based processing

..The The GPUsGPUs (Graphic Processing Units) contain (Graphic Processing Units) contain

hundreds/thousands of arithmetical units . hundreds/thousands of arithmetical units . These capacities may be used to accelerate a wide These capacities may be used to accelerate a wide

range of computing applications.range of computing applications.

CUDA cores CUDA cores

48 per Streaming Processor48 per Streaming Processor

Example Example -- nVIDIAnVIDIA: :

GT200,300,400,500 seriesGT200,300,400,500 series

P.Bakowski 6

CPUs and SSE extensionsCPUs and SSE extensions

..Modern CPU integrate specific SIMD units for graphic Modern CPU integrate specific SIMD units for graphic

processing. These units implement processing. These units implement -- SSE2, SSE3, SSE4 SSE2, SSE3, SSE4

instructions and contain 4 arithmetic units that may operate instructions and contain 4 arithmetic units that may operate

in parallel on 4 fixed or floating point data.in parallel on 4 fixed or floating point data.

P.Bakowski 7

CPUs and CPUs and GPUsGPUs

..The GPU are based on multiple processing units with The GPU are based on multiple processing units with

multiple processing cores (8/16/32 cores per processing multiple processing cores (8/16/32 cores per processing

unit), they contain register files and shared memories.unit), they contain register files and shared memories.

A graphic card contains a global memory that can be used by A graphic card contains a global memory that can be used by all processors (including CPU), a local memory for each all processors (including CPU), a local memory for each

processing unit, and special memories for constant values.processing unit, and special memories for constant values.

P.Bakowski 8

GPUsGPUs : streaming multi: streaming multi--processorsprocessors

..The The streamingstreaming multiprocessormultiprocessor (SM) (SM)

integrated in integrated in GPUsGPUs are the SIMD are the SIMD

blocks with several arithmetic blocks with several arithmetic

cores.cores.

88/16/32/48/16/32/48 cores per SMcores per SM

Each core contains Each core contains

one Floating Point one Floating Point

unit and one unit and one

INTegerINTeger unitunit

P.Bakowski 9

CPUsCPUs and cache and cache memoriesmemories

CPUs use cache memories to reduce the access CPUs use cache memories to reduce the access

latency to main memory.latency to main memory.

CPU caches need more and more of the surface of the CPU caches need more and more of the surface of the

processor and use a lot of energy.processor and use a lot of energy.

P.Bakowski 10

Cache Cache memorymemory : : latencylatency

P.Bakowski 11

CPUsCPUs and cache and cache memoriesmemories

GPUs use caches or shared memory to increase the GPUs use caches or shared memory to increase the

bandwidth of memory.bandwidth of memory.

Global MemoryGlobal Memory

P.Bakowski 12

GPU memory : transfer data rateGPU memory : transfer data rate

Each GPU multiprocessor has its own memory controller, Each GPU multiprocessor has its own memory controller,

For example, each memory controller of For example, each memory controller of nVIDIAnVIDIA GT200 GT200

chip provides 8 64chip provides 8 64--bit communication channels.bit communication channels.

8 * 648 * 64--bit bit

channelschannels

RRaster aster OOututPPututSMsSMsShared MemoryShared Memory

P.Bakowski 13

GPU memory : transfer data rateGPU memory : transfer data rate

data_ratedata_rate = interface_width/8 * memory_clock*2= interface_width/8 * memory_clock*2

for GTX275:for GTX275:

number of bytes on the bus: 448number of bytes on the bus: 448--bit/8 = 56bit/8 = 56

data_rate in bytes: 56 * 1224MHz = 68,544MB/sdata_rate in bytes: 56 * 1224MHz = 68,544MB/s

68,544MB/s*2 = 137,088Mb/s = 68,544MB/s*2 = 137,088Mb/s = 137.1GB/s 137.1GB/s

two reads/writes per clock cycle: DDR2two reads/writes per clock cycle: DDR2

P.Bakowski 14

CPU/GPU : execution threadsCPU/GPU : execution threads

GPU evoids the memory latency by the simulteneous GPU evoids the memory latency by the simulteneous execution of thousands of threads; if one thread waits on execution of thousands of threads; if one thread waits on

the memory access, the other one my be executed at the memory access, the other one my be executed at

the same time.the same time.

thread thread

waitswaits

thread thread

executesexecutes

thread thread

executesexecutes

P.Bakowski 15

CPU/GPU : execution threadsCPU/GPU : execution threads

A CPU may execute 1A CPU may execute 1--2 threads per core; a GPU 2 threads per core; a GPU

multiprocessor may maintain up to 1024 threads each.multiprocessor may maintain up to 1024 threads each.

The cost of thread The cost of thread «« contextcontext »» switching for a CPU switching for a CPU

core is tens or hundreds of memory cycles , a GPU may core is tens or hundreds of memory cycles , a GPU may switch several threads per clock cycle.switch several threads per clock cycle.

P.Bakowski 16

SIMD versus SIMTSIMD versus SIMT

The CPUs exploit the vector processing units for The CPUs exploit the vector processing units for SIMD processing (a single instruction is executed on SIMD processing (a single instruction is executed on

multiple data elements) multiple data elements) -- single execution threadsingle execution thread !!

The The GPUsGPUs use SIMT operational mode; single use SIMT operational mode; single

instruction is executed by multiple threads.instruction is executed by multiple threads.

SIMT processing does not require the SIMT processing does not require the

transformation of the data into vectors. transformation of the data into vectors.

It allows for arbitrary branches in the threads.It allows for arbitrary branches in the threads.

SIMDSIMD

SIMTSIMT

P.Bakowski 17

GPUsGPUs and high density computingand high density computing

The The GPUsGPUs give excellent results when the same give excellent results when the same

sequence of operations is applied to a great number of sequence of operations is applied to a great number of data.data.

The best results are obtained when the number of The best results are obtained when the number of

arithmetical operations greatly exceeds the number of arithmetical operations greatly exceeds the number of memory accesses. memory accesses.

High density of calculation does not require large cache High density of calculation does not require large cache memory that is necessary in CPUs.memory that is necessary in CPUs.

lowlowhighhighcalculationscalculations memory accessmemory access

P.Bakowski 18

GPUsGPUs : performance: performance

P.Bakowski 19

GPU based calculusGPU based calculus

In several cases the performance of GPU based In several cases the performance of GPU based processing is 5processing is 5--30 times greater than CPU based 30 times greater than CPU based

processing. processing.

The biggest difference The biggest difference -- performance gain up to performance gain up to 100 times! 100 times! -- relates to the code, that is not adapted to relates to the code, that is not adapted to

SEE instructions but suits well the GPU functions.SEE instructions but suits well the GPU functions.

P.Bakowski 20

GPU based calculusGPU based calculus

Some example of synthetic code accelerated by the Some example of synthetic code accelerated by the use of use of GPUsGPUs compared to the same code compared to the same code vectorizedvectorized

for SSE :for SSE :

processing for fluorescent microscope : 12xprocessing for fluorescent microscope : 12x

modeling of molecular dynamics : 8modeling of molecular dynamics : 8--16x16x

modeling electrostatic fields : 40modeling electrostatic fields : 40--120x et 7x.120x et 7x.

P.Bakowski 21

GPU based calculus: speedGPU based calculus: speed--upup

The comparison of the The comparison of the speedspeed--upup relative to SSErelative to SSE

P.Bakowski 22

From GeForce8 to TeslaFrom GeForce8 to Tesla

P.Bakowski 23


88--16 CUDA 16 CUDA corescores

P.Bakowski 24


How How manymany

CUDA CUDA corescores ??

P.Bakowski 25

From GeForce8 to Tesla From GeForce8 to Tesla

P.Bakowski 26

Tesla system Tesla system –– S1070S1070

P.Bakowski 27

NVIDIA and CUDANVIDIA and CUDA

CUDA technology is a software architecture based on CUDA technology is a software architecture based on

nVIDIA hardware. nVIDIA hardware. CUDA CUDA ““languagelanguage”” is an extension of the Cis an extension of the C programming programming

language.language.It gives acces to GPU instructions and to the video It gives acces to GPU instructions and to the video

memory for parallel calculations. memory for parallel calculations.

CUDA allows to implement the algorithms that can be run CUDA allows to implement the algorithms that can be run on GeForce 8 cards and on all more recent GPUs chips on GeForce 8 cards and on all more recent GPUs chips

(GeForce 9, GeForce 200, GeForce 300, GeForce 400, (GeForce 9, GeForce 200, GeForce 300, GeForce 400,

GeForce 500), Quadro and Tesla.GeForce 500), Quadro and Tesla.

P.Bakowski 28


P.Bakowski 29


The CUDA The CUDA ToolkitToolkit contains:contains:

compiler: compiler: nvccnvcc

libraries FFT and BLASlibraries FFT and BLASprofilerprofiler

debugger debugger gdbgdb for GPUfor GPUruntimeruntime driver for CUDA included in driver for CUDA included in nVIDIAnVIDIA driversdrivers

guide of programmingguide of programmingSDK for CUDA developersSDK for CUDA developers

source codes (examples) and documentationsource codes (examples) and documentation

P.Bakowski 30

CUDA : compilation phasesCUDA : compilation phases

TheThe CUDA C code is compiled with CUDA C code is compiled with nvccnvcc, that is a script , that is a script

activating other programs: activating other programs: cudacccudacc, , g++g++ , , clcl , etc., etc.

P.Bakowski 31


nvccnvcc generates: generates:

the CPU code, compiled with other the CPU code, compiled with other

parts of application and written in pure C , parts of application and written in pure C ,

and and

the the PTX object codePTX object code for the GPUfor the GPU

P.Bakowski 32


The executable files with CUDA code require:The executable files with CUDA code require:runtimeruntime CUDA library (CUDA library (cudartcudart) and ) and

base CUDA librarybase CUDA library

P.Bakowski 33

CUDA : advantagesCUDA : advantages

Main CUDA advantage for GPGPU computing results from Main CUDA advantage for GPGPU computing results from

the new GPU architecture designed for the efficient the new GPU architecture designed for the efficient

implementation of nonimplementation of non--graphic calculations and the use of graphic calculations and the use of

C programming language.C programming language.

There is no need to convert the algorithms into pipeThere is no need to convert the algorithms into pipe--lined format required for graphic calculations. lined format required for graphic calculations.

The GPGPU does not use the graphic API and the The GPGPU does not use the graphic API and the

corresponding driverscorresponding drivers

P.Bakowski 34

CUDA : advantagesCUDA : advantages

CUDA provides:CUDA provides:

the access to 16 KB of memory per SM; this access is the access to 16 KB of memory per SM; this access is

shared by the SM threadsshared by the SM threads

an efficient transfer of data between the system and an efficient transfer of data between the system and

video memory (global GPU memory) video memory (global GPU memory) a memory with linear addressing scheme and with a memory with linear addressing scheme and with

random access to any memory location random access to any memory location hardware implemented operations for FP, integers hardware implemented operations for FP, integers

and bits and bits

P.Bakowski 35

CUDA : limitationsCUDA : limitations

Limitations:Limitations:

no recursive functions (no stack)no recursive functions (no stack)processing block of minimum 32 threads (warp)processing block of minimum 32 threads (warp)

CUDA is a proprietary architecture of CUDA is a proprietary architecture of nVIDIAnVIDIA

P.Bakowski 36

CUDA : programming modelCUDA : programming model

CUDA programming model is based on groups of CUDA programming model is based on groups of

threads. threads.

The blocks of threads The blocks of threads –– grids of one or two dimensions of grids of one or two dimensions of

threads cooperate via shared memory and synchronization threads cooperate via shared memory and synchronization points. points.

A kernel program is executed in a grid of blocks of A kernel program is executed in a grid of blocks of

threads. threads.

Only one grid of blocks of threads is executed at a time.Only one grid of blocks of threads is executed at a time.

Each block may be built in one, two or three dimensions, Each block may be built in one, two or three dimensions,

and contain up two 512 threads.and contain up two 512 threads.

P.Bakowski 37

CUDA : programming modelCUDA : programming model

The blocks of threads are The blocks of threads are

executed by groups of 32 executed by groups of 32

threads called threads called warpswarps. .

A A warpwarp is a minimal volume is a minimal volume of data that is processed by of data that is processed by

streaming processors. streaming processors.

CUDA works with blocks of CUDA works with blocks of

threads containing from 32 to threads containing from 32 to 512 threads.512 threads.

P.Bakowski 38

CUDA : memory modelCUDA : memory model

Local and Global Memory Local and Global Memory is not cached .is not cached .

Local and Global Memory Local and Global Memory are implemented in are implemented in

separate circuits.separate circuits.

The access time to Local The access time to Local

and Global Memory is and Global Memory is much longer than the much longer than the

Register access time.Register access time.

P.Bakowski 39


There areThere are 1024 register 1024 register entries per SM.entries per SM.

The access to these The access to these registers is very rapid. registers is very rapid.

Each register may store Each register may store one 32one 32--bit integer or floating bit integer or floating

point number.point number.

P.Bakowski 40


Global Memory Global Memory –– from 256Mo to from 256Mo to

2Go ( up to 4Go in Tesla).2Go ( up to 4Go in Tesla).

Data bandwidth may be over 100 Data bandwidth may be over 100

Go/s but the latency is high (several Go/s but the latency is high (several hundreds of clock cycles) .hundreds of clock cycles) .

There is no cache memory for There is no cache memory for

Global Memory. Global Memory.

Global Memory is used for global Global Memory is used for global

data and instructionsdata and instructions

P.Bakowski 41


SharedShared Memory: 16Memory: 16--KB of KB of

sharedshared memorymemory for all for all corescores in a in a

block of threads.block of threads.

SharedShared Memory is as Memory is as rapidrapid as as

the the RegistersRegisters. .

P.Bakowski 42


Constant Memory Constant Memory -- 64 KB, 64 KB,

readread--only for all SM units only for all SM units

Constant Memory is high Constant Memory is high

latency memory with access latency memory with access

time of several hundreds of time of several hundreds of

clock cycles.clock cycles.

P.Bakowski 43


That is why the Constant That is why the Constant

Memory data Memory data are cachedare cached in in blocks of 8KB for each SM. blocks of 8KB for each SM.

P.Bakowski 44


Texture Memory is accessible Texture Memory is accessible

(read(read--only) to all MS. only) to all MS.

Texture data are used directly Texture data are used directly

by GPU, they may be by GPU, they may be interpolated linearly without interpolated linearly without

additional operations. additional operations.

P.Bakowski 45


Texture Memory has long Texture Memory has long

latency access and is cached.latency access and is cached.

P.Bakowski 46


Typical use of CUDA memories:Typical use of CUDA memories:

divide the task into several subdivide the task into several sub--tasks tasks

decompose the input data into blocks that correspond to decompose the input data into blocks that correspond to

the shared memory size the shared memory size

each block of data will be processed by a block of threadseach block of data will be processed by a block of threads

load the data blocks from the Global Memory to Shared load the data blocks from the Global Memory to Shared

Memory Memory

process the data in the Shared Memory process the data in the Shared Memory

copy the results from the Shared Memory to Global copy the results from the Shared Memory to Global MemoryMemory

P.Bakowski 47

CUDA : program exampleCUDA : program example

main()main() -- function at the CPU sidefunction at the CPU side

P.Bakowski 48


main()main() -- function at the CPU side (cont.) function at the CPU side (cont.)

P.Bakowski 49


main()main() -- function at the CPU side (cont.) function at the CPU side (cont.)

P.Bakowski 50


kernel function: at the GPU sidekernel function: at the GPU side

no loop but several threads no loop but several threads

each thread with an index each thread with an index –– threadIdx.xthreadIdx.x

10 threads10 threads

++ ++ ++ ++ ++ ++ ++ ++ ++ ++

P.Bakowski 51

CUDA and graphic APIs CUDA and graphic APIs

CUDA programs my exploit the graphic functions CUDA programs my exploit the graphic functions

provided by graphic APIs (DirectX, provided by graphic APIs (DirectX, openGLopenGL).).

These functions provide necessary image processing These functions provide necessary image processing

operations for operations for rasteringrastering and shading and shading –– rendering of the rendering of the images on the screen. images on the screen.

The proposed module does not deal with these The proposed module does not deal with these primitives. primitives.

However some of However some of openGLopenGL operations may be used in operations may be used in

practical classes to display the images directly from practical classes to display the images directly from GPU memory.GPU memory.

P.Bakowski 52

SummarySummary

Evolution of multiprocessingEvolution of multiprocessing

CPUs and CPUs and GPUsGPUs

SIMD and SIMT processing modesSIMD and SIMT processing modes

Performances of Performances of GPUsGPUs


CUDA processing modelCUDA processing model

CUDA memory modelCUDA memory model

a simple examplea simple example

GPU Introduction

Documents

Transcript of GPU Introduction