GPU Introduction
Transcript of GPU Introduction
P.Bakowski 1
The Architecture of The Architecture of
Graphic Processor Unit Graphic Processor Unit -- GPUGPU
P. BakowskiP. Bakowski
P.Bakowski 2
Evolution of parallel architecturesEvolution of parallel architectures
..We can distinguish 3 generations of massively parallel We can distinguish 3 generations of massively parallel
architectures (scientific calculation):architectures (scientific calculation):
(1) The super(1) The super--computers with special processors for computers with special processors for
vector calculation (vector calculation (SSingle ingle IInstruction nstruction MMultiple ultiple DData)ata)
The CrayThe Cray--1 (1976) contained 200,000 integrated circuits 1 (1976) contained 200,000 integrated circuits
and could perform 100 million floating point operations and could perform 100 million floating point operations
per second (per second (100 MFLOPS100 MFLOPS).).
price: $5 price: $5 -- $8.8 million $8.8 million
Number of units sold: 85Number of units sold: 85
P.Bakowski 3
Evolution of parallel architecturesEvolution of parallel architectures
(2) The super(2) The super--computers with standard computers with standard microprocessors adapted for massive multiprocessing microprocessors adapted for massive multiprocessing operating as operating as MMultiple ultiple IInstruction nstruction MMultiple ultiple DData computers.ata computers.
IBMIBM Roadrunner: Roadrunner: PowerXCellPowerXCell 8i 8i
CPUs, 6480 dual cores CPUs, 6480 dual cores -- AMD AMD
OpteronOpteron, Linux , Linux
Consumption: 2,35 MWConsumption: 2,35 MW
Surface: 296 racks, 560Surface: 296 racks, 560 mm22
Memory: 103,6 Memory: 103,6 TiBTiBPerformance: 1,042 Performance: 1,042 petaflopspetaflops
Price:Price: USD $125MUSD $125M
P.Bakowski 4
Evolution of GPU architectures Evolution of GPU architectures
((3) 3) GGeneral eneral PProcessing on rocessing on GGraphic raphic PProcessing rocessing
UUnits (GPGPU) technology based on the circuits nits (GPGPU) technology based on the circuits integrated into graphic cards.integrated into graphic cards.
P.Bakowski 5
GPU based processingGPU based processing
..The The GPUsGPUs (Graphic Processing Units) contain (Graphic Processing Units) contain
hundreds/thousands of arithmetical units . hundreds/thousands of arithmetical units . These capacities may be used to accelerate a wide These capacities may be used to accelerate a wide
range of computing applications.range of computing applications.
CUDA cores CUDA cores
48 per Streaming Processor48 per Streaming Processor
Example Example -- nVIDIAnVIDIA: :
GT200,300,400,500 seriesGT200,300,400,500 series
P.Bakowski 6
CPUs and SSE extensionsCPUs and SSE extensions
..Modern CPU integrate specific SIMD units for graphic Modern CPU integrate specific SIMD units for graphic
processing. These units implement processing. These units implement -- SSE2, SSE3, SSE4 SSE2, SSE3, SSE4
instructions and contain 4 arithmetic units that may operate instructions and contain 4 arithmetic units that may operate
in parallel on 4 fixed or floating point data.in parallel on 4 fixed or floating point data.
P.Bakowski 7
CPUs and CPUs and GPUsGPUs
..The GPU are based on multiple processing units with The GPU are based on multiple processing units with
multiple processing cores (8/16/32 cores per processing multiple processing cores (8/16/32 cores per processing
unit), they contain register files and shared memories.unit), they contain register files and shared memories.
A graphic card contains a global memory that can be used by A graphic card contains a global memory that can be used by all processors (including CPU), a local memory for each all processors (including CPU), a local memory for each
processing unit, and special memories for constant values.processing unit, and special memories for constant values.
P.Bakowski 8
GPUsGPUs : streaming multi: streaming multi--processorsprocessors
..The The streamingstreaming multiprocessormultiprocessor (SM) (SM)
integrated in integrated in GPUsGPUs are the SIMD are the SIMD
blocks with several arithmetic blocks with several arithmetic
cores.cores.
88/16/32/48/16/32/48 cores per SMcores per SM
Each core contains Each core contains
one Floating Point one Floating Point
unit and one unit and one
INTegerINTeger unitunit
P.Bakowski 9
CPUsCPUs and cache and cache memoriesmemories
CPUs use cache memories to reduce the access CPUs use cache memories to reduce the access
latency to main memory.latency to main memory.
CPU caches need more and more of the surface of the CPU caches need more and more of the surface of the
processor and use a lot of energy.processor and use a lot of energy.
P.Bakowski 10
Cache Cache memorymemory : : latencylatency
P.Bakowski 11
CPUsCPUs and cache and cache memoriesmemories
GPUs use caches or shared memory to increase the GPUs use caches or shared memory to increase the
bandwidth of memory.bandwidth of memory.
Global MemoryGlobal Memory
P.Bakowski 12
GPU memory : transfer data rateGPU memory : transfer data rate
Each GPU multiprocessor has its own memory controller, Each GPU multiprocessor has its own memory controller,
For example, each memory controller of For example, each memory controller of nVIDIAnVIDIA GT200 GT200
chip provides 8 64chip provides 8 64--bit communication channels.bit communication channels.
8 * 648 * 64--bit bit
channelschannels
RRaster aster OOututPPututSMsSMsShared MemoryShared Memory
P.Bakowski 13
GPU memory : transfer data rateGPU memory : transfer data rate
data_ratedata_rate = interface_width/8 * memory_clock*2= interface_width/8 * memory_clock*2
for GTX275:for GTX275:
number of bytes on the bus: 448number of bytes on the bus: 448--bit/8 = 56bit/8 = 56
data_rate in bytes: 56 * 1224MHz = 68,544MB/sdata_rate in bytes: 56 * 1224MHz = 68,544MB/s
68,544MB/s*2 = 137,088Mb/s = 68,544MB/s*2 = 137,088Mb/s = 137.1GB/s 137.1GB/s
two reads/writes per clock cycle: DDR2two reads/writes per clock cycle: DDR2
P.Bakowski 14
CPU/GPU : execution threadsCPU/GPU : execution threads
GPU evoids the memory latency by the simulteneous GPU evoids the memory latency by the simulteneous execution of thousands of threads; if one thread waits on execution of thousands of threads; if one thread waits on
the memory access, the other one my be executed at the memory access, the other one my be executed at
the same time.the same time.
thread thread
waitswaits
thread thread
executesexecutes
thread thread
executesexecutes
P.Bakowski 15
CPU/GPU : execution threadsCPU/GPU : execution threads
A CPU may execute 1A CPU may execute 1--2 threads per core; a GPU 2 threads per core; a GPU
multiprocessor may maintain up to 1024 threads each.multiprocessor may maintain up to 1024 threads each.
The cost of thread The cost of thread «« contextcontext »» switching for a CPU switching for a CPU
core is tens or hundreds of memory cycles , a GPU may core is tens or hundreds of memory cycles , a GPU may switch several threads per clock cycle.switch several threads per clock cycle.
P.Bakowski 16
SIMD versus SIMTSIMD versus SIMT
The CPUs exploit the vector processing units for The CPUs exploit the vector processing units for SIMD processing (a single instruction is executed on SIMD processing (a single instruction is executed on
multiple data elements) multiple data elements) -- single execution threadsingle execution thread !!
The The GPUsGPUs use SIMT operational mode; single use SIMT operational mode; single
instruction is executed by multiple threads.instruction is executed by multiple threads.
SIMT processing does not require the SIMT processing does not require the
transformation of the data into vectors. transformation of the data into vectors.
It allows for arbitrary branches in the threads.It allows for arbitrary branches in the threads.
SIMDSIMD
SIMTSIMT
P.Bakowski 17
GPUsGPUs and high density computingand high density computing
The The GPUsGPUs give excellent results when the same give excellent results when the same
sequence of operations is applied to a great number of sequence of operations is applied to a great number of data.data.
The best results are obtained when the number of The best results are obtained when the number of
arithmetical operations greatly exceeds the number of arithmetical operations greatly exceeds the number of memory accesses. memory accesses.
High density of calculation does not require large cache High density of calculation does not require large cache memory that is necessary in CPUs.memory that is necessary in CPUs.
lowlowhighhighcalculationscalculations memory accessmemory access
P.Bakowski 18
GPUsGPUs : performance: performance
P.Bakowski 19
GPU based calculusGPU based calculus
In several cases the performance of GPU based In several cases the performance of GPU based processing is 5processing is 5--30 times greater than CPU based 30 times greater than CPU based
processing. processing.
The biggest difference The biggest difference -- performance gain up to performance gain up to 100 times! 100 times! -- relates to the code, that is not adapted to relates to the code, that is not adapted to
SEE instructions but suits well the GPU functions.SEE instructions but suits well the GPU functions.
P.Bakowski 20
GPU based calculusGPU based calculus
Some example of synthetic code accelerated by the Some example of synthetic code accelerated by the use of use of GPUsGPUs compared to the same code compared to the same code vectorizedvectorized
for SSE :for SSE :
processing for fluorescent microscope : 12xprocessing for fluorescent microscope : 12x
modeling of molecular dynamics : 8modeling of molecular dynamics : 8--16x16x
modeling electrostatic fields : 40modeling electrostatic fields : 40--120x et 7x.120x et 7x.
P.Bakowski 21
GPU based calculus: speedGPU based calculus: speed--upup
The comparison of the The comparison of the speedspeed--upup relative to SSErelative to SSE
P.Bakowski 22
From GeForce8 to TeslaFrom GeForce8 to Tesla
P.Bakowski 23
From GeForce8 to TeslaFrom GeForce8 to Tesla
88--16 CUDA 16 CUDA corescores
P.Bakowski 24
From GeForce8 to TeslaFrom GeForce8 to Tesla
How How manymany
CUDA CUDA corescores ??
P.Bakowski 25
From GeForce8 to Tesla From GeForce8 to Tesla
P.Bakowski 26
Tesla system Tesla system –– S1070S1070
P.Bakowski 27
NVIDIA and CUDANVIDIA and CUDA
CUDA technology is a software architecture based on CUDA technology is a software architecture based on
nVIDIA hardware. nVIDIA hardware. CUDA CUDA ““languagelanguage”” is an extension of the Cis an extension of the C programming programming
language.language.It gives acces to GPU instructions and to the video It gives acces to GPU instructions and to the video
memory for parallel calculations. memory for parallel calculations.
CUDA allows to implement the algorithms that can be run CUDA allows to implement the algorithms that can be run on GeForce 8 cards and on all more recent GPUs chips on GeForce 8 cards and on all more recent GPUs chips
(GeForce 9, GeForce 200, GeForce 300, GeForce 400, (GeForce 9, GeForce 200, GeForce 300, GeForce 400,
GeForce 500), Quadro and Tesla.GeForce 500), Quadro and Tesla.
P.Bakowski 28
NVIDIA and CUDANVIDIA and CUDA
P.Bakowski 29
NVIDIA and CUDANVIDIA and CUDA
The CUDA The CUDA ToolkitToolkit contains:contains:
compiler: compiler: nvccnvcc
libraries FFT and BLASlibraries FFT and BLASprofilerprofiler
debugger debugger gdbgdb for GPUfor GPUruntimeruntime driver for CUDA included in driver for CUDA included in nVIDIAnVIDIA driversdrivers
guide of programmingguide of programmingSDK for CUDA developersSDK for CUDA developers
source codes (examples) and documentationsource codes (examples) and documentation
P.Bakowski 30
CUDA : compilation phasesCUDA : compilation phases
TheThe CUDA C code is compiled with CUDA C code is compiled with nvccnvcc, that is a script , that is a script
activating other programs: activating other programs: cudacccudacc, , g++g++ , , clcl , etc., etc.
P.Bakowski 31
CUDA : compilation phasesCUDA : compilation phases
nvccnvcc generates: generates:
the CPU code, compiled with other the CPU code, compiled with other
parts of application and written in pure C , parts of application and written in pure C ,
and and
the the PTX object codePTX object code for the GPUfor the GPU
P.Bakowski 32
CUDA : compilation phasesCUDA : compilation phases
The executable files with CUDA code require:The executable files with CUDA code require:runtimeruntime CUDA library (CUDA library (cudartcudart) and ) and
base CUDA librarybase CUDA library
P.Bakowski 33
CUDA : advantagesCUDA : advantages
Main CUDA advantage for GPGPU computing results from Main CUDA advantage for GPGPU computing results from
the new GPU architecture designed for the efficient the new GPU architecture designed for the efficient
implementation of nonimplementation of non--graphic calculations and the use of graphic calculations and the use of
C programming language.C programming language.
There is no need to convert the algorithms into pipeThere is no need to convert the algorithms into pipe--lined format required for graphic calculations. lined format required for graphic calculations.
The GPGPU does not use the graphic API and the The GPGPU does not use the graphic API and the
corresponding driverscorresponding drivers
P.Bakowski 34
CUDA : advantagesCUDA : advantages
CUDA provides:CUDA provides:
the access to 16 KB of memory per SM; this access is the access to 16 KB of memory per SM; this access is
shared by the SM threadsshared by the SM threads
an efficient transfer of data between the system and an efficient transfer of data between the system and
video memory (global GPU memory) video memory (global GPU memory) a memory with linear addressing scheme and with a memory with linear addressing scheme and with
random access to any memory location random access to any memory location hardware implemented operations for FP, integers hardware implemented operations for FP, integers
and bits and bits
P.Bakowski 35
CUDA : limitationsCUDA : limitations
Limitations:Limitations:
no recursive functions (no stack)no recursive functions (no stack)processing block of minimum 32 threads (warp)processing block of minimum 32 threads (warp)
CUDA is a proprietary architecture of CUDA is a proprietary architecture of nVIDIAnVIDIA
P.Bakowski 36
CUDA : programming modelCUDA : programming model
CUDA programming model is based on groups of CUDA programming model is based on groups of
threads. threads.
The blocks of threads The blocks of threads –– grids of one or two dimensions of grids of one or two dimensions of
threads cooperate via shared memory and synchronization threads cooperate via shared memory and synchronization points. points.
A kernel program is executed in a grid of blocks of A kernel program is executed in a grid of blocks of
threads. threads.
Only one grid of blocks of threads is executed at a time.Only one grid of blocks of threads is executed at a time.
Each block may be built in one, two or three dimensions, Each block may be built in one, two or three dimensions,
and contain up two 512 threads.and contain up two 512 threads.
P.Bakowski 37
CUDA : programming modelCUDA : programming model
The blocks of threads are The blocks of threads are
executed by groups of 32 executed by groups of 32
threads called threads called warpswarps. .
A A warpwarp is a minimal volume is a minimal volume of data that is processed by of data that is processed by
streaming processors. streaming processors.
CUDA works with blocks of CUDA works with blocks of
threads containing from 32 to threads containing from 32 to 512 threads.512 threads.
P.Bakowski 38
CUDA : memory modelCUDA : memory model
Local and Global Memory Local and Global Memory is not cached .is not cached .
Local and Global Memory Local and Global Memory are implemented in are implemented in
separate circuits.separate circuits.
The access time to Local The access time to Local
and Global Memory is and Global Memory is much longer than the much longer than the
Register access time.Register access time.
P.Bakowski 39
CUDA : memory modelCUDA : memory model
There areThere are 1024 register 1024 register entries per SM.entries per SM.
The access to these The access to these registers is very rapid. registers is very rapid.
Each register may store Each register may store one 32one 32--bit integer or floating bit integer or floating
point number.point number.
P.Bakowski 40
CUDA : memory modelCUDA : memory model
Global Memory Global Memory –– from 256Mo to from 256Mo to
2Go ( up to 4Go in Tesla).2Go ( up to 4Go in Tesla).
Data bandwidth may be over 100 Data bandwidth may be over 100
Go/s but the latency is high (several Go/s but the latency is high (several hundreds of clock cycles) .hundreds of clock cycles) .
There is no cache memory for There is no cache memory for
Global Memory. Global Memory.
Global Memory is used for global Global Memory is used for global
data and instructionsdata and instructions
P.Bakowski 41
CUDA : memory modelCUDA : memory model
SharedShared Memory: 16Memory: 16--KB of KB of
sharedshared memorymemory for all for all corescores in a in a
block of threads.block of threads.
SharedShared Memory is as Memory is as rapidrapid as as
the the RegistersRegisters. .
P.Bakowski 42
CUDA : memory modelCUDA : memory model
Constant Memory Constant Memory -- 64 KB, 64 KB,
readread--only for all SM units only for all SM units
Constant Memory is high Constant Memory is high
latency memory with access latency memory with access
time of several hundreds of time of several hundreds of
clock cycles.clock cycles.
P.Bakowski 43
CUDA : memory modelCUDA : memory model
That is why the Constant That is why the Constant
Memory data Memory data are cachedare cached in in blocks of 8KB for each SM. blocks of 8KB for each SM.
P.Bakowski 44
CUDA : memory modelCUDA : memory model
Texture Memory is accessible Texture Memory is accessible
(read(read--only) to all MS. only) to all MS.
Texture data are used directly Texture data are used directly
by GPU, they may be by GPU, they may be interpolated linearly without interpolated linearly without
additional operations. additional operations.
P.Bakowski 45
CUDA : memory modelCUDA : memory model
Texture Memory has long Texture Memory has long
latency access and is cached.latency access and is cached.
P.Bakowski 46
CUDA : memory modelCUDA : memory model
Typical use of CUDA memories:Typical use of CUDA memories:
divide the task into several subdivide the task into several sub--tasks tasks
decompose the input data into blocks that correspond to decompose the input data into blocks that correspond to
the shared memory size the shared memory size
each block of data will be processed by a block of threadseach block of data will be processed by a block of threads
load the data blocks from the Global Memory to Shared load the data blocks from the Global Memory to Shared
Memory Memory
process the data in the Shared Memory process the data in the Shared Memory
copy the results from the Shared Memory to Global copy the results from the Shared Memory to Global MemoryMemory
P.Bakowski 47
CUDA : program exampleCUDA : program example
main()main() -- function at the CPU sidefunction at the CPU side
P.Bakowski 48
CUDA : program exampleCUDA : program example
main()main() -- function at the CPU side (cont.) function at the CPU side (cont.)
P.Bakowski 49
CUDA : program exampleCUDA : program example
main()main() -- function at the CPU side (cont.) function at the CPU side (cont.)
P.Bakowski 50
CUDA : program exampleCUDA : program example
kernel function: at the GPU sidekernel function: at the GPU side
no loop but several threads no loop but several threads
each thread with an index each thread with an index –– threadIdx.xthreadIdx.x
10 threads10 threads
++ ++ ++ ++ ++ ++ ++ ++ ++ ++
P.Bakowski 51
CUDA and graphic APIs CUDA and graphic APIs
CUDA programs my exploit the graphic functions CUDA programs my exploit the graphic functions
provided by graphic APIs (DirectX, provided by graphic APIs (DirectX, openGLopenGL).).
These functions provide necessary image processing These functions provide necessary image processing
operations for operations for rasteringrastering and shading and shading –– rendering of the rendering of the images on the screen. images on the screen.
The proposed module does not deal with these The proposed module does not deal with these primitives. primitives.
However some of However some of openGLopenGL operations may be used in operations may be used in
practical classes to display the images directly from practical classes to display the images directly from GPU memory.GPU memory.
P.Bakowski 52
SummarySummary
Evolution of multiprocessingEvolution of multiprocessing
CPUs and CPUs and GPUsGPUs
SIMD and SIMT processing modesSIMD and SIMT processing modes
Performances of Performances of GPUsGPUs
NVIDIA and CUDANVIDIA and CUDA
CUDA processing modelCUDA processing model
CUDA memory modelCUDA memory model
a simple examplea simple example