Post on 08-Apr-2018
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 1/69
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 2/69
Université de Mons
Thanks GPU
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 3/69
Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU4. GPU programming
5. CUDA
6. Conclusion
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 4/69
Université de Mons
What is a GPU ?
The GPU is a processor specialized in 3D tasks
Offload the the CPU (central processor unit) of
several tasks
Highly parallel structuremore effective than
CPU for a range of complexe algorithme
Calculation of floating point
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 5/69
Université de Mons
Central Processing Unit : CPU
5Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• an essential component in a computer.
• interpret instructions and process datas of a
program.
• Sequential process (not much data but higher
complexity)
• Need to process more and more datas for
Multimedia applications (games, CAD,…)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 6/69
Université de Mons
Evolution of the CPU
6Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• Multimedia applications used dedicated
algorithms to proceed
• Linear algorithm to apply the same
instructions to a large amount of data : we
speak about « vector calculus »
• Adaptation of the Architectures of CPU to use
Multimedia complexion :
Intel Pentium MMX, AMD Opteron 3D Now !
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 7/69Université de Mons
Limitation of the CPU
7Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• New generation of CPU with higher
performances seems more features and
functions for the users
• Users want more and more functions and they
want that technologies follow their desire
• But technologies are limited because internal
clock frequency of CPU are physically limited
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 8/69Université de Mons
Solution to turn away the problem
8Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• Multi-core : combine several CPU to one CPU
• Add a specific processor to multimedia
application GPU
BUT need parallel programming
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 9/69Université de Mons
Multi-core CPU
9Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• Classic programming is not adapted to multi-
core architecture because sequential
programming use one core and no more
• Classic programming + multi-core doesn’t
seem improvement !
• Need parallel programming : the problem is
divided into elementary task which are
process simultaneously by several CPU to
decrease computation time
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 10/69Université de Mons
Multi-core CPU
10Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• Parallel programming seems complex
programming
• Parallel programming is already used by
scientists to use supercalculators
• Multi-core CPU is good but not enough
compare to GPU
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 11/69Université de Mons
GPU Vs CPU
11Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• Comparison on FLOPS performance (Floating
point Operation Per Second)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 12/69Université de Mons
Origin of GPU
12Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• Need to display a 2D projection of a 3D model
in real time
CAD : to visualize in 3D a virtual object
Video Games : to represent a virtual world
• 2 techniques : Ray tracing Rasterizing
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 13/69Université de Mons
Graphic Card is often called GPU
13Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• Graphic Card is an important part of the
computer
• Composed by memory area, processors,
registers and communication chipsets
• GPU = graphics processors on this card
•Until 240 parallel processors flow on GPU
@1500MHz
• Single Instruction on Multiple Data [SIMD]
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 14/69Université de Mons
Graphic Card is often called GPU
14Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
• GPU processors are organized in pipeline
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 15/69Université de Mons
GPU Programming
15Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
Languages
Shading Language
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 16/69Université de Mons
Language GPGPU
16Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
CUDA
OpenCLAccelerator
…..
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 17/69Université de Mons
Programming Model
17Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
Tableau = texture
Kernel =Fragment Shader
Calculus = Graphics renderingFeedback
GPGPU complexity
Memory AccessBandwidth
…..
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 18/69Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU4. GPU programming
5. CUDA
6. Conclusion
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 19/69Université de Mons
Basic need ?
Show, in real time, a 2D projection (on the screen) of a 3D model
Raytracing
Rasterisation
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 20/69Université de Mons
There is a specific vocabulary for
the GPU
Vertex
Texture
Pixel & fragmentShader
Pipeline
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 21/69Université de Mons
A Vertex (plural : Vertices) are
commonly used to define the
corners of surfaces in 3D
models, where each such point
is given as a vector.
A vertex is represented by
coordonates X,Y and Z
Vertex
This cube has 8 vertices
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 22/69Université de Mons
Texture
A Texture is a 2D image which
is applicated at a 3D object
perceived surface quality of an
artwork
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 23/69Université de Mons
Pixel Fragment
• A pixel is the smallest item of information in an image seen by the viewer
• A fragment is the data necessary to generate a single pixel of
a drawing primitive. It is constituate by :
Some coordonates X,Y,Z
A color
A visibility depth
NOT seen by the user
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 24/69
Université de Mons
Shader
A shader is simple programs that describe the traits of either a vertex or apixel (via the fragments).
It allows to control a subset of the GPU processors
Lots of special shading functions defined thanks to major graphics software
libraries (OpenGL and Direct3D)
3 types of shaders :
Vertex Shader
Run for each vertex given at the
processor
transform each vertex's 3D
position to the 2D coordinate of
the screen
Geometry shader
add and remove vertices
New shader (not present oneeach GPU)
Pixel (or fragment) Shader
calculate the color of individual
pixels lighting/shadow effect
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 25/69
Université de Mons
Pipeline
A pipeline is an ordonate sequence of different
levels.
Each level get the data of the past one, do his
own operation and send the results to the
next one.
A pipeline is « full » when each level is working
simultaneously optimal use
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 26/69
Université de Mons
Actual Graphic Pipeline
The graphics pipeline typically accepts some representation of a
three-dimensional scene as an input and results in a 2D raster
image (image made of pixels) as output.
OpenGL and Direct3D are two notable graphics pipeline modelsaccepted as widespread industry standards.
The graphic pipeline contains 4 levels :
3 programmable levels
pilot by the shader
1 non-programmable level
The rasterizer
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 27/69
Université de Mons
Vertex flux from the CPU to the GPU
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 28/69
Université de Mons
Pre-stage : Tessellation
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 29/69
Université de Mons
Stage 1 : Vertex shader (Programmable)
•Objects are transformed from 3D world spacecoordinates into a 3D coordinate system based on theposition and orientation of a virtual camera
•Use to add special effect to objetcs in a 3D
environment
•Run once for each vertex given to the GPU
•Can change vertex’s properties such as : position,
color, texture coordinate,…
•One element in/one element out
•Can not create new vertices
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 30/69
Université de Mons
Stage 2 : Geometry shader (Prgrammable)
•One element in / 0 ~100 elements out
•Can add and remove vertices
•Can be used to add volumetric detail (too costly forCPU) or for the refinement of the mesh size
•Ex : 20 triangles 100 triangles smaller
•Displacement Mapping
•Last type of shader created (not always present in thepipeline)
Mesh size = taille des mailles = maillage
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 31/69
Université de Mons
Stage 3 : Rasterization (non-programmable)(1)
•Most popular technique forproducing real-time 3D computergraphics (faster than raytracing)
•Projection of the polygons of the 3Dscene on a grid (2D) of the size of theoutput image
•Output fragments have the imagefinal coordinates
2D vector to raster
Vector image (Vertex) Raster image (Fragments)
Polygon = set of trianglesTriangle = 3 vertex in 3D space
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 32/69
Université de Mons
Stage 3 : Rasterization (2)
The Rasterization algorithme has minimum 3 steps :
1. Calculation of the 2D coordinates (transformation)
2. Filtering of the vertex (clipping)
3. Rasterization itself (scan conversion)
4. Acceleration technics (optional)
5. Further refinments
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 33/69
Université de Mons
Stage 3 : Rasterization (3)
The Rasterization algorithme has minimum 3 steps :
1. Calculation of the 2D coordinates (transformation)
Set of mathematics transformation :• Translation, scalling, rotation : to put the 3D figure at the desire
location (Exemple = the origine)• Projection : from 3D to 2D (orthogonal projection (removed the
z-components), perspective projection)
These operations are done thanks to a multiplication of thevertex’s augmented 3D matrix by different matrix
Ex : Translation matrix :
Ex : A man who turn his head
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 34/69
Université de Mons
Stage 3 : Rasterization (4)
2. Filtering of the vertex (clipping)
• Triangles 2D vertices location are calculated BUT may be outside of the window (area on the screen wherethe pixel will be written)
• Clipping is the process of truncating triangles to fit them inside the viewing area.
3. Rasterization itself (scan conversion)
• To fill in the 2D triangles that are now in the image plane in pixels
• Exemple : treatment of a line (coordonates (1,1) to (5.1), color degraded blue to green) Will fill pixel (1,1), (2,1), (3,1), (4,1), & (5,1) ;
For each pixel, ones has to determinates the caracteristic with a goog balance :
(1,1) being totaly blue, (2,1) less blue, (3,1) blue)green,…
• This is much more complicated for shape like triangle but the principe remains the same
• Difficulty : Pixel Aliasing
use of Z-buffer to see which pixel is closer to the camera
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 35/69
Université de Mons
Stage 3 : Rasterization (5)
4. Acceleration techniques
I. Backface culling :determines whether a polygon of a graphical object is
visible, if not (it shows its back to the camera) cull
II. Spatial data structures
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 36/69
Université de Mons
Stage 4 : Fragment shader (Programmable)
•Give his final color to each pixel (fonction of lighting,reflexing or refraction of the light,…)
•Biggest computational resource•Perform complex per-pixel effects and refinmentstechniques such as :
I. Texture filtering : to create clean images at anydistance
II. Environment mapping : a form
of texture mapping in which thetexture coordinates view-dependentto simulate reflection on a shinyobject
III. Shadows : traditionnally not processin the rasterizer modern techniques
Fragment Shader = Pixel Shader
OpenGL Direct3D
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 37/69
Université de Mons
Exit of the pipeline
• Fragment flux can :
Either be written in a framebufferand then display on the screen
Either, if it need more treatment, bewritten in a texture and then pick backby the the CPU
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 38/69
Université de Mons
Resume
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 39/69
Université de Mons
The unified architecture came from the 6th generation
of GPU
Before : 2 types of processor in the GPU
Vertex Units
Fragments Units
Creation of a neck of strangling when one type was over-charged not optimal
Since GeForce 8, processors are not specifics anymore
optimal use of the pipeline : Unified Architecture
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 40/69
Université de Mons
GPU-s evolution through the different generations
Gén Year Nvidia AMD/ATI Particularities
1 96 TNT2 Rage -DirectX6 = standard
-Rasterziation of traingle and texture
-Limitation : no vertex treatment
-Other provider : 3 dfx (Voodoo)
2 99 Geforce 256 Radeon 7500 -Open GL supported
- vertex treatment supported
3 0102
Geforce 3Geforce 4
Radeon 8500 -Nvidia buy 3 dfx-Vertex treatment programmable
4 02 Geforce FX Radeon 9700 - Fragments treatment programmable
- First GPGPU opérations
5 04
05
Geforce 6
Geforce 7
Radeon X800
Radeon X1800
-Speed of treatment increase
-GPGPU operation developped
6 06
07
08
Geforce 8
Geforce 9
Radeon HD200
Radeon HD300
-Geometry shader appear
-Unified architecture
-Nvidia created CUDA language
7 08 Geforce 200 Radeon HD400 -Not very spread yet
-Technical improvments (frequence, memory,
number of processor, bandwith,…)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 41/69
Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 42/69
Université de Mons
Architecture of a GPU
42Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 43/69
Université de Mons
Short remember :
Architecture of a CPU
CPU and its evolution
Drawbacks
Architecture of a GPU
Needs
SIMD/MIMD
Short talk about data management
Gathering/scattering and PRAM
Overview
43Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
Time
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 44/69
Université de Mons
Architecture of a CPU
44Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
Arithmetic Logic Unit orCalcul Unit :
• Manage all operations
Control Unit :
• Manage all instructions
Cache :
• Fast memory access• Expensive
• High volume
DRAM :
• Dynamic random access memory• Cheap but need to be refreshed
Control brain
ALU hands
Memory tools
CONTROLALU ALU
ALU ALU
CACHE
DRAM
h b
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 45/69
Université de Mons
CPU processing
For a computer :
Program = several sequential instructions
Simple CPU : SISD (single instruction single data)
Short remember :
ARCHITECTURE OF A CPU
CPU AND ITS EVOLUTION
DRAWBACKS
Architecture of a GPU
NEEDS
SIMD/MIMD
Data management
GATHERING/SCATTERING AND PRAM
45Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
Instruction1 Instruction2 Instruction3
Program
code
• Instructions are computed 1 by 1
• On a single data at each time
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 46/69
Université de Mons
At first : SISD
In-order processors
Out-of-order processors ( performances )
Instructions dispatch to an instruction queue The results are queued
The process is still sequential
High volume of cache memory
Need to have a fast access to instructions and datas
Lots of « go and back » on datas
CPU and its evolution
46Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 47/69
Université de Mons
Evolution ( Pentium 3 )
SIMD (single instruction multiple data)
Vectorial calculus performances
Reasons
Only a few « go and back » on datas The complexity of the algorithm is very
High volume of cache memory and out-of-order execution are
superficials for multimedia applications
Evolution and drawbacks
47Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
CPU is perfect for sequential program but is weak for
multimedia applications
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 48/69
Université de Mons
A GPU is a SIMD processor
To be able to process a lot of datas
Architecture of a GPU
48Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 49/69
Université de Mons
A high memory bandwidth
10 x CPU bandwidth to process lots of datas in real time
Needs of the GPU
49Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 50/69
Université de Mons
Parler de la nouvelle génération GPU
MIMD (multiple instruction multiple data)
Comparer MIMD et SIMD
Parler de la gestion des données
Gathering
Scattering
Parler du modèle PRAM utilisé dans les GPU
Reste à faire
50Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 51/69
Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 52/69
Université de Mons
• CUDA (Computer Unified Device Architecture) is a
development library created by NVIDIA in 2007.
• It allows to use the power of a compatible graphic
card for general purpose computing.• Programmers can use C,C++ or Fortran to develop
applications using CUDA.
• Interfaces (wrappers) enable to use high-level
languages such as Java, .net or Python.
CUDA
52
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 53/69
Université de Mons
Different components of CUDA
53
• CUDA is constituated of set of software layers to
communicate with the GPU: a Driver, a Runtime and
a few librairies.
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 54/69
Université de Mons
• Include the code of all the functions to be
executed on the GPU.
• Using those libraries, developpers can only
use a set of predefined functions.
• They do not have access to the actual GPU.
• Examples:• CUBLAS, which has a set of building blocks for linear algebra calculations
on the GPU
• CUFFT, which can handle calculation of Fourier transforms
CUDA Libraries
54
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 55/69
Université de Mons
• Also called « C for CUDA »
• The high-level API is implemented “above” the low-
level API, each call to a function of the Runtime is
broken down into more basic instructions managedby the Driver API
• The term “high-level API” is relative. Even the
Runtime API is still what a lot of people would
consider very low-level; yet it still offers functionsthat are highly practical for initialization.
High Level API : CUDA Runtime
55
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 56/69
Université de Mons
• The Driver API is more complex to manage; it
requires more work to launch processing on the
GPU.
• The upside is that it’s more flexible, giving theprogrammer additional control.
• Note that the high-level and Low-level APIs are
mutually exclusive – the programmer must use one
or the other, but it’s not possible to mix function calls
from both.
Low Level API : CUDA Driver
56
CUDA from the Hardware
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 57/69
Université de Mons
• Nvidia’s Shader Core is made up of several clusters Nvidia calls Texture
Processor Clusters.
• Each cluster is made up of a texture unit and 2 streaming multiprocessors.
CUDA from the Hardware
Point of View
57
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 58/69
Université de Mons
• These processors consist of a front
end that reads/decodes and launches
instructions and a backend made up
of a group of eight calculating units
and two SFUs (Super Function Units).
where the instructions are executed
in SIMD fashion.
• The same instruction is applied to all
the threads in the warp. Nvidia calls
this mode of execution SIMT (forsingle instruction multiple threads).
• The backend operates at double
the frequency of the front end.
The streaming Multiprocessor
58
Streaming multiprocessors’
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 59/69
Université de Mons
• At each cycle, a warp ready for execution is
selected by the front end, which launches
execution of an instruction.
• To apply the instruction to all 32 threads in the
warp, the backend will take four cycles, but since it
operates at double the frequency of the front end,from its point of view only two cycles will be
executed.
• to avoid having the front end remain unused for
one cycle, the ideal is to alternate types of
instructions every cycle – a classic instruction forone cycle and an SFU instruction for the other.
Streaming multiprocessors
operating mode
59
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 60/69
Université de Mons
• Each multoprocessors have a small
memory area called Shared Memory
with a size of 16 KB per multiprocessor.
• This memory area provides a way for
threads in the same block tocommunicate. All the threads in a given
block are executed by the same
multiprocessor.
• The assignment of blocks to the
different multiprocessors is completelyundefined, meaning that two threads
from different blocks can’t
communicate during their execution.
Shared Memory
60
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 61/69
Université de Mons
• To limit too-frequent access to theshared memory, Nvidia has also
provided its multiprocessors with a
cache (approximately 8 KB per
multiprocessor) for access to constants
and textures.
• The multiprocessors also have 8,192
registers that are shared among all the
threads of all the blocks active on that
multiprocessor. The number of activeblocks per multiprocessor can’t exceed
eight, and the number of active warps
are limited to 24 (768 threads)
Cache Memory - Registers
61
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 62/69
Université de Mons
• Finding the optimum balance between the number of blocks andtheir size – more threads per block will be useful in masking the
latency of the memory operations, but at the same time the
number of registers available per thread are reduced.
• Blocks of 512 threads would be particularly inefficient, since onlyone block might be active on a multiprocessor, potentially wasting
256 threads. So, Nvidia advises using blocks of 128 to 256 threads,
which offers the best compromise between masking latency and
the number of registers needed for most kernels.
Optimizing a CUDA program
62
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 63/69
Université de Mons
• Host : CPU
• Device : GPU
• Kernel : Function executed
on the GPU• Thread : basic element of the data
to be processed (very lightweight)
• Warp : group of 32 threads
• Block : set of 64 to 512 threads
• Grid : Array of blocks
Definitions
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 64/69
Université de Mons
VCheck
Definitions (2)
64Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
CUDA from a Software
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 65/69
Université de Mons
CUDA = set of extensions to the C language
Type qualifiers for functions :
__global__ void function()
Function called by the CPU, executed on the GPU
__device__ void function()
Function called by and executed on the GPU
__host__ void function() Standard function (executed on the CPU)
CUDA from a Software
Point of View
65Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 66/69
Université de Mons
Restrictions on __device__ and __global__ :
1. Cannot be recursive
2. Must have a fixed number of arguments
Type qualifier for variables :
__shared__ variableThis variable will be stored in the
multiprocessor’s shared memory
Software Point of View (2)
66Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 67/69
Université de Mons
1. CPU code is extracted
and handed to the
standard compiler
2. GPU code is converted
into PTX code(assembly code) and
scanned for
inefficiences
3. PTX is translated isGPU-specific
commands that are
incapsulated in the exe
Compilation
67Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)
8/7/2019 GPU - Graphical Processing Unit
http://slidepdf.com/reader/full/gpu-graphical-processing-unit 68/69
Université de Mons
A few applications examples
68