William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6....
Transcript of William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6....
+
William Stallings
Computer Organization
and Architecture
10th Edition
© 2016 Pearson Education, Inc., Hoboken,
NJ. All rights reserved.
+ Chapter 19General-Purpose
Graphic Processing Units© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ Compute Unified Device
Architecture (CUDA)
A parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce
CUDA C is a C/C++ based language
Program can be divided into three general sections
Code to be run on the host (CPU)
Code to be run on the device (GPU)
The code related to the transfer of data between the host and the device
The data-parallel code to be run on the GPU is called a kernel
Typically will have few to no branching statements
Branching statements in the kernel result in serial execution of the threads in the GPU hardware
A thread is a single instance of the kernel function
The programmer defines the number of threads launched when the kernel function is called
The total number of threads defined is typically in the thousands to maximize the utilization of the GPU processor cores, as well as maximize the available speedup
The programmer specifies how these threads are to be bundled
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.1 Relationship Among Threads, Blocks, and a Grid
Thread (0, 0)
Grid
Block (1,1)
Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Block(0, 0) Block(1, 0) Block(2, 0)
Block(0, 1) Block(1, 1) Block(2, 1)
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
CUDA Term Definition Equivalent GPU Hardware Component
Kernel Parallel code in the form of a function to
be run on GPU
Not applicable
Thread An instance of the kernel on the GPU GPU/CUDA processor core
Block A group of threads assigned to a
particular SM
CUDA multiprocessor (SM)
Grid The GPU GPU
Table 19.1
CUDA Terms to GPU’s Hardware Components
Equivalence Mapping
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication
Control
Cache
DRAM
CPU
DRAM
GPU
ALU ALU
ALU ALU
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.3 Floating-Point Operations per Second for CPU and GPU
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Sep-02
Theoretical
GFLOPS
Jan-04 May-05 Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13
NVIDIA GPU Single Precision
NVIDIA GPU Double Precision
Intel CPU Single Precision
Intel CPU Double Precision
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
GPU Architecture Overview
The historical evolution can be divided up into three major phases:
The first phase would cover early 1980s to late 1990s, where the GPU was composed of fixed, nonprogrammable, specialized processing stages
The second phase would cover the iterative modification of the resulting Phase I GPU architecture from a fixed, specialized, hardware pipeline to a fully programmable processor (early to mid-2000s)
The third phase covers how the GPU/GPGPU architecture makes an excellent and affordable highly parallelized SIMD coprocessor for accelerating the run times of some nongraphics-related programs, along with how a GPGPU language maps to this architecture
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.4 NVIDIA Fermi Architecture
DR
AM
DR
AM
Gig
aT
hre
ad
Ho
st I
nte
rfa
ce
DR
AM
DR
AM
DR
AM
DR
AM
L2 Cache
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.5 Single SM Architecture
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Register File (32k x 32-bit)
Instruction Cache
Dispatch Unit Dispatch Unit
Core
Core
Core
Core
Core
Core
Core
CoreLd/St
SFU
SFU
SFU
SFU
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Ld/St
Core
Core
Core
Core
Core
Core
Core
Warp Scheduler Warp Scheduler
Interconnect Network
64-kB Shared Memory/L1 Cache
Uniform Cache
Operand Collector
Result Queue
Dispatch Port
CUDA Core
FPUnit
IntUnit
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.6 Dual Warp Schedulers and
Instruction Dispatch Units Run Example
Tim
eWARP Scheduler
Instruction Dispatch Unit
Warp 8 instruction 11
Warp 2 instruction 42
Warp 14 instruction 95
Warp 8 instruction 12
Warp 14 instruction 96
Warp 2 instruction 43
WARP Scheduler
Instruction Dispatch Unit
Warp 9 instruction 11
Warp 3 instruction 33
Warp 15 instruction 95
Warp 9 instruction 12
Warp 3 instruction 34
Warp 15 instruction 96
+CUDA Cores
The NVIDIA GPU processor cores are also known as CUDA
cores
There are a total of 32 CUDA cores dedicated to each SM
in the Fermi architecture
Each CUDA core has two separate pipelines or data paths
An integer (INT) unit pipeline
Is capable of 32-bit, 64-bit, and extended precision for
integer and logic/bitwise operations
Floating-point (FP) unit pipeline
Can perform a single-precision FP operation, while a
double-precision FP operation requires two CUDA cores
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Memory Type
Relative Access Times
Access Type
Scope Data Lifetime
Registers Fastest. On-chip R/W Single thread Thread
Shared Fast. On-chip R/W All threads in a block
Block
Local 100´ to 150´ slower than
shared & register. Off-chip
R/W Single thread Thread
Global 100´ to 150´ slower than
shared & register. Off-chip.
R/W All threads & host Application
Constant 100´ to 150´ slower than
shared & register. Off-chip
R All threads & host Application
Texture 100´ to 150´ slower than
shared & register. Off-chip
R All threads & host Application
Table 19.2
GPU’s Memory Hierarchy Attributes
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.8 CUDA Representation of a GPU’s Basic Architecture.
The example GPU shown has two SMs and two CUDA cores per SM.
Registers
Thread (0,0)
Shared Memory
Block (0,0)
(Device) Grid
Registers
Thread (1,0)
Registers
Thread (0,0)
Shared Memory
Block (1,0)
Registers
Thread (1,0)
Global
Memory
Constant
Memory
Host
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.9 Intel Gen8 Execution Unit
EU: Execution Unit
Send
Branch
SIMD
FPU
SIMD
FPU
Intr
uct
ion
Fet
ch
Th
read
Arb
iter
Superscalar Pipeline
Superscalar Pipeline
Superscalar Pipeline
Superscalar Pipeline
Superscalar Pipeline
Superscalar Pipeline
Superscalar Pipeline
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.10 Intel Gen8 Subslice
EU
EU
EU
EU
EU
EU
EU
EU
Instruction
cache
Sampler
Local thread
dispatcher
L1
L2
sampler
cache
Data port
Subslice: 8 EUs
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.11 Intel Gen8 Slice
Slice: 24 EUs
EU
EU
EU
EU
EU
EU
EU
EU
Instruction
cache
Sampler
Local thread
dispatcher
L1
L2
sampler
cache
Data port
Subslice: 8 EUs
EU
EU
EU
EU
EU
EU
EU
EU
Instruction
cache
Sampler
Local thread
dispatcher
L1
L2
sampler
cache
Data port
Subslice: 8 EUs
EU
EU
EU
EU
EU
EU
EU
EU
Instruction
cache
Sampler
Local thread
dispatcher
L1
L2
sampler
cache
Data port
Subslice: 8 EUs
Fixed function units
L3 data cacheShared local
memoryFunction
logic
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.
Figure 19.12 Intel Core M Processor SoC
Intel Processor Graphics Gen8
Intel Core M Processor
Slice: 24 EUs
EU
EU
EU
EU
EU
EU
EU
EU
Instruction
cache
Sam pler
Local thr ead
dispatcher
L1
L2
sampler
cache
D ata port
Subslice: 8 EU s
EU
EU
EU
EU
EU
EU
EU
EU
Instruction
cache
Sam pler
Local thr ead
dispatcher
L1
L2
sampler
cache
D ata port
Subslice: 8 EU s
EU
EU
EU
EU
EU
EU
EU
EU
Instruction
cache
Sam pler
Local thr ead
dispatcher
L1
L2
sampler
cache
D ata port
Subslice: 8 EU s
Fixed function units
L3 data cacheShared local
m em oryAtom ics,Barriers
GT
I
CPU
core
CPU
core
LLC
Cache
slice
LLC
Cache
slice
Display
Controller
System Agent
SoC Ring Interconnect
Memory
Controller
PCIe
+ Summary
CUDA basics
GPU versus CPU
Basic differences between
CPU and GPU architectures
Performance and
performance per watt
comparison
Intel’s Gen8 GPU
GPU architecture
overview
Baseline GPU architecture
Full chip layout
Streaming multiprocessor
architecture details
Importance of knowing and
programming to your
memory types
Chapter 19
General-Purpose
Graphic Processing
Units
© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.