William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6....

20
+ William Stallings Computer Organization and Architecture 10 th Edition © 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Transcript of William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6....

Page 1: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

+

William Stallings

Computer Organization

and Architecture

10th Edition

© 2016 Pearson Education, Inc., Hoboken,

NJ. All rights reserved.

Page 2: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

+ Chapter 19General-Purpose

Graphic Processing Units© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Page 3: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

+ Compute Unified Device

Architecture (CUDA)

A parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce

CUDA C is a C/C++ based language

Program can be divided into three general sections

Code to be run on the host (CPU)

Code to be run on the device (GPU)

The code related to the transfer of data between the host and the device

The data-parallel code to be run on the GPU is called a kernel

Typically will have few to no branching statements

Branching statements in the kernel result in serial execution of the threads in the GPU hardware

A thread is a single instance of the kernel function

The programmer defines the number of threads launched when the kernel function is called

The total number of threads defined is typically in the thousands to maximize the utilization of the GPU processor cores, as well as maximize the available speedup

The programmer specifies how these threads are to be bundled

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Page 4: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.1 Relationship Among Threads, Blocks, and a Grid

Thread (0, 0)

Grid

Block (1,1)

Thread (1, 0) Thread (2, 0) Thread (3, 0)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)

Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)

Block(0, 0) Block(1, 0) Block(2, 0)

Block(0, 1) Block(1, 1) Block(2, 1)

Page 5: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

CUDA Term Definition Equivalent GPU Hardware Component

Kernel Parallel code in the form of a function to

be run on GPU

Not applicable

Thread An instance of the kernel on the GPU GPU/CUDA processor core

Block A group of threads assigned to a

particular SM

CUDA multiprocessor (SM)

Grid The GPU GPU

Table 19.1

CUDA Terms to GPU’s Hardware Components

Equivalence Mapping

Page 6: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.2 CPU vs. GPU Silicon Area/Transistor Dedication

Control

Cache

DRAM

CPU

DRAM

GPU

ALU ALU

ALU ALU

Page 7: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.3 Floating-Point Operations per Second for CPU and GPU

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Sep-02

Theoretical

GFLOPS

Jan-04 May-05 Oct-06 Feb-08 Jul-09 Nov-10 Apr-12 Aug-13

NVIDIA GPU Single Precision

NVIDIA GPU Double Precision

Intel CPU Single Precision

Intel CPU Double Precision

Page 8: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

GPU Architecture Overview

The historical evolution can be divided up into three major phases:

The first phase would cover early 1980s to late 1990s, where the GPU was composed of fixed, nonprogrammable, specialized processing stages

The second phase would cover the iterative modification of the resulting Phase I GPU architecture from a fixed, specialized, hardware pipeline to a fully programmable processor (early to mid-2000s)

The third phase covers how the GPU/GPGPU architecture makes an excellent and affordable highly parallelized SIMD coprocessor for accelerating the run times of some nongraphics-related programs, along with how a GPGPU language maps to this architecture

Page 9: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.4 NVIDIA Fermi Architecture

DR

AM

DR

AM

Gig

aT

hre

ad

Ho

st I

nte

rfa

ce

DR

AM

DR

AM

DR

AM

DR

AM

L2 Cache

Page 10: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.5 Single SM Architecture

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Register File (32k x 32-bit)

Instruction Cache

Dispatch Unit Dispatch Unit

Core

Core

Core

Core

Core

Core

Core

CoreLd/St

SFU

SFU

SFU

SFU

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Ld/St

Core

Core

Core

Core

Core

Core

Core

Warp Scheduler Warp Scheduler

Interconnect Network

64-kB Shared Memory/L1 Cache

Uniform Cache

Operand Collector

Result Queue

Dispatch Port

CUDA Core

FPUnit

IntUnit

Page 11: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.6 Dual Warp Schedulers and

Instruction Dispatch Units Run Example

Tim

eWARP Scheduler

Instruction Dispatch Unit

Warp 8 instruction 11

Warp 2 instruction 42

Warp 14 instruction 95

Warp 8 instruction 12

Warp 14 instruction 96

Warp 2 instruction 43

WARP Scheduler

Instruction Dispatch Unit

Warp 9 instruction 11

Warp 3 instruction 33

Warp 15 instruction 95

Warp 9 instruction 12

Warp 3 instruction 34

Warp 15 instruction 96

Page 12: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

+CUDA Cores

The NVIDIA GPU processor cores are also known as CUDA

cores

There are a total of 32 CUDA cores dedicated to each SM

in the Fermi architecture

Each CUDA core has two separate pipelines or data paths

An integer (INT) unit pipeline

Is capable of 32-bit, 64-bit, and extended precision for

integer and logic/bitwise operations

Floating-point (FP) unit pipeline

Can perform a single-precision FP operation, while a

double-precision FP operation requires two CUDA cores

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Page 13: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Memory Type

Relative Access Times

Access Type

Scope Data Lifetime

Registers Fastest. On-chip R/W Single thread Thread

Shared Fast. On-chip R/W All threads in a block

Block

Local 100´ to 150´ slower than

shared & register. Off-chip

R/W Single thread Thread

Global 100´ to 150´ slower than

shared & register. Off-chip.

R/W All threads & host Application

Constant 100´ to 150´ slower than

shared & register. Off-chip

R All threads & host Application

Texture 100´ to 150´ slower than

shared & register. Off-chip

R All threads & host Application

Table 19.2

GPU’s Memory Hierarchy Attributes

Page 14: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Page 15: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.8 CUDA Representation of a GPU’s Basic Architecture.

The example GPU shown has two SMs and two CUDA cores per SM.

Registers

Thread (0,0)

Shared Memory

Block (0,0)

(Device) Grid

Registers

Thread (1,0)

Registers

Thread (0,0)

Shared Memory

Block (1,0)

Registers

Thread (1,0)

Global

Memory

Constant

Memory

Host

Page 16: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.9 Intel Gen8 Execution Unit

EU: Execution Unit

Send

Branch

SIMD

FPU

SIMD

FPU

Intr

uct

ion

Fet

ch

Th

read

Arb

iter

Superscalar Pipeline

Superscalar Pipeline

Superscalar Pipeline

Superscalar Pipeline

Superscalar Pipeline

Superscalar Pipeline

Superscalar Pipeline

Page 17: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.10 Intel Gen8 Subslice

EU

EU

EU

EU

EU

EU

EU

EU

Instruction

cache

Sampler

Local thread

dispatcher

L1

L2

sampler

cache

Data port

Subslice: 8 EUs

Page 18: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.11 Intel Gen8 Slice

Slice: 24 EUs

EU

EU

EU

EU

EU

EU

EU

EU

Instruction

cache

Sampler

Local thread

dispatcher

L1

L2

sampler

cache

Data port

Subslice: 8 EUs

EU

EU

EU

EU

EU

EU

EU

EU

Instruction

cache

Sampler

Local thread

dispatcher

L1

L2

sampler

cache

Data port

Subslice: 8 EUs

EU

EU

EU

EU

EU

EU

EU

EU

Instruction

cache

Sampler

Local thread

dispatcher

L1

L2

sampler

cache

Data port

Subslice: 8 EUs

Fixed function units

L3 data cacheShared local

memoryFunction

logic

Page 19: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Figure 19.12 Intel Core M Processor SoC

Intel Processor Graphics Gen8

Intel Core M Processor

Slice: 24 EUs

EU

EU

EU

EU

EU

EU

EU

EU

Instruction

cache

Sam pler

Local thr ead

dispatcher

L1

L2

sampler

cache

D ata port

Subslice: 8 EU s

EU

EU

EU

EU

EU

EU

EU

EU

Instruction

cache

Sam pler

Local thr ead

dispatcher

L1

L2

sampler

cache

D ata port

Subslice: 8 EU s

EU

EU

EU

EU

EU

EU

EU

EU

Instruction

cache

Sam pler

Local thr ead

dispatcher

L1

L2

sampler

cache

D ata port

Subslice: 8 EU s

Fixed function units

L3 data cacheShared local

m em oryAtom ics,Barriers

GT

I

CPU

core

CPU

core

LLC

Cache

slice

LLC

Cache

slice

Display

Controller

System Agent

SoC Ring Interconnect

Memory

Controller

PCIe

Page 20: William Stallings 10 Editionakyokus.com/spring2021/co/lectures/stallingd10th/CH19... · 2021. 6. 2. · Shared F ast. On-chip R/W All threads in a block Block L ocal 100´ to 150´

+ Summary

CUDA basics

GPU versus CPU

Basic differences between

CPU and GPU architectures

Performance and

performance per watt

comparison

Intel’s Gen8 GPU

GPU architecture

overview

Baseline GPU architecture

Full chip layout

Streaming multiprocessor

architecture details

Importance of knowing and

programming to your

memory types

Chapter 19

General-Purpose

Graphic Processing

Units

© 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved.