Operation of the Basic SM...

1

(1)

Operation of the Basic SM Pipeline

©Sudhakar Yalamanchili unless otherwise noted

(2)

Objectives

• Cycle-level examination of the operation of major pipeline stages in a stream multiprocessor

• A breadth first look at a basic pipeline

• Understand the type of information necessary for each stage of operation

• Identification of performance bottlenecks v Detailed implementations are addressed in subsequent

modules

2

(3)

Objectives

Host CPU

Inte

rcon

nect

ion

Bus

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Kern

el M

anag

emen

t Uni

t

HW W

ork

Que

ues

Pend

ing

Kern

els

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

Step inside

(4)

Reading

• Documentation for the GPGPUSim simulatorv Good source of information about the general organization and

operation of a stream multiprocessorv http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual

• Operation of a Scoreboardv https://en.wikipedia.org/wiki/Scoreboarding

• General Purpose Graphics Architectures, T. Aamodt, W. Fung, and T. Rogers, Chapter 2.2

3

(5)

NVIDIA GK110 (Keplar)Thread Block Scheduler

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

Hierarchy of schedulers: kernel, TB, warp, memory transactions

(6)

SMX Organization : GK 110

Multiple Warp Schedulers

192 cores – 6 clusters of 32 cores each

64K 32-bit registers

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

What are the main stages of a generic

SMX pipeline?

4

(7)

A Generic SM Pipeline

Predicate & GP Register Files

Scalar Pipelines

Data Memory Access

Writeback/Commit

Scalar Fetch & Decode

Instruction Issue & Warp

Scheduler

Warp 6

Warp 1Warp 2

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

Pending Warps

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

I-Fetch

Miss?

Front-end

Scalar Cores

Back-end

(8)

Single Warp Execution

PC AM WIDState

warp stateThread Block

Grid

setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;

bra L2;

L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3; //r8 = &c[index]

L2:ret;

PTX (Assembly):

5

(9)

Instruction Fetch & Decode

PC AM WIDState Instr

Warp 0 PC

Warp 1 PC

Warp n-1 PC

To I-Cache

Next Warp

From GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

May realize multiple fetch

policies

I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Examples from Harmonica2 GPU

(10)

Instruction Buffer I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Scoreboard

• Buffer a fixed number of instructions per warp

• Coordinated with instruction fetchv Need an empty I-buffer for the warp

• V: valid instruction in the buffer• R: instruction ready to be issued

v Set using the scoreboard logicFrom GPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

V Instr 1 W1RV Instr 2 W1R

V Instr 2 WnR

V Instr 1 W2R

Example: buffer 2 instructions/warpDecoded instruction

ECE 6100/CS 6290

6

(11)

Instruction Buffer (2)I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Scoreboard

• Scoreboard enforces WAW and RAW hazardsv Indexed by Warp IDv Each entry hosts required registers, v Destination registers are reserved at

issuev Reserved registers released at

writeback

• Enables multiple instructions to be in execution from a single warp



V Instr 2 WnR

V Instr 1 W2R

(12)

Instruction Buffer (3)I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Scoreboard



V Instr 2 WnR

V Instr 1 W2R

Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Int Yes Load F2 R3 No

dest reg src1 src2

Source Registers

have value?Function unit

producing valueGeneric Scoreboard

• Next: Modified scoreboard design to address• Have multiple instructions in transit• Excessive demand for register file ports

7

(13)

Instruction IssueI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Warp 3Warp 8

Warp 7

Warp Scheduler

instruction


Manages implementation of barriers, register dependencies, and

control divergence

pool of ready warps

(14)

Instruction Issue (2)I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Warp 3Warp 8

Warp 7

Warp Scheduler

instruction


• Barriers – warps wait here for barrier synchronizationv All threads in the thread block

must reach the barrier

warp

barrier

8

(15)


Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Warp 3Warp 8

Warp 7

Warp Scheduler

instruction


• Register Dependencies - track through the scoreboard

ScoreboardV Instr 1 W1RV Instr 2 W1R

V Instr 2 WnR

V Instr 1 W2R

(16)


Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Warp 3Warp 8

Warp 7

Warp Scheduler

instruction


• Control Divergence - per warp stack

• Create execution mask that is read with operands

divergent warps

Keeps track of divergent threads at a branch

SIMT Stack (per warp)

9

(17)

Instruction Issue (5)

• Scheduler can issue multiple instructions from a warp• Issue conditions

v Has valid instructionsv Not waiting at a barrierv Scoreboard checkv Pipeline line is not stalled: operand access stage (will get to it

later)

• Reserve destination registers

• Instructions may issue to memory, SP or SFU pipelines

• Warp scheduling disciplines à more later in the course

(18)

Register File AccessI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

RF n-1RF n-2RF n-3RF n-4

RF1RF0


RF1RF0


RF1RF0

Xbar

1024 bit

Banks 0-15

OC OC OC OC

DU DU DU DU

Operand Collectors (OC)

Dispatch Units (DU)

ALUs L/S SFU

Arbiter

Single ported Register File

Banks

From SPs

10

(19)

Scalar PipelineI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

Dispatch

ALU FPULD/SD

Result Queue

A Single Core

• Functional units are pipelined

• Designs with multiple issue

(20)

Shared Memory Access

2-way Conflict access

Conflict free access

• Multiple bank organization

• Data is interleaved across banks

• Bank conflicts extend access times

I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

11

(21)

Memory Request Coalescing

Memory Requests

Tid

RQ Size

Base Add

Offset

Tid

RQ Size

Base Add

Offset

Tid

RQ Size

Base Add

Offset

Tid

RQ Size

Base Add

Offset

Pending Request Table

Memory Address Coalescing

Pending RQ Count Addr Mask Addr Mask Addr Mask

Thread Masks

• Pending Request Table (PRT) is filled whenever a memory request is issued

• Generate a set of address masks à one for each memory transaction

• Issue transactions

From J. Leng et.al., “GPUWattch : Enabling Energy Optimizations in GPGPUs,’ ISCA 2013

(22)

Memory Hierarchy

• Configurable cache/shared memory configuration for L1

• Read-only cache for compiler or developer (intrinsics) use

• Shared L2 across all SMXs• ECC coverage across the

hierarchyv Performance impact

From GK110: NVIDIA white paper

L1 Cache Shared Memory

Read-Only

Cache

L2 Cache

DRAM

warp

12

(23)

Summary

• Synchronous progress of a warp through the SM pipelines

• Warp progress in a thread block can diverge for many reasonsv Barriersv Control divergencev Memory divergence

• How is the execution optimized? Next à

Operation of the Basic SM...

Documents

Transcript of Operation of the Basic SM...