Dynamic Sched.1 10/14 Dynamic Hardware Scheduling Microarchitecture ILP.

Dynamic Sched.110/14

Dynamic Hardware Scheduling

Microarchitecture ILP


From Pipelining Review

• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls

– Ideal pipeline CPI: measure of the maximum performance attainable by the implementation

– Structural hazards: HW cannot support this combination of instructions

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)


Data Hazard Resolution: In-order issue, in-order completion

Time (clock cycles)

or r8,r2,r9

Instr.

Order

lw r1, 0(r2)

sub r4,r1,r6

and r6,r2,r7

Reg

ALU

DMemIfetch Reg

RegIfetch

ALU

DMem RegBubble

Ifetch

ALU

DMem RegBubble Reg

Ifetch

ALU

DMemBubble Reg

Extend to Multiple instruction issue?


Techniques to Reduce Stalls

Chapter 3HW

Chapter 4SW


Instruction-Level Parallelism (ILP)from same code thread

• Basic Block (BB) small ILP • BB: straight-line code sequence with no branches, ( entry / exit )

– average dynamic branch frequency 15% to 20% => 4 to 7 instructions between branches

– instructions in BB likely depend on each other

• To get performance enhancements, – exploit ILP across multiple blocks

• loop-level parallelism exploit parallelism in loop iterations

– Vector is one way

– dynamic via branch prediction or static via loop unrolling by compiler


Multiple Instruction Issuing

• issue packet: group of instructions from fetch unit that could issue in 1 clock

– If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue

– 0 to N instruction issues per clock cycle, for N-issue

• Must check issue-bility in 1 cycle– => issue stage split - pipelined

– 1st stage : how many instructions from packet can issue

– 2nd stage examines hazards among selected instructions and those already been issued

– => higher branch penalties => prediction accuracy important


Getting CPI < 1: Issueing Multiple Instructions/Cycle

• Vector Processing: Explicit coding of independent loops as operations – variable vectors ;

» Eg Multimedia instructions

• SUPERSCALER: (1 to 8) instructions/cycle, scheduled by compiler or HW

– All modern CPUs: IBM POWER, Sun UltraSparc, Core i3, i5, i7, ., ..

• Very Long Instruction Words VLIW: instructions (4-16) scheduled by compiler; put ops into wide templates

– Intel Itanium

– TI DSP

• multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI


Modern SuperscalarDynamically scheduled – How / What to widen


Clock 0 1 2 3 4 5 6 7 8 9 10 11

0 AND Fet DQ DS EX C/WB

1 OR Fet DQ DS EX C/WB

2 FADD Fet DQ DS EX EX EX C/WB

3 FSUB Fet DQ DS DS EX EX EX C/WB

4 ADDC Fet DQ DS EX C C C/WB

5 SUBFC Fet DQ DS EX C C C/WB

6 FMADD Fet DQ DS EX EX EX C/WB

7 FMSUB Fet DQ DS DS EX EX EX C/WB

8 XOR Fet DQ DS DS EX C C C/WB

9 NEG Fet DQ DS DS EX C C C/WB

10 FADDS Fet DQ DQ DS EX EX EX C/WB

11 FSUBS Fet DQ DQ DS DS EX EX EX C/WB

12 ADD Fet DQ DQ DS DS EX C C C/WB

13 SUB Fet DQ DQ DS DS EX C C C/WB

Superscalar Timing Example – 4 way

End of DS stage successful dispatch DS


ILP and Data Dependencies

• program order: order as expected – same as sequential execution 1 at a time; determined by code

• HW/SW goal: exploit parallelism by preserving appearance of program order

– modify order in manner than cannot be observed by program

– must not affect the outcome of the program

• Ex: Instructions involved in a name dependence can execute simultaneously; if name changed no effect ;so instructions do not conflict

– Register renaming resolves name dependence for regs

– Either by compiler or by HW


• Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions;

• InstrJ writes operand before InstrI reads it Write After Read (WAR) hazard

• OR: Write After Write (WAW) hazard

• Instructions are control dependent on branches, these control dependencies must be preserved to preserve program order

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Name Dependence Review : Anti-dependence, Output Dependence, Control

(not present if large – infinite no. of registers)

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7


Advantages ofDynamic Scheduling

• Handles cases when dependences unknown at compile time (eg memory reference)

• Simplifies compiler

• code compiled for one pipeline runs efficiently on different pipeline

• Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling

• Key idea: instructions behind stall can proceedDIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14

• Out-of-order execution => out-of-order completion.


Instruction Parallelism by HW

• Enables out-of-order execution and allows out-of-order completion

• Will distinguish when an instruction begins execution and when it completes execution; in between instruction in execution

• dynamically scheduled pipeline:: all instructions pass through issue stage in order (in-order issue)


Dynamic Scheduling by Scoreboard: bookkeeping technique - OLD

• Out-of-order execution divides ID stage:

1. Issue—decode instructions, check for structural hazards

2. Read operands—wait until no data hazards, then read operands

• Scoreboards date to CDC6600 in 1963

• Instructions execute whenever not dependent on previous instructions and no hazards.

• CDC 6600: In order issue, out-of-order execution (when there are no conflicts and the hardware is available). , out-of-order commit (or completion)

– No forwarding


Scoreboard Architecture (CDC 6600)

Funct

ion

al U

nit

s

Reg

iste

rs

FP MultFP Mult

FP MultFP Mult

FP DivideFP Divide

FP AddFP Add

IntegerInteger

MemorySCOREBOARDSCOREBOARD


Scoreboard Implications(FYI)• Out-of-order completion => WAR, WAW hazards?

• Solutions for WAR:

– Stall writeback until registers have been read

– Read registers only during Read Operands stage

• Solution for WAW:

– Detect hazard and stall issue of new instruction until other instruction completes

• No register renaming

• Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units

• Scoreboard keeps track of dependencies between instructions that have already issued.

• Scoreboard replaces ID, EX, WB with 4 stages


Four Stages of Scoreboard Control FYI

• Issue—decode instructions & check for structural hazards (ID1)

– Instructions issued in program order (for hazard checking)

– Don’t issue if structural hazard

– Don’t issue if instruction is output dependent on previously issued but uncompleted instruction (no WAW hazards)

• Read operands—wait until no data hazards, then read operands (ID2)

– All real dependencies (RAW hazards) resolved in this stage. Wait for instructions to write back data.

– No data forwarding


Four Stages of Scoreboard Control FYI

• Execution—operate on operands (EX)

– Functional unit begins execution upon receiving operands. When result is ready, scoreboard notified execute complete

• Write result—finish execution (WB)

– Stall until no WAR hazards with previous instructions:

Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14

CDC 6600 scoreboard would stall SUBD until ADDD reads operands


Three Parts of Scoreboard FYI

• Instruction status:Which of 4 steps instruction is in

• Functional unit status:—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy: Indicates whether the unit is busy or not

Op: Operation to perform in the unit (e.g., + or –)Fi: Destination registerFj,Fk: Source-register numbersQj,Qk: Functional units producing source registers Fj, FkRj,Rk: Flags indicating when Fj, Fk are ready

• Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register


Instruction Status

Instruction IssueRead

Operands Execution Complete Write ResultL.D F6,34(R2) X X X XL.D F2,45(R3) X X X XMUL.D F0,F2,F4 X X XSUB.D F8,F6,F2 X X X XDIV.D F10,F0,F6 XADD.D F6,F8,F2 X X X

Functional unit statusName Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 Yes Mult F0 F2 F4 No NoMult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result statusF0 F2 F4 F6 F8 F10 F12 ……

unit Mult1 Integer Add Divide

Example:Scoreboard tables before MUL.D writes results


Tomasulo

• For IBM 360/91 (before caches!)

• Goal: High Performance without special compilers

• Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations

– Tomasulo: how to get more effective registers — renaming in hardware!

• Same idea used today– HP 8000, MIPS 10000, Core xx, Power 4,5,6, 7…


Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6


Tomasulo Algorithm

• Control & buffers distributed with Function Units (FU) – FU buffers called “reservation stations”; have pending operands

• Registers in instructions replaced by values or pointers to reservation stations(RS);

– form of register renaming ; – avoids WAR, WAW hazards– More reservation stations than registers, so can do optimizations

compilers can’t

• Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs

• Load and Stores are FUs with reservation stations• instructions can go past branches


Three Stages of Tomasulo

1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),

control issues instr & sends operands (renames registers).

2.Execute—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result

3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;

mark reservation station available

• Common data bus: data + source (“come from” bus)


How Tomasulo overlaps loop iterations

• Register renaming– Multiple iterations use different physical destinations for registers (dynamic

loop unrolling).

• Reservation stations – Instructions advance past integer control flow operations

– buffer old values of registers - avoiding WAR stall in scoreboard.


Tomasulo’s scheme offers 2 major advantages

(1) the distribution of the hazard detection logic– distributed reservation stations and the CDB

– If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB

– If a centralized register file were used, the units would have to read their results from the registers when register buses are available.

(2) the elimination of stalls for WAW and WAR hazards


Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600)

Pipelined Functional Units Multiple Functional Units

(6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷)

window size: ≤ 14 instructions ≤ 5 instructions

No issue on structural hazard same

WAR: renaming avoids stall completion

WAW: renaming avoids stall issue

Broadcast results from FU Write/read registers

Control: reservation stations central scoreboard


Modern Superscalar


PPC604e


Flow paths in Superscalar


PPC604 Pipeline


Reorder Buffer Keeps Program order

Instruction slot is candidate for execution when:•It holds a valid instruction (“use” bit is set)•It has not already started execution (“exec” bit is clear)•Both operands are available (p1 and p2 are set)

Reorder buffer

t1

t2

.

.

.

tn

ptr2 next to

deallocate

prt1

nextavailable

Ins# use exec op p1 src1 p2 src2


Explicit Register Renaming• Virtual expansion of ISA registers;

• physical register file >> number of ISA registers

• Implementation by maintaining translation rename table:

– ISA register => physical register mapping

– When register is written, replace table entry with new register from freelist.

– Physical register free if not being used by any instructions in progress.

FetchDecode/Rename

Execute

RenameTable

Eliminates WAW & WAR MUL R2,R2,R3 ; R2 = R2 * R3 ADD R4,R2,1 ; R4 = R2 + 1 ADD R2,R3,1 ; R2 = R3 + 1 DIV R5,R2,R4 ; R5 = R4 * R4


Register Renaming

• Decode does register renaming ; adds instructions to reorder buffer (ROB)

• instructions in ROB with no RAW hazards can be dispatched.

Out-of-order or dataflow execution

IF ID WB

ALU Mem

Fadd

Fmul

Issue


Register Renaming Support :

• Rapid access to translation table

• Physical register file with more registers than ISA

• figure out which physical registers are free.

– No free registers stall on issue

• Register renaming doesn’t require reservation stations. However. some architectures use register renaming and reservation stations


Advantages of Explicit Renaming

• Decouples renaming from scheduling:

• Allows data to be fetched from single register file

• Supports precise interrupt points:

– Support for “undone” for precise break pointis to undo table mappings

– Provides an interesting mix between reorder buffer and future file

» Results are written immediately back to register file

» Registers names are “freed” in program order (by ROB)


Explicit register renaming:

Done?

Oldest

Newest

P0P0 P2P2 P4P4 F6F6 F8F8 P10P10 P12P12 P14P14 P16P16 P18P18 P20P20 P22P22 P24P24 p26p26 P28P28 P30P30

P32P32 P34P34 P36P36 P38P38 P60P60 P62P62

Current Map Table

Freelist

• Physical register file larger than ISA register file

• On issue, each instruction that modifies a register is allocated new physical register from freelist

• Used everywhere: R10000, HP PA8000, Power, Intel


RAW Hazards in memory>> Memory Disambiguation:

• Question: a load that follows a store in program order, are the two related?

• i. e. is there a RAW hazard between the store and the load ?

Eg: st 0(R2),R5 ld R6,0(R3)

• Can we go ahead and start the load early?

• Store address could be delayed for a long time by calculation that leads to R2

• We might want to issue/begin execution of both operations in same cycle.

• Answer: not allowed to start load until we know that address 0(R2) 0(R3)


Hardware Support for Memory Disambiguation

• Buffer tracks outstanding stores to memory, in program order.• Keep track of address (when available) and value (when available)• FIFO ordering: retire stores in program order• When issuing a load, record current head of store queue (know which stores

are ahead of you).• When have address for load, check store queue:• If any store prior to load is waiting for its address, stall load.• If load address matches earlier store address (associative lookup), then we

have a memory-induced RAW hazard:• store value available return value• store value not available return ROB number of source • Otherwise, send out request to memory• Actual stores commit in order


What about Precise Interrupts?Eg Page fault ?

• State as if no instruction beyond faulting instructions has issued

• Tomasulo had:

In-order issue, out-of-order execution, and out-of-order completion

• Need to “fix” the out-of-order completion • find precise breakpoint in instruction stream.


Relationship between precise interrupts and speculation:

• Speculation: guess and check

• Important for branch prediction:– “ best shot” at predicting branch direction.

• If speculation wrong, back up and restart execution to point predicted incorrectly:

• Technique for both precise interrupts/exceptions and speculation: in-order completion or commit


HW support for precise interrupts

• Need HW buffer for results of uncommitted instructions: reorder buffer

– 3 fields: instr, destination, value

– Use reorder buffer number instead of reservation station when execution completes

– Supplies operands between execution complete & commit

– (Reorder buffer can be operand source => more registers like RS)

– Instructions commit

– Once instruction commits, result is put into register

– As a result, easy to undo speculated instructions on mispredicted branches or exceptions

ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP Regs


Hardware complexities with reorder buffer (ROB)

ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP Regs

Com

par n

etw

ork

• latest version of a register?– need associative compare– register result status buffer tracks which reorder buffer has received the value

• Need as many ports on ROB as register file

Reorder Table

Dest

Reg

Resu

lt

Excep

tion

s?

Valid

Pro

gra

m C

ou

nte

r


Superscalar v. VLIW

• Smaller code size

• Binary compatability across generations of hardware

• HW better branch prediction

• SW speculation = Simplified Hardware for decoding, issuing instructions

• No Interlock Hardware (compiler checks?)

• More registers, but simplified Hardware for Register Ports (multiple independent register files?)


Summary• DataFlow view:

– Data triggers execution rather than instructions triggering data• Dynamic hardware schemes can unroll loops dynamically in hardware• Explicit register Renaming: more physical registers than needed by ISA.

– Rename table: tracks current association between architectural registers and physical registers

– Translation table to performs mapping on the fly• Precise Interrupts:

– Must commit results in order– Reorder buffer: temporarily holds results until commit possible

• Lasting Contributions – in today’s processors– Dynamic scheduling– Register renaming– Load/store disambiguation

• ILP limits: To make performance progress in future need to have explicit parallelism from programmer

• Multithreading, Multi-core architectures::explicitly parallel algorithms

Dynamic Sched.1 10/14 Dynamic Hardware Scheduling Microarchitecture ILP.

Documents

Transcript of Dynamic Sched.1 10/14 Dynamic Hardware Scheduling Microarchitecture ILP.