Dynamic Sched.1 10/14 Dynamic Hardware Scheduling Microarchitecture ILP.
-
Upload
silvester-walsh -
Category
Documents
-
view
216 -
download
0
Transcript of Dynamic Sched.1 10/14 Dynamic Hardware Scheduling Microarchitecture ILP.
Dynamic Sched.110/14
Dynamic Hardware Scheduling
Microarchitecture ILP
Dynamic Sched.210/14
From Pipelining Review
• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls
– Ideal pipeline CPI: measure of the maximum performance attainable by the implementation
– Structural hazards: HW cannot support this combination of instructions
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)
Dynamic Sched.310/14
Data Hazard Resolution: In-order issue, in-order completion
Time (clock cycles)
or r8,r2,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r2,r7
Reg
ALU
DMemIfetch Reg
RegIfetch
ALU
DMem RegBubble
Ifetch
ALU
DMem RegBubble Reg
Ifetch
ALU
DMemBubble Reg
Extend to Multiple instruction issue?
Dynamic Sched.410/14
Techniques to Reduce Stalls
Chapter 3HW
Chapter 4SW
Dynamic Sched.510/14
Instruction-Level Parallelism (ILP)from same code thread
• Basic Block (BB) small ILP • BB: straight-line code sequence with no branches, ( entry / exit )
– average dynamic branch frequency 15% to 20% => 4 to 7 instructions between branches
– instructions in BB likely depend on each other
• To get performance enhancements, – exploit ILP across multiple blocks
• loop-level parallelism exploit parallelism in loop iterations
– Vector is one way
– dynamic via branch prediction or static via loop unrolling by compiler
Dynamic Sched.610/14
Multiple Instruction Issuing
• issue packet: group of instructions from fetch unit that could issue in 1 clock
– If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue
– 0 to N instruction issues per clock cycle, for N-issue
• Must check issue-bility in 1 cycle– => issue stage split - pipelined
– 1st stage : how many instructions from packet can issue
– 2nd stage examines hazards among selected instructions and those already been issued
– => higher branch penalties => prediction accuracy important
Dynamic Sched.710/14
Getting CPI < 1: Issueing Multiple Instructions/Cycle
• Vector Processing: Explicit coding of independent loops as operations – variable vectors ;
» Eg Multimedia instructions
• SUPERSCALER: (1 to 8) instructions/cycle, scheduled by compiler or HW
– All modern CPUs: IBM POWER, Sun UltraSparc, Core i3, i5, i7, ., ..
• Very Long Instruction Words VLIW: instructions (4-16) scheduled by compiler; put ops into wide templates
– Intel Itanium
– TI DSP
• multiple instructions lead to Instructions Per Clock cycle (IPC) vs. CPI
Dynamic Sched.810/14
Modern SuperscalarDynamically scheduled – How / What to widen
Dynamic Sched.910/14
Clock 0 1 2 3 4 5 6 7 8 9 10 11
0 AND Fet DQ DS EX C/WB
1 OR Fet DQ DS EX C/WB
2 FADD Fet DQ DS EX EX EX C/WB
3 FSUB Fet DQ DS DS EX EX EX C/WB
4 ADDC Fet DQ DS EX C C C/WB
5 SUBFC Fet DQ DS EX C C C/WB
6 FMADD Fet DQ DS EX EX EX C/WB
7 FMSUB Fet DQ DS DS EX EX EX C/WB
8 XOR Fet DQ DS DS EX C C C/WB
9 NEG Fet DQ DS DS EX C C C/WB
10 FADDS Fet DQ DQ DS EX EX EX C/WB
11 FSUBS Fet DQ DQ DS DS EX EX EX C/WB
12 ADD Fet DQ DQ DS DS EX C C C/WB
13 SUB Fet DQ DQ DS DS EX C C C/WB
Superscalar Timing Example – 4 way
End of DS stage successful dispatch DS
Dynamic Sched.1010/14
ILP and Data Dependencies
• program order: order as expected – same as sequential execution 1 at a time; determined by code
• HW/SW goal: exploit parallelism by preserving appearance of program order
– modify order in manner than cannot be observed by program
– must not affect the outcome of the program
• Ex: Instructions involved in a name dependence can execute simultaneously; if name changed no effect ;so instructions do not conflict
– Register renaming resolves name dependence for regs
– Either by compiler or by HW
Dynamic Sched.1110/14
• Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions;
• InstrJ writes operand before InstrI reads it Write After Read (WAR) hazard
• OR: Write After Write (WAW) hazard
• Instructions are control dependent on branches, these control dependencies must be preserved to preserve program order
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Name Dependence Review : Anti-dependence, Output Dependence, Control
(not present if large – infinite no. of registers)
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Dynamic Sched.1210/14
Advantages ofDynamic Scheduling
• Handles cases when dependences unknown at compile time (eg memory reference)
• Simplifies compiler
• code compiled for one pipeline runs efficiently on different pipeline
• Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling
• Key idea: instructions behind stall can proceedDIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14
• Out-of-order execution => out-of-order completion.
Dynamic Sched.1310/14
Instruction Parallelism by HW
• Enables out-of-order execution and allows out-of-order completion
• Will distinguish when an instruction begins execution and when it completes execution; in between instruction in execution
• dynamically scheduled pipeline:: all instructions pass through issue stage in order (in-order issue)
Dynamic Sched.1410/14
Dynamic Scheduling by Scoreboard: bookkeeping technique - OLD
• Out-of-order execution divides ID stage:
1. Issue—decode instructions, check for structural hazards
2. Read operands—wait until no data hazards, then read operands
• Scoreboards date to CDC6600 in 1963
• Instructions execute whenever not dependent on previous instructions and no hazards.
• CDC 6600: In order issue, out-of-order execution (when there are no conflicts and the hardware is available). , out-of-order commit (or completion)
– No forwarding
Dynamic Sched.1510/14
Scoreboard Architecture (CDC 6600)
Funct
ion
al U
nit
s
Reg
iste
rs
FP MultFP Mult
FP MultFP Mult
FP DivideFP Divide
FP AddFP Add
IntegerInteger
MemorySCOREBOARDSCOREBOARD
Dynamic Sched.1610/14
Scoreboard Implications(FYI)• Out-of-order completion => WAR, WAW hazards?
• Solutions for WAR:
– Stall writeback until registers have been read
– Read registers only during Read Operands stage
• Solution for WAW:
– Detect hazard and stall issue of new instruction until other instruction completes
• No register renaming
• Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units
• Scoreboard keeps track of dependencies between instructions that have already issued.
• Scoreboard replaces ID, EX, WB with 4 stages
Dynamic Sched.1710/14
Four Stages of Scoreboard Control FYI
• Issue—decode instructions & check for structural hazards (ID1)
– Instructions issued in program order (for hazard checking)
– Don’t issue if structural hazard
– Don’t issue if instruction is output dependent on previously issued but uncompleted instruction (no WAW hazards)
• Read operands—wait until no data hazards, then read operands (ID2)
– All real dependencies (RAW hazards) resolved in this stage. Wait for instructions to write back data.
– No data forwarding
Dynamic Sched.1810/14
Four Stages of Scoreboard Control FYI
• Execution—operate on operands (EX)
– Functional unit begins execution upon receiving operands. When result is ready, scoreboard notified execute complete
• Write result—finish execution (WB)
– Stall until no WAR hazards with previous instructions:
Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads operands
Dynamic Sched.1910/14
Three Parts of Scoreboard FYI
• Instruction status:Which of 4 steps instruction is in
• Functional unit status:—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy: Indicates whether the unit is busy or not
Op: Operation to perform in the unit (e.g., + or –)Fi: Destination registerFj,Fk: Source-register numbersQj,Qk: Functional units producing source registers Fj, FkRj,Rk: Flags indicating when Fj, Fk are ready
• Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Dynamic Sched.2010/14
Instruction Status
Instruction IssueRead
Operands Execution Complete Write ResultL.D F6,34(R2) X X X XL.D F2,45(R3) X X X XMUL.D F0,F2,F4 X X XSUB.D F8,F6,F2 X X X XDIV.D F10,F0,F6 XADD.D F6,F8,F2 X X X
Functional unit statusName Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 Yes Mult F0 F2 F4 No NoMult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusF0 F2 F4 F6 F8 F10 F12 ……
unit Mult1 Integer Add Divide
Example:Scoreboard tables before MUL.D writes results
Dynamic Sched.2110/14
Tomasulo
• For IBM 360/91 (before caches!)
• Goal: High Performance without special compilers
• Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations
– Tomasulo: how to get more effective registers — renaming in hardware!
• Same idea used today– HP 8000, MIPS 10000, Core xx, Power 4,5,6, 7…
Dynamic Sched.2210/14
Tomasulo Organization
FP addersFP adders
Add1Add2Add3
FP multipliersFP multipliers
Mult1Mult2
From Mem FP Registers
Reservation Stations
Common Data Bus (CDB)
To Mem
FP OpQueue
Load Buffers
Store Buffers
Load1Load2Load3Load4Load5Load6
Dynamic Sched.2310/14
Tomasulo Algorithm
• Control & buffers distributed with Function Units (FU) – FU buffers called “reservation stations”; have pending operands
• Registers in instructions replaced by values or pointers to reservation stations(RS);
– form of register renaming ; – avoids WAR, WAW hazards– More reservation stations than registers, so can do optimizations
compilers can’t
• Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs
• Load and Stores are FUs with reservation stations• instructions can go past branches
Dynamic Sched.2410/14
Three Stages of Tomasulo
1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2.Execute—operate on operands (EX) When both operands ready then execute;
if not ready, watch Common Data Bus for result
3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting units;
mark reservation station available
• Common data bus: data + source (“come from” bus)
Dynamic Sched.2510/14
How Tomasulo overlaps loop iterations
• Register renaming– Multiple iterations use different physical destinations for registers (dynamic
loop unrolling).
• Reservation stations – Instructions advance past integer control flow operations
– buffer old values of registers - avoiding WAR stall in scoreboard.
Dynamic Sched.2610/14
Tomasulo’s scheme offers 2 major advantages
(1) the distribution of the hazard detection logic– distributed reservation stations and the CDB
– If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB
– If a centralized register file were used, the units would have to read their results from the registers when register buses are available.
(2) the elimination of stalls for WAW and WAR hazards
Dynamic Sched.2710/14
Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600)
Pipelined Functional Units Multiple Functional Units
(6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷)
window size: ≤ 14 instructions ≤ 5 instructions
No issue on structural hazard same
WAR: renaming avoids stall completion
WAW: renaming avoids stall issue
Broadcast results from FU Write/read registers
Control: reservation stations central scoreboard
Dynamic Sched.2810/14
Modern Superscalar
Dynamic Sched.2910/14
PPC604e
Dynamic Sched.3010/14
Flow paths in Superscalar
Dynamic Sched.3110/14
PPC604 Pipeline
Dynamic Sched.3210/14
Reorder Buffer Keeps Program order
Instruction slot is candidate for execution when:•It holds a valid instruction (“use” bit is set)•It has not already started execution (“exec” bit is clear)•Both operands are available (p1 and p2 are set)
Reorder buffer
t1
t2
.
.
.
tn
ptr2 next to
deallocate
prt1
nextavailable
Ins# use exec op p1 src1 p2 src2
Dynamic Sched.3310/14
Explicit Register Renaming• Virtual expansion of ISA registers;
• physical register file >> number of ISA registers
• Implementation by maintaining translation rename table:
– ISA register => physical register mapping
– When register is written, replace table entry with new register from freelist.
– Physical register free if not being used by any instructions in progress.
FetchDecode/Rename
Execute
RenameTable
Eliminates WAW & WAR MUL R2,R2,R3 ; R2 = R2 * R3 ADD R4,R2,1 ; R4 = R2 + 1 ADD R2,R3,1 ; R2 = R3 + 1 DIV R5,R2,R4 ; R5 = R4 * R4
Dynamic Sched.3410/14
Register Renaming
• Decode does register renaming ; adds instructions to reorder buffer (ROB)
• instructions in ROB with no RAW hazards can be dispatched.
Out-of-order or dataflow execution
IF ID WB
ALU Mem
Fadd
Fmul
Issue
Dynamic Sched.3510/14
Register Renaming Support :
• Rapid access to translation table
• Physical register file with more registers than ISA
• figure out which physical registers are free.
– No free registers stall on issue
• Register renaming doesn’t require reservation stations. However. some architectures use register renaming and reservation stations
Dynamic Sched.3610/14
Advantages of Explicit Renaming
• Decouples renaming from scheduling:
• Allows data to be fetched from single register file
• Supports precise interrupt points:
– Support for “undone” for precise break pointis to undo table mappings
– Provides an interesting mix between reorder buffer and future file
» Results are written immediately back to register file
» Registers names are “freed” in program order (by ROB)
Dynamic Sched.3710/14
Explicit register renaming:
Done?
Oldest
Newest
P0P0 P2P2 P4P4 F6F6 F8F8 P10P10 P12P12 P14P14 P16P16 P18P18 P20P20 P22P22 P24P24 p26p26 P28P28 P30P30
P32P32 P34P34 P36P36 P38P38 P60P60 P62P62
Current Map Table
Freelist
• Physical register file larger than ISA register file
• On issue, each instruction that modifies a register is allocated new physical register from freelist
• Used everywhere: R10000, HP PA8000, Power, Intel
Dynamic Sched.3810/14
RAW Hazards in memory>> Memory Disambiguation:
• Question: a load that follows a store in program order, are the two related?
• i. e. is there a RAW hazard between the store and the load ?
Eg: st 0(R2),R5 ld R6,0(R3)
• Can we go ahead and start the load early?
• Store address could be delayed for a long time by calculation that leads to R2
• We might want to issue/begin execution of both operations in same cycle.
• Answer: not allowed to start load until we know that address 0(R2) 0(R3)
Dynamic Sched.3910/14
Hardware Support for Memory Disambiguation
• Buffer tracks outstanding stores to memory, in program order.• Keep track of address (when available) and value (when available)• FIFO ordering: retire stores in program order• When issuing a load, record current head of store queue (know which stores
are ahead of you).• When have address for load, check store queue:• If any store prior to load is waiting for its address, stall load.• If load address matches earlier store address (associative lookup), then we
have a memory-induced RAW hazard:• store value available return value• store value not available return ROB number of source • Otherwise, send out request to memory• Actual stores commit in order
Dynamic Sched.4010/14
What about Precise Interrupts?Eg Page fault ?
• State as if no instruction beyond faulting instructions has issued
• Tomasulo had:
In-order issue, out-of-order execution, and out-of-order completion
• Need to “fix” the out-of-order completion • find precise breakpoint in instruction stream.
Dynamic Sched.4110/14
Relationship between precise interrupts and speculation:
• Speculation: guess and check
• Important for branch prediction:– “ best shot” at predicting branch direction.
• If speculation wrong, back up and restart execution to point predicted incorrectly:
• Technique for both precise interrupts/exceptions and speculation: in-order completion or commit
Dynamic Sched.4210/14
HW support for precise interrupts
• Need HW buffer for results of uncommitted instructions: reorder buffer
– 3 fields: instr, destination, value
– Use reorder buffer number instead of reservation station when execution completes
– Supplies operands between execution complete & commit
– (Reorder buffer can be operand source => more registers like RS)
– Instructions commit
– Once instruction commits, result is put into register
– As a result, easy to undo speculated instructions on mispredicted branches or exceptions
ReorderBuffer
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
FP Regs
Dynamic Sched.4310/14
Hardware complexities with reorder buffer (ROB)
ReorderBuffer
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
FP Regs
Com
par n
etw
ork
• latest version of a register?– need associative compare– register result status buffer tracks which reorder buffer has received the value
• Need as many ports on ROB as register file
Reorder Table
Dest
Reg
Resu
lt
Excep
tion
s?
Valid
Pro
gra
m C
ou
nte
r
Dynamic Sched.4410/14
Superscalar v. VLIW
• Smaller code size
• Binary compatability across generations of hardware
• HW better branch prediction
• SW speculation = Simplified Hardware for decoding, issuing instructions
• No Interlock Hardware (compiler checks?)
• More registers, but simplified Hardware for Register Ports (multiple independent register files?)
Dynamic Sched.4510/14
Summary• DataFlow view:
– Data triggers execution rather than instructions triggering data• Dynamic hardware schemes can unroll loops dynamically in hardware• Explicit register Renaming: more physical registers than needed by ISA.
– Rename table: tracks current association between architectural registers and physical registers
– Translation table to performs mapping on the fly• Precise Interrupts:
– Must commit results in order– Reorder buffer: temporarily holds results until commit possible
• Lasting Contributions – in today’s processors– Dynamic scheduling– Register renaming– Load/store disambiguation
• ILP limits: To make performance progress in future need to have explicit parallelism from programmer
• Multithreading, Multi-core architectures::explicitly parallel algorithms