Topic 3
description
Transcript of Topic 3
9/25/2006 eleg652-F06 1
Topic 3
Exploitation of Instruction Level Parallelism
The secret to creativity is knowing how to hide your sources. Albert Einstein
9/25/2006 eleg652-F06 2
Reading List
• Slides: Topic3x
• Henn&Patt: Chapters 3 & 4
• Other assigned readings from homework and classes
9/25/2006 eleg652-F06 3
Instruction Level Parallelism
• Parallelism that is found between instructions (or intra instruction)
• Dynamic and Static Exploitation– Dynamic: Hardware related. – Static: Software related (compiler and system
software)
• VLIW and Superscalar
• Micro-Dataflow and Tomasulo’s Algorithm
9/25/2006 eleg652-F06 4
RISC Concepts: Revisit
• Reduced Instruction Set Architecture– “Internal Computing Architecture in which processor instructions
are pared down so that most of them can be executed in one clock cycle, theoretically improving computing efficiency” Black Box Pocket Glossary of Computer Terms
• Characteristics:– Uniform instruction encoding– Homogenous Register Banks– Simplified Addressing Modes– Simplified data structures– Branch delay slot– Cache– Pipeline
9/25/2006 eleg652-F06 5
RISC Concepts: Revisited
• What prevents one instruction per cycle (CPI = 1)?– Hazards– Dependencies– Long Latency ops
• Cache Trashing
9/25/2006 eleg652-F06 6
Pipeline: A Review
• Hazards– Any situation that will prevent the smooth flow of the
instructions along the pipeline– Types
• Structural– Due to limited resources and contention among them
• Control– Instructions that change the PC (program counter)
• Data– Variables depends on values from previous instruction
– Stall• Hazards will “stall” the pipeline• Serious: It can hold up many instructions for many cycles
9/25/2006 eleg652-F06 7
RISC Pipeline & Instruction Issue
• Instruction Issue– The process of letting an instruction move from ID to
EXEC– Issue V.S. Execution
• In DLX– ID Check all data hazards, stall if any exists
Typical RISC Pipeline:
Instruction Fetch Instruction Decode Execute Memory Op Register Update
9/25/2006 eleg652-F06 8
Hazards
• Structural Hazards– Non Pipelining Function Units– One Port Register Bank and one port memory
bank
• Data Hazards– For some
• Forwarding
– For others• Pipeline Interlock
LD R1 A+ R4 R1 R7
Need Bubble / Stall
9/25/2006 eleg652-F06 9
Instruction Clock cycle number
1 2 3 4 5 6 7 8 9
Load instruction IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM
Structural Hazard
A single memory bank for insts and data
9/25/2006 eleg652-F06 10
Data Hazards
Instruction 1 2 3 4 5 6
ADD IF ID EX MEM WB
SUB IF ID EX MEM WB
Stage
Stage
Data is read here
Data is written here
The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles after SUB begins reading it!
The SUB instruction may read the incorrect value. Result may be non-deterministic. Solved by forwarding
9/25/2006 eleg652-F06 11
Data Dependency: A Review
B + C A
A + D E
Flow DependencyRAW Conflicts
A + C B
E + D A
Anti DependencyWAR Conflicts
B + C A
E + D A
Output DependencyWAW Conflicts
RAR are not really a problem
9/25/2006 eleg652-F06 12
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
ADD R1,R2,R3
SUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
Forwarding Example
9/25/2006 eleg652-F06 13
Bypassing Pitfalls
LW R1, 32 (R6)ADD R4, R1, R7SUB R5, R1, R8AND R6, R1, R7
IF ID EX MEM WBIF ID STALL EX MEM WB
IF STALL ID EX MEM WBSTALL IF ID EX MEM WB
Load Delay Slot cannot be eliminated by forwarding alone
Pipeline Interlock: Stall / Bubble for hazards that cannot be solved by forwarding
The Pipeline
The Code
9/25/2006 eleg652-F06 14
Pipelining
• Issue Pass the Instruction Decode stage
• DLX: Only issue instruction if there is no hazard
• Detect interlock early in the pipeline has the advantage that it never needs to suspend an instruction and undo state changes.
9/25/2006 eleg652-F06 15
Instruction Level Parallelism
• Static Scheduling– Simple Scheduling– Loop Unrolling– Loop Unrolling + Scheduling– Software Pipelining
• Dynamic Scheduling– Out of order execution– Data Flow computers
• Speculation
9/25/2006 eleg652-F06 16
Constraint Graph
• Directed-edges: data-dependence
• Undirected-edges: Resources constraint
• An edge (u,v) (directed or undirected) of length e
represent an interlock between node u and v, and they
must be separated by e time.
S1
S6
S5S4
S3S2
12
62
1 1
operation latencies4
3
9/25/2006 eleg652-F06 17
Code Scheduling For Single Pipeline
• Input: A constraint graph G = (V, E)
• Output: A sequence of operations in G (v1, v2, v3, v4, v5 ….vn) plus a number of no-op, such that:– If the no-op are deleted then the sequence is
a topological sort of G.– Any two nodes in the sequence (x, y) is
separated by a distance greater or equal d(x,y) in graph G
9/25/2006 eleg652-F06 18
Advanced Pipelining
• Instruction Reordering and scheduling within loop body
• Loop Unrolling– Code size suffers
• Superscalar– Compact code– Multiple issued of different instruction types
• VLIW
9/25/2006 eleg652-F06 19
VLIW
• Very Long Instruction Word
• Compiler has all responsibility to schedule instructions
• Make hardware simpler– Move complexity to software
• Concept developed by John Fisher at Yale’s University in early 1980
9/25/2006 eleg652-F06 20
An ExampleX[i] + a Loop: LD F0, 0 (R1) ; load the vector element
ADDD F4, F0, F2 ; add the scalar in F2SD 0 (R1), F4 ; store the vector elementSUB R1, R1, #8 ; decrement the pointer by
; 8 bytes (per DW)BNEZ R1, Loop ; branch when it’s not zero
Instruction Producer Instruction Consumer Latency
FP ALU op FP ALU op 3
FP ALU op Store Double 2
Load Double FP ALU op 1
Load Double Store Double 0
Load can by-pass the storeAssume that latency for Integer ops is zero and latency for Integer load is 1
9/25/2006 eleg652-F06 21
An ExampleX[i] + a
Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3STALL 4STALL 5SD 0 (R1), F4 6SUB R1, R1, #8 7BNEZ R1, Loop 8STALL 9
Load Latency
FP ALU Latency
Load Latency
This requires 9 Cycles per iteration
LD ADDD SD SUB BNEZ1 2 0 0 1
Constrain Graph
9/25/2006 eleg652-F06 22
An ExampleX[i] + a
Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3SUB R1, R1, #8 4BNEZ R1, Loop 5SD 8 (R1), F4 6
This requires 6 Cycles per iteration
LD ADDD SD SUB BNEZ1 2 0 0 1
Constrain Graph
Scheduling
9/25/2006 eleg652-F06 23
An ExampleX[i] + a
Loop : LD F0, 0 (R1) 1NOP 2ADDD F4, F0, F2 3NOP 4NOP 5SD 0 (R1), F4 6LD F6, -8 (R1) 7NOP 8ADDD F8, F6, F2 9NOP 10NOP 11SD -8 (R1), F8 12LD F10, -16 (R1) 13NOP 14ADDD F12, F10, F2 15NOP 16NOP 17SD -16 (R1), F12 18LD F14, -24 (R1) 19NOP 20ADDD F16, F14, F2 21NOP 22NOP 23SD -24 (R1), F16 24SUB R1, R1, #32 25BNEZ R1, LOOP 26NOP 27
This requires 6.8 Cycles per iteration
Unrolling
9/25/2006 eleg652-F06 24
An Example
X[i] + a
Loop : LD F0, 0 (R1) 1LD F6, - 8 (R1) 2LD F10, -16 (R1) 3LD F14, -24 (R1) 4ADDD F4, F0, F2 5ADDD F8, F6, F2 6ADDD F12, F10, F2 7ADDD F16, F14, F2 8SD 0 (R1), F4 9SD -8 (R1), F8 10SD -16 (R1), F12 11SUB R1, R1, #32 12 BNEZ R1, LOOP 13SD 8 (R1), F16 14
This requires 3.5 Cycles per iteration
Unrolling + Scheduling
9/25/2006 eleg652-F06 25
Topic 3a
Multi Issue Architectures
Beyond Simple RISC
9/25/2006 eleg652-F06 26
ILP
• ILP of a program– Average Number of Instructions that a superscalar
processor might be able to execute at the same time• Data dependencies• Latencies and other processor difficulties
• ILP of a machine– The ability of a processor to take advantage of the ILP
• Number of instructions that can be fetched and executed at the same time by such processor
9/25/2006 eleg652-F06 27
Multi Issue Architectures
• Super Scalar– Machines that issue multiple independent instructions
per clock cycle when they are properly scheduled by the compiler and runtime scheduler
• Very Long Instruction Word– A machine where the compiler has complete
responsibility for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue
Patterson & Hennessy P317 and P318
9/25/2006 eleg652-F06 28
Multiple Instruction Issue
• Multiple Issue + Static Scheduling VLIW• Dynamic Scheduling
– Tomasulo– Scoreboarding
• Multiple Issue + Dynamic Scheduling Superscalar
• Decoupled Architectures– Static Scheduling of R-R Instructions– Dynamic Scheduling of Memory Ops
• Buffers
9/25/2006 eleg652-F06 29
Five Primary Approaches
Common Name
Issue Structure
Hazard Detection
Scheduling Distinguishing characteristics
Examples
Superscalar (static)
Dynamic Hardware Static In order execution Sun UltraSPARC II and III
Superscalar (dynamic)
Dynamic hardware Dynamic Some out of order execution
IBM Power2
Superscalar (speculative)
Dynamic Hardware Dynamic With speculation
Speculative out of order execution
Pentium 3 and 4
VLIW / LIW Static Software Static No hazards between issues packets
Trimedia, i860
EPIC Mostly Static Mostly Software
Mostly Static Explicit Dependences marked by compiler
Itanium
9/25/2006 eleg652-F06 30
Integer instruction FP instruction Clock cycle
Loop: LD F0, 0 (R1) 1LD F6, -8 (R1) 2LD F10, -16 (R1) ADDD F4, F0, F2 3LD F14, -24 (R1) ADDD F8, F6, F2 4LD F18, -32 (R1) ADDD F12, F10, F2 5SD 0 (R1), F4 ADDD F16, F14, F2 6SD -8 (R1), F8 ADDD F20, F18, F2 7SD -16 (R1), F12 8SD -24 (R1), F16 9SUB R1, R1, #40 10BNEZ R1, LOOP 11SD 8 (R1), F20 12
Two-Issue ArchitectureUnrolled and Scheduled Code
The unrolled and scheduled code 2.4 cycles per iteration (5 iters in 12 cycles)
9/25/2006 eleg652-F06 31
Memory Memory FP FP Integer operation reference 1 reference 2 operation 1 operation 2 /branch
LD F0, 0 (R1) LD F6, - 8 (R1)LD F10, -16 (R1) LD F14, -24 (R1) LD F18, -32 (R1) LD F22, -40 (R1) ADDD F4, F0, F2 ADDD F8, F6, F2 LD F26, -48 (R1) ADDD F12, F10, F2 ADDD F16, F14, F2
ADDD F20, F18, F2 ADDD F24, F22, F2SD 0 (R1), F4 SD - 8 (R1), F8 ADDD F28, F26, F2SD -16 (R1), F12 SD -24 (R1), F16SD -32 (R1), F20 SD -40 (R1), F24 SUB R1, R1, #48SD - 0 (R1), F28 BNEZ R1, LOOP
Unrolling 6 times
F0
+
a
F4
LD
SD
F6
+
a
F8
LD
SD
F10
+
a
F12
LD
SD
F14
+
a
F16
LD
SD
F18
+
a
F20
LD
SD
F22
+
a
F24
LD
SD
F26
+
a
F28
LD
SD
A VLIW Code Sequence
7 iterations in 9 cycles 1.28 cycle per iter
9/25/2006 eleg652-F06 32
Trace Scheduling
• First Used for VLIW architecture• Trace
– A straight line sequence of instructions executed in some data or a sequence of ops which constitute a possible path based on “predicted” branches.
• Trace Scheduling– Identify a “most possible” sequence of instructions
and then “compact” the instructions in such path
• Tools– For Loops: Unrolling– For Branches: Static Branch prediction
9/25/2006 eleg652-F06 33
An ExampleTraces
A;B;C;if(D){ E; F;}else{ G;}H;I;
Basic BlockAn instruction sequence which has only one entry point and one exit point (no target for branches or branches in the middle)
ABC
br D
EF
G
HI
Trace 1 Trace 2
9/25/2006 eleg652-F06 34
Code Motion & Compensation Code
ABC
br D
EF G
HI
AB
br D
CE
CG
FHI
ABCE
br D
FH
Undo EGH
I
Original Code Code Move to the Succeeding Block
Code Move to the Preceding Block
9/25/2006 eleg652-F06 35
Trace Scheduling
• Similar to Basic Block Scheduling– Their unit is traces not Basic Blocks
• Reduce execution time of likely traces– Using Profiling
9/25/2006 eleg652-F06 36
Software Pipeline
• Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations
• Use less code size– Compared to Unrolling
• Some Architecture has specific software support– Rotating register banks– Predicated Instructions
9/25/2006 eleg652-F06 37
Software Pipelining
• Overlap instructions without unrolling the loop• Give the vector M in memory, and ignoring the start-up and finishing
code, we have:
Loop: SD 0 (R1), F4 ;stores into M[i]ADDD F4, F0, F2 ;adds to M[i +1]LD F0, -8 (R1) ;loads M[i + 2]BNEZ R1, LOOP
SUB R1, R1, #8 ;subtract indelay slot
This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions.
9/25/2006 eleg652-F06 38
Software Pipelining
1 2 3 4 5 6 7
1 LD
2 LD
3 ADDD LD
4 ADDD LD
5 ADDD LD
6 SD ADDD LD
7 BNEZ SD ADDD LD
8 BNEZ SD ADDD
9 BNEZ SD ADDD
10 BNEZ SD
11 BNEZ SD
Tim
e
Iter
9/25/2006 eleg652-F06 39
Software Pipeline
Overhead for Software Pipeline: Two times cost One for Prolog and one for epilog
Overhead for Unrolled Loop: M / N times cost M Loop Executions and N unrolling
Software Pipeline CodePrologue Epilog
Unrolled
Number of Overlapped instructions
Number of Overlapped instructions
Time
Time
9/25/2006 eleg652-F06 40
Loop Unrolling V.S. Software Pipelining
• When not running at maximum rate– Unrolling: Pay m/n times overhead when m
iteration and n unrolling– Software Pipelining: Pay two times
• Once at prologue and once at epilog• Moreover
– Code compactness– Optimal runtime– Storage constrains
9/25/2006 eleg652-F06 41
Comparison of Static Methods
w/o scheduling
scheduling unrolling Unrolling + Scheduling
2 issue 4 issue SP 1-issue
SP 5-Issue
Cycles per iterations
9 6 6.8 3.5 2.4 1.28 5 1
9/25/2006 eleg652-F06 42
On a Final Note
Loop unrolling, trace scheduling, and software pipelining all aim at exposing fine grain parallelism.
“The effectiveness of these techniques and their suitability for various architectural approaches are among the most open research areas in pipelined processor design”
- Henn & Patt
9/25/2006 eleg652-F06 43
Limitations of VLIW
• Limited parallelism (statically schedule) code– Basic Blocks may be too small– Global Code Motion is difficult
• Limited Hardware Resources• Code Size• Memory Port limitations• A Stall is serious• Cache is difficult to be used (effectively)
– i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width
– Cache miss penalty is increased since the length of instruction word
9/25/2006 eleg652-F06 44
An Open Question
“...Whether there are large classes of applications that are not suitable for vector machines, but still offer enough parallelism to justify the VLIW approach rather than a simpler one, such as a superscalar machine?”
Henn & Patt 1990
9/25/2006 eleg652-F06 45
An VLIW ExampleT
MS
32C
62x/
C67
Blo
ck D
iagr
am
Source: TMS320C600 Technical Brief. February 1999
9/25/2006 eleg652-F06 46
An VLIW Example
TMS32C62x/C67 Data Paths
Source: TMS320C600 Technical Brief. February 1999
Assembly Example
9/25/2006 eleg652-F06 47
Introduction to SuperScalar
Topic 3b
9/25/2006 eleg652-F06 48
Instruction Issue Policy
• It determinates the processor look ahead policy– Ability to examine instructions beyond the
current PC
• Look Ahead must ensure correctness at all costs
• Issue policy – Protocol used to issue instructions
• Note: Issue, execution and completion
9/25/2006 eleg652-F06 49
Issues in Out of Order Execution & Completion
R3 := R3 op R5 (1)R4 := R3 + 1 (2)R3 := R5 + 1 (3)R7 := R3 op R4 (4)
1
3
2
4
Flow DependencyAnti DependencyOutput Dependency
(2), (3) cannot be completed out-of order, otherwise, the anti-dependence may be violated, or R3 in (2) may be incorrectly written by (3) – [when (2) was stalled for some reason]
9/25/2006 eleg652-F06 50
Issues in Out of Order Execution & Completion
R3 := R3 op R5 (1)R4 := R3 + 1 (2)R3 := R5 + 1 (3)R7 := R3 op R4 (4)
1
3
2
4
Flow DependencyAnti DependencyOutput Dependency
(1), (3) cannot be completed out-of-order!Output-dependence has to be checked with all preceding instructions which are already in exec pipes, before an inst is issued, and ensure results to be written in correct order. Otherwise R3 in (4) may get a wrong value.
9/25/2006 eleg652-F06 51
Issues in Out of Order Execution & Completion
R := (1) := R (2)R := (3)
Note that the anti-dependence between (2) and (3) is handled correctly by stalling (3)’s issue if (1) has not completed.
9/25/2006 eleg652-F06 52
Achieve High Performance in Multiple Issued Instruction Machines
• Detection and resolution of storage conflicts– Extra “Shadow” registers– Special bit for reservation
• Organization and control of the buses between the various units in the PU– Special controllers to detect write backs and
read
9/25/2006 eleg652-F06 53
How to Detect Data Dependencies
X1 = X2 + X3
Y1 = Y2 + Y3
How many dependencies between these two instruction?
Five Possible Dependencies
A Total of 5 * O(n2) for n instructions
9/25/2006 eleg652-F06 54
A Super Scalar Architecture
Inst Fetch
Inst Decode
Issue Window
Wake Up Select
Register File
Exec
Write Back
New!!!!!Holds the instructions that are ready and the one that are waiting for dependencies
9/25/2006 eleg652-F06 55
Data Dependencies & SuperScalar
• Hardware Mechanism (dynamic scheduling)
- Scoreboarding
- limited out-of-order issue/completion
- centralized control
- Renaming with reorder buffer is a another attractive approach
(based on Tomasulo Alg.)
- Micro dataflow
- Advantage: exact runtime information
- Load/cache miss
- resolve storage location related dependence
9/25/2006 eleg652-F06 56
Scoreboarding• Named after CDC 6600
• Effective when there are enough resources and no data dependencies
• Out-of-order execution
• Issue: checking scoreboard and WAW will cause a stall
• Read operand- checking availability of operand and resolve RAW dynamically at this step
- WAR will not cause stall
• EX
• Write result- WAR will be checked and will cause stall
9/25/2006 eleg652-F06 57. . . . .
Registers
Integer unit
FP add
FP divide
FP multFP mult
Scoreboard
Data buses
Control/status
Control/status
The basic structure of a DLX processor with a scoreboard
9/25/2006 eleg652-F06 58
Scoreboarding
[CDC6600, Thorton70], [WeissSmith84]• A bit (called “scoreboard bit”) is associated with each
register bit = 1: the register is reserved by a write• An instruction has a source operand with bit = 1will be
issued, but put into an instruction window, with the register identifier to denote the “to-be-written” operand
• Copies of valid operands also be read with pending inst (solve anti-dependence)
• When the missing operand is finally written, the register id in the pending inst will be compared and value written, so it can be issued
• An inst has result R reserved - will stall so the output-dependence (WAW) will be correctly handled by stall!
9/25/2006 eleg652-F06 59
Example
(1) DIVF F0, F2, F4
(2) ADDF F10, F0, F8
(3) SUBF F8, F8, F14
(3) is allowed to be issued and executed when (2)
is waiting some operand
(3) cannot write its result to F8 (stalls!) if (2) has
not read F8: a stall - since no renaming of the Rs.
9/25/2006 eleg652-F06 60
Motivation of Dynamic Scheduling
Q1
– How to issue S2 without waiting S1 to complete?
(scoreboard)
– How to issue S3 without waiting S2 to complete?
– How to issue S3 without S1 to complete?
S1 X =
S2 = X
S3 X =
9/25/2006 eleg652-F06 61
• It permits out-of-order “issue” of instructions which do not related to each other.
• It permits out-of-order “completion” of insts which do not related to each other.
• It prevents “execution” of an inst if flow-dependence is violated.
• It prevents “issue” of an inst if output-dependence is violated.
• It prevents “completion” of an inst if anti-dependence is violated.
Features of Scoreboarding
9/25/2006 eleg652-F06 62
Advantage
• single bit – simple
• only one pending write per reg. so do
not need identify which is the latest
Scoreboarding
9/25/2006 eleg652-F06 63
Micro Data Flow
• Fundamental Concepts– “Data Flow”
• Instructions can only be fired when operands are available
– Single assignment and register renaming
• Implementation– Tomasulo’s Algorithm– Reorder Buffer
9/25/2006 eleg652-F06 64
Renaming/Single Assignment
R0 = R2 / R4; (1)R6 = R0 + R8 (2)R1[0] = R6 (3)R8 = R10 – R14 (4)R6 = R10 * R8 (5)
12
34
5
R0 = R2 / R4; (1)S = R0 + R8 (2)R1[0] = S (3)T = R10 – R14 (4)R6 = R10 * T (5)
12
34
5
9/25/2006 eleg652-F06 65
Principles of Register Renaming
• Additional R’s reestablish a one to one correspondence between values and registers.
• Extra registers Scheduled by hardware and associated with values
• A new value New (hardware) register• Anti and Output Dependencies are avoided• Registers are reused according to program
needs
9/25/2006 eleg652-F06 66
Baseline Superscalar Model
Inst Fetch
Inst Decode
Wake Up Select
Register File
ExecData Cache
Bypass
Renaming
Issue Window
Execution BypassData Cache Access
Register Write &Instruction Commit
9/25/2006 eleg652-F06 67
Micro Data FlowConceptual Model
A R1R1 * B R2R2 / C R1R4 + R1 R4
A
Load
*
/
+
B
C
R1OR4
OR3
OR5OR1
OR6
R2
R4 R1
R4
R1
R2
R3
R4
9/25/2006 eleg652-F06 68
Register Types
• Two Kinds of registers– Forwarding Registers
• Program / Instruction Visible• Compiler and programmer scheduled
– Physical Operand Registers• Not Visible• Scheduled and assigned by the hardware
9/25/2006 eleg652-F06 69
Reorder Buffer & Instruction Commit
• Instruction Commit– When an instruction is allowed to update memory
and/or registers– Concept used a lot in speculation
• Instruction Commit V.S. Instruction Execution– When speculation is used, Inst. Commit may not
happen immediately after inst. execution. – Reorder Buffer– A hardware buffer holding completed instructions but
not committed– Execute out-of-order but commit in-order– Extend the register set with extra registers
9/25/2006 eleg652-F06 70
ROB Stages
• Issue– Dispatch an instruction from the instruction queue– Reserved ROB entry and a reservation station
• Execute– Stall for operands– RAW resolved
• Write Result – Write back to any reservation stations waiting for it and to the
ROB• Commit
– Normal Commit: Update Registers– Store Commit: Update Memory– False Branch: Flush the ROB and re-begin execution
9/25/2006 eleg652-F06 71
ROB High Level Overview
Fetcher
InstructionQueue
Decoder
ReorderBuffer
InstructionWindow
RegisterFile
FunctionalUnit
FunctionalUnit
FunctionalUnit
9/25/2006 eleg652-F06 72
ROB Organization
• Content Addressable– X = A + B, X is renamed by a ROB register
(X’) and all references will be replaced by it– If an X’ is needed as an operand then ROB[X]
and• The value (if available) or a tag (if not) is returned if
X’ exists in the ROB, or …• A search to the “Visible” register bank is executed
if X’ is not found in the ROB
9/25/2006 eleg652-F06 73
ROB Organization
• If there is more than one ROB[X], then the most recent “entry” is fetched from the ROB
• When a result is produced:– All reservation stations that have a tag for that
result are updated
• When an instruction commits:– Update register banks and memory – Flush the ROB in case of a false branch
9/25/2006 eleg652-F06 74
Reg R-buffer
inst results
inst operands inst operands
R1 = R0 + 5 (1)
R2 = R1 + 6 (2)
R1 = R1 + 3 (3)
R4 = R1 + 9 (4)
op op1 op2 destR-buffer
R-name Value full
B5 R1 0
B8 R2 0
(1)+(2) issued
* + R0 = 1 5 B5
+ B5 6 B8
Assume R0 =1initially
enab
led
B5
B8
B11
R1
R1
0
0
0
+ 1 5 B5
+ B5 3 B11
+ B5 6 B8
(3) issued in-flight
R2
An
Exa
mpl
e
9/25/2006 eleg652-F06 75
+ 6 6 B8
+ 6 3 B11
*
*
B5
B8
B11
R1
R2
R1
6 1
0
0
(1) completed
An
Exa
mpl
eR1 = R0 + 5 (1)
R2 = R1 + 6 (2)
R1 = R1 + 3 (3)
R4 = R1 + 9 (4)
9/25/2006 eleg652-F06 76
B5 R1
R2
R1
6 1
0
0
(4) issued
R4
0
enabled op op1 op2 dest
Note: this is directly from B11 (not “6”), so, flow dependence is handled!
Also note these 2 instructors (e.g. (2) and (3)) can be completed out-of-order, but “6” is not affected so anti-dep : is resolved properly.
B8
B11
B13
* + 6 6 B8
+ B11 9 B13
* + 6 3 B11
An
Exa
mpl
eR1 = R0 + 5 (1)
R2 = R1 + 6 (2)
R1 = R1 + 3 (3)
R4 = R1 + 9 (4)
9/25/2006 eleg652-F06 77
Questions
• Memory Renaming– Not as attractive as R-R Data Flow– Load and Store are less frequent– Memory locations are less reused (in the register alloc
sense)– Memory ops have only one memory operand
• Store Buffer– Give Load priority to access the data cache– In order stores– Ensure that all instructions are performed before a
store has completed
9/25/2006 eleg652-F06 78
Memory Dataflow
• More difficult– Memory address is longer– Memory address may not be available at
decode stage
• Note:– In order-cache state: All stores must
performed in program order• All previous operations should have completed
– No cache reorder / check mechanism
9/25/2006 eleg652-F06 79
Load / Store Policy
• Loads may by-pass Stores if there are NO true dependencies among them• A check should be performed to ensure correctness• it cannot by-pass a store with the same destination. If one is
detected, the load is satisfied directly from the store buffer
• Loads are performed in program order at data cache, with respect to other loads• Simplicity• Out of Order doesn’t help much anyway
• At data cache interface• No Anti: No Store can by pass a Load• No Output: No Store can by pass each other
9/25/2006 eleg652-F06 80
Memory Dependencies & Event Ordering
• In case that a store target cannot be resolved – All subsequent loads are withheld until the
address can be resolved
• If two loads are in the instruction window– Do they need to wait for each other to be
resolved?
9/25/2006 eleg652-F06 81
Out-Of-Order Architectures
Fetch
DecodeRename
i0: R2 * R3i1: load@[R1 + R4]…i2: load@[R5]
INT FP L/S ROB
Mem
Independent Loads can execute in parallel
If i1 and i2 are independent, then they can executed at the same time
Loads do NOT need to wait for each other, even when addressed to the same memory location
9/25/2006 eleg652-F06 82
Summary
• Reorder Buffer– The most powerful scheme from the complex
dynamic scheduling techniques
• Simplest: Scoreboarding
• Hardware implementation is complex– Worth its returns?
9/25/2006 eleg652-F06 83
Tomasulo’s Algorithm
• Tomasulo, R.M. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, IBM J. of R&D 11:1 (Jan, 1967, p.p.232-233)
• IBM 360/91 (three year after CDC 6600 and just before caches)
• Features:• CDB: Common Data Bus• Reservation Units: Hardware features which allow the
fetch, use and reuse of data as soon as it becomes available. It allows register renaming and it is decentralized in nature (as opposed as Scoreboarding)
9/25/2006 eleg652-F06 84
Tomasulo’s Algorithm
• Control and Buffers distributed with Functional Units.• HW renaming of registers• CDB broadcasting• Load / Store buffers Functional Units• Reservation Stations:
– Hazard detection and Instruction control– 4-bit tag field to specify which station or buffer will produce the
result
• Register Renaming– Tag Assigned on IS– Tag discarded after write back
9/25/2006 eleg652-F06 85
Comparison
• Scoreboarding– Centralized Data structure
and control– Register bit
• Simple, low cost
– Structural hazards solved by FU
– Solve RAW by register bit– Solve WAR in write – Solve WAW stalls on issue
• Tomasulo’s Algoritjm– Distributed control– Tagged Registers +
register renaming– Structural Hazard stalls on
Reservation Station– Solve RAW by CDB– Solve WAR by copying
operand to Reservation Station
– Solve WAW by renaming– Limited: CDB
• Broadcast• 1 per cycle
9/25/2006 eleg652-F06 86
Reservation Station Fields
• Op: Operation to perform in the unit• Vj, Vk: Value of Source Operands
– Store Buffers has V field Result to be stored• Qj, Qk: Reservation Stations producing the source
registers– Zero means ready
• Busy• A: Memory address calculation• Register file:
– Qi: The number of reservation stations that will write to this register
9/25/2006 eleg652-F06 87
The Architecture
654321
Formmemory
Load buffers
From instruction unitFloating-pointoperations FP registers
FP adders FP multipliers
Store buffers
tomemory
Common data bus (CDB)
321
321
Operation bus
21
ReservationStations
Operandbus
- 3 Adders- 2 Multipliers- Load buffers (6)- Store buffers (3)- FP Queue- FP registers- CDB: Common Data Bus
9/25/2006 eleg652-F06 88
Tomasulo’s Algorithm’s Steps
• Issue- Issue if empty reservation station is found, fetch operands if they are in
registers, otherwise assign a tag- If no empty reservation is found, stall and wait for one to get free- Renaming is performed here and WAW and WAR are resolved
• Execute– If operands are not ready, monitor the CDB for them– RAWs are resolved– When they are ready, execute the op in the FU
• Write Back– Send the results to CDB and update registers and the Store buffers– Store Buffers will write to memory during this step
• Exception Behavior– During Execute: No instructions are allowed to be issued until all
branches before it have been completed
9/25/2006 eleg652-F06 89
Tomasulo’s Algorithm
• Note that:• Upon Entering a reservation station, source operands
are either filled with values or renamed• The new names are 1-to-1 correspondence to FU
names
• Question:• How the output dependencies are resolved?
• Two pending writes to a register• How to determinate that a read will get the most
recent value if they complete out of order
9/25/2006 eleg652-F06 90
Features of T. Alg.
• The value of an operand (for any inst already issued in a reservation station) will be read from CDB. it will not be read from the reg. field.
• Instructions can be issued without even the operands produced (but know they are coming from CDB)
9/25/2006 eleg652-F06 91
An Example
LD F6, 34 (R2) (1)LD F2, 45 (R3) (2)MULD F0, F2, F4 (3)SUBD F8, F2, F6 (4)DIVD F10, F0, F6 (5)ADDD F6, F8, F2 (6)
1
2
3
4
5
6
9/25/2006 eleg652-F06 92
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2L 0 0 Yes 34+[R2]1
0
1
2
3
4
5
6 L1
7
8
9
10
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
DIVD F10,F0,F6
SUBD F8.F2.F6
MULD F0,F2,F4
LD F2,45(R3)
9/25/2006 eleg652-F06 93
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0
1
2 L2
3
4
5
6 L1
7
8
9
10
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
DIVD F10,F0,F6
SUBD F8.F2.F6
MULD F0,F2,F4
9/25/2006 eleg652-F06 94
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1
OP Vj Vk Qj Qk Busy Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0 M1
1
2 L2
3
4
5
6 L1
7
8
9
10
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
DIVD F10,F0,F6
SUBD F8.F2.F6
9/25/2006 eleg652-F06 95
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2
- 45+[R3] 34+[R2] L2 L1 Yes Addr1
OP Vj Vk Qj Qk Busy Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0 M1
1
2 L2
3
4
5
6 L1
7
8 A1
9
10
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
DIVD F10,F0,F6
9/25/2006 eleg652-F06 96
An Example
OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2
- 45+[R3] 34+[R2] L2 L1 Yes Addr1
/ 34+[R2] M1 L1 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0 M1
1
2 L2
3
4
5
6 L1
7
8 A1
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
ADDD F6,F8,F2
9/25/2006 eleg652-F06 97
An Example
OP Vj Vk Qj Qk Busy Addr3+ 45+[R3] A1 L2 Yes Addr2- 45+[R3] 34+[R2] L2 L1 Yes Addr1
/ 34+[R2] M1 L1 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1
0 M1
1
2 L2
3
4
5
6 A2
7
8 A1
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
9/25/2006 eleg652-F06 98
An Example
OP Vj Vk Qj Qk Busy Addr3+ 45+[R3] A1 L2 Yes Addr2- 45+[R3] 40 L2 0 Yes Addr1
/ 40 M1 0 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2
1
0 M1
1
2 L2
3
4
5
6 A2
7
8 A1
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
Some time later L1 returns 40 and commits
9/25/2006 eleg652-F06 99
An Example
OP Vj Vk Qj Qk Busy Addr3+ 32 A1 0 Yes Addr2- 32 40 0 0 Yes Addr1
/ 40 M1 0 Yes Addr2* 32 [F4] = 4 0 0 Yes Addr1
OP Vj Vk Qj Qk Busy Addr3
2
1
0 M1
1
32
3
4
5
6 A2
7
8 A1
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
Some time later L2 returns 32 and commits
9/25/2006 eleg652-F06 100
An Example
OP Vj Vk Qj Qk Busy Addr3+ -8 32 0 0 Yes Addr2
1
/ 128 40 0 0 Yes Addr2
1
OP Vj Vk Qj Qk Busy Addr3
2
1
128
1
32
3
4
5
6 A2
7
-8
9
10 M2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
A1 and M1 Complete and commit
9/25/2006 eleg652-F06 101
An Example
OP Vj Vk Qj Qk Busy Addr3
2
1
2
1
OP Vj Vk Qj Qk Busy Addr3
2
1
128
1
32
3
4
5
24
7
-8
9
3.2
11
01
2
3
4
5
6
7
89
10
11
Memory Unit FP Adders FP Mult
A2 and M2 Complete and commit
9/25/2006 eleg652-F06 102
ROB and Tomasulo’s Alg.
• Many elements of Tomasulo’s algorithm are already included• Major difference?
How WAW is handled?
In Tomasulo: this is by keeping a “tag” with each register x and the tag is updated, each time a
x “+”
is issued, i.e. X-tag “+3” means 3rd + unit is reserved.
when write back via CDB, the tag of FU is compare with tag of R:
if
tag of FU = X-tag, overwritten the R (e.g. X)
else
ignore the result
9/25/2006 eleg652-F06 103
Tomasulo’s Algorithm
• Advantages• Distribution of Hazard
detection logic • R-renaming and
reservation stations take care of all data hazards.
• Disadvantages- Hardware cost: high-
speed associative M for tags + complex control logic
- One single CDB may be a bottleneck, while multiple CDB may be too costly (all associative - M must be duplicated)
9/25/2006 eleg652-F06 104
Conclusions
- Good for pipelined architecture which is difficult to schedule code and it is short in “visible” registers
- Future:- Hybrid between Software and hardware
techniques- Static schedule of R-R’s- Dynamic Schedule of Load and stores
9/25/2006 eleg652-F06 105
9/25/2006 eleg652-F06 106
ExampleDynamic Scheduling in Pentium 4
• Fetch up 3 IA-32 Instruction per cycle
• Decode them into micro code and send them to the out of order execution engine.
• Commit up to 3 micro ops per cycle
• Pipeline takes 20 cycles
• Register Renaming files– Potentially 128 outstanding results
• Seven Integer execution units
9/25/2006 eleg652-F06 107
An Example of an OoO Engine
Intel Xeon Out of Order Engine Pipeline Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and Microarchitecture”
9/25/2006 eleg652-F06 108
VLIW vs Superscalar
• Superscalar• Advantages
– Better Code density– Code compatible
• Difference– Dynamic scheduling
• Disadvantages– More IF and ID– More delay slots are
needed– Different FU
• VLIW• Advantages
– Fixed Instruction format– Explicit parallelism
exposed• Trace scheduling
• Difference:– Static Scheduling
• Disadvantages– Static scheduling– No dynamic decision– Code explosion– Caches are difficult to use
9/25/2006 eleg652-F06 109
Bibliography
• Texas Instruments, “TMS320C600 Technical Brief.” February 1999. www.ti.com
• Intel Pentium 4 Northwood. www.chip-architect.com. April 2003.
• “Hyper-Threading Technology Architecture and Microarchitecture.” Intel Technology Journal, Volume 6, Issue 1, February 2002, p4-15