Post on 11-Jun-2020
1
Pipeline Implementation
2
Don’t forget…
You need to register for the exam at least 14 days before http://tenta.angstrom.uu.se/tenta/
The IT office can not help you if you miss the deadline If you do miss the deadline:
You are not guaranteed a spot at the exam You won’t be anonymous
(although I never look at the names on exams when I’m grading so I’m not sure how this matters.)
There have been so many bugs and problems with this anonymous system that your grades and exams will be delayed by a few days. This won’t actually matter since I won’t be able to grade your exams until the second
week in January anyway.
3
Today’s Menu
More pipelinining Data hazards Structural hazards Control hazards Branch prediction – just a taste… Examples
4
This Data Hazard, Revisited
In this particular case… R10 value is not computed or returned to register file when later instruction wants to use it
as an input
Double pumping reg file doesn’t help here; later instruction needs R10 2 clock cycles before it’s been computed & stored back. Oops…
Iget Rget ALU op Mput Rput
Iget Rget ALU op Mput Rput
10 W
10 R
5
Coping with Data Hazards
What do you do? Sometimes the dumb-sounding answer is right
Hypothesis: It is BAD when certain instructions “overlap” in time in certain patterns in our 5 stage
MIPS pipeline
Proposed solution Don’t let them overlap like this…? Right - that is one solution
Mechanics Don’t let the instruction flow thru the pipe In particular, don’t let it WRITE any bits anywhere in the pipe hardware that represents
REAL CPU state (e.g., register file, memory) Name for this operation: PIPELINE STALL
6
Coping with Data Hazards: Example
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
ADD R10, R11, R12 ADD R12, R10, R11 ADD R11, R10, R12
REG IM DM ALU Reg
Program Execution
Time
REG IM DM ALU Reg
Clock Cycle 8
REG IM DM ALU Reg
10 W
10 R
10 R
7
Solution 1 : Stall
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
IM REG ALU bubble bubble
ADD R10, R11, R12 ADD R12, R10, R11 ADD R11, R10, R12
DM
REG IM ALU
10 W
10 R
Empty slots in in the pipe
called bubbles; means no real
instruction work getting saved here
10 R
8
Mechanically: How Do We Stall?
Add extra hardware to detect stall situations Watches the instruction field bits Looks for “read versus write” conflicts in particular pipe stages Basically, a bunch of careful “case logic”
Add extra hardware to push bubbles thru pipe Actually, relatively easy Can just let the instruction you want to stall GO FORWARD thru the pipe… …but, TURN OFF the bits that allow any results to get written into the machine state So, the instruction “executes” (it does the work), but doesn’t “save”
“If an instruction executes in the middle of forest, but no registers are around to save the results…did it really execute?” (No.)
9
No Dependence Between #1 and #4
REG IM DM ALU
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU
Program Execution
Clock Cycle 8
SUB R2, R1, R3 AND R12, R2, R5 OR R13, R6, R2 ADD R14, R2, R2
REG IM DM ALU
REG IM DM ALU
REG
REG
REG
REG
2 W
2 R
In this case, double pumped reg file makes it ok…
10
REG IM DM ALU Reg
How Else Could We Stall the Pipeline? Compiler can insert nops
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
ADD R10, R11, R12 nop nop ADD R12, R10, R11
IM DM ALU Reg
ALU
REG
IM DM Reg REG
On MIPS R0 = R0+R0 will do it-- saves no
state
11
Or, The Hardware Can Simulate NOPS
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
IM REG ALU
ADD R10, R11, R12 stall stall ADD R12, R10, R11 DM
IM
IM
bubble bubble bubble
bubble bubble
bubble
bubble bubble
Reg
12
Hardware Trick Can Fix a Dependence
If the result you need does not exist AT ALL yet… …well, you are outta luck; sorry
But, what if the result exists, but is not stored back yet? Then, maybe we can help Instead of stalling until the result is stored back in its “natural” home… …grab the result “on the fly” from “inside” the pipe, and send it to the other instruction
(another pipe stage) that wants to use it
Generic name: forwarding Instead of waiting to store the result, we forward it immediately (more or less) to the
instruction that wants it Mechanically, we add busses to the datapath to move these values around, and these
busses always “point backwards” in the datapath, from later stages to earlier stages
13
Reducing Data Hazards: Forwarding Data may be already computed - just not in the Register File
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
ADD R10, R11, R12 ADD R12, R10, R11 REG IM DM ALU Reg
R10
R10
Moving this R10 value requires forwarding busses & logic
10 W
14 Forwarding bus from MEM
Additions to the Datapath for Forwarding
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER
<< 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
15
Forwarding Continued
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
ADD R10, R11, R12 ADD R12, R10, R11 ADD R4, R5, R10
REG IM DM ALU Reg
REG IM DM ALU Reg
R10
R10
R10
R10
16 Forwarding bus from WB
More Additions to the Datapath
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
17
Forwarding Doesn’t Always Work
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
LW R10, 0x00(R4) ADD R12, R10, R11 REG IM DM ALU Reg
R10
R10
ALU needs R10 at beginning of clock cycle,
but R10 value not ready till end of cycle
18
Loads and Stores Require a Load Delay Slot
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
LW R10, 0x00(R4)
nop ADD R12, R10, R11 REG IM DM ALU Reg
IM DM ALU Reg
R10
R10
REG
Gives us 1 cycle of delay we need to get R10 from mem,
then to ALU
19
Pipelines: Idiosyncrasies
Different architects deal with these “special cases” in different ways, sometimes yielding very different solutions MIPS lets software compiler writers “see” this necessary delay slot Phrasing in architecture: feature “exposed to the compiler”
What this “exposing” means The compiler knows the slot is required The compiler has to deal with it--the hardware won’t deal with it Your book says this was a sensible tradeoff at the time this MIPS architecture was built
(i.e., when throwing 1M gates at a messy pipeline problem was not possible) MIPS = “Microprocessor without interlocked pipeline stages” Today everyone knows this was a bad idea. Much easier to fix in hardware (transistors
are free) than to force everyone to recompile everything.
Alternative Hardware in pipeline detects this hazard, stalls appropriately This sort of hardware has a name: pipeline interlock hardware 20
Example of Forwarding and Load Delay
Rewrite the code assuming a machine without forwarding (by inserting nops or stalls).
ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)
Rewrite the code assuming forwarding
21
Example of Forwarding and Load Delay
Why forwarding?
ADD R4, R5, R2
LW R15, 0(R4)
SW R15, 4(R2)
Why load delay? ADD R4, R5, R2
LW R15, 0(R4)
SW R15, 4(R2)
22
Solution Templete Program Execution
Time
ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)
IM DM ALU REG REG
23
Solution (w/out forwarding) Program Execution
Time
ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)
IM DM ALU REG REG
IM DM ALU REG REG bubble bubble
IM DM ALU REG bubble bubble
R4
R15
R15
R4
24
Solution (w/forwarding) Program Execution
Time
ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)
IM DM ALU REG REG
IM DM ALU REG REG
IM DM ALU REG bubble REG
R15
R15
Forwarding
R4
R4
Forwarding
25
Taxonomy of Hazards
Data Hazards are just one type of hazard that can occur in a machine. There are actually 3 basic types of hazards
Data hazards Instruction depends on result of prior computation which is not ready yet
Structural hazards HW cannot support a combination of instructions
Control hazards pipelining of branches and other instructions which change the PC
26
Taxonomy of Hazards
Data hazards Instruction depends on result of prior computation which is not ready yet
OK, we did these. Stall, double pump, and forward, to fix
Structural hazards HW cannot support a combination of instructions
Control hazards pipelining of branches and other instructions which change the PC
27
Taxonomy of Hazards
Data hazards Instruction depends on result of prior computation which is not ready yet
OK, we did these. Stall, double pump, and forward, to fix
Structural hazards HW cannot support a combination of instructions
Control hazards pipelining of branches and other instructions which change the PC
28
Structural Hazards--Pipe Stage Contention
Structural hazards Occurs when two or more instructions want to use the same hardware resource in the
same cycle Causes bubble (stall) in pipelined machines Overcome by replicating hardware resources Examples
Multiple accesses to the register file Branch adder and ALU Multiple accesses to memory
29
ADDER #2
Structural Hazard Example 1 Without adder #2, both the address computation and the arithmetic
computation would require access to the ALU in the same cycle beq r1,r2, offset ; if r1 == r2, then PC <-- PC + offset
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign exten
d
IF/ID ID/EX EX/MEM MEM/WB
ADDER #1
30
Structural Hazard Example 2
REG IM DM ALU Reg
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
LW R2, 0x10(R4) SUB R5,R6,R7 ADD R10,R11,R12 ADD R12, R10, R11
REG IM DM ALU Reg
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
REG IM DM ALU Reg
Two instructions need access to memory in Clock Cycle 4. This is a big reason for having separate I memory (for instructions) and D memory
(for data value)
31
Structural Example 2 (con’t)
REG IM DM ALU Reg
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
LW R2, 0x10(R4) SUB R5,R6,R7 ADD R10,R11,R12
Stall ADD R12, R10, R11
REG IM DM ALU Reg
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
Two instructions need access to memory in Clock Cycle 4. We’d need to stall to fix this as it…
REG IM DM ALU
bubble bubble bubble bubble bubble
32
Taxonomy of Hazards
Data hazards Instruction depends on result of prior computation which is not ready yet
OK, we did these. Stall, double pump, and forward, to fix
Structural hazards HW cannot support a combination of instructions
OK, maybe add extra hardware resources; may still have to stall
Control hazards pipelining of branches and other instructions which change the PC
33
Example code
Address Instruction 36 NOP 40 ADD R30,R30,R30 44 BEQ R1, R3, 24 <- this branchs to address 72 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 60 ... ... 72 LW R4, 50(R7) 76 ...
Flow of instructions if branch is taken: 36, 40, 44, 72, ... Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Control Hazards - Branches
We execute all these if R1 != R3
We execute just these if R1 == R3
34
Recall: Basic Pipelined Datapath
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
35
Branch Hazards
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
44 BEQ R1, R3, 24 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 60 or 72 (depending on branch)
REG IM DM ALU Reg
REG IM DM ALU Reg
REG IM DM ALU Reg
Clock Cycle 9
IM DM Reg ALU
Flow of instructions if branch is taken: 36, 40, 44, 72, ... Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
REG
36
Always Stalling Hurts the No-branch case
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
44 BEQ R1, R3, 24 stall stall stall 48 AND R12, R2, R5
IM
IM
IM
Clock Cycle 9
REG IM DM Reg ALU
bubble bubble bubble bubble
bubble bubble bubble bubble
bubble bubble bubble bubble
Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
37
Solution: Assume Branch Not Taken
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
44 BEQ R1, R3, 24 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 60 (we assume branch NOT taken)
REG IM DM ALU Reg
REG IM DM ALU Reg
REG IM DM ALU Reg
Clock Cycle 9
IM DM Reg ALU
Flow of instructions if branch is taken: 36, 40, 44, 72, ... Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
REG
38
…i.e., what if we guessed wrong on the branch?
Address Instruction 36 NOP 40 ADD R30,R30,R30 44 BEQ R1, R3, 24 <- this branches to address 72 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 60 ... ... 72 LW R4, 50(R7) 76 ...
Flow of instructions if branch is taken: 36, 40, 44, 72, ... Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
Uh Oh: What If Branch Was Taken…?
We already started some of these since we assumed NO branch taken
But a few clock cycles later, We figure out these are right Instructions to go next
39
What Happens When the Branch IS Taken
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
44 BEQ R1, R3, 24 48 AND R12, R2, R5 52 OR R13, R6, R2 56 ADD R14, R2, R2 72 LW R4, 50(R7)
REG IM DM ALU Reg
REG IM DM ALU Reg
REG IM DM ALU Reg
Clock Cycle 9
REG IM DM Reg ALU
Flow of instructions if branch is taken: 36, 40, 44, 72, ...
These 3 incorrect to execute--kill
them
40
Common Side-Effect in Pipelines
Sometimes, you just have to guess what will execute Often, we can do it right, and this saves cycles But, occasionally, we are wrong
Consequences We mistakenly start executing the wrong instructions To repair this, must make sure that they DO NOT really execute In particular, must ensure they do not incorrectly corrupt machine state
Terminology is appealing vivid We “kill” them -- bland but accurate We “squash” them -- think of “bug-under-boot” images
41
Common Side-Effect in Pipelines
About squashing instructions: Do it because you have to, to avoid getting wrong answer Do it because we insist on “sequential execution semantics” which means “program
behaves like the instructions execute sequentially, in order” no matter what weird goop happens in the pipe
Aside: terminology We say the machine executes the instruction sequentially Also say “instructions are RETIRED in sequential order” Image is: instruction is born (fetched), grows up (regfile access, ALU ops), then finally
finishes and commits correct machine state. This “finally finishes” is “retiring” the instruction In deep pipelines and complex machines, even knowing WHEN your instruction retires
takes a lot of complex logic
Consequence Think about restructuring pipe to MINIMIZE number of instructions squashed
42
Better if we can do it sooner, here
Move the Branch Computation Forward
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
Too late, adds extra cycle, 1 more inst to squash
43
Move the Branch Computation Further Forward
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB ADDER
Compare
Compare Controls MUX Selext
Even better if we can do it sooner, here; need to change hardware a little to do it
44
Result: New & Improved MIPS Datapath Need just 1 extra cycle after the BEQ branch to know right address On MIPS, its called - the branch delay slot
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
44 BEQ R1, R3, 24 48 AND R12, R2, R5 72 LW R4, 50(R7) REG IM DM ALU Reg
IM DM ALU Reg
Clock Cycle 9
REG
45
The Branch Problem Branch is detected and handled in cycle 2 Allows branch destination to start in cycle 3 But what about the instruction fetched in cycle 2? (ADD here…)
MIPS uses a “branch delay slot” Other architectures stall and lose performance
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
0x00 BEQ 0x20 0x 04 ADD 0x24 SUB
REG IM DM ALU Reg
REG IM DM ALU Reg
46
Pipeline Idiosyncrasies Revisited
Good news Just 1 cycle to figure out what the right branch address is So, not 2 or 3 cycles of potential NOP or stall
Strange news OK, it’s always 1 cycle, and we always have to wait SO--on MIPS, this instruction always executes, no matter what This deviates from the “atomic instruction principle”
Hence the name: branch delay slot The instruction cycle after the branch is used for address calc, 1 cycle delay necessary SO…we regard this as a free instruction cycle, and we just DO IT
Consequence You (or your compiler) will need to adjust your code to put some useful work in that “slot”, since just putting in a NOP is wasteful
47
Rewriting the Code for a Branch Delay Slot Without Branch Delay Slot With Branch Delay Slot
Address Instruction Address Instruction 36 NOP 36 NOP 40 ADD R30,R30,R30 40 BEQ R1, R3, 28 44 BEQ R1, R3, 24 44 ADD R30, R30, R30 48 AND R12, R2, R5 48 AND R12, R2, R5 52 OR R13, R6, R2 52 OR R13, R6, R2 56 ADD R14, R2, R2 56 ADD R14, R2, R2 60 ... 60 ... 64 ... 64 ... 68 ... 68 ... 72 LW R4, 50(R7) 72 LW R4, 50(R7) 76 ... 76 ...
Flow of instructions if branch is taken: 36, 40, 44, 72, ... Flow of instructions if branch is not taken: 36, 40, 44, 48, ...
48
Recall: Problems w/ Branch Delay Slots in Pipes
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB ADDER
Compare
Compare Controls MUX Select
If we left these branch target address calcs here (deep in the pipe), created many bubbles …moved here
49
Datapath with Branch Logic
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
M U X
Sign extend
IF/ID ID/EX MEM/WB ADDER
Compare
Compare Controls MUX Select
<< 2
EX/MEM
50
The Branch Delay Slot
In retrospect, probably a mistake This solution only works for some pipelines
Deeper pipelines mean higher performance They also mean more cycles between when you fetch an instruction and when you know
for sure the address of the next instruction # of branch delay slots would have to grow
Breaks the atomic instruction principle Compilers don’t always find a way to fill the slot
Forget about it for a minute… Let’s try to fix this problem in a better way
What if we could predict the future?
51
Branch Prediction: A better solution? Assume branch not taken; so just start AND instruction
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
44 BEQ R1, R3, 24 48 AND R1, R2, R5 60 or 72 (depending on outcome of branch)
REG IM DM ALU Reg
REG IM DM ALU Reg
This is a form of Branch Prediction
52
Predict Branch Not Taken Instead of a branch delay slot or stalling,
we just assume that the branch will not happen If you’re right, great! If your wrong, cancel the instructions that should not have executed
Example: Assume “not taken” when the branch is not taken
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg 44 BEQ R1, R3, 24 48 AND R1, R2, R5 52 SUB R2, R3, R4
REG IM DM ALU Reg
REG IM DM ALU Reg
53
Branch Misprediction Example:
Assume “not taken” when the branch is taken Cancel instruction 48 (AND) because it should not have issued
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg 44 BEQ R1, R3, 24 48 AND R1, R2, R5 72 SUB R8, R9, R2
REG IM DM ALU Reg
REG IM DM ALU Reg
54
How Can We Do Better? Branch Prediction
There are many different schemes Assume taken Assume not taken 1-bit Branch Prediction 2-bit Branch Prediction N-bit Branch Prediction Table-based Branch Prediction …..
Assume taken or not taken is called static branch prediction Using hardware to dynamically predict is called dynamic branch
prediction
55
Static Prediction Problems
Some branches: 99% of the time they are taken Example: “are we finished the loop, if not start at the top again…”
Some branches: 99% of the time they are not taken Example: “is this an error? If so branch to error routine”
Compilers have trouble re-writing to make static behavior the right behavior
56
Dynamic Branch Prediction
Dynamic branch prediction uses the previous outcome of a branch to determine future outcomes Does this make sense?
Yes, sometimes. Consider the following code fragment
for (k = 0; k < 100000000; k++){ /* do something */ }
But what about
if (a == b) { /* do something */ } else /* do something else */
Most of the time, the “last” branch decision is same as next branch decision
Hmmm… Not so clear here, eh?
57
1-bit Branch Prediction Hardware has a table of single bits
Each entry in the table corresponds to a branch in the program If a bit is set, the branch is predicted taken If the bit is not set (0), the branch is predicted not taken
How do the branch table bits get set? The hardware determined the real outcome of a branch and
uses that outcome (history) to set (or unset) a bit How do the branch instructions get mapped to entries in the table
Magic… for now. (Lots of custom logic, basically…)
Branch Prediction Table
L1 ADD R1, R2, R3 SUB R3, R4, R2 BEQ R1, R3, L1
L2 LUI R2, 0x1234 BNE R3,R4, L2 J L1
0 1 2 3 4 5
0 1 1 0 0 0
58
1-bit Branch Prediction (cont.)
Branch Prediction Table at start of program
0 1 2 3 4 5
L1 ADD R1, R2, R3 SUB R3, R4, R2 BEQ R1, R3, L1
L2 LUI R2, 0x1234 BNE R3,R4, L2 J L1
0 1 1 0 0 0
Flow of instructions ADD R1, R2, R3 SUB R3, R4, R2 BEQ R1, R3, L1 ;table predicts not taken LUI R2,0x1234 ; if the branch was taken, then squash LUI let’s assume the branch was taken ;then the hardware will update the table ;next, we fetch the destination of the branch ADD R1, R2, R3 SUB R3, R4, R2 BEQ R1, R3, L1 ; table predicts taken ADD R1, R2, R3
Branch Predict Table after first branch resolved
0 1 2 3 4 5
L1 ADD R1, R2, R3 SUB R3, R4, R2 BEQ R1, R3, L1
L2 LUI R2, 0x1234 BNE R3,R4, L2 J L1
1 1 1 0 0 0
59
Problem with one-bit predictors
Consider this:
for (j = 0; j < 100,000,000; j++){ for (k = 0; k < 10; k++) { /* do something */ }
} How often do we mispredict the k-loop branch?
60
2-bit Branch Prediction
Table has 2 bits instead of one bit Creates a history--more “memory” of behavior saved
Use the entries to determine the outcome of a branch as follows
Taken
Not Taken
Not Taken
Not Taken
Not Taken
Taken
Taken
Taken
Predict taken 11
Predict taken 10
Predict not taken 00
Predict not taken 01
61
2-bit Branch Prediction: Mechanics
If you’re right--you’re right Don’t change your prediction if things are going OK with it…
Predict taken
Predict not taken
Taken
Not Taken
11
00
62
2-bit Branch Prediction: Mechanics
Oops--first wrong, mis-predict. Remember that this was first--it’s a new state BUT--don’t change your prediction about the branch direction
Predict taken Predict taken
Predict not taken Predict not taken
Taken
Not Taken
Not Taken
Taken
11 10
00 01
Oops!
Oops!
63
2-bit Branch Prediction: Mechanics
If we’re lucky, that mispredict was a fluke… The way we WERE predicting it before was OK,
that one previous branch was just wrong. NEXT one, we’re right again.
Predict taken Predict taken
Predict not taken Predict not taken
Taken
Not Taken
Not Taken
Taken 11 10
00 01
64
2-bit Branch Prediction: Mechanics
Nope--we’re really wrong. Now, the branch really wants to go the other way, twice in a row. So--Alter our prediction.
Predict taken Predict taken
Predict not taken Predict not taken
Taken
Not Taken
Not Taken Taken
11 10
00 01
Not Taken
Taken
65
2-bit Branch Prediction: Example
Consider a few branch predictions in sequence
Predict not taken
Taken
Not Taken
Taken
Taken
Predict taken 11
Predict taken 10
00 Predict not taken
01
1. We stay here as long as “not taken” is correct
2. Oops…first mispredict
3. Oops…2nd mispredict; lets change our predictions for future
4. Stay here as long as “taken” is now right…
66
Generalization: 3-bit Prediction
T
NT
NT NT
T
T
NT
T
NT
T
NT
T
NT
T
NT
T
3-bit Prediction
T
NT
NT NT
T
T
NT
T
2-bit Prediction
67
Generalization: N-bit Prediction
Saturating N-bit counter
See whether a majority of the last 2N-1 branches were taken or not taken
T
NT NT
T
T
NT
T
NT
T
NT
T
NT
T
NT
T
NT NT
T
NT
T …
68
How well does this work?
Really good for loop behavior!
0% 5% 10% 15% 20%
nasa7
matrix300
tomcatv
doduc
spice
fpppp
gcc
espresso
eqntott
li
SPE
C89
Ben
chm
ark
Frequency of Mispredictions
4096 entries
2 bits per entry
69
When does this break?
Doesn’t deal with data-dependent branches (not much we can do here)
Doesn’t deal with correlated behavior:
L1: bne $s1, $0, L2 # B1! ...!L2: bne $s2, $0, L3 # B2! ...!L3: bne $s1, $s2, L4 # B3! ...!L4: ...!
Note that if B1 not-taken, and B2 not-taken, then B3 is not-taken There is a lot of correlated behavior like this in real programs
70
Small Example:
L1: bne $s1, $0, L2 # B1: if (d==0)! addi $s1, $0, 1 # d = 1!
L2: subi $s2, $s1, 1 # !
bne $s2, $0, L3 # B2: if (d == 1)!
... !
L3:!
!
If B1 is not taken, B2 will not be taken
How does a standard 1-bit predictor work with this? Assume $s1 alternates between 2 and 0
71
Small Example:
We ALWAYS mispredict!!!!
$s1 = ? B1 predict
B1 action
New B1
predict
B2 predict
B2 action
New B2
predict 2 NT T T NT T T
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
72
Correlating Branch Predictors
Idea: keep 2 (or more) predictors One is used/updated if last branch was taken (T) One is used/updated if last branch was not taken (NT) Each predictor could be N bits (we’ll assume one bit)
Prediction Bits Prediction if last branch not taken
Prediction if last branch taken
NT/NT Not taken Not taken
NT/T Not taken Taken
T/NT Taken Not taken
T/T Taken Taken
73
Previous Example
Initialized to NT/NT Only one misprediction of B2!
$s1 = ? B1 predict
B1 action
New B1
predict
B2 predict
B2 action
New B2
predict 2 NT/NT T T/NT NT/NT T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T
Use different predictor for B2 based on whether B1 was taken or not.
Upd
ate
pred
icto
r bas
ed o
n B
1 ac
tion
74
Performance of Correlating Branch Predictors
0% 5% 10% 15% 20%
nasa7
matrix300
tomcatv
doduc
spice
fpppp
gcc
espresso
eqntott
li
SPEC
89 B
ench
mar
k
Frequency of Mispredictions
2-bit 4096 entry 2-bit 2-level correlating 1024 entry
75
How to keep the branch prediction data
Keep a table of addresses of branch instructions with the current state of the branch predictor for that branch
A valid field to indicate that this address is a branch Check in the BTB when you fetch instruction Update bits when you know whether the branch is taken or not
= = ? = = ?
Current PC
= = ? = = ?
Branch Prediction State Valid
Tag bits
76
Branch Targets
So we now predict whether we are going to take the branch or not Doesn’t help if we don’t know where a taken branch goes MIPS: branch delay slot (BDS)
solves both the prediction and target address problem End of ID stage, we know whether we are taken a branch and where we are going
Without the BDS, or with greater level of pipelining, this doesn’t work
Fortunately, conditional branches, when taken, go the same place every time! Use history Keep a cache
Branch Target Buffer (BTB): Table for branch target
77
Branch Target Buffers
A table for branches that are predicted as taken Don’t have to compute branch targets for not-taken branches
Easy to add to the structure that stores the state of the predictor Useful for jumps (we know it is always taken, but we don’t know
where)
= = ? = = ?
Current PC
= = ? = = ?
Next PC 78
What Makes Pipelines Hard to Implement?
Detecting and resolving hazards
Instruction Set Architecture Very complex multicycle instructions are difficult to pipeline Example:
stringMov from 0x1234, to 0x4000, 0x1000 bytes
Exceptions and Interrupts
79
What Makes Pipelines Hard to Implement?
Detecting and resolving hazards
Instruction Set Architecture Very complex multicycle instructions are difficult to pipeline Example:
stringMov from 0x1234, to 0x4000, 0x1000 bytes
Exceptions and Interrupts
80
Exceptions and Interrupts
Exceptions are exceptional events that disrupt the normal flow of a program
Terminology varies between different machines Examples of Interrupts
User hitting the keyboard Disk drive asking for attention Arrival of a network packet
Examples of Exceptions Divide by zero Overflow Page fault
81
Exception Flow
When an exception (or interrupt) occurs, control is transferred to the OS
User Process
Event exception
Exception processing by exception handler
Exception return (optional)
Operating System
82
Flow of Instructions During Exception
Example: Add instruction overflows in clock cycle 3
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
ADDuserProgram LWuserProgram SUBuserProgram SWOS
REG IM DM ALU Reg
Clock Cycle 9
REG IM DM ALU Reg
REG IM DM ALU Reg
83
Characterizing Exceptions and Interrupts
Synchronous vs. asynchronous events Synchronous events occur at the same place every time a program executes
Asynchronous events are caused by external devices such as a keyboard, disk drive or mouse
User requested vs. coerced If a user asks for it, it is user requested
Coerced are hardware events not under user control
User maskable vs. nonmaskable Can a user disable an exception from being detected?
Within vs. between instructions Does the event prevent the current instruction from completing?
Resume vs. terminate Can the event be handled (corrected) or must the program be terminated?
84
Types of Exceptions
Exception Syn/Asynch User request? User maskable? Within? Resume? I/O device asynch coerced nonmaskable between resume invoke OS synch user req. nonmaskable between resume tracing instr. execution synch user req. user maskable between resume breakpoint synch user req. user maskable between resume int overflow synch coerced user maskable within resume fp overflow synch coerced user maskable within resume page fault synch coerced nonmaskable within resume misaligned mem access synch coerced user maskable within resume mem-prot violation synch coerced nonmaskable within term. undef. instr synch coerced nonmaskable within term. hardware malf. asynch coerced nonmaskable within term.
85
Stopping and Restarting Execution Exception occurs while many instructions are in flight
Ex: a page fault on a load instruction will occur in stage 4 of the MIPS pipe Pipeline must be safely shutdown when exception occurs and then restarted at the
offending instruction
How to handle this? This is done by: Force a trap instruction into the pipeline Until the trap is taken, turn off all writes for the faulting instruction and any instruction
that issued after the faulting instruction This prevents instructions from changing the state of the machine
When the trap is taken, invoking the OS, the OS saves the PC of the offending instruction
The OS fixes the exception (if possible) and then restarts the machine Restarting usually means setting PC <-- offending instruction address Replays instruction(s)
86
Precise vs. Imprecise Exceptions
If the pipeline can be stopped so that the instructions issued before the faulting instruction complete, then the pipeline is said to implement precise exceptions Gives the illusion that the machine executes one instruction at a time Difficult to do when some instructions take multiple cycles to complete
Some instructions may complete before an exception is detected Example
Multiply r1, r2, r3 ; multiply takes 10 cycles Add r10,r11,r12 ; takes 5 cycles
Add will complete before multiply is done. If multiply overflows, then an exception will be raised AFTER the add has updated the value in R10. This is an imprecise exception.
Some machines implement both modes: imprecise and precise exceptions Special software instructions to guarantee precise exceptions
Machine runs slower when one needs precise exceptions
87
Exceptions and the MIPS Architecture
Which stage can exceptions occur in? Stage Problem exceptions occurring
IF page fault on instruction fetch; misaligned memory access; memory protection violation
ID undefined or illegal opcode EX arithmetic exception MEM page fault on data fetch; misaligned memory access;
memory-protection violation WB none
88
Multiple Exceptions Multiple exceptions can happen in the same cycle
Example In Clock Cycle 4, LW can have a data page fault while the ADD has an arithmetic exception Handled by servicing the page fault and then restarting the LW instruction
The ADD’s arithmetic exception will occur again because the ADD instruction is restarted after the exception is handled
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
LW ADD REG IM DM ALU Reg
Clock Cycle 9
89
Multiple Exceptions (cont.)
Multiple exceptions can be difficult to manage Can occur out-of-order Example
ADD causes an exception in the instruction fetch stage while LW causes an exception in the memory access stage
If we implement precise exceptions, LW exception must be handled first This is done by having hardware post exceptions by order of instruction Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
REG IM DM ALU Reg
Clock Cycle 8
LW ADD REG IM DM ALU Reg
Clock Cycle 9
90
About Exceptions
One of the single messiest parts of designing a modern CPU It isn’t pretty, it’s easy to get wrong It’s often not too elegant It usually takes huge wads of special logic It causes architects to age prematurely
Further complicated by modern CPU mechanisms Deep pipes Superscalar --lots of instructions in flight in parallel Out-of-order execution -- time order of exceptions != program order of the instructions on
which the exceptions happened Maintaining illusion of “sequential instruction execution” gets really complicated.
91
Performance of Pipelined Systems
Stalls due to data and branch hazards make performance less than one instruction per cycle
Compiler is critical in determining overall performance Compiler generates code that avoids stalls
Example lw R15, 0x00(R2) add R14, R15, R15 lw R16, 0x04(R2)
Might become: lw R15, 0x00(R2) lw R16, 0x04(R2) add R14, R15, R15
92
Data Dependencies
Identify all of the true data dependencies in the following code fragment. Don’t assume any implementation information (e.g., forwarding). add R2, R5, R4 add R4, R2, R5 lw R5, 100(R2) add R3, R5, R4 sw R3, 101(R2)
93
Data Dependencies
Identify all of the true data dependencies in the following code fragment. Don’t assume any implementation information (e.g., forwarding). add R2, R5, R4 add R4, R2, R5 lw R5, 100(R2) add R3, R5, R4 sw R3, 101(R2)
94
Branch Delay Slot
Modify the following code to make use of a branch delay slot (assume a MIPS 5-stage pipeline w/bypass). Loop: add R3, R3, R4
lw R2, 100 (R3)
beq R3, R4, Loop
95
Branch Delay Slot
Modify the following code to make use of a branch delay slot (assume a MIPS 5-stage pipeline w/bypass). Loop: add R3, R3, R4
beq R3, R4, Loop
lw R2, 100 (R3)
96
Bypass Paths
Add the necessary bypass path for the following code fragment
add R3, R2, R1 sub R5, R2, R3
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
97
Bypass Paths
Add the necessary bypass path for the following code fragment
add R3, R2, R1 sub R5, R2, R3
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
98
Code Performance
How many cycles will the following code fragment take? Assume a 5-stage MIPS pipeline with forward (bypass) paths
add R5, R5, R7 lw R6, 100 (R7) sub R7, R6, R8
99
Code Performance
How many cycles will the following code fragment take. Assume a 5-stage MIPS pipeline with forward (bypass) paths
add R5, R5, R7 lw R6, 100 (R7) sub R7, R6, R8
REG IM DM ALU Reg
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
ADD R5,R6,R7 LW R6, 100(R7) SUB R7,R6, R8
REG IM DM ALU Reg
REG IM DM ALU Reg
Program Execution
Time Clock Cycle 8
STALL
100
Machine Performance
Given the following information, what is the clock cycle time (in nanoseconds) of the machine and how many nanoseconds does it take for an instruction to complete?
# of pipeline stages 8 Critical Path 15 ns
101
Machine Performance
Given the following information, what is the clock cycle time (in nanoseconds) of the machine and how many nanoseconds does it take for an instruction to complete?
# of pipeline stages 8 Critical Path 15 ns
Recall that the critical path is the longest path through a pipeline stage. For a pipelined machine, the critical path defines the cycle time. Therefore, the clock cycle time is 15 ns. One instruction takes 8 stages * 15 ns = 120 ns.
102
Machine Performance (2)
Using the machine specified from the previous problem What is the minimum (best) CPI (assume a MIPS 5-state pipeline)? What is the machine’s CPI if 20% of all instructions are loads and 5% of the instructions
following a load depend on the result of the load (assume all other instructions have no dependencies)?
103
Machine Performance (2)
Using the machine specified from the previous problem What is the minimum (best) CPI (assume a MIPS 5-state pipeline)? What is the machine’s CPI if 20% of all instructions are loads and 5% of the instructions
following a load depend on the result of the load (assume all other instructions have no dependencies)? The best CPI is 1.0 20% * 5% = 1% <-- 1% of the instructions stall for one cycle The CPI is: 99% * 1.0 cycles + 1% * 2.0 cycles = 1.01 cycles per instruction
104
Solution (w/out forwarding) Program Execution
Time
ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)
IM DM ALU REG REG
IM DM ALU REG REG bubble bubble
IM DM ALU REG bubble bubble
R4
R15
R15
R4
105
Solution (w/forwarding) Program Execution
Time
ADD R4, R5, R2 LW R15, 0(R4) SW R15, 4(R2)
IM DM ALU REG REG
IM DM ALU REG REG
IM DM ALU REG bubble REG
R15
R15
Forwarding
R4
R4
Forwarding