Lecture 2 Performance, Instruction Set Principles, Pipeline Hazards
Lecture 7 Pipeline Hazards
description
Transcript of Lecture 7 Pipeline Hazards
Hazards CS510 Computer Architectures Lecture 7 - 1
Lecture 7Lecture 7Pipeline HazardsPipeline Hazards
Hazards CS510 Computer Architectures Lecture 7 - 2
Its Not That Easy to Achieve Its Not That Easy to Achieve the Promised Performancethe Promised Performance
• Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Pipelining of branches and other instructions that change the PC
• Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”, i.e., idle clock cycles, in the pipeline
Hazards CS510 Computer Architectures Lecture 7 - 3
Structural Hazards /MemoryStructural Hazards /Memory
Instru
ction
O
rder
LOAD
Instr 1
Instr 2 Mem RegA
LU
Mem Reg
Mem Reg
ALU
Mem RegInstr 3
Reg
ALU
RegMem Mem
RegReg
ALU
MemMem
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Mem Reg
ALU
Mem RegInstr 4
Mem
Mem
Mem
MemOperation on MemoryOperation on Memoryby 2 different instructionsby 2 different instructionsin the same clock cycle in the same clock cycle
Hazards CS510 Computer Architectures Lecture 7 - 4
Structural Hazards Structural Hazards with Single-Port Memorywith Single-Port Memory
Instru
ction
O
rder
LOAD
Instr 1
Instr 2
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Mem RegA
LU
Mem Reg
Mem Reg
ALU
Mem RegInstr 3Stall
Reg
ALU
RegMem Mem
RegReg
ALU
MemMem
Mem Mem
Mem Mem
Mem Mem
StallStall
Mem Reg
ALU
Instr 3 Mem3 cycles stall3 cycles stallwith 1-port memorywith 1-port memory
Hazards CS510 Computer Architectures Lecture 7 - 5
Avoiding Structural Hazard Avoiding Structural Hazard with Dual-Port Memory with Dual-Port Memory
Instru
ction
O
rder
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
LOAD IM Reg
ALU
RegDM
Instr 1
Instr 2
Instr 3
Instr 4
Instr 5
IM DM
IM RegA
LUDM Reg
Reg
ALU
DM RegIM
IM Reg
ALU
DM Reg
IM Reg
ALU
DM Reg
IM Reg
ALU
DM
IM DM
IM DM
IM DM
IM DM
IM DMNo stall withNo stall with2-port memory2-port memory
Hazards CS510 Computer Architectures Lecture 7 - 6
Data Hazard on RegistersData Hazard on Registers
ADD R1,R2,R3
SUB R4,R1,R3
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R11,R1
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Mem Reg
ALU
Reg
ALU
Mem Reg
MemA
LUMem Reg
ALU
Mem Reg
Mem
ALU
Mem Reg
Mem
Mem
Mem
Reg
Reg
Reg
Reg
Time(clock cycles)
R1
ReReg
RegReg
RegReg
RegReg
Hazards CS510 Computer Architectures Lecture 7 - 7
Data Hazard on RegistersData Hazard on Registers
Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle
ClcokCycle
Register Ri
Store into Ri
Readfrom Ri
Hazards CS510 Computer Architectures Lecture 7 - 8
Data Hazard on RegistersData Hazard on Registers
ADD R1,R2,R3
SUB R4,R1,R3
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R11,R1
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Mem Reg
ALU
Reg
Reg
ALU
Mem Reg
Mem RegA
LUMem Reg
Reg
ALU
Mem Reg
Mem Reg
ALU
Mem Reg
Mem
Mem
Mem
Time(clock cycles)
R1
Reg
Reg
Needs to Stall 2 cycles
Reg
Reg
Hazards CS510 Computer Architectures Lecture 7 - 9
Three Generic Data HazardsThree Generic Data HazardsThree Generic Data HazardsThree Generic Data Hazards
Instri followed by Instrj
Read After Write (RAW) Instrj tries to read operand before Instri writes it
Instri LW R1, 0(R2)Instrj SUBR 4, R1, R5
Hazards CS510 Computer Architectures Lecture 7 - 10
Three Generic Data HazardsThree Generic Data HazardsThree Generic Data HazardsThree Generic Data Hazards
InstrI followed by InstrJ
• Write After Read (WAR) Instrj tries to write operand before Instri reads it
Instri ADD R1, R2, R3 Instrj LW R2, 0(R5) Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages,
– Reads are always in stage 2, and
– Writes are always in stage 5
Hazards CS510 Computer Architectures Lecture 7 - 11
Three Generic Data HazardsThree Generic Data HazardsThree Generic Data HazardsThree Generic Data Hazards
InstrI followed by InstrJ
Write After Write (WAW) Instrj tries to write operand before Instri writes it
– Leaves wrong result ( Instri not Instrj)
Instri LW R1, 0(R2)Instrj LW R1, 0(R3)
Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
Will see WAR and WAW in later more complicated pipes
Hazards CS510 Computer Architectures Lecture 7 - 12
Forwarding Forwarding to Avoid Data Hazardsto Avoid Data Hazards
Time(clock cycles)
ADD R1,R2,R3
SUB R4,R1,R3
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R11,R1
Mem Reg
ALU
Reg
Reg
ALU
Mem Reg
Mem Reg
ALU
Mem Reg
Reg
ALU
Mem Reg
Mem Reg
ALU
Mem Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Mem
Mem
Mem
Hazards CS510 Computer Architectures Lecture 7 - 13
HW Change HW Change for Forwardingfor Forwarding
MU
X
MU
X
Zero?
DataMemory
ALU
D/A
B
uffer
A/M
B
uffer
M/W
B
uffer
Hazards CS510 Computer Architectures Lecture 7 - 14
Load Delay Due to Data HazardLoad Delay Due to Data Hazard
LOAD R1,0(R2)
Time(clock cycles)
AND R6,R1,R7 IM Reg
ALU
DM Reg
OR R8,R1,R9 Reg
ALU
DMIM
SUB R4,R1,R6
RegA
LU
DM RegIM
Load Delay =2cycles
RegA
LU
DM RegIM
Reg
ALU
DM RegIM
Reg
ALU
DM RegIM
IM Reg
ALU
RegDM
Reg
ALU
DM RegIM
Hazards CS510 Computer Architectures Lecture 7 - 15
Load Delay Load Delay with Forwardingwith Forwarding
LOAD R1,0(R2)
Time(clock cycles)
IM Reg
ALU
RegDM
SUB R4,R1,R6
AND R6,R1,R7
OR R8,R1,R9
IM Reg
ALU
DM Reg
Reg
ALU
DM RegIM
We need to add HW, called Pipeline Interlock
IM Reg
ALU
DM Reg
Reg
ALU
DM RegIM Reg
ALU
DM RegIM
Load Delay with Forwarding=1cycle
Hazards CS510 Computer Architectures Lecture 7 - 16
Try to produce fast code fora = b + c;
d = e - f;
assuming a, b, c, d ,e, and f are in memory.
Software Scheduling Software Scheduling to Avoid Load Hazardsto Avoid Load HazardsSoftware Scheduling Software Scheduling
to Avoid Load Hazardsto Avoid Load Hazards
Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
Slow code(with forwarding):
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
RAW
RAW
RAW
RAW RAW
Stall
Stall
Stall
Stall Stall
Hazards CS510 Computer Architectures Lecture 7 - 17
% loads stalling pipeline
0% 20% 40% 60% 80%
tex
spice
gcc
25%
14%
31%
65%
42%
54%
scheduled unscheduled
Compiler Avoiding Load StallsCompiler Avoiding Load StallsCompiler Avoiding Load StallsCompiler Avoiding Load Stalls
Hazards CS510 Computer Architectures Lecture 7 - 18
Mem Stage
WB StageIF Stage ID Stage EX Stage
Instr.Memory
SignExt
Zero?
DataMemory
PC
MU
XM
UX
MU
X
MU
X
Add
ALURegFile
+4
16 32
SMD
LMD
F/D
B
uffer
D/A
B
uffer
A/M
B
uffer
M/W
B
uffer
Pipelined DLX DatapathPipelined DLX DatapathPipelined DLX DatapathPipelined DLX Datapath
• Branch Address Calculation• Decide Condition
• Branch Decision for target address
Hazards CS510 Computer Architectures Lecture 7 - 19
Control Hazard on Branches:Control Hazard on Branches:
Three Stall CyclesThree Stall CyclesControl Hazard on Branches:Control Hazard on Branches:
Three Stall CyclesThree Stall Cycles
IM Reg
ALU
DM Reg
IM Reg
ALU
DM Reg
IM Reg
ALU
DM Reg
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
IM Reg
ALU
DM Reg
IM Reg
ALU
DM Reg
Pro
gram
executio
n o
rder in
in
structio
ns
40 BEQ R1,R3, 36
44 AND R12,R2, R5
48 OR R13,R6, R2
52 ADD R14,R2, R2
80 LD R4,R7, 100
Should’t be executed whenbranch condition is true !
IM Reg
ALU
DM Reg
Branch Delay = 3 cycles
IM Reg
ALU
DM Reg
IM Reg
ALU
DM Reg Branch Targetavailable
Hazards CS510 Computer Architectures Lecture 7 - 20
Control Hazard on Branches:Control Hazard on Branches:
Three Stall CyclesThree Stall CyclesControl Hazard on Branches:Control Hazard on Branches:
Three Stall CyclesThree Stall Cycles
Branch instruction IF ID EX MEM WB
Branch successor IF ID EX MEM
3 Wasted clock cyclesfor the TAKEN branch
Now, we know the instructionbeing executed is a branch.But stall until branch target address is known.
Now, target address is available.We don’t know yet the instruction being executed is a branch. Fetch the branch successor.
Branch successor + 1 IF ID EX
Branch successor + 2 IF ID
Hazards CS510 Computer Architectures Lecture 7 - 21
Branch Stall ImpactBranch Stall ImpactBranch Stall ImpactBranch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9
– Half of the ideal speed
• Two part solution:
– Determine the branch is TAKEN or NOT TAKEN sooner, AND
– Compute TAKEN Branch Address(Branch Target) earlier
• DLX branch tests if register = 0 or 1
DLX Solution: Get New PC earlier- Move Zero test to ID stageZero test to ID stage- Additional ADDERAdditional ADDER to calculate New PC(taken PC) in ID stagein ID stage- 1 clock cycle penalty for branch in contrast to 3 cycles
Hazards CS510 Computer Architectures Lecture 7 - 22
Pipelined DLX DatapathPipelined DLX DatapathPipelined DLX DatapathPipelined DLX Datapath
To get targetaddr. earlier
To get the Condition Earlier.Target Addressavailable after ID.
Mem Stage
WB StageIF Stage ID Stage EX Stage
Instr.Memory
SignExt
Zero?
DataMemory
PC
MU
XM
UX
MU
X
MU
X
Add
ALURegFile
+4
16 32
SMD
LMD
F/D
B
uffer
D/A
B
uffer
A/M
B
uffer
M/W
B
uffer
Add
When a branchinstruction is inExecute stage,Next Addressis available here.
Hazards CS510 Computer Architectures Lecture 7 - 23
Branch Behavior in ProgramsBranch Behavior in ProgramsBranch Behavior in ProgramsBranch Behavior in Programs
• Conditional branch frequencies– integer average --- 14 to 16 %– floating point --- 3 to 12 %
• Forward and backward taken branches– forward taken --- 60 %– backward taken --- 85 %– the average of all conditional branches ---- 67 %
Hazards CS510 Computer Architectures Lecture 7 - 24
4 Branch Hazard Alternatives4 Branch Hazard Alternatives
• Stall until branch direction is clear• Predict branch NOT TAKEN• Predict branch TAKEN• Delayed branch
Hazards CS510 Computer Architectures Lecture 7 - 25
4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:
(1) STALL(1) STALL4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:
(1) STALL(1) STALL
Stall until branch direction is clear
Branch instruction IF ID EX MEM WB
3 cycle penalty
Branch instruction IF ID EX MEM WBBranch successor stall IF ID EX MEM WBBranch successor + 1 IF ID EX MEMBranch successor + 2 IF ID
Revised DLX pipeline(get the branch address at EX)
1 cycle penalty(Branch Delay Slot)
Branch successor stall stall stall IF ID EX MEMBranch successor + 1 IF ID EXBranch successor + 2 IF ID
Hazards CS510 Computer Architectures Lecture 7 - 26
4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:
(2) Predict Branch “NOT TAKEN”(2) Predict Branch “NOT TAKEN”
• Execute successor instructions in the sequence • PC+4 is already calculated, so use it to get the next instruction• Flush instructions in the pipeline if branch is actually TAKEN• Advantage of late pipeline state update• 47% of DLX branches are NOT TAKEN on the average
NOT TAKEN branch instruction i IF ID EX MEM WBinstruction i+1 IF ID EX MEM WBinstruction i+2 IF ID EX MEM WB
Nopenalty
TAKEN branch instruction i IF ID EX MEM WBinstruction i+1 IF ID EX MEM WBinstruction T IF ID EX MEM WB
1 cyclepenalty
Flush this instruction in progress
Hazards CS510 Computer Architectures Lecture 7 - 27
4 Branch Hazard Alternatives: 4 Branch Hazard Alternatives:
(3) Predict Branch “TAKEN”(3) Predict Branch “TAKEN”
– 53% DLX branches TAKEN on average
– Branch target address available after ID in DLX
• DLX still incurs 1 cycle branch penalty for TAKEN branch
• Other machines: branch target known before outcome
2 cycle penalty in DLX(1 in other machines).
1 cycle penalty in DLX(0 in other machines)
NOT TAKEN instruction i IF ID EX MEM WBInstruction T stall IFInstruction i+1 IF ID EX MEM WB
TAKEN branch instruction i IF ID EX MEM WBInstruction T stall IF ID EX MEM WBInstruction T+1 IF ID EX MEM WB
TAKEN address not available at this time
TAKEN address available
Hazards CS510 Computer Architectures Lecture 7 - 28
4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:
(4) Delayed Branch(4) Delayed Branch4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:
(4) Delayed Branch(4) Delayed Branch
Delayed Branch– Delay branch to take place AFTER a successor instruction
branch instructionsequential successor1
sequential successor2
........sequential successorn
branch target if taken
– 1 slot delayed branch allows proper decision and branch target address in 5 stage DLX pipeline with control hazard improvement
Delayed Branch of length n
Hazards CS510 Computer Architectures Lecture 7 - 29
Delayed BranchDelayed BranchDelayed BranchDelayed Branch
• Where to get instructions to fill branch delay slot?– Before branch instruction
– From the target address: only valuable when branch TAKEN
– From fall through: only valuable when branch NOT TAKEN
– Canceling branches allow more slots to be filled
• Compiler effectiveness for single delayed branch slot:– Fills about 60% of delayed branch slots
– About 80% of instructions executed in delayed branch slots are useful in computation
– About 50% (60% x 80%) of slots usefully filled
Hazards CS510 Computer Architectures Lecture 7 - 30
4 Branch Hazard Alternatives:4 Branch Hazard Alternatives: Delayed BranchDelayed Branch
4 Branch Hazard Alternatives:4 Branch Hazard Alternatives: Delayed BranchDelayed Branch
From target
SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then
Delay slot
ADD R1, R2, R3 if R1=0 then
SUB R4, R5, R6
- Improve performance when TAKEN(loop)- Must be alright to execute rescheduled instructions if Not Taken- May need duplicate the instruction if it is the target of another branch instr.
From fall through
ADD R1, R2, R3 if R1=0 then
SUB R4, R5, R6
Delay slot
ADD R1, R2, R3 if R2=0 then SUB R4, R5, R6
- Improve performance when NOT TAKEN- Must be alright to execute instructions of Taken
- Always improve performance- Branch must not depend on rescheduled instructions
From before
ADD R1, R2, R3 if R2=0 thenDelay slot
if R2=0 then ADD R1, R2, R3
Hazards CS510 Computer Architectures Lecture 7 - 31
Limitations on Delayed Limitations on Delayed BranchBranch
• Difficulty in finding useful instructions to fill the delayed branch slots
• Solution - Squashing– Delayed branch associated with a branch prediction
– Instructions in the predicted path are executed in the delayed branch slot
– If the branch outcome is mispredicted, instructions in the delayed branch slot are squashed(discarded)
Hazards CS510 Computer Architectures Lecture 7 - 32
Canceling BranchCanceling Branch
• Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to– Restrictions on scheduling instructions at the delay slots– Limitations on the ability to predict whether it will TAKE or NOT
TAKE at compile time
• Instruction includes the direction that the branch was predicted– When the branch behaves as predicted, the instructions in the
delay slot are executed– When branch is incorrectly predicted, the instructions in the delay
slot are turned into No-OPs
• Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements
Hazards CS510 Computer Architectures Lecture 7 - 33
Evaluating Branch Evaluating Branch AlternativesAlternatives
Evaluating Branch Evaluating Branch AlternativesAlternatives
Stall pipeline 3 1+0.14x3=1.42 5/1.42=3.5 1.0
Predict Taken 1 1+0.14x1=1.14 5/1.14=4.4 1.26
Predict Not Taken 1 1+0.14x0.65=1.09 5/1.09=4.5 1.29
Delayed branch 0.5 1+0.14x0.5=1.07 5/1.07=4.6 1.31
Pipeline speedup = Pipeline depth / CPI
= Pipeline depth1 +Branch frequency xBranch penalty
Conditional and Unconditional collectively 14% frequency,65% of branch is TAKEN
Scheduling Branch CPI speedup vs speedup vs scheme penalty unpipelined stall
Hazards CS510 Computer Architectures Lecture 7 - 34
Static(Compiler) Prediction of Static(Compiler) Prediction of Taken/Untaken BranchesTaken/Untaken Branches
Static(Compiler) Prediction of Static(Compiler) Prediction of Taken/Untaken BranchesTaken/Untaken Branches
Code Motion
LW R1, 0(R2)
SUB R1, R1, R3
BEQZ R1, L
OR R4, R5, R6
ADD R10,R4,R3
L: ADD R7, R8, R9
If branch is almost always NOT TAKENNOT TAKEN, and R4 is not needed on the taken path,and R5 and R6 are not modified in the following instruction(s), this move can increase speed
Depend on LW,need to stall
If branch is almost always TAKENTAKEN, and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed
Hazards CS510 Computer Architectures Lecture 7 - 35
Static(Compiler) Prediction Static(Compiler) Prediction of Taken/Untaken Branchesof Taken/Untaken BranchesStatic(Compiler) Prediction Static(Compiler) Prediction of Taken/Untaken Branchesof Taken/Untaken Branches
• Improves strategy for placing instructions in delay slot
• Two strategies
– Direction-based Prediction: TAKEN backward branch, NOT TAKEN forward branch
– Profile-based prediction: Record branch behaviors, predict branch based on the prior run(s)
Frequency of Misprediction
0%
10%
20%
30%
40%
50%
60%
70%
alvinn
compress
doduc
espresso
gcc
hydro2d
mdljsp2
ora
swm256
tomcatv
Always taken
Misprediction Rate
0%
2%
4%
6%
8%
10%
12%
14%
alvinn
compress
doduc
espresso
gcc
hydro2d
mdljsp2
ora
swm256
tomcatv
Taken backwardsNot Taken Forwards
Hazards CS510 Computer Architectures Lecture 7 - 36
Evaluating Static Branch Evaluating Static Branch Prediction StrategiesPrediction Strategies
Evaluating Static Branch Evaluating Static Branch Prediction StrategiesPrediction Strategies
• Misprediction rate ignores frequency of branch
• Instructions between mispredicted branches is a better metric
Instructions per mispredicted branch
1
10
100
1000
10000
100000
alvinn
compress
doduc
espresso
gcc
hydro2d
mdljsp2
ora
swm256
tomcatv
Profile-based Direction-based
End of Hazards and their End of Hazards and their Resolution PointResolution Point
Thank you
Hazards CS510 Computer Architectures Lecture 7 - 37