CA226 — AdvancedComputer Architecture
2
Types of Hazard
Structural hazardsresource conflicts;hardware cannot support all instruction combinations simultaneously
Data hazardswhen one instruction depends upon the result (which is not yet available) of aprevious instruction
Control hazardswhen the address of the next instruction cannot be determined immediately(branch, jump instructions — today)
CA226 — AdvancedComputer Architecture
3
Control HazardsControl hazards:
• arise from pipelining of branch (and jump) instructions
As described thus far, branching decisions:
• are made during the Mem stage of the pipeline
A naive approach:
• stall until branch decision is known
CA226 — AdvancedComputer Architecture
4
TerminologyWhenever we encounter a branch:
• it is:
• either taken, or not taken
• the cost may be different in each case
CA226 — AdvancedComputer Architecture
5
Control Hazards
CA226 — AdvancedComputer Architecture
6
Naive Branching
1 2 3 4 5 6 7
branch IF ID Ex Mem** WB
branch+4 stall stall stall **IF ID Ex
branch+8 stall stall stall IF ID
branch IF ID Ex Mem** WB
target stall stall stall **IF ID Ex
target+4 stall stall stall IF ID
CA226 — AdvancedComputer Architecture
7
Unfortunately …This will result in:
• the pipeline being stalled for three cycles every time a branch is encountered
• and branch instructions are common
CA226 — AdvancedComputer Architecture
8
…What might help is:
• a prediction
Predict that a branch will either be:
• taken, or not taken
CA226 — AdvancedComputer Architecture
9
…Easiest thing to do:
• predict branch not taken
• simply allow subsequent instructions to continue to flow into the pipeline
CA226 — AdvancedComputer Architecture
10
Predict Not Taken
Table 1. And branch is indeed not taken:
branch IF ID Ex Mem** WB
branch+4 IF ID Ex **Mem WB
branch+8 IF ID Ex Mem WB
branch+12 IF ID Ex Mem
Perfect!
• But what if the branch is in fact taken?
CA226 — AdvancedComputer Architecture
11
Predict Not Taken
Table 2. But branch is in fact taken:
branch IF ID Ex Mem** WB
branch+4 IF ID Ex **Mem WB
branch+8 IF ID **Ex Mem WB
branch+12 IF **ID Ex Mem
target **IF
CA226 — AdvancedComputer Architecture
12
Predict Not Taken
Table 3. But branch is in fact taken:
branch IF ID Ex Mem** WB
branch+4 IF ID Ex **nop nop
branch+8 IF ID **nop nop nop
branch+12 IF **nop nop nop
target **IF
Observe:
• none of the subsequent instructions has yet changed memory or any registersthat’s helpful!replace them with nop instructions
(Still a stall of three cycles when branch taken.)
CA226 — AdvancedComputer Architecture
13
Slightly BetterWhen a branch instruction is detected:
• route the Branch Taken condition:
• from Ex(instead of from Mem)
• to ID(instead of to IF)
CA226 — AdvancedComputer Architecture
14
MIPS Pipeline
CA226 — AdvancedComputer Architecture
15
Example
Table 4. Branch not taken:
branch IF ID Ex** Mem WB
branch+4 IF stall **ID Ex Mem WB
branch+8 IF ID Ex Mem
branch+12 IF ID Ex
Note
We save two stalls:one because we learn the decision one cycle sooner, andone because we allow the subsequent instruction into IF
CA226 — AdvancedComputer Architecture
16
Example — Branch Taken
Table 5. Branch taken:
branch IF ID Ex** Mem WB
branch+4 IF nop **ID nop nop nop
target **IF ID Ex Mem
target+4 IF ID Ex
Note
An effective stall of two cycles, but one better than before, because we learn if thebranch is taken one cycle sooner.
CA226 — AdvancedComputer Architecture
17
Where do we stand?If a branch is not taken:
• we have a stall of one cycle
If a branch is taken:
• we have a stall of two cycles
CA226 — AdvancedComputer Architecture
18
In PracticeUnfortunately:
• branches are commonand most branches are taken(which is indeed unfortunate)
CA226 — AdvancedComputer Architecture
19
In PracticeAdd additional hardware in ID:
• detect branches
• decode the target address:target = IF/ID.nPC + (sign-extend(Regs[IF/ID.IR(0..15)]) <<2)(so we need at leastat least an adder)
• calculate whether the branch is taken:we need to:
• test equality, and for zero(and perhaps a couple of other tests)
CA226 — AdvancedComputer Architecture
20
..
CA226 — AdvancedComputer Architecture
21
..So:
• branching is so common and the cost of stalls so great,
• that it is worth the cost and complexity of additional hardware in the ID pipelinestage
CA226 — AdvancedComputer Architecture
22
..So:
• we determine one stage earlier still whether a branch is taken or not(in ID, now, instead of in Ex)
So, we have:
• no stall if the branch is not taken, and
• a one-cycle stall if the branch is taken
CA226 — AdvancedComputer Architecture
23
Now…
Table 6. Branch not taken:
1 2 3 4 5 6 7
branch IF ID** Ex Mem WB
branch+4 IF **ID Ex Mem WB
CA226 — AdvancedComputer Architecture
24
Now…
Table 7. Branch taken:
branch IF ID** Ex Mem WB
branch+4 IF **nop nop nop nop
target **IF ID Ex Mem WB
target+4 IF ID Ex Mem
CA226 — AdvancedComputer Architecture
25
..Try these in the simulator ….
bnez r0,target ; no stalldaddi r1,r0,1
beqz r0,target ; branch taken, stall of 1 cycledaddi r1,r1,1
Note to self:
• see branch.s
CA226 — AdvancedComputer Architecture
26
Predict Not TakenIn effect:
• we’re guessing, here, that the branch will not be taken
• so this strategy is known as predict not taken
CA226 — AdvancedComputer Architecture
27
..So:
• no stall if the branch is not taken
• a stall of one cycle if the branch is taken
What might the average number of stall cycles for branch instructions be?
CA226 — AdvancedComputer Architecture
28
Unfortunately, …The common case in practice is …
• that the branch is taken!
• so the average number of stalls per branch, in practice, approaches 1
CA226 — AdvancedComputer Architecture
29
Because …for (i=0; i<N; i+=1){ // do stuff}
Whenever we have such a loop:
• the branch is taken more often than not taken
CA226 — AdvancedComputer Architecture
30
Because … daddi r1,r0,0 ; i=0; beq r1,r2,done ; if (i==N) goto done;loop: ; do stuff daddi r1,r1,1 ; i+=1; bne r1,r2,done ; if (i!=N) goto loop;done:
The bne instruction:
• is repeated about N times so the branch is usually taken,so the stalls-per-branch approaches 1
CA226 — AdvancedComputer Architecture
31
Might we do better?A predict branch taken strategy:
• would be helpful
• unfortunately, this is not possible on MIPS:
• we only learn the target address after the ID stage
• so a cycle has already been wasted
CA226 — AdvancedComputer Architecture
32
Might we do better?A predict branch taken strategy:
• would be helpful
• unfortunately, this is not possible on MIPS:
• we only learn the target address after the ID stage
• so a cycle has already been wasted
Hmm:
• Wasted.
• Or is it?
CA226 — AdvancedComputer Architecture
33
..How might we:
• make good use of that "wasted" cycle?
CA226 — AdvancedComputer Architecture
34
The "Branch Delay Slot"A branch delay slot is:
• the instruction following any branch (or jump) instruction
Approach:
• the instruction in the delay slot is always executed,whether the branch is taken or not
CA226 — AdvancedComputer Architecture
35
The "Delay Slot"
Table 8. Branch not taken:
branch IF ID** Ex Mem WB
branch+4 (BDS) IF **ID Ex Mem WB
branch+8 IF ID Ex Mem WB
The instruction after the branch:
• is always executed,good!
CA226 — AdvancedComputer Architecture
36
The "Delay Slot"
Table 9. Branch taken:
branch IF ID** Ex Mem WB
branch+4 (BDS) IF ID Ex Mem WB
target **IF ID Ex Mem WB
target+4 IF ID Ex Mem
The instruction after the branch:
• is always executed,"branch+4" is executed anyway,no stall!
CA226 — AdvancedComputer Architecture
37
The "Delay Slot"On such hardware, compilers:
• must insert a suitable instruction into the delay slot
• or, if that is not possible, then a nop (poor solution)
CA226 — AdvancedComputer Architecture
38
Some Cases — nop
This:
dadd r1,r2,r3 bnez r2,somewhere
Becomes:
dadd r1,r2,r3 bnez r2,somewhere nop ; poor solution, effectively a stall
Note
Correct, but not great.The nop is in effect a stall.
CA226 — AdvancedComputer Architecture
39
Some Cases — Independent InstructionThis:
dadd r1,r2,r3 bnez r2,somewhere
Becomes:
bnez r2,somewhere dadd r1,r2,r3 ; the branch does not depend on r1
CA226 — AdvancedComputer Architecture
40
Some Cases — Temporary RegistersThis:
dadd r1,r2,r3 or r20,r2,r3 ; r20 is temporary register within this loop bnez r1,target ...target: dsub r4,r5,r6
Becomes:
dadd r1,r2,r3 bnez r1,target or r20,r2,r3 ; doesn't matter if executed ... ; again, the delay cycle is effectively losttarget: ; but only if the branch is taken! (no nop) dsub r4,r5,r6
CA226 — AdvancedComputer Architecture
41
Loop — Far BetterThis:
target: dsub r4,r5,r6 ; assume r4 is a temporary register ... ; do stuff daddi r1,r1,-1 bnez r1,target ; branch depends on r1 nop ; BDS: we want to use this slot
CA226 — AdvancedComputer Architecture
42
Loop — Far BetterThis:
target: dsub r4,r5,r6 ; assume r4 is a temporary register ... ; do stuff daddi r1,r1,-1 bnez r1,target
Becomes:
dsub r4,r5,r6 ; moved uptarget: ... ; do stuff daddi r1,r1,-1 bnez r1,target dsub r4,r5,r6 ; repeated, from above
CA226 — AdvancedComputer Architecture
43
..Try these in the simulator, again, ….
bnez r0,target ; no stalldaddi r1,r0,1
beqz r0,target ; branch taken, no stall with branch delay slotdaddi r1,r1,1
Note
This time with the branch delay slot enabled.
CA226 — AdvancedComputer Architecture
44
More Insurmountable StallsExample:
dadd r1,r2,r3 bnez r1,target ; stall one cycle
ld r1,N(r0) bnez r1,target ; stall two cycles
The branch:
• depends upon an immediately preceding arithmetic instruction
• depends upon an immediately preceding load (stall two cycles)
CA226 — AdvancedComputer Architecture
45
Another Insurmountable Stall
Table 10. If branch taken is resolved in Ex:
dadd r1,r2,r3 IF ID Ex** Mem WB
bnez r1,target IF ID **Ex Mem WB
delay slot IF ID Ex Mem WB
No problem:
• r1 can be forwarded, as before
CA226 — AdvancedComputer Architecture
46
Another Insurmountable Stall
Table 11. If branch taken is resolved in ID:
dadd r1,r2,r3 IF ID Ex** Mem WB
bnez r1,target IF **ID Ex Mem WB
delay slot IF ID Ex Mem WB
Oops:
• forwarding can’t help here
CA226 — AdvancedComputer Architecture
47
Another Insurmountable Stall
Table 12. If branch taken is resolved in ID:
dadd r1,r2,r3 IF ID Ex** Mem WB
bnez r1,target IF stall **ID Ex Mem WB
delay slot IF ID Ex Mem
Such a RAW dependency:
• results in a stall of one cycle
(Try to find another instruction which can be inserted in between.)
CA226 — AdvancedComputer Architecture
48
JumpsJumps:
• are handled the same way:we learn the target address in ID,the instruction in the delay slot is always executed
CA226 — AdvancedComputer Architecture
49
Jumps
Table 13. Jumps are always taken:
jump IF ID** Ex Mem WB
delay slot IF ID Ex Mem WB
target **IF ID Ex Mem WB
target+8 IF ID Ex Mem
CA226 — AdvancedComputer Architecture
50
ExampleNote to self:
• take a look at ../winmips64/reverse-with-nops.s
CA226 — AdvancedComputer Architecture
51
Done<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element= document.createElement('script'); element.async = true; element.src = mathjax;element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||document.body).appendChild(element); })(); </script>
Top Related