CDA 5106 Advanced Computer Architecture I...
-
Upload
trinhkhanh -
Category
Documents
-
view
221 -
download
1
Transcript of CDA 5106 Advanced Computer Architecture I...
Computer Science Department
University of Central Florida
CDA 5106 Advanced Computer Architecture I
Pipelining
2
Designing a processor
• Design the ISA
• Classify instructions for the ISA (e.g., MIPS):
– Memory references
– Register-Register ALU Operations
– Register-Immediate ALU Operations
– Branches
• Work out the execution for each operation class
• Design appropriate hardware
• Look for opportunities to improve…
• …while maintaining correct execution
3
How to Execute an Instruction
• Instruction fetch (“IF”) – IR = Mem[PC] – NPC = PC + 4
• Instruction decode/Register fetch (“ID”) – A = Regs[IR25..21] – B = Regs[IR20..16] – Imm = sign-extend(IR15..0)
• Execute (“EX”) – Memory reference:
• ALUOutput = A + Imm – Reg/Reg ALU Operation:
• ALUOutput = A op B – Reg/Immediate ALU Operation:
• ALUOutput = A op Imm – Branch:
• ALUOutput = NPC + Imm; Cond = (A op 0)
4
Executing an Instruction (cont.)
• Memory Access/Branch completion (“MEM“)
– Memory Reference:
• Load_Mem_Data = Mem[ALUOutput] /* Load */
• Mem[ALUOutput] = B /* Store */
– Branch
• If (cond) PC = ALUOutput, else PC = NPC
• Write back (“WB”)
– Reg-Reg ALU Operation:
• Regs[IR15..11] = ALUOutput
– Reg-Immediate ALU Operation:
• Regs[IR20..16] = ALUOutput
– Load instruction:
• Regs[IR20..16] = Load_Mem_Data
5
How to Execute an Instruction
• Instruction fetch (“IF”) – IR = Mem[PC]
– NPC = PC + 4
AL
U
Instruction
cache
PC
NPC
IR (inst.
reg.)
4
6
How to Execute an Instruction (cont.)
• Instruction decode/Register fetch (“ID”) – A = Regs[IR25..21]
– B = Regs[IR20..16]
– Imm = sign-extend(IR15..0)
Regs
sign
extend
A
B
Imm
IR (inst.
reg.)
7
How to Execute an Instruction (cont.) • Execute (“EX”)
– Memory reference:
• ALUOutput = A + Imm
– Reg/Reg ALU Operation:
• ALUOutput = A op B
– Reg/Immediate ALU Operation:
• ALUOutput = A op Imm
– Branch:
• ALUOutput = NPC + Imm; Cond = (A op 0)
A
B
AL
U
MU
X
MU
X
=0? cond
Imm
NPC
8
How to Execute an Instruction (cont.)
• Memory Access/Branch completion (“MEM“) – Memory Reference:
• Load_Mem_Data = Mem[ALUOutput] /* Load */
• Mem[ALUOutput] = B /* Store */
– Branch
• If (cond) PC = ALUOutput, else PC = NPC
AL
U
cond
MU
X
data
cache LMD
NPC
PC
B
9
How to Execute an Instruction (cont.)
• Write back (“WB”) – Reg-Reg ALU Operation:
• Regs[IR15..11] = ALUOutput – Reg-Immediate ALU Operation:
• Regs[IR20..16] = ALUOutput – Load instruction:
• Regs[IR20..16] = Load_Mem_Data
Regs
LMD
MU
X
10
AL
U
Instruction
cache
PC
NPC
IR (inst.
reg.) Regs
sign
extend
A
B
AL
U
MU
X
MU
X
=0? cond
MU
X
data
cache LMD
MU
X
4
Imm
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Writeback
(WB)
11
An Abstract View of Single-Cycle Implementation
12
Controller
13
Analysis
• Single-cycle Implementation -> Multi-cycle Implementation • All instructions (except branch):
– IF, ID, EX, MEM, WB
• Branch: (12%)
– IF, ID, EX, MEM
• CPI = 5*0.88+4*0.12 = 4.88 cycles • Graphically:
IC Reg ALU DC Reg
IC Reg ALU DC Reg
IC Reg ALU DC Reg
14
Unpipelined Execution
• Throughput
– Depends on full latency of instruction
IF ID EX MEM WB
IF ID EX MEM WB
I$ idle
decoder idle, RF read ports idle
ALU idle
D$ idle
RF write port idle
15
Pipelined Execution
• Isolate each of IF, ID, EX, MEM, WB with latches
• When instruction i is in WB, i+1 is in MEM, etc.
• Graphically:
IC Reg ALU Reg DC
IC Reg ALU Reg DC
IC Reg ALU Reg DC
IC Reg ALU Reg DC
IC Reg ALU Reg DC
time i
i+1
i+2
i+3
i+4
17
Pipeline speedup (no stalls)
• For a pipeline of n stages:
0)= case, (ideal
/)]1([
pipelined timeexec ave
dunpipeline timeexec. ave. = speedup
latch
latchunpipe
unpipe
Tn
nnTT
T
IF ID EX MEM WB
18
Pipeline limits
• Limitations of pipelining
– Tlatch
• Delay, setup, hold times
• Clock skew
• Latch takes up more of cycle as cycle shrinks: deeper pipelining gives diminishing returns
– Minimum logic between latches
Tlatch
cycle
deeper pipelining
19
Pipelining Idealisms
• Uniform subcomputations
– Can pipeline into stages with equal delay
– Balance pipeline stages
• Identical computations
– Can fill pipeline with identical work
– Unify instruction types
• Independent computations
– No relationships between work units
– Minimize pipeline stalls
• Are these practical?
– No, but can get close enough to get significant speedup
20
Pipeline hazards
• A hazard reduces the performance of the pipeline – Due to program’s characteristics
– Potential violations of program dependences
• Hazard Resolution – Static Method: Performed statically by compiler
– Dynamic Method: Performed dynamically by hardware at run time, e.g., stall, flush, forwarding
• Three kinds: – Structural hazards - not enough hardware resources for all
combinations of instructions
– Data hazards - Dependencies between instructions prevent their overlapped execution
– Control hazards - Branches change the PC, which results in late code
21
Structural hazard
• Consider a pipeline with a unified data+instruction cache:
WB
EX
ID
MEM
EX
IF i (load) ID
IF i+1
EX
ID
IF i+2
WB
MEM
ID
IF i+4 MEM
WB
WB
MEM
EX
IF i+3
MEM
EX
ID
stall
22
Modeling stalls
speedup = ave. exec. time unpipelined
ave exec time pipelined
CPI CT
CPI CT
CPI CPI (stall cycles per instruction)
= 1+ (stall cycles per instruction)
speedup = CPI CT
CPI CT
CPI (= )
stall cycles per instruction)
speedup = 1
1+ (stall cycles per instruction)
CT
CT
1
1+ (stall cycles per instruction)
CT
CT
stall cycles per instruction)
unpipe unpipe
pipe pipe
pipe nostall
unpipe unpipe
pipe pipe
unpipe
unpipe
pipe
pipe
pipe
n
n
n
1
1
(
(
this assumes that the
two CT’s are equal
this assumes that the
two CPIs (CPIpipe and CPInostall)
are equal
(same thing)
23
Modeling stalls (example)
• Ex:
– n = 5 (e.g., MIPS pipeline)
– 20% of instructions are branches
– 60% of branches are taken
– Penalties:
• Taken branches: 3 stall cycles
• Not-taken branches: 0 stall cycles
• How many stall cycles per instr. on average?
– stall cycles/instr. = (0.8 x 0) + (0.2 x [ 0.6 x 3 + 0.4 x 0]) = 0.2 x 0.6 x 3 = 0.36
– Speedup = 5 / 1.36 = 3.68
24
Data Hazards
• Read-after-write (RAW) hazard
IF add r1, r2, r3 ID
IF add r4, r1, r5
EX
ID
IF
MEM
stall
stall
WB
ID
IF
EX
ID
WB
MEM
MEM
EX
reg. is read reg. is written
Reg.
write
Reg.
read
WB
ID
CT/2 CT/2
Perform the register write in the 1st half of the clock
cycle and the read in the second half.
2 cycle stall
25
RAW Data Hazards
IF ID EX MEM WB
IF ID EX MEM WB
add r1, r2, r3
add r4, r1, r5
Result (r2+r3) is available
r1 is written
r1 is read
only need the result not r1
27
Data forwarding (bypasses)
AL
U
D$
MU
X
MU
X
B
A
IMM
MU
X
ID/EX EX/MEM MEM/WB
RF
bypass 1 bypass 2
28
Data forwarding (cont.)
IF add r1, r2, r3 ID
IF add r4, r1, r5
WB
MEM
WB
MEM
EX WB
EX
ID
IF add r6, r1, r5
MEM
EX
ID
IF add r7, r1, r5
byp 1
add r8, r1, r5
byp 2
WB
MEM
EX
ID
WB
MEM
EX
ID
IF
29
Stalls due to RAW data hazards
• Our simple pipeline
– Most RAW hazards => no stall
– Loads cause 1-cycle stall
IF load r1, r2, r3 ID
IF add r4, r1, r5
EX
ID
IF
MEM
stall
stall
WB
MEM
MEM
EX
value available value needed
WB
EX
ID
byp 2
30
Other data hazards
• WAR (write-after-read)
– A r1, r2, r3
– B r2, r4, r5
– Hazard if B writes R2 before A reads R2
– Doesn’t happen in simple DLX pipeline, but can in others
• E.g., occurs if pipeline allows late register reads
SW 0(R2), R1 IF ID EX MEM1 MEM2 MEM3 WB
ADD R1, R3, R4 IF ID EX WB
writes R1 during WB
Reads R1 during MEM3
31
Other data hazards (cont.)
• WAW (write-after-write)
– A r1, r2, r3
– B r1, r4, r5
– Hazard if B writes R1 before A writes R1
– Result: later instructions see wrong value in the register
– Occurs if instructions can write register file out-of-order
– This also doesn’t happen in simple DLX pipeline, but can in others:
LW R1, 0(R2) IF ID EX MEM1 MEM2 WB
ADD R1, R2, R3 IF ID EX WB
writes 1st version of R1
writes 2nd version of R1
32
Other data hazards (cont.)
• Handling WAR/WAW hazards
– Stall the later instruction (stalls in WB stage)
– Detect in decode and prevent from happening by stalling earlier (easier to implement)
– Compiler: don’t reuse register specifier
– Hardware: register renaming (see next major topic – ILP)
33
Types of dependencies
• True-dependence (pure-dependence, flow-dependence) – ADD R1,R2,R3
– SUB R4,R5,R1
– May cause RAW hazards
• Anti-Dependence – ADD R3,R2,R1
– SUB R1,R4,R5
– May cause WAR hazards
– Due to reuse: Removed by using another register
• Output-Dependence – ADD R1,R2,R3
– SUB R1,R4,R5
– May cause WAW hazards
– Due to reuse: Removed by using another register
34
Control hazards
• Branches throw a wrench in the cogs
– Disrupts pipeline because we don’t know what to fetch next
– Problems
• Don’t know we have a branch until decode (ID)
• Don’t know taken target until execute (EX)
• Don’t know branch direction (taken/not taken) until execute (EX) or Memory stage MEM)
35
Handling Control Hazards: method 1
• stall
IF BNE r1, r2 ID
IF not-taken target MEM EX
WB MEM
ID
EX
stall WB
PC+4 known branch known
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID
EX
stall
WB
MEM
IF taken target
direction known (nt) PC+offset known
NOT-TAKEN
TAKEN
36
Handling Control Hazards: method 2
• predict not-taken
IF BNE r1, r2 ID
IF not-taken target WB MEM
WB MEM
EX
EX
ID
PC+4 known branch known
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID WB
MEM
IF taken target
direction known (nt) PC+offset known
TAKEN
NOT-TAKEN
EX
ID
IF
37
Move branch target back
AL
U
Instruction
cache
PC
NPC
IR (inst.
reg.)
A
AL
U
MU
X
data
cache LMD
MU
X
4 =0? cond
MU
X
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM)
=0? cond
MU
X
AL
U
Add an ALU
m May increase time for ID
m Eliminates 1 cycles, but still 1 additional stall
RF
38
Reducing branch penalty via the compiler: Delay slots
• Change the meaning of a branch so that next instruction after branch holds something useful
A BEQZ R1, X
B ADD R4,R2,R3
...
...
move useful instruction here
from above the branch
A
B
IFA IDA MEMA EXA WBA
X
IFB IDB MEMB EXB WBB
IFX IDX MEMX EXX WBX
“delay slot”
39
Delay slots
• Add n slots to cover n holes
• ISA is changed to mean “n instructions after any branch are always executed”
• Problem:
– ISA feature that encodes pipeline structure
– Difficult to maintain across generations
– Typically can fill:
• 1 slot 75% of time
• 2 slots about 25% of time
• >2 slots almost never
40
Filling slot from above branch
• Advantage
– Delay slot instruction can always execute regardless of branch outcome (don’t ever need to squash it)
• Disadvantage
– Need a “safe” instruction from above the branch
– Safe means: moving the instruction to the delay slot doesn’t violate any data dependencies
BEQZ R1, X
ADD R4, R2, R3
GOOD SCENARIO
NOP
BEQZ R1, X
ADD R4, R2, R3
BEQZ R1, X
ADD R1, R2, R3
BAD SCENARIO
NOP
BEQZ R1, X
ADD R1, R2, R3
NOP
41
Filling slot from target or fall-through
• If you can’t fill slot(s) from above the branch, use instructions from either:
– Target of branch (if frequently taken)
– Fall-through of branch (if frequently not-taken)
• Example: fill from target (branch is frequently taken)
BEQZ R1, X
NOP
X: SUB R4, R2, R3
Y: …
BEQZ R1, Y
X: SUB R4, R2, R3
Y: …
SUB R4, R2, R3 (copy of X)
(change target to Y)
• Disadvantages
– Only works if delay slot instruction is safe to execute when branch goes the opposite (infrequent) direction
42
Eliminating all stalls: Issues
• When:
– Detect an un-decoded instruction is a branch
• Where:
– Predict where the branch will go (if taken)
• Whether:
– (For conditional branches) Predict if it will be taken or not, before execution
• Optimal: try to determine all three in IF stage
– Won’t work perfectly (prediction), but we can try our best
43
AL
U
Instruction
cache
PC
NPC
IR (inst.
reg.) Regs
sign
extend
A
B
AL
U
MU
X
MU
X
=0? cond
MU
X
data
cache LMD
MU
X
4
Imm
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Writeback
(WB)
44
Issues with When and Where
• What does IF know? – Only the address of the instruction (PC)
– Keep buffer (cache) of last known branch targets around
– Buffer is written to by WB stage
Last
known
branches
PC value Hit = we know it is a branch (“WHEN”),
BTB returns branch target (“WHERE”)
Miss = assume not a branch
• Traditional name for this is a Branch Target Buffer (BTB)
tag pre-computed target BTB entry:
45
Predicting Where with returns
• Problem: A lot of jumps are returns from procedures
– Holding the last target address is a poor predictor
• Solution: Keep a hardware “stack” of return addresses
– Push return address when a “call” is executed
– Pop buffer on returns to get prediction
• Bottom of stack is filled with old value on a pop
– Need approx 4-8 entries for integer code
return?
Each entry
in BTB now
contains:
tag pre-computed target
46
Return Address Stack
RAS
empty
X:
call A
A
X:
call A RAS
X+4
Y:
call B
B
Y:
call B RAS
X+4
Y+4
Y+4
ret
ret
Prediction: Y+4
RAS
X+4
X+4
ret
ret
Prediction: X+4
RAS
empty
47
Issues with Whether
• Predicting conditional branches
– And sometimes unconditional branches if needed before decode
• Two approaches:
– Hardware to supply prediction
– Software
• Heuristics
• Profiling
48
Hardware branch prediction
• 1-bit schemes (Pattern History Table):
– Add 1-bit prediction field to branch target buffer
– Set prediction field = 1 if branch was taken, 0 if branch was not taken
– At IF, check “branch prediction buffer”:
• if prediction field = 1 then predict taken
• else predict not-taken
– Problems:
• Some branches don’t do what they did last time!
• Think of a simple 10 iteration loop, start predict NT
– What is prediction accuracy?
– Isn’t this high enough?!?
• Need more sophisticated predictor
49
Why accuracy matters so much
• Reduce stalls by: – Decreasing branch penalty
• Modify the pipeline (HW question)
– Increasing accuracy
• Fancy prediction schemes
– Decreasing fraction of branches
• compile-time code ordering
speedup = efficiency =stall cycles
efficiencystall cycles
stall cycles = branch penalty accuracy) fraction
efficiency = branch penalty accuracy) fraction
branch penalty accuracy) fraction = efficiency
accuracy = efficiency
branch penalty fraction
accuracy = 1-efficiency
branch penalty fraction
branch
branch
branch
branch
branch
nn
1
1
1
1
1
1 1
11
1
1
11
11
(
(
(
accuracy accuracy
branch penalty eff = 0.9 eff = 0.99
1 44.44% 94.95%
2 72.22% 97.47%
3 81.48% 98.32%
4 86.11% 98.74%
10 94.44% 99.49%
20 97.22% 99.75%
50
Smith n-bit counter predictor
• Replace prediction bit with n-bit counter:
11
10
01
00
T
T
T
T
N
N
N
N
predict taken
predict not-taken
initial state
(using NT
heuristic)
Problems with n > 2 to 3
Smith called it “inertia”
51
Example of Smith counter
previous state
new state
T T T T T T N T T N N N N N N N N T T 01 10 11 11 11 11 11 10 11 11 10 01 00 00 00 00 00 00 01
11 11 11 11 11 10 11 11 10 01 00 00 00 00 00 00 01 10 10
6 mispredictions out of 19 branch executions
T N T N T N T N T N T N T N T N T N previous state
new state
01 10
01 10
01 10
01 10
19 mispredictions out of 19: the infamous “toggle branch”
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
52
Improving the smith counter
• Options:
– Capture correlations between branches (“global”)
– Associate predictions with branch histories, not branch addresses (use different indexing scheme)
• Gselect (global history with index selection):
global branch history register
n-1 0
behavior of last branch
(shift in most recent outcome)
index
(using low order
bits of address)
... each entry is a two-bit counter
(or perhaps simpler)
BHR
53
Gselect example 1 0
BHR
matrix
00 01 10 11
A: BEQZ R1, D
D: BEQZ R1, F
F: NOT R1,R1
G: JUMP A
initially R1 = 0
...
...
A: T, pred N
New BHR = 01
D: T, pred N
New BHR = 11
01 01
01 01
A:
D:
BHR = 00
01 01
01 01
00 01 10 11
10 01
01 01
A:
D:
BHR = 01
01 01
01 01
00 01 10 11
10 01
01 10
A:
D:
BHR = 11
01 01
01 01
00 01 10 11
A: N, pred N
New BHR = 10
D: N, pred N
New BHR = 00
A: T, pred T
New BHR = 01
10 01
01 10
A:
D:
BHR = 10
01 01
01 00
00 01 10 11
10 01
01 10
A:
D:
BHR = 00
00 01
01 00
00 01 10 11
11 01
01 10
A:
D:
BHR = 01
00 01
01 00
00 01 10 11
D: T, pred T
New BHR = 11
11 01
01 11
A:
D:
BHR = 11
00 01
01 00
00 01 10 11
11 01
01 11
A:
D:
BHR = 10
00 01
01 00
00 01 10 11
A: N, pred N
New BHR = 10
underlined means entry
was updated due to last
branch execution
54
Gshare (global history with index sharing) Branch Predictor
BHR
n-1 0
behavior of last branch
index
(using low order
bits of address)
each entry is a two-bit counter
(or perhaps simpler)
Exclusive-or
(hopefully) makes
sure each index/BHR
combination goes to different
entry in the table.
55
Yeh/Patt predictors
pattern table
2-bit counters
(indexed by history
pattern)
1110111
shift registers history table
01
address of
branch
local predictor (pAg)
Hybrid Predictors (Yeh & Patt, 1993)
• Use global/local branch history to build (other) branch predictors
• G/g = Global, P/p = Per-address GHR
branch outcome
PHT
GAg
Yeh & Patt, 1993
• Use global/local branch history to build (other) branch predictors
• G/g = Global, P/p = Per-address GHR
branch outcome
PHT
GAp
PC
Yeh & Patt, 1993
• Use global/local branch history to build (other) branch predictors
• G/g = Global, P/p = Per-address GBHT
PHT
PAg
PC
Yeh & Patt, 1993
• Use global/local branch history to build (other) branch predictors
• G/g = Global, P/p = Per-address GBHT
PHT
PAp
PC
60
Yeh/Patt Example (pAg): toggle branch
A: TNTNTNTNTNTNTNTNTNTN
B: TTTTTTTTTTTTTTTTTTTTTT
01
01
01
01
PT
00
00
00
00
HT
00
01
10
11
A:
B:
A: T, pred N
10
01
01
01
PT
01
00
00
00
HT
00
01
10
11
A:
B:
11
01
01
01
PT
01
01
00
00
HT
00
01
10
11
A:
B:
B: T, pred T
11
00
01
01
PT
10
01
00
00
HT
00
01
10
11
A:
B:
A: N, pred N B: T, pred N
11
01
01
01
PT
10
11
00
00
HT
00
01
10
11
A:
B:
11
01
10
01
PT
01
11
00
00
HT
00
01
10
11
A:
B:
A: T, pred N
B: T, pred N
11
01
10
10
PT
01
11
00
00
HT
00
01
10
11
A:
B:
A: N, pred N
11
00
10
10
PT
10
11
00
00
HT
00
01
10
11
A:
B:
11
00
10
11
PT
10
11
00
00
HT
00
01
10
11
A:
B:
B: T, pred T A: T, pred T
11
00
11
11
PT
01
11
00
00
HT
00
01
10
11
A:
B:
PT entries 01, 10 are “trained” for A
and 11 is “trained” for B In general: provides 96-98% accuracy for integer code
61
Hybrid predictors
Predictor
#1
(e.g., gshare)
Predictor
#2
(eg, bimodal)
Chooser
array of 2bit
counters
address of
branch
• Both predictors supply a prediction-- pipeline uses only one
• Chooser updated based on which predictor was correct
– Increment chooser counter if #1 was correct, decrement if #2 was correct
prediction
Tournament Predictor: Alpha 21264
PC
1,02410b 1,0243b
12b
4,096
2b
branch outcome
4,09
62b
High accuracy! SPECfp95: 0.1% mp rate SPECint95: 1.15% mp rate
predictor predictor
63
Handling Control Hazards: method 3
from EX:
recover PC
BTB branch
predictor RAS
+ 4
Taken Target:
conditional branch
jump/call direct
jump/call indirect
control
logic
type
hit
taken
Next-PC MUX
from ID: PC+offset
updates
from ID/EX from EX
from IF/ID
I$
PC
next-PC
Branch prediction – Next PC logic
64
Control Hazards (revisited…)
• Suppose: – Always predict not-taken
• E.g., no BTB and no dynamic branch predictor – ID determines when, whether, where
• Not-taken: no penalty • Taken: 1 cycle penalty • Is there a compiler solution?
A
B
IF ID MEM EX WB
IF - - - -
A BEQZ R1, X
IF ID MEM EX WB X
65
Canceling branches
• Provide two types of delayed branches – Normal delayed branch
• Delay slot instruction is always executed
– Canceling delayed branch
• Slot filled from target (assumes taken branch): slot is squashed if branch is not-taken (branch likely inst)
• Slot filled from fall-through (assumes not-taken branch): slot is squashed if branch is taken
• Having both types greatly enhances compiler’s ability to fill most slots – Use normal branch when a safe instruction is available to fill
slot (from above, target, or fall-through)
– Use canceling branch when only unsafe candidates are available to fill slot (from target or fall-through)
66
Canceling branches (cont.)
• Change encoding to add a likely bit or to add new opcodes
• If compiler thinks branch is frequently taken
– Compiler sets likely bit = 1
– Compiler fills delay slot from target
– Hardware knows to squash delay slot(s) if branch not-taken
• If compiler thinks branch is frequently not-taken
– Compiler sets likely bit = 0
– Compiler fills delay slot from fall-through
– Hardware knows to squash delay slot(s) if branch taken
• If likely bit set capriciously, most delay slot instructions must be squashed
Opcode
6
rs1
5
rd
5 15 1
likely bit
67
Methods for setting likely bit
• Likely bit is essentially a static branch prediction
– Compiler makes a prediction that is fixed for that branch
– Likely bit = 1 means predict taken
– Likely bit = 0 means predict not-taken
• Static branch prediction methods (compiler branch prediction)
– Heuristics
– Profiling
68
Heuristics
• Heuristic #1: Branches are not taken
– The majority of conditional branches are not taken
– True about 60% of the time
• Heuristic #2: backward branches are taken, forward branches are not taken (BTFNT)
– Theme: Most backward branches are loops
– Notes:
• Since branches are PC relative, sign bit of offset = the prediction
• Jim Smith reports 70% accuracy for this scheme for scientific workloads
• Heuristic #3: Ball/Larus style predictions
– Set of rules to predict branches in special situations
69
Examples of Ball/Larus predictions
• Detect loops and use BTFNT
• Since error values returned by library functions are negative:
– …and since errors are rare
– Predict BLTZ, BLEZ, etc. not taken
– Predict BGTZ, BGEZ, etc., taken
• If a call is in the body of an if…then, predict the “then” branch as not-taken
– Since most calls in if…thens guard special case code
• Problem:
– Works great for SPECint92 (from which it was designed)
– My code might not work like that!
70
Profiling
• Three steps:
– 1. Run the program [the “profiled run”]
– 2. Record the average preferred direction for each branch (taken or not-taken)
– 3. Recompile to set the likely bits
• What if the program takes inputs (e.g., sort)?
– Collect a representative set of inputs somehow
• Problems:
– One prediction for entire run
– The profiled run is slow!
• Use hardware to collect predictions
• Runs at normal speed: users don’t realize it’s profiling
71
Control hazards (slide 33)
• Branches throw a wrench in the cogs
– Disrupts pipeline because we don’t know what to fetch next
– Problems
• Don’t know we have a branch until decode (ID)
• Don’t know taken target until execute (EX or ID)
• Don’t know branch direction (taken/not taken) until execute (EX)
72
Handling Control Hazards: method 1 (slide 34)
• stall
IF BNE r1, r2 ID
IF not-taken target MEM EX
WB MEM
ID
EX
stall WB
PC+4 known branch known and pc+offset is know
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID
EX
stall
WB
MEM
IF taken target
direction known
NOT-TAKEN
TAKEN
73
Handling Control Hazards: method 2 (slide 35)
• predict not-taken
IF BNE r1, r2 ID
IF not-taken target WB MEM
WB MEM
EX
EX
ID
PC+4 known branch known
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID WB
MEM
IF taken target
direction known (nt)
and PC+offset known
TAKEN
NOT-TAKEN
EX
ID
IF
74
Handling Control Hazards: method 3 (cont.)
• Branch prediction
• Case A:
– Correct branch prediction
– (BTB-hit) or (BTB-miss and not-taken)
IF BNE r1, r2 ID
IF correct target WB MEM
WB MEM
EX
EX
ID
75
Handling Control Hazards: method 3 (cont.)
• Case B:
– Correct branch prediction
– BTB-miss and taken
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID WB
MEM
IF taken target
EX
76
Handling Control Hazards: method 3 (cont.)
• Case C: – Incorrect branch prediction
– BTB-hit
– Think about the following case: predict a taken branch
(incorrect prediction) but BTB miss.
IF BNE r1, r2 ID
IF incorrect target
MEM EX
WB
ID WB
MEM
IF correct target
EX
ID
IF
77
Handling Control Hazards: method 4
• Compiler based approaches: delayed branch & canceling branch
– HW support
• Different opcodes for different types of branches
• Likely bit in canceling branches
– Compiler support
• Move the code
• Set the likely bit
• Ex: BTB has 90% hit rate, prediction accuracy is 95%, 60% branches taken
• Also, any misses to BTB that stall pipe will stall pipe for 1 extra cycle to update BTB
• What is the misprediction penalty for this pipeline?
• (.90)[(.05)(2)] + (.1)[(.4)(0) + (.6)(6+1)] =
• (.9)(.1) + (.1)(4.2) = .09 + .42 = .51 cycles/branch
• Note: delayed branches were about .5 cycles/branch for simple pipe
• Improvement increases quickly with better prediction accuracy
IF D1 D2 D3 R E I
target known branch dir known predict
Branch Performance
Exceptions: Harder Still
• Exceptions: interrupt instruction execution unexpectedly
• Harder to handle in pipelines since there is more overlap
– may have five instructions in flight when exception is raised
• Common exceptions:
– I/O device interrupt
– OS call
– Arithmetic overflow, FP anomaly
– Page fault
– Misaligned memory access
– Memory protection violation
– Illegal instruction
– Power failure
Restartable Exceptions
• The difficulty: implementing restartable exceptions
– exceptions that must restart interrupted instruction
• arithmetic overflow
• page fault
• some I/O
• Another program must be invoked to:
– save state
– correct exceptional condition
– restore state
• Invisible to original program
– restartable exceptions needed for virtual memory
Precise Exceptions and Pipelining
• CPU has precise exceptions if faulting instruction and after can be stopped and restarted, and:
– all previous instructions executed and committed
– no later instruction committed
• Required for implementing virtual memory/demand paging
• Save PC and state at excepting instruction
– later on restore state, restart on that PC
• Force exception vector into pipe for next IF
• Turn off all writes for faulting instruction and later
– earlier instructions finish as normal
• OS saves excepting PC+register file and handles exception
– Does this always work?
Implementing Exceptions
• What if EPC is in a branch delay slot?
– can we restart the instruction in the delay slot?
• Must save EPC for faulting instruction – but restart from branch!
– obvious: save 2 PCs
• OS issues rfe (return from exception) instruction
– restarts user program and re-enters user mode
Precise Exceptions
• Required for implementing virtual memory/demand paging
– all mps implement precise exceptions for integer pipes
– also required for IEEE 754 floating-point compliance
• FP execute out of order for performance
– hard to achieve precise exceptions there
• Some provide two modes: (a) performance mode, (b) precise mode
– precise mode restricts overlap – slow
– Alpha 21064/21164, MIPS R8000
• ~10 slower
Out-of-order Exceptions
lw IF ID EX MEM WB
add IF ID EX MEM WB
• Load can cause (among other things) page fault in MEM
• Add can cause (among other things) overflow in EX
• Two exceptions in the same cycle!
– handle page fault first
– in case of tie handle “earlier” instruction
• It gets worse…
Out-of-order Exceptions
lw IF ID EX MEM WB
add IF ID EX MEM WB
• Load can cause (among other things) page fault in MEM
• Add can cause (among other things) page fault in IF
– need precise integer exceptions – cannot handle this one yet
– need to wait until load completes (or causes an exception)
• Implement exception status vector, carried with each instruction
– turn off all writes on an exception
– prevent stores in MEM stage
– check vector in WB (instructions before are complete)
– handle earliest exception in program order
When Does State Change?
• An instruction is committed when it is guaranteed to complete
– easy to restart if state changed only when committed
• MIPS: WB
• VAX: auto-increment mode, state updated in middle of inst, need HW support to back out, undo – “roll back” state changes
• Some architectures have string copy instructions
– updates memory – cannot undo 100%
– general-purpose registers hold all state
– instruction continues after exception rather than restart