Lec 04 Pipeline d Processor
-
Upload
maheshkota -
Category
Documents
-
view
27 -
download
0
description
Transcript of Lec 04 Pipeline d Processor
-
Pipelined Processor Design
M S Bhat
Dept. of E&C,
NITK Suratkal
-
VL722 2 8/16/2015
Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
-
VL722 3 8/16/2015
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers
Each stage takes 30 minutes to complete
Four loads of clothes to wash, dry, and fold
A B
C D
Pipelining Example
-
VL722 4 8/16/2015
Sequential laundry takes 6 hours for 4 loads
Intuitively, we can use pipelining to speed up laundry
Sequential Laundry
Time
6 PM
A
30 30 30
7 8 9 10 11 12 AM
30 30 30
B
30 30 30
C
30 30 30
D
-
VL722 5 8/16/2015
Pipelined laundry takes 3 hours for 4 loads
Speedup factor is 2 for 4 loads
Time to wash, dry, and fold one load is still the
same (90 minutes)
Pipelined Laundry: Start Load ASAP
Time
6 PM
A
30
7 8 9 PM
B
30
30
C
30
30
30
D
30
30
30
30
30 30
-
VL722 6 8/16/2015
Serial Execution versus Pipelining
Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages
Each subtask requires one time unit
The total execution time of the task is k time units
Pipelining is to overlap the execution
The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
Without Pipelining
One completion every k time units
With Pipelining
One completion every 1 time unit
-
VL722 7 8/16/2015
Synchronous Pipeline
Uses clocked registers between stages
Upon arrival of a clock edge
All registers hold the results of previous stages simultaneously
The pipeline stages are combinational logic circuits
It is desirable to have balanced stages
Approximately equal delay in all stages
Clock period is determined by the maximum stage delay
S1 S2 Sk
Re
gis
ter
Re
gis
ter
Re
gis
ter
Re
gis
ter
Input
Clock
Output
-
VL722 8 8/16/2015
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A pipeline can process n tasks in k + n 1 cycles
k cycles are needed to complete the first task
n 1 cycles are needed to complete the remaining n 1 tasks
Ideal speedup of a k-stage pipeline over serial execution
Pipeline Performance
k + n 1 Pipelined execution in cycles
Serial execution in cycles = = Sk k for large n
nk Sk
-
VL722 9 8/16/2015
MIPS Processor Pipeline
Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address
4. MEM: Memory access for load and store
5. WB: Write Back result to register
-
VL722 10 8/16/2015
Performance Example
Assume the following operation times for components:
Instruction and data memories: 200 ps
ALU and adders: 180 ps
Decode and Register file access (read or write): 150 ps
Ignore the delays in PC, mux, extender, and wires
Which of the following would be faster and by how much?
Single-cycle implementation for all instructions
Multicycle implementation optimized for every class of instructions
Assume the following instruction mix:
40% ALU, 20% Loads, 10% stores, 20% branches, & 10% jumps
-
VL722 11 8/16/2015
Single-Cycle vs Multicycle Implementation
Break instruction execution into five steps
Instruction fetch
Instruction decode and register read
Execution, memory address calculation, or branch completion
Memory access or ALU instruction completion
Load instruction completion
One step = One clock cycle (clock cycle is reduced)
First 2 steps are the same for all instructions
Instruction # cycles Instruction # cycles
ALU & Store 4 Branch 3
Load 5 Jump 2
-
VL722 12 8/16/2015
Solution Instruction
Class
Instruction
Memory
Register
Read
ALU
Operation
Data
Memory
Register
Write Total
ALU 200 150 180 150 680 ps
Load 200 150 180 200 150 880 ps
Store 200 150 180 200 730 ps
Branch 200 150 180 530 ps
Jump 200 150 350 ps
For fixed single-cycle implementation:
Clock cycle =
For multi-cycle implementation:
Clock cycle =
Average CPI =
Speedup =
decode and update PC
0.44 + 0.25 + 0.14+ 0.23 + 0.12 = 3.8
max (200, 150, 180) = 200 ps (maximum delay at any step)
880 ps determined by longest delay (load instruction)
880 ps / (3.8 200 ps) = 880 / 760 = 1.16
-
VL722 13 8/16/2015
Single-Cycle vs Pipelined Performance
Consider a 5-stage instruction execution in which
Instruction fetch = Data memory access = 200 ps
ALU operation = 180 ps
Register read = register write = 150 ps
What is the clock cycle of the single-cycle processor?
What is the clock cycle of the pipelined processor?
What is the speedup factor of pipelined execution?
Solution
Single-Cycle Clock = 200+150+180+200+150 = 880 ps
Reg ALU MEM IF
880ps
Reg
Reg ALU MEM IF
880 ps
Reg
-
VL722 14 8/16/2015
Single-Cycle versus Pipelined contd Pipelined clock cycle =
CPI for pipelined execution =
One instruction is completed in each cycle (ignoring pipeline fill)
Speedup of pipelined execution =
Instruction count and CPI are equal in both cases
Speedup factor is less than 5 (number of pipeline stage)
Because the pipeline stages are not balanced
Throughput = 1/Max(delay) = 1/200 ps = 5 x 109 instructions/sec
880 ps / 200 ps = 4.4
1
max(200, 180, 150) = 200 ps
200
IF Reg MEM ALU Reg
IF Reg MEM Reg ALU
IF Reg MEM ALU Reg 200
200 200 200 200 200
-
VL722 15 8/16/2015
Pipeline Performance Summary
-
VL722 16 8/16/2015
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
-
VL722 17 8/16/2015
ID = Decode &
Register Read
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer: Introduce pipeline register at end of each stage
Next
PC
zero PCSrc
ALUCtrl
Reg
Write
ExtOp
RegDst
ALUSrc
Data
Memory
Address
Data_in
Data_out
32
32 A L U
32
Registers
RA
RB
BusA
BusB
RW
5
BusW
32
Address
Instruction
Instruction
Memory
PC
00 30
Rs
5
Rd E
Imm16
Rt
0
1
0
1
32
Imm26
32
ALU result
32
0
1
clk
+1
0
1
30
Jump or Branch Target Address
Mem
Read
Mem
Write
Mem
toReg
EX = Execute IF = Instruction Fetch MEM = Memory
Access
WB =
Write
Back
Bne
Beq
J
-
VL722 18 8/16/2015
zero
Pipelined Datapath Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers, register file, and data memory (for store instruction)
clk
32
A L U
32
5
Address
Instruction
Instruction
Memory Rs
5 Rt 1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
ID = Decode &
Register Read
EX = Execute IF = Instruction Fetch MEM = Memory
Access
WB
= W
rite
Ba
ck
32
Re
gis
ter
File
RA
RB
BusA
BusB
RW BusW
32
E
Rd
0
1
PC
AL
Uou
t D
WB
Da
ta
32
Imm16
Imm26
Next
PC
A
B
Imm
N
PC
2
Instr
uctio
n
NP
C
+1
0
1
-
VL722 19 8/16/2015
Problem with Register Destination Is there a problem with the register destination address?
Instruction in the ID stage different from the one in the WB stage
Instruction in the WB stage is not writing to its destination register but to the destination of a different instruction in the ID stage
zero
clk
32
A L U
32
5
Address
Instruction
Instruction
Memory Rs
5 Rt 1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
ID = Decode &
Register Read EX = Execute IF = Instruction Fetch MEM =
Memory Access
WB
= W
rite
Ba
ck
32
Re
gis
ter
File
RA
RB
BusA
BusB
RW BusW
32
E
Rd
0
1
PC
AL
Uou
t D
WB
Da
ta
32
Imm16
Imm26
Next
PC
A
B
Imm
N
PC
2
Instr
uctio
n
NP
C
+1
0
1
-
VL722 20 8/16/2015
Pipelining the Destination Register Destination Register number should be pipelined
Destination register number is passed from ID to WB stage
The WB stage writes back data knowing the destination register
zero
clk
32
A L U
32
5
Address
Instruction
Instruction
Memory Rs
5 Rt 1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
ID EX IF MEM WB
32
Reg
iste
r F
ile
RA
RB
BusA
BusB
RW BusW
32
E
PC
AL
Uou
t D
WB
Da
ta
32
Imm16
Imm26
Next
PC
A
B
Imm
N
PC
2
Instr
uctio
n
NP
C
+1
0
1
Rd
Rd2
0
1 Rd3
Rd4
-
VL722 21 8/16/2015
Graphically Representing Pipelines
Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
Prog
ram
Execu
tion
Ord
er
add R17,R18,R19
CC2
Reg
IF
DM
Reg
sub R13, R18, R11
CC4
ALU
IF
sw R18, 10(R19)
DM
Reg
CC5
Reg
ALU
IF
DM
Reg
CC6
Reg
ALU DM
CC7
Reg
ALU
CC8
Reg
DM
lw R14, 8(R21) IF
CC1
Reg
ori R20,R11, 7
ALU
CC3
IF
-
VL722 22 8/16/2015
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
Instruction-Time Diagram
IF
WB
EX
ID
WB
EX
WB
MEM
ID
IF
EX
ID
IF
Time CC1 CC4 CC5 CC6 CC7 CC8 CC9 CC2 CC3
MEM
EX
ID
IF
WB
MEM
EX
ID
IF
lw R15, 8(R19)
lw R14, 8(R21)
ori R12, R19, 7
sub R13, R18, R11
sw R18, 10(R19) Ins
truc
tion
Ord
er
Up to five instructions can be in the
pipeline during the same cycle.
Instruction Level Parallelism (ILP)
ALU instructions skip
the MEM stage.
Store instructions
skip the WB stage
-
VL722 23 8/16/2015
Control Signals
zero
clk
32
A L U
32
5
Address
Instruction
Instruction
Memory Rs
5 Rt 1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Reg
iste
r F
ile
RA
RB
BusA
BusB
RW BusW
32
E
PC
AL
Uou
t D
WB
Da
ta
32
Imm16
Imm26
Next
PC
A
B
Imm
N
PC
2
Instr
uctio
n
NP
C
+1
0
1
Rd
Rd2
0
1 Rd3
Rd4
Same control signals used in the single-cycle datapath
ALU
Ctrl
Reg
Write
Reg
Dst ALU
Src Mem
Write
Mem
toReg Mem
Read
Ext
Op
PCSrc
ID EX IF MEM WB
Bne
Beq
J
-
VL722 24 8/16/2015
Pipelined Control
zero
clk
32
A L U
32
5
Address
Instruction
Instruction
Memory Rs
5 Rt 1
0
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32 R
eg
iste
r F
ile
RA
RB
BusA
BusB
RW BusW
32
E
PC
AL
Uou
t D
WB
Da
ta
32
Imm16
Imm26
Next
PC
A
B
Imm
N
PC
2
Instr
uctio
n
NP
C
+1
0
1
Rd
Rd2
0
1 Rd3
Rd4
PCSrc
Bne
Beq
J
Op
Reg
Dst
EX
ALU
Src
ALU
Ctrl
Ext
Op
J
Beq
Bne
ME
M
Mem
Write
Mem
Read
Mem
toReg
WB
Reg
Write Pass control
signals along
pipeline just
like the data
Main
& ALU
Control
func
-
VL722 25 8/16/2015
Pipelined Control Cont'd
ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
Each stage uses some of the control signals
Instruction Decode and Register Read
Control signals are generated
RegDst is used in this stage
Execution Stage => ExtOp, ALUSrc, and ALUCtrl
Next PC uses J, Beq, Bne, and zero signals for branch control
Memory Stage => MemRead, MemWrite, and MemtoReg
Write Back Stage => RegWrite is used in this stage
-
VL722 26 8/16/2015
Op
Decode
Stage
Execute Stage
Control Signals
Memory Stage
Control Signals
Write
Back
RegDst ALUSrc ExtOp J Beq Bne ALUCtrl MemRd MemWr MemReg RegWrite
R-Type 1=Rd 0=Reg x 0 0 0 func 0 0 0 1
addi 0=Rt 1=Imm 1=sign 0 0 0 ADD 0 0 0 1
slti 0=Rt 1=Imm 1=sign 0 0 0 SLT 0 0 0 1
andi 0=Rt 1=Imm 0=zero 0 0 0 AND 0 0 0 1
ori 0=Rt 1=Imm 0=zero 0 0 0 OR 0 0 0 1
lw 0=Rt 1=Imm 1=sign 0 0 0 ADD 1 0 1 1
sw x 1=Imm 1=sign 0 0 0 ADD 0 1 x 0
beq x 0=Reg x 0 1 0 SUB 0 0 x 0
bne x 0=Reg x 0 0 1 SUB 0 0 x 0
j x x x 1 0 0 x 0 0 x 0
Control Signals Summary
-
VL722 27 8/16/2015
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
-
VL722 28 8/16/2015
Hazards: situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle
1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions
3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and limit performance
Pipeline Hazards
-
VL722 29 8/16/2015
Structural Hazards Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Example
Writing back ALU result in stage 4
Conflict with writing load data in stage 5
WB
WB
EX
ID
WB
EX MEM
IF ID
IF
Time CC1 CC4 CC5 CC6 CC7 CC8 CC9 CC2 CC3
EX
ID
IF
MEM
EX
ID
IF
lw R14, 8(R21)
ori R12, R19, 7
sub R13, R18,R19
sw R18, 10(R19) Ins
truc
tion
s
Structural Hazard
Two instructions are
attempting to write
the register file
during same cycle
-
VL722 30 8/16/2015
Resolving Structural Hazards Serious Hazard:
Hazard cannot be ignored
Solution 1: Delay Access to Resource
Must have mechanism to delay instruction access to resource
Delay all write backs to the register file to stage 5
ALU instructions bypass stage 4 (memory) without doing anything
Solution 2: Add more hardware resources (more costly)
Add more hardware to eliminate the structural hazard
Redesign the register file to have two write ports
First write port can be used to write back ALU results in stage 4
Second write port can be used to write back load data in stage 5
-
VL722 31 8/16/2015
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
-
VL722 32 8/16/2015
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Read After Write RAW Hazard
Given two instructions I and J, where I comes before J
Instruction J should read an operand after it is written by I
Called a data dependence in compiler terminology
I: add R17, R18, R19 # R17 is written
J: sub R20, R17, R19 # R17 is read
Hazard occurs when J reads the operand before I writes it
Data Hazards
-
VL722 33 8/16/2015
DM Reg
IF
Reg
ALU
IF
DM
Reg
Reg
ALU
IF
DM
Reg
Reg
ALU DM
Reg
ALU
Reg
DM
IF
Reg
ALU
IF
Time (cycles)
Prog
ram
Execu
tion
Ord
er
value of R18
sub R18, R9, R11
CC1 10
CC2
add R20, R18, R13
10
CC3
or R22, R11, R18
10
CC4
and R23, R12, R18
10
CC6 20
CC7 20
CC8 20
CC5
sw R24, 10(R18)
10
Example of a RAW Data Hazard
Result of sub is needed by add, or, and, & sw instructions
Instructions add & or will read old value of R18 from reg file
During CC5, R18 is written at end of cycle
-
VL722 34 8/16/2015
Reg Reg
Solution 1: Stalling the Pipeline
Three stall cycles during CC3 thru CC5 (wasting 3 cycles)
Stall cycles delay execution of add & fetching of or instruction
The add instruction cannot read R18 until beginning of CC6
The add instruction remains in the Instruction register until CC6
The PC register is not modified until beginning of CC6
DM
Reg
Reg Reg
Time (in cycles)
Ins
truc
tion
Ord
er value of R18
CC1 10
CC2 10
CC3 10
CC4 10
CC6 20
CC7 20
CC8 20
CC5 10
add R20, R18, R13 IF
or R22, R11, R18 IF ALU
ALU Reg
sub R18, R9, R11 IF Reg ALU DM Reg
CC9 20
stall stall
DM
stall
-
VL722 35 8/16/2015
DM
Reg
Reg
Reg
Reg
Reg
Time (cycles)
Prog
ram
Execu
tion
Ord
er
value of R18
sub R18, R9, R11 IF
CC1 10
CC2
add R20, R18, R13 IF
10
CC3
or R22, R11, R18
ALU
IF
10
CC4
and R23, R22, R18
ALU
IF
10
CC6
Reg
DM
ALU
20
CC7
Reg
DM
ALU
20
CC8
Reg
DM
20
CC5
sw R24, 10(R18)
Reg
DM
ALU
IF
10
Solution 2: Forwarding ALU Result
The ALU result is forwarded (fed back) to the ALU input
No bubbles are inserted into the pipeline and no cycles are wasted
ALU result is forwarded from ALU, MEM, and WB stages
-
VL722 36 8/16/2015
Implementing Forwarding
0
1
2
3
0
1
2
3
Resu
lt
32 32
clk
Rd
32
Rs
Instr
uctio
n
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
A L U E
Imm16 Imm26
1
0
Rd3
Rd2
A
B W
Data
D
Im2
6
32
Reg
iste
r F
ile
RB
BusA
BusB
RW BusW
RA Rt
Two multiplexers added at the inputs of A & B registers
Data from ALU stage, MEM stage, and WB stage is fed back
Two signals: ForwardA and ForwardB control forwarding
ForwardA
ForwardB
-
VL722 37 8/16/2015
Forwarding Control Signals
Signal Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardA = 1 Forward result of previous instruction to A (from ALU stage)
ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)
-
VL722 38 8/16/2015
Forwarding Example
Resu
lt
32 32
clk
Rd
32
Rs
Instr
uctio
n
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
A L U
ext Imm16
Imm26
1
0
Rd3
Rd2
A
B
0
1
2
3
0
1
2
3
WD
ata
D
Imm
32
Reg
iste
r F
ile
RB
BusA
BusB
RW BusW
RA Rt
Instruction sequence:
lw R12, 4(R8)
ori R15, R9, 2
sub R11, R12, R15
When sub instruction is in decode stage
ori will be in the ALU stage
lw will be in the MEM stage
ForwardA = 2 from MEM stage ForwardB = 1 from ALU stage
lw R12,4(R8) ori R15, R9,2 sub R11,R12,R15
2
1
-
VL722 39 8/16/2015
RAW Hazard Detection Current instruction being decoded is in Decode stage
Previous instruction is in the Execute stage
Second previous instruction is in the Memory stage
Third previous instruction in the Write Back stage
If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite)) ForwardA 1
Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA 2
Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWrite)) ForwardA 3
Else ForwardA 0
If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite)) ForwardB 1
Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB 2
Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWrite)) ForwardB 3
Else ForwardB 0
-
VL722 40 8/16/2015
Hazard Detect and Forward Logic
0
1
2
3
0
1
2
3
Resu
lt
32 32
clk
Rd
32
Rs
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
A L U
E
Imm26
1
0
Rd3
Rd2
A
B W
Data
D
Im2
6
32
Reg
iste
r F
ile
RB
BusA
BusB
RW BusW
RA Rt
Instr
uctio
n
ForwardB ForwardA
Hazard Detect
and Forward func
ALUCtrl
RegDst
Main
& ALU
Control
Op M
EM
EX
WB
RegWrite RegWrite RegWrite
-
VL722 41 8/16/2015
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Pipeline Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
-
VL722 42 8/16/2015
Reg
Reg
Reg
Time (cycles)
Prog
ram
Ord
er
CC2
add R20, R18, R13
Reg
IF
CC3
or R14, R11, R18
ALU
IF
CC6
Reg
DM
ALU
CC7
Reg
Reg
DM
CC8
Reg
lw R18, 20(R17) IF
CC1 CC4
and R15, R18, R12
DM
ALU
IF
CC5
DM
ALU
Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding
In the example shown below
The LW instruction does not read data until end of CC4
Cannot forward data to ADD at end of CC3 - NOT possible
However, load can
forward data to
2nd next and later
instructions
-
VL722 43 8/16/2015
Detecting RAW Hazard after Load
Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage
Instruction that depends on the load data is in the decode stage
Condition for stalling the pipeline
if ((EX.MemRead == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard
Insert a bubble into the EX stage after a load instruction
Bubble is a no-op that wastes one clock cycle
Delays the dependent instruction after load by once cycle
Because of RAW hazard
-
VL722 44 8/16/2015
Reg or R14, R11, R18 IF DM Reg ALU
Reg ALU DM Reg
add R20, R18, R13 IF
Reg lw R18, 20(R17) IF
stall
ALU
bubble bubble bubble
DM Reg
Stall the Pipeline for one Cycle ADD instruction depends on LW stall at CC3
Allow Load instruction in ALU stage to proceed
Freeze PC and Instruction registers (NO instruction is fetched)
Introduce a bubble into the ALU stage (bubble is a NO-OP)
Load can forward data to next instruction after delaying it Time (cycles)
Prog
ram
Ord
er
CC2 CC3 CC6 CC7 CC8 CC1 CC4 CC5
-
VL722 45 8/16/2015
lw R18, 8(R17) MEM WB EX ID Stall IF
lw R17, (R13) MEM WB EX ID IF
Showing Stall Cycles Stall cycles can be shown on instruction-time diagram
Hazard is detected in the Decode stage
Stall indicates that instruction is delayed
Instruction fetching is also delayed after a stall
Example:
add R2, R18, R11 MEM WB EX ID Stall IF
sub R1, R18, R2 MEM WB EX ID IF
Time CC1 CC4 CC5 CC6 CC7 CC8 CC9 CC2 CC3 CC10
Data forwarding is shown using blue arrows
-
VL722 46 8/16/2015
Control Signals
Bubble
= 0
0
1
Hazard Detect, Forward, and Stall
0
1
2
3
0
1
2
3
Resu
lt
32 32
clk
Rd
32
Rs
0
1
ALU result
32
0
1
Data
Memory
Address
Data_in
Data_out
32
Rd4
A L U
E
Imm26
1
0
Rd3
Rd2
A
B W
Data
D
Im2
6
32
Reg
iste
r F
ile
RB
BusA
BusB
RW BusW
RA Rt
Instr
uctio
n
ForwardB ForwardA
Hazard Detect
Forward, & Stall
func
RegDst
Main & ALU
Control
Op
ME
M
EX
WB
RegWrite RegWrite
RegWrite
MemRead Stall
Dis
able
PC
P
C
-
VL722 47 8/16/2015
Code Scheduling to Avoid Stalls Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E F; // A thru F are in Memory
Slow code:
lw R8, 4(R16) # &B = 4(R16)
lw R9, 8(R16) # &C = 8(R16)
add R10, R8, R9 # stall cycle
sw R10, 0(R16) # &A = 0(R16)
lw R11, 16(R16) # &E = 16(R16)
lw R12, 20(R16) # &F = 20(R16)
sub R13, R11, R12 # stall cycle
sw R13, 12(R0) # &D = 12(R0)
Fast code: No Stalls
lw R8, 4(R16)
lw R9, 8(R16)
lw R11, 16(R16)
lw R12, 20(R16)
add R10, R8, R9
sw R10, 0(R16)
sub R13, R11, R12
sw R13, 12(R0)
-
VL722 48 8/16/2015
Instruction J writes its result after it is read by I
Called anti-dependence by compiler writers
I: sub R12, R9, R11 # R9 is read
J: add R9, R10, R11 # R9 is written
Results from reuse of the name R9
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and
Instructions are processed in order
Anti-dependence can be eliminated by renaming
Use a different destination register for add (eg, R13)
Name Dependence: Write After Read
-
VL722 49 8/16/2015
Name Dependence: Write After Write
Same destination register is written by two instructions
Called output-dependence in compiler terminology
I: sub R9, R12, R11 # R9 is written
J: add R9, R10, R11 # R9 is written again
Not a data hazard in the 5-stage pipeline because:
All writes are ordered and always take place in stage 5
However, can be a hazard in more complex pipelines
If instructions are allowed to complete out of order, and
Instruction J completes and writes R9 before instruction I
Output dependence can be eliminated by renaming R9
Read After Read is NOT a name dependence
-
VL722 50 8/16/2015
Hazards due to Loads and Stores Consider the following statements:
sw R10, 0(R1)
lw R11, 6(R5)
Is there any Data Hazard possible in the above code sequence?
If 0(R1) == 6(R5), then it means the same memory location!!
Data dependency through Data Memory
But no Data Hazard since the memory is not accessed by the two instructions simultaneously writing and reading happens in two consecutive clock cycles
But, in an out-of-order execution processor, this can lead to a Hazard!
-
VL722 51 8/16/2015
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
-
VL722 52 8/16/2015
Control Hazards
Jump and Branch can cause great performance loss
Jump instruction needs only the jump target address
Branch instruction needs two things:
Branch Result Taken or Not Taken
Branch Target Address
PC + 4 If Branch is NOT taken
PC + 4 immediate If Branch is Taken
Jump and Branch targets are computed in the ID stage
At which point a new instruction is already being fetched
Jump Instruction: 1-cycle delay
Branch: 2-cycle delay for branch result (taken or not taken)
-
VL722 53 8/16/2015
2-Cycle Branch Delay
Control logic detects a Branch instruction in the 2nd Stage
ALU computes the Branch outcome in the 3rd Stage
Next1 and Next2 instructions will be fetched anyway
Convert Next1 and Next2 into bubbles if branch is taken
Beq R9,R10,L1 IF
cc1
Next1
cc2
Reg
IF
Next2
cc4 cc5 cc6 cc7
IF Reg DM ALU
Bubble Bubble Bubble
Bubble Bubble Bubble Bubble
L1: target instruction
cc3
Branch
Target
Addr
ALU
Reg
IF
-
VL722 54 8/16/2015
Implementing Jump and Branch
zero
clk
32
A L U
32
5
Address
Instruction
Instruction
Memory Rs
5 Rt 1
0
32
Reg
iste
r F
ile
RA
RB
BusA
BusB
RW BusW
E
PC
AL
Uo
ut
D
Imm16
Imm26
A
B
Im2
6
NP
C2
Instr
uctio
n
NP
C
+1
0
1
Rd
Rd2
0
1 Rd3
PCSrc
Bne
Beq
J
Op
Reg
Dst
EX
J, Beq, Bne
ME
M
Main & ALU
Control
func
Next
PC
0
2
3
1
0
2
3
1
Control Signals 0
1 Bubble = 0
Branch Delay = 2 cycles
Branch target & outcome
are computed in ALU stage
Ju
mp
or
Bra
nch
Ta
rge
t
-
VL722 55 8/16/2015
Predict Branch NOT Taken
Branches can be predicted to be NOT taken
If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted cycles
Else, convert Next1 and Next2 into bubbles 2 wasted cycles
Beq R9,R10,L1 IF
cc1
Next1
cc2
Reg
IF
Next2
cc3
NOT Taken ALU
Reg
IF Reg
cc4 cc5 cc6 cc7
ALU DM
ALU DM
Reg
Reg
-
VL722 56 8/16/2015
Reducing the Delay of Branches
Branch delay can be reduced from 2 cycles to just 1 cycle
Branches can be determined earlier in the Decode stage
A comparator is used in the decode stage to determine branch decision, whether the branch is taken or not
Because of forwarding the delay in the second stage will be increased and this will also increase the clock cycle
Only one instruction that follows the branch is fetched
If the branch is taken then only one instruction is flushed
We should insert a bubble after jump or taken branch
This will convert the next instruction into a NOP
-
VL722 57 8/16/2015
J, Beq, Bne
Reducing Branch Delay to 1 Cycle
clk
32
A L U
32
5
Address
Instruction
Instruction
Memory Rs
5 Rt 1
0
32
Reg
iste
r F
ile
RA
RB
BusA
BusB
RW BusW
E
PC
AL
Uo
ut
D
Imm16
A
B
Im1
6
Instr
uctio
n
+1
0
1
Rd
Rd2
0
1 Rd3
PCSrc
Bne
Beq
J
Op
Reg
Dst
EX
ME
M
Main & ALU
Control
func
0
2
3
1
0
2
3
1
Control Signals 0
1 Bubble = 0
Ju
mp
or
Bra
nch
Ta
rge
t
Reset signal converts
next instruction after
jump or taken branch
into a bubble
Zero
=
ALUCtrl
Reset
Next
PC
Longer Cycle
Data forwarded
then compared
-
VL722 58 8/16/2015
Based on SPEC benchmarks on DLX
Branches occur with a frequency of 14% to 16% in integer
programs and 3% to 12% in floating point programs.
About 75% of the branches are forward branches
60% of forward branches are taken
85% of backward branches are taken
67% of all branches are taken
Why are branches (especially backward branches) more
likely to be taken than not taken?
Branch Behavior in Programs
-
VL722 59 8/16/2015
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
Execute successor instructions in sequence
Flush instructions in pipeline if branch actually taken
Advantage of late pipeline state update
33% DLX branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
67% DLX branches taken on average
#4: Define branch to take place AFTER n following instruction (Delayed branch)
Four Branch Hazard Alternatives
-
VL722 60 8/16/2015
Next . . .
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
-
VL722 61 8/16/2015
Branch Hazard Alternatives
Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched
Do NOT Flush instruction after branch if branch is NOT taken
Flush only instructions appearing after Jump or taken branch
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
Dynamic Branch Prediction
Loop branches are taken most of time
Must reduce branch delay to 0, but how?
How to predict branch behavior at runtime?
-
VL722 62 8/16/2015
Define branch to take place after the next instruction
Instruction in branch delay slot is always executed
Compiler (tries to) move a useful instruction into delay slot.
From before the Branch: Always helpful when possible
For a 1-cycle branch delay, we have one delay slot
branch instruction
branch delay slot (next instruction)
branch target (if branch taken)
Compiler fills the branch delay slot
By selecting an independent instruction
From before the branch
If no independent instruction is found
Compiler fills delay slot with a NO-OP
Delayed Branch
label:
. . .
add R10,R11,R12
beq R17,R16,label
Delay Slot
label:
. . .
beq R17,R16,label
add R10,R11,R12
-
VL722 63 8/16/2015
Delayed Branch From the Target: Helps when branch is taken. May duplicate instructions
ADD R2, R1, R3 ADD R2, R1, R3
BEQZ R2, L1 BEQZ R2, L2
DELAY SLOT SUB R4, R5, R6
- -
L1: SUB R4, R5, R6 L1: SUB R4, R5, R6
L2: L2:
Instructions between BEQ and SUB (in fall through) must not use R4.
Why is instruction at L1 duplicated? What if R5 or R6 changed?
(Because, L1 can be reached by another path !!)
-
VL722 64 8/16/2015
Delayed Branch From Fall Through: Helps when branch is not taken (default case)
ADD R2, R1, R3 ADD R2, R1, R3
BEQZ R2, L1 BEQZ R2, L1
DELAY SLOT SUB R4, R5, R6
SUB R4, R5, R6 -
-
L1: L1:
Instructions at target (L1 and after) must not use R4 till set (written) again.
-
VL722 65 8/16/2015
Compiler effectiveness for single branch delay slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots are useful in computation
About 50% (i.e., 60% x 80%) of slots usefully filled
Canceling branches or nullifying branches
Include a prediction of the branch is taken or not taken
If the prediction is correct, the instruction in the delay slot is executed
If the prediction is incorrect, the instruction in the delay slot is quashed.
Allow more slots to be filled from the target address or fall through
Filling delay slots
-
VL722 66 8/16/2015
New meaning for branch instruction
Branching takes place after next instruction (Not immediately!)
Impacts software and compiler
Compiler is responsible to fill the branch delay slot
For a 1-cycle branch delay One branch delay slot
However, modern processors are deeply pipelined
Branch penalty is multiple cycles in deeper pipelines
Multiple delay slots are difficult to fill with useful instructions
MIPS used delayed branching in earlier pipelines
However, delayed branching is not useful in recent processors
Drawback of Delayed Branching
-
VL722 67 8/16/2015
Two strategies examined
Backward branch predict taken, forward branch not taken
Profile-based prediction: record branch behavior, predict branch based on prior run
Instr
uctions p
er
mis
pre
dic
ted b
ranch
1
10
100
1000
10000
100000
alv
inn
com
pre
ss
doduc
esp
ress
o
gcc
hyd
ro2d
mdljs
p2 ora
swm
256
tom
catv
Profile-based Direction-based
Compiler Static Prediction of Taken/Untaken Branches
-
VL722 68 8/16/2015
Zero-Delayed Branching
How to achieve zero delay for a jump or a taken branch?
Jump or branch target address is computed in the ID stage
Next instruction has already been fetched in the IF stage
Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions
Use the lower bits of the PC to index the BTB
Each BTB entry stores Branch/Jump address & Target Address
Check the PC to see if the instruction being fetched is a branch
Update the PC using the target address stored in the BTB
-
VL722 69 8/16/2015
Branch Target Buffer
The branch target buffer is implemented as a small cache
Stores the target address of recent branches and jumps
We must also have prediction bits
To predict whether branches are taken or not taken
The prediction bits are dynamically determined by the hardware
mux
PC
Branch Target & Prediction Buffer
Addresses of
Recent Branches
Target
Addresses
low-order bits
used as index
Predict
Bits Inc
= predict_taken
-
VL722 70 8/16/2015
Dynamic Branch Prediction
Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction
Typically few prediction bits (1 or 2) are used per entry
We dont know if the prediction is correct or not
If correct prediction
Continue normal execution no wasted cycles
If incorrect prediction (misprediction)
Flush the instructions that were incorrectly fetched wasted cycles
Update prediction bits and target address for future use
-
VL722 71 8/16/2015
Correct Prediction
No stall cycles
Yes No
Dynamic Branch Prediction Contd Use PC to address Instruction
Memory and Branch Target Buffer
Found
BTB entry with predict
taken?
Increment PC PC = target address
Mispredicted Jump/branch
Enter jump/branch address, target
address, and set prediction in BTB entry.
Flush fetched instructions
Restart PC at target address
Mispredicted branch
Branch not taken
Update prediction bits
Flush fetched instructions
Restart PC after branch
Normal
Execution
Yes No
Jump
or taken
branch?
Jump
or taken
branch?
No Yes
IF
ID
EX
-
VL722 72 8/16/2015
Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are flushed
1-bit prediction scheme is simplest to implement
1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)
Use last outcome to predict future behavior of a branch
1-bit Prediction Scheme
Predict
Not Taken
Taken
Predict
Taken
Not
Taken
Not Taken
Taken
-
VL722 73 8/16/2015
1-Bit Predictor: Shortcoming
Inner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around
outer: inner: bne , , inner bne , , outer
-
VL722 74 8/16/2015
1-bit prediction scheme has a performance shortcoming
2-bit prediction scheme works better and is often used
4 states: strong and weak predict taken / predict not taken
Implemented as a saturating counter
Counter is incremented to max=3 when branch outcome is taken
Counter is decremented to min=0 when branch is not taken
2-bit Prediction Scheme
Not Taken
Taken
Not Taken
Taken Strong
Predict
Not Taken
Taken
Weak
Predict
Taken
Not Taken
Weak
Predict
Not Taken Not Taken
Taken Strong
Predict
Taken
-
VL722 75 8/16/2015
Alternative state machine
2-bit Prediction Scheme
-
1-Bit Branch History Table
Example main()
{
int i, j;
while (1)
for (i=1; i
-
2-Bit Counter Scheme Prediction is made based on the last two branch outcomes
Each of BHT entries consists of 2 bits, usually a 2-bit counter, which is associated with the state of the automaton(many different
automata are possible)
.
.
.
Prediction PC
BHT
Automaton
Branch outcome
-
2-Bit BHT 2-bit scheme where change prediction only if get misprediction twice
MSB of the state symbol represents the prediction;
1: TAKEN, 0: NOT TAKEN
Predict
TAKEN 11 Predict
TAKEN 10
Predict
NOT
TAKEN 01
Predict
NOT
TAKEN 00
T
NT
NT
NT
NT
T
T
T
Prediction accuracy of for branch: 75 %
1
11
T
11
O
1
11
T
11
O
1
11
T
11
O
1
10
T
11
O
1
11
T
11
O
1
11
T
11
O
1
10
T
11
O
1
11
T
11
O
1
11
T
11
O
O : correct, X : mispredict
branch outcome
counter value
prediction
new counter value
correctness
. . .
. . .
. . .
. . .
. . .
0
11
T
10
X
0
11
T
10
X
0
11
T
10
X
assume initial counter value : 11
-
79
Case for Correlating Predictors
Basic two-bit predictor schemes
use recent behavior of a branch to predict its future behavior
Improve the prediction accuracy
look also at recent behavior of other branches
if (aa == 2) aa = 0;
if (bb == 2) bb = 0;
if (aa != bb) { }
subi R3, R1, #2 bnez R3, L1 ; b1
add R1, R0, R0
L1: subi R3, R1, #2
bnez R3, L2 ; b2
add R2, R0, R0
L2: sub R3, R1, R2
beqz R3, L3 ; b3
b3 is correlated with b1 and b2;
If b1 and b2 are both untaken,
then b3 will be taken.
=>
Use correlating predictors or
two-level predictors.
-
80
Branch Correlation
Branch direction
Not independent
Correlated to the path taken
Example: Decision of b3 can be surely known beforehand if the path to b3 is 1-1
Track path using a 2-bit register
if (aa==2) // b1
aa = 0;
if (bb==2) // b2
bb = 0;
if (aa!=bb) { // b3
.
}
Code Snippet b1
b2 b2
b3 b3 b3
0 (NT)
0 0
1 (T)
1
b3
1
Path: A:0-0 B:0-1 C:1-0 D:1-1 aa=0
bb=0
aa=0
bb2
aa2
bb=0
aa2
bb2
-
Correlating Branches
Example: if (d==0)
d=1;
if (d==1)
BNEZ R1,b1 ; (b1)(d!=0)
ADDI R1,R1,#1; since d==0, make d=1
b1 : SUBI R3,R1,#1
BNEZ R3,b2; (b2)(d!=1)
.....
b2 :
Initial value
of d
0
1
2
d==0?
Y
N
N
b1
NT
T
T
Value of d
before b2
1
1
2
d==1?
Y
Y
N
b2
NT
NT
T
If b1 is NT, then b2 is NT
d=?
2
0
2
0
b1
prediction
NT
T
NT
T
b1
action
T
NT
T
NT
New b1
prediction
T
NT
T
NT
b2
prediction
NT
T
NT
T
b2
action
T
NT
T
NT
New b2
prediction
T
NT
T
NT
All branches are mispredicted
1-bit
self history
predictor
Sequence of
2,0,2,0,2,0,...
-
Correlating Branches
Example: if (d==0)
d=1;
if (d==1)
BNEZ R1,L1 ; branch b1 (d!=0)
ADDI R1,R1,#1; since d==0, make d=1
b1 : SUBI R3,R1,#1
BNEZ R3,L2; branch b2(d!=1)
.....
b2 :
Self
Prediction
bits(XX)
NT/NT
NT/T
T/NT
T/T
Prediction, if last
branch action was NT
NT
NT
T
T
Prediction, if last
branch action was T
NT
T
NT
T
Gloabal
d=?
2
0
2
0
b1 prediction
NT/NT
T/NT
T/NT
T/NT
b1 action
T
NT
T
NT
new b1 prediction
T/NT
T/NT
T/NT
T/NT
b2 prediction
NT/NT
NT/T
NT/T
NT/T
b2 action
T
NT
T
NT
new b2 prediction
NT/T
NT/T
NT/T
NT/T
Initial self prediction bits NT/NT and
Initial last branch was NT. Prediction used is shown in Red
Misprediction only in the first prediction
-
8/16/2015 83
Local/Global Predictors
Instead of maintaining a counter for each branch to capture the common case,
Maintain a counter for each branch and surrounding pattern
If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor
If the surrounding pattern includes neighboring branches, the predictor is referred to as a global predictor
-
8/16/2015 84
Correlated Branch Prediction
Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the
proper n-bit branch history table
In general, (m,n) predictor means record last m branches to select between 2m history tables, each with n-bit
counters
Thus, old 2-bit BHT is a (0,2) predictor
Global Branch History: m-bit shift register keeping T/NT status of last m branches.
Each entry in table has m n-bit predictors.
-
8/16/2015 85
Correlating Branches
(2,2) predictor Behavior of recent
branches selects
between four
predictions of next
branch, updating just
that prediction
Branch address
2-bits per branch predictor
Prediction
2-bit global branch history
4
-
86
Correlated Branch Predictor
(M,N) correlation scheme
M: shift register size (# bits)
N: N-bit counter
2-bit
counter
hash
X X
Branch PC
hash
2-bit
counter
2-bit
counter
X X
2-bit
counter
2-bit
counter
Prediction Prediction
2-bit shift register
(global branch history)
select
Subsequent
branch
direction
(2,2) Correlation Scheme 2-bit Sat. Counter Scheme
2w
w
Branch PC
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
-
8/16/2015 87
0%
Fre
quen
cy o
f M
ispre
dic
tions
0% 1%
5% 6% 6%
11%
4%
6% 5%
1% 2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Accuracy of Different Schemes
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
nas
a7
mat
rix
300
doducd
spic
e
fpp
pp
gcc
expre
sso
eqnto
tt
li
tom
catv
-
88
Two-Level Branch Predictor
Generalized correlated branch predictor
1st level keeps branch history in Branch History Register (BHR)
2nd level segregates pattern history in Pattern History Table (PHT)
1 1 . . . . . 1 0
00..00
00..01
00..10
11..11
11..10
Branch History Pattern
Pattern History Table (PHT)
Prediction
Rc-k Rc-1
Rc: Actual Branch Outcome
FSM
Update
Logic
Branch History Register (BHR)
(Shift left when update)
N
2N entries
Current State PHT update
.
-
89
Branch History Register
An N-bit Shift Register = 2N patterns in PHT
Shift-in branch outcomes
1 taken
0 not taken
First-in First-Out
BHR can be
Global
Per-set
Local (Per-address)
-
90
Pattern History Table
2N entries addressed by N-bit BHR
Each entry keeps a counter (2-bit or more) for prediction
Counter update: the same as 2-bit counter
Can be initialized in alternate patterns (01, 10, 01, 10, ..)
Alias (or interference) problem
-
91
Two-Level Branch Prediction
0110
BHR
PC = 0x4001000C
PHT
00110110
00110110
00110111
11111101
11111110
00000000
00000001
00000010
11111111 MSB = 1 Predict Taken
10
Set
-
92
Predictor Update (Actually, Not Taken)
0110
BHR
PC = 0x4001000C
PHT
00110110
00110110
00110111
11111101
11111110
00000000
00000001
00000010
11111111
decremented
1100
00111100
00111100
Update Predictor after branch is resolved
10 01
-
8/16/2015 93
A local predictor might work well for some branches or programs, while a global predictor might work well for others
Provide one of each and maintain another predictor to identify which predictor is best for each branch
Tournament
Predictor
Branch PC
Table of 2-bit
saturating counters
Local
Predictor
Global
Predictor
M
U
X
Tournament Predictors
-
8/16/2015 94
Tournament Predictors Multilevel branch predictor
Selector for the Global and Local predictors of correlating branch
prediction
Use n-bit saturating counter to choose between predictors
Usual choice between global and local predictors
-
95
Advantage of tournament predictor is the ability to select the right predictor for a particular branch
A typical tournament predictor selects global predictor 40% of the time for SPEC integer benchmarks
AMD Opteron and Phenom use tournament style
Tournament Predictors
-
Accuracy v. Size (SPEC89)
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Total predictor size (Kbits)
Co
nd
itio
na
l b
ran
ch
mis
pre
dic
tio
n r
ate
Local - 2 bit counters
Correlating - (2,2) scheme
Tournament
-
97
Based on predictors used in Core 2 Duo chip Combines three different predictors
Two-bit Global history Loop exit predictor
Uses a counter to predict the exact number of taken branches (number of loop iterations) for a branch
that is detected as a loop branch
Tournament: Tracks accuracy of each predictor Main problem of speculation:
A mispredicted branch may lead to another branch being mispredicted !
Tournament Predictors (Intel Core i7)
-
98
Branch Prediction is More Important Today
Conditional branches still comprise about 20% of instructions
Correct predictions are more important today - why?
pipelines deeper
branch not resolved until more cycles from fetching - therefore the
misprediction penalty greater
cycle times smaller - more emphasis on throughput (performance)
more functionality between fetch & execute
multiple instruction issue (superscalars & VLIW)
branch occurs almost every cycle
flushing & refetching more instructions
object-oriented programming
more indirect branches - which are harder to predict
dual of Amdahls Law
other forms of pipeline stalling are being addressed - so the portion
of CPI due to branch delays is relatively larger
All this means that the potential stalling due to branches is greater
-
99
Branch Prediction is More Important Today
On the other hand,
Chips are denser so we can consider sophisticated HW solutions
Hardware cost is small compared to the performance gain
-
100
Directions in Branch Prediction
1: Improve the prediction
correlated (2-level) predictor (Pentium III - 512 entries, 2-bit, Pentium Pro - 4 history bits)
hybrid local/global predictor (Alpha 21264)
confidence predictors
2: Determine the target earlier
branch target buffer (Pentium Pro, IA-64 Itanium)
next address in I-cache (Alpha 21264, UltraSPARC)
return address stack (Alpha 21264, IA-64 Itanium, MIPS R10000, Pentium Pro, UltraSPARC-3)
3: Reduce misprediction penalty
fetch both instruction streams (IBM mainframes, SuperSPARC)
4: Eliminate the branch
predicated execution (IA-64 Itanium, Alpha 21264)
-
VL722 101 8/16/2015
Exceptions: Events other than branches or jumps that change the normal flow of instruction execution. Some types of exceptions:
I/O Device request
Invoking an OS service from user program
Tracing Instruction execution
Breakpoint (programmer requested interrupt)
Integer arithmetic overflow
FP arithmetic anomaly
Page fault (page not in main memory)
Misaligned memory accesses
Memory protection violation
Use of undefined instruction
Hardware malfunction
Power failure
Pipelining Complications
-
VL722 102 8/16/2015
Exceptions: Events other than branches or jumps that change the normal flow of instruction execution.
5 instructions executing in 5 stage pipeline
How to stop the pipeline? Who caused the interrupt?
How to restart the pipeline?
Stage Problems causing the interrupts
IF Page fault on instruction fetch; misaligned
memory access; memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic interrupt
MEM Page fault on data fetch; misaligned memory access; memory-protection violation
Pipelining Complications
-
VL722 103 8/16/2015
Simultaneous exceptions in more than one pipeline stage, e.g.,
LOAD with data page fault in MEM stage
ADD with instruction page fault in IF stage
Solution #1
Interrupt status vector per instruction
Defer check until last stage, kill state update if exception
Solution #2
Interrupt ASAP
Restart everything that is incomplete
Pipelining Complications
-
VL722 104 8/16/2015
Our DLX pipeline only writes results at the end of the instructions execution. Not all processors do this.
Address modes: Auto-increment causes register change during instruction execution
Interrupts Need to restore register state
Adds WAR and WAW hazards since writes happen not only in last stage
Memory-Memory Move Instructions
Must be able to handle multiple page faults
VAX and x86 store values temporarily in registers
Condition Codes
Need to detect the last instruction to change condition codes
Pipelining Complications
-
VL722 105 8/16/2015
Fallacies and Pitfalls
Pipelining is easy!
The basic idea is easy
The devil is in the details
Detecting data hazards and stalling pipeline
Poor ISA design can make pipelining harder
Complex instruction sets (Intel IA-32)
Significant overhead to make pipelining work
IA-32 micro-op approach
Complex addressing modes
Register update side effects, memory indirection
-
VL722 106 8/16/2015
Pipeline Hazards Summary
Three types of pipeline hazards
Structural hazards: conflicts using a resource during same cycle
Data hazards: due to data dependencies between instructions
Control hazards: due to branch and jump instructions
Hazards limit the performance and complicate the design
Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by data forwarding
However, load delay cannot be completely eliminated
Delayed branching can be a solution for control hazards
BTB with branch prediction can reduce branch delay to zero
Branch misprediction should flush the wrongly fetched instructions