Computer Structure Advanced Branch Prediction
description
Transcript of Computer Structure Advanced Branch Prediction
Computer Structure 2013 – Branch Prediction1
Computer Structure
Advanced Branch Prediction
Lihu Rappoport and Adi Yoaz
Computer Structure 2013 – Branch Prediction2
Introduction Need to predict:
Conditional branch direction (taken or no taken) Actual direction is known only after execution Wrong direction prediction causes a full flush
All taken branch (conditional taken or unconditional) targets Target of direct branches known at decode Target of indirect branches known at execution
Branch type Conditional, uncond. direct, uncond. indirect, call, return
Target: minimize branch misprediction rate for a given predictor size
Computer Structure 2013 – Branch Prediction3
What/Who/When We Predict/FixFetch Decode Execute
Target Array Branch type conditional unconditional direct unconditional indirect call return Branch target
Indirect Target Array Predict indirect target override TA target
Return Stack Buffer Predict return target
Cond. Branch Predictor Predict conditional
T/NT
Fix TA miss Fix wrong direct
target
on TA misstry to fix
Fix TA miss
Fix Wrong prediction
Fix Wrong prediction
Fix Wrong prediction
Dec FlushExe Flush
Computer Structure 2013 – Branch Prediction4
Branches and Performance MPI : misprediction-per-instruction:
# of incorrectly predicted branches MPI = total # of instructions
MPI correlates well with performance. For example: MPI = 1% (1 out of 100 instructions @1 out of 20 branches) IPC=2 (IPC is the average number of instructions per cycle), flush penalty of 10 cycles
We get: MPI = 1% flush in every 100 instructions flush in every 50 cycles (since IPC=2), 10 cycles flush penalty every 50 cycles 20% in performance
Computer Structure 2013 – Branch Prediction5
Target Array TA is accessed using the branch address (branch IP) Implemented as an n-way set associative cache Tags usually partial
Save space Can get false hits Few branches aliased to
the same entry No correctness only performance
TA predicts the following Indication that instruction is a branch Predicted target Branch type
Unconditional: take target Conditional: predict direction
TA allocated / updated at execution
Branch IP
tag target
predicted target
hit / miss(indicates a branch)
type
predicted type
Computer Structure 2013 – Branch Prediction6
Conditional BranchDirection Prediction
Computer Structure 2013 – Branch Prediction7
One-Bit Predictor
Problem: 1-bit predictor has a double mistake in loopsBranch Outcome 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 Prediction ? 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
branch IP Prediction (at fetch): previous branch outcome
counter array / cache
Update (at execution)Update bit with branch outcome
Computer Structure 2013 – Branch Prediction8
Bimodal (2-bit) Predictor A 2-bit counter avoids the double mistake in glitches
Need “more evidence” to change prediction 2 bits encode one of 4 states
00 – strong NT, 01 – weakly NT, 10 – weakly taken, 11 – strong taken
Initial state: weakly-taken (most branches are taken) Update
Branch was actually taken: increment counter (saturate at 11) Branch was actually not-taken: decrement counter (saturate at 00)
Predict according to m.s.bit of counter (0 – NT, 1 – taken) Does not predict well branches with patterns like 010101…
00SNT
taken
not-taken
taken taken
taken
not-taken not-taken
not-taken
01WNT
10WT
11ST
Predict takenPredict not-taken
Computer Structure 2013 – Branch Prediction9
l.s. bits of branch IP
Bimodal Predictor (cont.)
Prediction = msb of counter
2-bit-sat counter array
Update counter with branch outcome
Computer Structure 2013 – Branch Prediction10
Bimodal Predictor - example Br1 prediction
Pattern: 1 0 1 0 1 0counter: 2 3 2 3 2 3Prediction: 1 1 1 1 1 1
Br2 predictionPattern: 0 1 0 1 0 1counter: 2 1 2 1 2 1Prediction: 1 0 1 0 1 0
Br3 predictionPattern: 1 1 1 1 1 0counter: 2 3 3 3 3 3Prediction: 1 1 1 1 1 1
Code:
int n = 6
Loop: ….
br1: if (n/2) { … }
br2: if ((n+1)/2) { … }
n--;
br3: JNZ n, Loop
Computer Structure 2013 – Branch Prediction11
2-Level Prediction: Local Predictor Save the history of each branch in a Branch History Register
(BHR): A shift-register updated by branch outcome Saves the last n outcomes of the branch Used as a pointer to an array of bits specifying
direction per history Example: assume n=6
Assume the pattern 000100010001 . . . At the steady-state, the following patterns
are repeated in the BHR: 000100010001 . . .
000100001000010001 100010
Following 000100, 010001, 100010 the jump is not taken Following 001000 the jump is taken
BHR
0
2n-1
n
Computer Structure 2013 – Branch Prediction12
Local Predictor (cont.) There could be glitches from the pattern
Use 2-bit saturating counters instead of 1 bit to record outcome:
Too long BHRs are not good: Past history may be no longer relevant Warm-Up is longer Counter array becomes too big
UpdateHistorywithbranchoutcome
prediction = msb of counter
2-bit-sat counter array
Update counter withbranch outcome
history
BHR
Computer Structure 2013 – Branch Prediction13
Local Predictor: private counter arrays
Branch IP tag history
prediction = msb of counter
2-bit-sat counter arrays
Update counter withbranch outcome
UpdateHistorywithbranchoutcome
history cache
Predictor size: #BHRs × (tag_size + history_size + 2 × 2 history_size)
Example: #BHRs = 1024; tag_size=8; history_size=6
size=1024 × (8 + 6 + 2×26) = 142Kbit
Holding BHRs and counter arrays for many branches:
Computer Structure 2013 – Branch Prediction14
Local Predictor: shared counter arrays Using a single counter array shared by all BHR’s
All BHR’s index the same array Branches with similar patterns interfere with each other
prediction = msb of counter
Branch IP
2-bit-sat counter array
tag history
history cache
Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size
Example: #BHRs = 1024; tag_size=8; history_size=6
size=1024 × (8 + 6) + 2×26 = 14.1Kbit
Computer Structure 2013 – Branch Prediction15
Local Predictor: lselect lselect reduces inter-branch-interference in the counter array
prediction = msb of counter
Branch IP
2-bit-sat counter array
tag history
history cache
h
h+m
m l.s.bits of IP
Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size + m
Computer Structure 2013 – Branch Prediction16
Local Predictor: lsharelshare reduces inter-branch-interference in the counter array:maps common patterns in different branches to different counters
h
h
h l.s.bits of IP
history cache
tag history
prediction = msb of counter
Branch IP
2-bit-sat counter array
Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size
Computer Structure 2013 – Branch Prediction17
The behavior of some branches is highly correlated with the behavior of other branches:
if (x < 1) . . . if (x > 1) . . . Using a Global History Register (GHR), the prediction of the
second if may be based on the direction of the first if For other branches the history interference might be
destructive
Global Predictor
Computer Structure 2013 – Branch Prediction18
Global Predictor (cont.)
history
UpdateHistorywithbranchoutcome
prediction = msb of counter
2-bit-sat counter array
Update counter withbranch outcome
GHR
The predictor size: history_size + 2*2 history_size
Example: history_size = 12 size = 8 K Bits
Computer Structure 2013 – Branch Prediction19
gshare combines the global history information with the branch IP
Global Predictor: Gshare
prediction = msb of counter
2-bit-sat counter array
Update counter withbranch outcome
Branch IP
historyGHR
UpdateHistorywithbranchoutcome
Computer Structure 2013 – Branch Prediction20
Chooser
The chooser may also be indexed by the GHR
+1 if Bimodal / Local correct and Global wrong -1 if Bimodal / Local wrong and Global correct
Bimodal /Local
Global
Branch IP
Prediction
Chooser array (an arrayof 2-bit sat. counters)
GHR
A chooser selects between 2 predictor that predict the same branch:Use the predictor that was more correct in the past
Computer Structure 2013 – Branch Prediction21
Speculative History Updates Deep pipeline many cycles between fetch and branch resolution
If history is updated only at resolution Local: future occurrences of the same branch may see stale history Global: future occurrences of all branches may see stale history
History is speculatively updated according to the prediction History must be corrected if the branch is mispredicted Speculative updates are done in a special field to enable recovery
Speculative History Update Speculative history updated assuming previous predictions are
correct Speculation bit set to indicate that speculative history is used Counter array is not updated speculatively
Prediction can change only on a misprediction (state 01→10 or 10→01)
On branch resolution Update the real history and reset speculative histories if mispredicted
Computer Structure 2013 – Branch Prediction22
Return Stack Buffer A return instruction is a special case of an indirect
branch:Each times it jumps to a different targetThe target is determined by the location of the
corresponding call instruction
The idea:Hold a small stack of targetsWhen the target array predicts a call
Push the address of the instruction which follows the call into the stack
When the target array predicts a return Pop a target from the stack and use it as the return address
Computer Structure 2013 – Branch Prediction23
Intel 486, Pentium®, Pentium® II 486 Statically predict Not Taken Pentium® 2-bit saturating counters Pentium® II
2-level, local histories, per-set counters 4-way set associative: 512 entries in 128 sets
IP Tag Hist
1001
Pred=msb of counter
9
0
15
Way 0 Way 1
Target
Way 2 Way 3
9 432
counters
128sets
PTV
21 1 32
LRR
2
Per-Set
Branch Type00- cond01- ret10- call11- uncond
Computer Structure 2013 – Branch Prediction24
Alpha 264 – LG Chooser
1024
3
Counters
4 ways
256
Histories
IPIn each entry:6 bit tag +10 bit History
4096
2
Counters
GHR12
4096
Counters
Global
Local
Chooser
2
New entry on the Local stage is allocated on a global stage miss-prediction
Chooser state-machines: 2 bit each: one bit saves last time
global correct/wrong, and the other bit saves
for the local correct/wrong
Chooses Local only if local was correct and global was wrong
Computer Structure 2013 – Branch Prediction25
Pentium® M Combines 3 predictors
Bimodal, Global and a Loop predictor
Loop predictor detects branches that have loop behavior Moving in one direction (taken or NT) a fixed number of times Ended with a single movement in the opposite direction
PredictionLimitCount
=
+1
0
Computer Structure 2013 – Branch Prediction26
Pentium® M – Indirect Branch Predictor Indirect branch targets: target given in a register
Can have many targets: e.g., a case statement Resolved at execution high misprediction penalty Used in object-oriented code (C++, Java)
Added dedicated indirect branch target predictor global history based
Indirect Target Array
Branch IP
historyGHR
Computer Structure 2013 – Branch Prediction27
Indirect Branch Target Prediction Initially allocate indirect branch only in the Target Array (TA)
Many indirect branches have only one target If the TA mispredicts for an indirect jump
Allocate an iTA entry with the current target Indexed by IP Global History
Prediction from the iTA is used if TA indicates an indirect branch and iTA hits
TargetArray
Indirect Target Predictor
Branch IP
Predicted Target
M
XU
hitindirect branch
hitTarget
HIT
Global history
Target