Computer Structure Advanced Branch Prediction

27
Computer Structure 2013 – Branch Prediction 1 Computer Structure Advanced Branch Prediction Lihu Rappoport and Adi Yoaz

description

Computer Structure Advanced Branch Prediction. Lihu Rappoport and Adi Yoaz. Introduction. Need to predict: Conditional branch direction (taken or no taken) Actual direction is known only after execution Wrong direction prediction causes a full flush - PowerPoint PPT Presentation

Transcript of Computer Structure Advanced Branch Prediction

Page 1: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction1

Computer Structure

Advanced Branch Prediction

Lihu Rappoport and Adi Yoaz

Page 2: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction2

Introduction Need to predict:

Conditional branch direction (taken or no taken) Actual direction is known only after execution Wrong direction prediction causes a full flush

All taken branch (conditional taken or unconditional) targets Target of direct branches known at decode Target of indirect branches known at execution

Branch type Conditional, uncond. direct, uncond. indirect, call, return

Target: minimize branch misprediction rate for a given predictor size

Page 3: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction3

What/Who/When We Predict/FixFetch Decode Execute

Target Array Branch type conditional unconditional direct unconditional indirect call return Branch target

Indirect Target Array Predict indirect target override TA target

Return Stack Buffer Predict return target

Cond. Branch Predictor Predict conditional

T/NT

Fix TA miss Fix wrong direct

target

on TA misstry to fix

Fix TA miss

Fix Wrong prediction

Fix Wrong prediction

Fix Wrong prediction

Dec FlushExe Flush

Page 4: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction4

Branches and Performance MPI : misprediction-per-instruction:

# of incorrectly predicted branches MPI = total # of instructions

MPI correlates well with performance. For example: MPI = 1% (1 out of 100 instructions @1 out of 20 branches) IPC=2 (IPC is the average number of instructions per cycle), flush penalty of 10 cycles

We get: MPI = 1% flush in every 100 instructions flush in every 50 cycles (since IPC=2), 10 cycles flush penalty every 50 cycles 20% in performance

Page 5: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction5

Target Array TA is accessed using the branch address (branch IP) Implemented as an n-way set associative cache Tags usually partial

Save space Can get false hits Few branches aliased to

the same entry No correctness only performance

TA predicts the following Indication that instruction is a branch Predicted target Branch type

Unconditional: take target Conditional: predict direction

TA allocated / updated at execution

Branch IP

tag target

predicted target

hit / miss(indicates a branch)

type

predicted type

Page 6: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction6

Conditional BranchDirection Prediction

Page 7: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction7

One-Bit Predictor

Problem: 1-bit predictor has a double mistake in loopsBranch Outcome 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 Prediction ? 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0

branch IP Prediction (at fetch): previous branch outcome

counter array / cache

Update (at execution)Update bit with branch outcome

Page 8: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction8

Bimodal (2-bit) Predictor A 2-bit counter avoids the double mistake in glitches

Need “more evidence” to change prediction 2 bits encode one of 4 states

00 – strong NT, 01 – weakly NT, 10 – weakly taken, 11 – strong taken

Initial state: weakly-taken (most branches are taken) Update

Branch was actually taken: increment counter (saturate at 11) Branch was actually not-taken: decrement counter (saturate at 00)

Predict according to m.s.bit of counter (0 – NT, 1 – taken) Does not predict well branches with patterns like 010101…

00SNT

taken

not-taken

taken taken

taken

not-taken not-taken

not-taken

01WNT

10WT

11ST

Predict takenPredict not-taken

Page 9: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction9

l.s. bits of branch IP

Bimodal Predictor (cont.)

Prediction = msb of counter

2-bit-sat counter array

Update counter with branch outcome

Page 10: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction10

Bimodal Predictor - example Br1 prediction

Pattern: 1 0 1 0 1 0counter: 2 3 2 3 2 3Prediction: 1 1 1 1 1 1

Br2 predictionPattern: 0 1 0 1 0 1counter: 2 1 2 1 2 1Prediction: 1 0 1 0 1 0

Br3 predictionPattern: 1 1 1 1 1 0counter: 2 3 3 3 3 3Prediction: 1 1 1 1 1 1

Code:

int n = 6

Loop: ….

br1: if (n/2) { … }

br2: if ((n+1)/2) { … }

n--;

br3: JNZ n, Loop

Page 11: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction11

2-Level Prediction: Local Predictor Save the history of each branch in a Branch History Register

(BHR): A shift-register updated by branch outcome Saves the last n outcomes of the branch Used as a pointer to an array of bits specifying

direction per history Example: assume n=6

Assume the pattern 000100010001 . . . At the steady-state, the following patterns

are repeated in the BHR: 000100010001 . . .

000100001000010001 100010

Following 000100, 010001, 100010 the jump is not taken Following 001000 the jump is taken

BHR

0

2n-1

n

Page 12: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction12

Local Predictor (cont.) There could be glitches from the pattern

Use 2-bit saturating counters instead of 1 bit to record outcome:

Too long BHRs are not good: Past history may be no longer relevant Warm-Up is longer Counter array becomes too big

UpdateHistorywithbranchoutcome

prediction = msb of counter

2-bit-sat counter array

Update counter withbranch outcome

history

BHR

Page 13: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction13

Local Predictor: private counter arrays

Branch IP tag history

prediction = msb of counter

2-bit-sat counter arrays

Update counter withbranch outcome

UpdateHistorywithbranchoutcome

history cache

Predictor size: #BHRs × (tag_size + history_size + 2 × 2 history_size)

Example: #BHRs = 1024; tag_size=8; history_size=6

size=1024 × (8 + 6 + 2×26) = 142Kbit

Holding BHRs and counter arrays for many branches:

Page 14: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction14

Local Predictor: shared counter arrays Using a single counter array shared by all BHR’s

All BHR’s index the same array Branches with similar patterns interfere with each other

prediction = msb of counter

Branch IP

2-bit-sat counter array

tag history

history cache

Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size

Example: #BHRs = 1024; tag_size=8; history_size=6

size=1024 × (8 + 6) + 2×26 = 14.1Kbit

Page 15: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction15

Local Predictor: lselect lselect reduces inter-branch-interference in the counter array

prediction = msb of counter

Branch IP

2-bit-sat counter array

tag history

history cache

h

h+m

m l.s.bits of IP

Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size + m

Page 16: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction16

Local Predictor: lsharelshare reduces inter-branch-interference in the counter array:maps common patterns in different branches to different counters

h

h

h l.s.bits of IP

history cache

tag history

prediction = msb of counter

Branch IP

2-bit-sat counter array

Predictor size: #BHRs × (tag_size + history_size) + 2 × 2 history_size

Page 17: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction17

The behavior of some branches is highly correlated with the behavior of other branches:

if (x < 1) . . . if (x > 1) . . . Using a Global History Register (GHR), the prediction of the

second if may be based on the direction of the first if For other branches the history interference might be

destructive

Global Predictor

Page 18: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction18

Global Predictor (cont.)

history

UpdateHistorywithbranchoutcome

prediction = msb of counter

2-bit-sat counter array

Update counter withbranch outcome

GHR

The predictor size: history_size + 2*2 history_size

Example: history_size = 12 size = 8 K Bits

Page 19: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction19

gshare combines the global history information with the branch IP

Global Predictor: Gshare

prediction = msb of counter

2-bit-sat counter array

Update counter withbranch outcome

Branch IP

historyGHR

UpdateHistorywithbranchoutcome

Page 20: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction20

Chooser

The chooser may also be indexed by the GHR

+1 if Bimodal / Local correct and Global wrong -1 if Bimodal / Local wrong and Global correct

Bimodal /Local

Global

Branch IP

Prediction

Chooser array (an arrayof 2-bit sat. counters)

GHR

A chooser selects between 2 predictor that predict the same branch:Use the predictor that was more correct in the past

Page 21: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction21

Speculative History Updates Deep pipeline many cycles between fetch and branch resolution

If history is updated only at resolution Local: future occurrences of the same branch may see stale history Global: future occurrences of all branches may see stale history

History is speculatively updated according to the prediction History must be corrected if the branch is mispredicted Speculative updates are done in a special field to enable recovery

Speculative History Update Speculative history updated assuming previous predictions are

correct Speculation bit set to indicate that speculative history is used Counter array is not updated speculatively

Prediction can change only on a misprediction (state 01→10 or 10→01)

On branch resolution Update the real history and reset speculative histories if mispredicted

Page 22: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction22

Return Stack Buffer A return instruction is a special case of an indirect

branch:Each times it jumps to a different targetThe target is determined by the location of the

corresponding call instruction

The idea:Hold a small stack of targetsWhen the target array predicts a call

Push the address of the instruction which follows the call into the stack

When the target array predicts a return Pop a target from the stack and use it as the return address

Page 23: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction23

Intel 486, Pentium®, Pentium® II 486 Statically predict Not Taken Pentium® 2-bit saturating counters Pentium® II

2-level, local histories, per-set counters 4-way set associative: 512 entries in 128 sets

IP Tag Hist

1001

Pred=msb of counter

9

0

15

Way 0 Way 1

Target

Way 2 Way 3

9 432

counters

128sets

PTV

21 1 32

LRR

2

Per-Set

Branch Type00- cond01- ret10- call11- uncond

Page 24: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction24

Alpha 264 – LG Chooser

1024

3

Counters

4 ways

256

Histories

IPIn each entry:6 bit tag +10 bit History

4096

2

Counters

GHR12

4096

Counters

Global

Local

Chooser

2

New entry on the Local stage is allocated on a global stage miss-prediction

Chooser state-machines: 2 bit each: one bit saves last time

global correct/wrong, and the other bit saves

for the local correct/wrong

Chooses Local only if local was correct and global was wrong

Page 25: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction25

Pentium® M Combines 3 predictors

Bimodal, Global and a Loop predictor

Loop predictor detects branches that have loop behavior Moving in one direction (taken or NT) a fixed number of times Ended with a single movement in the opposite direction

PredictionLimitCount

=

+1

0

Page 26: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction26

Pentium® M – Indirect Branch Predictor Indirect branch targets: target given in a register

Can have many targets: e.g., a case statement Resolved at execution high misprediction penalty Used in object-oriented code (C++, Java)

Added dedicated indirect branch target predictor global history based

Indirect Target Array

Branch IP

historyGHR

Page 27: Computer Structure Advanced Branch Prediction

Computer Structure 2013 – Branch Prediction27

Indirect Branch Target Prediction Initially allocate indirect branch only in the Target Array (TA)

Many indirect branches have only one target If the TA mispredicts for an indirect jump

Allocate an iTA entry with the current target Indexed by IP Global History

Prediction from the iTA is used if TA indicates an indirect branch and iTA hits

TargetArray

Indirect Target Predictor

Branch IP

Predicted Target

M

XU

hitindirect branch

hitTarget

HIT

Global history

Target