CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD

CSL718 : Pipelined ProcessorsCSL718 : Pipelined Processors

Improving Branch Performance – contd.

21st Jan, 2006

Anshul Kumar, CSE IITD slide 2

Improving Branch PerformanceImproving Branch Performance

• Branch Elimination– replace branch with other instructions

• Branch Speed Up– reduce time for computing CC and TIF

• Branch Prediction– guess the outcome and proceed, undo if necessary

• Branch Target Capture– make use of history


Branch EliminationBranch Elimination

C

S

Use conditional/guarded instructions(predicated execution)

T

F

C : S

OP1BC CC = Z, + 2ADD R3, R2, R1OP2

OP1ADD R3, R2, R1, NZOP2

Examples: HP PA (all integer arithmetic/logical instructions)DEC Alpha, SPARC V9 (conditional move)


Branch Elimination - contd.Branch Elimination - contd.

IF IF IF D AG DF DF DF EX EX

IF IF IF D AG TIF TIF TIF

IF IF IF D’ D AG

OP1

ADD/OP2

BC

CC

IF IF IF D AG DF DF DF EX EXADD(cond)


Branch Speed Up : Branch Speed Up : early target address generationearly target address generation

• Assume each instruction is Branch• Generate target address while decoding• If target in same page omit translation• After decoding discard target address if not

Branch

IF IF IF D TIF TIF TIF AGBC


Branch Speed Up : Branch Speed Up : increase CC - branch gapincrease CC - branch gap

Increase the gap between the instruction which sets CC and branching

• Early CC setting• Delayed branch


Summary - Branch Speed UpSummary - Branch Speed Up

n=0 n=1 n=2 n=3 n=4 n=5uncond 4 4 4 4 4 4cond (T) 6 5 4 4 4 4cond (I) 5 4 3 2 1 0uncond 4 3 2 1 0 0cond (T) 6 5 4 3 2 1cond (I) 5 4 3 2 1 0de

laye

d e

arly

CC

bran

ch

set

ting


Delayed Branch with NullificationDelayed Branch with Nullification

(Also called annulment )• Delay slot is used optionally• Branch instruction specifies the option• Option may be exercised based on

correctness of branch prediction• Helps in better utilization of delay slots


Branch PredictionBranch Prediction

• Treat conditional branches as unconditional branches / NOP

• Undo if necessaryStrategies:

– Fixed (always guess inline)– Static (guess on the basis of instruction type /

displacement)– Dynamic (guess based on recent history)


Static Branch PredictionStatic Branch Prediction

Instr % Guess Branch Correct

uncond 14.5 always 100% 14.5%

cond 58 never 54% 27%

loop 9.8 always 91% 9%

call/ret 17.7 always 100% 17.7%

Total 68.2%


Threshold forThreshold for Static predictionStatic prediction

actual T I

guess T 4 5

I 6 0guess target if 4 p + 5 (1 - p) < 6 p + 0 (1 - p)

i.e. p > .71

IF IF D AG AG DF DF EX EX

IF IF D AG AG TIF TIFI-1

I

CC


Dynamic Branch Prediction -Dynamic Branch Prediction -basic ideabasic idea

Predict based on the history of previous branch

loop: xxx 2 mispredictions xxx for every xxx occurrence xxx BC loop


Dynamic Branch Prediction -Dynamic Branch Prediction -2 bit prediction scheme2 bit prediction scheme

0 1

2 3

N

T

N

T

N

T

T N0/1 3/2

predict taken predict not taken


Dynamic Branch Prediction -Dynamic Branch Prediction -second schemesecond scheme

Predict based on the history of previous n branches

e.g., if n = 3 then3 branches taken predict taken2 branches taken predict taken1 branch taken predict not taken0 branches taken predict not taken


Dynamic Branch Prediction -Dynamic Branch Prediction -Bimodal predictorBimodal predictor

Maintain saturating counters

0 1 2 3

T

N

T

N

T

N

TN

One counter per branch orOne counter per cache line -

merge results if multiple branches


Dynamic Branch Prediction -Dynamic Branch Prediction -History of last History of last nn occurrences occurrences

1 1 0

current entry

1 1 1

updated entry

outcome of lastthree occurrencesof this branch

0 : not taken1 : taken

prediction using majority decision

actual outcome‘taken’


Dynamic Branch Prediction -Dynamic Branch Prediction -storing prediction countersstoring prediction counters

store in separate buffer orstore in cache directory

CACHEdirectory storage

cache line

counter


Correct guesses vs. history lengthCorrect guesses vs. history length

n Compiler Business Scientific Supervisor

0 64.1 64.4 70.4 54.0

1 91.9 95.2 86.6 79.7

2 93.3 96.5 90.8 83.4

3 93.7 96.6 91.0 83.5

4 94.5 96.8 91.8 83.7

5 94.7 97.0 92.0 83.9


Two-Level PredictionTwo-Level Prediction

• Uses two levels of information to make a direction prediction– Branch History Table (BHT) - last n

occurrences– Pattern History Table (PHT) - saturating 2 bit

counters• Captures patterned behavior of branches

– Groups of branches are correlated– Particular branches have particular behavior


Correlation between branchesCorrelation between branches

B1: if (x)...

B2: if (y)...

z = x && yB3: if (z)

...

• B3 can be predicted with 100% accuracy based on the outcomes of B1 and B2


PHT

T/NT

1 0 1 1 0

GBHR

PHT

PC

T/NT

BHT

1 1 0 1 0

1 1 1 0 0

0 0 1 1 1

0 1 1 1 1

Global Predictor Local Predictor

Some Two-level PredictorsSome Two-level Predictors

bits from PC and BHT can be combined to index PHT


Two-level Predictor ClassificationTwo-level Predictor Classification

• Yeh and Patt 3-letter naming scheme– Type of history collected

• G (global), P (per branch), S (per set)– PHT type

• A (adaptive), S (static)– PHT organization

• g (global), p (per branch), s (per set)

• Examples - GAs, PAp etc.


Branch Target CaptureBranch Target Capture

• Branch Target Buffer (BTB)• Target Instruction Buffer (TIB)

instr addr pred stats targettarget addrtarget instr

prob of target change < 5%


BTB PerformanceBTB Performance

BTB missgo inline

inline

BTB hitgo to target

decision

result target inline target

delay 0 5 4 0

.4 .6

.8 .2 .2 .8

.4*.8*0 + .4*.2*5 + .6*.2*4 + .6*.8*0= 0.88


Dynamic information about branchDynamic information about branch

• Previous branch decisions

• Explicit prediction• Stored in cache

directory Branch History Table, BHT

• Previous target address / instruction

• Implicit prediction• Stored in separate buffer Branch Target Buffer, BTBBr Target Addr Cache, BTACTarget Instr Buffer, TIBBr Target Instr Cache, BTIC

These two can be combined


Storing prediction infoStoring prediction info

In cache

directory storage

cache line

counter

instr addr pred stats target

In separatebuffer


Combined prediction mechanismCombined prediction mechanism

• Explicit : use history bits• Implicit : use BTB hit/miss

– hit go to target, miss go inline• Combined : BTB hit/miss followed by

explicit prediction using history bits. One of the following is commonly used– hit go to target, miss explicit prediction– miss go inline, hit explicit prediction


Combined predictionCombined prediction

BTB missI

BTB hit BTB miss

I

BTB hitT

I T

expl predict

Prediction T: Target, I: Inline Actual outcome T: Target, I: Inline

I T I T

T

I T I T

I

expl predict

T

I T


Structure of TablesStructure of Tables

Instruction fetch path with• BHT• BTAC• BTIC


Compute/fetch schemeCompute/fetch scheme

I - cache

IFAR

+

InstructionFetch address

ComputeBTA

BTAIIFA

Next sequentialaddress

A I I + 1 I + 2 I + 3

BTI BTI+1 BTI+2 BTI+3

(no dynamic branch prediction)


BHT (Branch History Table)BHT (Branch History Table)

I-cache16 K

4-way set assocBHT

Predictionlogic

2 2 2 2History bits

InstructionFetch address

2 2 2 2

128 x 4entries

128 x 4 lines8 instr/line

4 instr/cycle

decode queue

issue queue

4 x 1 instr

4 x 1 instr

Taken / not takenBTA for a taken guess


BTAC schemeBTAC scheme

I - cache

IFAR

+

InstructionFetch addressBTA

IIFA


A I I + 1 I + 2 I + 3


BTAC

BA BTA


BTIC scheme - 1BTIC scheme - 1

I - cache

IFAR

+

InstructionFetch addressBTA

IIFA


A I

BTIC

BA BTI BTA+

To decoder


BTIC scheme - 2BTIC scheme - 2

I - cache

IFAR

+

InstructionFetch addressBTA+

IIFA


A I I+1

BTIC

BA BTI BTI+1

To decoder

computed


Successor index in I-cacheSuccessor index in I-cache

I - cache

IFAR

InstructionFetch addressIIFA

Next address

A I I + 1 I + 2 I + 3


successorindex

CSL718 : Pipelined Processors

Documents

Transcript of CSL718 : Pipelined Processors