CSL718 : Pipelined Processors

39
Anshul Kumar, CSE IITD CSL718 : Pipelined CSL718 : Pipelined Processors Processors Improving Branch Performance – contd. 21st Jan, 2006

description

CSL718 : Pipelined Processors. Improving Branch Performance – contd. 21st Jan, 2006. Improving Branch Performance. Branch Elimination replace branch with other instructions Branch Speed Up reduce time for computing CC and TIF Branch Prediction - PowerPoint PPT Presentation

Transcript of CSL718 : Pipelined Processors

Page 1: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD

CSL718 : Pipelined ProcessorsCSL718 : Pipelined Processors

Improving Branch Performance – contd.

21st Jan, 2006

Page 2: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 2

Improving Branch PerformanceImproving Branch Performance

• Branch Elimination– replace branch with other instructions

• Branch Speed Up– reduce time for computing CC and TIF

• Branch Prediction– guess the outcome and proceed, undo if necessary

• Branch Target Capture– make use of history

Page 3: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 3

Improving Branch PerformanceImproving Branch Performance

• Branch Elimination– replace branch with other instructions

• Branch Speed Up– reduce time for computing CC and TIF

• Branch Prediction– guess the outcome and proceed, undo if necessary

• Branch Target Capture– make use of history

Page 4: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 4

Branch EliminationBranch Elimination

C

S

Use conditional/guarded instructions(predicated execution)

T

F

C : S

OP1BC CC = Z, + 2ADD R3, R2, R1OP2

OP1ADD R3, R2, R1, NZOP2

Examples: HP PA (all integer arithmetic/logical instructions)DEC Alpha, SPARC V9 (conditional move)

Page 5: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 5

Branch Elimination - contd.Branch Elimination - contd.

IF IF IF D AG DF DF DF EX EX

IF IF IF D AG TIF TIF TIF

IF IF IF D’ D AG

OP1

ADD/OP2

BC

CC

IF IF IF D AG DF DF DF EX EXADD(cond)

Page 6: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 6

Improving Branch PerformanceImproving Branch Performance

• Branch Elimination– replace branch with other instructions

• Branch Speed Up– reduce time for computing CC and TIF

• Branch Prediction– guess the outcome and proceed, undo if necessary

• Branch Target Capture– make use of history

Page 7: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 7

Branch Speed Up : Branch Speed Up : early target address generationearly target address generation

• Assume each instruction is Branch• Generate target address while decoding• If target in same page omit translation• After decoding discard target address if not

Branch

IF IF IF D TIF TIF TIF AGBC

Page 8: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 8

Branch Speed Up : Branch Speed Up : increase CC - branch gapincrease CC - branch gap

Increase the gap between the instruction which sets CC and branching

• Early CC setting• Delayed branch

Page 9: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 9

Summary - Branch Speed UpSummary - Branch Speed Up

n=0 n=1 n=2 n=3 n=4 n=5uncond 4 4 4 4 4 4cond (T) 6 5 4 4 4 4cond (I) 5 4 3 2 1 0uncond 4 3 2 1 0 0cond (T) 6 5 4 3 2 1cond (I) 5 4 3 2 1 0de

laye

d e

arly

CC

bran

ch

set

ting

Page 10: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 10

Delayed Branch with NullificationDelayed Branch with Nullification

(Also called annulment )• Delay slot is used optionally• Branch instruction specifies the option• Option may be exercised based on

correctness of branch prediction• Helps in better utilization of delay slots

Page 11: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 11

Improving Branch PerformanceImproving Branch Performance

• Branch Elimination– replace branch with other instructions

• Branch Speed Up– reduce time for computing CC and TIF

• Branch Prediction– guess the outcome and proceed, undo if necessary

• Branch Target Capture– make use of history

Page 12: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 12

Branch PredictionBranch Prediction

• Treat conditional branches as unconditional branches / NOP

• Undo if necessaryStrategies:

– Fixed (always guess inline)– Static (guess on the basis of instruction type /

displacement)– Dynamic (guess based on recent history)

Page 13: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 13

Static Branch PredictionStatic Branch Prediction

Instr % Guess Branch Correct

uncond 14.5 always 100% 14.5%

cond 58 never 54% 27%

loop 9.8 always 91% 9%

call/ret 17.7 always 100% 17.7%

Total 68.2%

Page 14: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 14

Threshold forThreshold for Static predictionStatic prediction

actual T I

guess T 4 5

I 6 0guess target if 4 p + 5 (1 - p) < 6 p + 0 (1 - p)

i.e. p > .71

IF IF D AG AG DF DF EX EX

IF IF D AG AG TIF TIFI-1

I

CC

Page 15: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 15

Dynamic Branch Prediction -Dynamic Branch Prediction -basic ideabasic idea

Predict based on the history of previous branch

loop: xxx 2 mispredictions xxx for every xxx occurrence xxx BC loop

Page 16: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 16

Dynamic Branch Prediction -Dynamic Branch Prediction -2 bit prediction scheme2 bit prediction scheme

0 1

2 3

N

T

N

T

N

T

T N0/1 3/2

predict taken predict not taken

Page 17: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 17

Dynamic Branch Prediction -Dynamic Branch Prediction -second schemesecond scheme

Predict based on the history of previous n branches

e.g., if n = 3 then3 branches taken predict taken2 branches taken predict taken1 branch taken predict not taken0 branches taken predict not taken

Page 18: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 18

Dynamic Branch Prediction -Dynamic Branch Prediction -Bimodal predictorBimodal predictor

Maintain saturating counters

0 1 2 3

T

N

T

N

T

N

TN

One counter per branch orOne counter per cache line -

merge results if multiple branches

Page 19: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 19

Dynamic Branch Prediction -Dynamic Branch Prediction -History of last History of last nn occurrences occurrences

1 1 0

current entry

1 1 1

updated entry

outcome of lastthree occurrencesof this branch

0 : not taken1 : taken

prediction using majority decision

actual outcome‘taken’

Page 20: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 20

Dynamic Branch Prediction -Dynamic Branch Prediction -storing prediction countersstoring prediction counters

store in separate buffer orstore in cache directory

CACHEdirectory storage

cache line

counter

Page 21: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 21

Correct guesses vs. history lengthCorrect guesses vs. history length

n Compiler Business Scientific Supervisor

0 64.1 64.4 70.4 54.0

1 91.9 95.2 86.6 79.7

2 93.3 96.5 90.8 83.4

3 93.7 96.6 91.0 83.5

4 94.5 96.8 91.8 83.7

5 94.7 97.0 92.0 83.9

Page 22: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 22

Two-Level PredictionTwo-Level Prediction

• Uses two levels of information to make a direction prediction– Branch History Table (BHT) - last n

occurrences– Pattern History Table (PHT) - saturating 2 bit

counters• Captures patterned behavior of branches

– Groups of branches are correlated– Particular branches have particular behavior

Page 23: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 23

Correlation between branchesCorrelation between branches

B1: if (x)...

B2: if (y)...

z = x && yB3: if (z)

...

• B3 can be predicted with 100% accuracy based on the outcomes of B1 and B2

Page 24: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 24

PHT

T/NT

1 0 1 1 0

GBHR

PHT

PC

T/NT

BHT

1 1 0 1 0

1 1 1 0 0

0 0 1 1 1

0 1 1 1 1

Global Predictor Local Predictor

Some Two-level PredictorsSome Two-level Predictors

bits from PC and BHT can be combined to index PHT

Page 25: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 25

Two-level Predictor ClassificationTwo-level Predictor Classification

• Yeh and Patt 3-letter naming scheme– Type of history collected

• G (global), P (per branch), S (per set)– PHT type

• A (adaptive), S (static)– PHT organization

• g (global), p (per branch), s (per set)

• Examples - GAs, PAp etc.

Page 26: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 26

Improving Branch PerformanceImproving Branch Performance

• Branch Elimination– replace branch with other instructions

• Branch Speed Up– reduce time for computing CC and TIF

• Branch Prediction– guess the outcome and proceed, undo if necessary

• Branch Target Capture– make use of history

Page 27: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 27

Branch Target CaptureBranch Target Capture

• Branch Target Buffer (BTB)• Target Instruction Buffer (TIB)

instr addr pred stats targettarget addrtarget instr

prob of target change < 5%

Page 28: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 28

BTB PerformanceBTB Performance

BTB missgo inline

inline

BTB hitgo to target

decision

result target inline target

delay 0 5 4 0

.4 .6

.8 .2 .2 .8

.4*.8*0 + .4*.2*5 + .6*.2*4 + .6*.8*0= 0.88

Page 29: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 29

Dynamic information about branchDynamic information about branch

• Previous branch decisions

• Explicit prediction• Stored in cache

directory Branch History Table, BHT

• Previous target address / instruction

• Implicit prediction• Stored in separate buffer Branch Target Buffer, BTBBr Target Addr Cache, BTACTarget Instr Buffer, TIBBr Target Instr Cache, BTIC

These two can be combined

Page 30: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 30

Storing prediction infoStoring prediction info

In cache

directory storage

cache line

counter

instr addr pred stats target

In separatebuffer

Page 31: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 31

Combined prediction mechanismCombined prediction mechanism

• Explicit : use history bits• Implicit : use BTB hit/miss

– hit go to target, miss go inline• Combined : BTB hit/miss followed by

explicit prediction using history bits. One of the following is commonly used– hit go to target, miss explicit prediction– miss go inline, hit explicit prediction

Page 32: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 32

Combined predictionCombined prediction

BTB missI

BTB hit BTB miss

I

BTB hitT

I T

expl predict

Prediction T: Target, I: Inline Actual outcome T: Target, I: Inline

I T I T

T

I T I T

I

expl predict

T

I T

Page 33: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 33

Structure of TablesStructure of Tables

Instruction fetch path with• BHT• BTAC• BTIC

Page 34: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 34

Compute/fetch schemeCompute/fetch scheme

I - cache

IFAR

+

InstructionFetch address

ComputeBTA

BTAIIFA

Next sequentialaddress

A I I + 1 I + 2 I + 3

BTI BTI+1 BTI+2 BTI+3

(no dynamic branch prediction)

Page 35: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 35

BHT (Branch History Table)BHT (Branch History Table)

I-cache16 K

4-way set assocBHT

Predictionlogic

2 2 2 2History bits

InstructionFetch address

2 2 2 2

128 x 4entries

128 x 4 lines8 instr/line

4 instr/cycle

decode queue

issue queue

4 x 1 instr

4 x 1 instr

Taken / not takenBTA for a taken guess

Page 36: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 36

BTAC schemeBTAC scheme

I - cache

IFAR

+

InstructionFetch addressBTA

IIFA

Next sequentialaddress

A I I + 1 I + 2 I + 3

BTI BTI+1 BTI+2 BTI+3

BTAC

BA BTA

Page 37: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 37

BTIC scheme - 1BTIC scheme - 1

I - cache

IFAR

+

InstructionFetch addressBTA

IIFA

Next sequentialaddress

A I

BTIC

BA BTI BTA+

To decoder

Page 38: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 38

BTIC scheme - 2BTIC scheme - 2

I - cache

IFAR

+

InstructionFetch addressBTA+

IIFA

Next sequentialaddress

A I I+1

BTIC

BA BTI BTI+1

To decoder

computed

Page 39: CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD slide 39

Successor index in I-cacheSuccessor index in I-cache

I - cache

IFAR

InstructionFetch addressIIFA

Next address

A I I + 1 I + 2 I + 3

BTI BTI+1 BTI+2 BTI+3

successorindex