Advanced Computer Architecture Instruction Level...

Advanced Computer Architecture

Instruction Level ParallelismInstruction Level Parallelism

Myung Hoon, Sunwoo

School of Electrical and Computer EngineeringAjou University

Outline• ILP• Compiler techniques to increase ILP• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic Scheduling• Tomasulo Algorithmg• Speculation

Ajou Univ. SOC Lab.MultimediaCommunications2

Recall from Pipelining Review

• Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls p p p+ Control Stalls

– Ideal pipeline CPI: measure of the maximum performance attainable by the implementationatta ab e by t e p e e tat o

– Structural hazards: HW cannot support this combination of instructionsD t h d I t ti d d lt f i– Data hazards: Instruction depends on result of prior instruction still in the pipeline

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)


Instruction Level Parallelism• Instruction-Level Parallelism (ILP): overlap the execution of

instructions to improve performance

• 2 approaches to exploit ILP:1) Rely on hardware to help discover and exploit the parallelism dynamically

(e.g., Pentium 4, AMD Opteron, IBM Power)(e.g., Pentium 4, AMD Opteron, IBM Power)

2) Rely on software technology to find parallelism, statically at compile-time (e.g., Itanium 2)( g , )


Instruction-Level Parallelism (ILP)

• Basic Block (BB) ILP is quite small– BB: a straight-line code sequence with no branches in except to the entry and– BB: a straight-line code sequence with no branches in except to the entry and

no branches out except at the exit– average dynamic branch frequency 15% to 25%

=> 4 to 7 instructions execute between a pair of branches– Plus instructions in BB likely to depend on each otherPlus instructions in BB likely to depend on each other

• To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks

• Simplest: loop-level parallelism to exploit parallelism among iterations of a loop. E.g.,

for (i=1; i<=1000; i=i+1)x[i] = x[i] + y[i];


Loop-Level Parallelism

• Exploit loop-level parallelism to parallelism by “unrolling loop” either by

1.dynamic via branch prediction or 2.static via loop unrolling by compiler

Determining instruction dependence is critical to Loop Level Parallelism

• If 2 instructions areparallel they can execute simultaneously in a pipeline of arbitrary depth– parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards)

– dependent, they are not parallel and must be executed in order, although they may often be partially overlappedy y p y pp


Data Dependence and Hazards

• InstrJ is data dependent (aka true dependence) on InstrI:1 I t t i t d d b f I t it it1. InstrJ tries to read operand before InstrI writes it

I: add r1,r2,r3J: sub r4 r1 r3

2. or InstrJ is data dependent on InstrK which is dependent on InstrI

• If two instructions are data dependent, they cannot execute simultaneously or

J: sub r4,r1,r3

p , y ybe completely overlapped

• Data dependence in instruction sequence ⇒ data dependence in source code ⇒ effect of original data dependence must be preserved

• If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard


ILP and Data Dependancies, Hazards

• HW/SW must preserve program order: order instructions would execute in if executed sequentially as determined by

i i loriginal source program– Dependences are a property of programs

• Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline

• Importance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possibly be exploited

HW/SW l l it ll li b i d l h it• HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program


Name Dependence #1: Anti-dependence

• Name dependence: when 2 instructions use same register or memory location called a name but no flow of data between the instructionslocation, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence

• Instr writes operand before Instr reads it• InstrJ writes operand before InstrI reads it

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”

• If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard


Name Dependence #2: Output dependence

• InstrJ writes operand before InstrI writes it.

I: sub r1,r4,r3 J: add r1,r2,r3

• Called an “output dependence” by compiler writers

J: add r1,r2,r3K: mul r6,r1,r7

• Called an output dependence by compiler writersThis also results from the reuse of name “r1”

• If anti-dependence caused a hazard in the pipeline called a Write After WriteIf anti dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard

• Instructions involved in a name dependence can execute simultaneously if nameInstructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict

– Register renaming resolves name dependence for regs– Either by compiler or by HW


Control Dependencies

• Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to , g , p ppreserve program orderif p1 {S1;S1;

};if p2 {if p2 {S2;

}}

• S1 is control dependent on p1, and S2 is control dependent on p2but not on p1.p


Control Dependence Ignored

• Control dependence need not be preservedControl dependence need not be preserved– willing to execute instructions that should not have been executed, thereby

violating the control dependences, if can do so without affecting correctness of the program

• Instead, 2 properties critical to program correctness are 1) exception behavior and ) p2) data flow


Exception Behavior• Preserving exception behavior

⇒ any changes in instruction execution order must not change how exceptions are raised in program (⇒ no new exceptions)

• Example:pDADDU R2,R3,R4BEQZ R2,L1LW R1,0(R2)

L1L1:– (Assume branches not delayed)

• Problem with moving LW before BEQZ?


Data Flow

• Data flow: actual flow of data values among instructions that produce g presults and those that consume them

– branches make flow dynamic, determine which instruction is supplier of data

• Example:DADDU R1,R2,R3BEQZ R4,LQ ,DSUBU R1,R5,R6L: …OR R7,R1,R8

• OR depends on DADDU or DSUBU? Must preserve data flow on execution


Computers in the News

Who said this? A. Jimmy Carter, 1979B. Bill Clinton, 1996

C. Al Gore, 2000D. George W. Bush, 2006

"Again I'd repeat to you that if we can remain theAgain, I d repeat to you that if we can remain the most competitive nation in the world, it will benefit the worker here in America. People have got to understand, when we talk about spending your taxpayers' money on research and development, there is a correlating benefit, particularly to your children. See, it takes a while for some of the investments that are being made with government dollars to come to market. I don't know if people realize this but the Internet began as the Defenserealize this, but the Internet began as the Defense Department project to improve military communications. In other words, we were trying to figure out how to better communicate, here was research money spent, and as a result y pof this sound investment, the Internet came to be.

The Internet has changed us. It's changed the whole world."


Outline• ILP• Compiler techniques to increase ILP

Loop Unrolling• Loop Unrolling• Static Branch Prediction• Dynamic Branch Prediction• Overcoming Data Hazards with Dynamic Scheduling• Tomasulo Algorithm• SpeculationSpeculation


Software Techniques - Example

• This code, add a scalar to a vector:This code, add a scalar to a vector:for (i=1000; i>0; i=i–1)

x[i] = x[i] + s;• Assume following latencies for all examples• Assume following latencies for all examples

– Ignore delayed branch in these examples

Instruction Instruction Delay Latency(distance) (stalls between)

producing result using result in cycles in cyclesproducing result using result in cycles in cyclesFP ALU op Another FP ALU op 4 3FP ALU op Store double 3 2 Load double FP ALU op 2 1Load double Store double 1 0Integer op Integer op 1 0


Integer op Integer op 1 0

FP Loop: Where are the Hazards?

• First translate into MIPS code: -To simplify assume 8 is lowest address

Loop: L.D F0,0(R1);F0=vector elementADD D F4 F0 F2 dd l f F2

-To simplify, assume 8 is lowest address

ADD.D F4,F0,F2;add scalar from F2S.D 0(R1),F4;store resultDADDUI R1,R1,-8;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=zero


FP Loop Showing Stalls1 Loop: L.D F0,0(R1) ;F0=vector element2 stall3 ADD.D F4,F0,F2 ;add scalar in F24 stall5 stall5 stall6 S.D 0(R1),F4 ;store result7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW)

Instr ction Instr ction Latenc in

8 stall ;assumes can’t forward to branch9 BNEZ R1,Loop ;branch R1!=zero

Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3

9 clock c cles Re rite code to minimi e stalls?

FP ALU op Store double 2 Load double FP ALU op 1

Ajou Univ. SOC Lab.MultimediaCommunications

• 9 clock cycles: Rewrite code to minimize stalls?

19

Revised FP Loop Minimizing Stalls

1 Loop: L.D F0,0(R1)2 DADDUI R1 R1 -82 DADDUI R1,R1,-83 ADD.D F4,F0,F24 stall

5 stall

6 S.D 8(R1),F4 ;altered offset when move DSUBUI

7 BNEZ R1,Loop

Instruction Instruction Latency in

, p

Swap DADDUI and S.D by changing address of S.Dy

producing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2

7 clock cycles but just 3 for execution (L D ADD D S D) 4 for loop

FP ALU op Store double 2 Load double FP ALU op 1


7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; How make faster?

20

Unroll Loop Four Times (straightforward way)

Rewrite loop to minimize stalls?

1 Loop:L.D F0,0(R1)3 ADD.D F4,F0,F2

1 cycle stall2 cycles stall

6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ7 L.D F6,-8(R1)9 ADD.D F8,F6,F212 S.D -8(R1),F8 ;drop DSUBUI & BNEZ13 L.D F10,-16(R1)15 ADD.D F12,F10,F218 S 16( 1) 12 d S18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ19 L.D F14,-24(R1)21 ADD.D F16,F14,F224 S D 24(R1) F1624 S.D -24(R1),F1625 DADDUI R1,R1,#-32 ;alter to 4*826 BNEZ R1,LOOP

27 clock cycles, or 6.75 per iteration(Assumes R1 is multiple of 4)


( p )

Unrolled Loop Detail• Do not usually know upper bound of loop• Suppose it is n, and we would like to unroll the loop to make k copies of

the bodythe body• Instead of a single unrolled loop, we generate a pair of consecutive

loops:– 1st executes (n mod k) times and has a body that is the original loop– 1st executes (n mod k) times and has a body that is the original loop– 2nd is the unrolled body surrounded by an outer loop that iterates (n/k)

times• For large values of n, most of the execution time will be spent in theFor large values of n, most of the execution time will be spent in the

unrolled loop


Unrolled Loop Detailn = 10, k = 3

(n mod k) = 1

/

n = 10, k = 5

(n mod k) = 0

Loop: L.D F0,0(R1)ADD.D F4,F0,F2S D 0(R1) F4

L.D F0,0(R1)ADD.D F4,F0,F21st exec.

n / k = 3n / k = 2

S.D 0(R1),F4L.D F6,-8(R1)ADD.D F8,F6,F2S.D -8(R1),F8

S.D 0(R1),F4Loop: L.D F6,-8(R1)

ADD.D F8,F6,F2S D 8(R1) F82nd( ),

L.D F10,-16(R1)ADD.D F12,F10,F2S.D -16(R1),F12 L D F14 24(R1)

S.D -8(R1),F8 L.D F10,-16(R1)ADD.D F12,F10,F2S.D -16(R1),F12

2nd exec.~

L.D F14,-24(R1)ADD.D F16,F14,F2S.D -24(R1),F16L.D F14,-32(R1)

L.D F14,-24(R1)ADD.D F16,F14,F2S.D -24(R1),F16DADDUI R1 R1 # 32

ADD.D F16,F14,F2S.D -32(R1),F16DADDUI R1,R1,#-40BNEZ R1 LOOP

DADDUI R1,R1,#-32BNEZ R1,LOOP


BNEZ R1,LOOP

Unrolled Loop That Minimizes Stalls1 Loop:L.D F0,0(R1)2 L.D F6,-8(R1)3 L D F10 -16(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD D F8 F6 F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F49 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#32, ,13 S.D 8(R1),F16 ; 8-32 = -2414 BNEZ R1,LOOP

14 clock cycles, or 3.5 per iteration


5 Loop Unrolling Decisions• Requires understanding how one instruction depends on

another and how the instructions can be changed or reordered given the dependences:given the dependences:

1. Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code)

2 U diff t i t t id t i t f d2. Use different registers to avoid unnecessary constraints forced by using same registers for different computations

3. Eliminate the extra test and branch instructions and adjust the jloop termination and iteration code

4. Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from differentinterchanged by observing that loads and stores from different iterations are independent • Transformation requires analyzing memory addresses and finding that they

do not refer to the same address

5. Schedule the code, preserving any dependences needed to yield the same result as the original code


3 Limits to Loop Unrolling1. Decrease in amount of overhead amortized with each extra unrolling

• Amdahl’s Law

2. Growth in code size • For larger loops, concern it increases the instruction cache miss rate

3. Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling

If t b ibl t ll t ll li l t i t l• If not be possible to allocate all live values to registers, may lose some or all of its advantage

• Loop unrolling reduces impact of branches on pipeline; another way• Loop unrolling reduces impact of branches on pipeline; another way is branch prediction


Static Branch Prediction• Lecture from last week showed scheduling code around

delayed branch• To reorder code around branches, need to predict branchTo reorder code around branches, need to predict branch

statically when compile • Simplest scheme is to predict a branch as taken

Average misprediction = untaken branch frequency = 34% SPEC– Average misprediction = untaken branch frequency = 34% SPEC

22%

18%20%

25%

e

12%

18%

11% 12%9% 10%

15%

10%

15%

20%

dict

ion

Rat

e

4%6%

0%

5%

10%

Mis

pre

compre

sseq

ntott

espre

sso gcc li

dodu

c

ear

hydro

2dmdlj

dpsu

2cor


Integer Floating Point

Dynamic Branch Prediction• Why does prediction work?

– Underlying algorithm has regularities– Data that is being operated on has regularities– Data that is being operated on has regularities– Instruction sequence has redundancies that are artifacts of way that

humans/compilers think about problems

• Is dynamic branch prediction better than static branch prediction?– Seems to be – There are a small number of important branches in programs which have– There are a small number of important branches in programs which have

dynamic behavior


Dynamic Branch Prediction

• Performance = ƒ(accuracy, cost of misprediction)

• Branch History Table: Lower bits of PC address index table of 1-bit values– Says whether or not branch taken last time– No address check

• Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 p, p ( giterations before exit):

– End of loop case, when it exits instead of looping as before– First time through loop on next time through code, when it predicts exit instead g p g p

of looping


Dynamic Branch Prediction

• Solution: 2-bit scheme where change prediction only if get mispredictiontwicetwice

T

NT

T NTPredict Taken Predict Taken

NTT

NT

Predict Not Taken

Predict Not TakenT

• Red: stop, not taken

NT

• Green: go, taken• Adds hysteresis to decision making process


BHT Accuracy• Mispredict because either:

– Wrong guess for that branch– Got branch history of wrong branch when index the table

• 4096 entry table:

18%

14%16%18%20%

Rat

e

5%

12%10% 9%

5%

9% 9%

8%10%12%14%

edic

tion

R

5% 5%

0% 1%

0%2%4%6%

Mis

pre

0%

eqnto

ttes

press

o gcc li

spice

dodu

csp

ice

fpppp

matrix3

00na

sa7


e m

31


Correlated Branch Prediction (1)• Definition: Branch prediction that uses the behavior of other branches to

make a prediction• Consider the following code sequence• Consider the following code sequence

BNEZ R1,L1 ; branch b1 (d!=0)

Assuming d is assigned to R1 register

if (d==0) d=1;if (d==1) ...;

To MIPS code DADDIU R1,R0,#1 ; d==0, so d=1

L1: DADDIU R3,R1,#-1

BNEZ R3,L2 ; branch b2 (d!=0)BNEZ R3,L2 ; branch b2 (d! 0)

…

L2: …..

[P ibl ti ]

Initial value of d d == 0 ? b1 Value of d

before b2 d ==1 ? b2

[Possible execution sequences]

0 Yes NT 1 Yes NT

1 No T 1 Yes NT

2 No T 2 No T


2 No T 2 No T

NT: Not Taken , T: Taken

Correlated Branch Prediction (2)• In the case of 1-bit predictor

T NT

Predict TakenPredict Not

TakenT

NT

d=? b1 prediction b1 New b1 b2 b2 New b2

[Behavior of a 1-bit predictor initialized to NT]

d=? b1 prediction action prediction prediction action prediction

2 NT(initial) T T NT(initial) T T0 T NT NT T NT NTMispredicted Mispredictediteration

Mispredicted Mispredicted

2 NT T T NT T T0 T NT NT T NT NT

Mispredicted

Mispredicted

Mispredicted

Mispredicted

Mispredicted

iteration

• As the above table shows, all the branches are mispredicted.


Correlated Branch Prediction (3)• Consider a predictor that uses 1 bit of correlation

– The easiest way to think is that every branch has two separate prediction bits» The first bit : the prediction bit if the last branch is not taken.» The second bit : the prediction bit if the last branch is taken.

Prediction bits Prediction if last branch is not taken Prediction if last branch is taken

NT/NT NT NT

NT/T NT T

[B h i f th 1 bit di t ith 1bit f l ti ]

NT/T NT T

T/NT T NT

T/T T T

d=? b1 prediction b1 action

New b1 prediction

b2prediction

b2 action

New b2prediction

[Behavior of the 1-bit predictor with 1bit of correlation]

2 NT/NT(initial) T T/NT NT/NT(initial) T NT/T0 T/NT NT T/NT NT/T NT NT/T2 T/NT T T/NT NT/T T NT/T

predicted

predicted

predicted

predicted

iteration

Mispredicted Mispredicted

0 T/NT NT T/NT NT/T NT NT/Tp

predicted

p ed cted

predicted

• The branches except the first iteration are correctly predicted.• Even if we had chosen different values for d, the predictor for b2 would correctly predict,


p y psince b2 is correlated with the previous prediction of b1.

• 1-bit predictor with 1bit of correlation is called a (1,1) predictor.

Correlated Branch Prediction (4)• In general, a (m, n) predictor means recording last m branches to select between 2m history

tables, each with n-bit counters– (1,1) predictor uses the behavior of the last one branch to choose from 2 branch predictors(i.e.

hi t t bl 21) h f hi h i 1 bit( 1) di thistory tables, 21), each of which is an 1-bit(n=1) predictor• (2,2) predictor

– Behavior of recent branches selects between four predictions of next branch, updating just that predictionp

Branch address

4

The low-order address bits of the branch instruction address

If two branches have the same low-order address

2-bits per branch predictorbits, this doesn’t matter.

The prediction is a hint that is assumed to be correct. If the hint turns out to be wrong, the prediction bit is inverted and stored back

Prediction

prediction bit is inverted and stored back.

n • Global Branch History: m-bit shift register keeping T/NT status of last m branches.


2-bit global branch history

Example-(1,1) predictor• (1,1) predictor example (NT = 0, T = 1)

d=? b1 prediction b1 action New b1 prediction

b2prediction

b2 action

New b2prediction

2 NT/NT(initial) T T/NT NT/NT(initial) T NT/T

0 T/NT NT T/NT NT/T NT NT/T

2 T/NT T T/NT NT/T T NT/T

I iti ll NT I iti ll NT 01b1 addr. 01b1 addr.


Predict NT for b1 branch b1 action is Taken

1 2

New Prediction is Taken

2

0

Branch address (2 bits)

addr. addr.

000 100

Initially NT Initially NT

0 0


01

addr. addr.

000 100

1 0


0

addr. addr.

000

001

100

101

NT T NT T NT T

0

0

0

0

001

010

011

101

110

111

0

0

0

0

001

010

011

101

110

111

1

0

0

0

001

010

011

101

110

111

Update

0

1-bit global branch history (GBH)

Initially NT 0

1-bit global branch history(GBH)

1

1-bit global branch history(GBH)

b1 action is Taken

Update

121 bit predictor


1 bit global branch history (GBH) b t g oba b a c sto y(G )

Predict NT•Update new prediction value into the content of 001 addr

•Update 1(taken) into GBH

Example-(1,1) predictord=? b1 prediction b1 action New b1

predictionb2

predictionb2

action New b2

prediction





3

4

5

11 b2 addr. 11 b2 addr.

B h dd (2 bit )

01 b1 addr.

b2 prediction is NT because the last branch b1 is T b2 action is T, so new b2 prediction is T.b1 prediction is NT because the last branch b2 is T

5

1 0


addr. addr.

000

001

100

101 1 0


addr. addr.

000

001

100

101 1 0


addr. addr.

000

001

100

101

NT TNT T NT T

0 0010

011

110

111 0 1010

011

110

111

Update

0 1010

011

110

111

5

1


1


b2 action is Taken

Update

1


Predict NT34


Predict NT •Update new prediction value into the content of 111 addr.•Update 1(taken) into GBH


predictionb2

predictionb2

action New b2

prediction




0 T/NT NT T/NT NT/T NT NT/T7 8

01 b1 addr. 11b2 addr. 11b2 addr.

b1 action is NT, so prediction is correct.

6b2 prediction is NT because the last branch b1 is NT

7

b2 action is NT, so prediction is correct.

8

1 0


addr. addr.

000

001

100

101 1 0


addr. addr.

000

001

100

101 1 0


addr. addr.

000

001

100

101

NT TNT T NT T

1

0

0

1

001

010

011

101

110

111

1

0

0

1

001

010

011

101

110

111 0 1

001

010

011

101

110

111

Not update Not update

6

0


0


Prediction NT 0


b2 action is Not Taken

b1 action is Not Taken

p Not update78



predictionb2

predictionb2

action New b2

prediction




0 T/NT NT T/NT NT/T NT NT/T10 11

01b1 addr. 11 b2 addr.

b1 action is T, so prediction is correct.

9b2 prediction is T because the last branch b1 is T

b2 action is T, so prediction is correct.

01

11

b1 addr.

1 0


addr. addr.

000

001

100

101 1 0


addr. addr.

000

001

100

101

NT T NT T

1 0


addr. addr.

000

001

100

101

NT T

1

0

0

1

001

010

011

101

110

111 0 1

001

010

011

101

110

1119

1

0

0

1

001

010

011

101

110

11110

N t d t

0


1


1


11

Prediction T b1 action is Taken

Not update

Prediction T


Accuracy of Different Schemes2-bit predictor can access all 4096 entries or a part of 4096 entries .

16%

18%

20%

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT

11%12%

14%

16% Unlimited Entries 2 bit BHT1024 Entries (2,2) BHT

ncy o

f

dic

tions

5%6% 6% 6%

5%6%

8%

10%

0%

Fre

quen

Mis

pre

d

0%1%

5%4%

5%

1%2%

4%

0% 0%

nasa7

matr

ix300

doducd

spic

e

fpppp

gcc

expre

sso

eqnto

tt li

tom

catv


4,096 entries: 2-bits per entryUnlimited entries: 2-bits/entry1,024 entries (2,2)

Tournament Predictors (1)• The main idea behind Tournament predictors is to implement several predictors and

then dynamically select one of the predictors.

• Global PredictorGlobal Predictor– If the surrounding pattern includes neighboring branches, the predictor is referred to as a global

predictor

Make use of the fact that the behavior of many branches is strongly correlated with the history of– Make use of the fact that the behavior of many branches is strongly correlated with the history of other recently taken branches.

– Keep a single shift register updated with the recent history of every branch executed, and use this value to index into a table.

A single shift register that keeps track of recent history for all branches

00…110101

12 bit

Table of4K entries

12 bits

12 bits


Pattern history table

Tournament Predictors (2)• Local Predictor

– If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor

– Local branch history table » indexed by the low-order bits of each branch instruction’s address from Program Counter

(PC)» Each history table entry records the direction taken by the most recent n branches.Each history table entry records the direction taken by the most recent n branches.

(ex) if the branch was recently taken 10 times, the entry will be all 1s.

i=0;do{

Branch PC

Use 10 bits of branch PC toi d i t l l hi t t bl

i=i+1;} while(i==10);

Single branch is taken 10 times until i is 10 in do loop.

T bl f

index into local history tableLocal branch history table

Table of1K entries0111011001 10-bit history

indexes intonext level


Table of 1024 entries of 10-bithistories for a single branch

next levelPattern history table

Tournament Predictors (3)• Tournament Predictor

P1/P2 =global predictor/ local predictorP1 P2 P1 P2

S0 =00S3 =11

P1c P2c P1c-P2c

0 0 0 No change

0 1 -1 Decrement counter

1 0 1 Increment counter

S1 =01S2 =10

1 0 1 Increment counter

1 1 0 No change

P1c and P2c denote whether predictors P1 and P2 are correct respectivelycorrect, respectively.

• The tournament predictor uses 2-bit up/down saturating counter.The tournament predictor uses 2 bit up/down saturating counter.Ex) If the current tournament predictor state is S3 and P2 is correct, then the P1c

is 0 and P2c is 1. Hence, the counter is decrement by 1 and the next tournament predictor state is S2


predictor state is S2.

Tournament Predictors (4)

Tournament predictor using 4K 2-bit counters indexed by local branch address.

• Alpha 21264 branch predictor [2001]

Chooses between:

– Global predictor» 4K entries index by history of last 12 branches (212 = 4K)» 4K entries index by history of last 12 branches (2 4K)» Each entry is a standard 2-bit predictor

– Local predictor» Local history table: 1024 10-bit entries recording last 10 branches, index by branch addressy g y» The pattern of the last 10 occurrences of that particular branch used to index table of 1K

entries with 3-bit saturating counters

LocalPredictor

Global

MUX

TournamentPredictor

Branch PC

Predictor


[Alpha 21264 branch predictor logic]Table of 2-bit

saturating counters

Comparing Predictors (Fig. 2.8)

• Advantage of tournament predictor is ability to select the right predictor for a particular branch

– Particularly crucial for integer benchmarks. – A typical tournament predictor will select the global predictor almost 40% of

the time for the SPEC integer benchmarks and less than 15% of the time for the SPEC FP benchmarksthe SPEC FP benchmarks


Pentium 4 Misprediction Rate (per 1000 instructions not per branch)(per 1000 instructions, not per branch)

13

1213

14

≈6% misprediction rate per branch SPECint (19% of INT instructions are branch)

11

10

11

12n

stru

ctio

ns

≈2% misprediction rate per branch SPECfp(5% of FP instructions are branch)

7

9

7

8

9

per

10

00

In

5

5

6

7

pre

dic

tio

ns

2

3

4

Bra

nch

mis

10 0 0

0

1

4gzip

5.vpr

6.gcc

1.mcf

crafty

pwise

swim

mgrid

applu

mesa


164.g

175

176

181.

186.cr

168.wupw

171.sw

172.mg

173.ap

177.m

SPECint2000 SPECfp2000

Branch Target Buffers (BTB)

• Branch target calculation is costly and stalls the instruction fetch.Branch target calculation is costly and stalls the instruction fetch.

• BTB stores PCs the same way as caches

• The PC of a branch is sent to the BTB

• When a match is found the corresponding Predicted PC is returned

• If the branch was predicted taken instruction fetch continues at the returned• If the branch was predicted taken, instruction fetch continues at the returned predicted PC


Branch Target Buffers


Dynamic Branch Prediction Summary

• Prediction becoming important part of execution

• Branch History Table: 2 bits for loop accuracy

• Correlation: Recently executed branches correlated with next branch– Either different branches (GA)– Or different executions of same branches (PA)

• Tournament predictors take insight to next level, by using multiple predictors – usually one based on global information and one based on local information, and y g ,

combining them with a selector– In 2006, tournament predictors using ≈ 30K bits are in processors like the Power5

and Pentium 4

• Branch Target Buffer: include branch address & prediction


Advantages of Dynamic Scheduling

• Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior

• It handles cases when dependences unknown at compile time – it allows the processor to tolerate unpredictable delays such as cache misses, by

ti th d hil iti f th i t lexecuting other code while waiting for the miss to resolve

• It allows code that compiled for one pipeline to run efficiently on a different pipelinepipeline

• It simplifies the compiler

• Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling


HW Schemes: Instruction Parallelism

• Key idea: Allow instructions behind stall to proceedDIVD F0 F2 F4DIVD F0,F2,F4ADDD F10,F0,F8SUBD F12,F8,F14

• Enables out-of-order execution and allows out-of-order completion (e.g., SUBD)

– In a dynamically scheduled pipeline, all instructions still pass through issue stagein order (in-order issue)

Will di ti i h h i t ti b i ti d h it l t• Will distinguish when an instruction begins execution and when it completesexecution; between 2 times, the instruction is in execution

• Note: Dynamic execution creates WAR and WAW hazards and makesexceptions harder


Dynamic Scheduling Step 1

• Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID) also called Instruction IssueInstruction Decode (ID), also called Instruction Issue

• Split the ID pipe stage of simple 5-stage pipeline into 2 stages:

• Issue—Decode instructions, check for structural hazards

• Read operands—Wait until no data hazards, then read operands


A Dynamic Algorithm: Tomasulo’s

• For IBM 360/91 (before caches!)L l t– ⇒ Long memory latency

• Goal: High Performance without special compilers

• Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operationsscheduling of operations

– This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware!

• Why Study 1966 Computer?

• The descendants of this have flourished!– Alpha 21264, Pentium 4, AMD Opteron, Power 5, …


Tomasulo Algorithm

• Control & buffers distributed with Function Units (FU)– FU buffers called “reservation stations”; have pending operands– FU buffers called reservation stations ; have pending operands

• Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; ( ); g g ;

– Renaming avoids WAR, WAW hazards– More reservation stations than registers, so can do optimizations compilers can’t

• Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs

– Avoids RAW hazards by executing an instruction only when its operands are available

• Load and Stores treated as FUs with RSs as well

• Integer instructions can go past branches (predict taken), allowing FP ops beyond basic block in FP queue


Tomasulo Organization

From Mem FP RegistersFP OpQueueQueue

Load BuffersLoad1Load2

Store Buffers

Load2Load3Load4Load5Load6

Add1Add2 Mult1

Buffers

FP dd

Add3

FP lti li

Mult2

Reservation Stations

To Mem

FP adders FP multipliers


Common Data Bus (CDB)

Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)Vj Vk V l f S dVj, Vk: Value of Source operands

– Store buffers has V field, result to be storedQj, Qk: Reservation stations producing source registers (value to be written)

– Note: Qj,Qk=0 => ready– Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busyBusy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will write each register, if one exists Blank when no pending instructions that will write that registerone exists. Blank when no pending instructions that will write that register.


Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op QueueIf ti t ti f ( t t l h d)If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

2. Execute—operate on operands (EX)When both operands ready then execute;When both operands ready then execute;if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting units;Write on Common Data Bus to all awaiting units; mark reservation station available

• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)Common data bus: data source ( come from bus)

– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast

• Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /


Tomasulo Example

Instruction status: Exec Write

Instruction stream

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

R i S i

3 Load/Buffers

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoFU count 3 FP Adder R SAdd3 NoMult1 NoMult2 No

FU countdown

3 FP Adder R.S.2 FP Mult R.S.

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30

0 FU


Clock cycle counter

Tomasulo Example Cycle 1Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6VADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qkime Name y p j Qj Q

Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 F30Clock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Load1



Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6VADDD F6 F8 F2


Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 NoMult2 No


2 FU Load2 Load1

N t C h lti l l d t t di


Note: Can have multiple loads outstanding


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6VADDD F6 F8 F2


Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult1 Yes MULTD R(F4) Load2Mult2 No


3 FU Mult1 Load2 Load1

• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued


Stations; MULT issued• Load1 completing; what is waiting for Load1?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6VADDD F6 F8 F2


Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult1 Yes MULTD R(F4) Load2Mult2 No


4 FU Mult1 Load2 M(A1) Add1

• Load2 completing; what is waiting for Load2?


Load2 completing; what is waiting for Load2?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5VADDD F6 F8 F2


2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1


5 FU Mult1 M(A2) M(A1) Add1 Mult2

• Timer starts down for Add1 Mult1


• Timer starts down for Add1, Mult1


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5VADDD F6 F8 F2 6


1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No



6 FU Mult1 M(A2) Add2 Add1 Mult2


• Issue ADDD here despite name dependency on F6?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5VADDD F6 F8 F2 6


0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No



7 FU Mult1 M(A2) Add2 Add1 Mult2


• Add1 (SUBD) completing; what is waiting for it?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6


Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1


8 FU Mult1 M(A2) Add2 (M-M) Mult2



Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6









Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10








• Add2 (ADDD) completing; what is waiting for it?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 No



11 FU Mult1 M(A2) (M-M+M(M-M) Mult2

• Write result of ADDD here?


Write result of ADDD here?• All quick instructions complete in this cycle!


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11







• Mult1 (MULTD) completing; what is waiting for it?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)


16 FU M*F4 M(A2) (M-M+M(M-M) Mult2


• Just waiting for Mult2 (DIVD) to complete

Faster than light computation(skip a couple of cycles)



Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5VADDD F6 F8 F2 6 10 11




Register result status:Clock F0 F2 F4 F6 F8 F10 F12Clock F0 F2 F4 F6 F8 F10 F12 ...




Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56VADDD F6 F8 F2 6 10 11







• Mult2 (DIVD) is completing; what is waiting for it?


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57V 7ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)


56 FU M*F4 M(A2) (M-M+M(M-M) Result

• Once again: In-order issue, out-of-order execution and out-


Once again: In order issue, out of order execution and outof-order completion.

Why can Tomasulo overlap iterations of loops?

• Register renaming– Multiple iterations use different physical destinations for registers

(dynamic loop unrolling).

• Reservation stations – Permit instruction issue to advance past integer control flow operations– Also buffer old values of registers - totally avoiding the WAR stall

• Other perspective: Tomasulo building data flow dependency graph on the fly


Tomasulo’s scheme offers 2 major advantages

1. Distribution of the hazard detection logic– distributed reservation stations and the CDB– distributed reservation stations and the CDB– If multiple instructions waiting on single result, & each instruction has

other operand, then instructions can be released simultaneously by broadcast on CDB

– If a centralized register file were used, the units would have to read theirresults from the registers when register buses are available

2. Elimination of stalls for WAW and WAR hazards


Tomasulo Drawbacks

• Complexity– delays of 360/91, MIPS 10000, Alpha 21264,delays of 360/91, MIPS 10000, Alpha 21264,

IBM PPC 620 in CA:AQA 2/e, but not in silicon!

• Many associative stores (CDB) at high speed

• Performance limited by Common Data Bus– Each CDB must go to multiple functional units

hi h it hi h i i d it⇒high capacitance, high wiring density– Number of functional units that can complete per cycle limited to one!

» Multiple CDBs ⇒ more FU logic for parallel assoc stores

• Non-precise interrupts!– We will address this later


Speculation to greater ILP

• Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correctg p g g– Speculation ⇒ fetch, issue, and execute instructions as if

branch predictions were always correct Dynamic scheduling ⇒ only fetches and issues instructions– Dynamic scheduling ⇒ only fetches and issues instructions

• Essentially a data flow execution model: Operations execute as soon as ytheir operands are available


Speculation to greater ILP

3 components of HW-based speculation:

1. Dynamic branch prediction to choose which instructions to execute

2. Speculation to allow execution of instructions before control dependences are resolved + ability to undo effects of incorrectly speculated sequence+ ability to undo effects of incorrectly speculated sequence

3. Dynamic scheduling to deal with scheduling of different combinations of basic blocks


Adding Speculation to Tomasulo

• Must separate execution from allowing instruction to finish or “commit”

• This additional step called instruction commit

• When an instruction is no longer speculative, allow it to update the register file or memory

• Requires additional set of buffers to hold results of instructions that have finished execution but have not committed

• This reorder buffer (ROB) is also used to pass results among instructions that may be speculated


Reorder Buffer (ROB)• In non-speculative Tomasulo’s algorithm, once an instruction writes its

result, any subsequently issued instructions will find result in the register file

• With speculation, the register file is not updated until the instruction commits

– (we know definitively that the instruction should execute)

• Thus the ROB supplies operands in interval between completion of• Thus, the ROB supplies operands in interval between completion of instruction execution and instruction commit

– ROB is a source of operands for instructions, just as reservation stations (RS) provide operands in Tomasulo’s algorithm( ) p p g

– ROB extends architectured registers like RS


Reorder Buffer Entry FieldsEach entry in the ROB contains four fields:

1. Instruction type • a branch (has no destination result), a store (has a memory address

destination), or a register operation (ALU operation or load, which has register destinations)

2. Destination• Register number (for loads and ALU operations) or

memory address (for stores)memory address (for stores) where the instruction result should be written

3. Value3. Value• Value of instruction result until the instruction commits

4 Ready4. Ready• Indicates that instruction has completed execution, and the value is ready


Reorder Buffer operation• Holds instructions in FIFO order, exactly as issued• When instructions complete, results placed into ROB

– Supplies operands to other instruction between execution complete & commit ⇒ more registers like RS

– Tag results with ROB buffer number instead of reservation station• Instructions commit ⇒values at head of ROB placed in registers (or memory

locations)• As a result, easy to undo speculated instructions

i di d b h ion mispredicted branches or on exceptionsReorderBufferFP

OpQueue FP Regs

Commit path

FP Adder FP AdderRes Stations Res Stations

Commit path


FP Adder FP Adder

Recall: 4 Steps of Speculative Tomasulo Algorithm

1. Issue—get instruction from FP Op QueueIf ti t ti d d b ff l t f i i t & d d &If reservation station and reorder buffer slot free, issue instr & send operands &reorder buffer no. for destination (this stage sometimes called “dispatch”)

2. Execution—operate on operands (EX)2. Execution operate on operands (EX)When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”)

3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4. Commit—update register with reorder resultWhen instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes y) preorder buffer (sometimes called “graduation”)


Tomasulo With Reorder Buffer - Cycle 1LD F6 34+R2

Issue Exec Write Commit1LD F6 34+R2

LD F2 45+R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

1

Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 N

ADDD F6 F8 F2

0 Add3 No0 Mult1 No0 Mult2 No

Busy AddressEntry Busy Instruction State Destination Value Load1 Yes 34+Regs[R2]

1 Yes LD F6, 34(R2) Issue F6 Load22 Load33456 Reorder Buffer78910

F0 F2 F4 F6 F8 F10 F12 ... F30


Reorder # #1Busy no no no Yes no no no no

91


Issue Exec Write Commit1 2LD F6 34+R2


1 2 2


ADDD F6 F8 F2

0 Add3 No0 Mult1 No0 Mult2 No

Busy AddressEntry Busy Instruction State Destination Value Load1 Yes 34+Regs[R2]

1 Yes LD F6, 34(R2) Ex1 F6 Load2 Yes 45+Regs[R3]2 Yes LD F2, 45(R3) Issue F2 Load33456 Reorder Buffer78910

F0 F2 F4 F6 F8 F10 F12 ... F30


Reorder # #2 #1Busy no Yes no Yes no no no no

92


Issue Exec Write Commit1 2 3LD F6 34+R2


1 2 32 33


ADDD F6 F8 F2

0 Add3 No0 Mult1 Yes Mult Regs[F4] #2 #30 Mult2 No

Busy AddressEntry Busy Instruction State Destination Value Load1 No

1 Yes LD F6, 34(R2) write F6 Mem[load1] Load2 Yes 45+Regs[R3]2 Yes LD F2, 45(R3) Ex1 F2 Load33 Yes MULT F0, F2, F4 Issue F0456 Reorder Buffer78910

F0 F2 F4 F6 F8 F10 F12 ... F30


Reorder # #3 #2 #1Busy Yes Yes no Yes no no no no

93


Issue Exec Write Commit1 2 3 4LD F6 34+R2


1 2 3 4 2 3 43 44

Time Name Busy Op Vj Vk Qj Qk Dest2 Add1 Yes SUB Regs[F6] Mem[45+Regs[R3]] #4 Reservation0 Add2 No Stations0 Add3 No

ADDD F6 F8 F2

0 Add3 No10 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 No

Busy AddressE t B I t ti St t D ti ti V l L d1 NEntry Busy Instruction State Destination Value Load1 No

1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 Yes LD F2, 45(R3) write F2 Mem[load2] Load33 Yes MULT F0, F2, F4 EX1 F04 Yes SUBD F8, F6, F2 Issue F856 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30


Reorder # #3 #2 #4Busy Yes Yes no no Yes no no no

94




1 2 3 42 3 4 53 44 55

Time Name Busy Op Vj Vk Qj Qk Dest1 Add1 Yes SUB Regs[F6] Mem[45+Regs[R3]] #4 Reservation0 Add2 No Stations0 Add3 N

ADDD F6 F8 F2

0 Add3 No9 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] #30 Mult2 Yes DIV Regs[F6] #3 #5


1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex2 F04 Yes SUBD F8, F6, F2 Ex1 F85 Yes DIVD F10, F0, F6 Issue F106 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30


Reorder # #3 #4 #5Busy Yes no no no Yes Yes no no

95




1 2 3 42 3 4 53 44 556

Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 Yes SUB Regs[F6] Mem[45+Regs[R3]] #4 Reservation0 Add2 Yes Add Regs[F2] #4 #6 Stations0 Add3 N

ADDD F6 F8 F2 6



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex3 F04 Yes SUBD F8, F6, F2 Ex2 F85 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 Issue F6 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30


Reorder # #3 #6 #4 #5Busy Yes no no Yes Yes Yes no no

96




1 2 3 42 3 4 53 44 5 756 7

Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation2 Add2 Yes Add #4 Regs[F2] #6 Stations0 Add3 N

ADDD F6 F8 F2 6 7



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex4 F04 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 EX1 F6 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30



97




1 2 3 42 3 4 53 44 5 756 7

Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation1 Add2 Yes Add #4 Regs[F2] #6 Stations0 Add3 N

ADDD F6 F8 F2 6 7



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex5 F04 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 Ex2 F6 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30



98




1 2 3 42 3 4 53 44 5 756 7 9

Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 Yes Add #4 Regs[F2] #6 Stations0 Add3 No

ADDD F6 F8 F2 6 7 9



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex6 F04 Y SUBD F8 F6 F2 it F8 F6 #24 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30



99




1 2 3 42 3 4 53 44 5 756 7 9


ADDD F6 F8 F2 6 7 9



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex7 F04 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30



100




1 2 3 42 3 4 53 44 5 756 7 9

Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations0 Add3 No

ADDD F6 F8 F2 6 7 9



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex8 F04 Y SUBD F8 F6 F2 it F8 F6 #24 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30



101




1 2 3 42 3 4 53 44 5 756 7 9


ADDD F6 F8 F2 6 7 9



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 Ex9 F04 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Issue F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer789

10F0 F2 F4 F6 F8 F10 F12 ... F30



102




1 2 3 42 3 4 53 4 13 4 5 75 136 7 9


ADDD F6 F8 F2 6 7 9

0 Add3 No0 Mult1 No0 Mult2 Yes DIV #2xRegs[F4] Regs[F6] #5

Busy AddressEntry Busy Instruction State Destination Value Load1 NoEntry Busy Instruction State Destination Value Load1 No

1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 Yes MULT F0, F2, F4 write F0 #2 x Regs[F4]4 Yes SUBD F8 F6 F2 write F8 F6 #24 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Ex1 F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer7889

10F0 F2 F4 F6 F8 F10 F12 ... F30



103




1 2 3 42 3 4 53 4 13 144 5 75 136 7 9


ADDD F6 F8 F2 6 7 9


Busy AddressEntry Busy Instruction State Destination Value Load1 NoEntry Busy Instruction State Destination Value Load1 No

1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 No MULT F0, F2, F4 commit F0 #2 x Regs[F4]4 Y SUBD F8 F6 F2 it F8 F6 #24 Yes SUBD F8, F6, F2 write F8 F6 - #25 Yes DIVD F10, F0, F6 Ex2 F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer788910

F0 F2 F4 F6 F8 F10 F12 ... F30


Reorder # #6 #4 #5Busy No no no Yes Yes Yes no no

104




1 2 3 42 3 4 53 4 13 144 5 7 155 136 7 9


ADDD F6 F8 F2 6 7 9



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 No MULT F0, F2, F4 commit F0 #2 x Regs[F4]4 N SUBD F8 F6 F2 it F8 F6 #24 No SUBD F8, F6, F2 commit F8 F6 - #25 Yes DIVD F10, F0, F6 Ex3 F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer78910

F0 F2 F4 F6 F8 F10 F12 ... F30


Reorder # #6 #5Busy no no no Yes no Yes no no

105




1 2 3 42 3 4 53 4 13 144 5 7 155 136 7 9

Time Name Busy Op Vj Vk Qj Qk Dest0 Add1 No Reservation0 Add2 No Stations

ADDD F6 F8 F2 6 7 9



1 No LD F6, 34(R2) commit F6 Mem[load1] Load2 No2 No LD F2, 45(R3) commit F2 Mem[load2] Load33 No MULT F0, F2, F4 commit F0 #2 x Regs[F4]4 No SUBD F8, F6, F2 commit F8 F6 - #25 Yes DIVD F10, F0, F6 Ex4 F106 Yes ADDD F6, F8, F2 write F6 #4 + F2 Reorder Buffer7 Need 36 more EX

cycles for DIV to8910

F0 F2 F4 F6 F8 F10 F12 ... F30

cycles for DIV to finish…


Reorder # #6 #5Busy no no no Yes no Yes no no

106

Avoiding Memory Hazards• WAW and WAR hazards through memory are eliminated with

speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pendingstill be pending

• RAW hazards through memory are maintained by two restrictions: 1 not allo ing a load to initiate the second step of its e ec tion if an acti e1. not allowing a load to initiate the second step of its execution if any active

ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and

2. maintaining the program order for the computation of an effective address of a load with respect to all earlier storesof a load with respect to all earlier stores.

• these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memorylocation written to by an earlier store cannot perform the memory access until the store has written the data


Exceptions and Interrupts• IBM 360/91 invented “imprecise interrupts”

– Computer stopped at this PC; its likely close to this address– Not so popular with programmersp p p g– Also, what about Virtual Memory? (Not in IBM 360)

• Technique for both precise interrupts/exceptions and speculation: in-order completion and in-order commit

– If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly

– This is exactly same as need to do with precise exceptionsThis is exactly same as need to do with precise exceptions

• Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROBy

– If a speculated instruction raises an exception, the exception is recorded in the ROB

– This is why reorder buffers in all new processors


Getting CPI below 1

• CPI ≥ 1 if issue only 1 instruction every clock cycle

• Multiple-issue processors come in 3 flavors: 1. statically-scheduled superscalar processors,2 dynamically scheduled superscalar processors and2. dynamically-scheduled superscalar processors, and 3. VLIW (very long instruction word) processors

• 2 types of superscalar processors issue varying numbers of instructions per clock – use in-order execution if they are statically scheduled, or – out-of-order execution if they are dynamically scheduled

• VLIW processors, in contrast, issue a fixed number of instructions p , ,formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)


VLIW: Very Large Instruction Word

• Each “instruction” has explicit coding for multiple operations– In IA-64, grouping called a “packet”– In Transmeta, grouping called a “molecule” (with “atoms” as ops)

• Tradeoff instruction space for simple decoding– The long instruction word has room for many operations– By definition, all the operations the compiler puts in the long instruction word are y , p p p g

independent => execute in parallel– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide– Need compiling technique that schedules across several branches


Recall: Unrolled Loop that Minimizes Stalls for Scalar

1 Loop: L.D F0,0(R1)2 L D F6 8(R1)

L.D to ADD.D: 1 Cycle2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD D F4 F0 F2

ADD.D to S.D: 2 Cycles

5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD D F16 F14 F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S D -16(R1),F1211 S.D 16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24( ), ;

14 clock cycles, or 3.5 per iteration


Loop Unrolling in VLIW

Memory Memory FP FP Int op/ ClockMemory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branchL.D F0,0(R1) L.D F6,-8(R1) 1L.D F10,-16(R1) L.D F14,-24(R1) 2L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4

ADD.D F20,F18,F2 ADD.D F24,F22,F2 5S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6S.D -16(R1),F12 S.D -24(R1),F16 7S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8S.D -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delaysy7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)Average: 2.5 ops per clock, 50% efficiency


Note: Need more registers in VLIW (15 vs. 6 in SS)

112

Problems with 1st Generation VLIW

• Increase in code sizeti h ti i t i ht li d f t i biti l– generating enough operations in a straight-line code fragment requires ambitiously

unrolling loops– whenever VLIW instructions are not full, unused functional units translate to wasted

bits in instruction encodingg

• Operated in lock-step; no hazard detection HW– a stall in any functional unit pipeline caused entire processor to stall since all– a stall in any functional unit pipeline caused entire processor to stall, since all

functional units must be kept synchronized– Compiler might predict function units, but caches hard to predict

• Binary code compatibility– Pure VLIW => different numbers of functional units and unit latencies require

different versions of the code


Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”Instruction Computer (EPIC)• IA-64: instruction set architecture• 128 64-bit integer regs + 128 82-bit floating point regs• 128 64-bit integer regs + 128 82-bit floating point regs

– Not separate register files per functional unit as in old VLIW• Hardware checks dependencies

(interlocks => binary compatibility over time)(interlocks => binary compatibility over time)• Predicated execution (select 1 out of 64 1-bit flags)

=> 40% fewer mispredictions?It i ™ fi t i l t ti (2001)• Itanium™ was first implementation (2001)

– Highly parallel and deeply pipelined hardware at 800Mhz– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process

• Itanium 2™ is name of 2nd implementation (2005)– 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process– Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3, , , ,


Increasing Instruction Fetch Bandwidth

• Predicts next instruct address, sends it out

Branch Target Buffer (BTB)

,before decoding instructuction

• PC of branch sent to BTB• When match is found,

Predicted PC is returned• If branch predicted taken, p ,

instruction fetch continues at Predicted PC


IF BW: Return Address Predictor

• Small buffer of return addresses acts as a stack• Caches most recent return addressesCaches most recent return addresses• Call ⇒ Push a return address

on stack• Return ⇒ Pop an address off stack & predict as new PC• Return ⇒ Pop an address off stack & predict as new PC

60%

70%

ncy

gom88ksim

40%

50%

60%

n f

requen

cc1compressxlisp

20%

30%

pre

dic

tion

ijpegperlvortex

0%

10%

0 1 2 4 8 16

Mis

p


0 1 2 4 8 16Return address buffer entries

116

More Instruction Fetch Bandwidth

• Integrated branch prediction branch predictor is part of instruction fetch unit and is constantly predicting branchesy p g

• Instruction prefetch Instruction fetch units prefetch to deliver multiple instruct. per clock, integrating it with branch prediction

• Instruction memory access and buffering Fetching multiple instructionsInstruction memory access and buffering Fetching multiple instructions per cycle:

– May require accessing multiple cache blocks (prefetch to hide cost of crossing cache blocks)

– Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed


Speculation: Register Renaming vs. ROB

• Alternative to ROB is a larger physical set of registers combined with register renamingg

– Extended registers replace function of both ROB and reservation stations

• Instruction issue maps names of architectural registers to physical registerInstruction issue maps names of architectural registers to physical register numbers in extended register set

– On issue, allocates a new unused register for the destination (which avoids WAW and WAR hazards)

– Speculation recovery easy because a physical register holding an instruction destination does not become the architectural register until the instruction commits

• Most Out-of-Order processors today use extended registers with renaming


Value Prediction

• Attempts to predict value produced by instruction– E.g., Loads a value that changes infrequentlyE.g., Loads a value that changes infrequently

• Value prediction is useful only if it significantly increases ILPFocus of research has been on loads; so so results no processor uses value– Focus of research has been on loads; so-so results, no processor uses value prediction

• Related topic is address aliasing prediction• Related topic is address aliasing prediction– RAW for load and store or WAW for 2 stores

Address alias prediction is both more stable and simpler since need not• Address alias prediction is both more stable and simpler since need not actually predict the address values, only whether such values conflict

– Has been used by a few processors


(Mis) Speculation on Pentium 4% f i t d

43% 45%

• % of micro-ops not used

39%40%

45%

24% 24%30%35%

20%20%

25%

3%1% 1% 0%5%

10%15%

1% 1% 0%0%

5%

I t Fl ti P i t



Perspective

• Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model

• Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice

• Conservative in ideas, just faster clock and bigger

• Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple-issue processors announced in 1995announced in 1995

– Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units⇒ performance 8 to 16X

• Peak v. delivered performance gap increasing


Advanced Computer Architecture Instruction Level...

Documents

Transcript of Advanced Computer Architecture Instruction Level...