Capp 04

63
Instruction Level Parallelism using Dynamic Scheduling Topic 4 1

description

CAPP 4th slide

Transcript of Capp 04

Page 1: Capp 04

Instruction Level Parallelism usingDynamic Scheduling

Topic 4

1

Page 2: Capp 04

Dynamic Scheduling• Dynamic Scheduling by hardware

• Enables handling some cases when dependencies are unknown at compile time (e.g. dependencies involving memory ref)

• Allow processor to tolerate unpredictable delays

• Allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline

• Allows Out-of-order execution, Out-of-order completion

2

DIVD F0, F2, F4

ADDD F10, F0, F8 (stall)

SUBD F12, F8, F14 (have to wait)

In-order: If an instruction is stalled, no later instructions can proceed.

Page 3: Capp 04

Dynamic Scheduling

• In classical pipeline, we use in-order instruction issue and execution

• Both structural and data hazards are checked during ID stage. No hazards means it can be issued from ID for execution

• To allow out-of-order execution, we split ID stage into 2 stages: – Issue – Decode Instruction, check for SH– Read Operands – Wait until no DH, then read

Operands3

Page 4: Capp 04

Dynamic Scheduling

• Here, in-order issue but instructions may bypass each other in read operands stage, thus enter EXE stage out-of-order

• Out of order execution introduces possibility of WAW & WAR hazards

4

Page 5: Capp 04

Dynamic Scheduling

• Two dynamic scheduling approaches– Scoreboarding– The Tomasulo approach

5

Page 6: Capp 04

HW Schemes: Instruction Parallelism

• Out-of-order execution divides ID stage:1. Issue—decode instructions, check for structural hazards, Issue in order if the functional unit is

free and no WAW.2. Read operands (RO)—wait until no data hazards, then read operands ADDD would stall at RO, and SUBD could proceed with no stalls.

• Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions.

(WAR?)

(WAR?)

Focusing on FP operations – assume no MEM stages

IF ISSUE

… RO EX1 … EXm

RO EX1 … EXn

… RO EX1 … EXp

WB?

WB?

WB

Page 7: Capp 04

Four Stages of Instructions with Scoreboard1. Issue (ID1):

– Decode instructions, check for structural hazards– Instructions issued in program order (for hazard checking)– Don’t issue if structural hazard– Don’t issue if instruction is output dependent on any previously

issued but uncompleted instruction (no WAW hazards)2. Read operands (ID2):

– Wait until no data hazards (no earlier active instructions will write source operands), then read operands (no RAW hazards)

– No forwarding supported3. Execution (EX):

– The functional unit starts execution upon receiving operands. When the results are ready it notifies the scoreboard

4. Write result (WB):– Stall until no WAR hazards with previous instructions

7

Page 8: Capp 04

Three Parts of the Scoreboard

1 Instruction status: Which of 4 steps (Issue, RO, EX, WB) the instruction is in.

2 Functional unit status: Indicates the state of the functional unit (FU). Nine fields for each functional unit:

– Busy Indicates whether the unit is busy or not (values Yes & No)– Op Operation to perform in the unit (e.g., ADDD or SUBD)– Fi Destination register (e.g R2, F2 etc.)– Fj, Fk Source-register numbers (e.g. R1, R2, F1 etc)– Qj, Qk Functional units producing source registers Fj, Fk (e.g. Integer, Mult, Div etc)– Rj, Rk Flags indicating when Fj, Fk are ready and not yet read. Set to No after

operand are read.

3 Register result status: Indicates which functional unit will write to each register (Result). Blank when no pending instructions will write that register.

8

Page 9: Capp 04

Detailed Scoreboard Pipeline Control

Read operands

Execution complete

Instruction status

Write result

Issue

Bookkeeping

Rj No; Rk No Qj 0; Qk 0;

f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes);

Result(Fi(FU)) 0; Busy(FU) No

Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;

Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU;

Rj and Rk

Functional unit done

Wait until

f((Fj( f )!=Fi(FU) or Rj( f )=No) &

(Fk( f )!=Fi(FU) or

Rk( f )=No))

Not busy (FU) and not Result(D)

WAR

WAW

Page 10: Capp 04

Scoreboard Example (Cycle 0)

10

Instruction status Read Execution Write

Instruction j k Issue operands complete Result

L.D F6 34+ R2L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

IntegerMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU

FP Latency: LD = 1 Cycle (compute address + data cache access) Add = 2 cycles, Multiply = 10, Divide = 40

No

Page 11: Capp 04

Scoreboard Example (Cycle 1)

11

Instruction status

Read Execution Write

Instruction j k Issue operands complete Result

L.D F6 34+ R2 1L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2

Functional unit status

dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30

1 FU Integer

FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Page 12: Capp 04

Scoreboard Example (Cycle 2)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

L.D F6 34+ R2 1L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Integer

2

structural hazard, Not issued

?

12

Page 13: Capp 04

Scoreboard Example (Cycle 3)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

L.D F6 34+ R2 1L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Integer

2

“issue” is in-order3

?

13

Page 14: Capp 04

Scoreboard Example (Cycle 4)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

L.D F6 34+ R2 1 2 3 4L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Integer

14

Page 15: Capp 04

Scoreboard Example (Cycle 5)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6

L.DL.DMUL.DSUB.DDIV.DADD.DF6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 YesMult1 NoMult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Integer

5

15

Page 16: Capp 04

Scoreboard Example (Cycle 6)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operandscompleteResult

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 YesMult1Mult2 NoAdd NoDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Integer

Yes Mult F0 F2 F4 Integer No Yes

5 6 6

Mult1

L.DL.DMUL.DSUB.DDIV.DADD.D

16

Page 17: Capp 04

Scoreboard Example (Cycle 7)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 YesMult1Mult2 NoAddDivide No

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Integer

5 6 76

Yes Mult F0 F2 F4 Integer No Yes

Yes Sub F8 F6 F2 Integer Yes No

Mult1 Add

7

L.DL.DMUL.DSUB.DDIV.DADD.D

?Still waiting F2 to be

written back

17

Page 18: Capp 04

Scoreboard Example (Cycle 8a)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

No WAW hazardsNo structural hazards

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 YesMult1Mult2 NoAddDivide

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Integer

5 6 76

Yes Mult F0 F2 F4 Integer No Yes

Yes Sub F8 F6 F2 Integer Yes No

Mult1 Add Divide

78

Yes Div F10 F0 F6 Mult1 No Yes

L.DL.DMUL.DSUB.DDIV.DADD.D

18

Page 19: Capp 04

Scoreboard Example (Cycle 8b)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1Mult2 NoAddDivide

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU

5 6 7 86

Yes Mult F0 F2 F4 Yes Yes

Yes Sub F8 F6 F2 Yes Yes

Mult1 Add Divide

78

Yes Div F10 F0 F6 Mult1 No Yes

L.DL.DMUL.DSUB.DDIV.DADD.D

19

Page 20: Capp 04

Scoreboard Example (Cycle 9)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No10 Mult1

Mult2 No2 Add

DivideRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU

5 6 7 86 9

Yes Mult F0 F2 F4 Yes Yes

Yes Sub F8 F6 F2 Yes Yes

Mult1 Add Divide

7 98

Yes Div F10 F0 F6 Mult1 No Yes

?

L.DL.DMUL.DSUB.DDIV.DADD.D

structural hazards

20

Page 21: Capp 04

Scoreboard Example (Cycle 11)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No8 Mult1

Mult2 No0 Add

DivideRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU

5 6 7 86 9

Yes Mult F0 F2 F4 Yes Yes

Yes Sub F8 F6 F2 Yes Yes

Mult1 Add Divide

7 9 118

Yes Div F10 F0 F6 Mult1 No Yes

L.DL.DMUL.DSUB.DDIV.DADD.D

21

Page 22: Capp 04

Scoreboard Example (Cycle 12)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No7 Mult1

Mult2 NoAddDivide

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU

5 6 7 86 9

Yes Mult F0 F2 F4 Yes Yes

No

Mult1 Divide

7 9 11 128

Yes Div F10 F0 F6 Mult1 No Yes

L.DL.DMUL.DSUB.DDIV.DADD.D

22

Page 23: Capp 04

Scoreboard Example (Cycle 13)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3F0 F2 F4F8 F6 F2F10 F0 F6F6 F8 F2

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No6 Mult1

Mult2 NoAddDivide

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU

5 6 7 86 9

Yes Mult F0 F2 F4 Yes Yes

Mult1 Add Divide

7 9 11 128

Yes Div F10 F0 F6 Mult1 No YesYes Add F6 F8 F2 Yes Yes

13

L.DL.DMUL.DSUB.DDIV.DADD.D

23

Page 24: Capp 04

Scoreboard Example (Cycle 17)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3 5 6 7 8F0 F2 F4 6 9F8 F6 F2 7 9 11 12F10 F0 F6 8F6 F8 F2 13 14 16

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No2 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3017 FU Mult1 Add Divide

L.DL.DMUL.DSUB.DDIV.DADD.D ?

24

Page 25: Capp 04

Scoreboard Example (Cycle 20)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3 5 6 7 8F0 F2 F4 6 9 19 20F8 F6 F2 7 9 11 12F10 F0 F6 8F6 F8 F2 13 14 16

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1Mult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Yes Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3020 FU Add Divide

No

L.DL.DMUL.DSUB.DDIV.DADD.D

25

Page 26: Capp 04

Scoreboard Example (Cycle 21)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3 5 6 7 8F0 F2 F4 6 9 19 20F8 F6 F2 7 9 11 12F10 F0 F6 8 21F6 F8 F2 13 14 16

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1Mult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Yes Yes

Register result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3021 FU Add Divide

No

L.DL.DMUL.DSUB.DDIV.DADD.D

26

Page 27: Capp 04

Scoreboard Example (Cycle 22)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3 5 6 7 8F0 F2 F4 6 9 19 20F8 F6 F2 7 9 11 12F10 F0 F6 8 21 F6 F8 F2 13 14 16 22

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1Mult2 NoAdd No

40 Divide Yes Div F10 F0 F6 Yes YesRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3022 FU Divide

No

L.DL.DMUL.DSUB.DDIV.DADD.D

27

Page 28: Capp 04

Scoreboard Example (Cycle 61)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read Execution Write

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3 5 6 7 8F0 F2 F4 6 9 19 20F8 F6 F2 7 9 11 12F10 F0 F6 8 21 61 F6 F8 F2 13 14 16 22

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1Mult2 NoAdd No

0 Divide Yes Div F10 F0 F6 Yes YesRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3061 FU Divide

No

L.DL.DMUL.DSUB.DDIV.DADD.D

28

Page 29: Capp 04

Scoreboard Example (Cycle 62)FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40

Instruction status Read ExecutionWrite

Instruction j k Issue operands complete Result

F6 34+ R2 1 2 3 4F2 45+ R3 5 6 7 8F0 F2 F4 6 9 19 20F8 F6 F2 7 9 11 12F10 F0 F6 8 21 61 62F6 F8 F2 13 14 16 22

Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd No

0 Divide NoRegister result status

Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU

• In-order issue,

• Out-of-order execute and commit

L.DL.DMUL.DSUB.DDIV.DADD.D

29

Page 30: Capp 04

30

Review: Scoreboard• Limitations of CDC6600 scoreboard

– No forwarding– Limited to instructions in basic block (small window)– Large number of functional units (structural hazards)– Stall on WAR hazards– Stall on WAW hazards

DIV.D F0, F2, F4ADD.D F6, F0, F8S.D F6, 0(R1)SUB.D F8, F10, F14MUL.D F6, F10, F8

WAR WAW

Antidependence Output dependence

Name dependence

Page 31: Capp 04

Tomasulo Algorithm• Designed for the IBM 360/91, about 3 years after CDC 6600, by

Robert Tomasulo• Goal: high performance without special compilers• Designed to overcome long memory access and floating point

delays.• RAW hazards are avoided by executing an instruction only when

its operands are available.

31

Page 32: Capp 04

Tomasulo Algorithm• WAR and WAW hazards arised from name dependencies, are

eliminated by register renaming.• Registers in instructions are replaced by values or pointers to

reservation stations.• The Common Data Bus (CDB) is used to bypass the registers and

pass the results from the reservation stations directly to the functional units.

32

Page 33: Capp 04

Tomasulo Algorithm

• Differences between Tomasulo Algorithm & Scoreboard– Control & buffers distributed with Function Units vs.

centralized in scoreboard; called “reservation stations”

– Registers in instructions replaced by pointers to reservation station buffer

– HW renaming of registers to avoid WAW hazards– Buffer operand values to avoid WAR hazards– Common Data Bus broadcasts results to all FUs– Load and Stores treated as FUs as well

33

Page 34: Capp 04

Tomasulo’s Organization 34

FP unit and load-store unit using Tomasulo’s alg.

Page 35: Capp 04

35

Three Stages of Tomasulo Algorithm1. Issue—get instruction from FP Op Queue

Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the issue logic issues instr to rs & read operands into rs if ready (Register renaming => Solves WAR). Make status of destination register waiting for this latest instn even if the previous instn writing to this register hasn’t completed => Solves WAW hazards.

2. Execution—operate on operands (EX) When both operands are ready then execute;

if not ready, watch CDB for result – Solves RAW3. Write result—finish execution (WB)

Write on Common Data Bus to all awaiting units; mark reservation station available. Write result into dest. reg. if its status is r. => Solves WAW.

• Normal data bus: data + destination (“go to” bus)• CDB: data + source (“come from” bus)

– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does broadcast

Page 36: Capp 04

36

Reservation Station Components

Op—Operation to perform in the unit (e.g., + or –)Vj, Vk— Value of the source operand.Qj, Qk— Name of the RS that would provide the source operands. Value zero means the source operands already available in Vj or Vk, or is not necessary. Busy—Indicates reservation station or FU is busy

Register File Status Qi:Qi —Indicates which functional unit will write each register, if one exists. Blank (0) when no pending instructions that will write that register meaning that the value is already available.

Page 37: Capp 04

Tomasulo Example (Cycle 0)

37

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 38: Capp 04

Tomasulo Example (Cycle 1)

38

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 39: Capp 04

Tomasulo Example (Cycle 2)

39

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 40: Capp 04

Tomasulo Example (Cycle 3)

40

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 41: Capp 04

Tomasulo Example (Cycle 4)

41

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 42: Capp 04

Tomasulo Example (Cycle 5)

42

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 43: Capp 04

Tomasulo Example (Cycle 6)

43

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 44: Capp 04

Tomasulo Example (Cycle 7)

44

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 45: Capp 04

Tomasulo Example (Cycle 8)

45

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 46: Capp 04

Tomasulo Example (Cycle 10)

46

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No0 Add2 Yes ADDD (M-M) M(A2)

Add3 No5 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 47: Capp 04

Tomasulo Example (Cycle 11)

47

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 48: Capp 04

Tomasulo Example (Cycle 15)

48

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 49: Capp 04

Tomasulo Example (Cycle 16)

49

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 50: Capp 04

Tomasulo Example (Cycle 55)

50

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 51: Capp 04

Tomasulo Example (Cycle 56)

51

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 52: Capp 04

Tomasulo Example (Cycle 57)

52

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3057 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• In-order issue,

• Out-of-order execute and commit

FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40Load takes 2 cycles in execution stage

Page 53: Capp 04

Lec. 7 53

Branch Prediction

Page 54: Capp 04

54

Branch Prediction

• Easiest (static prediction)– Always taken, always not taken– Opcode based– Displacement based (forward not taken, backward taken)– Compiler directed (branch likely, branch not likely)

• Next easiest– 1 bit predictor – remember last taken/not taken per

branch• Use a branch-prediction buffer or branch-history table• Use part of the PC (low-order bits) to index buffer/table

– Multiple branches may share the same bit• Invert the bit if the prediction is wrong• Backward branches for loops will be mispredicted twice

Page 55: Capp 04

55

Example

Q: Assume a loop branch is taken nine times in a row, then not taken once. What is the prediction accuracy using 1-bit predictor?

A: After first loop, the predictor will say not to take because the last time the execution came out of loop, it set a “0” in the predictor. So, it’s a misprediction. The bit will now be set to “1”. Works fine until the last loop when it is predicted as taken. So, 2 mispredictions in in 10 loop executions => 80% accuracy.

How about a 2-bit predictor? Let the prediction be changed only after it misses twice in a row.

Page 56: Capp 04

56

2-bit Branch Prediction• Has 4 states instead of 2, allowing for more

information about tendencies• A prediction must miss twice before it is changed• Good for backward branches of loops

Page 57: Capp 04

Nov. 2, 2004 Lec. 7 57

Branch History Table

01

BHTbranch PC

• Has limited size• 2 bits by N (e.g. 4K)• 4K same as infinite, see Fig. 3.9• Uses low-order bits of branch PC

to choose entry

Page 58: Capp 04

Nov. 2, 2004 Lec. 7 58

Can we do better ?• Correlating branch predictors also look at other branches

for cluesif (aa==2) T

aa = 0if (bb==2) T

bb = 0if(aa!=bb) { … NT

Prediction if the last branch is NT

Prediction if the last branch is T

(1,1) predictor – uses history of 1 branch and uses a 1-bit predictor

Page 59: Capp 04

Nov. 2, 2004 Lec. 7 59

Correlating Branch Predictor• If we use 2 branches as histories, then there are 4

possibilities (T-T, NT-T, NT-NT, NT-T). • For each possibility, we need to use a predictor (1-bit, 2-

bit).• And this repeats for every branch.

(2,2) branch prediction

Page 60: Capp 04

Nov. 2, 2004 Lec. 7 60

Performance of Correlating Branch Prediction

• With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor.

• Outperforms a 2-bit predictor with infinite number of entries

Page 61: Capp 04

Nov. 2, 2004 Lec. 7 61

General (m,n) Branch Predictors

• The global history register is an m-bit shift register that records the last m branches encountered by the processor

• Usually use both the PC address and the GHR (2-level) 01

m-bit ghr

00

n-bit predictors

PCCombining

funciton

Page 62: Capp 04

Nov. 2, 2004 Lec. 7 62

Is Branch Predictor Enough?

• When is using branch prediction beneficial?– When the outcome is known later than the target– For example, in our standard MIPS pipeline, we compute

the target in ID stage but testing the branch condition incur a structure hazard in register file.

• If we predict the branch is taken and suppose it is correct, what is the target address?– Need a mechanism to provide target address as well

• Can we eliminate the one cycle delay for the 5-stage pipeline?– Need to fetch from branch target immediately after branch

Page 63: Capp 04

Nov. 2, 2004 Lec. 7 63

Branch Target Buffer (BTB)

Is the current instruction a branch ?

• BTB provides the answer before the current instruction is decoded and therefore enables fetching to begin after IF-stage .

What is the branch target ?

• BTB provides the branch target if the prediction is a taken direct branch (for not taken branches the target is simply PC+4 ) .