MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

35
MS108 Computer System I Lecture 7 Tomasulo’s Algorithm Prof. Xiaoyao Liang 2014/3/24 1

Transcript of MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

Page 1: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

MS108 Computer System I

Lecture 7 Tomasulo’s Algorithm

Prof. Xiaoyao Liang 2014/3/24 1

Page 2: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

2

The Tomasulo’s Algorithm

• From IBM 360/91• Goal: High Performance using a limited number

of registers without a special compiler– 4 double-precision FP registers on 360– Uses register renaming

• Why Study a 1966 Computer? – The descendants of this include: Alpha 21264, HP

8000, MIPS 10000, Pentium III, PowerPC 604, …

Page 3: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

3

Tomasulo Algorithm• Control & buffers are distributed with Function Units

(FU)– FU buffers called “reservation stations (RS)”– Contain information about instructions, including operands– More reservation stations than registers, so can do

optimizations compilers can’t• Registers in instructions replaced by values or pointers

to reservation stations – form of register renaming– avoids WAR, WAW hazards

• Results to FU from RS, not through registers (equivalent of forwarding). A Common Data Bus (CDB) broadcasts results to all FUs (their RSes)

• Loads and Stores treated as FUs with RSes as well

Page 4: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

4

Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6

Page 5: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

5

Reservation Station Components

• Busy: Indicates reservation station or FU is busy• Op: Operation to perform in the unit (e.g., + or –)• Vj, Vk: Value of Source operands• Qj, Qk: Reservation stations producing source

registers (value to be written)– Note: Qj,Qk=0 => ready

• A: effective address

Tomasulo Organization

Page 6: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

6

• Register result status— Qi– Indicates which functional unit will write each register,

if one exists. Blank when no pending instructions that will write that register

• Common data bus– Normal data bus: data + destination (“go to” bus)– CDB: data + source (“come from” bus)

• 64 bits of data + 4 bits of Functional Unit source address• Write if matches expected Functional Unit (produces result)• Does the broadcast

Tomasulo Organization

Page 7: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

7

Three Stages of Tomasulo Algorithm

• 1. Issue—get instruction from FP Op Queue– If reservation station free (no structural

hazard), control issues the instruction & sends operands (renames registers).

• 2. Execute—operate on operands (EX)– When both operands ready then execute;

if not ready, watch Common Data Bus for result

• 3. Write result—finish execution (WB)– Write on Common Data Bus to all awaiting

units; mark reservation station available

Page 8: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

8

Tomasulo Loop ExampleLoop: LD F0, 0(R1) MULTD F4, F0, F2

SD F4, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop

• This time assume multiply takes 4 clock cycles in the execution stage

• Assume 1st load takes 8 clock cycles (L1 cache miss) in the execution stage, 2nd load takes 1 extra cycle (hit)

• Assume store takes 3 cycles in the execution stage• To be clear, will show clocks for SUBI, BNEZ• Show about 2 iterations

Page 9: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

9

Loop Example using simplified presentation for load/store components

Added Store Buffers

Value of Register used for address, iteration control

Instruction Loop

Iter-ationCount

Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Qk

1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

0 80 Qi

Page 10: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

10

Loop Example Cycle 1Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 Load1 Yes 80

Load2 NoLoad3 NoStore1 NoStore2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F301 80 Qi Load1

Page 11: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

11

Loop Example Cycle 2Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No

Load3 NoStore1 NoStore2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F302 80 Qi Load1 Mult1

Page 12: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

12

Loop Example Cycle 3Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F303 80 Qj Load1 Mult1

Page 13: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

13

Loop Example Cycle 4Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F304 80 Qi Load1 Mult1

• Dispatching SUBI Instruction (not in FP queue)

Page 14: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

14

Loop Example Cycle 5Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 No

Store1 Yes 80 Mult1Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F305 72 Qi Load1 Mult1

• And, BNEZ instruction (not in FP queue)

Page 15: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

15

Loop Example Cycle 6Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult1

Store2 NoStore3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F306 72 Qi Load2 Mult1

• Notice that F0 never sees Load from location 80

Page 16: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

16

Loop Example Cycle 7Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 No

Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F307 72 Qi Load2 Mult2

• Register file completely detached from computation• First and Second iteration completely overlapped

Page 17: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

17

Loop Example Cycle 8Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F308 72 Qi Load2 Mult2

Page 18: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

18

Loop Example Cycle 9Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F309 72 Qi Load2 Mult2

• Load1 completing: who is waiting?• Note: Dispatching SUBI

Page 19: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

19

Loop Example Cycle 10

• Load2 completing: who is waiting?• Note: Dispatching BNEZ

Instruction status: Exec WriteITER Instruction j k Issue CompResult Busy Addr Qk

1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30

10 64 Qi Load2 Mult2

Page 20: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

20

Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6

Page 21: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

21

Loop Example Cycle 11Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3011 64 Qi Load3 Mult2

• Next load in sequence

Page 22: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

22

Loop Example Cycle 12Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #83 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3012 64 Qi Load3 Mult2

• Why not issue third multiply?

Page 23: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

23

Loop Example Cycle 13Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #82 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3013 64 Qi Load3 Mult2

• Why not issue third store?

Page 24: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

24

Loop Example Cycle 14Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #81 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3014 64 Qi Load3 Mult2

• Mult1 completing. Who is waiting?

Page 25: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

25

Loop Example Cycle 15Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 #8

0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3015 64 Qi Load3 Mult2

• Mult2 completing. Who is waiting?

Page 26: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

26

Loop Example Cycle 16Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 No

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1

4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3016 64 Qi Load3 Mult1

Page 27: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

27

Loop Example Cycle 17Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3017 64 Qi Load3 Mult1

Page 28: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

28

Loop Example Cycle 18Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 [80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3018 64 Qi Load3 Mult1

Page 29: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

29

Loop Example Cycle 19Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3019 56 Qi Load3 Mult1

Page 30: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

30

Loop Example Cycle 20Instruction status: Exec Write

ITER Instruction j k Issue CompResult Busy Addr Qk1 LD F0 0 R1 1 9 10 Load1 Yes 561 MULTD F4 F0 F2 2 14 15 Load2 No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2 MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64 Mult1

Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code:

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop

Register result status

Clock R1 F0 F2 F4 F6 F8 F10 F12 ... F3020 56 Qi Load1 Mult1

• Once again: In-order issue, out-of-order execution and out-of-order completion.

Page 31: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

31

Why can Tomasulo overlap iterations of loops?

• Register renaming– Multiple iterations use different physical

destinations for registers (dynamic loop unrolling).

• Reservation stations – Buffer old values of registers - avoiding the WAR

stall that we saw in the scoreboard.

• Other perspective: Tomasulo builds data flow dependency graph on the fly.

Page 32: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.

32

Tomasulo’s scheme offers 2 major advantages

(1)the distribution of the hazard detection logic– Distributed reservation stations and the CDB– If multiple instructions waiting on single result,

the instructions can be released simultaneously by broadcast on CDB

– If a centralized register file were used, the units would have to read their results from the registers when register buses are available.

(2) the elimination of stalls for WAW and WAR hazards

Page 33: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.
Page 34: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.
Page 35: MS108 Computer System I Lecture 7 Tomasulos Algorithm Prof. Xiaoyao Liang 2014/3/24 1.