Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and...

30
Oct. 18, 2000 Machine Organization 1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000 *This lecture was derived from material in the text (Chap. 4). All figures from Computer Architecture: A Quantitative Approach, Second Edition, by John Hennessy and David Patterson, are copyrighted material (COPYRIGHT 1996 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED).

Transcript of Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and...

Page 1: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 1

Machine Organization (CS 570)

Lecture 7: Dynamic Scheduling and Branch Prediction*

Jeremy R. Johnson

Wed. Nov. 8, 2000

*This lecture was derived from material in the text (Chap. 4).All figures from Computer Architecture: A Quantitative Approach, Second Edition, by John Hennessy and David Patterson, are copyrighted material (COPYRIGHT 1996 MORGAN KAUFMANN PUBLISHERS, INC. ALL RIGHTS RESERVED).

Page 2: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 2

Introduction• Objective: To understand how pipeline scheduling, loop unrolling, and

branch prediction can be carried out in hardware. – We contrast the dynamic approaches to the compiler techniques discussed

previously. Since many of these techniques become more important when there is multiple issue of instructions, we give a brief overview of techniques for multiple instruction issue.

– We will also review static techniques for branch prediction.

• Topics– Review of static branch prediction

• tcov profiling tool

– Dynamic scheduling• scoreboard• Tomasulu’s algorithm

– Branch prediction– Multiple issue

• superscalar• VLIW

Page 3: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 3

Dynamic Scheduling

• Hardware rearranges instructions at runtime to reduce stalls– simplify compiler– catch some cases where dependencies are not known at compile time– Scoreboard (RAW)– Register renaming (WAW, WAR)

• Major limitation of previous pipelining techniques is that they use an in-order instruction issue. If an instruction is stalled, no later instructions can proceed.

DIVD F0, F2,F4 ; long running instruction

ADDD F10,F0,F8

SUBD F12,F8,F14 ; no dependence on DIVD, eliminate by no longer

; requiring in-order execution

Page 4: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 4

Split ID Stage

• Issue - Decode instruction, check for structural hazards• Read operands - Wait until no data hazards, then read

operands

• This allows multiple instructions to be in execution at the same time.

• Out of order execution is possible (WAR, WAW hazards may occur)

– DIVD F0, F2, F4– ADDD F10, F0, F8– SUBD F8, F8, F14 ; if dest = F10 WAW

Page 5: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 5

Scoreboard

Page 6: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 6

Pipeline Steps with Scoreboard

• Issue– if functional unit free and no other active instruction has the same destination

register then issue the instruction and update scoreboard– may have a queue for instruction fetches (stall when full)– removes WAW hazards

• Read operands– scoreboard monitors availability of source operands– when available tells functional unit to read operands– resolves RAW hazards (may be sent to execution out of order)

• Execution– functional unit begins execution upon receiving operands– notify scoreboards when result is ready

• Write result– when result available check for WAR hazards and stall if necessary and write

to register

Page 7: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 7

Example (Fig. 4.4 - 4.6)

Page 8: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 8

Example (Fig. 4.4 - 4.6)

Page 9: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 9

Example (Fig. 4.4 - 4.6)

Page 10: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 10

Checks and Bookkeeping

• Issue– Wait until not busy[FU] and not Result[D]– Busy[FU] = yes; Op[FU] = op; Fi[FU] = D;– Fj[FU] = S1; Fk[FU] = S2; Qj = Result[S1]; Qk = Result[S2];– Rj = not Qj; Rk = not Qk; Result[D] = FU;

• Read Operands– wait until Rj and Rk– Rj = no; Rk = no; Qj = 0; Qk = 0;

• Execution Complete– Functional unit done

• Write result f (Fj[f] Fi[FU] or Rj[f] = No) & (Fk[f] Fj[FU] or Rk[f] = No) f ( if Qj[f] = FU then Rj[f] = yes); f ( if Qk[f] = FU then Rk[f] = yes);– Result[Fi[FU]] = 0; Busy[FU] = no;

Page 11: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 11

Register Renaming (Tomasulo)

• Uses reservation stations to buffer instructions waiting to issue

– fetch operands as soon as possible– eliminates need to get operand from a register– pending instructions designate reservation station that will provide

results– with successive writes to a register only last one actually updates

register– register specifiers are renamed to reservation station– eliminates WAW and WAR hazards

• Uses distributed control (common data bus)• Results go directly to functional units from reservation

stations rather than through the register file

Page 12: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 12

Hardware for Register Renaming (Fig. 4.8)

Page 13: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 13

Pipeline Steps with Renaming

• Issue– Get instruction from floating point queue– If FP reservation station free send instruction with operands if in

registers– If ld/st issue there is an available buffer– renaming done here

• Execution– If one or more operands are not available monitor CDB while waiting

for it to be computed– When operands are available execute operation– Check for RAW hazards

• Write result– When result available write on CDB and from there into registers,

reservation stations, and waiting store buffers

Page 14: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 14

Example (Fig. 4.9 - 4.10)

Page 15: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 15

Example (Fig. 4.9 - 4.10)

Page 16: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 16

Checks and Bookkeeping• Issue

– Wait until station or buffer empty– if (Register[S1].Qi 0) { RS[r].Qj = Register[S1].Qi }– else {RS[r]/Vj = S1; RS[r].Qj = 0;}– if (Register[S2].Qi 0) { RS[r].Qk = Register[S2].Qi }– else {RS[r]/Vj = S2; RS[r].Qk = 0;}– RS[r].Busy = yes; Register[D].Qi = r;

• Execution Complete– Wait until (RS[r].Qj = 0) and (RS[r].Qk = 0)– no bookkeeping - operands are in Vj and Vk

• Write result– Wait until execution completed at r and CDB available x (if (Register[x].Qi = r) then {Fx = result; Register[x].Qi=0}) x (if (RS[x].Qj = r) then {RS[x].Vj=result; RS[x].Qj = 0}) x (if (RS[x].Qk = r) then {RS[x].Vk=result; RS[x].Qk = 0}) x (if (Store[x].Qi = r) then {Store[x].V=result; Store[x].Qi = 0})– RS[r].Busy = no;

Page 17: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 17

Dynamic Unrolling (Fig. 4.12)

Loop: LD F0, 0(R1)

MULTD F4, F0, F2

SD 0(R1), F4

SUBI R1, R1, #8

BNEZ R1, Loop

• If we predict that the branch is taken, using reservation stations will allow multiple executions of this loop to proceed at once (dynamic unrolling)

Page 18: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 18

Dynamic Unrolling (Fig. 4.12)

Page 19: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 19

Dynamic Branch Prediction

• Provide hardware to dynamically predict whether a branch is taken or not

• In order to be effective when we predict that a branch will be taken, it is necessary to be able to compute the address before we would normally determine whether to take the branch

– branch target buffer (BTB) provides a cache of branch target addresses

• Simplest approach uses 1 bit to remember whether the branch was taken the last time or not

• In a simple loop this leads to two mispredictions– A two bit scheme will improve this situation

Page 20: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 20

Two-Bit Prediction Scheme

Page 21: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 21

Branch Target Buffer

Page 22: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 22

Frequency of Mispredictions (4.14)4096-entry 2-bit prediction buffer

Page 23: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 23

Handling Instructions with BTB (4.23)

Page 24: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 24

Multiple Issue

• Previous techniques used to eliminate data and control stalls. They allow us to approach the ideal CPI of 1

• To improve performance further, we would like to decrease the CPI to less than 1. This can not happen if we can issue only one instruction per cycle

• Multiple issue processors

– superscalar– VLIW

Page 25: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 25

Code Example

• Latencies

Loop: LD F0,0(R1) ; F0 = array element

ADDD F4, F0, F2 ; add scalar in F2

SD 0(R1), F4 ; store result

SUBI R1, R1, #8 ; decrement pointer

; 8 bytes per double

BNEZ R1, Loop ; branch R1!=zero

Inst. producing result Inst. using result latencyFP ALU op FP ALU op 3FP ALU op Store 2Load FP ALU op 1Load Store 0

Page 26: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 26

Superscalar DLX

• Can issue two instructions per cycle– integer including ld/st/br– FP– To make this worthwhile, we need either multiple FP units or pipelined

FP units– This restriction simplifies the implementation (e.g. use opcode to

detect issue restriction)– Extra difficulty with simultaneous ld/st and FP operation (contention

for register datapath)

• Need to fetch and decode 64 bits of instructions – assume that they are aligned on 64-bit boundaries– integer instruction comes first

Page 27: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 27

Superscalar Scheduling

Loop: LD F0,0(R1) ; ; 1

LD F6,-8(R1) ; ; 2

LD F10,-16(R1) ; ADDD F4, F0, F2 ; 3

LD F14,-24(R1) ; ADDD F8, F6, F2 ; 4

LD F18,-32(R1) ; ADDD F12, F10, F2 ; 5

SD 0(R1), F4 ; ADDD F16, F14, F2 ; 6

SD -8(R1), F8 ; ADDD F20, F18, F2 ; 7

SD -16(R1), F16 ; ; 8

SUBI R1, R1, #40 ; ; 9

SD 16(R1), F16 ; ; 10

BNEZ R1, Loop ; ; 11

SD 8(R1), F20 ; ; 12

Page 28: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 28

Dynamic Scheduling

It Inst Issue Ex Mem WB

1 LD F0,0(R1) 1 2 3 3

1 ADDD F4,F0,F2 1 4 6

1 SD 0(R1),F4 2 3 7

1 SUBI R1,R1,#8 3 4 5

1 BNEZ R1,Loop 4 5

2 LD F0,0(R1) 5 6 8 8

2 ADDD F4,F0,F2 5 9 11

2 SD 0(R1),F4 6 7 12

2 SUBI R1,R1,#8 7 8 9

2 BNEZ R1,Loop 8 5

Page 29: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 29

VLIW Scheduling

• Each instruction

– two memory references– two FP operations– one integer or branch operation

• In the following example– 7 loop iterations in 9 cycles– 23 operations (2.5 ops/cycle)– 60% efficiency– needs extra registers

Page 30: Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Oct. 18, 2000 Machine Organization 30

VLIW Scheduling

LD F0,0(R1) LD F6,-8(R1)

LD F10,-16(R1) LD F14,-24(R1)

LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2

LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16, F14, F2

ADDD F20,F18,F2 ADDD F24,F22,F2

SD 0(R1), F4 SD -8(R1), F8 ADD F28, F26, F2

SD -16(R1), F12 SD -24(R1),F16

SD -32(R1), F20 SD -40(R1),F24 SUBI R1,R1,#56

SD 8(R1), F28 BNEZ R1, Loop