1 Stalling The easiest solution is to stall the pipeline We could delay the AND instruction by...

24
1 Stalling The easiest solution is to stall the pipeline We could delay the AND instruction by introducing a one- cycle delay into the pipeline, sometimes called a bubble Notice that we’re still using forwarding in cycle 5, to get data from the MEM/WB pipeline register to the ALU DM Reg Reg IM DM Reg Reg IM lw $2, 20($3) and $12, $2, $5 Clock cycle 1 2 3 4 5 6 7
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of 1 Stalling The easiest solution is to stall the pipeline We could delay the AND instruction by...

Page 1: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

1

Stalling

The easiest solution is to stall the pipeline

We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes called a bubble

Notice that we’re still using forwarding in cycle 5, to get data from the MEM/WB pipeline register to the ALU

DM Reg RegIM

DM Reg RegIM

lw $2, 20($3)

and $12, $2, $5

Clock cycle1 2 3 4 5 6 7

Page 2: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

2

Stalling and forwarding

Without forwarding, we’d have to stall for two cycles to wait for the LW instruction’s writeback stage

In general, you can always stall to avoid hazards—but dependencies are very common in real code, and stalling often can reduce performance by a significant amount

DM Reg RegIM

DM Reg RegIM

lw $2, 20($3)

and $12, $2, $5

Clock cycle1 2 3 4 5 6 7 8

Page 3: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Load-Use Hazard Detection

• Check when using instruction is decoded in ID stage• ALU operand register numbers in ID stage are given by

• IF/ID.RegisterRs, IF/ID.RegisterRt• Load-use hazard when

• ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))

• If detected, stall and insert bubble

Page 4: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

How to Stall the Pipeline

• Force control values in ID/EX registerto 0• EX, MEM and WB do nop (no-operation)

• Prevent update of PC and IF/ID register• Using instruction is decoded again• Following instruction is fetched again• 1-cycle stall allows MEM to read data for lw

•Can subsequently forward to EX stage

Page 5: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

5

Stalling delays the entire pipeline If we delay the second instruction, we’ll have to delay the third

one too—This is necessary to make forwarding work between AND and

OR—It also prevents problems such as two instructions trying to

write to the same register in the same cycle

DM Reg RegIM

DM Reg RegIM

DMReg RegIM

lw $2, 20($3)

and $12, $2, $5

or $13, $12, $2

Clock cycle1 2 3 4 5 6 7 8

Page 6: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

6

But what about the ALU during cycle 4, the data memory in cycle 5, and the register file write in cycle 6?

Those units aren’t used in those cycles because of the stall, so we can set the EX, MEM and WB control signals to all 0s.

Reg

What about EX, MEM, WB

DM Reg RegIM

RegIM

IM

lw $2, 20($3)

and $12, $2, $5

or $13, $12, $2 DMReg RegIM

DM Reg

Clock cycle1 2 3 4 5 6 7 8

Page 7: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

7

Detecting Stalls, cont. When should stalls be detected?

EX stage (of the instruction causing the stall)

Reg

DM Reg RegIM

RegIM

lw $2, 20($3)

and $12, $2, $5 DM Reg

id/e

x

if/id

ex/m

em

mem

\wb

id/e

x

if/id

ex/m

em

mem

\wb

if/id

What is the stall condition?

if (ID/EX.MemRead = 1 and (ID/EX.rt = IF/ID.rs or ID/EX.rt = IF/ID.rt))

then stall

Page 8: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

8

Adding hazard detection to the CPU

IF/I

D W

rite

Rs

0 1

Addr

Instructionmemory

Instr

Address

Write data

Datamemory

Readdata

1

0

PC

Extend

ALUSrcResult

ZeroALU

Instr [15 - 0]RegDst

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata 2

Readdata 1

Registers

Rd

Rt0

1

IF/ID

ID/EX

EX/MEM

MEM/WB

EX

M

WB

Control M

WB

WB

Rs

0

1

2

0

1

2

ForwardingUnit

EX/MEM.RegisterRd

MEM/WB.RegisterRd

HazardUnit

0

1

0

ID/EX.MemRead

PC

Writ

e

Rt

ID/EX.RegisterRt

Page 9: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Stalls and Performance

Stalls reduce performance—But are required to get correct results

Compiler can arrange code to avoid hazards and stalls—Requires knowledge of the pipeline structure

Page 10: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Code Scheduling to Avoid Stalls

Reorder code to avoid use of load result in the next instruction

Ex: c code for A = B + E; C = B + F;

lw $t1, 0($t0)lw $t2, 4($t0)add $t3, $t1, $t2sw $t3, 12($t0)lw $t4, 8($t0)add $t5, $t1, $t4sw $t5, 16($t0)

stall

stall

lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0)add $t3, $t1, $t2sw $t3, 12($t0)add $t5, $t1, $t4sw $t5, 16($t0)

11 cycles13 cycles

Page 11: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

11

Branches in the original pipelined datapath

Readaddress

Instructionmemory

Instruction[31-0]

Address

Writedata

Data memory

Readdata

MemWrite

MemRead

1

0

MemToReg

4

Shiftleft 2

PC

Add

1

0

PCSrc

Signextend

ALUSrc

Result

ZeroALU

ALUOp

Instr [15 - 0]RegDst

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata 2

Readdata 1

Registers

RegWrite

Add

Instr [15 - 11]

Instr [20 - 16]0

1

0

1

IF/ID

ID/EX

EX/MEM

MEM/WB

EX

M

WB

Control

M

WB

WB

When are they resolved?

Page 12: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Branch Hazards

If branch outcome determined in MEM:

PC

Flush theseinstructions(Set controlvalues to 0)

Page 13: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Reducing Branch Delay

Move hardware to determine outcome to ID stage—Target address adder—Register comparator

Example: branch taken36: sub $10, $4, $840: beq $1, $3, 744: and $12, $2, $548: or $13, $2, $652: add $14, $4, $256: slt $15, $6, $7 ...72: lw $4, 50($7)

Page 14: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Example: Branch Taken

Page 15: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Example: Branch Taken

Page 16: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Data Hazards for Branches

If a comparison register is a destination of 2nd or 3rd preceding ALU instruction

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

add $4, $5, $6

add $1, $2, $3

beq $1, $4, target

Can resolve using forwarding

Page 17: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Data Hazards for Branches

If a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction

Need 1 stall cycle

beq stalled

IF ID EX MEM WB

IF ID EX MEM WB

IF ID

ID EX MEM WB

add $4, $5, $6

lw $1, addr

beq $1, $4, target

Page 18: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Data Hazards for Branches

If a comparison register is a destination of immediately preceding load instruction—Need 2 stall cycles

beq stalled

IF ID EX MEM WB

IF ID

ID

ID EX MEM WB

beq stalled

lw $1, addr

beq $1, $0, target

Page 19: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Branch Prediction

• Longer pipelines can’t readily determine branch outcome early• Stall penalty becomes unacceptable

• Predict (i.e., guess) outcome of branch• Only stall if prediction is wrong

• Simplest prediction strategy• predict branches not taken • Works well for loops if the loop tests are done

at the start.• Fetch instruction after branch, with no delay

Page 20: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Dynamic Branch Prediction

In deeper and superscalar pipelines, branch penalty is more significant

Use dynamic prediction Branch prediction buffer (aka branch history table) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch

Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction

Page 21: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

1-Bit Predictor: Shortcoming

Inner loop branches mispredicted twice!

outer: … …inner: … … beq …, …, inner … beq …, …, outer

Mispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of inner

loop next time around

Page 22: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

2-Bit Predictor

Only change prediction on two successive mispredictions

Page 23: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Calculating the Branch Target

Even with predictor, still need to calculate the target address 1-cycle penalty for a taken branch

Branch target buffer Cache of target addresses Indexed by PC when instruction fetched

If hit and instruction is branch predicted taken, can fetch target immediately

Page 24: 1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

Concluding Remarks

ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput

using parallelism More instructions completed per second Latency for each instruction not reduced

Hazards: structural, data, control Main additions in hardware:

forwarding unit hazard detection and stalling branch predictor branch target table