Lecture 7 Pipeline Hazards

37
Hazards CS510 Computer Architectures Lecture 7 - 1 Lecture 7 Lecture 7 Pipeline Hazards Pipeline Hazards

description

Lecture 7 Pipeline Hazards. Its Not That Easy to Achieve the Promised Performance. Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle Structural hazards : HW cannot support this combination of instructions - PowerPoint PPT Presentation

Transcript of Lecture 7 Pipeline Hazards

Page 1: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 1

Lecture 7Lecture 7Pipeline HazardsPipeline Hazards

Page 2: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 2

Its Not That Easy to Achieve Its Not That Easy to Achieve the Promised Performancethe Promised Performance

• Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle– Structural hazards: HW cannot support this combination of

instructions

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

– Control hazards: Pipelining of branches and other instructions that change the PC

• Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”, i.e., idle clock cycles, in the pipeline

Page 3: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 3

Structural Hazards /MemoryStructural Hazards /Memory

Instru

ction

O

rder

LOAD

Instr 1

Instr 2 Mem RegA

LU

Mem Reg

Mem Reg

ALU

Mem RegInstr 3

Reg

ALU

RegMem Mem

RegReg

ALU

MemMem

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Mem Reg

ALU

Mem RegInstr 4

Mem

Mem

Mem

MemOperation on MemoryOperation on Memoryby 2 different instructionsby 2 different instructionsin the same clock cycle in the same clock cycle

Page 4: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 4

Structural Hazards Structural Hazards with Single-Port Memorywith Single-Port Memory

Instru

ction

O

rder

LOAD

Instr 1

Instr 2

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Mem RegA

LU

Mem Reg

Mem Reg

ALU

Mem RegInstr 3Stall

Reg

ALU

RegMem Mem

RegReg

ALU

MemMem

Mem Mem

Mem Mem

Mem Mem

StallStall

Mem Reg

ALU

Instr 3 Mem3 cycles stall3 cycles stallwith 1-port memorywith 1-port memory

Page 5: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 5

Avoiding Structural Hazard Avoiding Structural Hazard with Dual-Port Memory with Dual-Port Memory

Instru

ction

O

rder

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

LOAD IM Reg

ALU

RegDM

Instr 1

Instr 2

Instr 3

Instr 4

Instr 5

IM DM

IM RegA

LUDM Reg

Reg

ALU

DM RegIM

IM Reg

ALU

DM Reg

IM Reg

ALU

DM Reg

IM Reg

ALU

DM

IM DM

IM DM

IM DM

IM DM

IM DMNo stall withNo stall with2-port memory2-port memory

Page 6: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 6

Data Hazard on RegistersData Hazard on Registers

ADD R1,R2,R3

SUB R4,R1,R3

AND R6,R1,R7

OR R8,R1,R9

XOR R10,R11,R1

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Mem Reg

ALU

Reg

ALU

Mem Reg

MemA

LUMem Reg

ALU

Mem Reg

Mem

ALU

Mem Reg

Mem

Mem

Mem

Reg

Reg

Reg

Reg

Time(clock cycles)

R1

ReReg

RegReg

RegReg

RegReg

Page 7: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 7

Data Hazard on RegistersData Hazard on Registers

Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle

ClcokCycle

Register Ri

Store into Ri

Readfrom Ri

Page 8: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 8

Data Hazard on RegistersData Hazard on Registers

ADD R1,R2,R3

SUB R4,R1,R3

AND R6,R1,R7

OR R8,R1,R9

XOR R10,R11,R1

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Mem Reg

ALU

Reg

Reg

ALU

Mem Reg

Mem RegA

LUMem Reg

Reg

ALU

Mem Reg

Mem Reg

ALU

Mem Reg

Mem

Mem

Mem

Time(clock cycles)

R1

Reg

Reg

Needs to Stall 2 cycles

Reg

Reg

Page 9: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 9

Three Generic Data HazardsThree Generic Data HazardsThree Generic Data HazardsThree Generic Data Hazards

Instri followed by Instrj

Read After Write (RAW) Instrj tries to read operand before Instri writes it

Instri LW R1, 0(R2)Instrj SUBR 4, R1, R5

Page 10: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 10

Three Generic Data HazardsThree Generic Data HazardsThree Generic Data HazardsThree Generic Data Hazards

InstrI followed by InstrJ

• Write After Read (WAR) Instrj tries to write operand before Instri reads it

Instri ADD R1, R2, R3 Instrj LW R2, 0(R5) Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages,

– Reads are always in stage 2, and

– Writes are always in stage 5

Page 11: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 11

Three Generic Data HazardsThree Generic Data HazardsThree Generic Data HazardsThree Generic Data Hazards

InstrI followed by InstrJ

Write After Write (WAW) Instrj tries to write operand before Instri writes it

– Leaves wrong result ( Instri not Instrj)

Instri LW R1, 0(R2)Instrj LW R1, 0(R3)

Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages, and

– Writes are always in stage 5

Will see WAR and WAW in later more complicated pipes

Page 12: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 12

Forwarding Forwarding to Avoid Data Hazardsto Avoid Data Hazards

Time(clock cycles)

ADD R1,R2,R3

SUB R4,R1,R3

AND R6,R1,R7

OR R8,R1,R9

XOR R10,R11,R1

Mem Reg

ALU

Reg

Reg

ALU

Mem Reg

Mem Reg

ALU

Mem Reg

Reg

ALU

Mem Reg

Mem Reg

ALU

Mem Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Mem

Mem

Mem

Page 13: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 13

HW Change HW Change for Forwardingfor Forwarding

MU

X

MU

X

Zero?

DataMemory

ALU

D/A

B

uffer

A/M

B

uffer

M/W

B

uffer

Page 14: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 14

Load Delay Due to Data HazardLoad Delay Due to Data Hazard

LOAD R1,0(R2)

Time(clock cycles)

AND R6,R1,R7 IM Reg

ALU

DM Reg

OR R8,R1,R9 Reg

ALU

DMIM

SUB R4,R1,R6

RegA

LU

DM RegIM

Load Delay =2cycles

RegA

LU

DM RegIM

Reg

ALU

DM RegIM

Reg

ALU

DM RegIM

IM Reg

ALU

RegDM

Reg

ALU

DM RegIM

Page 15: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 15

Load Delay Load Delay with Forwardingwith Forwarding

LOAD R1,0(R2)

Time(clock cycles)

IM Reg

ALU

RegDM

SUB R4,R1,R6

AND R6,R1,R7

OR R8,R1,R9

IM Reg

ALU

DM Reg

Reg

ALU

DM RegIM

We need to add HW, called Pipeline Interlock

IM Reg

ALU

DM Reg

Reg

ALU

DM RegIM Reg

ALU

DM RegIM

Load Delay with Forwarding=1cycle

Page 16: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 16

Try to produce fast code fora = b + c;

d = e - f;

assuming a, b, c, d ,e, and f are in memory.

Software Scheduling Software Scheduling to Avoid Load Hazardsto Avoid Load HazardsSoftware Scheduling Software Scheduling

to Avoid Load Hazardsto Avoid Load Hazards

Fast code:

LW Rb,b

LW Rc,c

LW Re,e

ADD Ra,Rb,Rc

LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

Slow code(with forwarding):

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

RAW

RAW

RAW

RAW RAW

Stall

Stall

Stall

Stall Stall

Page 17: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 17

% loads stalling pipeline

0% 20% 40% 60% 80%

tex

spice

gcc

25%

14%

31%

65%

42%

54%

scheduled unscheduled

Compiler Avoiding Load StallsCompiler Avoiding Load StallsCompiler Avoiding Load StallsCompiler Avoiding Load Stalls

Page 18: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 18

Mem Stage

WB StageIF Stage ID Stage EX Stage

Instr.Memory

SignExt

Zero?

DataMemory

PC

MU

XM

UX

MU

X

MU

X

Add

ALURegFile

+4

16 32

SMD

LMD

F/D

B

uffer

D/A

B

uffer

A/M

B

uffer

M/W

B

uffer

Pipelined DLX DatapathPipelined DLX DatapathPipelined DLX DatapathPipelined DLX Datapath

• Branch Address Calculation• Decide Condition

• Branch Decision for target address

Page 19: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 19

Control Hazard on Branches:Control Hazard on Branches:

Three Stall CyclesThree Stall CyclesControl Hazard on Branches:Control Hazard on Branches:

Three Stall CyclesThree Stall Cycles

IM Reg

ALU

DM Reg

IM Reg

ALU

DM Reg

IM Reg

ALU

DM Reg

Time(clock cycles)

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

IM Reg

ALU

DM Reg

IM Reg

ALU

DM Reg

Pro

gram

executio

n o

rder in

in

structio

ns

40 BEQ R1,R3, 36

44 AND R12,R2, R5

48 OR R13,R6, R2

52 ADD R14,R2, R2

80 LD R4,R7, 100

Should’t be executed whenbranch condition is true !

IM Reg

ALU

DM Reg

Branch Delay = 3 cycles

IM Reg

ALU

DM Reg

IM Reg

ALU

DM Reg Branch Targetavailable

Page 20: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 20

Control Hazard on Branches:Control Hazard on Branches:

Three Stall CyclesThree Stall CyclesControl Hazard on Branches:Control Hazard on Branches:

Three Stall CyclesThree Stall Cycles

Branch instruction IF ID EX MEM WB

Branch successor IF ID EX MEM

3 Wasted clock cyclesfor the TAKEN branch

Now, we know the instructionbeing executed is a branch.But stall until branch target address is known.

Now, target address is available.We don’t know yet the instruction being executed is a branch. Fetch the branch successor.

Branch successor + 1 IF ID EX

Branch successor + 2 IF ID

Page 21: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 21

Branch Stall ImpactBranch Stall ImpactBranch Stall ImpactBranch Stall Impact

• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9

– Half of the ideal speed

• Two part solution:

– Determine the branch is TAKEN or NOT TAKEN sooner, AND

– Compute TAKEN Branch Address(Branch Target) earlier

• DLX branch tests if register = 0 or 1

DLX Solution: Get New PC earlier- Move Zero test to ID stageZero test to ID stage- Additional ADDERAdditional ADDER to calculate New PC(taken PC) in ID stagein ID stage- 1 clock cycle penalty for branch in contrast to 3 cycles

Page 22: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 22

Pipelined DLX DatapathPipelined DLX DatapathPipelined DLX DatapathPipelined DLX Datapath

To get targetaddr. earlier

To get the Condition Earlier.Target Addressavailable after ID.

Mem Stage

WB StageIF Stage ID Stage EX Stage

Instr.Memory

SignExt

Zero?

DataMemory

PC

MU

XM

UX

MU

X

MU

X

Add

ALURegFile

+4

16 32

SMD

LMD

F/D

B

uffer

D/A

B

uffer

A/M

B

uffer

M/W

B

uffer

Add

When a branchinstruction is inExecute stage,Next Addressis available here.

Page 23: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 23

Branch Behavior in ProgramsBranch Behavior in ProgramsBranch Behavior in ProgramsBranch Behavior in Programs

• Conditional branch frequencies– integer average --- 14 to 16 %– floating point --- 3 to 12 %

• Forward and backward taken branches– forward taken --- 60 %– backward taken --- 85 %– the average of all conditional branches ---- 67 %

Page 24: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 24

4 Branch Hazard Alternatives4 Branch Hazard Alternatives

• Stall until branch direction is clear• Predict branch NOT TAKEN• Predict branch TAKEN• Delayed branch

Page 25: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 25

4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:

(1) STALL(1) STALL4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:

(1) STALL(1) STALL

Stall until branch direction is clear

Branch instruction IF ID EX MEM WB

3 cycle penalty

Branch instruction IF ID EX MEM WBBranch successor stall IF ID EX MEM WBBranch successor + 1 IF ID EX MEMBranch successor + 2 IF ID

Revised DLX pipeline(get the branch address at EX)

1 cycle penalty(Branch Delay Slot)

Branch successor stall stall stall IF ID EX MEMBranch successor + 1 IF ID EXBranch successor + 2 IF ID

Page 26: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 26

4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:

(2) Predict Branch “NOT TAKEN”(2) Predict Branch “NOT TAKEN”

• Execute successor instructions in the sequence • PC+4 is already calculated, so use it to get the next instruction• Flush instructions in the pipeline if branch is actually TAKEN• Advantage of late pipeline state update• 47% of DLX branches are NOT TAKEN on the average

NOT TAKEN branch instruction i IF ID EX MEM WBinstruction i+1 IF ID EX MEM WBinstruction i+2 IF ID EX MEM WB

Nopenalty

TAKEN branch instruction i IF ID EX MEM WBinstruction i+1 IF ID EX MEM WBinstruction T IF ID EX MEM WB

1 cyclepenalty

Flush this instruction in progress

Page 27: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 27

4 Branch Hazard Alternatives: 4 Branch Hazard Alternatives:

(3) Predict Branch “TAKEN”(3) Predict Branch “TAKEN”

– 53% DLX branches TAKEN on average

– Branch target address available after ID in DLX

• DLX still incurs 1 cycle branch penalty for TAKEN branch

• Other machines: branch target known before outcome

2 cycle penalty in DLX(1 in other machines).

1 cycle penalty in DLX(0 in other machines)

NOT TAKEN instruction i IF ID EX MEM WBInstruction T stall IFInstruction i+1 IF ID EX MEM WB

TAKEN branch instruction i IF ID EX MEM WBInstruction T stall IF ID EX MEM WBInstruction T+1 IF ID EX MEM WB

TAKEN address not available at this time

TAKEN address available

Page 28: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 28

4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:

(4) Delayed Branch(4) Delayed Branch4 Branch Hazard Alternatives:4 Branch Hazard Alternatives:

(4) Delayed Branch(4) Delayed Branch

Delayed Branch– Delay branch to take place AFTER a successor instruction

branch instructionsequential successor1

sequential successor2

........sequential successorn

branch target if taken

– 1 slot delayed branch allows proper decision and branch target address in 5 stage DLX pipeline with control hazard improvement

Delayed Branch of length n

Page 29: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 29

Delayed BranchDelayed BranchDelayed BranchDelayed Branch

• Where to get instructions to fill branch delay slot?– Before branch instruction

– From the target address: only valuable when branch TAKEN

– From fall through: only valuable when branch NOT TAKEN

– Canceling branches allow more slots to be filled

• Compiler effectiveness for single delayed branch slot:– Fills about 60% of delayed branch slots

– About 80% of instructions executed in delayed branch slots are useful in computation

– About 50% (60% x 80%) of slots usefully filled

Page 30: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 30

4 Branch Hazard Alternatives:4 Branch Hazard Alternatives: Delayed BranchDelayed Branch

4 Branch Hazard Alternatives:4 Branch Hazard Alternatives: Delayed BranchDelayed Branch

From target

SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then

Delay slot

ADD R1, R2, R3 if R1=0 then

SUB R4, R5, R6

- Improve performance when TAKEN(loop)- Must be alright to execute rescheduled instructions if Not Taken- May need duplicate the instruction if it is the target of another branch instr.

From fall through

ADD R1, R2, R3 if R1=0 then

SUB R4, R5, R6

Delay slot

ADD R1, R2, R3 if R2=0 then SUB R4, R5, R6

- Improve performance when NOT TAKEN- Must be alright to execute instructions of Taken

- Always improve performance- Branch must not depend on rescheduled instructions

From before

ADD R1, R2, R3 if R2=0 thenDelay slot

if R2=0 then ADD R1, R2, R3

Page 31: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 31

Limitations on Delayed Limitations on Delayed BranchBranch

• Difficulty in finding useful instructions to fill the delayed branch slots

• Solution - Squashing– Delayed branch associated with a branch prediction

– Instructions in the predicted path are executed in the delayed branch slot

– If the branch outcome is mispredicted, instructions in the delayed branch slot are squashed(discarded)

Page 32: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 32

Canceling BranchCanceling Branch

• Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to– Restrictions on scheduling instructions at the delay slots– Limitations on the ability to predict whether it will TAKE or NOT

TAKE at compile time

• Instruction includes the direction that the branch was predicted– When the branch behaves as predicted, the instructions in the

delay slot are executed– When branch is incorrectly predicted, the instructions in the delay

slot are turned into No-OPs

• Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements

Page 33: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 33

Evaluating Branch Evaluating Branch AlternativesAlternatives

Evaluating Branch Evaluating Branch AlternativesAlternatives

Stall pipeline 3 1+0.14x3=1.42 5/1.42=3.5 1.0

Predict Taken 1 1+0.14x1=1.14 5/1.14=4.4 1.26

Predict Not Taken 1 1+0.14x0.65=1.09 5/1.09=4.5 1.29

Delayed branch 0.5 1+0.14x0.5=1.07 5/1.07=4.6 1.31

Pipeline speedup = Pipeline depth / CPI

= Pipeline depth1 +Branch frequency xBranch penalty

Conditional and Unconditional collectively 14% frequency,65% of branch is TAKEN

Scheduling Branch CPI speedup vs speedup vs scheme penalty unpipelined stall

Page 34: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 34

Static(Compiler) Prediction of Static(Compiler) Prediction of Taken/Untaken BranchesTaken/Untaken Branches

Static(Compiler) Prediction of Static(Compiler) Prediction of Taken/Untaken BranchesTaken/Untaken Branches

Code Motion

LW R1, 0(R2)

SUB R1, R1, R3

BEQZ R1, L

OR R4, R5, R6

ADD R10,R4,R3

L: ADD R7, R8, R9

If branch is almost always NOT TAKENNOT TAKEN, and R4 is not needed on the taken path,and R5 and R6 are not modified in the following instruction(s), this move can increase speed

Depend on LW,need to stall

If branch is almost always TAKENTAKEN, and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed

Page 35: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 35

Static(Compiler) Prediction Static(Compiler) Prediction of Taken/Untaken Branchesof Taken/Untaken BranchesStatic(Compiler) Prediction Static(Compiler) Prediction of Taken/Untaken Branchesof Taken/Untaken Branches

• Improves strategy for placing instructions in delay slot

• Two strategies

– Direction-based Prediction: TAKEN backward branch, NOT TAKEN forward branch

– Profile-based prediction: Record branch behaviors, predict branch based on the prior run(s)

Frequency of Misprediction

0%

10%

20%

30%

40%

50%

60%

70%

alvinn

compress

doduc

espresso

gcc

hydro2d

mdljsp2

ora

swm256

tomcatv

Always taken

Misprediction Rate

0%

2%

4%

6%

8%

10%

12%

14%

alvinn

compress

doduc

espresso

gcc

hydro2d

mdljsp2

ora

swm256

tomcatv

Taken backwardsNot Taken Forwards

Page 36: Lecture 7 Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 36

Evaluating Static Branch Evaluating Static Branch Prediction StrategiesPrediction Strategies

Evaluating Static Branch Evaluating Static Branch Prediction StrategiesPrediction Strategies

• Misprediction rate ignores frequency of branch

• Instructions between mispredicted branches is a better metric

Instructions per mispredicted branch

1

10

100

1000

10000

100000

alvinn

compress

doduc

espresso

gcc

hydro2d

mdljsp2

ora

swm256

tomcatv

Profile-based Direction-based

Page 37: Lecture 7 Pipeline Hazards

End of Hazards and their End of Hazards and their Resolution PointResolution Point

Thank you

Hazards CS510 Computer Architectures Lecture 7 - 37