Pipelining Enhancing Performance with C D Pipeliningwatis/courses/178322/lecture8.pdf ·...

Chapter 6Enhancing Performance with

Pipelining

Pipelining• Think of using machines in laundry services

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Task order

Task order

Assume 30 min. each task – wash, dry, fold, store – and that separate tasks use separate hardware and so can be overlapped

Pipelined

Not pipelined

Pipelined vs. Single-Cycle Instruction Execution: the Plan

Instructionfetch

Reg ALU Dataaccess

Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetch Reg ALUData

access Reg

2 ns 2 ns 2 ns 2 ns 2 ns


Single-cycle

Pipelined

Assume 2 ns for memory access, ALU operation; 1 ns for register access:therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.

Pipelining: Keep in Mind

• Pipelining does not reduce latency of a single task, it increases throughput of entire workload

• Pipeline rate limited by longest stage– potential speedup = number pipe stages

– unbalanced lengths of pipe stages reduces speedup

• Time to fill pipeline and time to drain it – when there is slack in the pipeline – reduces speedup

Pipelining MIPS

• What makes it hard?– structural hazards: different instructions, at different stages, in

the pipeline want to use the same hardware resource

– control hazards: succeeding instruction, to put into pipeline, depends on the outcome of a previous branch instruction, already in pipeline

– data hazards: an instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline

• Before actually building the pipelined datapath and control we first briefly examine these potential hazards individually…

• Structural hazard: inadequate hardware to simultaneously support all instructions in the pipeline in the same clock cycle

• E.g., suppose single – not separate – instruction and data memory in pipeline below with one read port– then a structural hazard between first and fourth lw instructions

• MIPS was designed to be pipelined: structural hazards are easy to avoid!

Structural Hazards

2 4 6 8 10 12 14

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALUData

access Reg

2 nsInstruction

fetchReg ALU Data

accessReg

2 ns 2 ns 2 ns 2 ns 2 ns


Pipelined

Instructionfetch Reg ALU

Dataaccess Reg

2 nslw $4, 400($0)

Hazard if single memory

Control Hazards

• Control hazard: need to make a decision based on the result of a previous instruction still executing in pipeline

• Solution 1 Stall the pipeline

Instructionfetch

Reg ALU Dataaccess

Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

4 ns

Instructionfetch

Reg ALU Dataaccess

Reg2ns

Instructionfetch

Reg ALU Dataaccess

Reg

2ns

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)

Pipeline stall

bubble

Note that branch outcome iscomputed in ID stage withadded hardware (later…)

Control Hazards• Solution 2 Predict branch outcome

– e.g., predict branch-not-taken :

Instruction fetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

Instruction fetch

Reg ALUData

accessReg

2 ns

Instruction fetch

Reg ALUData

accessReg

2 ns

Program execution order (in instructions)

Instruction fetch

Reg ALUData

accessReg

Time

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

Instruction fetch

Reg ALUData

accessReg

2 4 6 8 10 12 14

2 4 6 8 10 12 14

Instruction fetch

Reg ALUData

accessReg

2 ns

4 ns

bubble bubble bubble bubble bubble


Prediction success

Prediction failure: undo (=flush) lw

Control Hazards

• Solution 3 Delayed branch: always execute the sequentially next statement with the branch executing after one instruction delay – compiler’s job to find a statement that can be put in the slot that is independent of branch outcome– MIPS does this – but it is an option in SPIM (Simulator -> Settings)


Dataaccess Reg

Time

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)


Dataaccess Reg

2 ns


Dataaccess Reg

2 ns

2 4 6 8 10 12 14

2 ns

(d elayed branch slot)


Delayed branch beq is followed by add that isindependent of branch outcome

Data Hazards

• Data hazard: instruction needs data from the result of a previous instruction still executing in pipeline

• Solution Forward data if possible…

Time2 4 6 8 10

add $s0, $t0, $t1

IF ID WBEX MEM

add $s0, $t0, $t1

sub $t2, $s0, $t3


IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

Instruction pipeline diagram:shade indicates use –left=write, right=read

Without forwarding – blue line –data has to go back in time;with forwarding – red line –data is available in time

Data Hazards

• Forwarding may not be enough– e.g., if an R-type instruction following a load uses the result of the

load – called load-use data hazardTime

2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3


IF ID WBMEMEX

IF ID WBMEMEX

bubble bubble bubble bubble bubble

With a one-stage stall, forwardingcan get the data to the sub instruction in time

Without a stall it is impossibleto provide input to the subinstruction in time

Reordering Code to Avoid Pipeline Stall (Software Solution)

• Example:lw $t0, 0($t1)

lw $t2, 4($t1)

sw $t2, 0($t1)

sw $t0, 4($t1)

• Reordered code:lw $t0, 0($t1)

lw $t2, 4($t1)

sw $t0, 4($t1)

sw $t2, 0($t1)

Data hazard

Interchanged

Pipelined Datapath

• We now move to actually building a pipelined datapath• First recall the 5 steps in instruction execution

1. Instruction Fetch & PC Increment (IF)

2. Instruction Decode and Register Read (ID)

3. Execution or calculate address (EX)4. Memory access (MEM)

5. Write result into register (WB)

• Review: single-cycle processor– all 5 steps done in a single clock cycle

– dedicated hardware required for each step

• What happens if we break the execution into multiple cycles, butkeep the extra hardware?

Review - Single-Cycle Datapath “Steps”

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

IFInstruction Fetch

IDInstruction Decode

EXExecute/ Address Calc.

MEMMemory Access

WBWrite Back

Zero

Pipelined Datapath – Key Idea

• What happens if we break the execution into multiple cycles, but keep the extra hardware?– Answer: We may be able to start executing a new instruction at each clock

cycle - pipelining

• …but we shall need extra registers to hold data between cycles –pipeline registers

Pipelined Datapath

IF/ID

Pipeline registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB

Zero

64 bits

97 bits 64 bits

128 bits

wide enough to hold data coming in

Pipelined Datapath

IF/ID

Pipeline registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB

Zero

64 bits

97 bits 64 bits

128 bits

wide enough to hold data coming in

Only data flowing right to left may cause hazard…, why?

Bug in the Datapath

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

Write register number comes from another later instruction!

IF/ID ID/EX EX/MEM MEM/WB

Corrected Datapath

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

EX/MEM MEM/WB

Zero

ID/EXIF/ID

64 bits 133 bits102 bits 69 bits

Destination register number is also passed through ID/EX, EX/MEMand MEM/WB registers, which are now wider by 5 bits

Pipelined Example

• Consider the following instruction sequence:

lw $t0, 10($t1)

sw $t3, 20($t4)

add $t5, $t6, $t7

sub $t8, $t9, $t10

Single-Clock-Cycle Diagram:Clock Cycle 1

LW

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

Single-Clock-Cycle Diagram: Clock Cycle 2

LWSW

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero


LWSWADD

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero


LWSWADDSUB


5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

LWSWADDSUB


5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

SWADDSUB


5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

ADDSUB


5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5


Zero

SUB

Alternative View –Multiple-Clock-Cycle Diagram

IM REG ALU DM REGlw $t0, 10($t1)

sw $t3, 20($t4)

add $t5, $t6, $t7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

IM REG ALU DM REG

sub $t8, $t9, $t10 IM REG ALU DM REG

CC 8

Time axis

Notes

• One significant difference in the execution of an R-type instruction between multicycle and pipelined implementations:– register write-back for the R-type instruction is the 5th (the last

write-back) pipeline stage vs. the 4th stage for the multicycleimplementation. Why?

– think of structural hazards when writing to the register file…• Worth repeating: the essential difference between the pipeline

and multicycle implementations is the insertion of pipeline registers to decouple the 5 stages

• The CPI of an ideal pipeline (no stalls) is 1. Why?• The RaVi Architecture Visualization Project of Dortmund U.

has pipeline simulations – see link in our Additional Resources page

• As we develop control for the pipeline keep in mind that the text does not consider jump – should not be too hard to implement!

Recall Single-Cycle Control –the Datapath

PC

Instruction memory

Read address

Instruction [31– 0]

Instruction [20 16]

Instruction [25 21]

Add

Instruction [5 0]

MemtoReg

ALUOp

MemWrite

RegWrite

MemRead

BranchRegDst

ALUSrc

Instruction [31 26]

4

16 32Instruction [15 0]

0

0M u x

0

1

Control

Add ALU result

M u x

0

1

RegistersWrite register

Write data

Read data 1

Read data 2

Read register 1

Read register 2

Sign extend

M u x

1

ALU result

Zero

PCSrc

Data memory

Write data

Read data

M u x

1

Instruction [15 11]

ALU control

Shift left 2

ALUAddress

Instruction AluOp Instruction Funct Field Desired ALU controlopcode operation ALU action inputLW 00 load word xxxxxx add 0 10

SW 00 store word xxxxxx add 0 10

Branch eq 01 branch eq xxxxxx subtract 110

R-type 10 add 100000 add 010

R-type 10 subtract 100010 subtract 110

R-type 10 AND 100100 and 000

R-type 10 OR 100101 or 001

R-type 10 set on less 101010 set on l ess 111

Recall Single-Cycle – ALU Control

Truth table for ALU control bits

ALUOp Funct field OperationALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0

0 0 X X X X X X 0100 1 X X X X X X 1101 X X X 0 0 0 0 0101 X X X 0 0 1 0 1101 X X X 0 1 0 0 0001 X X X 0 1 0 1 0011 X X X 1 0 1 0 111

Recall Single-Cycle – Control Signals

Signal Name Effect when deasserted Effect when asserted

RegDst The register destination number for the The register destination number for theWrite register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)

RegWrite None The register on the Write register input is written with the value on the Write data input

AlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended, second register file output (Read data 2) lower 16 bits of the instruction

PCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adderthat computes the value of PC + 4 that computes the branch target

MemRead None Data memory contents designated by the addressinput are put on the first Read data output

MemWrite None Data memory contents designated by the addressinput are replaced by the value of the Write data input

MemtoReg The value fed to the register Write data input The value fed to the register Write data inputcomes from the ALU comes from the data memory

Instruction RegDst ALUSrcMemto-

RegReg

WriteMem Read

Mem Write Branch ALUOp1 ALUp0

R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1

Effect of control bits

Deter-miningcontrolbits

Pipeline Control

• Initial design – motivated by single-cycle datapath control – use the same control signals

• Observe:– No separate write signal for the PC as it is written every cycle

– No separate write signals for the pipeline registers as they are written every cycle

– No separate read signal for instruction memory as it is read every clock cycle

– No separate read signal for register file as it is read every clock cycle

• Need to set control signals during each pipeline stage• Since control signals are associated with components active

during a single pipeline stage, can group control lines into five groups according to pipeline stage

Pipelined Datapath with Control I

PC

Instruction memory

Address

Inst

ruct

ion


MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction [15– 0]

0

0Registers

Write register

Write data

Read data 1

Read data 2

Read register 1

Read register 2

Sign extend

M u x

1Write

data

Read

data M u x

1

ALU control

RegWrite

MemRead


6


MemWrite

Address

Data memory

PCSrc

Zero

AddAdd

result

Shift left 2

ALU result

ALUZero

Add

0

1

M u x

0

1

M u x

Same controlsignals as thesingle-cycledatapath

• There are five stages in the pipeline– instruction fetch / PC increment– instruction decode / register fetch– execution / address calculation– memory access– write back

Pipeline Control Signals

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Nothing to control as instruction memoryread and PC write are always enabled

• Pass control signals along just like the data – extend each pipeline register to hold needed control bits for succeeding stages

• Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6 least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register

Pipeline Control Implementation

Control

EX

M

WB

M

WB

WB


Instruction

Pipelined Datapath with Control II

PC

Instruction memory

Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

0

M u x

0

1

Add Add result


Write data

Read data 1

Read data 2

Read register 1

Read register 2

Sign extend

M u x

1

ALU result

Zero

Write data

Read data

M u x

1

ALU control

Shift left 2

Reg

Writ

e

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

M u x

0

1

Mem

Writ

e

AddressData

memory

Address

Control signalsemanate from the controlportions of the pipeline registers

Instruction memory


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

M u x

0

1

Add Add result


Write data

Read data 1

Read data 2

Read register 1

Read register 2

Sign extend

M u x

1

ALU result

Zero

ALU control

Shift left 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

M u x

0

1

Add

PC

0

Data memory

Address

Write data

Read data

M u x

1

WB

EX

M

Instruction memory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

M u x

0

1

Add Add result

Write register

Write data

M u x

1

ALU result

Zero

ALU control

Shift left 2

Re

gWrit

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

M u x

0

1

Add

PC

0Write data

Read data

M u x

1

lwControl

Registers

Read data 1

Read data 2

Read register 1

Read register 2

X

10

20

X

1


Instruction [15– 0] Sign

extend


20

$X

$1

10

X

Me

mW

rite

MemRead

Me

mW

rite

Data memory

Address

Address

Address

Clock 2

Clock 1

PipelinedExecutionandControl

• Instructionsequence:

lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Label “before<i>” meansi th instruction beforelw

Clock cycle 1

Clock cycle 2

Instruction memory

Address


Mem

toR

eg

Branch

ALUSrc

4


0

1

Add Add result


Write data

Read data 1

Read data 2

Read register 1

Read register 2

ALU result

Shift left 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

M u x

0

1

Add

PC

0Write data

Read data

M u x

1

WB

EX

M

Instruction memory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Add result

Write register

Write data 1

ALU result

ALU control

Shift left 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

M u x

0

1

Add

PC

0Write data

M u x

1

andControl

Registers

Read data 1

Read data 2

Read register 1

Read register 2

12

X

X

5

4




X

$5

$4

X

12

Me

mW

rite

MemRead

Me

mW

rite

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

M u x

0

M u x

1

ALUOp

RegDst

ALU control

M

WB

$3

$2

11

M u x

M u x

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Sign extend

Sign extend

Data memory

Address

Clock 3

Clock 4



lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Clock cycle 3

Clock cycle 4

Instruction memory

Address


Branch

ALUSrc

4


0

1

Add Add result


Write data

Read data 1

Read data 2

Read register 1

Read register 2

ALU result

Shift left 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

M u x

0

1

Add

PC

0Write data

Read data

M u x

1

WB

EX

M

Instruction memory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Add result

1

ALU result

ALU control

Shift left 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

1

0

M u x

0

1

Add

PC

0Write data

M u x

1

addControl

Registers

Read data 1

Read data 2

Read register 1

Read register 2

14

X

X

9

8




X

$9

$8

X

14

Me

mW

rite

MemRead

Me

mW

rite

or

13

X

X

7

6

X

$7

$6

X

13

$4

M u x

0

M u x

1

ALUOp

RegDst

ALU control

M

WB

$7

$6

13

M u x

M u x

ALURead data

12

WB

11 10

10$5

12

WB

Mem

toR

eg

1

1

11

11

Write register

Write data

Zero

Zero

Data memory

Address

Data memory

Address

Sign extend

Sign extend

Clock 5

Clock 6



lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Clock cycle 5

Clock cycle 6

Label “after<i>” meansi th instruction after add

Instruction memory

Address


Branch

ALUSrc

4


0

1

Add Add result


Write data

ALU result

Shift left 2

Re

gWrit

e

MemRead

Control

ALU


Sign extend

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

000

00

0000

000

10

101

0

10

00

0

M u x

0

1

Add

PC

0Write data

Read data

M u x

1

WB

EX

M

Instruction memory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Add result

1

ALU result

Zero

ALU control

Shift left 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

000

00

0000

000

00

000

0

10

00

0

1

0

M u x

0

1

Add

PC

0Write data

M u x

1

Control

Registers

Read data 1

Read data 2

Read register 1

Read register 2



extend


Me

mW

rite

MemRead

Me

mW

rite

$8

M u x

0

M u x

1

ALUOp

RegDst

ALU control

M

WB

M u x

M u x

ALURead data

14

WB

13 12

12$9

14

WB

Mem

toR

eg

1

0

13

13

Write register

Write data

Read data 1

Read data 2

Read register 1

Read register 2 Zero

Data memory

Address

Data memory

Address

Clock 7

Clock 8



lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Clock cycle 8

Clock cycle 7

Pipelined Execution and Control


WB

EX

M

Instruction memory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Add result

1

ALU result

Zero

ALU control

Shift left 2

Reg

Writ

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

MEM/WB

IF: after<4>

000

00

0000

000

00

000

0

00

00

0

1

0

M u x

0

1

Add

PC

0Write data

M u x

1

Control

Registers

Read data 1

Read data 2

Read register 1

Read register 2



extend


MemRead

Mem

Wri

te

M u x

M u x

ALURead data

WB

14

14

Write register

Write data

Data memory

Address

Clock 9

lw $10, 20($1)sub $11, $2, $3and $12, $4, $7or $13, $6, $7add $14, $8, $9

Clock cycle 9

Revisiting Hazards

• So far our datapath and control have ignored hazards• We shall revisit data hazards and control hazards and

enhance our datapath and control to handle them in hardware…

• Problem with starting an instruction before previous are finished:– data dependencies that go backward in time – called data hazards

Data Hazards and Forwarding

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

$2 = 10 before sub ;$2 = -20 after sub

• Have compiler guarantee never any data hazards!– by rearranging instructions to insert independent instructions between

instructions that would otherwise have a data hazard between them,– or, if such rearrangement is not possible, insert nops

• Such compiler solutions may not always be possible, and nopsslow the machine down

Software Solution

sub $2, $1, $3

lw $10, 40($3)slt $5, $6, $7and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

sub $2, $1, $3

nopnopand $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

or

MIPS: nop = “no operation” = 00…0 (32bits) = sll $0, $0, 0

Hardware Solution: Forwarding

• Idea: use intermediate data, do not wait for result to be finally written to the destination register. Two steps:

1. Detect data hazard

2. Forward intermediate data to resolve hazard

Pipelined Datapath with Control II (as before)

PC

Instruction memory

Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

0

M u x

0

1

Add Add result


Write data

Read data 1

Read data 2

Read register 1

Read register 2

Sign extend

M u x

1

ALU result

Zero

Write data

Read data

M u x

1

ALU control

Shift left 2

Reg

Writ

e

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

M u x

0

1

Mem

Writ

e

AddressData

memory

Address


Hazard Detection

• Hazard conditions:1a. EX/MEM.RegisterRd = ID/EX.RegisterRs

1b. EX/MEM.RegisterRd = ID/EX.RegisterRt2a. MEM/WB.RegisterRd = ID/EX.RegisterRs

2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

– Eg., in the earlier example, first hazard between sub $2, $1, $3 and and $12, $2, $5 is detected when the and is in EX stage and the

sub is in MEM stage because• EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)

• Whether to forward also depends on:– if the later instruction is going to write a register – if not, no need to

forward, even if there is register number match as in conditions above

– if the destination register of the later instruction is $0 – in which case there is no need to forward value ($0 is always 0 and never overwritten)

Data Forwarding• Plan:

– allow inputs to the ALU not just from ID/EX, but also later pipeline registers, and

– use multiplexors and control signals to choose appropriate inputs to ALU

IM Reg

IM Reg



sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Dependencies between pipelines move forward in time

ForwardingHardware

Registers

M u x M

u x

ALU

ID/EX MEM/WB

Data memory

M u x

Forwarding unit

EX/MEM

b. With forwarding

ForwardB

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

M u x

ALU

ID/EX MEM/WB

Data memory

EX/MEM

a. No forwarding

Registers

M u x

Datapath before adding forwarding hardware

Datapath after adding forwarding hardware

Forwarding Hardware with Control

PCInstruction

memory

Registers

M u x

M u x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Data memory

M u x

Forwarding unit

IF/ID

Inst

ruct

ion

M u x

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Datapath with forwarding hardware and control wires – certain details,e.g., branching hardware, are omitted to simplify the drawing

Note: so far we have only handled forwarding to R-type instructions…!

Called forwarding unit, not hazard detection unit, because once data is forwarded there is no hazard!

PC Instruction memory

Registers

M u x

M u x

M u x

EX

M

WB

WB

Data memory

M u x

Forwarding unit

Inst

ruct

ion

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU


Registers

M u x

M u x

M u x

EX

M

WB

M

WB

Data memory

M u x

Forwarding unit

Inst

ruct

ion

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

6

10 10

$4

$2

6

2

4

$2

$5

5

2

4

Control

ALU

10

2

WB

M

WBForwarding

sub $2, $1, $3and $4, $2, $5or $4, $4, $2add $9, $4, $2

• Execution example:

Clock cycle 3

Clock cycle 4

PCInstruction

memory

Registers

M u x

M u x

M u x

EX

M

WB

M

WB

Data memory

M u x

Forwarding unit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1


M u x

M u x

M u x

EX

M

WB

M

WB

Data memory

M u x

Forwarding unit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control

Forwarding

• Executionexample(cont.):

sub $2, $1, $3and $4, $2, $5or $4, $4, $2add $9, $4, $2

Clock cycle 5

Clock cycle 6

• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction that

writes to the same register

– therefore, we need a hazard detection unit to stall the pipeline after the load instruction

Data Hazards and Stalls

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

lw $2, 20($1)and $4, $2, $5or $8, $2, $6add $9, $4, $2Slt $1, $6, $7

As even a pipelinedependency goesbackward in timeforwarding will notsolve the hazard

Pipelined Datapath with Control II (as before)

PC

Instruction memory

Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

0

M u x

0

1

Add Add result


Write data

Read data 1

Read data 2

Read register 1

Read register 2

Sign extend

M u x

1

ALU result

Zero

Write data

Read data

M u x

1

ALU control

Shift left 2

Reg

Writ

e

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

M u x

0

1

Mem

Writ

e

AddressData

memory

Address


Hazard Detection Logic to Stall

• Hazard detection unit implements the following check if to stall

if ( ID/EX.MemRead // if the instruction in the EX stage is a load…and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination register

or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register

// of the instruction in the ID stage, then…stall the pipeline

Mechanics of Stalling

• If the check to stall verifies, then the pipeline needs to stall only 1 clock cycle after the load as after that the forwarding unit can resolve the dependency

• What the hardware does to stall the pipeline 1 cycle:– does not let the IF/ID register change (disable write!) – this will

cause the instruction in the ID stage to repeat, i.e., stall– therefore, the instruction, just behind, in the IF stage must be

stalled as well – so hardware does not let the PC change(disable write!) – this will cause the instruction in the IF stage to repeat, i.e., stall

– changes all the EX, MEM and WB control fields in the ID/EX pipeline register to 0, so effectively the instruction just behind the load becomes a nop – a bubble is said to have been inserted into the pipeline

• note that we cannot turn that instruction into an nop by 0ing all the bits in the instruction itself – recall nop = 00…0 (32 bits) – because it has already been decoded and control signals generated

Hazard Detection Unit

PCInstruction

memory

Registers

M u x

M u x

M u x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Data memory

M u x

Hazard detection

unit

Forwarding unit

0

M u x

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/ID

Writ

e

PC

Writ

e

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

Datapath with forwarding hardware, the hazard detection unit and controls wires – certain details, e.g., branching hardware are omitted to simplify the drawing

Stalling Resolves a Hazard

• Same instruction sequence as before for which forwarding by itself could not resolve the hazard:

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

lw $2, 20($1)and $4, $2, $5or $8, $2, $6add $9, $4, $2Slt $1, $6, $7

Hazard detection unit inserts a 1-cycle bubble in the pipeline, afterwhich all pipeline register dependencies go forward so then the forwarding unit can handle them and there are no more hazards

Hazard detection

unit

0

M u xIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

lw $2, 20($1)


Registers

M u x

M u x

M u x

EX

M

WB

WB

Data memory

M u x

Forwarding unit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

2

500 11

$2

$5

5

2

4

$1

$X

X

1

2

Control

ALU

M

WB

Hazard detection

unit

0

M u xIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

ID/EX.MemRead

ID/EX.MemRead

M

WB

$1

$X

X

1

2

before<3>


Registers

M u x

M u x

M u x

EX WB

Data memory

M u x

Forwarding unit

Inst

ruct

ion

IF/ID

ID/EX

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

1

1

X

X11

Control

ALU

M

WBStalling

• Executionexample:

lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2

Clock cycle 2

Clock cycle 3

Hazard detection

unit

0

M u xIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

$2

$5

5

2

2

2

4

WB

Hazard detection

unit

0

M u xIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

PCInstruction

memory

Registers

M u x

M u x

M u x

EX

M

WB

Data memory

M u x

Inst

ruct

ion

IF/ID

and $4, $2, $5 bubble

ID/EX

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

2

2

5

510

11

00

$2

$5

5

2

4

Control

ALU

M

WB

bubble lw $2, . . .

PCInstruction

memory

Registers

M u x

M u x

M u x

EX

M

WB

M

WB

Data memory

M u x

Forwarding unit

Forwarding unit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

2

210 10

11

$4

$2

2

4

4

4

2

4

$2

$5

5

2

4

Control

ALU

0

WB

ID/EX.MemRead

ID/EX.MemRead

or $4, $4, $2

or $4, $4, $2

Stalling

• Execution example(cont.):

lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2

Clock cycle 4

Clock cycle 5

Registers

Inst

ruct

ion

ID/EX

4

Control



Hazard detection

unit

0

M u xIF/ID

Writ

e

PC

Writ

e

IF/ID

Writ

e

PC

Writ

e

ID/EX.RegisterRt

bubble

Registers

M u x

M u x

EX

M

WB

M

WB

Data memory

M u x

Inst

ruct

ion

IF/ID

add $9, $4, $2

ID/EX

and $4, . . .

EX/MEM

MEM/WB

Clock 6

4

4

2

210 10

$4

$2

2

4

49

$2

2

Control

ALU

10

WB0

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

after<1>

Clock 7

M u x

M u x

M u x

EX

M

WB

M

WB

Data memory

M u x

Forwarding unit

Forwarding unit

EX/MEM

MEM/WB

10 10

$4

$4

$2

2

4

4

9

ALU

10

WB

44

4

1

Hazard detection

unit

0

M u x

ID/EX.RegisterRt

or $4, $4, $2

ID/EX.MemRead

ID/EX.MemRead

M u x

IF/ID

Stalling

• Execution example(cont.):

lw $2, 20($1)and $4, $2, $5or $4, $4, $2add $9, $4, $2

Clock cycle 6

Clock cycle 7

• Problem with branches in the pipeline we have so far is that thebranch decision is not made till the MEM stage – so what instructions, if at all, should we insert into the pipeline following the branch instructions?

• Possible solution: stall the pipeline till branch decision is known– not efficient, slow the pipeline significantly!

• Another solution: predict the branch outcome– e.g., always predict branch-not-taken – continue with next

sequential instructions– if the prediction is wrong have to flush the pipeline behind the

branch – discard instructions already fetched or decoded – and continue execution at the branch target

Control (or Branch) Hazards

Predicting Branch-not-taken:Misprediction delay

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

The outcome of branch taken (prediction wrong) is decided only when beq is in the MEM stage, so the following three sequential instructionsalready in the pipeline have to be flushed and execution resumes at lw

Optimizing the Pipeline to Reduce Branch Delay

• Move the branch decision from the MEM stage (as in our current pipeline) earlier to the ID stage– calculating the branch target address involves moving the

branch adder from the MEM stage to the ID stage – inputs to this adder, the PC value and the immediate fields are already available in the IF/ID pipeline register

– calculating the branch decision is efficiently done, e.g., for equality test, by XORing respective bits and then ORing all the results and inverting, rather than using the ALU to subtract andthen test for zero (when there is a carry delay)

• with the more efficient equality test we can put it in the ID stage without significantly lengthening this stage – remember an objective of pipeline design is to keep pipeline stages balanced

– we must correspondingly make additions to the forwarding and hazard detection units to forward to or stall the branch at the ID stage in case the branch decision depends on an earlier result

Flushing on Misprediction

• Same strategy as for stalling on load-use data hazard…• Zero out all the control values (or the instruction itself) in pipeline

registers for the instructions following the branch that are already in the pipeline – effectively turning them into nops – so they are flushed– in the optimized pipeline, with branch decision made in the ID stage,

we have to flush only one instruction in the IF stage – the branch delay penalty is then only one clock cycle


4

Registers

M u x

M u x

M u x

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Data memory

M u x

Hazard detection

unit

Forwarding unit

IF.Flush

IF/ID

Sign extend

Control

M u x

=

Shift left 2

M u x

Optimized Datapath for Branch

Branch decision is moved from the MEM stage to the ID stage – simplified drawing not showing enhancements to the forwarding and hazard detection units

IF.Flush control zeros out the instruction in the IF/ID pipeline register (which follows the branch)

PCInstruction

memory

4

Registers

Sign extend

M u x

M u x

Control

EX

M

WB

M

WB

WB

M u x

Hazard detection

unit

Forwarding unit

M u x

IF.Flush

IF/ID

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

ID/EX

Clock 3

72 44

48 44

28

7

$1

$3

10

48

72

72

0

M u x

0

$4

$8

ALUData

memory

bubble (nop)lw $4, 50($7)

Clock 4

M u x

Shift left 2

before<1>

beq $1, $3, 7 sub $10, . . . before<1>

before<2>

=


4

Registers

Sign extend

M u x

M u x

Control

EX

M

WB

M

WB

WB

M u x

Hazard detection

unit

Forwarding unit

IF.Flush

IF/ID

MEM/WB

EX/MEM

ID/EX

76 72

76 72

$1

$3

10

76

ALUData

memory

M u x

Shift left 2

=

PipelinedBranch

• Executionexample:

36 sub $10, $4, $840 beq $1, $3, 744 and $12 $2, $548 or $13 $2, $652 add $14, $4, $256 slt $15, $6, $7

…72 lw $4, 50($7)

Clock cycle 4

Clock cycle 3

Optimized pipeline withonly one bubble as a resultof the taken branch

Superscalar Architecture

• A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor.

• Each functional unit is not a separate CPU core but an execution resource within a single CPU

Typical 5-stage pipeline Superscalar Pipeline

Pentium4 Pipeline

20-stage pipeline

Pipelining Enhancing Performance with C D Pipeliningwatis/courses/178322/lecture8.pdf ·...

Documents

Transcript of Pipelining Enhancing Performance with C D Pipeliningwatis/courses/178322/lecture8.pdf ·...