CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José...

CMPUT 429 - Computer Systems and Architecture

1

CMPUT429/CMPE382 Winter 2001

Topic3-PipeliningJosé Nelson Amaral

(Adapted from David A. Patterson’s CS252lecture slides at Berkeley)


2

What is Pipelining?

Pipelining is a key implementation technique usedto build fast processors. It allows the execution ofmultiple instructions to overlap in time.A pipeline within a processor is similar to a car assembly line. Each assembly station is called apipe stage or a pipe segment.

The throughput of an instruction pipeline isthe measure of how often an instruction exits thepipeline.


3

Pipelining: Its Natural!Laundry ExampleAnn, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

Washer takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutes

A B C D


4

Sequential Laundry

Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time


5

Pipelined LaundryStart work ASAP

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D


Task

Order

Time

30 40 40 40 40 20What is preventingthem from doing it

faster?

How could we eliminatethis limiting factor?


6

Pipelined LaundryStart work ASAP

A

B

C

D


Task

Order

Time

30 40 40 40 40 20 Pipeline throughput:How many loads arecompleted per unit

of time?

Pipeline Latency:Once a load starts, how

long does it takes tocomplete it?

Mux

0

1

Mux

0

1

Mux

0123

012

Mux

PC

Signext.

Shiftleft 2

Conc/Shiftleft 2

Readaddress

Writeaddress

Writedata

MemData

Instruction[31-26]

Instruction[25-0]

Instructionregister

Memory

Readregister 1Readregister 2WriteregisterWritedata

Readdata 1

Readdata 2

Registers 4

32

Mux

0

10

1Mux

ALUresult

Zero

ALU

Target

16

426

32

I[25-21]I[20-16]

I[15-0]

[15-11]


8

5 Steps of MIPS DatapathFigure 3.1, Page 130, CA:AQA 2e

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

LMD

ALU

MUX

Mem

ory

Reg File

MUX

MUX

DataM

emory

MUX

SignExtend

4

Adder Zero?

Next SEQ PC

Address

Next PC

WB Data

Inst

RD

RS1

RS2

Imm


9

5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

MUX

DataM

emory

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Address

RS1

RS2

Imm

MUX


10

Steps to Execute EachInstruction Type

Instruction Type Step R-type load store branch jump

Fetch IR Memory[PC] PC PC + 4

Decode A Registers[IR[25-21]] B Registers[IR[20-16]]

Target PC + (sign-extend(IR[15-0]) << 2) Execute

ALUopt A op B

ALUout A + sign-extend(IR[15-0])

If(A==B) then PC Target

PC concat(PC[31-28], IR[25-0]) << 2

Memory Reg(R[15-11]) ALUout

Memdata Mem[ALUout]

Mem[ALUout] B

Write-back

Reg(IR[20-16])

memdata


11

Pipeline Stages

We can divide the execution of an instructioninto the following stages:

IF: Instruction FetchID: Instruction DecodeEX: ExecutionMEM: Memory AccessWB: Write Back


12

Pipeline Throughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.

Pipeline throughput: instructions completed per second.Pipeline latency: how long does it take to execute a single instruction in the pipeline.


13



Pipeline throughput: how often an instruction is completed.

nsinstrnsnsnsnsnsinstr

WBlatMEMlatEXlatIDlatIFlatinstrT

10/14,10,5,4,5max/1

)(),(),(),(),(max/1

Pipeline latency: how long does it take to execute an instruction in the pipeline.

nsnsnsnsnsnsWBlatMEMlatEXlatIDlatIFlatL

28410545)()()()()(

Is this right?


14



Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction

IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB

MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4

L(I5) = 43nsEX WB

We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every state

the same length as the longest one.


15



The slowest pipeline state also limits the latency!!

IF MEMIDI1

L(I1) = L(I2) = L(I3) = L(I4) = 50ns

EX WBIF MEMIDI2 L(I2) = 50nsEX WB

IF MEMID EX WBIF MEMID EX

0 10 20 30 40 50 60

I3I4


16



How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, and hazards)

How long would it take using the same moduleswithout pipelining?

snsnsExecTime pipe 2002000001020000

snsnsExecTime pipenon 5605600002820000


17



Thus the speedup that we got from the pipeline is:

8.2 200 560

ss

ExecTimeExecTime

Speeduppipe

pipenonpipe

How can we improve this pipeline design?

We need to reduce the unbalance to increasethe clock speed.


18


IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns

Now we have one more pipeline stage, but themaximum latency of a single stage is reduced in half.

MEM25 ns

nsinstrnsnsnsnsnsnsinstr

WBlatMEMlatMEMlatEXlatIDlatIFlatinstrT

5/1)4,5,5,5,4,5max(/1

)(),2(),1(),(),(),(max(/1

nsnsL 3056

The new latency for a single instruction is:


19



MEM25 ns

IF MEM1IDI1 EX WBMEM1IF MEM1IDI2 EX WBMEM1



IF MEM1IDI7 EX WBMEM1


20



MEM25 ns

snsnsExecTime pipe 100100000520000

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubles caused bybranches, cache misses, etc, for now)

Thus the speedup that we get from the pipeline is:

6.5 100 560

ss

ExecTimeExecTime

Speeduppipe

pipenonpipe


21



MEM25 ns

What have we learned from this example?1. It is important to balance the delays in the stages of the pipeline

2. The throughput of a pipeline is 1/max(delay).3. The latency is Nmax(delay), where N is the number of stages in the pipeline.


22

Pipelining Lessons

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

•Pipelining doesn’t help latency of single task, it helps throughput of entire workload•Pipeline rate limited by slowest pipeline stage•Multiple tasks operating simultaneously•Potential speedup = Number pipe stages•Unbalanced lengths of pipe stages reduces speedup•Time to “fill” and “drain” pipeline reduces speedup


23

Computer Pipelines

Execute billions of instructions, so throughput is what matters

Desirable features: all instructions same length, registers located in same place in

instruction format, memory operands only in loads or

stores

Visualizing PipeliningFigure 3.3, Page 133 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5


25

Its Not That Easy for Computers

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycleStructural hazards: Hardware cannot

support this combination of instructions (single person to fold and put clothes away)

Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)

Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

One Memory Port/Structural Hazards

Figure 3.6, Page 142 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg


Reg ALU

DMemIfetch Reg

One Memory Port/Structural Hazards

Figure 3.7, Page 143 , CA:AQA 2e

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg


Reg ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU DMemIfetch Reg





Data Hazard on R1Figure 3.9, page 147 , CA:AQA 2e

Time (clock cycles)

IF ID/RF EX MEM WB


29

Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3


30

Write After Read (WAR) InstrJ writes operand before InstrI reads it

Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7



31


Write After Write (WAW) InstrJ writes operand before InstrI writes it.

Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7


32


WAW Can’t happen in MIPS 5 stage pipeline because:

All instructions take 5 stages, and

Writes are always in stage 5


33

Forwarding to Avoid Data Hazard

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11


34

HW Change for ForwardingFigure 3.20, Page 161


35

Instr.

Order

Time (clock cycles)

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard Even with Forwarding

Figure 3.12, Page 153


36

Data Hazard Even with Forwarding

Figure 3.13, Page 154Instr.

Order

Time (clock cycles)

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Try producing fast code fora = b + c;d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd


38

Why the fast code is faster? Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LW Rb, b IF ID EX MEM WB LW Rc, c IF ID EX MEM WB ADD Ra, Rb, Rc IF ID stall EX MEM WB SW a, Ra IF stall ID EX MEM WB LW Re, e stall IF ID EX MEM WB LW Rf, f IF ID EX MEM WB SUB Rd, Re, Rf IF ID stall EX MEM WB SW d, Rd IF stall ID EX MEM WB

Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LW Rb, b IF ID EX MEM WB LW Rc, c IF ID EX MEM WB LW Re, e IF ID EX MEM WB ADD Ra, Rb, Rc IF ID EX MEM WB LW Rf, f IF ID EX MEM WB SW a, Ra IF ID EX MEM WB SUB Rd, Re, Rf IF ID EX MEM WB SW d, Rd IF ID EX MEM WB


39

Control Hazard on BranchesThree Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg


40

Example: Branch Stall ImpactIf 30% branch, Stall of 3 cycles is significantTwo part solution:

Determine if branch is taken or not sooner, AND Compute taken branch address earlier

MIPS branch tests if register = 0 or 0MIPS Solution:

Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Adder

IF/ID

Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e

MemoryAccess

Write

Back

InstructionFetch


ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

DataM

emory

MUX

SignExtend

Zero?

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Address

RS1

RS2

ImmM

UX

ID/EX


42

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear#2: Predict Branch Not Taken

Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS

MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome


43

Four Branch Hazard Alternatives

#4: Delayed Branch Define branch to take place AFTER a following instruction

branch instructionsequential successor1

sequential successor2

........sequential successorn

branch target if taken

1 slot delay allows proper decision and branch target address in 5 stage pipeline

MIPS uses this

Branch delay of length n


44

Delayed BranchWhere to get instructions to fill branch

delay slot? Before branch instruction From the target address: only valuable when

branch taken From fall through: only valuable when branch

not taken Canceling branches allow more slots to be filled

A canceling branch only executes instructions in the delay slot if the direction predicted is correct.


45

Fig. 3.28


46

Delayed BranchCompiler effectiveness for single branch

delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay

slots useful in computation About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José...

Documents

Transcript of CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José...