CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José...
-
Upload
william-bates -
Category
Documents
-
view
223 -
download
0
description
Transcript of CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José...
CMPUT 429 - Computer Systems and Architecture
1
CMPUT429/CMPE382 Winter 2001
Topic3-PipeliningJosé Nelson Amaral
(Adapted from David A. Patterson’s CS252lecture slides at Berkeley)
CMPUT 429 - Computer Systems and Architecture
2
What is Pipelining?
Pipelining is a key implementation technique usedto build fast processors. It allows the execution ofmultiple instructions to overlap in time.A pipeline within a processor is similar to a car assembly line. Each assembly station is called apipe stage or a pipe segment.
The throughput of an instruction pipeline isthe measure of how often an instruction exits thepipeline.
CMPUT 429 - Computer Systems and Architecture
3
Pipelining: Its Natural!Laundry ExampleAnn, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold
Washer takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutes
A B C D
CMPUT 429 - Computer Systems and Architecture
4
Sequential Laundry
Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
CMPUT 429 - Computer Systems and Architecture
5
Pipelined LaundryStart work ASAP
Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20What is preventingthem from doing it
faster?
How could we eliminatethis limiting factor?
CMPUT 429 - Computer Systems and Architecture
6
Pipelined LaundryStart work ASAP
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20 Pipeline throughput:How many loads arecompleted per unit
of time?
Pipeline Latency:Once a load starts, how
long does it takes tocomplete it?
Mux
0
1
Mux
0
1
Mux
0123
012
Mux
PC
Signext.
Shiftleft 2
Conc/Shiftleft 2
Readaddress
Writeaddress
Writedata
MemData
Instruction[31-26]
Instruction[25-0]
Instructionregister
Memory
Readregister 1Readregister 2WriteregisterWritedata
Readdata 1
Readdata 2
Registers 4
32
Mux
0
10
1Mux
ALUresult
Zero
ALU
Target
16
426
32
I[25-21]I[20-16]
I[15-0]
[15-11]
CMPUT 429 - Computer Systems and Architecture
8
5 Steps of MIPS DatapathFigure 3.1, Page 130, CA:AQA 2e
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MUX
Mem
ory
Reg File
MUX
MUX
DataM
emory
MUX
SignExtend
4
Adder Zero?
Next SEQ PC
Address
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
CMPUT 429 - Computer Systems and Architecture
9
5 Steps of MIPS DatapathFigure 3.4, Page 134 , CA:AQA 2e
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MUX
MUX
DataM
emory
MUX
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/MEM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Address
RS1
RS2
Imm
MUX
CMPUT 429 - Computer Systems and Architecture
10
Steps to Execute EachInstruction Type
Instruction Type Step R-type load store branch jump
Fetch IR Memory[PC] PC PC + 4
Decode A Registers[IR[25-21]] B Registers[IR[20-16]]
Target PC + (sign-extend(IR[15-0]) << 2) Execute
ALUopt A op B
ALUout A + sign-extend(IR[15-0])
If(A==B) then PC Target
PC concat(PC[31-28], IR[25-0]) << 2
Memory Reg(R[15-11]) ALUout
Memdata Mem[ALUout]
Mem[ALUout] B
Write-back
Reg(IR[20-16])
memdata
CMPUT 429 - Computer Systems and Architecture
11
Pipeline Stages
We can divide the execution of an instructioninto the following stages:
IF: Instruction FetchID: Instruction DecodeEX: ExecutionMEM: Memory AccessWB: Write Back
CMPUT 429 - Computer Systems and Architecture
12
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.
Pipeline throughput: instructions completed per second.Pipeline latency: how long does it take to execute a single instruction in the pipeline.
CMPUT 429 - Computer Systems and Architecture
13
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Pipeline throughput: how often an instruction is completed.
nsinstrnsnsnsnsnsinstr
WBlatMEMlatEXlatIDlatIFlatinstrT
10/14,10,5,4,5max/1
)(),(),(),(),(max/1
Pipeline latency: how long does it take to execute an instruction in the pipeline.
nsnsnsnsnsnsWBlatMEMlatEXlatIDlatIFlatL
28410545)()()()()(
Is this right?
CMPUT 429 - Computer Systems and Architecture
14
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction
IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB
MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4
L(I5) = 43nsEX WB
We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every state
the same length as the longest one.
CMPUT 429 - Computer Systems and Architecture
15
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
The slowest pipeline state also limits the latency!!
IF MEMIDI1
L(I1) = L(I2) = L(I3) = L(I4) = 50ns
EX WBIF MEMIDI2 L(I2) = 50nsEX WB
IF MEMID EX WBIF MEMID EX
0 10 20 30 40 50 60
I3I4
CMPUT 429 - Computer Systems and Architecture
16
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, and hazards)
How long would it take using the same moduleswithout pipelining?
snsnsExecTime pipe 2002000001020000
snsnsExecTime pipenon 5605600002820000
CMPUT 429 - Computer Systems and Architecture
17
Pipeline Throughput and Latency
IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns
Thus the speedup that we got from the pipeline is:
8.2 200 560
ss
ExecTimeExecTime
Speeduppipe
pipenonpipe
How can we improve this pipeline design?
We need to reduce the unbalance to increasethe clock speed.
CMPUT 429 - Computer Systems and Architecture
18
Pipeline Throughput and Latency
IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns
Now we have one more pipeline stage, but themaximum latency of a single stage is reduced in half.
MEM25 ns
nsinstrnsnsnsnsnsnsinstr
WBlatMEMlatMEMlatEXlatIDlatIFlatinstrT
5/1)4,5,5,5,4,5max(/1
)(),2(),1(),(),(),(max(/1
nsnsL 3056
The new latency for a single instruction is:
CMPUT 429 - Computer Systems and Architecture
19
Pipeline Throughput and Latency
IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns
MEM25 ns
IF MEM1IDI1 EX WBMEM1IF MEM1IDI2 EX WBMEM1
IF MEM1IDI3 EX WBMEM1IF MEM1IDI4 EX WBMEM1
IF MEM1IDI5 EX WBMEM1IF MEM1IDI6 EX WBMEM1
IF MEM1IDI7 EX WBMEM1
CMPUT 429 - Computer Systems and Architecture
20
Pipeline Throughput and Latency
IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns
MEM25 ns
snsnsExecTime pipe 100100000520000
How long does it take to execute 20000 instructionsin this pipeline? (disregard bubles caused bybranches, cache misses, etc, for now)
Thus the speedup that we get from the pipeline is:
6.5 100 560
ss
ExecTimeExecTime
Speeduppipe
pipenonpipe
CMPUT 429 - Computer Systems and Architecture
21
Pipeline Throughput and Latency
IF ID EX MEM1 WB5 ns 4 ns 5 ns 5 ns 4 ns
MEM25 ns
What have we learned from this example?1. It is important to balance the delays in the stages of the pipeline
2. The throughput of a pipeline is 1/max(delay).3. The latency is Nmax(delay), where N is the number of stages in the pipeline.
CMPUT 429 - Computer Systems and Architecture
22
Pipelining Lessons
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
•Pipelining doesn’t help latency of single task, it helps throughput of entire workload•Pipeline rate limited by slowest pipeline stage•Multiple tasks operating simultaneously•Potential speedup = Number pipe stages•Unbalanced lengths of pipe stages reduces speedup•Time to “fill” and “drain” pipeline reduces speedup
CMPUT 429 - Computer Systems and Architecture
23
Computer Pipelines
Execute billions of instructions, so throughput is what matters
Desirable features: all instructions same length, registers located in same place in
instruction format, memory operands only in loads or
stores
Visualizing PipeliningFigure 3.3, Page 133 , CA:AQA 2e
Instr.
Order
Time (clock cycles)
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
CMPUT 429 - Computer Systems and Architecture
25
Its Not That Easy for Computers
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycleStructural hazards: Hardware cannot
support this combination of instructions (single person to fold and put clothes away)
Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
One Memory Port/Structural Hazards
Figure 3.6, Page 142 , CA:AQA 2e
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg ALU
DMemIfetch Reg
One Memory Port/Structural Hazards
Figure 3.7, Page 143 , CA:AQA 2e
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg ALU
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Data Hazard on R1Figure 3.9, page 147 , CA:AQA 2e
Time (clock cycles)
IF ID/RF EX MEM WB
CMPUT 429 - Computer Systems and Architecture
29
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
CMPUT 429 - Computer Systems and Architecture
30
Write After Read (WAR) InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
CMPUT 429 - Computer Systems and Architecture
31
Three Generic Data Hazards
Write After Write (WAW) InstrJ writes operand before InstrI writes it.
Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
CMPUT 429 - Computer Systems and Architecture
32
Three Generic Data Hazards
WAW Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and
Writes are always in stage 5
CMPUT 429 - Computer Systems and Architecture
33
Forwarding to Avoid Data Hazard
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
CMPUT 429 - Computer Systems and Architecture
34
HW Change for ForwardingFigure 3.20, Page 161
CMPUT 429 - Computer Systems and Architecture
35
Instr.
Order
Time (clock cycles)
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with Forwarding
Figure 3.12, Page 153
CMPUT 429 - Computer Systems and Architecture
36
Data Hazard Even with Forwarding
Figure 3.13, Page 154Instr.
Order
Time (clock cycles)
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Try producing fast code fora = b + c;d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
CMPUT 429 - Computer Systems and Architecture
38
Why the fast code is faster? Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LW Rb, b IF ID EX MEM WB LW Rc, c IF ID EX MEM WB ADD Ra, Rb, Rc IF ID stall EX MEM WB SW a, Ra IF stall ID EX MEM WB LW Re, e stall IF ID EX MEM WB LW Rf, f IF ID EX MEM WB SUB Rd, Re, Rf IF ID stall EX MEM WB SW d, Rd IF stall ID EX MEM WB
Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 LW Rb, b IF ID EX MEM WB LW Rc, c IF ID EX MEM WB LW Re, e IF ID EX MEM WB ADD Ra, Rb, Rc IF ID EX MEM WB LW Rf, f IF ID EX MEM WB SW a, Ra IF ID EX MEM WB SUB Rd, Re, Rf IF ID EX MEM WB SW d, Rd IF ID EX MEM WB
CMPUT 429 - Computer Systems and Architecture
39
Control Hazard on BranchesThree Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
CMPUT 429 - Computer Systems and Architecture
40
Example: Branch Stall ImpactIf 30% branch, Stall of 3 cycles is significantTwo part solution:
Determine if branch is taken or not sooner, AND Compute taken branch address earlier
MIPS branch tests if register = 0 or 0MIPS Solution:
Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
Adder
IF/ID
Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MUX
DataM
emory
MUX
SignExtend
Zero?
MEM
/WB
EX/MEM
4
Adder
Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Address
RS1
RS2
ImmM
UX
ID/EX
CMPUT 429 - Computer Systems and Architecture
42
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear#2: Predict Branch Not Taken
Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS
MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome
CMPUT 429 - Computer Systems and Architecture
43
Four Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following instruction
branch instructionsequential successor1
sequential successor2
........sequential successorn
branch target if taken
1 slot delay allows proper decision and branch target address in 5 stage pipeline
MIPS uses this
Branch delay of length n
CMPUT 429 - Computer Systems and Architecture
44
Delayed BranchWhere to get instructions to fill branch
delay slot? Before branch instruction From the target address: only valuable when
branch taken From fall through: only valuable when branch
not taken Canceling branches allow more slots to be filled
A canceling branch only executes instructions in the delay slot if the direction predicted is correct.
CMPUT 429 - Computer Systems and Architecture
45
Fig. 3.28
CMPUT 429 - Computer Systems and Architecture
46
Delayed BranchCompiler effectiveness for single branch
delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay
slots useful in computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)