Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Pipelining (Chapter 8)
description
Transcript of Pipelining (Chapter 8)
![Page 1: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/1.jpg)
1
Pipelining(Chapter 8)
TU-DelftTI1400/11-PDS
http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_10.ppt
![Page 2: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/2.jpg)
TU-DelftTI1400/11-PDS
2
Basic idea (1)
F1 E1 F2 F3 F4E2 E3 E4I1 I2 I3 I4
sequential execution time
B1
Instructionfetchunit
Executionunit
buffer
![Page 3: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/3.jpg)
TU-DelftTI1400/11-PDS
3
Basic idea (2): Overlap
F1 E1
F2
F3
F4
E2
E3
E4
I1
I2
I3
I4
pipelined execution
time
1 2 3 4 5 Clock cycle
![Page 4: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/4.jpg)
TU-DelftTI1400/11-PDS
4
Instruction phases
• F Fetch instruction• D Decode instruction and fetch operands• O Perform operation• W Write result
![Page 5: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/5.jpg)
TU-DelftTI1400/11-PDS
5
Four-stage pipeline
F1 D1
F2
F3
F4
D2
D3
D4
I1
I2
I3
I4
pipelined execution
time
1 2 3 4 5 Clock cycleO1 W1
O2 W2
O3 W3
O4 W4
![Page 6: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/6.jpg)
TU-DelftTI1400/11-PDS
6
Hardware organization (1)
Fetchunit
B1
Decodeand
fetchoper.
B2
Operunit
B3
Writeunit
![Page 7: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/7.jpg)
TU-DelftTI1400/11-PDS
7
Hardware organization (2)
During cycle 4, the buffers contain:• B1:
- instruction I3• B2:
- the source operands of I2- the specification of the operation- the specification of the destination operand
• B3:- the result of the operation of I1- the specification of the destination operand
![Page 8: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/8.jpg)
TU-DelftTI1400/11-PDS
8
Hardware organization (3)
Fetchunit
B1
Decodeand
fetchoper.
B2
Operunit
B3
Writeunit
I3 Operands I2Operation I2
Result I1
![Page 9: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/9.jpg)
TU-DelftTI1400/11-PDS
9
Pipeline stall (1)
• Pipeline stall: delay in a stage of the pipeline due to an instruction
• Reasons for pipeline stall:- Cache miss- Long operation (for example, division)- Dependency between successive instructions- Branching
![Page 10: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/10.jpg)
TU-DelftTI1400/11-PDS
10
Pipeline stall (2): Cache miss
F1 D1
F2
F3
D2
D3
I1
I2
I3
time
1 2 3 4 5 Clock cycleO1 W1
O2 W2
O3 W3
6 7 8
Cache miss in I2
![Page 11: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/11.jpg)
TU-DelftTI1400/11-PDS
11
Pipeline stall (3): Cache miss
F1 F2
D2
F
D
O
1 2 3 4 5 Clock cycle
F2 F2
D3
6 7 8
W
D1
F2 F3
idle idle idle
O1 O2 O3idle idle idle
W1 W2 W3idle idle idle
Effect of cache miss in F2
![Page 12: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/12.jpg)
TU-DelftTI1400/11-PDS
12
Pipeline stall (4): Long operation
F2 D2I2 O2 W2
F3 D3I3 O3 W3
F4 D4I4 O4 W4
time
F1 D1I11 2 3 4 5 Clock cycle
O1 W16 7 8
![Page 13: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/13.jpg)
TU-DelftTI1400/11-PDS
13
Pipeline stall (5): Dependencies
• Instructions:ADD R1, 3(R1)ADD R4, 4(R1)
cannot be done in parallel• Instructions:
ADD R2, 3(R1)ADD R4, 4(R3)
can be done in parallel
![Page 14: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/14.jpg)
TU-DelftTI1400/11-PDS
14
Pipeline stall (6): Branch
time
Ii
Ik
Fi Ei
Fk Ek
(branch)
Pipeline stall due to Branch
only start fetching instructions after branch has beenexecuted
![Page 15: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/15.jpg)
TU-DelftTI1400/11-PDS
15
Data dependency (1): example
MUL R2,R3,R4 /* R4 destination */
ADD R5,R4,R6 /* R6 destination */
New value of R4 must be available before ADD instruction uses it
![Page 16: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/16.jpg)
TU-DelftTI1400/11-PDS
16
Data dependency (2): example
timeI1 F1 D1 O1 W1
F2 D2 O2 W2I2
W3F3 D3 O3I3
I4 F4 D4 O4 W4
MUL
ADD
Pipeline stall due to data dependence between W1 and D2
![Page 17: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/17.jpg)
TU-DelftTI1400/11-PDS
17
Branching: Instruction queue
Fetch
Dispatch Operation Write
instruction queue........
![Page 18: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/18.jpg)
TU-DelftTI1400/11-PDS
18
Idling at branch
time
Ij
Ij+1
Fj Ej
Fj+1
(branch)
Ik Fk Ek
idle
Ik+1 Fk+1 Ek+1
![Page 19: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/19.jpg)
TU-DelftTI1400/11-PDS
19
Branch with instruction queueI1 F1 E1
I3 F3 E3
I2 F2 E2
I4 F4
Ij Fj Ej
Ij+1 Fj+1 Ej+1
Ij+2 Fj+2 Ej+2
Ij+3 Fj+3 Ej+3
time
branch
Branch folding:execute a later branch instruction simultaneously(i.e., compute target)
I4 discarded
![Page 20: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/20.jpg)
TU-DelftTI1400/11-PDS
20
Delayed branch (1): reordering
LOOP Shift_left R1Decrement R2Branch_if>0 LOOP
NEXT Add R1,R3
LOOP Decrement R2Branch_if>0 LOOPShift_left R1
NEXT Add R1,R3
Original
Reordered alwaysexecuted
alwaysloose acycle
![Page 21: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/21.jpg)
TU-DelftTI1400/11-PDS
21
Delayed branch (2): execution timing
F EF E
F EF E
F EF E
F E
DecrementBranchShiftDecrementBranchShiftAdd
![Page 22: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/22.jpg)
TU-DelftTI1400/11-PDS
22
Branch prediction (1)
I1 F1 D1 E1 W1
F2
F3
F4
E2
D3 E3 X
D4 X
Fk Dk
Compare
Branch-if>I2
I3
I4
Effect of incorrect branch predictionIk
![Page 23: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/23.jpg)
TU-DelftTI1400/11-PDS
23
Branch prediction (2)
Possible implementation:- use a single bit- bit records previous choice of branch- bit tells from which location to fetch next
instructions
![Page 24: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/24.jpg)
TU-DelftTI1400/11-PDS
24
Data paths of CPU (1)Source 1Source 2
SRC1 SRC2
ALU
RSLT
Registerfile
Destination
Operand forwarding
![Page 25: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/25.jpg)
TU-DelftTI1400/11-PDS
25
Data paths of CPU (2)
Operation Write
SRC1SRC2 RSLT
forwarding data path
register fileALU
![Page 26: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/26.jpg)
TU-DelftTI1400/11-PDS
26
Pipelined operation
I1 F R1 + R3
F
Add
ShiftI2
I3
I4
R2
shift R3R3
F D O W
F D O WI1: Add R1, R2, R3I2: Shift_left R3
result of Add has tobe available
![Page 27: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/27.jpg)
TU-DelftTI1400/11-PDS
27
Short pipeline
I1 F R1 + R3R2
F D fwd,shift R3 -
F D O W
I2
I3
![Page 28: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/28.jpg)
TU-DelftTI1400/11-PDS
28
Long pipeline
F D O WI1 1 O2 O3
FI2
I3
D O1 O2 O3 Wfwd
F D O1 O2 O3 W
![Page 29: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/29.jpg)
TU-DelftTI1400/11-PDS
29
Compiler solution
I1: Add R1, R2, R3I2: Shift_left R3
I1: Add R1, R2, R3NOPNOP
I2: Shift_left R3
insert no-operations towait for result
![Page 30: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/30.jpg)
TU-DelftTI1400/11-PDS
30
Side effects
I2: ADD D1, D2
I3: ADDX D3, D4carry copy
Other form of (implicit) data dependency:instructions can have side effects that are usedby the next instruction
![Page 31: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/31.jpg)
TU-DelftTI1400/11-PDS
31
Complex addressing mode
F D X+[R1] [X+[R1]][[X+[R1]]] R2 D
F DD Dfwd,O
Load
Next instruct. DW
Load (X(R1)), R2
Cause pipe line stall
X in instruction
![Page 32: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/32.jpg)
TU-DelftTI1400/11-PDS
32
Simple addressing modes
F D X+[R1]
[X+[R1]]
[[X+[R1]]]
R2 DAdd
F DD
F DD
R2
R2
F DD Dfwd,O W
Load
Load
Next instruction
Add #X,R1,R2Load (R2),R2Load (R2),R2
Build up from simple instructions: same amount of time
![Page 33: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/33.jpg)
TU-DelftTI1400/11-PDS
33
Addressing modes• Requirements addressing modes with pipelining:
- operand access not more than one memory access
- only load and store instructions access memory- addressing modes do not have side effects
• Possible addressing modes:- register- register indirect- index
![Page 34: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/34.jpg)
TU-DelftTI1400/11-PDS
34
Condition codes (1)• Problems in RISC with condition codes
(CCs):- do instructions after reordering have access
to the right CC values?- are CCs already available at the next
instruction?• Solutions:
- compiler detection- no automatic use of CCs, only when explicitly
given in instruction
![Page 35: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/35.jpg)
TU-DelftTI1400/11-PDS
35
Explicit specification of CCs
Increment R5Add R2, R4Add-with-increment R1, R3
ADDI R5, R5, 1ADDC R4, R2, R4ADDE R3, R1, R3
double precisionaddition
PowerPC instructions (C: change carry flag, E: use carry flag)
![Page 36: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/36.jpg)
TU-DelftTI1400/11-PDS
36
Two execution units
Fetch
DispatchUnit
FP Unit
Write
instruction queue
IntegerUnit
........
![Page 37: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/37.jpg)
TU-DelftTI1400/11-PDS
37
Instruction flow (superscalar)
F1 D1 O1 W1I1 O1 O1
F2 D2 O2 W2
F3 D3 O3 O3 O3
W4F4 D4 O4
W3
Fadd
I2 Add
I3 Fsub
I4 SubSimultaneous execution of floating pointand integer operations
![Page 38: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/38.jpg)
TU-DelftTI1400/11-PDS
38
Completion in program order
D1 O1 W1I1 O1 O1
F2 D2 O2 W2
F3 D3 O3 O3 O3
W4F4 D4 O4
W3
Fadd
I2 Add
I3 Fsub
I4 Sub
F1
wait until previous instruction has completed
![Page 39: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/39.jpg)
TU-DelftTI1400/11-PDS
39
Consequences completion order
When an exception occurs:• writes not necessarily in order of
instructions: imprecise exceptions• writes in order: precise exceptions
![Page 40: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/40.jpg)
TU-DelftTI1400/11-PDS
40
PowerPC pipeline
Data cache Instr. cache
Instr. fetch Branch unit
Dispatcher
Instructionqueue
Completionqueue
LSUIU
FPU
store queue
![Page 41: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/41.jpg)
TU-DelftTI1400/11-PDS
41
Performance Effects (1)
• Execution time of a program: T• Dynamic instruction count: N• Number of cycles per instruction: S• Clock rate: R• Without pipelining: T = (N x S)
/ R• With an n-stage pipeline: T’ = T /
n ???
![Page 42: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/42.jpg)
TU-DelftTI1400/11-PDS
42
Performance Effects (2)• Cycle time: 2 ns (R is 500 MHz)• Cache hit (miss) ratio instructions: 0.95
(0.05)• Cache hit (miss) ratio data: 0.90 (0.10)• Fraction of instructions that need data
from memory: 0.30• Cache miss penalty: 17 cycles • Average extra delay per instruction:
(0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than
2!!
![Page 43: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/43.jpg)
TU-DelftTI1400/11-PDS
43
Performance Effects (3)
• On average, the fetch stage takes, due to instruction cache misses:
1 + (0.05 x 17) = 1.85 cycles• On average, the decode stage takes, due
to operand cache misses:1 + (0.3 x 0.1 x 17) = 1.51 cycles
• For a total additional cost of 1.36 cycles
![Page 44: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/44.jpg)
TU-DelftTI1400/11-PDS
44
Performance Effects (4)• If only one stage takes longer, the additional
time should be counted relative to one stage, not relative to the complete instruction:
• In other words: here, the pipeline is as slow as the slowest stage
F1 D1 O1 W1
F1 D1 O1 W1
![Page 45: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/45.jpg)
TU-DelftTI1400/11-PDS
45
Performance Effects (5)• Delay of 1 cycle every 4 instructions in only
one stage: average penalty: 0.25
• Average inter-completion time: (3x1 + 1x2)/4=1.25
F4 D4 O4 W4
F1 D1 O1 W1
F3 D3 O3 W3
F2 D2 O2 W2
F5 D5 O5 W5
![Page 46: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/46.jpg)
TU-DelftTI1400/11-PDS
46
Performance Effects (6)• Delays in two stages:
- k % of the instructions in one stage, penalty s cycles
- l % of the instructions in another stage, penalty t cycles
• Average inter-completion time:((100-k-l) x 1 + k(1+s) + l(1+t))/100 =
(100+ ks +lt)/100• In example (k=5, l=3, s=t=17): 2.36
![Page 47: Pipelining (Chapter 8)](https://reader036.fdocuments.in/reader036/viewer/2022081520/56814d62550346895dbaaef4/html5/thumbnails/47.jpg)
TU-DelftTI1400/11-PDS
47
Performance Effects (7)• Large number of pipeline stages seems
advantageous, but: - more instructions simultaneously being
processed, so more opportunity for conflicts- branch penalty becomes larger- ALU is usually bottleneck, no use having smaller
time steps