Design of Digital Circuits Lecture 15: Pipelining - ETH Z · Design of Digital Circuits Lecture 15:...
Transcript of Design of Digital Circuits Lecture 15: Pipelining - ETH Z · Design of Digital Circuits Lecture 15:...
Design of Digital Circuits
Lecture 15: Pipelining
Prof. Onur Mutlu ETH Zurich Spring 2017 13 April 2017
Agenda for Today & Next Few Lectures ! Single-cycle Microarchitectures
! Multi-cycle and Microprogrammed Microarchitectures ! Pipelining
! Issues in Pipelining: Control & Data Dependence Handling, State Maintenance and Recovery, …
! Out-of-Order Execution
! Issues in OoO Execution: Load-Store Handling, …
2
Readings for This Week ! H&H, Chapter 7.5 (keep reading)
3
Wrap Up Microprogramming
4
Remember: An Exercise in Microprogramming
5
Handouts ! 7 pages of Microprogrammed LC-3b design
! https://www.ethz.ch/content/dam/ethz/special-interest/infk/inst-infsec/system-security-group-dam/education/Digitaltechnik_17/lecture/lc3b-figures.pdf
6
A Simple LC-3b Control and Datapath
7
C.2. THE STATE MACHINE 5
R
PC<!BaseR
To 18
12
To 18
To 18
RR
To 18
To 18
To 18
MDR<!SR[7:0]
MDR <! M
IR <! MDR
R
DR<!SR1+OP2*set CC
DR<!SR1&OP2*set CC
[BEN]
PC<!MDR
32
1
5
0
0
1To 18
To 18 To 18
R R
[IR[15:12]]
28
30
R7<!PCMDR<!M[MAR]
set CC
BEN<!IR[11] & N + IR[10] & Z + IR[9] & P
9DR<!SR1 XOR OP2*
4
22
To 111011
JSRJMP
BR
1010
To 10
21
200 1
LDB
MAR<!B+off6
set CC
To 18
MAR<!B+off6
DR<!MDRset CC
To 18
MDR<!M[MAR]25
27
3762
STW STBLEASHF
TRAP
XOR
AND
ADD
RTI
To 8
set CC
set CCDR<!PC+LSHF(off9, 1)
14
LDW
MAR<!B+LSHF(off6,1) MAR<!B+LSHF(off6,1)
PC<!PC+LSHF(off9,1)
33
35
DR<!SHF(SR,A,D,amt4)
NOTESB+off6 : Base + SEXT[offset6]
R
MDR<!M[MAR[15:1]’0]
DR<!SEXT[BYTE.DATA]
R
29
31
18, 19
MDR<!SR
To 18
R R
M[MAR]<!MDR16
23
R R
17
To 19
24
M[MAR]<!MDR**
MAR<!LSHF(ZEXT[IR[7:0]],1)15To 18
PC+off9 : PC + SEXT[offset9]
MAR <! PCPC <! PC + 2
*OP2 may be SR2 or SEXT[imm5]** [15:8] or [7:0] depending on MAR[0]
[IR[11]]
PC<!BaseR
PC<!PC+LSHF(off11,1)
R7<!PC
R7<!PC
13
Figure C.2: A state machine for the LC-3b
C.4. THE CONTROL STRUCTURE 7
MEMORY
OUTPUTINPUT
KBDR
ADDR. CTL.LOGIC
MDR
INMUX
MAR L
L
MAR[0]
MAR[0]
DATA.SIZE
R
DATA.SIZE
D
D
.
.
M
MDR
AR
2
KBSR
MEM.EN
R.W
MIO.EN
GatePCGateMARMUX
16
16 16
16
16 16 16
LD.CC
SR2MUX
SEXT
SEXT[8:0]
[10:0]
SEXT
SEXT[5:0]
16
+2
PCLD.PC
16
+
16
16
[7:0]
LSHF1
[4:0]
GateALU16
SHF
GateSHF
6IR[5:0]
16
1616
16
16
16
16
LOGIC
16 16
GateMDR
N Z P
SR2OUT
SR1OUT
REGFILE
MARMUX
16
3
0
16
R
ADDR2MUX
2
ZEXT &LSHF1
3
3
ALUALUK
2 AB
ADDR1MUX
PCMUX2
SR1
DR
SR2
LD.REG
IRLD.IR
CONTROL
DDR
DSR
MIO.EN
LOGIC
LOGIC
SIZEDATA.
WE0WE1
[0]WE
LOGIC
Figure C.3: The LC-3b data path
provide you with the additional flexibility of more states, so we have selected a controlstore consisting of 26 locations.
A Simple Datapath Can Become Very Powerful
10APPENDIXC. THEMICROARCHITECTUREOFTHE LC-3B, BASICMACHINE
IRD
Address of Next State
6
6
0,0,IR[15:12]
J[5]
Branch ReadyModeAddr.
J[0]J[1]J[2]
COND0COND1
J[3]J[4]
R IR[11]BEN
Figure C.5: The microsequencer of the LC-3b base machine
unused opcodes, the microarchitecture would execute a sequence of microinstructions,starting at state 10 or state 11, depending on which illegal opcode was being decoded.In both cases, the sequence of microinstructions would respond to the fact that aninstruction with an illegal opcode had been fetched.
Several signals necessary to control the data path and the microsequencer are notamong those listed in Tables C.1 and C.2. They are DR, SR1, BEN, and R. Figure C.6shows the additional logic needed to generate DR, SR1, and BEN.
The remaining signal, R, is a signal generated by the memory in order to allow theState18(010010)State33(100001)State35(100011)State32(100000)State6(000110)State25(011001)State27(011011)
StateMachineforLDW Microsequencer
C.4. THE CONTROL STRUCTURE 11
DR
IR[11:9]
111
DRMUX
(a)
SR1
SR1MUX
IR[11:9]
IR[8:6]
(b)
Logic BEN
PZN
IR[11:9]
(c)
Figure C.6: Additional logic required to provide control signals
LC-3b to operate correctly with a memory that takes multiple clock cycles to read orstore a value.
Suppose it takes memory five cycles to read a value. That is, once MAR containsthe address to be read and the microinstruction asserts READ, it will take five cyclesbefore the contents of the specified location in memory are available to be loaded intoMDR. (Note that the microinstruction asserts READ by means of three control signals:MIO.EN/YES, R.W/RD, and DATA.SIZE/WORD; see Figure C.3.)
Recall our discussion in Section C.2 of the function of state 33, which accessesan instruction from memory during the fetch phase of each instruction cycle. For theLC-3b to operate correctly, state 33 must execute five times before moving on to state35. That is, until MDR contains valid data from the memory location specified by thecontents of MAR, we want state 33 to continue to re-execute. After five clock cycles,the memory has completed the “read,” resulting in valid data in MDR, so the processorcan move on to state 35. What if the microarchitecture did not wait for the memory tocomplete the read operation before moving on to state 35? Since the contents of MDRwould still be garbage, the microarchitecture would put garbage into IR in state 35.
The ready signal (R) enables the memory read to execute correctly. Since the mem-ory knows it needs five clock cycles to complete the read, it asserts a ready signal(R) throughout the fifth clock cycle. Figure C.2 shows that the next state is 33 (i.e.,100001) if the memory read will not complete in the current clock cycle and state 35(i.e., 100011) if it will. As we have seen, it is the job of the microsequencer (FigureC.5) to produce the next state address.
C.4. THE CONTROL STRUCTURE 9
Microinstruction
R
Microsequencer
BEN
x2
Control Store6
IR[15:11]
6
(J, COND, IRD)
269
35
35
Figure C.4: The control structure of a microprogrammed implementation, overall blockdiagram
on the LC-3b instruction being executed during the current instruction cycle. This statecarries out the DECODE phase of the instruction cycle. If the IRD control signal in themicroinstruction corresponding to state 32 is 1, the output MUX of the microsequencer(Figure C.5) will take its source from the six bits formed by 00 concatenated with thefour opcode bits IR[15:12]. Since IR[15:12] specifies the opcode of the current LC-3b instruction being processed, the next address of the control store will be one of 16addresses, corresponding to the 14 opcodes plus the two unused opcodes, IR[15:12] =1010 and 1011. That is, each of the 16 next states is the first state to be carried outafter the instruction has been decoded in state 32. For example, if the instruction beingprocessed is ADD, the address of the next state is state 1, whose microinstruction isstored at location 000001. Recall that IR[15:12] for ADD is 0001.
If, somehow, the instruction inadvertently contained IR[15:12] = 1010 or 1011, the
Simple Design of the Control Structure
10APPENDIXC. THEMICROARCHITECTUREOFTHE LC-3B, BASICMACHINE
IRD
Address of Next State
6
6
0,0,IR[15:12]
J[5]
Branch ReadyModeAddr.
J[0]J[1]J[2]
COND0COND1
J[3]J[4]
R IR[11]BEN
Figure C.5: The microsequencer of the LC-3b base machine
unused opcodes, the microarchitecture would execute a sequence of microinstructions,starting at state 10 or state 11, depending on which illegal opcode was being decoded.In both cases, the sequence of microinstructions would respond to the fact that aninstruction with an illegal opcode had been fetched.
Several signals necessary to control the data path and the microsequencer are notamong those listed in Tables C.1 and C.2. They are DR, SR1, BEN, and R. Figure C.6shows the additional logic needed to generate DR, SR1, and BEN.
The remaining signal, R, is a signal generated by the memory in order to allow the
14APPENDIXC. THEMICROARCHITECTUREOFTHE LC-3B, BASICMACHINE
J LD.PC
LD.BEN
LD.IR
LD.M
DR
LD.M
AR
LD.REG
LD.CC
Cond
IRD
GatePC
GateMDR
GateALU
GateMARMUX
GateSH
FPC
MUXDRMUXSR
1MUX
ADDR1MUX
ADDR2MUX
MARMUX
010000 (State 16)010001 (State 17)
010011 (State 19)010010 (State 18)
010100 (State 20)010101 (State 21)010110 (State 22)010111 (State 23)011000 (State 24)011001 (State 25)011010 (State 26)011011 (State 27)011100 (State 28)011101 (State 29)011110 (State 30)011111 (State 31)100000 (State 32)100001 (State 33)100010 (State 34)100011 (State 35)100100 (State 36)100101 (State 37)100110 (State 38)100111 (State 39)101000 (State 40)101001 (State 41)101010 (State 42)101011 (State 43)101100 (State 44)101101 (State 45)101110 (State 46)101111 (State 47)110000 (State 48)110001 (State 49)110010 (State 50)110011 (State 51)110100 (State 52)110101 (State 53)110110 (State 54)110111 (State 55)111000 (State 56)111001 (State 57)111010 (State 58)111011 (State 59)111100 (State 60)111101 (State 61)111110 (State 62)111111 (State 63)
001000 (State 8)001001 (State 9)001010 (State 10)001011 (State 11)001100 (State 12)001101 (State 13)001110 (State 14)001111 (State 15)
000000 (State 0)000001 (State 1)000010 (State 2)000011 (State 3)000100 (State 4)000101 (State 5)000110 (State 6)000111 (State 7)
ALUK
MIO.EN
R.W LSHF1
DATA.SI
ZE
Figure C.7: Specification of the control store
End of the Exercise in Microprogramming
16
Variable-Latency Memory ! The ready signal (R) enables memory read/write to execute
correctly " Example: transition from state 33 to state 35 is controlled by
the R bit asserted by memory when memory data is available
! Could we have done this in a single-cycle microarchitecture?
! What did we assume about memory and registers in a single-cycle microarchitecture?
17
The Microsequencer: Advanced Questions ! What happens if the machine is interrupted?
! What if an instruction generates an exception?
! How can you implement a complex instruction using this control structure? " Think REP MOVS instruction in x86
18
The Power of Abstraction ! The concept of a control store of microinstructions enables
the hardware designer with a new abstraction: microprogramming
! The designer can translate any desired operation to a sequence of microinstructions
! All the designer needs to provide is " The sequence of microinstructions needed to implement the
desired operation " The ability for the control logic to correctly sequence through
the microinstructions " Any additional datapath elements and control signals needed
(no need if the operation can be “translated” into existing control signals)
19
Let’s Do Some More Microprogramming ! Implement REP MOVS in the LC-3b microarchitecture
! What changes, if any, do you make to the " state machine? " datapath? " control store? " microsequencer?
! Show all changes and microinstructions ! Extra Credit Assignment
20
x86 REP MOVS (String Copy) Instruction
21
REP MOVS (DEST SRC)
How many instructions does this take in MIPS ISA?
How many microinstructions does this take to add to the LC-3b microarchitecture?
Aside: Alignment Correction in Memory ! Unaligned accesses
! LC-3b has byte load and byte store instructions that move data not aligned at the word-address boundary " Convenience to the programmer/compiler
! How does the hardware ensure this works correctly? " Take a look at state 29 for LDB " States 24 and 17 for STB " Additional logic to handle unaligned accesses
! P&P, Revised Appendix C.5
22
Aside: Memory Mapped I/O ! Address control logic determines whether the specified
address of LDW and STW are to memory or I/O devices
! Correspondingly enables memory or I/O devices and sets up muxes
! An instance where the final control signals of some datapath elements (e.g., MEM.EN or INMUX/2) cannot be stored in the control store " These signals are dependent on memory address
! P&P, Revised Appendix C.6
23
Advantages of Microprogrammed Control ! Allows a very simple design to do powerful computation by
controlling the datapath (using a sequencer) " High-level ISA translated into microcode (sequence of u-instructions) " Microcode (u-code) enables a minimal datapath to emulate an ISA " Microinstructions can be thought of as a user-invisible ISA (u-ISA)
! Enables easy extensibility of the ISA " Can support a new instruction by changing the microcode " Can support complex instructions as a sequence of simple
microinstructions (e.g., REP MOVS, INC [MEM])
! Enables update of machine behavior " A buggy implementation of an instruction can be fixed by changing the
microcode in the field ! Easier if datapath provides ability to do the same thing in different ways
24
Update of Machine Behavior ! The ability to update/patch microcode in the field (after a
processor is shipped) enables " Ability to add new instructions without changing the processor! " Ability to “fix” buggy hardware implementations
! Examples " IBM 370 Model 145: microcode stored in main memory, can be
updated after a reboot " IBM System z: Similar to 370/145.
! Heller and Farrell, “Millicode in an IBM zSeries processor,” IBM JR&D, May/Jul 2004.
" B1700 microcode can be updated while the processor is running ! User-microprogrammable machine! ! Wilner, “Microprogramming environment on the Burroughs B1700”, CompCon 1972.
25
Multi-Cycle vs. Single-Cycle uArch ! Advantages
! Disadvantages
! For you to fill in
26
Can We Do Better?
27
Can We Do Better? ! What limitations do you see with the multi-cycle design?
! Limited concurrency " Some hardware resources are idle during different phases of
instruction processing cycle " “Fetch” logic is idle when an instruction is being “decoded” or
“executed” " Most of the datapath is idle when a memory access is
happening
28
Can We Use the Idle Hardware to Improve Concurrency?
! Goal: More concurrency # Higher instruction throughput (i.e., more “work” completed in one cycle)
! Idea: When an instruction is using some resources in its processing phase, process other instructions on idle resources not needed by that instruction " E.g., when an instruction is being decoded, fetch the next
instruction " E.g., when an instruction is being executed, decode another
instruction " E.g., when an instruction is accessing data memory (ld/st),
execute the next instruction " E.g., when an instruction is writing its result into the register
file, access data memory for the next instruction 29
Pipelining
30
Pipelining: Basic Idea ! More systematically:
" Pipeline the execution of multiple instructions " Analogy: “Assembly line processing” of instructions
! Idea: " Divide the instruction processing cycle into distinct “stages” of
processing " Ensure there are enough hardware resources to process one
instruction in each stage " Process a different instruction in each stage
! Instructions consecutive in program order are processed in consecutive stages
! Benefit: Increases instruction processing throughput (1/CPI) ! Downside: Start thinking about this…
31
Example: Execution of Four Independent ADDs
! Multi-cycle: 4 cycles per instruction
! Pipelined: 4 cycles per 4 instructions (steady state)
32
Time
F D E W F D E W
F D E W F D E W
F D E W F D E W
F D E W F D E W
Time
Islifealwaysthisbeau9ful?
The Laundry Analogy
! “place one dirty load of clothes in the washer” ! “when the washer is finished, place the wet load in the dryer” ! “when the dryer is finished, take out the dry load and fold” ! “when folding is finished, ask your roommate (??) to put the clothes
away”
33
- steps to do a load are sequentially dependent - no dependence between different loads - different steps do not share resources
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Task order
Task order
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
Pipelining Multiple Loads of Laundry
34
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Task order
Task order
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Task order
Task order
- latency per load is the same - throughput increased by 4
- 4 loads of laundry in parallel - no additional resources
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
35
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Task order
Task order
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Task order
Task order
the slowest step decides throughput
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
Pipelining Multiple Loads of Laundry: In Practice
36
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Task order
Task order
A
BA
B
throughput restored (2 loads per hour) using 2 dryers
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Task order
Task order
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
An Ideal Pipeline ! Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
! Repetition of identical operations " The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
! Repetition of independent operations " No dependencies between repeated operations
! Uniformly partitionable suboperations " Processing can be evenly divided into uniform-latency
suboperations (that do not share resources)
! Fitting examples: automobile assembly line, doing laundry " What about the instruction processing “cycle”?
37
Ideal Pipelining
38
combinaTonallogic(F,D,E,M,W)Tpsec
BW=~(1/T)
BW=~(2/T)T/2ps(F,D,E) T/2ps(M,W)
BW=~(3/T)T/3ps(F,D)
T/3ps(E,M)
T/3ps(M,W)
More Realistic Pipeline: Throughput ! NonpipelinedversionwithdelayT BW=1/(T+S)whereS=latchdelay
! k-stagepipelinedversion BWk-stage=1/(T/k+S) BWmax=1/(1gatedelay+S)
39
Tps
T/kps
T/kps
Latch delay reduces throughput (switching overhead b/w stages)
More Realistic Pipeline: Cost ! NonpipelinedversionwithcombinaTonalcostG Cost=G+LwhereL=latchcost
! k-stagepipelinedversion Costk-stage=G+Lk
40
Ggates
G/k G/k
Latches increase hardware cost
Pipelining Instruction Processing
41
Remember: The Instruction Processing Cycle
" Fetch " Decode " Evaluate Address " Fetch Operands " Execute " Store Result
42
1. Instruction fetch (IF) 2. Instruction decode and register operand fetch (ID/RF) 3. Execute/Evaluate memory address (EX/AG) 4. Memory operand fetch (MEM) 5. Store/writeback result (WB)
Remember the Single-Cycle Uarch
43
Shift left 2
PC
Instruction memory
Read address
Instruction [31– 0]
Data memory
Read data
Write data
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Instruction [15– 11]
Instruction [20– 16]
Instruction [25– 21]
Add
ALU result
Zero
Instruction [5– 0]
MemtoRegALUOpMemWrite
RegWrite
MemReadBranchJumpRegDst
ALUSrc
Instruction [31– 26]
4
M u x
Instruction [25– 0] Jump address [31– 0]
PC+4 [31– 28]
Sign extend
16 32Instruction [15– 0]
1
M u x
1
0
M u x
0
1
M u x
0
1
ALU control
Control
Add ALU result
M u x
0
1 0
ALU
Shift left 226 28
Address
PCSrc2=BrTaken
PCSrc1=Jump
ALUoperaTon
bcond
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
T BW=~(1/T)
Dividing Into Stages
44
200ps
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Instruction
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read dataAddress
Data memory
1
ALU result
M u x
ALUZero
IF: Instruction fetch ID: Instruction decode/ register file read
EX: Execute/ address calculation
MEM: Memory access WB: Write back
Is this the correct partitioning? Why not 4 or 6 stages? Why not different boundaries?
100ps 200ps 200ps 100ps
RFwrite
ignorefornow
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
Instruction Pipeline Throughput
45
Instruction fetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Data access Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Program execution order (in instructions)
Instruction fetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Data access Reg
2 nsInstruction
fetch Reg ALU Data access Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Program execution order (in instructions)
20040060080010001200140016001800
200400600800100012001400
800ps
800ps
800ps
200ps200ps200ps200ps200ps
200ps
200ps
5-stage speedup is 4, not 5 as predicted by the ideal model. Why?
Enabling Pipelined Processing: Pipeline Registers
46 T
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Instruction
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read dataAddress
Data memory
1
ALU result
M u x
ALUZero
IF: Instruction fetch ID: Instruction decode/ register file read
EX: Execute/ address calculation
MEM: Memory access WB: Write back
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Data memory
Address
No resource is used by more than 1 stage!
IRD
PCF
PCD+4
PCE+4
nPC M
A E
B E
Imm
E
Aout
M
B M
MDR
W
Aout
W
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
T/kps
T/kps
Pipelined Operation Example
47
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Instruction fetchlw
Address
Data memory
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX MEM/WB
Instruction decodelw
Address
Data memory
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Instruction fetchlw
Address
Data memory
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX MEM/WB
Instruction decodelw
Address
Data memory
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX MEM/WB
Executionlw
Address
Data memory
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read dataData
memory1
ALU result
M u x
ALUZero
ID/EX MEM/WB
Memorylw
Address
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write data
Read dataData
memory
1
ALU result
M u x
ALUZero
ID/EX MEM/WB
Write backlw
Write register
Address
97108/Patterson Figure 06.15
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read dataData
memory1
ALU result
M u x
ALUZero
ID/EX MEM/WB
Memorylw
Address
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write data
Read dataData
memory
1
ALU result
M u x
ALUZero
ID/EX MEM/WB
Write backlw
Write register
Address
97108/Patterson Figure 06.15
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0
Address
Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
Data memory
1
ALU result
M u x
ALUZero
ID/EX
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
All instruction classes must follow the same path and timing through the pipeline stages.
Any performance impact?
Pipelined Operation Example
48
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Instruction decodelw $10, 20($1)
Instruction fetchsub $11, $2, $3
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Instruction fetchlw $10, 20($1)
Address
Data memory
Address
Data memory
Clock 1
Clock 2
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Instruction decodelw $10, 20($1)
Instruction fetchsub $11, $2, $3
Instruction memory
Address
4
32
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Instruction fetchlw $10, 20($1)
Address
Data memory
Address
Data memory
Clock 1
Clock 2
Instruction memory
Address
4
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
3216Sign
extend
Write register
Write data
Memorylw $10, 20($1)
Read data
1
ALU result
M u x
ALUZero
ID/EX
Executionsub $11, $2, $3
Instruction memory
Address
4
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Executionlw $10, 20($1)
Instruction decodesub $11, $2, $3
3216Sign
extend
Address
Data memory
Data memory
Address
Clock 3
Clock 4
Instruction memory
Address
4
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
3216Sign
extend
Write register
Write data
Memorylw $10, 20($1)
Read data
1
ALU result
M u x
ALUZero
ID/EX
Executionsub $11, $2, $3
Instruction memory
Address
4
0
Add Add result
Shift left 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
M u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
Write register
Write data
Read data
1
ALU result
M u x
ALUZero
ID/EX
Executionlw $10, 20($1)
Instruction decodesub $11, $2, $3
3216Sign
extend
Address
Data memory
Data memory
Address
Clock 3
Clock 4
Instruction memory
Address
4
32
0
Add Add result
1
ALU result
Zero
Shift left 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backM u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
M u x
ALURead data
Write register
Write data
lw $10, 20($1)
Instruction memory
Address
4
32
0
Add Add result
1
ALU result
Zero
Shift left 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backM u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
M u x
ALURead data
Write register
Write data
sub $11, $2, $3
Memory
sub $11, $2, $3
Address
Data memory
Address
Data memory
Clock 6
Clock 5
Instruction memory
Address
4
32
0
Add Add result
1
ALU result
Zero
Shift left 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backM u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
M u x
ALURead data
Write register
Write data
lw $10, 20($1)
Instruction memory
Address
4
32
0
Add Add result
1
ALU result
Zero
Shift left 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backM u x
0
1
Add
PC
0Write data
M u x
1Registers
Read data 1
Read data 2
Read register 1
Read register 2
16Sign
extend
M u x
ALURead data
Write register
Write data
sub $11, $2, $3
Memory
sub $11, $2, $3
Address
Data memory
Address
Data memory
Clock 6
Clock 5
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
Islifealwaysthisbeau9ful?
Illustrating Pipeline Operation: Operation View
49
MEMEXIDIFInst4
WB
IF
MEM
IF
MEMEX
t0 t1 t2 t3 t4 t5
IDEXIF ID
IF ID
Inst0 IDIFInst1
EXIDIFInst2
MEMEXIDIFInst3
WB
WBMEMEX
WB
steady state (full pipeline)
Illustrating Pipeline Operation: Resource View
50
I0
I0
I1
I0
I1
I2
I0
I1
I2
I3
I0
I1
I2
I3
I4
I1
I2
I3
I4
I5
I2
I3
I4
I5
I6
I3
I4
I5
I6
I7
I4
I5
I6
I7
I8
I5
I6
I7
I8
I9
I6
I7
I8
I9
I10
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
IF
ID
EX
MEM
WB
Control Points in a Pipeline
51
PC
Instruction memory
Address
Inst
ruct
ion
Instruction [20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction [15– 0]
0
0Registers
Write register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Sign extend
M u x
1Write data
Read data M
u x
1
ALU control
RegWrite
MemRead
Instruction [15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Data memory
PCSrc
Zero
Add Add result
Shift left 2
ALU result
ALUZero
Add
0
1
M u x
0
1
M u x
Identical set of control points as the single-cycle datapath!!
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
Control Signals in a Pipeline ! For a given instruction
" same control signals as single-cycle, but " control signals required at different cycles, depending on stage ⇒ Option 1: decode once using the same logic as single-cycle and
buffer signals until consumed
⇒ Option 2: carry relevant “instruction word/field” down the pipeline
and decode locally within each or in a previous stage Which one is better?
52
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Pipelined Control Signals
53
PC
Instruction memory
Inst
ruct
ion
Add
Instruction [20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction [15– 0]
0
0
M u x
0
1
Add Add result
RegistersWrite register
Write data
Read data 1
Read data 2
Read register 1
Read register 2
Sign extend
M u x
1
ALU result
Zero
Write data
Read data
M u x
1
ALU control
Shift left 2
Reg
Writ
e
MemRead
Control
ALU
Instruction [15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
M u x
0
1
Mem
Writ
e
AddressData
memory
Address
Basedonoriginalfigurefrom[P&HCO&D,COPYRIGHT2004Elsevier.ALLRIGHTSRESERVED.]
Carnegie Mellon
54
AnotherExample:Single-CycleandPipelined
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PCF01
PC' InstrD 25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLKCLK
CLK
SignImm
CLK
A RDInstruction
Memory+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
Carnegie Mellon
55
AnotherExample:CorrectPipelinedDatapath
! WriteRegmustarriveatthesame9measResult
SignImmE
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PCF01
PC' InstrD 25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLKCLK
CLK
Fetch Decode Execute Memory Writeback
Carnegie Mellon
56
AnotherExample:PipelinedControl
SignImmE
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PCF01
PC' InstrD 25:21
20:16
15:0
5:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
ZeroM
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
BranchE BranchM
RegDstE
ALUSrcE
WriteRegE4:0
! Samecontrolunitassingle-cycleprocessorControldelayedtoproperpipelinestage
Remember: An Ideal Pipeline ! Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
! Repetition of identical operations " The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
! Repetition of independent operations " No dependencies between repeated operations
! Uniformly partitionable suboperations " Processing an be evenly divided into uniform-latency
suboperations (that do not share resources)
! Fitting examples: automobile assembly line, doing laundry " What about the instruction processing “cycle”?
57
Instruction Pipeline: Not An Ideal Pipeline ! Identical operations ... NOT!
⇒ different instructions # not all need the same stages Forcing different instructions to go through the same pipe stages # external fragmentation (some pipe stages idle for some instructions)
! Uniform suboperations ... NOT! ⇒ different pipeline stages # not the same latency
Need to force each stage to be controlled by the same clock # internal fragmentation (some pipe stages are too fast but all take
the same clock cycle time)
! Independent operations ... NOT! ⇒ instructions are not independent of each other
Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results # pipeline stalls (pipeline is not always moving)
58
Issues in Pipeline Design ! Balancing work in pipeline stages
" How many stages and what is done in each stage
! Keeping the pipeline correct, moving, and full in the presence of events that disrupt pipeline flow " Handling dependences
! Data ! Control
" Handling resource contention " Handling long-latency (multi-cycle) operations
! Handling exceptions, interrupts
! Advanced: Improving pipeline throughput " Minimizing stalls
59
Causes of Pipeline Stalls ! Stall: A condition when the pipeline stops moving
! Resource contention
! Dependences (between instructions) " Data " Control
! Long-latency (multi-cycle) operations
60
Dependences and Their Types ! Also called “dependency” or less desirably “hazard”
! Dependences dictate ordering requirements between instructions
! Two types " Data dependence " Control dependence
! Resource contention is sometimes called resource dependence " However, this is not fundamental to (dictated by) program
semantics, so we will treat it separately
61
Handling Resource Contention ! Happens when instructions in two pipeline stages need the
same resource
! Solution 1: Eliminate the cause of contention " Duplicate the resource or increase its throughput
! E.g., use separate instruction and data memories (caches) ! E.g., use multiple ports for memory structures
! Solution 2: Detect the resource contention and stall one of the contending stages " Which stage do you stall? " Example: What if you had a single read and write port for the
register file?
62
Carnegie Mellon
63
ExampleResourceDependence:RegFile! TheregisterfilecanbereadandwriNeninthesamecycle:
$ writetakesplaceduringthe1sthalfofthecycle$ readtakesplaceduringthe2ndhalfofthecycle=>noproblem!!!$ HoweveroperaTonsthatinvolveregisterfilehaveonlyhalfaclock
cycletocompletetheoperaTon!!
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
Design of Digital Circuits
Lecture 15: Pipelining
Prof. Onur Mutlu ETH Zurich Spring 2017 13 April 2017