Multicycle Datapath Implementation - UCSBstrukov/ece154Winter2012/... · 2012-02-03 · andi1: x...
Transcript of Multicycle Datapath Implementation - UCSBstrukov/ece154Winter2012/... · 2012-02-03 · andi1: x...
Multicycle DatapathImplementation
Adapted from instructor’s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, © 2008, MK]
andand Computer Architecture: From Microprocessors to Supercomputers,
B. Parhami, 2005 Oxford Press
Review: A Multicycle Data Path
jta Inst Reg x Reg
ALU Cache Reg file
j
imm
rs,rt,rd (rs) Address z Reg PC
file (rt)
Data
Data Reg y Reg
Control
op fn
Fig. 14.2 Abstract view of a multicycle instruction execution unit for MicroMIPS For naming of instruction fields see Fig 13 1
Control
Feb. 2011Computer Architecture, Data Path and
ControlSlide 2
MicroMIPS. For naming of instruction fields, see Fig. 13.1.
Review: MicroprogrammingCycle 1 Cycle 3 Cycle 2 Cycle 1 Cycle 4 Cycle 5
State 5 ALUSrcX = 1 ALUSrcY = 1
ALUFunc = ‘’ JumpAddr = %
PCSrc = @ PCWrite = #
State 6
InstData = 1 MemWrite = 1
Jump/ Branch
Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero () for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1,
State 0
InstData = 0 MemRead = 1
IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’
PCSrc = 3 PCWrite = 1
Start
lw/ sw lw
sw
State 1
ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’
State 8
State 7
State 4
RegDst = 0 RegInSrc = 0 RegWrite = 1
State 2
ALUSrcX = 1 ALUSrcY = 2
ALUFunc = ‘+’
State 3
InstData = 1 MemRead = 1
For jal, RegDst 2, RegInSrc 1, RegWrite = 1 fetch: PCnext, CacheFetch # State 0 (start)
PC + 4imm, PCdisp1 # State 1lui1: lui(imm) # State 7lui
rt z, PCfetch # State 8luiadd1: x + y # State 7add
rd z, PCfetch # State 8addsub1: x - y # State 7sub
ALU- type
RegDst = 0 or 1RegInSrc = 1 RegWrite = 1
ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies
Note for State 7: ALUFunc is determined based on the op and fn f ields
PC t l
Cache t l
Register t l
ALU i t
Sequence t l
ALU f ti
sub1: x y # State 7subrd z, PCfetch # State 8sub
slt1: x - y # State 7sltrd z, PCfetch # State 8slt
addi1: x + imm # State 7addirt z, PCfetch # State 8addi
slti1: x - imm # State 7sltirt z, PCfetch # State 8slti
control control control inputs
JumpAddr PCSrc
PCWrite
FnType LogicFn
AddSub ALUSrcY
controlfunction,
and1: x y # State 7andrd z, PCfetch # State 8and
or1: x y # State 7orrd z, PCfetch # State 8or
xor1: x y # State 7xorrd z, PCfetch # State 8xor
nor1: x y # State 7norInstData
MemRead MemWrite
IRWrite
ALUSrcX RegInSrc
RegDst RegWrite
rd z, PCfetch # State 8norandi1: x imm # State 7andi
rt z, PCfetch # State 8andiori1: x imm # State 7ori
rt z, PCfetch # State 8orixori: x imm # State 7xori
rt z, PCfetch # State 8xoril 1 + i PCdi 2 # St t 2lwsw1: x + imm, mPCdisp2 # State 2lw2: CacheLoad # State 3
rt Data, PCfetch # State 4sw2: CacheStore, PCfetch# State 6j1: PCjump, PCfetch # State 5jjr1: PCjreg, PCfetch # State 5jrbranch1: PCbranch, PCfetch # State 5branchjal1 PCj mp $31PC PCfetch # State 5jal
Microprogram memory or PLA
Address 1
Incr
MicroPC
Data
0
0 1 2 3
Dispatch table 1
Dispatch table 2
Feb. 2011 Slide 3
jal1: PCjump, $31PC, PCfetch # State 5jalsyscall1:PCsyscall, PCfetch # State 5syscallop (from
instruction register) Control signals to data path
Sequence control
Microinstruction register
Review: Exception Control
Cycle 1 Cycle 3 Cycle 2 Cycle 4 Cycle 5
State 5 ALUSrcX = 1
State 6 Jump/Control
States
sw
ALUSrcY = 1 ALUFunc = ‘’ JumpAddr = %
PCSrc = @ PCWrite = #
InstData = 1 MemWrite = 1
Jump/Branch
State 0 InstData = 0 MemRead = 1
IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’
PCSrc = 3 PCWrite = 1
lw/ sw lw
State 1
ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’
State 4
RegDst = 0 RegInSrc = 0 RegWrite = 1
State 2
ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’
State 3
InstData = 1 MemRead = 1
Start
ALU-
State 8
RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1
State 7
ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = VariesALU
type RegWrite = 1ALUFunc = Varies
State 10 IntCause = 0
CauseWrite = 1 ALUSrcX = 0
State 9 IntCause = 1
CauseWrite = 1 ALUSrcX = 0ALUSrcX = 0
ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1
PCSrc = 0 PCWrite = 1
ALUSrcX = 0ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1
PCSrc = 0 PCWrite = 1
Illegal operation
Overflow
Feb. 2011Computer Architecture, Data Path and
ControlSlide 4
Fig. 14.10 Exception states 9 and 10 added to the control state machine.
MIPS Pipelined Datapath and Control
Single‐Cycle vs. Multicycle vs. Pipelined
Clock
Time
needed
Clock
Instr 1 Instr 4 Instr 3 Instr 2 Time
allotted
Instr 2 Instr 1 Instr 3 Instr 4 3 cycles 3 cycles 4 cycles 5 cycles
Time saved
Time needed
Time allotted
1
2
3
1
2
3
Cycle Cycle1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Drainageregion
a
a
w
w
f
f
f
r
r
d
d
d
r r r r r r r
f f f f f f f
3
4
5
3
4
5
6 Pipeline
Start-up region
a
a
a
a
w
w
w
w
f
f
f
f
r
r
r
r
d
d
d
d
a a a a a a a
w w w w w w w
d d d d d d d f = Fetch r = Reg read a = ALU op d = Data access w = Writeback
Feb. 2011Computer Architecture, Data Path and
ControlSlide 6
7
(a) Task-time diagram (b) Space-time diagram Instruction
stagea wf r d
Pipelining Analogy§4.5 A
n OPipelining Analogy• Pipelined laundry: overlapping execution
Parallelism improves performance
Overview
of– Parallelism improves performance f Pipelining
Four loads: Four loads: Speedup= 8/3.5 = 2.3/
Non‐stop: Speedupp p= 2n/0.5n + 1.5 ≈ 4= number of stages
Chapter 4 — The Processor —7
MIPS PipelineMIPS PipelineFive stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register
IFetch Dec Exec Mem WBlw
Chapter 4 — The Processor —8
Pipeline PerformancePipeline Performance• Assume time for stages is
– 100ps for register read or write00ps o eg ste ead o te
– 200ps for other stages
• Compare pipelined datapath with single‐cycle p p p p g ydatapath
Instr Instr fetch Register read
ALU op Memory access
Register write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
Chapter 4 — The Processor —9
beq 200ps 100 ps 200ps 500ps
Pipeline PerformancePipeline PerformanceSingle‐cycle (Tc= 800ps)
Pipelined (Tc= 200ps)p ( c p )
Chapter 4 — The Processor —10
Pipeline SpeedupPipeline Speedup
• If all stages are balancedIf all stages are balanced– i.e., all take the same time
– Time between instructions i li dTime between instructionspipelined= Time between instructionsnonpipelined
Number of stages
• If not balanced, speedup is less• Speedup due to increased throughputp p g p
– Latency (time for each instruction) does not decrease
Chapter 4 — The Processor —11
Pipelining and ISA DesignPipelining and ISA Design
• MIPS ISA designed for pipeliningg p p g– All instructions are 32‐bits
• Easier to fetch and decode in one cyclef 86 1 17 b i i• c.f. x86: 1‐ to 17‐byte instructions
– Few and regular instruction formats• Can decode and read registers in one stepg p
– Load/store addressing• Can calculate address in 3rd stage, access memory in 4thstagestage
– Alignment of memory operands• Memory access takes only one cycle
Chapter 4 — The Processor —12
Graphically Representing MIPS Pipeline
ALUIM Reg DM Reg
• Can help with answering questions like:– How many cycles does it take to execute this code?How many cycles does it take to execute this code?– What is the ALU doing during cycle 4?– Is there a hazard, why does it occur, and how can it be fixed?
Why Pipeline? For Performance!Ti ( l k l )Time (clock cycles)
Inst 0
A
IM R DM ROnce the pipeline
i f llInst
Inst 0
Inst 1
LUIM Reg DM Reg
ALUIM Reg DM Reg
is full, one instruction is
completed every cycle so CPI = 1t
r.
O Inst 2
Ug g
ALUIM Reg DM Reg
cycle, so CPI = 1
rder
Inst 3
ALUIM Reg DM Reg
r
Inst 4
ALUIM Reg DM Reg
Time to fill the pipelineTime to fill the pipeline
HazardsHazards
• Situations that prevent starting the next p ginstruction in the next cycle
• Structure hazards– A required resource is busy
• Data hazard– Need to wait for previous instruction to complete its data read/write
• Control hazardControl hazard– Deciding on control action depends on previous instruction
Chapter 4 — The Processor —15
Structure HazardsStructure Hazards
• Conflict for use of a resourceConflict for use of a resource
• In MIPS pipeline with a single memoryL d/ t i d t– Load/store requires data access
– Instruction fetch would have to stall for that cycleW ld i li “b bbl ”• Would cause a pipeline “bubble”
• Hence, pipelined datapaths require separate i i /d iinstruction/data memories– Or separate instruction/data caches
Chapter 4 — The Processor —16
Ti ( l k l )
A Single Memory Would Be a Structural HazardTime (clock cycles)
lw
A
M R M RReading data from
Inst
lw
Inst 1
LUMem Reg Mem Reg
ALUMem Reg Mem Reg
memory
tr.
O Inst 2
Ug g
ALUMem Reg Mem Reg
rder
Inst 3
ALUMem Reg Mem Reg
r
Inst 4
ALUMem Reg Mem RegReading instruction
from memoryy
Fix with separate instr and data memories (I$ and D$)
Data HazardsData Hazards• An instruction depends on completion of data access by a previous instructionaccess by a previous instruction– add $s0, $t0, $t1sub $t2 $s0 $t3sub $t2, $s0, $t3
Chapter 4 — The Processor —18
Register Usage Can Cause Data Hazards• Dependencies backward in time cause hazards
AL
IM Reg DM Reg
Dependencies backward in time cause hazards
add $1 LUIM Reg DM Reg
ALUIM Reg DM Reg
add $1,
sub $4,$1,$5 U
ALUIM Reg DM Reg
$ ,$ ,$
and $6,$1,$7
ALUIM Reg DM Regor $8,$1,$9
ALUIM Reg DM Regxor $4,$1,$5
Read before write data hazard
Loads Can Cause Data Hazards• Dependencies backward in time cause hazards
lw $1 4($2)
AL
IM Reg DM Reg
Dependencies backward in time cause hazards
Inst
lw $1,4($2)
sub $4,$1,$5
LUIM Reg DM Reg
ALUIM Reg DM Regt
r.
O
$ ,$ ,$
and $6,$1,$7U
ALUIM Reg DM Reg
rder
or $8,$1,$9
ALUIM Reg DM Reg
r
xor $4,$1,$5
ALUIM Reg DM Reg
Load‐use data hazard
How About Register File Access?Time (clock cycles)
I
( y )
ALUIM Reg DM Reg
Fix register file access hazard by doing add $1,I
nst Inst 1
Ug
ALUIM Reg DM Reg
hazard by doing reads in the second half of the cycle and writes in the first half
,
r.
Or
Inst 2ALUIM Reg DM Reg
rder
ALUIM Reg DM Regadd $2,$1,
l k d th t t lclock edge that controls register writing
clock edge that controls loading of pipeline state registers
One Way to “Fix” a Data Hazard
I add $1,
ALUIM Reg DM Reg
Can fix data hazard by
waiting – stall –
stall
nstr
waiting – stall –but impacts CPI
stall
r.
Order
sub $4,$1,$5
ALUIM Reg DM Reg
and $6,$1,$7
ALUIM Reg DM Reg
Forwarding (aka Bypassing)Forwarding (aka Bypassing)• Use result when it is computed
Don’t wait for it to be stored in a register– Don t wait for it to be stored in a register
– Requires extra connections in the datapath
Chapter 4 — The Processor —23
Another Way to “Fix” a Data Hazard
ALUIM Reg DM Reg
Fix data hazards by forwarding results as soon as they are I add $1,
ALUIM Reg DM Reg
available to where they are needed
nstr.
sub $4,$1,$5
ALUIM Reg DM Reg
r.
Ord
and $6,$1,$7
ALUIM Reg DM Reg
der or $8,$1,$9
ALUIM Reg DM Regxor $4,$1,$5
Forwarding Illustration
I add $1,
ALUIM Reg DM Reg
nstr
sub $4,$1,$5
ALUIM Reg DM Reg
r.
Or and $6,$7,$1
ALUIM Reg DM Reg
der
$ ,$ ,$
EX forwarding MEM forwarding
Yet Another Complication!• Another potential data hazard can occur when there is• Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded?should be forwarded?
Ins
add $1,$1,$2
ALUIM Reg DM Reg
str.
O
add $1,$1,$3 ALUIM Reg DM Reg
Orde
add $1,$1,$4
ALUIM Reg DM Reg
r
Load‐Use Data HazardLoad Use Data Hazard• Can’t always avoid stalls by forwarding
If value not computed when needed– If value not computed when needed
– Can’t forward backward in time!
Chapter 4 — The Processor —27
Code Scheduling to Avoid StallsCode Scheduling to Avoid Stalls• Reorder code to avoid use of load result in the next instructionnext instruction
• C code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t1, 0($t0)
lw $t2, 4($t0),
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4 8($t0)
stall
,
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3 12($t0)lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)stall
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
Chapter 4 — The Processor —28
11 cycles13 cycles