31 January, 2010
Chapter 4 — The Processor 1
Building a DatapathDatapath
Elements that process data and addressesi th CPU
§4.3 Building a Datapa
in the CPURegisters, ALUs, mux’s, memories, …
We will build a MIPS datapath incrementally
Refining the overview design
ath
Chapter 4 — The Processor — 14
g g
Instruction Fetch
32 bit
Increment by 4 for next instruction
Chapter 4 — The Processor — 15
32-bit register
instruction
31 January, 2010
Chapter 4 — The Processor 2
R-Format InstructionsRead two register operandsPerform arithmetic/logical operationWrite register result
Chapter 4 — The Processor — 16
Load/Store InstructionsRead register operandsCalculate address using 16-bit offset
Use ALU but sign-extend offsetUse ALU, but sign-extend offsetLoad: Read memory and update registerStore: Write register value to memory
Chapter 4 — The Processor — 17
31 January, 2010
Chapter 4 — The Processor 3
Branch InstructionsRead register operandsCompare operands
Use ALU, subtract and check Zero outputCalculate target address
Sign-extend displacementShift left 2 places (word displacement)Add t PC 4
Chapter 4 — The Processor — 18
Add to PC + 4Already calculated by instruction fetch
Branch Instructions
Justre-routes
wireswires
Chapter 4 — The Processor — 19
Sign-bit wire replicated
31 January, 2010
Chapter 4 — The Processor 4
Composing the ElementsFirst-cut data path does an instruction in one clock cycle
Each datapath element can only do one function at a timeHence, we need separate instruction and data memories
Use multiplexers where alternate data
Chapter 4 — The Processor — 20
psources are used for different instructions
R-Type/Load/Store Datapath
Chapter 4 — The Processor — 21
31 January, 2010
Chapter 4 — The Processor 5
Full Datapath
Chapter 4 — The Processor — 22
ALU ControlALU used for
Load/Store: F = add
§4.4 A Simple Im
plem
Branch: F = subtractR-type: F depends on funct field
entation Schem
eALU control Function0000 AND0001 OR
Chapter 4 — The Processor — 23
0010 add0110 subtract0111 set-on-less-than1100 NOR
31 January, 2010
Chapter 4 — The Processor 6
ALU ControlAssume 2-bit ALUOp derived from opcode
Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU controllw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010beq 01 branch equal XXXXXX subtract 0110R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
Chapter 4 — The Processor — 24
AND 100100 AND 0000OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111
The Main Control UnitControl signals derived from instruction
0 rs rt rd shamt functR type 0 rs rt rd shamt funct31:26 5:025:21 20:16 15:11 10:6
35 or 43 rs rt address31:26 25:21 20:16 15:0
4 rs rt address
R-type
Load/Store
Branch
Chapter 4 — The Processor — 25
31:26 25:21 20:16 15:0
opcode always read
read, except for load
write for R-type
and load
sign-extend and add
31 January, 2010
Chapter 4 — The Processor 7
Datapath With Control
Chapter 4 — The Processor — 26
R-Type Instruction
Chapter 4 — The Processor — 27
31 January, 2010
Chapter 4 — The Processor 8
Load Instruction
Chapter 4 — The Processor — 28
Branch-on-Equal Instruction
Chapter 4 — The Processor — 29
31 January, 2010
Chapter 4 — The Processor 9
Implementing Jumps
2 address31:26 25:0
Jump
Jump uses word addressUpdate PC with concatenation of
Top 4 bits of old PC26-bit jump address
Chapter 4 — The Processor — 30
00Target address = PC31…28 : (address × 4)
Need an extra control signal decoded from opcode
Datapath With Jumps Added
Chapter 4 — The Processor — 31
31 January, 2010
Chapter 4 — The Processor 10
Performance IssuesLongest delay determines clock period
Critical path: load instructionInstruction memory → register file → ALU →data memory → register file
Not feasible to vary period for different instructionsViolates design principle
Chapter 4 — The Processor — 32
Violates design principleMaking the common case fast
We will improve performance by pipelining
Pipelining AnalogyPipelined laundry: overlapping execution
Parallelism improves performance
§4.5 An O
verview of PP
ipeliningFour loads:Speedup= 8/3.5 = 2.3
Non-stop:
Chapter 4 — The Processor — 33
Speedup= 2n/(0.5n + 1.5) ≈ 4= number of stages
31 January, 2010
Chapter 4 — The Processor 11
MIPS PipelineFive stages, one step per stage
1. IF: Instruction fetch from memory2. ID: Instruction decode & register read3. EX: Execute operation or calculate address4. MEM: Access memory operand5. WB: Write result back to register
Chapter 4 — The Processor — 34
Pipeline PerformanceAssume time for stages is
100ps for register read or write200ps for other stages200ps for other stages
Compare pipelined datapath with single-cycle datapath
Instr Instr fetch Register read
ALU op Memory access
Register write
Total time
Chapter 4 — The Processor — 35
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
31 January, 2010
Chapter 4 — The Processor 12
Pipeline PerformanceSingle-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Chapter 4 — The Processor — 36
Pipeline SpeedupIf all stages are balanced
i.e., all take the same timeTime between instructionspipelined= Time between instructionsnonpipelined
Number of stages
If not balanced, speedup is lessSpeedup due to increased throughput
Chapter 4 — The Processor — 37
Speedup due to increased throughputLatency (time for each instruction) does not decrease
31 January, 2010
Chapter 4 — The Processor 13
Pipelining and ISA DesignMIPS ISA designed for pipelining
All instructions are 32-bitsEasier to fetch and decode in one cycleEasier to fetch and decode in one cyclec.f. x86: 1- to 17-byte instructions
Few and regular instruction formatsCan decode and read registers in one step
Load/store addressingCan calculate address in 3rd stage, access memory
Chapter 4 — The Processor — 38
g , yin 4th stage
Alignment of memory operandsMemory access takes only one cycle
Top Related