Post on 20-Dec-2015
12004 Morgan Kaufmann Publishers
Chapter Five
The Processor : Datapath and Control
22004 Morgan Kaufmann Publishers
32004 Morgan Kaufmann Publishers
Outline
• 5.1 Introduction
• 5.2 Logic Design Conventions
• 5.3 Building a Datapath
• 5.4 A simple Implementation Scheme
• 5.5 A Multicycle Implementation
• 5.6 Exceptions
• 5.9 Real Stuff: The Organization of Recent Pentium
• 5.10 Fallacies and Pitfalls
• 5.11 Concluding Remarks
• 5.12 Historical Perspective and Further Reading
42004 Morgan Kaufmann Publishers
5.1 Introduction
52004 Morgan Kaufmann Publishers
• We're ready to look at an implementation of the MIPS
• Simplified to contain only:
– memory-reference instructions: lw, sw – arithmetic-logical instructions: add, sub, and, or, slt– control flow instructions: beq, j
• Generic Implementation:
– use the program counter (PC) to supply instruction address
– get the instruction from memory
– read registers
– use the instruction to decide exactly what to do
• All instructions use the ALU after reading the registers
Why? memory-reference? arithmetic? control flow?
The Processor: Datapath & Control
62004 Morgan Kaufmann Publishers
• Abstract / Simplified View:
• Two types of functional units:
– elements that operate on data values (combinational)
– elements that contain state (sequential)
More Implementation Details
72004 Morgan Kaufmann Publishers
FIGURE 3.14 MIPS architecture revealed thus farMIPS assembly language
Category Instruction Example Meaning Comments
Arithmetic
add add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; Overflow detected
subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; Overflow detected
add immediate addi $s1, $s2, 100 $s1 = $s2+ 100 + constants; overflow detected
add unsigned addu $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; overflow undetected
subtract unsigned subu $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; overflow undetected
add immediate unsigned
addiu $s1, $s2, 100 $s1 = $s2+ 100 + constants; overflow detected
move from coprocessor register
mfc0 $s1, $epc $s1 = $epc Copy Exception PC + special regs
multiply mult $s2, $s3 Hi, Lo = $s2 x $s3 64-bit signed product in Hi, Lo
multiply unsigned multu $s2, $s3 Hi, Lo = $s2 x $s3 64-bit unsigned product in Hi, Lo
divide div $s2, $s3 Lo = $s2 / $s3
Hi = $s2 mod $s3
Lo = quotient, Hi = remainder
divide unsigned divu $s2, $s3 Lo = $s2 / $s3
Hi = $s2 mod $s3
Unsigned quotient and remainder
move from Hi mfhi $s1 $s1 = Hi Used to get copy of Hi
move from Lo mflo $s1 $s1 = Lo Used to get copy of Lo
Data transfer
load word lw $s1, 100($s2) $s1 = Memory [$s2 + 100] Word from memory to register
store word sw $s1, 100 ($s2) Memory [$s2 + 100] = $s1 Word from register to memory
load half unsigned lh $s1, 100($s2) $s1 = Memory [$s2 + 100] Halfword memory to register
store half sh $s1, 100 ($s2) Memory [$s2 + 100] = $s1 Halfword register to memory
load byte unsigned lb $s1, 100($s2) $s1 = Memory [$s2 + 100] Byte from memory to register
store byte sb $s1, 100 ($s2) Memory [$s2 + 100] = $s1 Byte from register to memory
load upper immed. lui $s1, 100 $s1 = 100 * 2^16 Loads constant in upper 16 bits
82004 Morgan Kaufmann Publishers
Continue..
Logical
and add $s1, $s2, $s3 $s1 = $s2 & $s3 Three reg. operands; bit-by-bit AND
or or $s1, $s2, $s3 $s1 = $s2 | $s3 Three reg. operands; bit-by-bit OR
nor nor $s1, $s2, $s3 $s1 = ~($s2 | $s3) Three reg. operands; bit-by-bit NOR
and immediate andi $s1, $s2, 100 $s1 = $s2 & 100 Bit-by-bit AND reg with constant
or immediate ori $s1, $s2, 100 $s1 = $s2 | 100 Bit-by-bit OR reg with constant
shift left logical sll $s1, $s2, 10 $s1 = $s2 << 10 Shift left by constant
shift right logical srl $$s1, $s2, 10 $s1 = $s2 >> 10 Shift right by constant
Conditional branch
branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to
PC+4+100
Equal test; PC-relative branch
branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to L
PC+4+100
Not equal test; PC-relative
set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1;
else $s1 = 0
Compare less than; two’s complement
set on less than immediate
slt $s1, $s2, 100 if ($s2 < 100) $s1 = 1;
else $s1 = 0
Compare < constant;
Two’s complement
set less than unsigned
sltu $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1;
else $s1 = 0
Compare less than; natural numbers
set less than immediate unsigned
sltuiu $s1, $s2, 100
if ($s2 < 100) $s1 = 1;
else $s1 = 0
Compare< constant;
natural numbers
Unconditional jump
jump j 2500 go to 10000 Jump to target address
jump register jr $ra go to $ra For switch, procedure return
jump and link jal 2500 $ra = PC + 4; go to 10000
For procedure call
92004 Morgan Kaufmann Publishers
Name Format Example Comments
add.s R 17 16 6 4 2 0 add.s $f2, $f4, $f6
sub.s R 17 16 6 4 2 1 sub.s $f2, $f4, $f6
mul.s R 17 16 6 4 2 2 mul.s $f2, $f4, $f6
div.s R 17 16 6 4 2 3 div.s $f2, $f4, $f6
add.d R 17 17 6 4 2 0 add.d $f2, $f4, $f6
sub.d R 17 17 6 4 2 1 sub.d $f2, $f4, $f6
mul.d R 17 17 6 4 2 2 mul.d $f2, $f4, $f6
div.d R 17 17 6 4 2 3 div.d $f2, $f4, $f6
lwc1 I 49 20 2 100 lwc1 $f2, $f4, $f6
swc1 I 57 20 2 100 sec1 $f2, $f4, $f6
bc1t I 17 8 1 25 bc1t 25
bc1f I 17 8 0 25 bc1f 25
c. lt. s R 17 16 4 2 0 60 c. lt. s $f2, $f4
c. lt. d R 17 17 4 2 0 60 c. lf. d $f2, $f4
Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits ALL MIPS instructions 32 bits
MIPS floating-point machine language
102004 Morgan Kaufmann Publishers
Figure 5.2 The basic implementation of the MIPS subset including the necessary multiplexers and control lines.
112004 Morgan Kaufmann Publishers
5.2 Logic Design Conventions
122004 Morgan Kaufmann Publishers
Keywords
• Clocking methodology The approach used to determine when data is valid and stable relative to the clock.
• Edge-triggered clocking A clocking scheme in which all state changes occur on a clock edge.
• Control signal A signal used for multiplexer selection or for directing the operation of a function unit; contrasts with a data signal, which contains information that is operated on by a functional unit.
132004 Morgan Kaufmann Publishers
• Unclocked vs. Clocked
• Clocks used in synchronous logic
– when should an element that contains state be updated?
State Elements
Clock period Rising edge
Falling edge
cycle time
142004 Morgan Kaufmann Publishers
• The set-reset latch
– output depends on present inputs and also on past inputs
An unclocked state element
R
S
Q
Q
152004 Morgan Kaufmann Publishers
• Output is equal to the stored value inside the element(don't need to ask for permission to look at the value)
• Change of state (value) is based on the clock
• Latches: whenever the inputs change, and the clock is asserted
• Flip-flop: state changes only on a clock edge(edge-triggered methodology)
"logically true", — could mean electrically low
A clocking methodology defines when signals can be read and written— wouldn't want to read a signal at the same time it was being written
Latches and Flip-flops
162004 Morgan Kaufmann Publishers
• Two inputs:
– the data value to be stored (D)
– the clock signal (C) indicating when to read & store D
• Two outputs:
– the value of the internal state (Q) and it's complement
D-latch
Q
C
D
_Q
D
C
Q
172004 Morgan Kaufmann Publishers
D flip-flop
• Output changes only on the clock edge
D
C
Q
D
C
Dlatch
D
C
QD
latch
D
C
Q Q
182004 Morgan Kaufmann Publishers
Our Implementation
• An edge triggered methodology
• Typical execution:
– read contents of some state elements,
– send values through some combinational logic
– write results to one or more state elements
Stateelement
1
Stateelement
2Combinational logic
Clock cycle
192004 Morgan Kaufmann Publishers
Figure 5.4 An edge-triggered methodology allows a state element to be read and written in the same clock cycle without creating a race that could lead to indeterminate data values.
202004 Morgan Kaufmann Publishers
5.3 Building a Datapath
212004 Morgan Kaufmann Publishers
Keywords
• Datapath element A functional unit used to operate on or hold data within a processor. In the MIPS implementation the datapath elements include the instruction and data memories, the register file, the arithmetic logic unit (ALU), and adders.
• Program counter (PC) The register containing the address of the instruction in the program being executed.
• Register file A state element that consists of a set of registers that can be read and written by supplying a register number to be accessed.
• Sign-extend To increase the size of a data item by replicating the high-order sign bit of the original data item in the high-order bits of the larger, destination data item.
222004 Morgan Kaufmann Publishers
Keywords
• Branch target address The address specified in a branch, which becomes the new program counter (PC) if the branch is taken. In the MIPS architecture the branch target is given by the sum of the offset field of the instruction and the address of the instruction following the branch.
• Branch taken A branch where the branch condition is satisfied and the program counter (PC) becomes the branch target. All unconditional branches are taken branches.
• Branch not taken A branch where the branch condition is false and the program counter (PC) becomes the address of the instruction that sequentially follows the branch.
• Delayed branch A type of branch where the instruction immediately following the branch is always executed, independent of whether the branch condition is true or false.
232004 Morgan Kaufmann Publishers
• Built using D flip-flops
Register File
Read registernumber 1 Read
data 1Read registernumber 2
Readdata 2
Writeregister
WriteWritedata
Register file
Read registernumber 1
Register 0
Register 1
. . .
Register n – 2
Register n – 1
M
u
x
Read registernumber 2
M
u
x
Read data 1
Read data 2
Do you understand? What is the “Mux” above?
242004 Morgan Kaufmann Publishers
Abstraction
• Make sure you understand the abstractions!
• Sometimes it is easy to think you do, when you don’t
Mux
C
Select
32
32
32
B
A
Mu
x
Select
B31
A31
C31
Mux
B30
A30
C30
Mux
B0
A0
C0
...
...
252004 Morgan Kaufmann Publishers
Register File
• Note: we still use the real clock to determine when to write
Write
01
n-to-2n
decoder
n – 1
n
Register 0
C
D
Register 1
C
D
Register n – 2
C
D
Register n – 1
C
D
...
Register number...
Register data
262004 Morgan Kaufmann Publishers
Simple Implementation
• Include the functional units we need for each instruction
PC
Instructionaddress
Instruction
Instructionmemory
Add Sum
a. Instruction memory b. Program counter c. Adder
272004 Morgan Kaufmann Publishers
AddressReaddata
Datamemory
a. Data memory unit
Writedata
MemRead
MemWrite
b. Sign-extension unit
Signextend
16 32
282004 Morgan Kaufmann Publishers
Readregister 1
Readregister 2
Writeregister
WriteData
Registers ALUData
Data
Zero
ALUresult
RegWrite
a. Registers b. ALU
5
5
5
Registernumbers
Readdata 1
Readdata 2
ALU operation4
Why do we need this stuff?
292004 Morgan Kaufmann Publishers
Figure 5.10 The datapath for the memory instructions and the R-type instructions.
302004 Morgan Kaufmann Publishers
Building the Datapath
• Use multiplexors to stitch them together
Readregister 1
Readregister 2
Writeregister
Writedata
Writedata
Registers ALU
Add
Zero
RegWrite
MemRead
MemWrite
PCSrc
MemtoReg
Readdata 1
Readdata 2
ALU operation4
Signextend
16 32
InstructionALU
result
Add
ALUresult
Mux
Mux
Mux
ALUSrc
Address
Datamemory
Readdata
Shiftleft 2
4
Readaddress
Instructionmemory
PC
312004 Morgan Kaufmann Publishers
5.4 A Simple Implementation Scheme
322004 Morgan Kaufmann Publishers
Keywords
• Don’t-care term An element of a logic function in which the output does not depend on the values of all the inputs. Don’t-care terms may be specified in different ways.
• Opcode The field that denotes the operation and format of an instruction.
• Single-cycle implementation Also called single clock cycle implementation. An implementation in which an instruction is executed in one clock cycle.
332004 Morgan Kaufmann Publishers
Control
• Selecting the operations to perform (ALU, read/write, etc.)
• Controlling the flow of data (multiplexor inputs)
• Information comes from the 32 bits of the instruction
• Example:
add $8, $17, $18 Instruction Format:
000000 10001 10010 01000 00000 100000
op rs rt rd shamt funct
• ALU's operation based on instruction type and function code
342004 Morgan Kaufmann Publishers
• e.g., what should the ALU do with this instruction• Example: lw $1, 100($2)
35 2 1 100
op rs rt 16 bit offset
• ALU control input
0000 AND0001 OR0010 add0110 subtract0111 set-on-less-than1100 NOR
• Why is the code for subtract 0110 and not 0011?
Control
352004 Morgan Kaufmann Publishers
Figure 5.12 How the ALU control bits are set depends on the ALUOp control bits and the different function codes for the R-type instruction.
Instruction Opcode
ALUOpInstruction operation
Funct field
Desired ALU action
ALU control input
LW 00 Load word XXXXXX Add 0010
SW 00 Store word XXXXXX Add 0010
Branch equal 01 Branch equal XXXXXX Subtract 0110
R-type 10 Add 100000 Add 0010
R-type 10 subtract 100010 Subtract 0110
R-type 10 AND 100100 And 0000
R-type 10 OR 100101 Or 0001
R-type 10 Set on less than 101010 Set on less than 0111
362004 Morgan Kaufmann Publishers
• Must describe hardware to compute 4-bit ALU control input
– given instruction type 00 = lw, sw01 = beq, 10 = arithmetic
– function code for arithmetic
• Describe it using a truth table (can turn into gates):
ALUOp computed from instruction type
Control
372004 Morgan Kaufmann Publishers
Figure B.5.9 A 1-bit ALU that performs AND, OR, and addition on a and b or a and b.
382004 Morgan Kaufmann Publishers
FIGURE B.5.10 (Top) A 1-bit ALU that performs AND, OR, and addition on a and b or b.
392004 Morgan Kaufmann Publishers
FIGURE B.5.10 (bottom) a 1-bit ALU for the most significant bit.
402004 Morgan Kaufmann Publishers
FIGURE B.5.11 A 32-bit ALU constructed from the 31 copies of the 1-bit ALU in the top of Figure B.5.10 and one 1-bit ALU in the bottom of that figure.
412004 Morgan Kaufmann Publishers
FIGURE B.5.12 The final 32-bit ALU. This adds a Zero detector to Figure B.5.11.
422004 Morgan Kaufmann Publishers
FIGURE B.5.13 The values of the three ALU control lines Bnegate and Operation and the corresponding ALU operations.
ALU control lines Function
0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
1100 NOR
432004 Morgan Kaufmann Publishers
FIGURE B.5.14 The symbol commonly used to represent an ALU, as shown in FigureB.5.12.
442004 Morgan Kaufmann Publishers
Figure 5.14 The three instruction classes (R-tape, load and store, and branch) use two different instruction formats.
0 rs rt rd shamt funct
35 or 43 rs rt address
4 rs rt address
Field Bit positions 31:26 25:21 20:16 15:11 10:6 5:0
Field Bit positions 31:26 25:21 20:16 15:0
Field Bit positions 31:26 25:21 20:16 15:0
a. R-type instruction
b. Load or store instruction
c. Branch instruction
452004 Morgan Kaufmann Publishers
Figure 5.15 The datapath of Figure 5.12 with all necessary multiplexors and all control lines identified
462004 Morgan Kaufmann Publishers
Control
• Simple combinational logic (truth tables)
Operation2
Operation1
Operation0
Operation
ALUOp1
F3
F2
F1
F0
F (5– 0)
ALUOp0
ALUOp
ALU control block
R-format Iw sw beq
Op0
Op1
Op2
Op3
Op4
Op5
Inputs
Outputs
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
472004 Morgan Kaufmann Publishers
• All of the logic is combinational
• We wait for everything to settle down, and the right thing to be done
– ALU might not produce “right answer” right away
– we use write signals along with clock to determine when to write
• Cycle time determined by length of the longest path
Our Simple Control Structure
We are ignoring some details like setup and hold times
Stateelement
1
Stateelement
2Combinational logic
Clock cycle
482004 Morgan Kaufmann Publishers
Single Cycle Implementation
• Calculate cycle time assuming negligible delays except:
– memory (200ps), ALU and adders (100ps), register file access (50ps)
Readregister 1
Readregister 2
Writeregister
Writedata
Writedata
Registers ALU
Add
Zero
RegWrite
MemRead
MemWrite
PCSrc
MemtoReg
Readdata 1
Readdata 2
ALU operation4
Signextend
16 32
InstructionALU
result
Add
ALUresult
Mux
Mux
Mux
ALUSrc
Address
Datamemory
Readdata
Shiftleft 2
4
Readaddress
Instructionmemory
PC
492004 Morgan Kaufmann Publishers
Figure 5.16 The effect of each of the seven control signals.
Signal name
Effect when deasserted Effect when asserted
RegDst The register destination number for the Write register comes from the rt field (bits 20:16).
The register destination number for the Write register comes from the rd field (bits 15:11).
RegWrite None. The register on the Write register input is written with the value on the Write data input.
ALUSrc The second ALU operand comes from the second register file output (Read data 2).
The second ALU operand is the sign-extended, lower 16 bits of the instruction.
PCSrc The PC is replaced by the output of the adder that computes the value of PC+4.
The PC is replaced by the output of the adder that computed the branch target.
MEmRead None. Data memory contents designated by the address input are put on the Read data output.
MemWrite None. Data memory contents designated by the address input are replaced by the value on the Write data input.
MemtoReg The value fed to the register Write data input comes from the ALU.
The value fed to the register Write data input comes from the data memory.
502004 Morgan Kaufmann Publishers
Figure 5.17 The simple datapath with the control unit.
512004 Morgan Kaufmann Publishers
Figure 5.18 The setting of the control lines is completely determined by the opcode fields of the instruction.
Instruction RegDst ALUSrcMemto-
RegReg
WriteMem Read
Mem Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1
522004 Morgan Kaufmann Publishers
Figure 5.19 The datapath in operation for an R-type instruction such as add $t1, $t2, $t3.
532004 Morgan Kaufmann Publishers
Figure 5.20 The datapath in operation for a load instruction.
542004 Morgan Kaufmann Publishers
Figure 5.21 The datapath in operation for a branch equal instruction.
552004 Morgan Kaufmann Publishers
Figure 5.22 The control function for the simple single-cycle implementation is completely specified by this truth table.
Input or output Signal name R-format lw sw beq
Inputs Op5 0 1 1 0
Op4 0 0 0 0
Op3 0 0 1 0
Op2 0 0 0 1
Op1 0 1 1 0
Op0 0 1 1 0
Outputs RegDst 1 0 X X
ALUSrc 0 1 1 0
MemtoReg 0 1 X X
RegWrite 1 1 0 0
MemRead 0 1 0 0
MemWrite 0 0 1 0
Branch 0 0 0 1
ALUOp1 1 0 0 0
ALUOp0 0 0 0 1
562004 Morgan Kaufmann Publishers
Figure 5.23 Instruction format for the jump instruction (opcode = 2).
Field Bit positions 31:26 25:0
000010 address
572004 Morgan Kaufmann Publishers
Figure 5.24 The simple control and datapath are extended to handle the jump instruction.
582004 Morgan Kaufmann Publishers
Problem: Performance of Single-Cycle Machines (p.315)
Assume that the operation times for the major functional units in this implementation are the following:
Memory units: 200 picoseconds (ps)ALU and adders: 100 psRegister file (read or write): 50 ps
Assume that the multiplexors, control unit, PC accesses, sign extension unit, and wires have no delay, which of the following implementations would be faster and by how much?
1. An implementation in which every instruction operates in 1 clock cycle of a fixed length.
2. An implementation where every instruction executes in 1 clock cycle using a variable-length clock, which for each instruction is only as long as it needs to be.
To compare the performance, assume the following instruction mix: 25% loads, 10% stores, 45% ALU instructions, 15% branches, and 5% jumps.
592004 Morgan Kaufmann Publishers
• Let’s start by comparing the CPU execution times.
Since CPI must be 1, we can simplify this to
• The critical path for the different instruction classes is as follows:
timecycleClock CPIcountn Instructio timeexecution CPU
timecycleClock countn Instructio timeexecution CPU
Instruction class Functional units used by the instruction class
R-type Instruction fetch Register access ALU Register access
Load word Instruction fetch Register access ALU Memory access Register access
Store word Instruction fetch Register access ALU Memory access
Branch Instruction fetch Register access ALU
Jump Instruction fetch
602004 Morgan Kaufmann Publishers
• Using these critical paths, we can compute the required length for each instruction class:
• Thus, the average time per instruction with a variable clock is
Instruction class
Instruction memory
Register read
ALU operation
Data memory
Register write
Total
R-type 200 50 100 0 50 400ps
Load word 200 50 100 200 50 600ps
Store word 200 50 100 200 550ps
Branch 200 50 100 0 350ps
Jump 200 200ps
ps.5447
%5200%15350%45400%10550%25600 cycleclock CPU
612004 Morgan Kaufmann Publishers
• Since the variable clock implementation has a shorter average clock cycle, it is clearly faster. Let’s find the performance ratio:
34.15.447
600
cycleclock CPU
cycleclock CPU
cycleclock CPUIC
cycleclock CPUIC
timeexecution CPU
timeexecution CPU
eperformanc CPU
eperformanc CPU
clock variable
clock single
clock variable
clock single
clock variable
clock single
clock single
clock variable
622004 Morgan Kaufmann Publishers
5.5 A Multicycle Implementation
632004 Morgan Kaufmann Publishers
Keywords
• Multicycle implementation Also called multiple clock cycle implementation. An implementation in which and instruction is executed in multiple clock cycles.
• Microprogramming A symbolic representation of control in the form of instructions, called microinstructions, that are executed on a simple micromachine.
• Finite state machine A sequential logic function consisting of a set of inputs and outputs, a next-state function that maps the current state and the inputs to a new state, and an output function that maps the current state and possibly the input to a set of asserted outputs.
• Next-state function A combinational function that, given the inputs and the current state, determines the next state of a finite state machine.
642004 Morgan Kaufmann Publishers
Where we are headed
• Single Cycle Problems:
– what if we had a more complicated instruction like floating point?
– wasteful of area
• One Solution:
– use a “smaller” cycle time
– have different instructions take different numbers of cycles
– a “multicycle” datapath:
652004 Morgan Kaufmann Publishers
• We will be reusing functional units
– ALU used to compute address and to increment PC
– Memory used for instruction and data
• Our control signals will not be determined directly by instruction
– e.g., what should the ALU do for a “subtract” instruction?
• We’ll use a finite state machine for control
Multicycle Approach
662004 Morgan Kaufmann Publishers
• Break up the instructions into steps, each step takes a cycle
– balance the amount of work to be done
– restrict each cycle to use only one major functional unit
• At the end of a cycle
– store values for use in later cycles (easiest thing to do)
– introduce additional “internal” registers
Multicycle Approach
672004 Morgan Kaufmann Publishers
Figure 5.27 The multicycle datapath from Figure 5.26 with the control lines shown.
682004 Morgan Kaufmann Publishers
Figure 5.28 The complete datapath for the multicycle implementation together with the necessary control lines.
692004 Morgan Kaufmann Publishers
Figure 5.29 The action caused by the setting of each control signal in Figure 5.28 on page 323.
Signal name Effect when deasserted Effect when asserted
RegDst The register file destination number for the Write register comes from the rt field.
The register file destination number for the Write register comes from the rd field.
RegWrite None. The general-purpose register selected by the Write register number is written with the value of the Write data input.
ALUSrcA The first ALU operand is the PC. The first ALU operand comes from the A register.
MemRead None. Content of memory at the location specified by the address input is put on Memory data output.
MemWrite None. Memory contents at the location specified by the address input is replaced by value on Write data input.
MemtoReg The value fed to the register file Write data input comes from ALUOut.
The value fed to the register file Write data input comes from the MDR.
IorD The PC is used to supply the address to the memory unit.
ALUOut is used to supply the address to the memory unit.
IRWrite None. The output of the memory is written into the IR.
PCWrite None. The PC is written; the source is controlled by PCSource.
PCWriteCond None. The PC is written is the Zero output from the ALU is also active.
Actions of the 1-bit control signals
702004 Morgan Kaufmann Publishers
Continue…
Actions of the 2-bit control signalsSignal name
Value (binary)
Effect
ALUOp 00 The ALU performs an add operation.
01 The ALU performs a subtract operation.
10 The funct field of the instruction determines the ALU operation.
ALUSrcB 00 The second input to the ALU comes from the B register.
01 The second input to the ALU is the constant 4.
10 The second input to the ALU is the sign-extend, lower 16 bits of the IR.
11 The second input to the ALU is the sign-extended, lower 16 bits of the IR shifted left 2 bits.
PCSource 00 Output of the ALU (PC+4) is sent to the PC for writing.
01 The contents of ALUOut (the branch target address) are sent to the PC for waiting.
10 The jump target address (IR[25:0] shifted left 2 bits and concatenated with PC+4[31:28] is sent to the PC for writing.)
712004 Morgan Kaufmann Publishers
Instructions from ISA perspective
• Consider each instruction from perspective of ISA.
• Example:
– The add instruction changes a register.
– Register specified by bits 15:11 of instruction.
– Instruction specified by the PC.
– New value is the sum (“op”) of two registers.
– Registers specified by bits 25:21 and 20:16 of the instructionReg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]
– In order to accomplish this we must break up the instruction.(kind of like introducing variables when programming)
722004 Morgan Kaufmann Publishers
Breaking down an instruction
• ISA definition of arithmetic:
Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]
• Could break down to:– IR <= Memory[PC]– A <= Reg[IR[25:21]]– B <= Reg[IR[20:16]]– ALUOut <= A op B– Reg[IR[20:16]] <= ALUOut
• We forgot an important part of the definition of arithmetic!– PC <= PC + 4
732004 Morgan Kaufmann Publishers
Idea behind multicycle approach
• We define each instruction from the ISA perspective (do this!)
• Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps)
• Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)
• Finally try and pack as much work into each step (avoid unnecessary cycles)
while also trying to share steps where possible(minimizes control, helps to simplify solution)
• Result: Our book’s multicycle Implementation!
742004 Morgan Kaufmann Publishers
• Instruction Fetch
• Instruction Decode and Register Fetch
• Execution, Memory Address Computation, or Branch Completion
• Memory Access or R-type instruction completion
• Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
Five Execution Steps
752004 Morgan Kaufmann Publishers
• Use PC to get instruction and put it in the Instruction Register.
• Increment the PC by 4 and put the result back in the PC.
• Can be described succinctly using RTL "Register-Transfer Language"
IR <= Memory[PC];PC <= PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
Step 1: Instruction Fetch
762004 Morgan Kaufmann Publishers
• Read registers rs and rt in case we need them
• Compute the branch address in case the instruction is a branch
• RTL:
A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15:0]) << 2);
• We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
Step 2: Instruction Decode and Register Fetch
772004 Morgan Kaufmann Publishers
• ALU is performing one of three functions, based on instruction type
• Memory Reference:
ALUOut <= A + sign-extend(IR[15:0]);
• R-type:
ALUOut <= A op B;
• Branch:
if (A==B) PC <= ALUOut;
Step 3 (instruction dependent)
782004 Morgan Kaufmann Publishers
• Loads and stores access memory
MDR <= Memory[ALUOut];or
Memory[ALUOut] <= B;
• R-type instructions finish
Reg[IR[15:11]] <= ALUOut;
The write actually takes place at the end of the cycle on the edge
Step 4 (R-type or memory-access)
792004 Morgan Kaufmann Publishers
• Reg[IR[20:16]] <= MDR;
Which instruction needs this?
Write-back step
802004 Morgan Kaufmann Publishers
Summary:
812004 Morgan Kaufmann Publishers
• How many cycles will it take to execute this code?
lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume notadd $t5, $t2, $t3sw $t5, 8($t3)
Label: ...
• What is going on during the 8th cycle of execution?
• In what cycle does the actual addition of $t2 and $t3 takes place?
Simple Questions
822004 Morgan Kaufmann Publishers
Problem: CPI in a multicycle CPU
• Using the SPECINT2000 instruction mix shown in Figure 3.26, what is the CPI, assuming that each state in the multicycle CPU requires 1 clock cycle?
Answer: The mix is 25% loads (1% load byte+24% load word), 10% stores (1% store byte+9% store word), 11% branches (6% beq, 5% bne), 2% jumps (1% jal+1% jr), and 52% ALU (all the rest of the mix, which we assume to be ALU instructions). From Figure 5.30 on page 329, the number of clock cycles for each instruction class is the following:
Loads: 5 ; Store: 4; ALU instructions: 4; Branches: 3; Jumps: 3;
The CPI is given by the following:
ii
ii
CPIcountn Instructio
countn Instructio
countn Instructio
CPIcountn Instructio
countn Instructio
cyclesclock CPUCPI
832004 Morgan Kaufmann Publishers
• The ratio
is simplify the instruction frequency for the instruction class i. We can therefore substitute to obtain
This CPI is better than the worst-case CPI of 5.0 when all the instructions take the same number of clock cycles.
countn Instructio
countin Instructio
12.4302.0311.0452.0410.050.25CPI
842004 Morgan Kaufmann Publishers
• Finite state machines:
– a set of states and
– next state function (determined by current state and the input)
– output function (determined by current state and possibly input)
– We’ll use a Moore machine (output based only on current state)
Review: finite state machines
Inputs
Current state
Outputs
Clock
Next-statefunction
Outputfunction
Nextstate
852004 Morgan Kaufmann Publishers
Review: finite state machines
• Example:
B. 37 A friend would like you to build an “electronic eye” for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light “moves” from left to right and then from right to left, thus scaring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eye’s movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs.
862004 Morgan Kaufmann Publishers
• Value of control signals is dependent upon:
– what instruction is being executed
– which step is being performed
• Use the information we’ve accumulated to specify a finite state machine
– specify the finite state machine graphically, or
– use microprogramming
• Implementation can be derived from specification
Implementing the Control
872004 Morgan Kaufmann Publishers
Figure 5.31 The high-level view of the finite state machine control.
882004 Morgan Kaufmann Publishers
Figure 5.32 The instruction fetch and decode portion of every instruction is identical.
892004 Morgan Kaufmann Publishers
Figure 5.33 The finite state machine for controlling memory-reference instructions has four states.
902004 Morgan Kaufmann Publishers
Figure 5.34 R-type instructions can be implemented with a simple two-state finite state machine.
912004 Morgan Kaufmann Publishers
Figure 5.35 The branch instruction requires a single state.
922004 Morgan Kaufmann Publishers
Figure 5.36 The jump instruction requires a single state that asserts two control signals to write the PC with the lower 26 bits of the instruction register shifted left 2 bits and concatenated to the upper 4 bits of the PC of this instruction.
932004 Morgan Kaufmann Publishers
• Implementation:
Finite State Machine for Control
PCWrite
PCWriteCond
IorD
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3NS2NS1NS0
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
State register
IRWrite
MemRead
MemWrite
Instruction registeropcode field
Outputs
Control logic
Inputs
942004 Morgan Kaufmann Publishers
• Note:– don’t care if not mentioned– asserted if name only
– otherwise exact value
• How many state bits will we need?
Graphical Specification of FSM
952004 Morgan Kaufmann Publishers
962004 Morgan Kaufmann Publishers
PLA Implementation (Section C.3 & AppendixB)
• If I picked a horizontal or vertical line could you explain it?Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
IorD
IRWrite
MemReadMemWrite
PCWritePCWriteCond
MemtoRegPCSource1
ALUOp1
ALUSrcB0ALUSrcARegWriteRegDstNS3NS2NS1NS0
ALUSrcB1ALUOp0
PCSource0
972004 Morgan Kaufmann Publishers
• ROM = "Read Only Memory"– values of memory locations are fixed ahead of time
• A ROM can be used to implement a truth table– if the address is m-bits, we can address 2m entries in the ROM.– our outputs are the bits of data that the address points to.
m is the "height", and n is the "width"
ROM Implementation (Section C.3 & AppendixB)
m n
0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1
982004 Morgan Kaufmann Publishers
• How many inputs are there?6 bits for opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 different addresses)
• How many outputs are there?16 datapath-control outputs, 4 state bits = 20 outputs
• ROM is 210 x 20 = 20K bits (and a rather unusual size)
• Rather wasteful, since for lots of the entries, the outputs are the same
— i.e., opcode is often ignored
ROM Implementation
992004 Morgan Kaufmann Publishers
• Break up the table into two parts
— 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM
— 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM
— Total: 4.3K bits of ROM
• PLA is much smaller
— can share product terms
— only need entries that produce an active output
— can take into account don't cares
• Size is (#inputs #product-terms) + (#outputs #product-terms)
For this example = (10x17)+(20x17) = 510 PLA cells
• PLA cells usually about the size of a ROM cell (slightly bigger)
ROM vs PLA
1002004 Morgan Kaufmann Publishers
• Complex instructions: the "next state" is often current state + 1
Another Implementation Style
AddrCtl
Outputs
PLA or ROM
State
Address select logic
Op
[5–
0]
Adder
Instruction registeropcode field
1
Control unit
Input
PCWritePCWriteCondIorD
MemtoRegPCSourceALUOpALUSrcBALUSrcARegWriteRegDst
IRWrite
MemReadMemWrite
BWrite
1012004 Morgan Kaufmann Publishers
DetailsDispatch ROM 1 Dispatch ROM 2
Op Opcode name Value Op Opcode name Value000000 R-format 0110 100011 lw 0011000010 jmp 1001 101011 sw 0101000100 beq 1000100011 lw 0010101011 sw 0010
State number Address-control action Value of AddrCtl
0 Use incremented state 31 Use dispatch ROM 1 12 Use dispatch ROM 2 23 Use incremented state 34 Replace state number by 0 05 Replace state number by 0 06 Use incremented state 37 Replace state number by 0 08 Replace state number by 0 09 Replace state number by 0 0
State
Adder
1
PLA or ROM
Mux3 2 1 0
Dispatch ROM 1Dispatch ROM 2
0
AddrCtl
Address select logic
Instruction registeropcode field
1022004 Morgan Kaufmann Publishers
5.6 Exceptions
1032004 Morgan Kaufmann Publishers
Keywords
• Exception Also called interrupt. An unscheduled event that disrupts program execution; used to detect overflow.
• Interrupt An exception that comes from outside of the processor. (Some architectures use the term interrupt for all exceptions.)
Type of event From where?
MIPS terminology
I/O device request External Interrupt
Invoke the operating system from user program Internal Exception
Arithmetic overflow Internal Exception
Using an undefined instruction Internal Exception
Hardware malfunctions External Exception or interrupt
1042004 Morgan Kaufmann Publishers
Keywords
• Vectored interrupt An interrupt for which the address to which the address to which control is transferred is determined by the cause of the exception.
Exception type Exception vector address (in hex)
Undefined instruction
Arithmetic overflow
hexC 0000 000
hexC 0020 000
1052004 Morgan Kaufmann Publishers
How control Checks for Exceptions?
• Undefined instruction
• Arithmetic overflow
1062004 Morgan Kaufmann Publishers
Figure 5.39 The multicycle datapath with the addition needed to implement exceptions.
1072004 Morgan Kaufmann Publishers
Figure 5.40 This shows the finite state machine with the additions to handle exception detection.
1082004 Morgan Kaufmann Publishers
5.9 Real Stuff: The Organization of Recent Pentium
1092004 Morgan Kaufmann Publishers
Keywords
• Microprogrammed control A method of specifying control that uses microcode rather than a finite state representation.
• Hardwired control An implementation of finite state machine control typically using programmable logic arrays (PLAs) or collections of PLAs and random logic.
• Microcode The set of microinstructions that control a processor.
• Superscalar An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle.
• Microinstruction A representation of control using low-level instructions, each of which asserts a set of control signals that are active on a given clock cycle as well as specified what microinstruction to execute next.
1102004 Morgan Kaufmann Publishers
Keywords
• Micro-operations The RISC-like instructions directly executed by the hardware in recent Pentium implementations.
• Trace cache An instruction cache that holds a sequence of instructions with a given starting address; in recent Pentium implementations the trace cache holds microoperations rather than IA-32 instructions.
• Dispatch An operation in a microprogrammed control unit in which the next microinstruction is selected on the basis of one or more fields of a macroinstruction, usually by creating a table containing the addresses of the target microinstructions and indexing the table using a field of the macroinstruction. The dispatch tables are typically implemented in ROM or programmable logic array (PLA). The term dispatch is also used in dynamically scheduled processors to refer to the process of sending an instruction to a queue.
1112004 Morgan Kaufmann Publishers
Microprogramming
• What are the “microinstructions” ?
PCWritePCWriteCondIorD
MemtoRegPCSourceALUOpALUSrcBALUSrcARegWrite
AddrCtl
Outputs
Microcode memory
IRWrite
MemReadMemWrite
RegDst
Control unit
Input
Microprogram counter
Address select logic
Adder
1
Instruction registeropcode field
BWrite
Datapath
1122004 Morgan Kaufmann Publishers
• A specification methodology– appropriate if hundreds of opcodes, modes, cycles, etc.– signals specified symbolically using microinstructions
• Will two implementations of the same architecture have the same microcode?• What would a microassembler do?
Microprogramming
LabelALU
control SRC1 SRC2Register control Memory
PCWrite control Sequencing
Fetch Add PC 4 Read PC ALU SeqAdd PC Extshft Read Dispatch 1
Mem1 Add A Extend Dispatch 2LW2 Read ALU Seq
Write MDR FetchSW2 Write ALU FetchRformat1 Func code A B Seq
Write ALU FetchBEQ1 Subt A B ALUOut-cond FetchJUMP1 Jump address Fetch
1132004 Morgan Kaufmann Publishers
Microinstruction formatField name Value Signals active Comment
Add ALUOp = 00 Cause the ALU to add.ALU control Subt ALUOp = 01 Cause the ALU to subtract; this implements the compare for
branches.Func code ALUOp = 10 Use the instruction's function code to determine ALU control.
SRC1 PC ALUSrcA = 0 Use the PC as the first ALU input.A ALUSrcA = 1 Register A is the first ALU input.B ALUSrcB = 00 Register B is the second ALU input.
SRC2 4 ALUSrcB = 01 Use 4 as the second ALU input.Extend ALUSrcB = 10 Use output of the sign extension unit as the second ALU input.Extshft ALUSrcB = 11 Use the output of the shift-by-two unit as the second ALU input.Read Read two registers using the rs and rt fields of the IR as the register
numbers and putting the data into registers A and B.Write ALU RegWrite, Write a register using the rd field of the IR as the register number and
Register RegDst = 1, the contents of the ALUOut as the data.control MemtoReg = 0
Write MDR RegWrite, Write a register using the rt field of the IR as the register number andRegDst = 0, the contents of the MDR as the data.MemtoReg = 1
Read PC MemRead, Read memory using the PC as address; write result into IR (and lorD = 0 the MDR).
Memory Read ALU MemRead, Read memory using the ALUOut as address; write result into MDR.lorD = 1
Write ALU MemWrite, Write memory using the ALUOut as address, contents of B as thelorD = 1 data.
ALU PCSource = 00 Write the output of the ALU into the PC.PCWrite
PC write control ALUOut-cond PCSource = 01, If the Zero output of the ALU is active, write the PC with the contentsPCWriteCond of the register ALUOut.
jump address PCSource = 10, Write the PC with the jump address from the instruction.PCWrite
Seq AddrCtl = 11 Choose the next microinstruction sequentially.Sequencing Fetch AddrCtl = 00 Go to the first microinstruction to begin a new instruction.
Dispatch 1 AddrCtl = 01 Dispatch using the ROM 1.Dispatch 2 AddrCtl = 10 Dispatch using the ROM 2.
1142004 Morgan Kaufmann Publishers
• No encoding:
– 1 bit for each datapath operation
– faster, requires more memory (logic)
– used for Vax 780 — an astonishing 400K of memory!
• Lots of encoding:
– send the microinstructions through logic to get control signals
– uses less memory, slower
• Historical context of CISC:
– Too much logic to put on a single chip with everything else
– Use a ROM (or even RAM) to hold the microcode
– It’s easy to add new instructions
Maximally vs. Minimally Encoded
1152004 Morgan Kaufmann Publishers
Microcode: Trade-offs
• Distinction between specification and implementation is sometimes blurred
• Specification Advantages:
– Easy to design and write
– Design architecture and microcode in parallel
• Implementation (off-chip ROM) Advantages
– Easy to change since values are in memory
– Can emulate other architectures
– Can make use of internal registers
• Implementation Disadvantages, SLOWER now that:
– Control is implemented on same chip as processor
– ROM is no longer faster than RAM
– No need to go back and make changes
1162004 Morgan Kaufmann Publishers
5.10 Fallacies and Pitfalls
1172004 Morgan Kaufmann Publishers
• Pitfall: Adding a complex instruction implemented with microprogramming may not be faster than a sequence using simpler instructions.
• Fallacy: If there is space in the control store, new instructions are free of cost.
1182004 Morgan Kaufmann Publishers
5.11 Concluding Remarks
1192004 Morgan Kaufmann Publishers
Figure 5.41 Alternative methods for specifying and implementing control.
1202004 Morgan Kaufmann Publishers
5.12 Historical Perspective and Further Reading
1212004 Morgan Kaufmann Publishers
Historical Perspective
• In the ‘60s and ‘70s microprogramming was very important for implementing machines
• This led to more sophisticated ISAs and the VAX• In the ‘80s RISC processors based on pipelining became popular• Pipelining the microinstructions is also possible!• Implementations of IA-32 architecture processors since 486 use:
– “hardwired control” for simpler instructions (few cycles, FSM control implemented using PLA or random
logic)
– “microcoded control” for more complex instructions(large numbers of cycles, central control store)
• The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store
1222004 Morgan Kaufmann Publishers
Pentium 4
• Pipelining is important (last IA-32 without it was 80386 in 1985)
• Pipelining is used for the simple instructions favored by compilers
“Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions”
Control
Control
Control
Enhancedfloating pointand multimedia
Control
I/Ointerface
Instruction cache
Integerdatapath
Datacache
Secondarycacheandmemoryinterface
Advanced pipelininghyperthreading support
Chapter 6
Chapter 7
1232004 Morgan Kaufmann Publishers
Pentium 4
• Somewhere in all that “control we must handle complex instructions
• Processor executes simple microinstructions, 70 bits wide (hardwired)
• 120 control lines for integer datapath (400 for floating point)
• If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions)
• Its complicated!
Control
Control
Control
Enhancedfloating pointand multimedia
Control
I/Ointerface
Instruction cache
Integerdatapath
Datacache
Secondarycacheandmemoryinterface
Advanced pipelininghyperthreading support
1242004 Morgan Kaufmann Publishers
Chapter 5 Summary
• If we understand the instructions…
We can build a simple processor!
• If instructions take different amounts of time, multi-cycle is better
• Datapath implemented using:
– Combinational logic for arithmetic
– State holding elements to remember bits
• Control implemented using:
– Combinational logic for single-cycle implementation
– Finite state machine for multi-cycle implementation