1 2004 Morgan Kaufmann Publishers Chapter Five The Processor : Datapath and Control.

12004 Morgan Kaufmann Publishers

Chapter Five

The Processor : Datapath and Control

Outline

• 5.1 Introduction

• 5.2 Logic Design Conventions

• 5.3 Building a Datapath

• 5.4 A simple Implementation Scheme

• 5.5 A Multicycle Implementation

• 5.6 Exceptions

• 5.9 Real Stuff: The Organization of Recent Pentium

• 5.10 Fallacies and Pitfalls

• 5.11 Concluding Remarks

• 5.12 Historical Perspective and Further Reading

5.1 Introduction

• We're ready to look at an implementation of the MIPS

• Simplified to contain only:

– memory-reference instructions: lw, sw – arithmetic-logical instructions: add, sub, and, or, slt– control flow instructions: beq, j

• Generic Implementation:

– use the program counter (PC) to supply instruction address

– get the instruction from memory

– read registers

– use the instruction to decide exactly what to do

• All instructions use the ALU after reading the registers

Why? memory-reference? arithmetic? control flow?

The Processor: Datapath & Control

• Abstract / Simplified View:

• Two types of functional units:

– elements that operate on data values (combinational)

– elements that contain state (sequential)

More Implementation Details

FIGURE 3.14 MIPS architecture revealed thus farMIPS assembly language

Category Instruction Example Meaning Comments

Arithmetic

add add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; Overflow detected

subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; Overflow detected

add immediate addi $s1, $s2, 100 $s1 = $s2+ 100 + constants; overflow detected

add unsigned addu $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; overflow undetected

subtract unsigned subu $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; overflow undetected

add immediate unsigned

addiu $s1, $s2, 100 $s1 = $s2+ 100 + constants; overflow detected

move from coprocessor register

mfc0 $s1, $epc $s1 = $epc Copy Exception PC + special regs

multiply mult $s2, $s3 Hi, Lo = $s2 x $s3 64-bit signed product in Hi, Lo

multiply unsigned multu $s2, $s3 Hi, Lo = $s2 x $s3 64-bit unsigned product in Hi, Lo

divide div $s2, $s3 Lo = $s2 / $s3

Hi = $s2 mod $s3

Lo = quotient, Hi = remainder

divide unsigned divu $s2, $s3 Lo = $s2 / $s3

Hi = $s2 mod $s3

Unsigned quotient and remainder

move from Hi mfhi $s1 $s1 = Hi Used to get copy of Hi

move from Lo mflo $s1 $s1 = Lo Used to get copy of Lo

Data transfer

load word lw $s1, 100($s2) $s1 = Memory [$s2 + 100] Word from memory to register

store word sw $s1, 100 ($s2) Memory [$s2 + 100] = $s1 Word from register to memory

load half unsigned lh $s1, 100($s2) $s1 = Memory [$s2 + 100] Halfword memory to register

store half sh $s1, 100 ($s2) Memory [$s2 + 100] = $s1 Halfword register to memory

load byte unsigned lb $s1, 100($s2) $s1 = Memory [$s2 + 100] Byte from memory to register

store byte sb $s1, 100 ($s2) Memory [$s2 + 100] = $s1 Byte from register to memory

load upper immed. lui $s1, 100 $s1 = 100 * 2^16 Loads constant in upper 16 bits

Continue..

Logical

and add $s1, $s2, $s3 $s1 = $s2 & $s3 Three reg. operands; bit-by-bit AND

or or $s1, $s2, $s3 $s1 = $s2 | $s3 Three reg. operands; bit-by-bit OR

nor nor $s1, $s2, $s3 $s1 = ~($s2 | $s3) Three reg. operands; bit-by-bit NOR

and immediate andi $s1, $s2, 100 $s1 = $s2 & 100 Bit-by-bit AND reg with constant

or immediate ori $s1, $s2, 100 $s1 = $s2 | 100 Bit-by-bit OR reg with constant

shift left logical sll $s1, $s2, 10 $s1 = $s2 << 10 Shift left by constant

shift right logical srl $$s1, $s2, 10 $s1 = $s2 >> 10 Shift right by constant

Conditional branch

branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to

PC+4+100

Equal test; PC-relative branch

branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to L

PC+4+100

Not equal test; PC-relative

set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1;

else $s1 = 0

Compare less than; two’s complement

set on less than immediate

slt $s1, $s2, 100 if ($s2 < 100) $s1 = 1;

else $s1 = 0

Compare < constant;

Two’s complement

set less than unsigned

sltu $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1;

else $s1 = 0

Compare less than; natural numbers

set less than immediate unsigned

sltuiu $s1, $s2, 100

if ($s2 < 100) $s1 = 1;

else $s1 = 0

Compare< constant;

natural numbers

Unconditional jump

jump j 2500 go to 10000 Jump to target address

jump register jr $ra go to $ra For switch, procedure return

jump and link jal 2500 $ra = PC + 4; go to 10000

For procedure call

Name Format Example Comments

add.s R 17 16 6 4 2 0 add.s $f2, $f4, $f6

sub.s R 17 16 6 4 2 1 sub.s $f2, $f4, $f6

mul.s R 17 16 6 4 2 2 mul.s $f2, $f4, $f6

div.s R 17 16 6 4 2 3 div.s $f2, $f4, $f6

add.d R 17 17 6 4 2 0 add.d $f2, $f4, $f6

sub.d R 17 17 6 4 2 1 sub.d $f2, $f4, $f6

mul.d R 17 17 6 4 2 2 mul.d $f2, $f4, $f6

div.d R 17 17 6 4 2 3 div.d $f2, $f4, $f6

lwc1 I 49 20 2 100 lwc1 $f2, $f4, $f6

swc1 I 57 20 2 100 sec1 $f2, $f4, $f6

bc1t I 17 8 1 25 bc1t 25

bc1f I 17 8 0 25 bc1f 25

c. lt. s R 17 16 4 2 0 60 c. lt. s $f2, $f4

c. lt. d R 17 17 4 2 0 60 c. lf. d $f2, $f4

Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits ALL MIPS instructions 32 bits

MIPS floating-point machine language

Figure 5.2 The basic implementation of the MIPS subset including the necessary multiplexers and control lines.

5.2 Logic Design Conventions

Keywords

• Clocking methodology The approach used to determine when data is valid and stable relative to the clock.

• Edge-triggered clocking A clocking scheme in which all state changes occur on a clock edge.

• Control signal A signal used for multiplexer selection or for directing the operation of a function unit; contrasts with a data signal, which contains information that is operated on by a functional unit.

• Unclocked vs. Clocked

• Clocks used in synchronous logic

– when should an element that contains state be updated?

State Elements

Clock period Rising edge

Falling edge

cycle time

• The set-reset latch

– output depends on present inputs and also on past inputs

An unclocked state element

• Output is equal to the stored value inside the element(don't need to ask for permission to look at the value)

• Change of state (value) is based on the clock

• Latches: whenever the inputs change, and the clock is asserted

• Flip-flop: state changes only on a clock edge(edge-triggered methodology)

"logically true", — could mean electrically low

A clocking methodology defines when signals can be read and written— wouldn't want to read a signal at the same time it was being written

Latches and Flip-flops

• Two inputs:

– the data value to be stored (D)

– the clock signal (C) indicating when to read & store D

• Two outputs:

– the value of the internal state (Q) and it's complement

D-latch

D flip-flop

• Output changes only on the clock edge

Dlatch

Our Implementation

• An edge triggered methodology

• Typical execution:

– read contents of some state elements,

– send values through some combinational logic

– write results to one or more state elements

Stateelement

2Combinational logic

Clock cycle

Figure 5.4 An edge-triggered methodology allows a state element to be read and written in the same clock cycle without creating a race that could lead to indeterminate data values.

5.3 Building a Datapath

Keywords

• Datapath element A functional unit used to operate on or hold data within a processor. In the MIPS implementation the datapath elements include the instruction and data memories, the register file, the arithmetic logic unit (ALU), and adders.

• Program counter (PC) The register containing the address of the instruction in the program being executed.

• Register file A state element that consists of a set of registers that can be read and written by supplying a register number to be accessed.

• Sign-extend To increase the size of a data item by replicating the high-order sign bit of the original data item in the high-order bits of the larger, destination data item.

Keywords

• Branch target address The address specified in a branch, which becomes the new program counter (PC) if the branch is taken. In the MIPS architecture the branch target is given by the sum of the offset field of the instruction and the address of the instruction following the branch.

• Branch taken A branch where the branch condition is satisfied and the program counter (PC) becomes the branch target. All unconditional branches are taken branches.

• Branch not taken A branch where the branch condition is false and the program counter (PC) becomes the address of the instruction that sequentially follows the branch.

• Delayed branch A type of branch where the instruction immediately following the branch is always executed, independent of whether the branch condition is true or false.

• Built using D flip-flops

Register File

Read registernumber 1 Read

data 1Read registernumber 2

Readdata 2

Writeregister

WriteWritedata

Register file

Read registernumber 1

Register 0

Register 1

Register n – 2

Register n – 1

Read registernumber 2

Read data 1

Read data 2

Do you understand? What is the “Mux” above?

Abstraction

• Make sure you understand the abstractions!

• Sometimes it is easy to think you do, when you don’t

Select

Register File

• Note: we still use the real clock to determine when to write

n-to-2n

decoder

n – 1

Register 0

Register 1

Register n – 2

Register n – 1

Register number...

Register data

Simple Implementation

• Include the functional units we need for each instruction

Instructionaddress

Instruction

Instructionmemory

Add Sum

a. Instruction memory b. Program counter c. Adder

AddressReaddata

Datamemory

a. Data memory unit

Writedata

MemRead

MemWrite

b. Sign-extension unit

Signextend

Readregister 1

Readregister 2

Writeregister

WriteData

Registers ALUData

ALUresult

RegWrite

a. Registers b. ALU

Registernumbers

Readdata 1

Readdata 2

ALU operation4

Why do we need this stuff?

Figure 5.10 The datapath for the memory instructions and the R-type instructions.

Building the Datapath

• Use multiplexors to stitch them together

Readregister 1

Readregister 2

Writeregister

Writedata

Registers ALU

RegWrite

MemRead

MemWrite

MemtoReg

Readdata 1

Readdata 2

ALU operation4

Signextend

InstructionALU

result

ALUresult

ALUSrc

Address

Datamemory

Readdata

Shiftleft 2

Readaddress

Instructionmemory

5.4 A Simple Implementation Scheme

Keywords

• Don’t-care term An element of a logic function in which the output does not depend on the values of all the inputs. Don’t-care terms may be specified in different ways.

• Opcode The field that denotes the operation and format of an instruction.

• Single-cycle implementation Also called single clock cycle implementation. An implementation in which an instruction is executed in one clock cycle.

Control

• Selecting the operations to perform (ALU, read/write, etc.)

• Controlling the flow of data (multiplexor inputs)

• Information comes from the 32 bits of the instruction

• Example:

add $8, $17, $18 Instruction Format:

000000 10001 10010 01000 00000 100000

op rs rt rd shamt funct

• ALU's operation based on instruction type and function code

• e.g., what should the ALU do with this instruction• Example: lw $1, 100($2)

35 2 1 100

op rs rt 16 bit offset

• ALU control input

0000 AND0001 OR0010 add0110 subtract0111 set-on-less-than1100 NOR

• Why is the code for subtract 0110 and not 0011?

Control

Figure 5.12 How the ALU control bits are set depends on the ALUOp control bits and the different function codes for the R-type instruction.

Instruction Opcode

ALUOpInstruction operation

Funct field

Desired ALU action

ALU control input

LW 00 Load word XXXXXX Add 0010

SW 00 Store word XXXXXX Add 0010

Branch equal 01 Branch equal XXXXXX Subtract 0110

R-type 10 Add 100000 Add 0010

R-type 10 subtract 100010 Subtract 0110

R-type 10 AND 100100 And 0000

R-type 10 OR 100101 Or 0001

R-type 10 Set on less than 101010 Set on less than 0111

• Must describe hardware to compute 4-bit ALU control input

– given instruction type 00 = lw, sw01 = beq, 10 = arithmetic

– function code for arithmetic

• Describe it using a truth table (can turn into gates):

ALUOp computed from instruction type

Control

Figure B.5.9 A 1-bit ALU that performs AND, OR, and addition on a and b or a and b.

FIGURE B.5.10 (Top) A 1-bit ALU that performs AND, OR, and addition on a and b or b.

FIGURE B.5.10 (bottom) a 1-bit ALU for the most significant bit.

FIGURE B.5.11 A 32-bit ALU constructed from the 31 copies of the 1-bit ALU in the top of Figure B.5.10 and one 1-bit ALU in the bottom of that figure.

FIGURE B.5.12 The final 32-bit ALU. This adds a Zero detector to Figure B.5.11.

FIGURE B.5.13 The values of the three ALU control lines Bnegate and Operation and the corresponding ALU operations.

ALU control lines Function

0000 AND

0001 OR

0010 add

0110 subtract

0111 set-on-less-than

1100 NOR

FIGURE B.5.14 The symbol commonly used to represent an ALU, as shown in FigureB.5.12.

Figure 5.14 The three instruction classes (R-tape, load and store, and branch) use two different instruction formats.

0 rs rt rd shamt funct

35 or 43 rs rt address

4 rs rt address

Field Bit positions 31:26 25:21 20:16 15:11 10:6 5:0

Field Bit positions 31:26 25:21 20:16 15:0

a. R-type instruction

b. Load or store instruction

c. Branch instruction

Figure 5.15 The datapath of Figure 5.12 with all necessary multiplexors and all control lines identified

Control

• Simple combinational logic (truth tables)

Operation2

Operation1

Operation0

Operation

ALUOp1

F (5– 0)

ALUOp0

ALU control block

R-format Iw sw beq

Inputs

Outputs

RegDst

ALUSrc

MemtoReg

RegWrite

MemRead

MemWrite

Branch

ALUOp1

ALUOpO

• All of the logic is combinational

• We wait for everything to settle down, and the right thing to be done

– ALU might not produce “right answer” right away

– we use write signals along with clock to determine when to write

• Cycle time determined by length of the longest path

Our Simple Control Structure

We are ignoring some details like setup and hold times

Stateelement

2Combinational logic

Clock cycle

Single Cycle Implementation

• Calculate cycle time assuming negligible delays except:

– memory (200ps), ALU and adders (100ps), register file access (50ps)

Readregister 1

Readregister 2

Writeregister

Writedata

Registers ALU

RegWrite

MemRead

MemWrite

MemtoReg

Readdata 1

Readdata 2

ALU operation4

Signextend

InstructionALU

result

ALUresult

ALUSrc

Address

Datamemory

Readdata

Shiftleft 2

Readaddress

Instructionmemory

Figure 5.16 The effect of each of the seven control signals.

Signal name

Effect when deasserted Effect when asserted

RegDst The register destination number for the Write register comes from the rt field (bits 20:16).

The register destination number for the Write register comes from the rd field (bits 15:11).

RegWrite None. The register on the Write register input is written with the value on the Write data input.

ALUSrc The second ALU operand comes from the second register file output (Read data 2).

The second ALU operand is the sign-extended, lower 16 bits of the instruction.

PCSrc The PC is replaced by the output of the adder that computes the value of PC+4.

The PC is replaced by the output of the adder that computed the branch target.

MEmRead None. Data memory contents designated by the address input are put on the Read data output.

MemWrite None. Data memory contents designated by the address input are replaced by the value on the Write data input.

MemtoReg The value fed to the register Write data input comes from the ALU.

The value fed to the register Write data input comes from the data memory.

Figure 5.17 The simple datapath with the control unit.

Figure 5.18 The setting of the control lines is completely determined by the opcode fields of the instruction.

Instruction RegDst ALUSrcMemto-

RegReg

WriteMem Read

Mem Write Branch ALUOp1 ALUp0

R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1

Figure 5.19 The datapath in operation for an R-type instruction such as add $t1, $t2, $t3.

Figure 5.20 The datapath in operation for a load instruction.

Figure 5.21 The datapath in operation for a branch equal instruction.

Figure 5.22 The control function for the simple single-cycle implementation is completely specified by this truth table.

Input or output Signal name R-format lw sw beq

Inputs Op5 0 1 1 0

Op4 0 0 0 0

Op3 0 0 1 0

Op2 0 0 0 1

Op1 0 1 1 0

Op0 0 1 1 0

Outputs RegDst 1 0 X X

ALUSrc 0 1 1 0

MemtoReg 0 1 X X

RegWrite 1 1 0 0

MemRead 0 1 0 0

MemWrite 0 0 1 0

Branch 0 0 0 1

ALUOp1 1 0 0 0

ALUOp0 0 0 0 1

Figure 5.23 Instruction format for the jump instruction (opcode = 2).

Field Bit positions 31:26 25:0

000010 address

Figure 5.24 The simple control and datapath are extended to handle the jump instruction.

Problem: Performance of Single-Cycle Machines (p.315)

Assume that the operation times for the major functional units in this implementation are the following:

Memory units: 200 picoseconds (ps)ALU and adders: 100 psRegister file (read or write): 50 ps

Assume that the multiplexors, control unit, PC accesses, sign extension unit, and wires have no delay, which of the following implementations would be faster and by how much?

1. An implementation in which every instruction operates in 1 clock cycle of a fixed length.

2. An implementation where every instruction executes in 1 clock cycle using a variable-length clock, which for each instruction is only as long as it needs to be.

To compare the performance, assume the following instruction mix: 25% loads, 10% stores, 45% ALU instructions, 15% branches, and 5% jumps.

• Let’s start by comparing the CPU execution times.

Since CPI must be 1, we can simplify this to

• The critical path for the different instruction classes is as follows:

timecycleClock CPIcountn Instructio timeexecution CPU

timecycleClock countn Instructio timeexecution CPU

Instruction class Functional units used by the instruction class

R-type Instruction fetch Register access ALU Register access

Load word Instruction fetch Register access ALU Memory access Register access

Store word Instruction fetch Register access ALU Memory access

Branch Instruction fetch Register access ALU

Jump Instruction fetch

• Using these critical paths, we can compute the required length for each instruction class:

• Thus, the average time per instruction with a variable clock is

Instruction class

Instruction memory

Register read

ALU operation

Data memory

Register write

R-type 200 50 100 0 50 400ps

Load word 200 50 100 200 50 600ps

Store word 200 50 100 200 550ps

Branch 200 50 100 0 350ps

Jump 200 200ps

ps.5447

%5200%15350%45400%10550%25600 cycleclock CPU

• Since the variable clock implementation has a shorter average clock cycle, it is clearly faster. Let’s find the performance ratio:

34.15.447

cycleclock CPU

cycleclock CPUIC

timeexecution CPU

eperformanc CPU

clock variable

clock single

clock variable

clock single

clock variable

clock single

clock variable

5.5 A Multicycle Implementation

Keywords

• Multicycle implementation Also called multiple clock cycle implementation. An implementation in which and instruction is executed in multiple clock cycles.

• Microprogramming A symbolic representation of control in the form of instructions, called microinstructions, that are executed on a simple micromachine.

• Finite state machine A sequential logic function consisting of a set of inputs and outputs, a next-state function that maps the current state and the inputs to a new state, and an output function that maps the current state and possibly the input to a set of asserted outputs.

• Next-state function A combinational function that, given the inputs and the current state, determines the next state of a finite state machine.

Where we are headed

• Single Cycle Problems:

– what if we had a more complicated instruction like floating point?

– wasteful of area

• One Solution:

– use a “smaller” cycle time

– have different instructions take different numbers of cycles

– a “multicycle” datapath:

• We will be reusing functional units

– ALU used to compute address and to increment PC

– Memory used for instruction and data

• Our control signals will not be determined directly by instruction

– e.g., what should the ALU do for a “subtract” instruction?

• We’ll use a finite state machine for control

Multicycle Approach

• Break up the instructions into steps, each step takes a cycle

– balance the amount of work to be done

– restrict each cycle to use only one major functional unit

• At the end of a cycle

– store values for use in later cycles (easiest thing to do)

– introduce additional “internal” registers

Multicycle Approach

Figure 5.27 The multicycle datapath from Figure 5.26 with the control lines shown.

Figure 5.28 The complete datapath for the multicycle implementation together with the necessary control lines.

Figure 5.29 The action caused by the setting of each control signal in Figure 5.28 on page 323.

Signal name Effect when deasserted Effect when asserted

RegDst The register file destination number for the Write register comes from the rt field.

The register file destination number for the Write register comes from the rd field.

RegWrite None. The general-purpose register selected by the Write register number is written with the value of the Write data input.

ALUSrcA The first ALU operand is the PC. The first ALU operand comes from the A register.

MemRead None. Content of memory at the location specified by the address input is put on Memory data output.

MemWrite None. Memory contents at the location specified by the address input is replaced by value on Write data input.

MemtoReg The value fed to the register file Write data input comes from ALUOut.

The value fed to the register file Write data input comes from the MDR.

IorD The PC is used to supply the address to the memory unit.

ALUOut is used to supply the address to the memory unit.

IRWrite None. The output of the memory is written into the IR.

PCWrite None. The PC is written; the source is controlled by PCSource.

PCWriteCond None. The PC is written is the Zero output from the ALU is also active.

Actions of the 1-bit control signals

Continue…

Actions of the 2-bit control signalsSignal name

Value (binary)

Effect

ALUOp 00 The ALU performs an add operation.

01 The ALU performs a subtract operation.

10 The funct field of the instruction determines the ALU operation.

ALUSrcB 00 The second input to the ALU comes from the B register.

01 The second input to the ALU is the constant 4.

10 The second input to the ALU is the sign-extend, lower 16 bits of the IR.

11 The second input to the ALU is the sign-extended, lower 16 bits of the IR shifted left 2 bits.

PCSource 00 Output of the ALU (PC+4) is sent to the PC for writing.

01 The contents of ALUOut (the branch target address) are sent to the PC for waiting.

10 The jump target address (IR[25:0] shifted left 2 bits and concatenated with PC+4[31:28] is sent to the PC for writing.)

Instructions from ISA perspective

• Consider each instruction from perspective of ISA.

• Example:

– The add instruction changes a register.

– Register specified by bits 15:11 of instruction.

– Instruction specified by the PC.

– New value is the sum (“op”) of two registers.

– Registers specified by bits 25:21 and 20:16 of the instructionReg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]

– In order to accomplish this we must break up the instruction.(kind of like introducing variables when programming)

Breaking down an instruction

• ISA definition of arithmetic:

Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]

• Could break down to:– IR <= Memory[PC]– A <= Reg[IR[25:21]]– B <= Reg[IR[20:16]]– ALUOut <= A op B– Reg[IR[20:16]] <= ALUOut

• We forgot an important part of the definition of arithmetic!– PC <= PC + 4

Idea behind multicycle approach

• We define each instruction from the ISA perspective (do this!)

• Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps)

• Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)

• Finally try and pack as much work into each step (avoid unnecessary cycles)

while also trying to share steps where possible(minimizes control, helps to simplify solution)

• Result: Our book’s multicycle Implementation!

• Instruction Fetch

• Instruction Decode and Register Fetch

• Execution, Memory Address Computation, or Branch Completion

• Memory Access or R-type instruction completion

• Write-back step

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Five Execution Steps

• Use PC to get instruction and put it in the Instruction Register.

• Increment the PC by 4 and put the result back in the PC.

• Can be described succinctly using RTL "Register-Transfer Language"

IR <= Memory[PC];PC <= PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

Step 1: Instruction Fetch

• Read registers rs and rt in case we need them

• Compute the branch address in case the instruction is a branch

• RTL:

A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15:0]) << 2);

• We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Step 2: Instruction Decode and Register Fetch

• ALU is performing one of three functions, based on instruction type

• Memory Reference:

ALUOut <= A + sign-extend(IR[15:0]);

• R-type:

ALUOut <= A op B;

• Branch:

if (A==B) PC <= ALUOut;

Step 3 (instruction dependent)

• Loads and stores access memory

MDR <= Memory[ALUOut];or

Memory[ALUOut] <= B;

• R-type instructions finish

Reg[IR[15:11]] <= ALUOut;

The write actually takes place at the end of the cycle on the edge

Step 4 (R-type or memory-access)

• Reg[IR[20:16]] <= MDR;

Which instruction needs this?

Write-back step

Summary:

• How many cycles will it take to execute this code?

lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume notadd $t5, $t2, $t3sw $t5, 8($t3)

Label: ...

• What is going on during the 8th cycle of execution?

• In what cycle does the actual addition of $t2 and $t3 takes place?

Simple Questions

Problem: CPI in a multicycle CPU

• Using the SPECINT2000 instruction mix shown in Figure 3.26, what is the CPI, assuming that each state in the multicycle CPU requires 1 clock cycle?

Answer: The mix is 25% loads (1% load byte+24% load word), 10% stores (1% store byte+9% store word), 11% branches (6% beq, 5% bne), 2% jumps (1% jal+1% jr), and 52% ALU (all the rest of the mix, which we assume to be ALU instructions). From Figure 5.30 on page 329, the number of clock cycles for each instruction class is the following:

Loads: 5 ; Store: 4; ALU instructions: 4; Branches: 3; Jumps: 3;

The CPI is given by the following:

CPIcountn Instructio

countn Instructio

CPIcountn Instructio

countn Instructio

cyclesclock CPUCPI

• The ratio

is simplify the instruction frequency for the instruction class i. We can therefore substitute to obtain

This CPI is better than the worst-case CPI of 5.0 when all the instructions take the same number of clock cycles.

countn Instructio

countin Instructio

12.4302.0311.0452.0410.050.25CPI

• Finite state machines:

– a set of states and

– next state function (determined by current state and the input)

– output function (determined by current state and possibly input)

– We’ll use a Moore machine (output based only on current state)

Review: finite state machines

Inputs

Current state

Outputs

Next-statefunction

Outputfunction

Nextstate

Review: finite state machines

• Example:

B. 37 A friend would like you to build an “electronic eye” for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light “moves” from left to right and then from right to left, thus scaring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eye’s movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs.

• Value of control signals is dependent upon:

– what instruction is being executed

– which step is being performed

• Use the information we’ve accumulated to specify a finite state machine

– specify the finite state machine graphically, or

– use microprogramming

• Implementation can be derived from specification

Implementing the Control

Figure 5.31 The high-level view of the finite state machine control.

Figure 5.32 The instruction fetch and decode portion of every instruction is identical.

Figure 5.33 The finite state machine for controlling memory-reference instructions has four states.

Figure 5.34 R-type instructions can be implemented with a simple two-state finite state machine.

Figure 5.35 The branch instruction requires a single state.

Figure 5.36 The jump instruction requires a single state that asserts two control signals to write the PC with the lower 26 bits of the instruction register shifted left 2 bits and concatenated to the upper 4 bits of the PC of this instruction.

• Implementation:

Finite State Machine for Control

PCWrite

PCWriteCond

MemtoReg

PCSource

ALUSrcB

ALUSrcA

RegWrite

RegDst

NS3NS2NS1NS0

State register

IRWrite

MemRead

MemWrite

Instruction registeropcode field

Outputs

Control logic

Inputs

• Note:– don’t care if not mentioned– asserted if name only

– otherwise exact value

• How many state bits will we need?

Graphical Specification of FSM

PLA Implementation (Section C.3 & AppendixB)

• If I picked a horizontal or vertical line could you explain it?Op5

IRWrite

MemReadMemWrite

PCWritePCWriteCond

MemtoRegPCSource1

ALUOp1

ALUSrcB0ALUSrcARegWriteRegDstNS3NS2NS1NS0

ALUSrcB1ALUOp0

PCSource0

• ROM = "Read Only Memory"– values of memory locations are fixed ahead of time

• A ROM can be used to implement a truth table– if the address is m-bits, we can address 2m entries in the ROM.– our outputs are the bits of data that the address points to.

m is the "height", and n is the "width"

ROM Implementation (Section C.3 & AppendixB)

0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1

• How many inputs are there?6 bits for opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 different addresses)

• How many outputs are there?16 datapath-control outputs, 4 state bits = 20 outputs

• ROM is 210 x 20 = 20K bits (and a rather unusual size)

• Rather wasteful, since for lots of the entries, the outputs are the same

— i.e., opcode is often ignored

ROM Implementation

• Break up the table into two parts

— 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM

— 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM

— Total: 4.3K bits of ROM

• PLA is much smaller

— can share product terms

— only need entries that produce an active output

— can take into account don't cares

• Size is (#inputs #product-terms) + (#outputs #product-terms)

For this example = (10x17)+(20x17) = 510 PLA cells

• PLA cells usually about the size of a ROM cell (slightly bigger)

ROM vs PLA

• Complex instructions: the "next state" is often current state + 1

Another Implementation Style

AddrCtl

Outputs

PLA or ROM

Address select logic

Control unit

PCWritePCWriteCondIorD

MemtoRegPCSourceALUOpALUSrcBALUSrcARegWriteRegDst

IRWrite

MemReadMemWrite

BWrite

DetailsDispatch ROM 1 Dispatch ROM 2

Op Opcode name Value Op Opcode name Value000000 R-format 0110 100011 lw 0011000010 jmp 1001 101011 sw 0101000100 beq 1000100011 lw 0010101011 sw 0010

State number Address-control action Value of AddrCtl

0 Use incremented state 31 Use dispatch ROM 1 12 Use dispatch ROM 2 23 Use incremented state 34 Replace state number by 0 05 Replace state number by 0 06 Use incremented state 37 Replace state number by 0 08 Replace state number by 0 09 Replace state number by 0 0

PLA or ROM

Mux3 2 1 0

Dispatch ROM 1Dispatch ROM 2

AddrCtl

5.6 Exceptions

Keywords

• Exception Also called interrupt. An unscheduled event that disrupts program execution; used to detect overflow.

• Interrupt An exception that comes from outside of the processor. (Some architectures use the term interrupt for all exceptions.)

Type of event From where?

MIPS terminology

I/O device request External Interrupt

Invoke the operating system from user program Internal Exception

Arithmetic overflow Internal Exception

Using an undefined instruction Internal Exception

Hardware malfunctions External Exception or interrupt

Keywords

• Vectored interrupt An interrupt for which the address to which the address to which control is transferred is determined by the cause of the exception.

Exception type Exception vector address (in hex)

Undefined instruction

Arithmetic overflow

hexC 0000 000

hexC 0020 000

How control Checks for Exceptions?

• Undefined instruction

• Arithmetic overflow

Figure 5.39 The multicycle datapath with the addition needed to implement exceptions.

Figure 5.40 This shows the finite state machine with the additions to handle exception detection.

5.9 Real Stuff: The Organization of Recent Pentium

Keywords

• Microprogrammed control A method of specifying control that uses microcode rather than a finite state representation.

• Hardwired control An implementation of finite state machine control typically using programmable logic arrays (PLAs) or collections of PLAs and random logic.

• Microcode The set of microinstructions that control a processor.

• Superscalar An advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle.

• Microinstruction A representation of control using low-level instructions, each of which asserts a set of control signals that are active on a given clock cycle as well as specified what microinstruction to execute next.

Keywords

• Micro-operations The RISC-like instructions directly executed by the hardware in recent Pentium implementations.

• Trace cache An instruction cache that holds a sequence of instructions with a given starting address; in recent Pentium implementations the trace cache holds microoperations rather than IA-32 instructions.

• Dispatch An operation in a microprogrammed control unit in which the next microinstruction is selected on the basis of one or more fields of a macroinstruction, usually by creating a table containing the addresses of the target microinstructions and indexing the table using a field of the macroinstruction. The dispatch tables are typically implemented in ROM or programmable logic array (PLA). The term dispatch is also used in dynamically scheduled processors to refer to the process of sending an instruction to a queue.

Microprogramming

• What are the “microinstructions” ?

PCWritePCWriteCondIorD

MemtoRegPCSourceALUOpALUSrcBALUSrcARegWrite

AddrCtl

Outputs

Microcode memory

IRWrite

MemReadMemWrite

RegDst

Control unit

Microprogram counter

BWrite

Datapath

• A specification methodology– appropriate if hundreds of opcodes, modes, cycles, etc.– signals specified symbolically using microinstructions

• Will two implementations of the same architecture have the same microcode?• What would a microassembler do?

Microprogramming

LabelALU

control SRC1 SRC2Register control Memory

PCWrite control Sequencing

Fetch Add PC 4 Read PC ALU SeqAdd PC Extshft Read Dispatch 1

Mem1 Add A Extend Dispatch 2LW2 Read ALU Seq

Write MDR FetchSW2 Write ALU FetchRformat1 Func code A B Seq

Write ALU FetchBEQ1 Subt A B ALUOut-cond FetchJUMP1 Jump address Fetch

Microinstruction formatField name Value Signals active Comment

Add ALUOp = 00 Cause the ALU to add.ALU control Subt ALUOp = 01 Cause the ALU to subtract; this implements the compare for

branches.Func code ALUOp = 10 Use the instruction's function code to determine ALU control.

SRC1 PC ALUSrcA = 0 Use the PC as the first ALU input.A ALUSrcA = 1 Register A is the first ALU input.B ALUSrcB = 00 Register B is the second ALU input.

SRC2 4 ALUSrcB = 01 Use 4 as the second ALU input.Extend ALUSrcB = 10 Use output of the sign extension unit as the second ALU input.Extshft ALUSrcB = 11 Use the output of the shift-by-two unit as the second ALU input.Read Read two registers using the rs and rt fields of the IR as the register

numbers and putting the data into registers A and B.Write ALU RegWrite, Write a register using the rd field of the IR as the register number and

Register RegDst = 1, the contents of the ALUOut as the data.control MemtoReg = 0

Write MDR RegWrite, Write a register using the rt field of the IR as the register number andRegDst = 0, the contents of the MDR as the data.MemtoReg = 1

Read PC MemRead, Read memory using the PC as address; write result into IR (and lorD = 0 the MDR).

Memory Read ALU MemRead, Read memory using the ALUOut as address; write result into MDR.lorD = 1

Write ALU MemWrite, Write memory using the ALUOut as address, contents of B as thelorD = 1 data.

ALU PCSource = 00 Write the output of the ALU into the PC.PCWrite

PC write control ALUOut-cond PCSource = 01, If the Zero output of the ALU is active, write the PC with the contentsPCWriteCond of the register ALUOut.

jump address PCSource = 10, Write the PC with the jump address from the instruction.PCWrite

Seq AddrCtl = 11 Choose the next microinstruction sequentially.Sequencing Fetch AddrCtl = 00 Go to the first microinstruction to begin a new instruction.

Dispatch 1 AddrCtl = 01 Dispatch using the ROM 1.Dispatch 2 AddrCtl = 10 Dispatch using the ROM 2.

• No encoding:

– 1 bit for each datapath operation

– faster, requires more memory (logic)

– used for Vax 780 — an astonishing 400K of memory!

• Lots of encoding:

– send the microinstructions through logic to get control signals

– uses less memory, slower

• Historical context of CISC:

– Too much logic to put on a single chip with everything else

– Use a ROM (or even RAM) to hold the microcode

– It’s easy to add new instructions

Maximally vs. Minimally Encoded

Microcode: Trade-offs

• Distinction between specification and implementation is sometimes blurred

• Specification Advantages:

– Easy to design and write

– Design architecture and microcode in parallel

• Implementation (off-chip ROM) Advantages

– Easy to change since values are in memory

– Can emulate other architectures

– Can make use of internal registers

• Implementation Disadvantages, SLOWER now that:

– Control is implemented on same chip as processor

– ROM is no longer faster than RAM

– No need to go back and make changes

5.10 Fallacies and Pitfalls

• Pitfall: Adding a complex instruction implemented with microprogramming may not be faster than a sequence using simpler instructions.

• Fallacy: If there is space in the control store, new instructions are free of cost.

5.11 Concluding Remarks

Figure 5.41 Alternative methods for specifying and implementing control.

5.12 Historical Perspective and Further Reading

Historical Perspective

• In the ‘60s and ‘70s microprogramming was very important for implementing machines

• This led to more sophisticated ISAs and the VAX• In the ‘80s RISC processors based on pipelining became popular• Pipelining the microinstructions is also possible!• Implementations of IA-32 architecture processors since 486 use:

– “hardwired control” for simpler instructions (few cycles, FSM control implemented using PLA or random

logic)

– “microcoded control” for more complex instructions(large numbers of cycles, central control store)

• The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store

Pentium 4

• Pipelining is important (last IA-32 without it was 80386 in 1985)

• Pipelining is used for the simple instructions favored by compilers

“Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions”

Control

Enhancedfloating pointand multimedia

Control

I/Ointerface

Instruction cache

Integerdatapath

Datacache

Secondarycacheandmemoryinterface

Advanced pipelininghyperthreading support

Chapter 6

Chapter 7

Pentium 4

• Somewhere in all that “control we must handle complex instructions

• Processor executes simple microinstructions, 70 bits wide (hardwired)

• 120 control lines for integer datapath (400 for floating point)

• If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions)

• Its complicated!

Control

Enhancedfloating pointand multimedia

Control

I/Ointerface

Instruction cache

Integerdatapath

Datacache

Secondarycacheandmemoryinterface

Advanced pipelininghyperthreading support

Chapter 5 Summary

• If we understand the instructions…

We can build a simple processor!

• If instructions take different amounts of time, multi-cycle is better

• Datapath implemented using:

– Combinational logic for arithmetic

– State holding elements to remember bits

• Control implemented using:

– Combinational logic for single-cycle implementation

– Finite state machine for multi-cycle implementation

1 2004 Morgan Kaufmann Publishers Chapter Five The Processor : Datapath and Control.

Documents

Transcript of 1 2004 Morgan Kaufmann Publishers Chapter Five The Processor : Datapath and Control.

1 1998 Morgan Kaufmann Publishers Chapter Six. 2 1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput.

1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2002s1 Ch8-1 Chapter 8 I/O Systems.

Object-Oriented Construction Handbook Engineering... · 2014. 12. 7. · Zull-FM.qxd 31/8/04 2:38 PM Page iii. Copublished by Morgan Kaufmann Publishers and dpunkt.verlag Morgan Kaufmann

1 2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.

1 Ó1998 Morgan Kaufmann Publishers Chapter 8 I/O Systems.

1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2004s2 Ch7a-1 Chapter Seven Sistemas de Memória.

1 2004 Morgan Kaufmann Publishers Chapter Seven.

1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2002s1 Ch7-1 Chapter Seven Sistemas de Memória.

1 1998 Morgan Kaufmann Publishers and UCB Performance CEG3420 Computer Design Lecture 3.

1 1998 Morgan Kaufmann Publishers AULA 06 - MIPS PIPELINE.

Mario Côrtes - MO401 - IC/Unicamp- 2004s2 Ch3-1 1998 Morgan Kaufmann Publishers Chapter 3 Instructions: Language of the Machine.

1 1998 Morgan Kaufmann Publishers Arithmetic Where we've been: –Performance (seconds, cycles, instructions) –Abstractions: Instruction Set Architecture.

1 2004 Morgan Kaufmann Publishers Chapter Six. 2 2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 2004 Morgan Kaufmann Publishers Chapter 2. 2 2004 Morgan Kaufmann Publishers Instructions: Language of the Machine We’ll be working with the MIPS.

(Morgan Kaufmann series in computer graphics and geometric modeling) Jean H Gallier-Curves and surfaces in geometric modeling _ theory and algorithms-Morgan Kaufmann Publishers (2011)(1).pdf

1 Ó1998 Morgan Kaufmann Publishers Chapter 2 Performance and Cost.

1998 Morgan Kaufmann Publishers Mario Côrtes - MO401 - IC/Unicamp- 2004s2 Ch4-1 Chapter Four Arithmetic for Computers.

1 1998 Morgan Kaufmann Publishers Chapter 8 Storage, Networks and Other Peripherals.

1 1998 Morgan Kaufmann Publishers Chapter Five The Processor: Datapath and Control.

1 1998 Morgan Kaufmann Publishers Interfacing Processors and Peripherals.