Computer Architecture and Organization CS 2214 Version 0 Unpipelined EMY CPU Haldun Hadimioglu...

Computer Architecture and OrganizationCS 2214CS 2214

Version 0

Unpipelined EMY CPU

Haldun Hadimioglu

Computer Science & Engineering

Spring Spring 20142014Spring Spring 20142014

Haldun Hadimioglu CSE – Spring

2014

EMY CPU Version 0 2CS 2214

Outline Introduction Version 0 EMY CPU : Unpipelined EMY CPU

It executes only integer instructions How a memory hierarchy can be attached to the

unpipelined CPU is also studied

Handout to use EMY CPU


2014


Introduction On the microarchitecture layer, a computer is a

collection of at least three interconnected digital systems

A central processing unit (CPU) A (main) memory An I/O controller to control an I/O device, such as the

disk There can be several I/O controllers to control different I/O

devices

Intr

odu

ctio

n

Memory

CPU

I/OController

InterconnectionSystem

Disk


2014


Digital Systems A digital system performs microoperations

It consists of a datapath (data unit) and a control unit

The datapath actually performs the microoperations The control unit determines which microoperation

happens when

Registers ALUs Buses

SequencerStatus signals Control signals

Datapath

Control Unit

Intr

odu

ctio

n


2014


Digital Systems The datapath (data unit) has registers,

ALUs and buses to perform the microoperations

Registers keep information temporarilyALUs perform arithmetic/logic operationsBuses interconnect the registers and ALUsOther components are used include

Multiplexers (MUXes), decoders, encoders, comparators, counters, etc.

Intr

odu

ctio

n


2014


Digital Systems The control unit has a sequencer circuit

that determines the sequence of microoperations

The sequencer needs status signals from the data unit to know what is happening there

Then, it determines which microoperations to be performed and indicates to the datapath by means of control signalsIn

trod

uct

ion


2014


Designing Digital systems Datapath design is simpler than the

control unit since it has highly regular (duplicated) circuits

A 32-bit ADDer is composed of 2 16-bit identical ADDers

A 32-bit comparator consists of 4 8-bit identical comparators, etc.

Control unit design is more difficult due to Large amounts of random logicA lot of effort is needed to make sure there are

no timing problems Microoperations must start at the right time and end

at the right time !

Intr

odu

ctio

n


2014


Designing digital systems We will use the finite-state machine (FSM)

technique to design the EMY CPU where the FSM state diagram will have states with microoperations

The state diagram shows which state follows which state precisely

Each state indicates which microoperations to perform

The state diagram shows which states are needed when for which machine language instruction

Intr

odu

ctio

n


2014


Designing digital systems We will design the EMY CPU by using the

finite-state machine (FSM) techniqueMore specifically, we will obtain the following

for the complete EMY CPU design A high-level-state diagram to show which

microoperation happens when The datapath from the high-level state diagram The low-level state diagram from the high-level sate

diagram and the datapath The control unit from the low-level state diagram

It can be implemented by hardwiring and/or microprogramming

Intr

odu

ctio

n


2014


Designing the microarchitecture level of a computer There are two tasks in this design

Develop the CPU and memory digital systems so that instructions can be run

Develop the memory and I/O controller digital systems so that I/O can happen

We will concentrate on the CPU and memory digital systemsIn

trod

uct

ion


2014


Designing the CPU and memory digital systems First we focus on the CPU digital system while we

make a few design decisions on the memory quickly

We will design the CPU as a slow CPU running only integer instructions : No pipelining

This is Version 0 We will assume the memory is fast which is not realistic today Then, we will see how a memory hierarchy with cache

memories, etc. can be incorporated Then, we will improve the CPU speed by using

pipelining, but still running integer instructions This is Version 1

We will assume the memory is fast which is not realistic today Then, we will see how a memory hierarchy with cache

memories, etc. can be incorporated This CPU coverage will be in another PowerPoint

presentation For both versions the memory will be a black box

with a few details

Intr

odu

ctio

n


2014


Designing the CPU as a Digital System The EMY CPU digital system

We will concentrate on designing the EMY CPU for nine integer instructions in the beginning

High-level state diagram of the EMY CPU Datapath of the CPU Low-level state diagram of the CPU Control unit of the CPU

Intr

odu

ctio

n


2014


Designing the CPU digital system To design the EMY CPU, we will start with

the EMY architectureWhat is the connection between the

architecture and the CPU? A computer processes digital information, by

running machine language instructions A program is a list of instructions each of which

specifies operations on data (arguments) An instruction specifies architectural operations Each architectural operation is implemented by

microoperations

Intr

odu

ctio

n


2014


Designing the CPU Digital System In order to perform an architectural

operation, the CPU performs a series of microoperations in a number of clock periods

That is an architectural operation is broken down into smaller operations called microoperations

That is, to run a machine language instruction, the CPU performs microoperations

The CPU performs some microoperations alone and some in cooperation with the memory and the I/O controllers

Intr

odu

ctio

n


2014


Designing the CPU Digital System Architectural operations

An architectural operation is what we describe as the semantics of the instruction, such as

The architectural operation specified by the ADD instruction

Rd Rs + Rt The architectural operation specified by the SUB

instruction Rd Rs - Rt

The architectural operation specified by the SLT instruction

If Rs < Rt then Rd 1 else Rd 0 The architectural operation specified by the J instruction

PC[27-0] (Address * 4)

It is the CPU that contributes the most to the execution of an instruction since it performs most of the microoperations needed for an architectural operation

Intr

odu

ctio

n


2014


Designing the CPU Digital System Typical CPU digital system microoperations

Add, subtract, multiply In the past, a 32-bit addition was completed in 1 clock

period. Today, a 32-bit addition is completed in several clock periods

AND, OR, XOR Shift right, Shift left Read data from memory, write data to memory

In the past, a memory access was completed in 1 clock period.

Today, it is completed in several clock periods

Read instructions from memory (fetch) Increment the program counter Transfer a register to another register …

Intr

odu

ctio

n


2014


Designing the CPU as a Digital System Other machines, especially CISC

machines, require other microoperations such as

Reading indirect address(es) from the memoryEffective address calculation for

Indexing Autoincrement Autodecrement

Alignment for Instructions Data Addresses

Intr

odu

ctio

n


2014


Designing the CPU Digital System Architecture’s effect on microoperations

The decisions made on architecture determine the microoperations needed for the execution of the instructions

General microoperations found on most CPUs The ones mentioned on previous slides

Specific microoperations for certain CPUs Specific microoperations for Memory Management Units

(MMUs), caches, I/O controllers The architecture also determines the characteristics of

each microoperation If the 26-bit PC-direct addressing mode is used, the

rightmost 26 bits of IR are catenated the leftmost 4 bits of PC and the resulting 30 bits are shifted to the left by 2

Thus, each machine language instruction requires a number of certain microoperations taking a certain time : the CPIi

Intr

odu

ctio

n


2014


Designing the CPU Digital System Microoperations

The CPU can perform one or more microoperations per clock period, depending on the complexity of the microoperation and the availability of the hardware resources

Most often a microoperation can be completed in one clock period unless it is a complex microoperation

If a complex microoperations is desired to be run in a clock period, the clock period needs to be longer

The more and complex the microoperations are, the longer it takes to run the machine language instruction

CISC instructions take longer time to execute (larger CPIi)

Intr

odu

ctio

n


2014


Designing the CPU Digital System Calculating CPIi

The time it takes to run an instruction, CPIi, is then determined by

The number of microoperations needed for it The complexity of the microoperations

The number of clock periods for an instruction, CPIi, becomes a matter of figuring out the microoperations and distributing them to individual clock periods

One can come up with 5-10 simple microoperations to be performed one after another, resulting in a CPIi of 5-10

But, since microoperations are simple, the clock period is short

Alternatively, one can come up with 2-4 complex microoperations, resulting in a CPIi of 2-4

But, the clock period is longer

Intr

odu

ctio

n


2014


Designing the CPU Digital System Calculating CPIi

What can we do ? Few long clock periods vs. many but shorter clock

periods ? Since increasing the clock frequency is important for

marketing purposes the second option would weigh in substantially

It turns out that if pipelining is implemented, having many shorter clock periods would be beneficial as we will see

CPIi figures will be large but CPIave will be close to 1 (one) !

Today’s microprocessors have instruction CPIi values in the range of 10-30, but CPIave figures for their targeted applications are even less than 1 (one) !

Because they employ advanced pipelining techniques, such as superscalar execution, hyperthreading, etc.

Intr

odu

ctio

n


2014


Designing the CPU Digital System Determining microoperations for a

machine language instructionSome microoperations are performed for all

the instructions Usually at the same point in time during the

execution of every instruction Fetching the instruction is always the first

microoperation to perform for all CPUs Updating PC (PC PC + 4) so that it points at the

next instruction is also universal

The other microoperations depend on the instruction, the addressing mode, where the arguments are, the length of the arguments, etc.

Intr

odu

ctio

n


2014


Designing the CPU Digital System Determining microoperations for a machine

language instruction We would list all the microoperations for each

instruction, by making sure that we are consistent in terms of

Bus usage We often decide an approximate number of buses we need

for our datapath Today’s CPUs have at least three internal buses to

complete an integer arithmetic microoperation in one clock period

Two buses carry the numbers from two registers and the third bus carries the result to a register

ALU usage An ALU is expensive and so we try to limit the number of

them

Intr

odu

ctio

n


2014


Designing the CPU Digital System Determining microoperations for a machine

language instruction We would list all the microoperations for each

instruction, by making sure that we are consistent in terms of

Register usage Additional registers not visible to the architecture level are

used to keep temporary values : microarchitecture registers Typically, the more registers are used, the more clock periods

we spend for an instruction since temporary values will be passed from one register in one clock period to another register to be used the following clock period

But, sometimes we have to use microarchitecture registers, such as the instruction register that keep the current instruction

Control unit usage

Intr

odu

ctio

n


2014


Designing the CPU Digital System Determine how each EMY architectural

operation is implemented by microoperations

Most microoperations must be simple enough to be completed in less than one clock period

A few microoperations may not be completed in a clock period

For example a memory read may take several clock periods since the memory is slower

These long microoperations should be accommodated in the high-level state diagram, the datapath, low-level state diagram and the control unit

We will assume in the beginning that every microoperation is completed in one clock period

Intr

odu

ctio

n


2014


Designing the CPU Digital System The EMY microoperations implied by the EMY

machine language instructions include Instruction fetch, performed always Update PC for next instruction, performed always Effective address calculation for Displacement and

relative addressing modes Sign extension or catenation of 0s for data/addresses Reading data from the memory Writing data to the memory Perform an arithmetic/logic Register transfer Testing a condition

Intr

odu

ctio

n


2014


Unpipelined EMY CPU : Version 0 By using the EMY CPU Handout

The most interesting component of a computer is the CPU

We know that the CPU has registers, buses, ALUs and a sequencer, among other

Note that whether hardwiring or microprogramming is used, the datapath stays the same, at least theoretically

The datapath performs microoperations on data It uses registers, buses and the ALU for that purpose

The microoperations are in turn controlled by the control unit.

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Overview We are now ready for the organizational

design of the EMYWe know the architecture of EMY

We will designThe EMY CPU that will have

A control unit with a sequencer A datapath containing registers, buses and the ALU

The datapath performs the microoperations and the control unit determines the timing and sequence of these microoperations

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Overview The way the EMY computer is covered indicates

that the authors organized the computer similar to the commercial EMY systems where

There is an integer EMY CPU A system control coprocessor (CP0) responsible for

memory management and cache control. A FP coprocessor (CP1)

The integer EMY CPU registers are either architectural or microarchitectural (temporary registers)

There are two other coprocessors, CP2 and CP3 that are reserved for future use

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Overview Designing the EMY CPU for all of

instructions is prohibitive First, we will design a EMY CPU to execute

only integer instructions that includeLW, SWADD, SUB, SLT, AND, ORBEQ, J

These integer instructions use the three format : R, I and J formats

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Overview The EMY CPU will have all the

architectural registers needed by these nine integer instructions

32 32-bit GPRs32-bit PC

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


New Microarchitectural registers These (temporary) registers are not a part of the

state (hence architecture) 32-bit instruction register, IR, to keep the current

instruction IR contains the instruction until it is completely

executed

32-bit A and B registers They keep the content of Rs and Rt registers of the

current instruction

32-bit register ALUout It contains a memory address or A/L operation result

32-bit Memory Data Register, MDR, register It keeps the data read from the memory for Load

instructions

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


New Microarchitectural registers 32-bit A and B registers

Opcode Rs Rt Displacement/Offset/Immediate

6 5 5 16

To registerA

To registerB

Opcode Rs Rd FunctionShamtRt

5 5 6

I format

R format

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


New Microarchitectural registers Even if an instruction does not have Rs and Rt

fields, such as a J-format instruction, Rs and Rt field bits are used to move Rs and Rt content to A and B, respectively

The values of A and B registers will not be used ! The reason for moving to A and B is to make the

common case fast where we think most instructions are R-format or I-format and require this move !

Opcode Offset26

6 26

Rs Rt

5 5

To registerA

To registerBJump

J format

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


New Microarchitectural registers Note that the Displacement used for loads and

stores is signed The offset of BEQ is also signed We have to sign extend the 16-bit Displacement,

Offset and Immediate (DOImm) value for some of the integer instructions

These include LW, SW, BEQ We will use DOImm+ to indicate a sign-extended value

from now on

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram The design of a CPU is very complex

We have to consider the space (hardware) and time (speed)

The design, analysis, description, testing, modification, optimization, servicing and maintenance can be more efficient if there are efficient tools around

These include HDLs and CAD tools The textbook uses a typical register transfer language

(RTL) notation in Appendix A to describe the execution of instructions

We will use the same RTL notation which is also used in the handout

To quickly see the execution steps of the integer machine language instructions, a high-level state diagram a CPU datapat, a low-level state diagram are developed in the handout

Additionally, timing diagrams and tables need to be studied to understand the CPU design

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram An instruction goes through several

phases when executed We give a name to each phase of an

instruction execution A phase is also called major cycle

Each major cycle will take one or more minor cycles (clock periods)

Each minor cycle is a state Each minor cycle takes typically one clock period

Each major cycle often has at least one microoperation

Often the name of a major cycle is derived from the major microoperation of the cycle

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram The number of major cycles and their complexity

are small for RISC systems and larger for CISC systems

Often for RISC systems, the CPIi for most frequently used instructions is between 4 and 6

However, this number has to be larger to have deep pipelining and high clock frequencies

In simple systems like RISC systems sharing of hardware among different major cycles is not necessary

A hardware resource is often needed in one major cycle only

The hardware for each major cycle can then be easily identified and often named stage

So, the execution of an instruction is the movement of the instruction through some or all of the stages of the CPU !

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram The EMY integer instructions go through at

most five major cycles during the execution

However, even for this RISC machine, it is difficult to name 5 cycle names because not all instructions do similar things in a major cycle

Some microoperations will be performed in advance in anticipation of a frequent operation

The early operations will not alter the state and will not cause longer clock periods, but will slightly increase the hardwareUn

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram The EMY CPU major cycles for integer instructions

Instruction fetch cycle Abbreviated as IF, standing for instruction fetch Same for all EMY instructions.

Instruction decode/Register fetch cycle Abbreviated as ID, standing for instruction decode Same for all EMY instructions.

Execution/effective address cycle Abbreviated as EX, standing for execution

Memory access cycle Abbreviated as MEM, standing for memory

Write-back cycle Abbreviated as WB, standing for write-back

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram Emphasizing again that designing a CPU is

determining which microoperation happens when for each architectural operation (the semantics of the instruction)

For the EMY, like many other CPUs, the IF and ID stages are identical for all instructions

The same microoperations are performed for all instructions

These microoperations implement portions of the architectural operation

For the EMY, the remaining portions of the architectural operation are performed in the EX, MEM and WB cyclesU

np

ipel

ined

EM

Y C

PU

Des

ign

: V

ersi

on 0


2014


The EMY CPU state diagram Architectural operations of I-format

instructions among the integer instructions

Load/Store instructions LW Rt, Disp(Rs) Rt M[Rs + Disp+] SW Rt, Disp(Rs) M[Rs + Disp+] Rt


6 5 5 16

Architectural operations ofLoad/Store instructions ≡ Semantics

Superscript + indicates sign extension

I format

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram Architectural operations of I-format instructions

among the integer instructions

Branch instruction BEQ Rs, Rt, Offset If Rs = Rt, then PC PC + (Offset+ x

4)


6 5 5 16

I format

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram Architectural operations of R-format instructions

among the integer instructions

Arithmetic/Logic instructions ADD Rd, Rs, Rt Rd Rs + Rt SUB Rd, Rs, Rt Rd Rs - Rt AND Rd, Rs, Rt Rd Rs & Rt OR Rd, Rs, Rt Rd Rs | Rt SLT Rt, Rs, Rt If Rs < Rt then Rt 1 else Rt 0

6 5 5


5 5 6

R format

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram Architectural operations of J-format

instructions among the integer instructions

Jump instructionPC[27-0] (Address x 4)

Opcode Offset26Rs Rt

5 5J format

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY CPU state diagram The major cycles of the DLX CPU are shown by the high-

level state diagram given in the EMY CPU handout Registers A and B are used to prepare operands for an ALU

operation Each state takes 1 clock period

Later, we will change it to one or more clock periods Memory accesses and complex arithmetic operations can take

more than one clock period to perform The state that has a memory access or a complex arithmetic

operation will take more than one clock period

All microoperations mentioned in a state are performed in parallel, so their order does not matter

If a state takes more than one clock period, one has to be careful about the parallel operations

We now obtain the state diagram and the datapath hardware of the EMY CPUU

np

ipel

ined

EM

Y C

PU

Des

ign

: V

ersi

on 0


2014


The EMY major cycles and states The instruction fetch cycle

It is performed for all the instructionsThere are two microoperations performed In general, all CPUs, regardless of their

architecture do these two microoperations Read the machine language instruction pointed by

the program counter (PC) to the instruction register (IR)

Update the program counter so that it points at the instruction that follows the instruction being read from the memory

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014



Read the machine language instruction pointed by the program counter (PC) to the instruction register (IR)

IR ← M[PC] Note the RTL notation that we use an equal sign (=) if

the destination is a wire or a bus and an arrow sign () if the destination is a register, such as IR

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014



Read the machine language instruction pointed by the program counter (PC) to the instruction register (IR)

IR ← M[PC] Then, the read of the instruction in terms buses is as

follows :

Note again the three microoperations implement the instruction read and they happen at the same and their order does not matter

Note the RTL notation that we use an equal sign (=) if the destination is a wire or a bus, such as MABUS and an arrow sign () if the destination is a register, such as IR

MABUS = PC ; MemRead = 1 ; IR MRBUS

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014



Update the program counter so that it points at the next instruction

PC ← PC + 4 Since an instruction is four bytes long, we need to add

4 to PC We will use the general ALU to do the addition, at the

expense of increasing the complexity of the ALU input logic

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The instruction fetch cycle The two microoperations of the IF cycle

can be shown in state 0 as follows

The two microoperations are simply shown without using buses to save space

The instruction read and PC update microoperations happen simultaneously and complete before the end of the clock period

IR M[PC] ;PC PC + 4 ;

0IF

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The EMY major cycles and states The instruction decode cycle

The most important goal in this cycle is to decode the instruction

Decoding the instruction means the CPU determines what the current instruction is

It is performed for all the instructions regardless of their architecture

Decoding is done by the control unit that checks the opcode and function bits of IR

They are input as status signals to the control unit

During this time the datapath does not do anything

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The instruction decode cycle During the Decode cycle the control unit

determines what the next state will be based on the type of the instruction

If it is a memory reference instruction (LW, SW), the next state is state 2 in the EX cycle

If it is a R-format A/L instruction, the next state is state 6 in the EX cycle

If it is a BEQ instruction, the next state is state 8 in the EX cycle

If it is a J instruction, the next state is state 9 in the EX cycle

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014



We realize that we can perform a number of microoperations needed for some instructions in the datapath since it is not used

Performing these microoperations in advance can help run those instructions faster

Which microoperations to perform in order to be prepared ?

We can transfer GPR register Rs pointed by I-format and R-format instructions to register A

We can transfer GPR register Rt pointed by I-format and R-format instructions to register B

We realize that in order to save hardware we can transfer Rs and Rt to A and B registers in every clock period

This will cause any problem and simplify the Control Unit since it would have generate Store signals for A and B registers

We realize that we can transfer the output of the ALU to a microarchitectural register, ALUout, in every clock period

We will later see it will simplify the Control Unit

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014



When we discuss BEQ, we will add one more microoperation to perform in the Decode cycle

By doing these in advance, we save time But, not all instructions need them : J-format

instructions do not need them and some of I-format instructions do not need the transfer to register B

This is fine since A, B and ALUout registers are not architectural registers and so changing them will not result in program errors

These microoperations transferring to A, B and ALUout are performed for all the instructions in every clock period

In general, RISC CPUs do these microoperations

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014



A ← GPR[Rs] ; B ← GPR[Rt] The GPR register file is designed so that two GPRs

can be read simultaneously, by using the Rs and Rt fields of IR

This means the GPR register file has two read ports controlled by Rs and Rt

Note that the order of these microoperations does not matter as they happen simultaneously

There is also a write port to the GPR register file controlled by Rt and Rd fields : 10 bits are connected to the GPR file to determine the destination register

A GPR[Rs] ; B GPR[Rt]

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The instruction decode cycle These three microoperations happen in every

clock period

The GPR read ports are directly connected to register A and B and so no buses are used

The three microoperations happen simultaneously in every clock period and complete before the end of the clock period

Since these microoperations happen every clock period, we will not show them in our states

A GPR[Rs] ; B GPR[Rt] ;ALUout OBUS

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The instruction decode cycle For the time being the decode cycle state will be

as follows

No microoperation happens other than the three microoperations that happen every clock period

However, we will change state 1 and place a microoperation there when we discuss BEQ instructions

1

ID

A GPR[Rs] ; B GPR[Rt] ;ALUout OBUS

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of LW and SW The LW instruction

LW Rt, Disp(Rs) Rt M[Rs + Disp+] We see that to execute the LW we need to

1) Calculate the effective address, the address of the memory location we want load from : Rs + Disp+

Read the cache memory pointed by the effective address

2) Transfer the value to GPR register Rt

The SW instruction SW Rt, Disp(Rs) M[Rs + Disp+] Rt We see that to execute the SW we need to

1) Calculate the effective address, the address of the memory location we want store to : Rs + Disp+

Write to the cache memory pointed by the effective address

2) Transfer the value from GPR register Rt to the memory pointed by the effective address

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of LW and SW LW and SW both have a microoperation in

common : calculating the effective address Then their microoperations differ

In order to calculate the effective address, we need to sign extend the DOImm field and add to GPR Rs Register GPR Rs has been transferred to A We also realize that GPR register Rt has been

transferred to register B Register B will be written to the memory for the SW

instruction then

LW requires one extra microoperation than the SW as we will soon seeU

np

ipel

ined

EM

Y C

PU

Des

ign

: V

ersi

on 0


2014


Completing the execution of LW and SW We decide to have the effective address

calculation of LW and SW in the Execution/Effective address cycle The effective address is stored in a

microarchitectural register called ALUout Then, we separate LW and SW execution in the

Memory Access/Branch completion cycle : Both access the memory LW reads the memory location pointed by the effective

address to a microarchitectural register called MDR SW writes microarchitectural register B to a memory

location pointed by the effective address and completes its execution

LW completes its execution by transferring the data in MDR to GPR register RtUn

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of LW and SW The effective address calculation

Rs + Disp+

Rs is now in register A Sign extend DOImm and then add

As we will see shortly, A/L instructions will have their arithmetic/logic operation performed in this cycle as well They need the ALU in this cycle Therefore, we decide to use the adder of the ALU

to do the addition for the effective address

ALUout A + DOImm+

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of LW and SW Reading from the memory

Note that the microoperations are performed in parallel and the order does not matter

This microoperation can be stated without giving the bus detail

Note that the memory access can take more than one clock period and so we may stay in this state more than one clock period

MABUS = ALUout ; MemRead = 1 ; MDR MRBUS

MD R M[ALUout]

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of LW and SW The LW instruction completes by transferring

MDR to GPR register Rt

The Rt field of IR is used by the GPR register file to select the register to be written the value from MDR

The result would be stored on the microarchitectural register ALUout

Though, we could store the result of the operation directly on GPR register Rd

We decide to store to MDR and transfer from MDR to the GPR write port

This decision will help pipelining as we will see later !

We then go back to state 0, the IF cycle, to start executing the next instruction

GPR[Rt] MDR

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of LW and SW Storing to the data memory

Note that the microoperations are performed in parallel and the order does not matter

This microoperation can be stated without giving the bus detail

Note that the memory access can take more than one clock period and so we may stay in this state more than one clock period

SW completes its execution ! We then go back to state 0, the IF cycle to

start executing the next instruction

MABUS = ALUout ; MemWrite = 1 ; MWBUS = B

M[ALUout] B

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of LW and SW The portion of the state diagram for LW

and SW

ALUout A + DOImm+

2EX

From the ID cycle

LW, SW

MDR M[ALUout] 3 LW

M[ALUout] B5

SW

GPR[Rt] MDR

4

WB 0

MEM

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of R-format A/L instructions The R-format A/L instructions

ADD Rd, Rs, Rt Rd Rs + Rt SUB Rd, Rs, Rt Rd Rs - Rt AND Rd, Rs, Rt Rd Rs & Rt OR Rd, Rs, Rt Rd Rs | Rt SLT Rd, Rs, Rt If Rs < Rt then Rd 1 else Rd

0 We see that to execute these instructions we

need to perform an operation specified by the Opcode and Function fields

Then, we transfer the result to GPR register Rd

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of R-format A/L instructions We see that we can perform the all

required operations for R-format instructions in one state

Which one to perform would be determined by the Opcode and Function fields

The inputs are Rs and Rt Rs is already transferred to register A and Rt is

already transferred to register B We see we save time by moving them in the ID

stage !

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of R-format A/L instructions We see that we can perform the all

required operations for R-format instructions in one state

The result would be stored on the microarchitectural register ALUout

Though, we could store the result of the operation directly on GPR register Rd

This would require a separate bus from the output of the ALU to the write port of the GPR file

We decide to store to ALUout and transfer from ALUout to the GPR write port

This decision will help pipelining as we will see later !Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of R-format A/L instructions The microoperation for the current R-

format A/L operation

The meaning of “op” is that the type of the operation is indicated by the Opcode and Function fields of IR

What happens is that the control unit uses the Opcode and Function fields to generate a set of control signals

These control signals are connected to the ALU, telling which operation to perform

ALUout A op B

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of R-format A/L instructions The ALU output for the five A/L

instructions is straightforward to understand, except when the SLT instruction is executed

The ALU has to output1 (31 zeros and a one) if Rs < Rt 0 (32 zeros) otherwise

The ALU will have a functional unit called SLT that will output as such

SLT Rd, Rs, Rt If Rs < Rt then Rd 1 else Rd 0

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of R-format A/L instructions The result that is in ALUout is moved to GPR

register RdThe Rd field of IR is used by the GPR register

file to select the register to be written the value from ALUoutput

We then go back to state 0, the IF cycle to start executing the next instruction

GPR[Rd] ALUout

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of R-format A/L instructions The portion of the state diagram for R-format A/L

instructions

ALUout A op B

6EX

From the ID cycle

R-Format A/L instructions

GPR[Rd] ALUout

0

MEM7

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of Control instructions

The BEQZ instruction BEQ Rs, Rt, Offset If Rs = Rt, then

PC PC + (Offset+ x 4)

The J instruction J Address PC[27-0] (Address x 4)

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of Control instructions

We see that to execute these instructions we need to

1) Calculate the effective address the address to branch/jump to Add PC to the result of the multiplication of the

sign extended Offset by 4 Move to the rightmost 28 bits of PC the result of

the multiplication of the Address by 4 This completes the J instruction execution

2) Test if Rs is equal to equal to Rt and if yes, transfer the the effective address to PC This completes the BEQ instruction executionU

np

ipel

ined

EM

Y C

PU

Des

ign

: V

ersi

on 0


2014


Completing the execution of BEQ instruction In order to calculate the effective address, we

need to Sign extend the DOImm field Then multiply it by 4 The add it to PC

That is we need to do the following

PC + (Offset+ x 4) We realize that the Datapath is free to do this

microoperation in the ID cycle while it is doing the other microoperations as it does in every clock period A GPR[Rs] ;

B GPR[Rt] ;ALUout OBUS

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of BEQ instruction We then decide to calculate the effective address of the

BEQ instruction in ID cycle

ALUout PC + (Offset+ x 4) Note that we are calculating the BEQ effective address

while we are determining which instruction we have If we do not have a BEQ instruction, the result is not used Otherwise, we save time since we perform this operation in

advance Note that executing BEQ fast, by performing its

microoperation in advance is important since this will help pipelining

As we shall see later, control instructions, all Branch and Jump instructions, slow down the pipeline CPU

It is therefore critical to complete their execution as quickly as possible to reduce the negative effect of these instructions on the pipelineU

np

ipel

ined

EM

Y C

PU

Des

ign

: V

ersi

on 0


2014


Completing the execution of BEQ instruction The BEQ effective address calculation

ALUout PC + (Offset+ x 4) Sign extending DOImm requires a simple combinational

circuit Multiplying by 4 requires no logic at all

We just have to catenate two zeros to the right of DOImm We know that shifting a number to the left by two bit

positions is multiplying it by four

We decide to use the adder of the ALU to do the addition since it is free to use

ALUout PC + ([Offset+] << 2)

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of BEQ instruction Since the BEQ effective address calculation is

done in the decode cycle, state 1 has been modified

We now perform a microoperation in the decode cycle besides the three microoperations we perform every clock period

ALUout PC + ([Offset+] << 2) 1

ID

A GPR[Rs] ; B GPR[Rt] ;ALUout OBUSU

np

ipel

ined

EM

Y C

PU

Des

ign

: V

ersi

on 0


2014


Completing the execution of BEQ instruction Testing if Rs is equal to Rt and storing to

PC based on the resultWe know that GPR register Rs has been

transferred to register AWe know that GPR register Rt has been

transferred to register BWe will compare Register A Register B then !We decide to have a Zero circuit in the ALU to

compare registers A and B The ALU will have a new 1-bit output named Zero

showing the result of the compare so that Zero is 1 if the two registers are equal 0 if the two registers are not equalU

np

ipel

ined

EM

Y C

PU

Des

ign

: V

ersi

on 0


2014


Completing the execution of BEQ instruction Changing PC if Zero is 1

This means we branch to a memory location That is we take the branch


If (A == B) then PC ALUout

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of BEQ instruction How do we handle conditional and unconditional stores on

PC ? If PC has to be stored unconditionally, such as in state 0, we

use the control signal, PCWrite, to store on PC The control unit generates another control signal,

PCWriteCond, to conditionally store on PC PCWriteCond is ANDed with Zero to conditionally store on PC

If Zero = 1 it means the condition is true, we will branch which means we store the effective address on PC

If Zero = 0 it means the condition is not true, we will not branch which means we will not store the effective address on PC

Zero

PCWriteCond

PCWrite

PC

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of BEQ instruction The portion of the state diagram for BEQ instruction

Note again that we complete the execution of BEQ fast since this will help pipelining

As mentioned earlier, control instructions slow down the pipeline CPU

It is therefore critical to complete their execution as quickly as possible to reduce the negative effect of these instructions on the pipeline

If A == B then PC ALUout

8

EX

From the ID cycle

BEQ

0

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of J instruction We see that all we need to do is store the

effective address on PC unconditionally Move to the rightmost 28 bits of PC the result of the

multiplication of the Address by 4

This is equivalent to

The rightmost 4 bits of PC are not changed ! The above microoperation has the same effect the

following one

PC[27-0] (Address x 4)

PC PC[31-28], (Address x 4)

PC (PC[31-28], Address) x 4

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of J instruction In order to calculate the effective address, we

need to Multiply DOImm by 4

Multiplying by 4 requires no logic at all We just have to catenate two zeros to the right of DOImm We know that shifting a number to the left by two bit

positions is multiplying it by four


PC (PC[31-28], Address) << 2

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of J instruction The portion of the state diagram for J instruction

Note again that we complete the execution of J fast since this will help pipelining

As mentioned earlier, control instructions slow down the pipeline CPU

It is therefore critical to complete their execution as quickly as possible to reduce the negative effect of these instructions on the pipeline

PC (PC[31-28], Address) << 2

9

EX

From the ID cycle

J

0

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Completing the execution of J instruction How do we handle conditional and unconditional stores on

PC ? If PC has to be stored unconditionally, such as in state 0 and

state 9 , we use the control signal, PCWrite, to store on PC The J instruction uses PCWrite

The control unit generates another control signal, PCWriteCond, to conditionally store on PC

The BEQ instruction uses PCWriteCond

Zero

PCWriteCond

PCWrite

PC

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The complete state diagram The high-level state diagram for integer

instructions and the datapath are given in the EMY CPU handout

They will be modified to implement a pipelined EMY CPU

But, the overall CPU structure will be similar

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014



instructions has a branch out from state 1

If the CPU receives an instruction that is not one the nine (LW, SW, ADD, SUB, AND, OR, SLT, BEQ, J), it will generate an internal interrupt (an exception) since it does not know what to do

In order to generate the internal interrupt, it will go to state 10 to prepare the CPU for the interrupt

In state 10, it will perform a number of microoperations, including moving the internal interrupt handler address 80000180 to PC

ALUout PC + ([Offset+] << 2)

1

ID

State 10

Invalid instruction exception (invalid opcode, not one of nine instructions)

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014



instructions has a branch out from state 6

If the CPU performs a 2’s Complement addition or a subtraction and there is an overflow, it will generate an internal interrupt (an exception) since the result is not correct

In order to generate the internal interrupt, it will go to state 11 to prepare the CPU for the interrupt

In state 11, it will perform a number of microoperations, including moving the internal interrupt handler address 80000180 to PC

ALUout A op B

6EX

State 11

Arithmetic exception (signed overflow)

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CPIi of Integer Instructions With this implementation, the CPIi of the

instructions can be calculated asCPILW = 5 because we trace states 0, 1, 2, 3,

4CPISW = 4 because we trace states 0, 1, 2, 5CPIA/L R Format = 4 because we trace states

0, 1, 6, 7

CPIBEQ = 3 because we trace states 0, 1, 8CPIJ = 3 because we trace states 0, 1, 9

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Control Signals The semantics of each state is that which

microoperation to perform is determined by the control unit, turning on and off a few MUX select, register store, ALU control inputs and enable control signals

Control signals are connected to MUXes, registers and ALUs

They are shown as angled signals in the handout

The EMY low-level state diagram describes which control signal is 1 when

IR

IRWrite

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


The Clock Signal The clock period duration is determined

by the slowest but important microoperation in the CPU

All the signal delays in the datapath and control unit are added up to calculate the time for this important operation

It is usually the integer add microoperation Though it could be the memory access time if it was

a little longer than the integer addition time Usually, the memory is much slower than the CPU in

commercial systems but we will not consider it when we calculate the clock period duration since we will deal with slow memory when we cover the memory hierarchy topicU

np

ipel

ined

EM

Y C

PU

Des

ign

: V

ersi

on 0


2014


The Clock Signal Thus, we will assume the integer addition and memory

access both take one clock period each ! For today’s microprocessors this is not the case though !

If a microoperation takes more than one clock period, we draw a loop-back arrow to indicate so

For example, if the memory takes more than one clock period, there will be loop back lines drawn for states 0, 3 and 5 to indicate that the CPU spends more than one clock period

IR M[PC] ;PC PC + 4 ;

0IF

MDR M[ALUout] 3 LW

M[ALUout] B5

SW

MEMUn

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Clock Signal The clock period duration is determined by the

addition of all the delays in the control unit and the delays in the datpath for the integer add microoperation

The delays in the control unit include the delays to generate the MUX select, register clock input, ALU control and enable control signals

Gate networks generate these select and clock control signals if hardwiring is used

The micromemory and additional circuits generate these select and clock control signals if microprogramming is used

The delays in the datapath include Delay of data travel from registers to the ALU inputs Delay of the adder in the ALU Delay of the data travel from the ALU to the destination

register in the datapath : ALUout

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Architecture-Microarchitecture Interaction An example of how architectural decisions can

affect the microarchitecture design is the following

The Rs and Rt fields of R-format and I-format instructions are in the same position

Therefore, we do not need to use separate read ports from the register file

We have one read port for Rs and one read port for Rt


6 5 5 16

6 5 5


5 5 6

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Using the state diagram Consider the following piece of program in

the EMY memory---400000 LW R8, 0(R9) ; R8 M[R9 + 0+] ; M[R9] has C400004 ADD R10, R8, R11 ; R10 R8 + R11400008 ADD R12, R13, R14 ; R12 R13 + R1440000C SW R12, 0(R15) ; M[R15 + 0+] R12 ; M[R15] R12400010 BEQ R12, R0, 3 ; If R12 is equal to R0, branch to 400020---100000150 C ; The content of this location is C---1000A200 ?

Assume that R9 has 10000150 and R15 has 1000A200 initially

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Using the state diagram This piece of program takes 20 clock periods as the table

below shows the execution of the program with respect to time

See the EMY CPU handout for timing IF ID EX MEM WB400000 LW R8, 0(R9) 1 2 3 4 5400004 ADD R10, R8, R11 6 7 8 9400008 ADD R12, R13, R14 10 11 12 1340000C SW R12, 0(R15) 14 15 16 17400010 BEQ R12, R0, 3 18 19 20

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


Using the state diagram If the clock frequency is 1GHz

4 5

20

run nsinstructio ofNumber

program for the cyclesclock ofNumber CPIave

ns 1 second 10 10

1

frequencyClock

1 periodClock 9-

9

ns 20 1 20 periodClock programfor periodsclock ofNumber CPUtime

50 1020

1

10 10 20

5

10 CPUtime

run nsinstructio ofNumber MIPS

3-69-6ave

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014

EMY CPU Version 0100

CS 2214

We have covered unpipelined CPU The remaining slides will be used when we

cover the memory hierarchy topicSo far we have assumed that

The memory takes one clock period to access There is one solid memory

What if it took two clock periods or more ?What if there were instruction and data cache

memories ? Then we need to take a look at the execution timing

in detail again

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

If we assume that Instruction and data cache memories take one clock period

each and there is no cache miss, the execution timing will be as before

What if the two cache memories took two clock periods each ? The execution timing will be identical to the clock doubling

case studied in class LW would take 7 clock periods since we trace states 0, 0, 1, 2,

3, 3, 4 States 0 and 3 are repeated twice since the cache memories take

two clock periods each SW would take 6 clock periods since we trace states 0, 0, 1, 2,

5, 5 States 0 and 5 are repeated twice since the cache memories take

two clock periods each ADD would take 5 clock periods since we trace states 0, 0, 1,

2, 6 State 0 is repeated twice since the cache memory takes two clock

periods BEQ would take 4 clock periods since we trace states 0, 0, 1, 8

State 0 is repeated twice since the cache memory takes two clock periods

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

Using the state diagram If the cache memories are slow (they take two clock period

per access) and there is no cache miss, then this piece of program will take 27 clock periods as the table below shows the execution of the program with respect to time

See the EMY CPU handout for timing IF ID EX MEM WB

400000 LW R8, 0(R9) 1-2 3 4 5-6 7400004 ADD R10, R8, R11 8-9 10 11 12400008 ADD R12, R13, R14 13-14 15 16 1740000C SW R12, 0(R15) 18-19 20 21 22-23400010 BEQ R12, R0, 3 24-25 26 27

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

We have so far assumed that the cache memories do not have misses ! What if both instruction and data cache

memories result is cache misses ?That is, there is a cold start !

What is the new execution time ?

To calculate the new execution time we have to study the structure of the cache memories

The size of the physical (main) memory, the size of the cache memories, the size of cache blocks, the type of mapping (direct, associative, block-set associative), the block replacement strategy, etc.

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

What if both instruction and data cache memories result is cache misses ?

For this semester We will concentrate on Level 1 cache memories, i.e.

instruction and data cache memories We will assume that there is no Level 2 cache memory

miss ! We will indicate all physical addresses used For this presentation assume that

The physical (main) memory has 256 Mbytes The physical memory has 4 Bytes per location The bus width between the physical and lowest level cache is

4 Bytes The instruction cache is 8KBytes The data cache is 16KBytes Both cache block sizes are 32 bytes Both cache memories use direct mapping Both caches use write-back with write-allocate Both cache memories access the needed item first The physical memory latency is 4 clock periods and

transferring an 4-Byte content is one clock period each

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

Instruction and data cache misses ? The physical memory has 256MBytes or

228 BytesThe physical address is 28 bits longThe physical memory has 228/32 = 228/25 = 223

blocksThe instruction cache has 8KB/32 = 213/25 = 28

= 256 blocksThe data cache has 16KB/32 = 214/25 = 29 =

512 blocks

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

Instruction and data cache misses ? The physical address is used by the physical memory and

instruction cache as follows

The physical address is used by the physical memory and data cache as follows

15 8 5

23 bits

Instructioncache block #

Byte offset

Main memory block number

Address tag

14 9 5

23 bits

Data cacheblock #

Byte offset

Main memory block number

Address tagUn

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

Instruction and data cache misses ? The instruction cache has 32-Byte blocks

Each block contains 8 instructions since each instruction is 4 Bytes long

Instructions in physical memory locations 100 through 110 are in one instruction cache block

00000100

00000104000001080000010C

Instruction cache blocks have 32 bytes and so each block holds 8 instructions !Instructions in 100, 104, 108, 10C, 110, 114, 118 and 11C are in one instruction cache block !

4 bytes

LW R8, 0(R9)

ADD R10, R8, R11

ADD R12, R13, R14

SW R12, 0(R15)

BEQ R12, R0, 3

Which instruction cache block is this ?

?

??

00000110

00000114000001180000011C

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

Instruction and data cache misses ? The instruction cache has 32-Byte blocks

Instructions in physical memory locations 100 through 110 are in main memory block number 8 and in instruction cache memory block number 8

0000100 LW R8, 0(R9)

0000 0000 0000 0000 0001 0000 00000 0 0 0 1 0 0

5 bits ! The byte offset is 5 bits long. The LW instruction has 0 offset from the beginning of the block, i.e. the first instruction of the block

Instructioncache block # 8 since 00001000 is 8 in decimal

Address tag

Instructions in 100, 104, 108, 10C, 110 are in instruction cache block 8 !

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0

Main memory block number : 8


2014


CS 2214

Instruction and data cache misses ? How long does it take to access individual instructions ?

Both cache memories access the needed item first The physical memory latency is 4 clock periods and transferring an 4-Byte

content is one clock period each

Start access

LatencyTransferM[104]

TransferM[108]

TransferM[10C]

TransferM[110]

Time

Block fill time = 12 clock periods

M[100] is the needed item and accessed & transferred first!

Five clock periods !Six clock periods !Seven clock periods !

Eight clock periods !

000001000000010400000108

0000010C

LW R8, 0(R9)

ADD R10, R8, R11

ADD R12, R13, R14

SW R12, 0(R15)

BEQ R12, R0, 3

?

?

?

00000110

00000114

00000118

0000011C

Nine clock periods !

TransferM[114]

TransferM[118]

TransferM[11C]

Ten clock periods !

Eleven clock periods !

Twelve clock periods !

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

Instruction and data cache misses ? The data cache has 32-Byte blocks

Each block contains 4 data elements since each data element is 4 Bytes long

The data element in physical memory location 1150 is in one data cache block

00001140

4 bytes

00001144

00001148

0000114C

Data cache blocks have 32 bytes and so each holds 8 data elements !Data elements in 1140, 1144, 1148, 114C, 1150, 1154, 1158 and 115C are in one data cache block !

C

Which data cache block is this ?

0000115400001158

0000115C

00001150

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214


The data element in physical memory location 1150 is in main memory block number 138 and in data cache block number 138

0001150 C

0000 0000 0000 0001 0001 0101 00000 0 0 1 1 5 0

5 bits ! The byte offset is 5 bits long. The data element has 16-Byte offset from the beginning of the block, i.e. the fifth data element of the block

Data cache block # 138 since 010001010 is 138 in decimal

Address tag

Data element in 150 is in data cache block 138 !

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0



2014


CS 2214

Instruction and data cache misses ? How long does it take to access individual data element ?



Seven clock periods !Eight clock periods !

Five clock periods !

Six clock periods !

00001140

00001144

000011480000114C

00001154000011580000115C

00001150 C

Latency

TransferM[1154]

TransferM[1158]

TransferM[115C]

TransferM[1140]

Time


M[1150] is the needed item and accessed & transferred first!

TransferM[1144]

TransferM[1148]

TransferM[114C]

Start access

Eleven clock periods !

Twelve clock periods !


Ten clock periods !

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214


Each block contains 4 data elements since each data element is 4 Bytes long

The data element in physical memory location 2200 is in one data cache block

?0000220000002204

00002208

0000220C

Data cache blocks have 32 bytes and so each holds 8 data elements ! Data elements in 2200, 2204, 2208, 220C, 2210, 2214, 2218 and 221C are in one data cache block !

Which data cache block is this ?

0000221400002218

0000221C

00002210

4 bytes

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214


The data element in physical memory location 2200 is in main memory block number 272 and in data cache block number 272

0002200 ?

0000 0000 0000 0010 0010 0000 00000 0 0 2 2 0 0

5 bits ! The byte offset is 5 bits long. The data element has 0 offset from the beginning of the block, i.e. the first data element of the block

Data cache block # 272 since 100010000 is 272 in decimal

Address tag

Data element in 2200 is in data cache block 272 !

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0



2014


CS 2214

Instruction and data cache misses ? How long does it take to access individual instructions ?



Eleven clock periods !Twelve clock periods !


Ten clock periods !

?

Latency

TransferM[2204]

TransferM[2208]

TransferM[220C]

TransferM[2210]

Time


M[2200] is the needed item & accessed & transferred first!

TransferM[2214]

TransferM[2218]

TransferM[221C]

Start access

Seven clock periods !

Eight clock periods !

Five clock periods !

Six clock periods !

00002200

00002204

000022080000220C

0000221400002218

0000221C

00002210

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

Instruction and data cache misses ? How long does it take to run the program with a cold start ?

This piece of program will take 32 clock periods as the table below shows the execution of the program with respect to time

See the EMY CPU handout for timing IF ID EX MEM WB400000 LW R8, 0(R9) 1/5 6 7 8/12 13

400004 ADD R10, R8, R11 14 15 16 17

400008 ADD R12, R13, R14 18 19 20 21

40000C SW R12, 0(R15) 22 23 24 25/29

400010 BEQ R12, R0, 3 30 31 32

Un

pip

elin

ed E

MY

CP

U D

esig

n :

Ver

sion

0


2014


CS 2214

Unpipelined EMY CPU is complete We studied how the architecture affects

the organization We designed the EMY CPU for nine integer

instructions We considered how cache memories

affect the unpipelined EMY CPU execution Another PowerPoint presentation will

coverThe pipelined EMY CPUHow cache memories affect the pipelined EMY

CPU execution

Computer Architecture and Organization CS 2214 Version 0 Unpipelined EMY CPU Haldun Hadimioglu...

Documents

Transcript of Computer Architecture and Organization CS 2214 Version 0 Unpipelined EMY CPU Haldun Hadimioglu...