Computer Architecture and Organization CS 2214 Version 0 Unpipelined EMY CPU Haldun Hadimioglu...
-
Upload
cora-thompson -
Category
Documents
-
view
217 -
download
0
Transcript of Computer Architecture and Organization CS 2214 Version 0 Unpipelined EMY CPU Haldun Hadimioglu...
Computer Architecture and OrganizationCS 2214CS 2214
Version 0
Unpipelined EMY CPU
Haldun Hadimioglu
Computer Science & Engineering
Spring Spring 20142014Spring Spring 20142014
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 2CS 2214
Outline Introduction Version 0 EMY CPU : Unpipelined EMY CPU
It executes only integer instructions How a memory hierarchy can be attached to the
unpipelined CPU is also studied
Handout to use EMY CPU
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 3CS 2214
Introduction On the microarchitecture layer, a computer is a
collection of at least three interconnected digital systems
A central processing unit (CPU) A (main) memory An I/O controller to control an I/O device, such as the
disk There can be several I/O controllers to control different I/O
devices
Intr
odu
ctio
n
Memory
CPU
I/OController
InterconnectionSystem
Disk
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 4CS 2214
Digital Systems A digital system performs microoperations
It consists of a datapath (data unit) and a control unit
The datapath actually performs the microoperations The control unit determines which microoperation
happens when
Registers ALUs Buses
SequencerStatus signals Control signals
Datapath
Control Unit
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 5CS 2214
Digital Systems The datapath (data unit) has registers,
ALUs and buses to perform the microoperations
Registers keep information temporarilyALUs perform arithmetic/logic operationsBuses interconnect the registers and ALUsOther components are used include
Multiplexers (MUXes), decoders, encoders, comparators, counters, etc.
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 6CS 2214
Digital Systems The control unit has a sequencer circuit
that determines the sequence of microoperations
The sequencer needs status signals from the data unit to know what is happening there
Then, it determines which microoperations to be performed and indicates to the datapath by means of control signalsIn
trod
uct
ion
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 7CS 2214
Designing Digital systems Datapath design is simpler than the
control unit since it has highly regular (duplicated) circuits
A 32-bit ADDer is composed of 2 16-bit identical ADDers
A 32-bit comparator consists of 4 8-bit identical comparators, etc.
Control unit design is more difficult due to Large amounts of random logicA lot of effort is needed to make sure there are
no timing problems Microoperations must start at the right time and end
at the right time !
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 8CS 2214
Designing digital systems We will use the finite-state machine (FSM)
technique to design the EMY CPU where the FSM state diagram will have states with microoperations
The state diagram shows which state follows which state precisely
Each state indicates which microoperations to perform
The state diagram shows which states are needed when for which machine language instruction
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 9CS 2214
Designing digital systems We will design the EMY CPU by using the
finite-state machine (FSM) techniqueMore specifically, we will obtain the following
for the complete EMY CPU design A high-level-state diagram to show which
microoperation happens when The datapath from the high-level state diagram The low-level state diagram from the high-level sate
diagram and the datapath The control unit from the low-level state diagram
It can be implemented by hardwiring and/or microprogramming
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 10CS 2214
Designing the microarchitecture level of a computer There are two tasks in this design
Develop the CPU and memory digital systems so that instructions can be run
Develop the memory and I/O controller digital systems so that I/O can happen
We will concentrate on the CPU and memory digital systemsIn
trod
uct
ion
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 11CS 2214
Designing the CPU and memory digital systems First we focus on the CPU digital system while we
make a few design decisions on the memory quickly
We will design the CPU as a slow CPU running only integer instructions : No pipelining
This is Version 0 We will assume the memory is fast which is not realistic today Then, we will see how a memory hierarchy with cache
memories, etc. can be incorporated Then, we will improve the CPU speed by using
pipelining, but still running integer instructions This is Version 1
We will assume the memory is fast which is not realistic today Then, we will see how a memory hierarchy with cache
memories, etc. can be incorporated This CPU coverage will be in another PowerPoint
presentation For both versions the memory will be a black box
with a few details
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 12CS 2214
Designing the CPU as a Digital System The EMY CPU digital system
We will concentrate on designing the EMY CPU for nine integer instructions in the beginning
High-level state diagram of the EMY CPU Datapath of the CPU Low-level state diagram of the CPU Control unit of the CPU
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 13CS 2214
Designing the CPU digital system To design the EMY CPU, we will start with
the EMY architectureWhat is the connection between the
architecture and the CPU? A computer processes digital information, by
running machine language instructions A program is a list of instructions each of which
specifies operations on data (arguments) An instruction specifies architectural operations Each architectural operation is implemented by
microoperations
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 14CS 2214
Designing the CPU Digital System In order to perform an architectural
operation, the CPU performs a series of microoperations in a number of clock periods
That is an architectural operation is broken down into smaller operations called microoperations
That is, to run a machine language instruction, the CPU performs microoperations
The CPU performs some microoperations alone and some in cooperation with the memory and the I/O controllers
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 15CS 2214
Designing the CPU Digital System Architectural operations
An architectural operation is what we describe as the semantics of the instruction, such as
The architectural operation specified by the ADD instruction
Rd Rs + Rt The architectural operation specified by the SUB
instruction Rd Rs - Rt
The architectural operation specified by the SLT instruction
If Rs < Rt then Rd 1 else Rd 0 The architectural operation specified by the J instruction
PC[27-0] (Address * 4)
It is the CPU that contributes the most to the execution of an instruction since it performs most of the microoperations needed for an architectural operation
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 16CS 2214
Designing the CPU Digital System Typical CPU digital system microoperations
Add, subtract, multiply In the past, a 32-bit addition was completed in 1 clock
period. Today, a 32-bit addition is completed in several clock periods
AND, OR, XOR Shift right, Shift left Read data from memory, write data to memory
In the past, a memory access was completed in 1 clock period.
Today, it is completed in several clock periods
Read instructions from memory (fetch) Increment the program counter Transfer a register to another register …
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 17CS 2214
Designing the CPU as a Digital System Other machines, especially CISC
machines, require other microoperations such as
Reading indirect address(es) from the memoryEffective address calculation for
Indexing Autoincrement Autodecrement
Alignment for Instructions Data Addresses
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 18CS 2214
Designing the CPU Digital System Architecture’s effect on microoperations
The decisions made on architecture determine the microoperations needed for the execution of the instructions
General microoperations found on most CPUs The ones mentioned on previous slides
Specific microoperations for certain CPUs Specific microoperations for Memory Management Units
(MMUs), caches, I/O controllers The architecture also determines the characteristics of
each microoperation If the 26-bit PC-direct addressing mode is used, the
rightmost 26 bits of IR are catenated the leftmost 4 bits of PC and the resulting 30 bits are shifted to the left by 2
Thus, each machine language instruction requires a number of certain microoperations taking a certain time : the CPIi
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 19CS 2214
Designing the CPU Digital System Microoperations
The CPU can perform one or more microoperations per clock period, depending on the complexity of the microoperation and the availability of the hardware resources
Most often a microoperation can be completed in one clock period unless it is a complex microoperation
If a complex microoperations is desired to be run in a clock period, the clock period needs to be longer
The more and complex the microoperations are, the longer it takes to run the machine language instruction
CISC instructions take longer time to execute (larger CPIi)
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 20CS 2214
Designing the CPU Digital System Calculating CPIi
The time it takes to run an instruction, CPIi, is then determined by
The number of microoperations needed for it The complexity of the microoperations
The number of clock periods for an instruction, CPIi, becomes a matter of figuring out the microoperations and distributing them to individual clock periods
One can come up with 5-10 simple microoperations to be performed one after another, resulting in a CPIi of 5-10
But, since microoperations are simple, the clock period is short
Alternatively, one can come up with 2-4 complex microoperations, resulting in a CPIi of 2-4
But, the clock period is longer
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 21CS 2214
Designing the CPU Digital System Calculating CPIi
What can we do ? Few long clock periods vs. many but shorter clock
periods ? Since increasing the clock frequency is important for
marketing purposes the second option would weigh in substantially
It turns out that if pipelining is implemented, having many shorter clock periods would be beneficial as we will see
CPIi figures will be large but CPIave will be close to 1 (one) !
Today’s microprocessors have instruction CPIi values in the range of 10-30, but CPIave figures for their targeted applications are even less than 1 (one) !
Because they employ advanced pipelining techniques, such as superscalar execution, hyperthreading, etc.
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 22CS 2214
Designing the CPU Digital System Determining microoperations for a
machine language instructionSome microoperations are performed for all
the instructions Usually at the same point in time during the
execution of every instruction Fetching the instruction is always the first
microoperation to perform for all CPUs Updating PC (PC PC + 4) so that it points at the
next instruction is also universal
The other microoperations depend on the instruction, the addressing mode, where the arguments are, the length of the arguments, etc.
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 23CS 2214
Designing the CPU Digital System Determining microoperations for a machine
language instruction We would list all the microoperations for each
instruction, by making sure that we are consistent in terms of
Bus usage We often decide an approximate number of buses we need
for our datapath Today’s CPUs have at least three internal buses to
complete an integer arithmetic microoperation in one clock period
Two buses carry the numbers from two registers and the third bus carries the result to a register
ALU usage An ALU is expensive and so we try to limit the number of
them
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 24CS 2214
Designing the CPU Digital System Determining microoperations for a machine
language instruction We would list all the microoperations for each
instruction, by making sure that we are consistent in terms of
Register usage Additional registers not visible to the architecture level are
used to keep temporary values : microarchitecture registers Typically, the more registers are used, the more clock periods
we spend for an instruction since temporary values will be passed from one register in one clock period to another register to be used the following clock period
But, sometimes we have to use microarchitecture registers, such as the instruction register that keep the current instruction
Control unit usage
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 25CS 2214
Designing the CPU Digital System Determine how each EMY architectural
operation is implemented by microoperations
Most microoperations must be simple enough to be completed in less than one clock period
A few microoperations may not be completed in a clock period
For example a memory read may take several clock periods since the memory is slower
These long microoperations should be accommodated in the high-level state diagram, the datapath, low-level state diagram and the control unit
We will assume in the beginning that every microoperation is completed in one clock period
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 26CS 2214
Designing the CPU Digital System The EMY microoperations implied by the EMY
machine language instructions include Instruction fetch, performed always Update PC for next instruction, performed always Effective address calculation for Displacement and
relative addressing modes Sign extension or catenation of 0s for data/addresses Reading data from the memory Writing data to the memory Perform an arithmetic/logic Register transfer Testing a condition
Intr
odu
ctio
n
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 27CS 2214
Unpipelined EMY CPU : Version 0 By using the EMY CPU Handout
The most interesting component of a computer is the CPU
We know that the CPU has registers, buses, ALUs and a sequencer, among other
Note that whether hardwiring or microprogramming is used, the datapath stays the same, at least theoretically
The datapath performs microoperations on data It uses registers, buses and the ALU for that purpose
The microoperations are in turn controlled by the control unit.
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 28CS 2214
Overview We are now ready for the organizational
design of the EMYWe know the architecture of EMY
We will designThe EMY CPU that will have
A control unit with a sequencer A datapath containing registers, buses and the ALU
The datapath performs the microoperations and the control unit determines the timing and sequence of these microoperations
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 29CS 2214
Overview The way the EMY computer is covered indicates
that the authors organized the computer similar to the commercial EMY systems where
There is an integer EMY CPU A system control coprocessor (CP0) responsible for
memory management and cache control. A FP coprocessor (CP1)
The integer EMY CPU registers are either architectural or microarchitectural (temporary registers)
There are two other coprocessors, CP2 and CP3 that are reserved for future use
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 30CS 2214
Overview Designing the EMY CPU for all of
instructions is prohibitive First, we will design a EMY CPU to execute
only integer instructions that includeLW, SWADD, SUB, SLT, AND, ORBEQ, J
These integer instructions use the three format : R, I and J formats
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 31CS 2214
Overview The EMY CPU will have all the
architectural registers needed by these nine integer instructions
32 32-bit GPRs32-bit PC
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 32CS 2214
New Microarchitectural registers These (temporary) registers are not a part of the
state (hence architecture) 32-bit instruction register, IR, to keep the current
instruction IR contains the instruction until it is completely
executed
32-bit A and B registers They keep the content of Rs and Rt registers of the
current instruction
32-bit register ALUout It contains a memory address or A/L operation result
32-bit Memory Data Register, MDR, register It keeps the data read from the memory for Load
instructions
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 33CS 2214
New Microarchitectural registers 32-bit A and B registers
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
To registerA
To registerB
Opcode Rs Rd FunctionShamtRt
5 5 6
I format
R format
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 34CS 2214
New Microarchitectural registers Even if an instruction does not have Rs and Rt
fields, such as a J-format instruction, Rs and Rt field bits are used to move Rs and Rt content to A and B, respectively
The values of A and B registers will not be used ! The reason for moving to A and B is to make the
common case fast where we think most instructions are R-format or I-format and require this move !
Opcode Offset26
6 26
Rs Rt
5 5
To registerA
To registerBJump
J format
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 35CS 2214
New Microarchitectural registers Note that the Displacement used for loads and
stores is signed The offset of BEQ is also signed We have to sign extend the 16-bit Displacement,
Offset and Immediate (DOImm) value for some of the integer instructions
These include LW, SW, BEQ We will use DOImm+ to indicate a sign-extended value
from now on
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 36CS 2214
The EMY CPU state diagram The design of a CPU is very complex
We have to consider the space (hardware) and time (speed)
The design, analysis, description, testing, modification, optimization, servicing and maintenance can be more efficient if there are efficient tools around
These include HDLs and CAD tools The textbook uses a typical register transfer language
(RTL) notation in Appendix A to describe the execution of instructions
We will use the same RTL notation which is also used in the handout
To quickly see the execution steps of the integer machine language instructions, a high-level state diagram a CPU datapat, a low-level state diagram are developed in the handout
Additionally, timing diagrams and tables need to be studied to understand the CPU design
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 37CS 2214
The EMY CPU state diagram An instruction goes through several
phases when executed We give a name to each phase of an
instruction execution A phase is also called major cycle
Each major cycle will take one or more minor cycles (clock periods)
Each minor cycle is a state Each minor cycle takes typically one clock period
Each major cycle often has at least one microoperation
Often the name of a major cycle is derived from the major microoperation of the cycle
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 38CS 2214
The EMY CPU state diagram The number of major cycles and their complexity
are small for RISC systems and larger for CISC systems
Often for RISC systems, the CPIi for most frequently used instructions is between 4 and 6
However, this number has to be larger to have deep pipelining and high clock frequencies
In simple systems like RISC systems sharing of hardware among different major cycles is not necessary
A hardware resource is often needed in one major cycle only
The hardware for each major cycle can then be easily identified and often named stage
So, the execution of an instruction is the movement of the instruction through some or all of the stages of the CPU !
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 39CS 2214
The EMY CPU state diagram The EMY integer instructions go through at
most five major cycles during the execution
However, even for this RISC machine, it is difficult to name 5 cycle names because not all instructions do similar things in a major cycle
Some microoperations will be performed in advance in anticipation of a frequent operation
The early operations will not alter the state and will not cause longer clock periods, but will slightly increase the hardwareUn
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 40CS 2214
The EMY CPU state diagram The EMY CPU major cycles for integer instructions
Instruction fetch cycle Abbreviated as IF, standing for instruction fetch Same for all EMY instructions.
Instruction decode/Register fetch cycle Abbreviated as ID, standing for instruction decode Same for all EMY instructions.
Execution/effective address cycle Abbreviated as EX, standing for execution
Memory access cycle Abbreviated as MEM, standing for memory
Write-back cycle Abbreviated as WB, standing for write-back
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 41CS 2214
The EMY CPU state diagram Emphasizing again that designing a CPU is
determining which microoperation happens when for each architectural operation (the semantics of the instruction)
For the EMY, like many other CPUs, the IF and ID stages are identical for all instructions
The same microoperations are performed for all instructions
These microoperations implement portions of the architectural operation
For the EMY, the remaining portions of the architectural operation are performed in the EX, MEM and WB cyclesU
np
ipel
ined
EM
Y C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 42CS 2214
The EMY CPU state diagram Architectural operations of I-format
instructions among the integer instructions
Load/Store instructions LW Rt, Disp(Rs) Rt M[Rs + Disp+] SW Rt, Disp(Rs) M[Rs + Disp+] Rt
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
Architectural operations ofLoad/Store instructions ≡ Semantics
Superscript + indicates sign extension
I format
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 43CS 2214
The EMY CPU state diagram Architectural operations of I-format instructions
among the integer instructions
Branch instruction BEQ Rs, Rt, Offset If Rs = Rt, then PC PC + (Offset+ x
4)
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
I format
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 44CS 2214
The EMY CPU state diagram Architectural operations of R-format instructions
among the integer instructions
Arithmetic/Logic instructions ADD Rd, Rs, Rt Rd Rs + Rt SUB Rd, Rs, Rt Rd Rs - Rt AND Rd, Rs, Rt Rd Rs & Rt OR Rd, Rs, Rt Rd Rs | Rt SLT Rt, Rs, Rt If Rs < Rt then Rt 1 else Rt 0
6 5 5
Opcode Rs Rd FunctionShamtRt
5 5 6
R format
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 45CS 2214
The EMY CPU state diagram Architectural operations of J-format
instructions among the integer instructions
Jump instructionPC[27-0] (Address x 4)
Opcode Offset26Rs Rt
5 5J format
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 46CS 2214
The EMY CPU state diagram The major cycles of the DLX CPU are shown by the high-
level state diagram given in the EMY CPU handout Registers A and B are used to prepare operands for an ALU
operation Each state takes 1 clock period
Later, we will change it to one or more clock periods Memory accesses and complex arithmetic operations can take
more than one clock period to perform The state that has a memory access or a complex arithmetic
operation will take more than one clock period
All microoperations mentioned in a state are performed in parallel, so their order does not matter
If a state takes more than one clock period, one has to be careful about the parallel operations
We now obtain the state diagram and the datapath hardware of the EMY CPUU
np
ipel
ined
EM
Y C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 47CS 2214
The EMY major cycles and states The instruction fetch cycle
It is performed for all the instructionsThere are two microoperations performed In general, all CPUs, regardless of their
architecture do these two microoperations Read the machine language instruction pointed by
the program counter (PC) to the instruction register (IR)
Update the program counter so that it points at the instruction that follows the instruction being read from the memory
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 48CS 2214
The EMY major cycles and states The instruction fetch cycle
Read the machine language instruction pointed by the program counter (PC) to the instruction register (IR)
IR ← M[PC] Note the RTL notation that we use an equal sign (=) if
the destination is a wire or a bus and an arrow sign () if the destination is a register, such as IR
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 49CS 2214
The EMY major cycles and states The instruction fetch cycle
Read the machine language instruction pointed by the program counter (PC) to the instruction register (IR)
IR ← M[PC] Then, the read of the instruction in terms buses is as
follows :
Note again the three microoperations implement the instruction read and they happen at the same and their order does not matter
Note the RTL notation that we use an equal sign (=) if the destination is a wire or a bus, such as MABUS and an arrow sign () if the destination is a register, such as IR
MABUS = PC ; MemRead = 1 ; IR MRBUS
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 50CS 2214
The EMY major cycles and states The instruction fetch cycle
Update the program counter so that it points at the next instruction
PC ← PC + 4 Since an instruction is four bytes long, we need to add
4 to PC We will use the general ALU to do the addition, at the
expense of increasing the complexity of the ALU input logic
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 51CS 2214
The instruction fetch cycle The two microoperations of the IF cycle
can be shown in state 0 as follows
The two microoperations are simply shown without using buses to save space
The instruction read and PC update microoperations happen simultaneously and complete before the end of the clock period
IR M[PC] ;PC PC + 4 ;
0IF
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 52CS 2214
The EMY major cycles and states The instruction decode cycle
The most important goal in this cycle is to decode the instruction
Decoding the instruction means the CPU determines what the current instruction is
It is performed for all the instructions regardless of their architecture
Decoding is done by the control unit that checks the opcode and function bits of IR
They are input as status signals to the control unit
During this time the datapath does not do anything
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 53CS 2214
The instruction decode cycle During the Decode cycle the control unit
determines what the next state will be based on the type of the instruction
If it is a memory reference instruction (LW, SW), the next state is state 2 in the EX cycle
If it is a R-format A/L instruction, the next state is state 6 in the EX cycle
If it is a BEQ instruction, the next state is state 8 in the EX cycle
If it is a J instruction, the next state is state 9 in the EX cycle
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 54CS 2214
The EMY major cycles and states The instruction decode cycle
We realize that we can perform a number of microoperations needed for some instructions in the datapath since it is not used
Performing these microoperations in advance can help run those instructions faster
Which microoperations to perform in order to be prepared ?
We can transfer GPR register Rs pointed by I-format and R-format instructions to register A
We can transfer GPR register Rt pointed by I-format and R-format instructions to register B
We realize that in order to save hardware we can transfer Rs and Rt to A and B registers in every clock period
This will cause any problem and simplify the Control Unit since it would have generate Store signals for A and B registers
We realize that we can transfer the output of the ALU to a microarchitectural register, ALUout, in every clock period
We will later see it will simplify the Control Unit
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 55CS 2214
The EMY major cycles and states The instruction decode cycle
When we discuss BEQ, we will add one more microoperation to perform in the Decode cycle
By doing these in advance, we save time But, not all instructions need them : J-format
instructions do not need them and some of I-format instructions do not need the transfer to register B
This is fine since A, B and ALUout registers are not architectural registers and so changing them will not result in program errors
These microoperations transferring to A, B and ALUout are performed for all the instructions in every clock period
In general, RISC CPUs do these microoperations
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 56CS 2214
The EMY major cycles and states The instruction decode cycle
A ← GPR[Rs] ; B ← GPR[Rt] The GPR register file is designed so that two GPRs
can be read simultaneously, by using the Rs and Rt fields of IR
This means the GPR register file has two read ports controlled by Rs and Rt
Note that the order of these microoperations does not matter as they happen simultaneously
There is also a write port to the GPR register file controlled by Rt and Rd fields : 10 bits are connected to the GPR file to determine the destination register
A GPR[Rs] ; B GPR[Rt]
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 57CS 2214
The instruction decode cycle These three microoperations happen in every
clock period
The GPR read ports are directly connected to register A and B and so no buses are used
The three microoperations happen simultaneously in every clock period and complete before the end of the clock period
Since these microoperations happen every clock period, we will not show them in our states
A GPR[Rs] ; B GPR[Rt] ;ALUout OBUS
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 58CS 2214
The instruction decode cycle For the time being the decode cycle state will be
as follows
No microoperation happens other than the three microoperations that happen every clock period
However, we will change state 1 and place a microoperation there when we discuss BEQ instructions
1
ID
A GPR[Rs] ; B GPR[Rt] ;ALUout OBUS
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 59CS 2214
Completing the execution of LW and SW The LW instruction
LW Rt, Disp(Rs) Rt M[Rs + Disp+] We see that to execute the LW we need to
1) Calculate the effective address, the address of the memory location we want load from : Rs + Disp+
Read the cache memory pointed by the effective address
2) Transfer the value to GPR register Rt
The SW instruction SW Rt, Disp(Rs) M[Rs + Disp+] Rt We see that to execute the SW we need to
1) Calculate the effective address, the address of the memory location we want store to : Rs + Disp+
Write to the cache memory pointed by the effective address
2) Transfer the value from GPR register Rt to the memory pointed by the effective address
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 60CS 2214
Completing the execution of LW and SW LW and SW both have a microoperation in
common : calculating the effective address Then their microoperations differ
In order to calculate the effective address, we need to sign extend the DOImm field and add to GPR Rs Register GPR Rs has been transferred to A We also realize that GPR register Rt has been
transferred to register B Register B will be written to the memory for the SW
instruction then
LW requires one extra microoperation than the SW as we will soon seeU
np
ipel
ined
EM
Y C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 61CS 2214
Completing the execution of LW and SW We decide to have the effective address
calculation of LW and SW in the Execution/Effective address cycle The effective address is stored in a
microarchitectural register called ALUout Then, we separate LW and SW execution in the
Memory Access/Branch completion cycle : Both access the memory LW reads the memory location pointed by the effective
address to a microarchitectural register called MDR SW writes microarchitectural register B to a memory
location pointed by the effective address and completes its execution
LW completes its execution by transferring the data in MDR to GPR register RtUn
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 62CS 2214
Completing the execution of LW and SW The effective address calculation
Rs + Disp+
Rs is now in register A Sign extend DOImm and then add
As we will see shortly, A/L instructions will have their arithmetic/logic operation performed in this cycle as well They need the ALU in this cycle Therefore, we decide to use the adder of the ALU
to do the addition for the effective address
ALUout A + DOImm+
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 63CS 2214
Completing the execution of LW and SW Reading from the memory
Note that the microoperations are performed in parallel and the order does not matter
This microoperation can be stated without giving the bus detail
Note that the memory access can take more than one clock period and so we may stay in this state more than one clock period
MABUS = ALUout ; MemRead = 1 ; MDR MRBUS
MD R M[ALUout]
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 64CS 2214
Completing the execution of LW and SW The LW instruction completes by transferring
MDR to GPR register Rt
The Rt field of IR is used by the GPR register file to select the register to be written the value from MDR
The result would be stored on the microarchitectural register ALUout
Though, we could store the result of the operation directly on GPR register Rd
We decide to store to MDR and transfer from MDR to the GPR write port
This decision will help pipelining as we will see later !
We then go back to state 0, the IF cycle, to start executing the next instruction
GPR[Rt] MDR
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 65CS 2214
Completing the execution of LW and SW Storing to the data memory
Note that the microoperations are performed in parallel and the order does not matter
This microoperation can be stated without giving the bus detail
Note that the memory access can take more than one clock period and so we may stay in this state more than one clock period
SW completes its execution ! We then go back to state 0, the IF cycle to
start executing the next instruction
MABUS = ALUout ; MemWrite = 1 ; MWBUS = B
M[ALUout] B
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 66CS 2214
Completing the execution of LW and SW The portion of the state diagram for LW
and SW
ALUout A + DOImm+
2EX
From the ID cycle
LW, SW
MDR M[ALUout] 3 LW
M[ALUout] B5
SW
GPR[Rt] MDR
4
WB 0
MEM
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 67CS 2214
Completing the execution of R-format A/L instructions The R-format A/L instructions
ADD Rd, Rs, Rt Rd Rs + Rt SUB Rd, Rs, Rt Rd Rs - Rt AND Rd, Rs, Rt Rd Rs & Rt OR Rd, Rs, Rt Rd Rs | Rt SLT Rd, Rs, Rt If Rs < Rt then Rd 1 else Rd
0 We see that to execute these instructions we
need to perform an operation specified by the Opcode and Function fields
Then, we transfer the result to GPR register Rd
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 68CS 2214
Completing the execution of R-format A/L instructions We see that we can perform the all
required operations for R-format instructions in one state
Which one to perform would be determined by the Opcode and Function fields
The inputs are Rs and Rt Rs is already transferred to register A and Rt is
already transferred to register B We see we save time by moving them in the ID
stage !
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 69CS 2214
Completing the execution of R-format A/L instructions We see that we can perform the all
required operations for R-format instructions in one state
The result would be stored on the microarchitectural register ALUout
Though, we could store the result of the operation directly on GPR register Rd
This would require a separate bus from the output of the ALU to the write port of the GPR file
We decide to store to ALUout and transfer from ALUout to the GPR write port
This decision will help pipelining as we will see later !Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 70CS 2214
Completing the execution of R-format A/L instructions The microoperation for the current R-
format A/L operation
The meaning of “op” is that the type of the operation is indicated by the Opcode and Function fields of IR
What happens is that the control unit uses the Opcode and Function fields to generate a set of control signals
These control signals are connected to the ALU, telling which operation to perform
ALUout A op B
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 71CS 2214
Completing the execution of R-format A/L instructions The ALU output for the five A/L
instructions is straightforward to understand, except when the SLT instruction is executed
The ALU has to output1 (31 zeros and a one) if Rs < Rt 0 (32 zeros) otherwise
The ALU will have a functional unit called SLT that will output as such
SLT Rd, Rs, Rt If Rs < Rt then Rd 1 else Rd 0
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 72CS 2214
Completing the execution of R-format A/L instructions The result that is in ALUout is moved to GPR
register RdThe Rd field of IR is used by the GPR register
file to select the register to be written the value from ALUoutput
We then go back to state 0, the IF cycle to start executing the next instruction
GPR[Rd] ALUout
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 73CS 2214
Completing the execution of R-format A/L instructions The portion of the state diagram for R-format A/L
instructions
ALUout A op B
6EX
From the ID cycle
R-Format A/L instructions
GPR[Rd] ALUout
0
MEM7
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 74CS 2214
Completing the execution of Control instructions
The BEQZ instruction BEQ Rs, Rt, Offset If Rs = Rt, then
PC PC + (Offset+ x 4)
The J instruction J Address PC[27-0] (Address x 4)
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 75CS 2214
Completing the execution of Control instructions
We see that to execute these instructions we need to
1) Calculate the effective address the address to branch/jump to Add PC to the result of the multiplication of the
sign extended Offset by 4 Move to the rightmost 28 bits of PC the result of
the multiplication of the Address by 4 This completes the J instruction execution
2) Test if Rs is equal to equal to Rt and if yes, transfer the the effective address to PC This completes the BEQ instruction executionU
np
ipel
ined
EM
Y C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 76CS 2214
Completing the execution of BEQ instruction In order to calculate the effective address, we
need to Sign extend the DOImm field Then multiply it by 4 The add it to PC
That is we need to do the following
PC + (Offset+ x 4) We realize that the Datapath is free to do this
microoperation in the ID cycle while it is doing the other microoperations as it does in every clock period A GPR[Rs] ;
B GPR[Rt] ;ALUout OBUS
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 77CS 2214
Completing the execution of BEQ instruction We then decide to calculate the effective address of the
BEQ instruction in ID cycle
ALUout PC + (Offset+ x 4) Note that we are calculating the BEQ effective address
while we are determining which instruction we have If we do not have a BEQ instruction, the result is not used Otherwise, we save time since we perform this operation in
advance Note that executing BEQ fast, by performing its
microoperation in advance is important since this will help pipelining
As we shall see later, control instructions, all Branch and Jump instructions, slow down the pipeline CPU
It is therefore critical to complete their execution as quickly as possible to reduce the negative effect of these instructions on the pipelineU
np
ipel
ined
EM
Y C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 78CS 2214
Completing the execution of BEQ instruction The BEQ effective address calculation
ALUout PC + (Offset+ x 4) Sign extending DOImm requires a simple combinational
circuit Multiplying by 4 requires no logic at all
We just have to catenate two zeros to the right of DOImm We know that shifting a number to the left by two bit
positions is multiplying it by four
We decide to use the adder of the ALU to do the addition since it is free to use
ALUout PC + ([Offset+] << 2)
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 79CS 2214
Completing the execution of BEQ instruction Since the BEQ effective address calculation is
done in the decode cycle, state 1 has been modified
We now perform a microoperation in the decode cycle besides the three microoperations we perform every clock period
ALUout PC + ([Offset+] << 2) 1
ID
A GPR[Rs] ; B GPR[Rt] ;ALUout OBUSU
np
ipel
ined
EM
Y C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 80CS 2214
Completing the execution of BEQ instruction Testing if Rs is equal to Rt and storing to
PC based on the resultWe know that GPR register Rs has been
transferred to register AWe know that GPR register Rt has been
transferred to register BWe will compare Register A Register B then !We decide to have a Zero circuit in the ALU to
compare registers A and B The ALU will have a new 1-bit output named Zero
showing the result of the compare so that Zero is 1 if the two registers are equal 0 if the two registers are not equalU
np
ipel
ined
EM
Y C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 81CS 2214
Completing the execution of BEQ instruction Changing PC if Zero is 1
This means we branch to a memory location That is we take the branch
We then go back to state 0, the IF cycle to start executing the next instruction
If (A == B) then PC ALUout
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 82CS 2214
Completing the execution of BEQ instruction How do we handle conditional and unconditional stores on
PC ? If PC has to be stored unconditionally, such as in state 0, we
use the control signal, PCWrite, to store on PC The control unit generates another control signal,
PCWriteCond, to conditionally store on PC PCWriteCond is ANDed with Zero to conditionally store on PC
If Zero = 1 it means the condition is true, we will branch which means we store the effective address on PC
If Zero = 0 it means the condition is not true, we will not branch which means we will not store the effective address on PC
Zero
PCWriteCond
PCWrite
PC
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 83CS 2214
Completing the execution of BEQ instruction The portion of the state diagram for BEQ instruction
Note again that we complete the execution of BEQ fast since this will help pipelining
As mentioned earlier, control instructions slow down the pipeline CPU
It is therefore critical to complete their execution as quickly as possible to reduce the negative effect of these instructions on the pipeline
If A == B then PC ALUout
8
EX
From the ID cycle
BEQ
0
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 84CS 2214
Completing the execution of J instruction We see that all we need to do is store the
effective address on PC unconditionally Move to the rightmost 28 bits of PC the result of the
multiplication of the Address by 4
This is equivalent to
The rightmost 4 bits of PC are not changed ! The above microoperation has the same effect the
following one
PC[27-0] (Address x 4)
PC PC[31-28], (Address x 4)
PC (PC[31-28], Address) x 4
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 85CS 2214
Completing the execution of J instruction In order to calculate the effective address, we
need to Multiply DOImm by 4
Multiplying by 4 requires no logic at all We just have to catenate two zeros to the right of DOImm We know that shifting a number to the left by two bit
positions is multiplying it by four
We then go back to state 0, the IF cycle to start executing the next instruction
PC (PC[31-28], Address) << 2
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 86CS 2214
Completing the execution of J instruction The portion of the state diagram for J instruction
Note again that we complete the execution of J fast since this will help pipelining
As mentioned earlier, control instructions slow down the pipeline CPU
It is therefore critical to complete their execution as quickly as possible to reduce the negative effect of these instructions on the pipeline
PC (PC[31-28], Address) << 2
9
EX
From the ID cycle
J
0
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 87CS 2214
Completing the execution of J instruction How do we handle conditional and unconditional stores on
PC ? If PC has to be stored unconditionally, such as in state 0 and
state 9 , we use the control signal, PCWrite, to store on PC The J instruction uses PCWrite
The control unit generates another control signal, PCWriteCond, to conditionally store on PC
The BEQ instruction uses PCWriteCond
Zero
PCWriteCond
PCWrite
PC
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 88CS 2214
The complete state diagram The high-level state diagram for integer
instructions and the datapath are given in the EMY CPU handout
They will be modified to implement a pipelined EMY CPU
But, the overall CPU structure will be similar
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 89CS 2214
The complete state diagram The high-level state diagram for integer
instructions has a branch out from state 1
If the CPU receives an instruction that is not one the nine (LW, SW, ADD, SUB, AND, OR, SLT, BEQ, J), it will generate an internal interrupt (an exception) since it does not know what to do
In order to generate the internal interrupt, it will go to state 10 to prepare the CPU for the interrupt
In state 10, it will perform a number of microoperations, including moving the internal interrupt handler address 80000180 to PC
ALUout PC + ([Offset+] << 2)
1
ID
State 10
Invalid instruction exception (invalid opcode, not one of nine instructions)
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 90CS 2214
The complete state diagram The high-level state diagram for integer
instructions has a branch out from state 6
If the CPU performs a 2’s Complement addition or a subtraction and there is an overflow, it will generate an internal interrupt (an exception) since the result is not correct
In order to generate the internal interrupt, it will go to state 11 to prepare the CPU for the interrupt
In state 11, it will perform a number of microoperations, including moving the internal interrupt handler address 80000180 to PC
ALUout A op B
6EX
State 11
Arithmetic exception (signed overflow)
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 91CS 2214
CPIi of Integer Instructions With this implementation, the CPIi of the
instructions can be calculated asCPILW = 5 because we trace states 0, 1, 2, 3,
4CPISW = 4 because we trace states 0, 1, 2, 5CPIA/L R Format = 4 because we trace states
0, 1, 6, 7
CPIBEQ = 3 because we trace states 0, 1, 8CPIJ = 3 because we trace states 0, 1, 9
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 92CS 2214
Control Signals The semantics of each state is that which
microoperation to perform is determined by the control unit, turning on and off a few MUX select, register store, ALU control inputs and enable control signals
Control signals are connected to MUXes, registers and ALUs
They are shown as angled signals in the handout
The EMY low-level state diagram describes which control signal is 1 when
IR
IRWrite
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 93CS 2214
The Clock Signal The clock period duration is determined
by the slowest but important microoperation in the CPU
All the signal delays in the datapath and control unit are added up to calculate the time for this important operation
It is usually the integer add microoperation Though it could be the memory access time if it was
a little longer than the integer addition time Usually, the memory is much slower than the CPU in
commercial systems but we will not consider it when we calculate the clock period duration since we will deal with slow memory when we cover the memory hierarchy topicU
np
ipel
ined
EM
Y C
PU
Des
ign
: V
ersi
on 0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 94CS 2214
The Clock Signal Thus, we will assume the integer addition and memory
access both take one clock period each ! For today’s microprocessors this is not the case though !
If a microoperation takes more than one clock period, we draw a loop-back arrow to indicate so
For example, if the memory takes more than one clock period, there will be loop back lines drawn for states 0, 3 and 5 to indicate that the CPU spends more than one clock period
IR M[PC] ;PC PC + 4 ;
0IF
MDR M[ALUout] 3 LW
M[ALUout] B5
SW
MEMUn
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 95CS 2214
Clock Signal The clock period duration is determined by the
addition of all the delays in the control unit and the delays in the datpath for the integer add microoperation
The delays in the control unit include the delays to generate the MUX select, register clock input, ALU control and enable control signals
Gate networks generate these select and clock control signals if hardwiring is used
The micromemory and additional circuits generate these select and clock control signals if microprogramming is used
The delays in the datapath include Delay of data travel from registers to the ALU inputs Delay of the adder in the ALU Delay of the data travel from the ALU to the destination
register in the datapath : ALUout
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 96CS 2214
Architecture-Microarchitecture Interaction An example of how architectural decisions can
affect the microarchitecture design is the following
The Rs and Rt fields of R-format and I-format instructions are in the same position
Therefore, we do not need to use separate read ports from the register file
We have one read port for Rs and one read port for Rt
Opcode Rs Rt Displacement/Offset/Immediate
6 5 5 16
6 5 5
Opcode Rs Rd FunctionShamtRt
5 5 6
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 97CS 2214
Using the state diagram Consider the following piece of program in
the EMY memory---400000 LW R8, 0(R9) ; R8 M[R9 + 0+] ; M[R9] has C400004 ADD R10, R8, R11 ; R10 R8 + R11400008 ADD R12, R13, R14 ; R12 R13 + R1440000C SW R12, 0(R15) ; M[R15 + 0+] R12 ; M[R15] R12400010 BEQ R12, R0, 3 ; If R12 is equal to R0, branch to 400020---100000150 C ; The content of this location is C---1000A200 ?
Assume that R9 has 10000150 and R15 has 1000A200 initially
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 98CS 2214
Using the state diagram This piece of program takes 20 clock periods as the table
below shows the execution of the program with respect to time
See the EMY CPU handout for timing IF ID EX MEM WB400000 LW R8, 0(R9) 1 2 3 4 5400004 ADD R10, R8, R11 6 7 8 9400008 ADD R12, R13, R14 10 11 12 1340000C SW R12, 0(R15) 14 15 16 17400010 BEQ R12, R0, 3 18 19 20
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0 99CS 2214
Using the state diagram If the clock frequency is 1GHz
4 5
20
run nsinstructio ofNumber
program for the cyclesclock ofNumber CPIave
ns 1 second 10 10
1
frequencyClock
1 periodClock 9-
9
ns 20 1 20 periodClock programfor periodsclock ofNumber CPUtime
50 1020
1
10 10 20
5
10 CPUtime
run nsinstructio ofNumber MIPS
3-69-6ave
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0100
CS 2214
We have covered unpipelined CPU The remaining slides will be used when we
cover the memory hierarchy topicSo far we have assumed that
The memory takes one clock period to access There is one solid memory
What if it took two clock periods or more ?What if there were instruction and data cache
memories ? Then we need to take a look at the execution timing
in detail again
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0101
CS 2214
If we assume that Instruction and data cache memories take one clock period
each and there is no cache miss, the execution timing will be as before
What if the two cache memories took two clock periods each ? The execution timing will be identical to the clock doubling
case studied in class LW would take 7 clock periods since we trace states 0, 0, 1, 2,
3, 3, 4 States 0 and 3 are repeated twice since the cache memories take
two clock periods each SW would take 6 clock periods since we trace states 0, 0, 1, 2,
5, 5 States 0 and 5 are repeated twice since the cache memories take
two clock periods each ADD would take 5 clock periods since we trace states 0, 0, 1,
2, 6 State 0 is repeated twice since the cache memory takes two clock
periods BEQ would take 4 clock periods since we trace states 0, 0, 1, 8
State 0 is repeated twice since the cache memory takes two clock periods
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0102
CS 2214
Using the state diagram If the cache memories are slow (they take two clock period
per access) and there is no cache miss, then this piece of program will take 27 clock periods as the table below shows the execution of the program with respect to time
See the EMY CPU handout for timing IF ID EX MEM WB
400000 LW R8, 0(R9) 1-2 3 4 5-6 7400004 ADD R10, R8, R11 8-9 10 11 12400008 ADD R12, R13, R14 13-14 15 16 1740000C SW R12, 0(R15) 18-19 20 21 22-23400010 BEQ R12, R0, 3 24-25 26 27
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0103
CS 2214
We have so far assumed that the cache memories do not have misses ! What if both instruction and data cache
memories result is cache misses ?That is, there is a cold start !
What is the new execution time ?
To calculate the new execution time we have to study the structure of the cache memories
The size of the physical (main) memory, the size of the cache memories, the size of cache blocks, the type of mapping (direct, associative, block-set associative), the block replacement strategy, etc.
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0104
CS 2214
What if both instruction and data cache memories result is cache misses ?
For this semester We will concentrate on Level 1 cache memories, i.e.
instruction and data cache memories We will assume that there is no Level 2 cache memory
miss ! We will indicate all physical addresses used For this presentation assume that
The physical (main) memory has 256 Mbytes The physical memory has 4 Bytes per location The bus width between the physical and lowest level cache is
4 Bytes The instruction cache is 8KBytes The data cache is 16KBytes Both cache block sizes are 32 bytes Both cache memories use direct mapping Both caches use write-back with write-allocate Both cache memories access the needed item first The physical memory latency is 4 clock periods and
transferring an 4-Byte content is one clock period each
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0105
CS 2214
Instruction and data cache misses ? The physical memory has 256MBytes or
228 BytesThe physical address is 28 bits longThe physical memory has 228/32 = 228/25 = 223
blocksThe instruction cache has 8KB/32 = 213/25 = 28
= 256 blocksThe data cache has 16KB/32 = 214/25 = 29 =
512 blocks
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0106
CS 2214
Instruction and data cache misses ? The physical address is used by the physical memory and
instruction cache as follows
The physical address is used by the physical memory and data cache as follows
15 8 5
23 bits
Instructioncache block #
Byte offset
Main memory block number
Address tag
14 9 5
23 bits
Data cacheblock #
Byte offset
Main memory block number
Address tagUn
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0107
CS 2214
Instruction and data cache misses ? The instruction cache has 32-Byte blocks
Each block contains 8 instructions since each instruction is 4 Bytes long
Instructions in physical memory locations 100 through 110 are in one instruction cache block
00000100
00000104000001080000010C
Instruction cache blocks have 32 bytes and so each block holds 8 instructions !Instructions in 100, 104, 108, 10C, 110, 114, 118 and 11C are in one instruction cache block !
4 bytes
LW R8, 0(R9)
ADD R10, R8, R11
ADD R12, R13, R14
SW R12, 0(R15)
BEQ R12, R0, 3
Which instruction cache block is this ?
?
??
00000110
00000114000001180000011C
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0108
CS 2214
Instruction and data cache misses ? The instruction cache has 32-Byte blocks
Instructions in physical memory locations 100 through 110 are in main memory block number 8 and in instruction cache memory block number 8
0000100 LW R8, 0(R9)
0000 0000 0000 0000 0001 0000 00000 0 0 0 1 0 0
5 bits ! The byte offset is 5 bits long. The LW instruction has 0 offset from the beginning of the block, i.e. the first instruction of the block
Instructioncache block # 8 since 00001000 is 8 in decimal
Address tag
Instructions in 100, 104, 108, 10C, 110 are in instruction cache block 8 !
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Main memory block number : 8
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0109
CS 2214
Instruction and data cache misses ? How long does it take to access individual instructions ?
Both cache memories access the needed item first The physical memory latency is 4 clock periods and transferring an 4-Byte
content is one clock period each
Start access
LatencyTransferM[104]
TransferM[108]
TransferM[10C]
TransferM[110]
Time
Block fill time = 12 clock periods
M[100] is the needed item and accessed & transferred first!
Five clock periods !Six clock periods !Seven clock periods !
Eight clock periods !
000001000000010400000108
0000010C
LW R8, 0(R9)
ADD R10, R8, R11
ADD R12, R13, R14
SW R12, 0(R15)
BEQ R12, R0, 3
?
?
?
00000110
00000114
00000118
0000011C
Nine clock periods !
TransferM[114]
TransferM[118]
TransferM[11C]
Ten clock periods !
Eleven clock periods !
Twelve clock periods !
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0110
CS 2214
Instruction and data cache misses ? The data cache has 32-Byte blocks
Each block contains 4 data elements since each data element is 4 Bytes long
The data element in physical memory location 1150 is in one data cache block
00001140
4 bytes
00001144
00001148
0000114C
Data cache blocks have 32 bytes and so each holds 8 data elements !Data elements in 1140, 1144, 1148, 114C, 1150, 1154, 1158 and 115C are in one data cache block !
C
Which data cache block is this ?
0000115400001158
0000115C
00001150
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0111
CS 2214
Instruction and data cache misses ? The data cache has 32-Byte blocks
The data element in physical memory location 1150 is in main memory block number 138 and in data cache block number 138
0001150 C
0000 0000 0000 0001 0001 0101 00000 0 0 1 1 5 0
5 bits ! The byte offset is 5 bits long. The data element has 16-Byte offset from the beginning of the block, i.e. the fifth data element of the block
Data cache block # 138 since 010001010 is 138 in decimal
Address tag
Data element in 150 is in data cache block 138 !
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Main memory block number : 138
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0112
CS 2214
Instruction and data cache misses ? How long does it take to access individual data element ?
Both cache memories access the needed item first The physical memory latency is 4 clock periods and transferring an 4-Byte
content is one clock period each
Seven clock periods !Eight clock periods !
Five clock periods !
Six clock periods !
00001140
00001144
000011480000114C
00001154000011580000115C
00001150 C
Latency
TransferM[1154]
TransferM[1158]
TransferM[115C]
TransferM[1140]
Time
Block fill time = 12 clock periods
M[1150] is the needed item and accessed & transferred first!
TransferM[1144]
TransferM[1148]
TransferM[114C]
Start access
Eleven clock periods !
Twelve clock periods !
Nine clock periods !
Ten clock periods !
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0113
CS 2214
Instruction and data cache misses ? The data cache has 32-Byte blocks
Each block contains 4 data elements since each data element is 4 Bytes long
The data element in physical memory location 2200 is in one data cache block
?0000220000002204
00002208
0000220C
Data cache blocks have 32 bytes and so each holds 8 data elements ! Data elements in 2200, 2204, 2208, 220C, 2210, 2214, 2218 and 221C are in one data cache block !
Which data cache block is this ?
0000221400002218
0000221C
00002210
4 bytes
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0114
CS 2214
Instruction and data cache misses ? The data cache has 32-Byte blocks
The data element in physical memory location 2200 is in main memory block number 272 and in data cache block number 272
0002200 ?
0000 0000 0000 0010 0010 0000 00000 0 0 2 2 0 0
5 bits ! The byte offset is 5 bits long. The data element has 0 offset from the beginning of the block, i.e. the first data element of the block
Data cache block # 272 since 100010000 is 272 in decimal
Address tag
Data element in 2200 is in data cache block 272 !
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Main memory block number : 272
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0115
CS 2214
Instruction and data cache misses ? How long does it take to access individual instructions ?
Both cache memories access the needed item first The physical memory latency is 4 clock periods and transferring an 4-Byte
content is one clock period each
Eleven clock periods !Twelve clock periods !
Nine clock periods !
Ten clock periods !
?
Latency
TransferM[2204]
TransferM[2208]
TransferM[220C]
TransferM[2210]
Time
Block fill time = 12 clock periods
M[2200] is the needed item & accessed & transferred first!
TransferM[2214]
TransferM[2218]
TransferM[221C]
Start access
Seven clock periods !
Eight clock periods !
Five clock periods !
Six clock periods !
00002200
00002204
000022080000220C
0000221400002218
0000221C
00002210
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0116
CS 2214
Instruction and data cache misses ? How long does it take to run the program with a cold start ?
This piece of program will take 32 clock periods as the table below shows the execution of the program with respect to time
See the EMY CPU handout for timing IF ID EX MEM WB400000 LW R8, 0(R9) 1/5 6 7 8/12 13
400004 ADD R10, R8, R11 14 15 16 17
400008 ADD R12, R13, R14 18 19 20 21
40000C SW R12, 0(R15) 22 23 24 25/29
400010 BEQ R12, R0, 3 30 31 32
Un
pip
elin
ed E
MY
CP
U D
esig
n :
Ver
sion
0
Haldun Hadimioglu CSE – Spring
2014
EMY CPU Version 0117
CS 2214
Unpipelined EMY CPU is complete We studied how the architecture affects
the organization We designed the EMY CPU for nine integer
instructions We considered how cache memories
affect the unpipelined EMY CPU execution Another PowerPoint presentation will
coverThe pipelined EMY CPUHow cache memories affect the pipelined EMY
CPU execution