Computer Organization

Computer Organization

Department of CSE, SSE Mukka

The Memory System

www.bookspar.com | Website for students | VTU NOTES

Chapter Objectives Basic memory circuits Organization of the main memory Cache memory concept –

Shortens the effective memory access time Virtual memory mechanism –

Increases the apparent size of the main memory Secondary storage

Magnetic disks Optical disks Magnetic tapes


Basic Memory Concepts The maximum size of the Main Memory (MM)

that can be used in any computer is determined by its addressing scheme.

For eg., 16 – bit computer that generates 16-bit

addresses is capable of addressing up to ? 32 – bit computer with 32-bit address can

address _____ memory locations 40 – bit computer can address _______

memory locations


Word addressability and byte-addressability If the smallest addressable unit of information is a

memory word, the machine is called word-addressable.

If individual memory bytes are assigned distinct addresses, the computer is called byte-addressable.

Most of the commercial machines are byte-addressable. For example in a byte-addressable 32-bit computer, each

memory word contains 4 bytes. A possible word-address assignment would be:

Word Address Byte Address 0 0 1 2 3 4 4 5 6 7 8 8 9 10 11


Basic Memory Concepts Word length of a computer is the number of bits

actually stored or retrieved in one memory access For eg., a byte addressable 32-bit computer, whose

instructions generate 32-bit addresses High order 30 bit to determine which word in memory Low order 2 bits to determine which byte in that word

Suppose we want to fetch only one byte from a word. In case of Read operation, other bytes are discarded

by processor In case of Write operation, care should be taken not to

overwrite other bytes


Basic Memory concepts Data transfer between memory and the processor

takes place through the use of 2 processor registers MAR – Memory address register MDR – Memory Data Register

If MAR is k bits long and MDR n bits long Memory unit may contain up to 2k addressable

locations During a memory cycle, n bits of data are transferred

between memory and the processor No of address lines and data lines in processor? There are additional control lines read/write, MFC,

no of bytes to be transferred etc www.bookspar.com | Website for

students | VTU NOTES

Up to 2 k addressableMDR

MAR

Figure 5.1. Connection of the memory to the processor.

k-bitaddress bus

n-bitdata bus

Control lines( , MFC, etc.)

Processor Memory

locations

Word length = n bits

WR /


How processor reads data from the memory ? Loads the address of the required memory

location into MAR Sets R/W line to 1 The memory responds by placing the

requested data on data lines Confirms this action by asserting MFC signal

Upon receipt of MFC signal, processor loads the data on the data lines in to the MDR register


How processor Writes Data into memory? Loads the address of the location into MAR Loads the data into MDR Indicates Write operation by setting R/W line

to 0


Some concepts Memory Access Times: - It is a useful measure of the speed of the memory

unit. It is the time that elapses between the initiation of an operation and the completion of that operation (for example, the time between READ and MFC).

Memory Cycle Time :- It is an important measure of the memory

system. It is the minimum time delay required between the initiations of two successive memory operations (for example, the time between two successive READ operations). The cycle time is usually slightly longer than the access time.

www.bookspar.com | Website for


Random Access Memory (RAM) A memory unit is called a Random Access

Memory if any location can be accessed for a READ or WRITE

operation in some fixed amount of time that is independent of the location’s address.

Main memory units are of this type. This distinguishes them from serial or partly

serial access storage devices such as magnetic tapes and disks which are used as the secondary storage device.


Cache Memory The CPU processes instructions and data faster than

they can be fetched from compatibly priced main memory unit. Memory cycle time becomes the bottleneck in the system.

One way to reduce the memory access time is to use cache memory. Its a small and fast memory that is inserted between the

larger, slower main memory and the CPU. Holds the currently active segments of a program and its

data. Because of the locality of address references,

CPU finds the relevant information mostly in the cache memory itself (cache hit)

infrequently needs access to the main memory (cache miss) With suitable size of the cache memory, cache hit rates

of over 90% are possiblewww.bookspar.com | Website for


Memory Interleaving This technique divides the memory system

into a number of memory modules

Arranges addressing so that successive words in the address space are placed in different modules. When requests for memory access involve

consecutive addresses, the access will be to different modules.

Since parallel access to these modules is possible, the average rate of fetching words from the Main Memory can be increased


Virtual Memory In a virtual memory System, the addresses generated

by the program may be different from the actual physical address the address generated by the CPU is referred to as a virtual

or logical address. The required mapping between physical memory and

logical address space is implemented by a special memory control unit, called the memory management unit.

The mapping function may be changed during program execution according to system requirements.

The logical (virtual) address space can be as large as the addressing capability of the CPU

The physical address space the actual physical memory can be much smaller.


Virtual memory Only the active portion of the virtual address

space is mapped onto the physical memory the rest of the virtual address space is mapped onto

the bulk storage device like magnetic disks( hard disks) If the addressed information is in the Main

Memory (MM), it is accessed and execution proceeds.

Otherwise, an exception is generated, in response to which

the memory management unit transfers a contiguous block of words containing the desired word from the bulk storage unit to the MM,

displacing some block that is currently inactive.


FF

Figure 5.2. Organization of bit cells in a memory chip.

circuitSense / Write

Addressdecoder

FF

CS

cellsMemory

circuitSense / Write Sense / Write

circuit

Data input/output lines:

A0

A1

A2

A3

W0

W1

W15

b7 b1 b0

WR /

b7 b1 b0

b7 b1 b0

•••

•••

•••

•••

•••

•••

•••

•••

•••


An example of memory organization A memory chip consisting of 16 words of 8 bits each,

which is usually referred to as a 16 x 8 organization. The data input and the data output of each

Sense/Write circuit are connected to a single bi-directional data line in order to reduce the number of pins required.

One control line, the R/W (Read/Write) input is used a specify the required operation and

another control line, the CS (Chip Select) input is used to select a given chip in a multichip memory system.

This circuit requires 14 external connections, and allowing 2 pins for power supply and ground connections, can be manufactured in the form of a 16-pin chip.

It can store 16 x 8 = 128 bits.


Figure 5.3. Organization of a 1K 1 memory chip.

CS

Sense/Writecircuitry

arraymemory cell

address5-bit row

input/outputData

5-bitdecoder

address5-bit column

address10-bit

output multiplexer 32-to-1

input demultiplexer

32 32

WR /

W0

W1

W31

and


1K X 1 memory chip The 10-bit address is divided into two groups

of 5 bits each to form the row and column addresses for the cell array.

A row address selects a row of 32 cells, all of which are accessed in parallel.

One of these, selected by the column address, is connected to the external data lines by the input and output multiplexers.

This structure can store 1024 bits, can be implemented in a 16-pin chip.


Static memories Memories that consist of circuits capable of

retaining their state as long as power is applied – static memories

Static rams can be accessed very quickly – few nanosecs


YX

Word line

Bit lines

Figure 5.4. A static RAM cell.

b

T2T1

b

Two inverters

Bit line

2 transistors T1 and T2

When word line is at ground level transistors are turned off, and latch retains its state


Read and Write operation in SRAM Read

Word line is activated – to close switches T1 and T2 If cell is in state 1, the signal on bit line b is high

and signal on bit line b’ is low Opposite holds if cell is in state 0 Sense/Write circuits at the end of the bit lines

monitor the states of b and b’ and sets output Write

State of cell is set by placing appropriate value on bit line ba dn b’ and then word line is activated

This forces the cell into corresponding state Required signals on bit lines are generated by

Sense/Write circuit


Word line

b

Bit lines

Figure 5.5. An example of a CMOS memory cell.

T1 T2

T6T5

T4T3

YX

Vsupplyb


Dynamic RAMs Static RAMs are fast but come at a higher cost

Their cells require several transistors Less expensive RAMs using less no of

transistors, But their cells cannot retain their state indefinitely Called as Dynamic RAMs

Information stored in the form of charge on a capacitor This charge can be maintained only for tens of

milliseconds Contents must be periodically refreshed


Figure 5.6. A single-transistor dynamic memory cell

TC

Word line

Bit line

Dynamic RAM – needs to refreshed periodically to hold data


A 16-Mbit DRAM chip, configured as 2M X 8 The cells are organized as 4K X 4K array The 4096 cells in each row are divided into

512 groups of 8 A row can hence store 512 bytes of data 12 Address bits are required to select a row 9 bits needed to specify a group of 8 bits in the

selected row


Column

CSSense / Writecircuits

cell arraylatchaddressRow

Column

latch

decoderRow

decoderaddress

R /W

A20 9- A 8 0-

D0D7

RAS

CAS

Figure 5.7. Internal organization of a 2M 8 dynamic memory chip.

4096X(512X8)

Timing is controlled asynchronously. Specialized memory controller circuit to provide the necessary control signals CAS and RAS, that govern the timing.

Hence it is asynchronous DRAM


Fast Page mode All bits of a row are sensed but only 8 bits are placed This byte is selected by column address bits A simple modification can make it access other bytes of the

same row without having to reselect the row Add a latch to the output of the sense amplifier in each

column The application of a row address will load latches

corresponding to all bits in a selected row Need only different column addresses to place the different bytes

on the data lines Most useful arrangement is to transfer bytes in sequential

order Apply a consecutive sequence of column addresses under the

control of successive CAS signals. This scheme allows transferring a block of data at a much

faster rate than can be achieved for transfers involving random addresses This block transfer capability is called as fast page mode


SYNCHRONOUS DRAMs Operation directly synchronized with a clock signal Called as SDRAMs The cell array is the same as in Asynchronous DRAMs. The address and data connections are buffered by

means of registers Output of each sense amplifier is connected to a

latch A read operation causes the contents of all cells in the

selected row to be loaded into these latches If an access is made for refreshing purposes only, it wont

change the contents of these latches Data held in the latches that correspond to the selected

column(s) are transferred into the data output register


R/ W

R AS

CAS

CS

Clock

Cell arraylatch

addressRow

decoderRow

Figure 5.8. Synchronous DRAM.

decoderColumn Read/Write

circuits & latchescounteraddressColumn

Row/Columnaddress

Data inputregister

Data outputregister

Data

Refreshcounter

Mode registerand

timing control


SYNCHRONOUS DRAMs SDRAMs have several different modes of

operation Selected by writing control information into a mode

register Can specify burst operations of different lengths

In SDRAMs, it is not necessary to provide externally generated pulses on the CAS line to select successive columns Necessary signals are provided using a column

counter and clock signal Hence new data can be placed on data lines at each

clock cycle All actions triggered by rising edge of the clock


R/W

R AS

C AS

Clock

Figure 5.9. Burst read of length 4 in an SDRAM.

Row Col

D0 D1 D2 D3

Address

Data


Burst read of length 4 in an SDRAM. Row address latched under control of RAS signal

Memory takes about 2-3 cycles to activate selected row Column address is latched under control of CAS

signal After delay of 1 cycle, first set of data bits placed

on data lines SDRAM automatically increments column address to

access the next 3 sets of bits in the selected row, placed on data lines in successive clock cycles

SDRAMs have built in refresh circuitry Provides the addresses of rows that are selected for

refreshing Each row must be refreshed at least every 64ns


Latency and Bandwidth The parameters that indicate the performance

of the memory Memory latency – amount of time it takes to

transfer a word of data to or from memory In block transfers, latency is used to denote the

time it takes to transfer the first word of data This is longer than the time needed to transfer each

subsequent word of a block In prev diagram, access cycle begins with

assertion of RAS and first word is transferred 5 cycles later. Hence latency is 5 clock cycles


Bandwidth Bandwidth usually is the no of bits or bytes

that can be transferred in one sec Depends on

Speed of memory access Transfer capability of the links – speed of the bus No of bits that can be accessed in parallel

Bandwidth is product of the rate at which data are transferred ( and accessed) and width of the data bus


Double – Data – Rate SDRAM ( DDR SDRAMs) The standard SDRAM performs all actions on the rising

edge of the clock signal DDR SDRAMs access the cell array in same way but

transfers data on both the edges of the clock The latency is the same as standard SDRAMs But since they transfer data on both the edges of clock,

bandwidth is essentially doubled for long burst transfers To make this possible, the cell array is organized into 2

banks Each bank can be accessed separately Consecutive words of a given block are stored in different banks

Efficiently used in applications where block transfers are prevalent Eg., main memory to and from processor caches


Questions for assignment 1. Explain how processor reads and writes data

froma dn to memory 2. explain organization of 1K X 1 memory chip 3. Explain a single SRAM cell with diagram. How

read and write operations are carried out? 4. Explain DRAM cell with diagram. How read and

write operations are carried out? 5. Explain 2M X 8 DRAM chip. How can you modify

this for fast page mode 6. Explain SDRAMs with help of a diagram 7. Explain the terms latency and bandwidth 8. Explain the burst length read of 4 in SDRAM

with timing diagram 9. Explain DDR SDRAMs


Structure of larger memories Memory systems connected to form larger

memories There are 2 types of memory systems

Static memory systems Dynamic memory systems


Static Memory systems Following is the diagram for implementation of 2M X 32 memory

using 16 512K X 8 static memory chips There are 4 columns, each column containing 4 chips to implement

one byte position Only selected chips ( using chip select input ) place data on output

lines 21 address bits are needed to select a 32 bit word in this memory

high order 2 bits used to determine which of the 4 chip select signals should be activated

19 bits used to access specific byte locations inside each chip of selected row

R/W inputs of each chip are tied together to form a single R/W signal Dynamic memory systems are organized much in the same manner

as static Physical implementation more conveniently done in the form of memory

modules


Figure 5.10. Organization of a 2M 32 memory module using 512K 8 static memory chips.

19-bit internal chip address

Chip select

memory chip

decoder2-bit

addresses21-bit

19-bitaddress

512 K X 8

A0A1

A19

memory chip

A20

D31-24 D7-0D23-16 D15-8

512 K X 8

8-bit datainput/output


Memory System Considerations The choice of a RAM for a given system depends

on several factors Cost Speed Power dissipation Size of chip

Static RAMs are used when very fast operation is the primary requirement Used mostly in cache memories

Dynamic RAMs are predominant choice for computer main memories High densities achievable make larger memories

economically feasible


Memory Controller To reduce number of pins, dynamic memory chips

use multiplexed address inputs Address divided into 2 parts

High-order address bits, to select a row in a cell array, are provided first and latched into memory under control of RAS

Low-order address bits, to select a column, are provided on the same address pins and latched under CAS signal

Processor issues all bits of address at the same time

The required multiplexing of address bits are performed by a memory controller circuit,


Processor

RAS

CAS

R/ W

Clock

AddressRow/Column

address

Memorycontroller

R/ W

Clock

Request

CS

Data

Memory

Figure 5.11. Use of a memory controller.


Memory controller functions Interposed between processor and memory Processor sends Request signal

Accepts complete address and R/W signal from the processor

The controller forwards the row and column portions of address to the memory Generates the RAS and CAS signals

Also sends R/W and CS signals to the memory Data lines are directly connected between the

processor and the memory When used with DRAM chips, the memory

controller provides all the information needed to control the refreshing process Contains a refresh counter – to refresh all rows within

the time limit specified for a device


RAMBUS Memory To increase the system bandwidth we need to

increase system bus width or system bus speed A wide bus is expensive and required lot of space

on motherboard Rambus – narrow bus but much faster

Key feature is fast signaling method used to transfer information between chips

Uses the concept of differential signaling Instead of either 0 volts or Vsupply ( 5 Volts ), uses 0.3

volt differences from a reference voltage called as Vref


READ-ONLY Memories (ROMs) Both SRAMs and DRAMs are volatile

Loses data if power is turned off Many applications need to retain data even if power is

off E.g., a hard disk used to store information, including OS When system is turned on , need to load OS from hard disk to

memory Need to execute a program that boots OS That boot program, since is large, is stored on disk Processor must execute some instructions that load boot

program into memory So we need a small amount of non volatile memory that

holds instructions needed to load boot program into RAM

Special type of writing process to place info into non volatile memories

Called as ROM – Read Only Memorywww.bookspar.com | Website for


Not connected to store a 1Connected to store a 0

Figure 5.12. A ROM cell.

Word line

P

Bit line

T


ROM Transistor is connected to ground at point P

then 0 is stored Else 1 is stored Bit line connected to a power supply through a

resistor To read, word line is activated

If voltage drops down – then 0 If voltage remains same – then 1


PROM Allows data to be loaded by the user Achieved by inserting a fuse at point P in the

prev figure Before it is programmed, memory contains all

0s The user can insert 1at required locations

using high-current pulses Process is irreversible


EPROM Allows the stored data to be erased and new data to be

loaded Erasable, reprogrammable ROM – called as EPROM Can be used when memory is being developed

So that it can accommodate changes Cell structure is similar to ROM

The connection to ground is always made at point P A special transistor is used – ability to function either as a

normal transistor or as a disabled transistor which is always turned off Can be programmed to behave as permanently open switch

Can erase by exposing the chip to UV light which dissipate the charges trapped in transistor memory cells


EEPROM Disadvantage of EPROMs

Chip must be physically removed from the circuit for reprogramming

Entire contents are erased from UV light EEPROM – another version of Erasable PROM

that can be both programmed and erased electrically Need not be removed for erasure Can erase cell contents selectively

Disadvantage Different voltages needed for erasing , writing and

reading stored datawww.bookspar.com | Website for


Flash Memory An approach similar to EEPROM

A flash cell is based on a single transistor controlled by trapped charge

In EEPROM can read and write a single cell In Flash memory – can read the contents of a single

cell but can write only to a block of cells Flash devices have greater density

Higher capacity Lower cost per bit

Require a single power supply voltage Consumes less power in operation

Used in portable equipment that is battery driven – handheld computers, cell phones, digital cameras, MP3 players


Flash Cards and Flash Drives Single flash chips do not provide sufficient storage

capacity Larger memory modules are required – flash cards

and flash drives Flash cards

Mount flash chips on a small card A card is simply plugged into a conveniently accessible slot Variety of sizes

Flash Drives Larger modules to replace hard disk drives Designed to fully emulate hard disks – not yet possible Storage capacity is significantly lower Have shorter seek and access times hence faster response Lower power consumption Insensitive to vibration


Speed, Size and Cost Ideal memory – fast, large and inexpensive Very fast memory if SRAM chips are used

These chips are expensive So impractical to build a large Memory using SRAM chips

DRAM chips are cheaper But also slower

Solution for space is to provide large secondary storage devices Very large disks at reasonable prices

For main memory – use DRAMs Use SRAMs in smaller memories like cache memory


Processor

Primarycache

Secondarycache

Main

Magnetic disk

memory

Increasingsize

Increasingspeed

Figure 5.13. Memory hierarchy.

secondarymemory

Increasingcost per bit

Registers

L1

L2


Cache Memories Speed of main memory slower than modern processors Processor cannot spend time wasting to access

instructions and data in main memory Use a cache memory which is much faster and makes

the main memory appear faster to processor than it really is

Effectiveness of cache based on locality of reference – many instructions in localized areas of the program are executed repeatedly during some time period, and the remainder of the program is accessed relatively infrequently. Two ways Temporal – a recently executed instruction is likely to be

executed again very soon Spatial – instructions in close proximity to a recently executed

instruction ( with respect to instruction’s address ) are likely to be executed soon


Operation of a cache If the active segments of the program can be placed

in fast cache memory – can reduce total execution time significantly

Memory control circuitry designed to take advantage of locality of reference Temporal – whenever an item( instruction or data) is first

needed, this item is brought into the cache – remains there till needed again

Spatial – instead of fetching just one item from the main memory to the cache, fetch several items that reside at adjacent addresses. – referred to as block or cache line.

Replacement algorithm – to decide which block of data to be moved back from cache to main memory so that a new block can be accommodated


Figure 5.14. Use of a cache memory.

CacheMain

memoryProcessor


Operation of a cache Read request from processor

The contents of a block of memory words containing the location specified are transferred into the cache one word at a time When the program references any of the locations in this block,

the desired contents are read directly from the cache. The cache can store reasonable no of words; but it is

small compared to main memory The correspondence between main memory blocks and

those in the cache is specified by a mapping function When the cache is full and a memory word that’s not

in cache is referenced, the cache control hardware decides which block must be removed to create space for newly arrived block The collection of rules for this operation is called

replacement algorithm


Cache operation Processor does not explicitly need to know

existence of cache It issues read/write requests using memory addresses the cache control circuitry determines whether the

requested word is currently in cache. If in cache, the read/write operation is preformed on

appropriate cache location In this case, a read or write hit is said to have occurred If its read operation, then main memory is not involved For a write operation – there are 2 options

Write-through protocol – both cache and main memory updated simultaneously

Write-back or copy-back protocol – only cache will be updated during write operation. ( denote this using a dirty or modified bit) later when we are moving this block back to main memory – update the main memory.


Limitations of write-through and write back protocols The write-through protocol is simpler but it

results in unnecessary write operations in the main memory when a given cache word is updated several times during its cache residency

The write-back protocol may also result in unnecessary write operations because when a cache block is written back to the memory all words of the block are written back, even if only a single word has been changed while the block was in cache


Read miss When the addressed word is not present in

cache The block of words that contains the requested

word is copied from the main memory into cache After that, the requested word is sent to processor

Alternatively, this word may be sent to the processor as soon as it is read from the memory Called as load-through or early restart Reduces the processor’s waiting period But needs more complex circuitry


Write miss Occurs if the addressed word is not in cache If write-through protocol is used, the

information is written directly into the main memory

If write-back protocol is used, the block containing the addressed word is first

brought into the cache The desired word in the cache is overwritten with

new info


Mapping functions Correspondence between main memory blocks and

those in the cache 3 techniques

Direct mapping Associative mapping Set-associative mapping

Consider a cache 128 blocks of 16 words each Total of 2K words ( 2048)

Consider Main memory has 16 bit address 64Kwords 4K blocks of 16 words each

Consecutive address refers to consecutive memory locations


Direct Mapping The simplest way to determine cache locations in

which to store memory blocks is the direct-mapping technique

Block j of main memory maps onto block j modulo 128 of the cache. ( refer following figure )

Whenever one of main memory blocks 0,128,256,.. is loaded in the cache, it is stored in cache block 0.

Blocks 1,129,257,… are stored in cache block ? Since more than one memory block is mapped onto

a given cache block position, contention may arise even when the cache is not full Eg., instructions of a program may start at block 1 and

continue in block 129 ( possibly after a branch) Can resolve the contention by allowing new block to

overwrite the currently resident block


tag

tag

tag

Cache

Mainmemory

Block 0

Block 1

Block 127

Block 128

Block 129

Block 255

Block 256

Block 257

Block 4095

Block 0

Block 1

Block 127

7 4 Main memory addressT ag Block W ord

Figure 5.15. Direct-mapped cache.

5


Direct mapping contd.. Placement of a block in the cache is determined from

the memory address. Memory address divided into 3 fields

Low order 4 bits – 1 out of 16 words 7 bit cache block field – to determine which cache block

this new block is stored High- order 5 bits – tag bits associated with the location in

cache Identifies which of the 32 blocks that are mapped into this cache

position are currently resident in cache If they match then the desired word is in the cache, If there is no match, the block containing the required word must

first be read from the main memory and loaded into cache Direct mapping is easy but not flexible


Associative mapping A main memory block can be placed into any cache

block position 12 tag bits to identify a memory block when it is

resident in the cache The tag bits of an address received from the

processor are compared to the tag bits of each block of the cache to see if desired block is present

Called as associative - mapping technique Gives complete freedom in choosing the cache

location in which to place the memory block New block has to replace an existing block only if the cache

is full Need replacement algorithms to choose which block to replace

Cost of this mapping technique is higher than direct mapping as we need to search all 128 tag patterns – called as associative search


4

tag

tag

tag

Cache

Mainmemory

Block 0

Block 1

Block i

Block 4095

Block 0

Block 1

Block 127

12 Main memory address

Figure 5.16. Associative-mapped cache.

T ag W ord


Set-associative mapping Combination of direct mapping and associative

mapping Blocks of the cache are grouped into sets

Mapping allows a block of the main memory to reside in any block of a specific set.

So, we have got a few choices where to place the block, the problem of contention of the direct method is eased

The hardware cost is reduced by decreasing the size of the associative search.

Following figure is a example – with 2 blocks per set. Memory blocks 0,64,128,….,4032 map into cache set 0, and

they can occupy either of the two block positions within this set.

Total 64 sets, so we need 6 bits to choose a set Compare tag field with tags of the cache blocks to check if

the desired block is present


tag

tag

tag

Cache

Mainmemory

Block 0

Block 1

Block 63

Block 64

Block 65

Block 127

Block 128

Block 129

Block 4095

Block 0

Block 1

Block 126

tag

tagBlock 2

Block 3

tag Block 127

Main memory address6 6 4T ag Set W ord

Set 0

Set 1

Set 63

Figure 5.17. Set-associative-mapped cache with two blocks per set.


Set associative mapping contd… No of blocks per set is parameter that can be

selected to suit the requirements of the computer Four blocks per set can be accommodated by a 5-bit

set field. Eight blocks per set can be accommodated by 4-bit

set field 128 blocks per set? Requires no set bits and is fully

associative technique , with 12 tag bits Other extreme of one block per set is direct-mapping

method

A cache with k blocks per set is called as k-way set-associative cache.


Valid bit and cache coherence problem A control bit called as valid bit is provided for

each block Indicates whether the block contains a valid data Is different from the dirty or modified bit Dirty bit is required only in systems that don’t use

write-through method Valid bit is initially 0 when

power is applied to system Main memory is loaded with new programs and data

from the disk Transfers from the disk to the main memory are

carried out by a DMA mechanism Normally DMA transfers bypass cache ( cost and

performance)www.bookspar.com | Website for


Valid bit and cache coherence problem Valid bit of a block is set to 1 the first time block is

loaded from main memory Whenever a main memory block is updated by a source

that bypasses cache A check is made to determine whether the block being

loaded is currently in cache If so, then its valid bit is cleared to 0. This ensures that stale data does not exist in cache

Whenever a DMA transfer made from the main memory to the disk, and cache uses write-back protocol Data in the memory might not reflect the changes that have

been made in the cached copy. Solution : flush the cache by forcing the dirty data to be

written back to the memory before DMA transfer takes place Need to ensure that two different entities ( processor and

DMA in this case ) uses the same copies of data is referred to as a cache coherence problem


Replacement algorithms In direct mapping method, the position of each block

is pre-determined No replacement strategy exists

In associative and set-associative strategy, there is some flexibility If cache is full when a new block arrives, the cache

controller must decide which of the old blocks to overwrite This decision is very important and determines system

performance Keep the blocks in the cache that may be referenced in the

near future Some algorithms

LRU block Oldest block Random block


Least Recently Used ( LRU ) replacement algorithm Uses the property of locality of reference High probability that the blocks that have been

referenced recently will be referenced soon So when a block needs to be overwritten, overwrite the

one that has gone the longest time without being referenced This block is called as least recently used block

The cache controller must track references to all the blocks Uses a 2-bit counter for a set of 4 blocks

When hit occurs – the block’s counter is made 0 Lower values are incremented by 1 Higher values are unchanged

When a miss occurs – Set is not full – new block is loaded and assigned counter value 0 Set is full – block with counter value 3 is removed and new block

put in its place. Other 3 blocks’ counters are incremented by 1


Reading assignment : Go through the examples of mapping

techniques in the text book


Performance considerations 2 key factors in success of a computer – cost and

performance The objective is to achieve best possible performance at

the lowest possible cost Challenge in design alternative is to improve performance

without increasing cost Measure of success – price/performance ratio

Performance depends on how fast instructions can be brought into the processor for execution and how fast they can be executed In this unit, we will focus on the first aspect

In case of memory we need shorter access time and larger capacity If we have a slow and faster unit – it is beneficial if we can

transfer data at the rate of faster unit – to achieve this we go for parallel access using a technique called as interleaving


Interleaving Main memory of a computer is structured as a

collection of physically separate modules Each with its own address buffer register ( ABR )and data

buffer register ( DBR ) Memory access operations may proceed in more than one

module at the same time. Two ways of implementing interleaving

High order k bits name one of n modules and low order m bits name a particular word in that module When consecutive locations are accessed only one module

is involved Devices with DMA capability can access info from other

memory modules Low – order k bits select a module and high order m bits name

a location within that module Consecutive addresses are located in successive modules Hence faster access and higher average utilization Called as memory interleaving – more effective way


m bitsAddress in module MM address

Figure 5.25. Addressing multiple-module memory systems.

(b) Consecutive words in consecutive modules

i

k bits

0ModuleModuleModule

Module MM address

DBRABRABR DBRABR DBR

Address in module

(a) Consecutive words in a module

i

k bits

Module Module Module

Module

DBRABR DBRABR ABR DBR

0

2 k 1-

n 1-

m bits

Go through the example in the text for better understandingwww.bookspar.com | Website for students | VTU NOTES

Problem A cache with 8-word blocks, on a read miss, the block that

contains the desired word must be copied from main memory into cache. Assume – it takes one clock cycle to send an address to main

memory Memory built using DRAM chips – first word access takes 8 cc and

subsequent words in same block can be accessed in 4cc per word. One CC needed to send one word to cache

Using a single memory module – time needed to load desired block into cache is

1 + 8 +(7X4) + 1 = 38 CC Using memory interleaving – 4 words accessed in 8 CC and

transferred in next 4 CC word by word, during which remaining 4 words are read and stored in DBR. These 4 words are transferred one word at a time to cache So time required to transfer a block is 1+8+4+4 = 17 CC


Hit rate and miss penalty The number of hits stated as fraction of all

attempted accesses is called the hit rate The number of misses stated as a fraction of all

attempted accesses is called as miss rate Hit rates well over 0.9 are essential for high-

performance computers Performance is adversely affected by the actions

that must be taken after a miss the extra time needed to bring the desired info into

the cache is called as miss penalty Miss penalty is the time needed to bring a block of

data from a slower unit in memory hierarchy to a faster unit Interleaving can reduce miss penalty substantially


Problem Let h be hit rate, M the miss penalty – ( time to access info

from main memory), and C – the time to access information in the cache. Average access time is tave = hC + (1-h)M

Consider same parameters as previous problem If computer has no cache, then using a fast processor and a typical

DRAM main memory, it takes 10 clock cycles for each memory read access

Suppose computer has a cache that holds 8-word blocks and an interleaved main memory. Then it requires 17 cycles ( as discussed before ) to load one block to

cache Suppose 30 percent of the instructions in a program perform read or

write operation – 130 memory accesses for every 100 instructions Assume hit rates are .95 for instructions and .9 for data Assume miss penalty is same for both read and write accesses An estimate of improvement in performance is – Time without cache / time with cache =

(130 X 10) / ( 100(.95X1 + .05X17) + 30(.9X1 + .1X17) ) = 5.04 So computer with cache performs 5 time better ( considering

processor clock and system bus have same speed)www.bookspar.com | Website for


Caches on the processor chip From speed point of view, optimal space for cache is on

the processor chip Since space on processor chip Is required for many other

functions, this limits the size of cache that can be accommodated

Either a combined cache(offers greater flexibility in mapping) for instructions and data or separate caches(increases parallel access of information but more complex circuitry) for instructions and data.

Normally 2 levels of caches are used L1 and L2 cache L1 designed to allow very fast access by processor

Its access time will have a very large effect on clock rate of processor L2 can be slower but it should be much larger to ensure high hit

rate A work station computer may include L1 cache with capacity 10s of

kilobytes and L2 cache with capacity several megabytes Including L2 cache further reduces the impact of main

memory speed on the performance of the computerwww.bookspar.com | Website for


Cache on processor chip Average access time experienced by the

processor with 2 levels of caches is tave = h1C1 + (1-h1)h2C2 + (1-h1)(1-h2)M h1 – hit rate in L1 h2 – hit rate in L2 C1 – time to access info in L1 cache C2 – time to access info in L2 cache M – time to access info in main memory

No of misses in L2 cache must be very low


Write buffers Temporary storage area for write requests Usage when write-through protocol is used

Each write operation results in writing new value to the memory

If processor waits for memory function to be completed, then processor is slowed down

Processor immediately does not require results of write operation

Processor instead of waiting for write operation to complete, places the write requests into this buffer and continues execution of next instruction

The write requests are sent to memory whenever read requests are not serviced by memory Because read requests must be serviced immediately – else

processor cannot proceed without the data to be read


Write buffers Write buffer holds a number of write requests

A read request may refer to data that are still in write buffer So, addresses of data to be read from memory are compared

with addresses of the data in the write buffer In case of match , data in write buffers are used

Usage of write buffers when write-back protocol is used Write operations are simply performed on the

corresponding word in the cache If a new block of data comes into cache as a result of read

miss, it replaces an existing block which has some dirty data ( modified data) which has to be written into main memory

If write-back operation is performed first, then processor has to wait longer for new block to be read into the cache

So to read the data first, provide a fast write buffer for temporary storage of dirty block that is ejected Afterwards contents of write buffer are written into memory


Prefetching New data are brought into cache when they are first needed The processor has to pause until the new data arrive To avoid this stalling of the processor, it is possible to prefetch

the data into the cache before they are needed Prefetching done either through software or hardware

In software – include a separate prefetch instruction – which loads the data into cache by the time data are required in the program Allows overlapping of accesses to main memory and

computation of the processor Prefetching instructions inserted either by compiler or by

programmer – compiler insertion is better In hardware – adding circuitry that attempts to discover a

pattern in memory references and prefetches data according to this pattern


Lock-up free cache The software Prefetching does not work well if it

interferes with normal execution of instructions If action of prefetching stops other accesses to the

cache until the prefetch is completed A cache of this type is said to be locked while it

services a miss Allow the processor to access the cache while the

miss is being serviced A cache that allows multiple outstanding misses is

called lockup-free cache Since it can service only one miss at a time, it must

have circuitry to keep track of all outstanding misses By including special registers to hold pertinent information


VIRTUAL Memories Refer to slides given separately


SECONDARY Storage Magnetic hard disks

Organization and accessing of data on a disk Access time Typical disks Data buffer/cache Disk controller Floppy disks RAID Disk arrays Commodity disk considerations

Optical disks CD technology CD-ROM CD-Recordable CD-Rewritable DVD Technology DVD-RAM

Magnetic tape systems


Computer Organization

Documents

Transcript of Computer Organization