3.Embedded Memories

33
I Embedded Memories

description

good one

Transcript of 3.Embedded Memories

Page 1: 3.Embedded Memories

I

Embedded Memories

Page 2: 3.Embedded Memories

Embedded Memories

Embedded Memories

Contents:1. Scratchpad memories2. Flash memories3. Embedded DRAM4. Cache Memories5. MESI protocol6. Cache coherence7. Directory based coherence

1. Scratchpad memories:

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a microprocessor ("CPU"), scratchpad refers to a special high-speed memory circuit used to hold small items of data for rapid retrieval.

It can be considered similar to the L1 cache in that it is the next closest memory to the ALU after the internal registers, with explicit instructions to move data from and to main memory often using DMA -based data transfer. In contrast with a system that uses caches, a system with scratchpads is a system with Non Uniform memory access latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another difference with a system that employs caches is that a scratchpad commonly does not contain a copy of data that is also stored in the main memory.

2

Page 3: 3.Embedded Memories

Embedded Memories

Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in microprocessor system-on-chip for embedded systems. They are mostly suited for storing temporary results (as it would be found in the CPU stack) that typically wouldn't need to always be committing to the main memory; however when fed by DMA, they can also be used in place of a cache for mirroring the state of slower main memory. The same issues of locality of reference apply in relation to efficiency of use; although some systems allow stride DMA to access rectangular data sets. Another difference is that scratchpads are explicitly manipulated by applications.Scratchpads are not used in mainstream desktop processors where generality is required for legacy software to run from generation to generation, in which the available on-chip memory size may change. They are better implemented in embedded systems, special-purpose processors and game consoles, where chips are often manufactured as MPSoC, and where software is often tuned to one hardware configuration.

2. Flash Memories:

Flash memory or a flash RAM is a type of nonvolatile semiconductor memory device where stored data exists even when memory device is not electrically powered. It's an improved version of electrically erasable programmable read-only memory (EEPROM). The difference between Flash Memory and EEPROM are, EEPROM erases and rewrite its content one byte at a time or in other words, at byte level. Where as Flash memory erases or writes its data in entire blocks, which makes it a very fast memory compared to EEPROM. Flash memory can't replace DRAM and SRAM because the speed at which the DRAM/SRAM can access data and also their ability to address at byte level can't be matched by Flash.

Flash stores the data by removing or putting electrons on its floating gate. Charge on floating gate affects the threshold of the memory element. When electrons are present on the floating gate, no current flows through the transistor, indicating a logic-0. When electrons are removed from the floating gate, the transistor starts conducting, indicating a logic-1. This is achieved by applying voltages between the control gate and source or drain.

Fowler-Nordheim (F-N) Tunneling and hot-electron injection are some of the process by which these operations are carried out in the flash cell. Tunneling is a process where electrons are transported through a barrier. Here the barrier is considered as the thickness of the Si02 insulator layer surrounding the floating gate. The tunneling process in oxide was first reported by Fowler and Nordheim, so the name.

3

Page 4: 3.Embedded Memories

Embedded Memories

Bit Line 1 Bit Line 2 Word Line 1

Word Line 2

Word Line3

Source Line

1. Erase operation:The raw state of flash memory cells (A single-level NOR flash cell) will be bit 1's, (at default state) because floating gates carry no negative charges. Erasing a flash-memory cell (resetting to a logical 1) is achieved by applying a voltage across the source and control gate (word line). The voltage can be in the range of -9V to -12V. And also apply around 6V to the source. The electrons in the floating gate are pulled off and transferred

4

Page 5: 3.Embedded Memories

Embedded Memories

to the source by quantum tunneling (a tunnel current). In other words, electrons tunnel from the floating gate to the source and substrate.

2. Write (program) operation:A NOR flash cell can be programmed, or set to a binary "0" value, by the following procedure. While writing a high voltage of around 12V is applied to the control gate (word line). If high voltage around 7V is applied to Bit Line (Drain terminal), bit 0 is stored in the cell. The channel is now turned on, so electrons can flow from the source to the drain. Through the thin oxide layer electrons move to the floating gate. The source-drain current is sufficiently high to cause some high-energy electrons to jump through the insulating layer onto the floating gate, via a process called hot-electron injection.Due to applied voltage at floating-gate the excited electrons are forced through and trapped on other side of the thin oxide layer, giving it a negative charge on the floating gate. These negatively charged electrons act as a barrier between the control gate and the floating gate.

If low voltage is applied to the drain via the bit line, the amount of electrons on the floating gate remains the same, and logic state doesn't change, storing the bit 1. Since floating gate is insulated by oxide, the charge accumulated on the floating gate will not leak out, even if the power is turned off. A device called a cell sensor watches the level of the charge passing through the floating gate. If the flow through the gate crosses 50 percent threshold, it has a value of 1. When the charge passing through decline to below 50-percent threshold, than the value changes to 0.Because of the very good insulation properties of SiO2, the charge on the floating gate leaks away very slowly.

3. Read operation:Apply a voltage around 5V to the control gate and around 1V to the drain. The state of the memory cell is distinguished by the current flowing between the drain and the source.To read the data, a voltage is applied to the control gate, and the MOSFET channel will be either conducting or remain insulating, based on the threshold voltage of the cell, which is in turn controlled by charge on the floating gate. The current flow through the MOSFET channel is sensed and forms a binary code, reproducing the stored data.

5

Page 6: 3.Embedded Memories

Embedded Memories

3. Embedded DRAM

Bi

Word line

Column Select

_R/W

Chip Sel

Data

A typical cell in a DRAM array appears as illustrated in the fig. The Read and Write operation uses the single bit line.

READ OperationA value is read from the cell by first precharging Bi to a voltage that is halfway between a 0 and a 1. Asserting the word line enables the stored signal onto Bi. If the stored value is a logical 1, through charge sharing will cause value on Bi to decrease. The change in value is sensed and amplified by the

6

Sense Amplifier

Page 7: 3.Embedded Memories

Embedded Memories

write/sense amplifier. The read operation causes the capacitor to discharge. The sensed and amplified value is placed back onto the bit line this is called Restore or Rewrite operation.

WRITE OperationA value is written into the cell by applying a logical 0 or logical 1 to Bi through the write/sense amplifiers. Asserting the word line charges the capacitor if a logical 1 is to be stored and discharge it if logic 0 is to be stored.

REFRESH OperationDynamic memories only store data for short period of time on the parasitic capacitor associated with a MOS transistor. If the charge stored on the capacitor is not replaced periodically it will leak away, there by losing the data stored in a memory. Replacement is implemented by executing a read operation followed by a rewrite of the data back into the cell. Such replacement is referred to as refresh or a refresh cycle. The time between two refresh operations is called the refresh time interval. The interval is determined by the vendor’s specification and the system in which memory is operating.

4. Cache memories:

Hierarchical memory

There are three memory design goals:

• The size of the memory must be large and no constraints should be imposed on program or data size.

• The speed of the memory must be as fast as the fastest memory technology available.

• The cost of the memory must approach the cost of the cheapest memory technology available.

All of these goals cannot be achieved simultaneously since they are mutually exclusive. However, there is a practical compromise that approaches the ideal. The implementation is based on an hierarchy of memory levels. The memory closest to the processor is the fastest and the most expensive. This level is called the cache. The lowest level of the hierarchy is the slowest and uses the cheapest type of memory available which usually consists of magnetic disks. Disk memory provides a huge amount of cheap storage space but has a large access time.

7

Page 8: 3.Embedded Memories

Embedded Memories

The levels in the hierarchy are ordered by their speed, cost and size. When a memory access is issued the processor looks first into the level with fastest memory. In the above Figure the cache level represents the fastest memory. If the required data is present inside the cache the data is transferred to the processor and the memory operation ends very quickly. However if the data is not present inside the cache another call is issued to the next level of the memory hierarchy, which in above Figure is the main memory. The process continues in a similar way until the level that contains the requested data is reached. So even if the lower level performance is poor, the system performance will approach the performance of the top level (in our example the cache memory), which is very fast compared with the lower levels. This is due to the hierarchical structure of the memory and the locality principle stated in the introduction.

Basic Cache Structure

Processors are generally able to perform operations on operands faster than the access time of large capacity main memory. Though semiconductor memory which can operate at speeds comparable with the operation of the processor exists, it is not economical to provide all the main memory with very high speed semiconductor memory. The problem

8

Page 9: 3.Embedded Memories

Embedded Memories

can be alleviated by introducing a small block of high speed memory called a cache between the main memory and the processor.

The idea of cache memories is similar to virtual memory in that some active portion of a low-speed memory is stored in duplicate in a higher-speed cache memory. When a memory request is generated, the request is first presented to the cache memory, and if the cache cannot respond, the request is then presented to main memory.

The difference between cache and virtual memory is a matter of implementation; the two notions are conceptually the same because they both rely on the correlation properties observed in sequences of address references. Cache implementations are totally different from virtual memory implementation because of the speed requirements of cache.

We define a cache miss to be a reference to a item that is not resident in cache, but is resident in main memory. The corresponding concept for cache memories is page fault, which is defined to be a reference to a page in virtual memory that is not resident in main memory. For cache misses, the fast memory is cache and the slow memory is main memory. For page faults the fast memory is main memory, and the slow memory is auxiliary memory.

A cache-memory reference. The tag 0117X matches address 01173, so the cache returns the item in the position X=3 of the matched block Figure shows the structure of a typical cache memory. Each reference to a cell in memory is presented to the cache. The cache searches its directory of address tags shown in the figure to see if the item is in the cache. If the item is not in the cache, a miss occurs.

For READ operations that cause a cache miss, the item is retrieved from main memory and copied into the cache. During the short period available before the main-memory operation is complete, some other item in cache is removed form the cache to make rood for the new item.

9

Page 10: 3.Embedded Memories

Embedded Memories

The cache-replacement decision is critical; a good replacement algorithm can yield somewhat higher performance than can a bad replacement algorithm. The effective cycle-time of a cache memory (teff) is the average of cache-memory cycle time (tcache) and main-memory cycle time (tmain), where the probabilities in the averaging process are the probabilities of hits and misses. If we consider only READ operations, then a formula for the average cycle-time is: teff = tcache + ( 1 - h ) tmain

where h is the probability of a cache hit (sometimes called the hit rate), the quantity ( 1 - h ), which is the probability of a miss, is know as the miss rate. In Fig, shows an item in the cache surrounded by nearby items, all of which are moved into and out of the cache together. We call such a group of data a block of the cache.

Cache Organization

Within the cache, there are three basic types of organization:

1. Direct Mapped2. Fully Associative3. Set Associative

In fully associative mapping, when a request is made to the cache, the requested address is compared in a directory against all entries in the directory. If the requested address is found (a directory hit), the corresponding location in the cache is fetched and returned to the processor; otherwise, a miss occurs. 

10

Page 11: 3.Embedded Memories

Embedded Memories

Fully Associative Cache

In a direct mapped cache, lower order line address bits are used to access the directory. Since multiple line addresses map into the same location in the cache directory, the upper line address bits (tag bits) must be compared with the directory address to ensure a hit. If a comparison is not valid, the result is a cache miss, or simply a miss. The address given to the cache by the processor actually is subdivided into several pieces, each of which has a different role in accessing data.

Direct Mapped Cache

The set associative cache operates in a fashion somewhat similar to the direct-mapped

11

Page 12: 3.Embedded Memories

Embedded Memories

cache. Bits from the line address are used to address a cache directory. However, now there are multiple choices: two, four, or more complete line addresses may be present in the directory. Each of these line addresses corresponds to a location in a sub-cache. The collection of these sub-caches forms the total cache array. In a set associative cache, as in the direct-maped cache, all of these sub-arrays can be accessed simultaneously, together with the cache directory. If any of the entries in the cache directory match the reference address, and there is a hit, the particular sub-cache array is selected and outgated back to the processor. 

Set Associative Cache

Cache Performance The performance of a cache can be quantified in terms of the hit and miss rates, the cost of a hit, and the miss penalty, where a cache hit is a memory access that finds data in the cache and a cache miss is one that does not.

When reading, the cost of a cache hit is roughly the time to access an entry in the cache. the miss penalty is the additional cost of replacing a cache line with one containing the desired data.

(Access time) = (hit cost) + (miss rate)*(miss penalty) = (Fast memory access time) + (miss rate)*(slow memory access time)

Note that the approximation is an underestimate - control costs have been left out. Also note that only one word is being loaded from the faster memory while a whole cache block's worth of data is being loaded from the slower memory.

12

Page 13: 3.Embedded Memories

Embedded Memories

Since the speeds of the actual memory used will be improving ``independently'', most effort in cache design is spent on fast control and decreasing the miss rates. We can classify misses into three categories, compulsory misses, capacity misses and conflict misses. Compulsory misses are when data is loaded into the cache for the first time (e.g. program startup) and are unavoidable. Capacity misses are when data is reloaded because the cache is not large enough to hold all the data no matter how we organize the data (i.e. even if we changed the hash function and made it omniscient). All other misses are conflict misses - there is theoretically enough space in the cache to avoid the miss but our fast hash function caused a miss anyway.

Cache read and write policy

When the processor makes a read or write access of the cache a policy must be provided and implemented in hardware. The policy is usually implemented in he cache controller which is responsible for managing all the transfers between caches and main memory.

READ HIT: The referenced memory address is in the first level of cache. The content of the memory address is just transferred from L1 cache to the processor.

READ MISS:

1. If the referenced memory location is not present in the L1 cache then we have a cache miss and further actions have to be taken to bring the desired memory location into the L1 cache.

2. If there is no free block slot in the cache that can receive the desired block then a block must be evicted from the cache. Different replacement policies can be implemented: the evicted block can be the least recently used (LRU) or can be chosen in a random way from the cache. After an old block is evicted the cache controller reissues the read and finds a read miss with a vacant block step and the process continues with step 1.

WRITE HIT:

There are two policies in case of a write hit.

1. The write back policy In this case the write is done only in the cache. When a write is made, the block’s dirty bit is set, indicating that the block has been modified and does not contain the same data as the block in the main memory. Because the cache block and the corresponding block from the main memory are not the same, when there is a subsequent read miss, the block must be evicted from the cache and before eviction the block must be written back to memory so that it contains the current value.

2. The write through policy

13

Page 14: 3.Embedded Memories

Embedded Memories

Every time a write is done to a block contained in the cache, the data is also written to the corresponding location in main memory. A dirty bit is not required with this policy because the cache and the main memory are always coherent. When there is a subsequent read miss it is not necessary to evict the block, because there is another copy in the main memory. The new block can be simply overwritten.

5. MESI Protocol

The MESI protocol makes it possible to maintain the coherence in cached systems. It is based on the four states that a block in the cache memory can have. These four states are the abbreviations for MESI: modified, exclusive, shared and invalid. States are explained below:

• Invalid: It is a non-valid state. The data you are looking for are not in the cache, or the local copy of these data is not correct because another processor has updated the corresponding memory position.

• Shared: Shared without having been modified. Another processor can have the data into the cache memory and both copies are in their current version.

• Exclusive: Exclusive without having been modified. That is, this cache is the only one that has the correct value of the block. Data blocks are according to the existing ones in the main memory.

• Modified: Actually, it is an exclusive-modified state. It means that the cache has the only copy that is correct in the whole system. The data which are in the main memory are wrong.

The state of each cache memory block can change depending on the actions taken by the CPU [3]. Figure 1 presents these transitions clearly.

14

Page 15: 3.Embedded Memories

Embedded Memories

Although the Figure 1 is very clear, here is a brief explanation: at the beginning, when the cache is empty and a block of memory is written into the cache by the processor, this block has the exclusive state because there are no copies of that block in the cache. Then, if this block is written, it changes to a modified state, because the block is only in one cache but it has been modified and the block that is in the main memory is different to it.

On the other hand, if a block is in the exclusive state, when the CPU tries to read it and it does not find the block, it has to find it in the main memory and loads it into its cache memory. Then, that block is in two different caches so its state is shared. Then, if a CPU wants to write into a block that is in the modified state and it is not in its cache, this block has to be cleared from the cache where it was and it has to be loaded into the main memory because it was the most current copy of that block in the system. In that case, the CPU writes the block and it is loaded in its cache memory with the exclusive state, because it is the most current version now. If the CPU wants to read a block and it does not find the block in its cache, this is because there is a more recent copy, so the system has to clear the block from the cache where it was and to load it in the main memory. From there, the block is read and the new state is shared because thereare two current copies in the system. Another option is that a CPU writes into a shared block, in this case the block changes its state into exclusive.

15

Page 16: 3.Embedded Memories

Embedded Memories

It should be taken into account that the state of a cache memory block can change because of the actions of another CPU, an Input/Output interruption or a DMA. These transitions are shown in Figure. Hence, the processor is going to use the valid data in its operations. We do not have to worry if a processor has changed data from the main memory and has the most current value of these data in its cache. With the MESI protocol, the processor obtains the most current value every time it is required.

6. Cache Coherence

The presence of caches in current-generation distributed shared-memory multiprocessorsImproves performance by reducing the processor’s memory access time and bydecreasing the bandwidth requirements of both the local memory module and the global interconnect. Unfortunately, the local caching of data introduces the cache coherenceproblem. Early distributed shared-memory machines left it to the programmer to deal with the cache coherence problem, and consequently these machines were considered difficult to program. Today’s multiprocessors solve the cache coherence problem in hardware by implementing a cache coherence protocol.

A designer has to choose a protocol to implement, and this should be done carefully. Protocol choice can lead to differences in cache miss latencies and differences in the number of messages sent through the interconnection network, both of which can lead to differences in overall application performance. Moreover, some protocols have high-level properties like automatic data distribution or distributed queueing that can help application performance. Let us examine the cache coherence problem in distributed shared-memory machines in detail.

The Cache Coherence Problem

Figure below depicts an example of the cache coherence problem. Memory initially contains the value 0 for location x, and processors 0 and 1 both read location x into theircaches. If processor 0 writes location x in its cache with the value 1, then processor 1’scache now contains the state value 0 for location x. Subsequent reads of location x by processor 1 will continue to return the stale, cached value of 0. This is likely not what the programmer expected when she wrote the program. The expected behavior is for a read by any processor to return the most up-to-date copy of the datum. This is exactly what acache coherence protocol does: it ensures that requests for a certain datum always returnthe most recent value.

16

Page 17: 3.Embedded Memories

Embedded Memories

The cache coherence problem: Initially processors 0 and 1 both read location x, initially containingthe value 0, into their caches. When processor 0 writes the value 1 to location x, the stalevalue 0 for location x is still in processor 1’s cache.

The coherence protocol achieves this goal by taking action whenever a location is written. More precisely, since the granularity of a cache coherence protocol is a cache line, the protocol takes action whenever any cache line is written. Protocols can take two kinds of actions when a cache line L is written—they may either invalidate all copies of L from the other caches in the machine, or they may update those lines with the new value being written. Continuing the earlier example, in an invalidation-based protocol when processor 0 writes x = 1, the line containing x is invalidated from processor 1’s cache. The next time processor 1 reads location x it suffers a cache miss, and goes to memory to retrieve the latest copy of the cache line. In systems with write-through caches, memory can supply the data because it was updated when processor 0 wrote x. In the more common case of systems with writeback caches, the cache coherence protocol has to ensure that processor 1 asks processor 0 for the latest copy of the cache line. Processor 0 then supplies the line from its cache and processor 1 places that line into its cache, completing its cache miss. In update-based protocols when processor 0 writes x = 1, it sends the new copy of the datum directly to processor 1 and updates the line in processor 1’s cache with the new value. In either case, subsequent reads by processor 1 now “see” the correct value of 1 for location x, and the system is said to be cache coherent.

Most modern cache-coherent multiprocessors use the invalidation technique rather thanthe update technique since it is easier to implement in hardware. As cache line sizes continue to increase, the invalidation-based protocols remain popular because of the increased number of updates required when writing a cache line sequentially with an update-based coherence protocol. There are times however, when using an update-based protocol is superior. These include accessing heavily contended lines and some types of synchronization variables. Typically designers choose an invalidation-based protocol and add some special features to handle heavily contended synchronization variables. All the protocols presented in this paper are invalidation-based cache coherence protocols, and a later section is devoted to the discussion of synchronization primitives.

17

Page 18: 3.Embedded Memories

Embedded Memories

7. Directory Scheme

The previous section describes the cache coherence problem and introduces the cachecoherence protocol as the agent that solves the coherence problem. But the questionremains, how do cache coherence protocols work?

There are two main classes of cache coherence protocols, snoopy protocols and directory- based protocols. Snoopy protocols require the use of a broadcast medium in the machine and hence apply only to small-scale bus-based multiprocessors. In these broadcast systems each cache “snoops” on the bus and watches for transactions which affect it. Any time a cache sees a write on the bus it invalidates that line out of its cache if it is present. Any time a cache sees a read request on the bus it checks to see if it has the most recent copy of the data, and if so, responds to the bus request. These snoopy bus-based systems are easy to build, but unfortunately as the number of processors on the busincreases, the single shared bus becomes a bandwidth bottleneck and the snoopy protocol’s reliance on a broadcast mechanism becomes a severe scalability limitation.To address these problems, architects have adopted the distributed shared memory(DSM) architecture. In a DSM multiprocessor each node contains the processor and itscaches, a portion of the machine’s physically distributed main memory, and a node controller which manages communication within and between nodes (see Figure 2.2). Rather than being connected by a single shared bus, the nodes are connected by a scalable interconnection network. The DSM architecture allows multiprocessors to scale to thousands of nodes, but the lack of a broadcast medium creates a problem for the cache coherence protocol. Snoopy protocols are no longer appropriate, so instead designers must use a directory-based cache coherence protocol.

The directory is simply an auxiliary data structure that tracks the cachingstate of each cache line in the system. For each cache line in the system, the directoryneeds to track which caches, if any, have read-only copies of the line, or which cache hasthe latest copy of the line if the line is held exclusively. A directory-based cache-coherentmachine works by consulting the directory on each cache miss and taking the appropriateaction based on the type of request and the current state of the directory.

18

Page 19: 3.Embedded Memories

Embedded Memories

Figure 2.3 shows a directory-based DSM machine. Just as main memory is physicallydistributed throughout the machine to improve aggregate memory bandwidth, so the directory is distributed to eliminate the bottleneck that would be caused by a single monolithic directory. If each node’s main memory is divided into cache-line-sized blocks, then the directory can be thought of as extra bits of state for each block of main memory. Any time a processor wants to read cache line L, it must send a request to the node that has the directory for line L. This node is called the home node for L. The home node receives the request, consults the directory, and takes the appropriate action. On a cache read miss, for example, if the directory shows that the line is currently uncached or is cached read-only

(the line is said to be clean) then the home node marks the requesting node as a sharer inthe directory and replies to the requester with the copy of line L in main memory. If, however, the directory shows that a third node has the data modified in its cache (the line is dirty), the home node forwards the request to the remote third node and that node is responsible for retrieving the line from its cache and responding with the data. The remote node must also send a message back to the home indicating the success of the transaction. Even the simplified examples above will give the savvy reader an inkling for the complexities of implementing a full cache coherence protocol in a machine with distributed memories and distributed directories. Because the only serialization point is the directory itself, races and transient cases can happen at other points in the system, and the cache coherence protocol is left to deal with the complexity. For instance in the “3-hop” example above (also shown in Figure 2.6) where the home node forwards the request to the dirty remote third node, the desired cache line may no longer be present in the remote cache when the forwarded request arrives. The dirty node may have written the desired cache line back to the home node on its own. The cache coherence protocol has to “do the right thing” in these cases—there is no such thing as being “almost coherent”.

There are two major components to every directory-based cache coherence protocol:

• The directory organization• The set of message types and message actions

19

Page 20: 3.Embedded Memories

Embedded Memories

The directory organization refers to the data structures used to store the directory information and directly affects the number of bits used to store the sharing information for each cache line. The memory required for the directory is a concern because it is “extra” memory that is not required by non-directory-based machines. The ratio of the directory memory to the total amount of memory is called the directory memory overhead.

The designer would like to keep the directory memory overhead as low as possible and would like it to scale very slowly with machine size. The directory organization also has ramifications for the performance of directory accesses since some directory data structures may require more hardware to implement than others, have more state bits to check, or require traversal of linked lists rather than more static data structures. The directory organization holds the state of the cache coherence protocol, but the protocol must also send messages back and forth between nodes to communicate protocol state changes, data requests, and data replies. Each protocol message sent over the network has a type or opcode associated with it, and each node takes a specific action based on the type of message it receives and the current state of the system.

The set of message actions include reading and updating the directory state as necessary, handling all possible race conditions, transient states, and “corner cases” in the protocol, composing any necessary response messages, and correctly managing the central resources of the machine, such as virtual lanes in the network, in a deadlock-free manner. Because the actions of the protocol are intimately related to the machine’s deadlock avoidance strategy, it is very easy to design a protocol that will livelock or deadlock. It is much more complicated to design and implement a high-performance protocol that is deadlock-free. Variants of three major cache coherence protocols have been implemented in commercial DSM machines, and other protocols have been proposed in the research community. Each protocol varies in terms of directory organization (and therefore directory memory overhead), the number and types of messages exchanged between nodes, the direct protocol processing overhead, and inherent scalability features.

Worked Examples:

20

Page 21: 3.Embedded Memories

Embedded Memories

1. How many total bits are required for a direct mapped cache with 64 Kb of data and one word blocks assuming a 32bit address

Soln:We know that 64Kb is 16k word which is 2^14 words , and with a block size of 1word ,

2^14 blocks. Each block has 32 bits of data plus a tag , which is 32 -14-2, plus a valid bit. Thus the total cache size is 2^14*(32+(32-14-2)+1)=2^14*49= 784*2^10= 784KbitsOr98Kb for a 64 Kb cache. For this Cache the total number of bits in the cache is over 1.5 times as many as needed just for the storage of the data.

2. Consider the cache with 64 blocks and a block size of 16bytes. What block number does the byte addr a 1200 map to?

Soln:The block is given by (Block addr) modulo(number of cache blocks)

Where the addr of the block is byte addr / bytes per blockNotice that this block addr is the block containing all addresses between (byte addr / bytes per block) * (bytes per block)And (Byte addr/bytes per block) * bytes per block + (bytes per bock - 1) Thus , with 16 bytes per block, byte addr 1200 is block addr(1200/16) = 75Which maps to cache block number (75 modulo 64) = 11

3. Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4% . If the machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles for all misses , Determine how much faster a machine would run with a perfect cache that never missed.

Soln: The number of memory miss cycles for the instructions in terms of the instruction count (I) is

Instruction miss cycle = I * 2% * 40 = 0.80 * IThe frequency of all loads and stores in gcc is 36%. Therefore we can find the number of memory miss cycles for data references Data miss cycle = I * 36% * 4% * 40 = 0.56 * IThe total number of memory – stall cycles is 0.80I + 0.56I = 1.36I .This is more than one cycle memory stall per instruction .Accordingly the CPI with memory stalls is 2 + 1.36 = 3.36. Since there is no change in instruction count or clock rate, the ratio of CPU execution time is

CPU time with stalls = I * CPI stall * clock cycle CPU time with perfect cache I * CPIperfect * clock cycle = CPIstall / CPIperfect

= 3.36 / 2The performance with perfect cache is better by 1.68What happens if the processor is made faster, but the memory system stays the same? The amount of time spent on memory stalls will take up an increasing fraction of the execution time; Amdahl’s law. Suppose we speed up the machine in the previous example by reducing its CPI from 2 to 1 without changing the clock rate which might be done with an improved pipeline. The system with cache misses would then have a CPI of 1 + 1.36 = 2.36, and the system with perfect cache would be 2.36/1 = 2.36 times faster

21

Page 22: 3.Embedded Memories

Embedded Memories

The amount of execution time spent on memory stalls would have risen from 1.36/3.36 = 41% To1.36/2.36 = 58%

4. Suppose we increase the performance of the machine in the previous example by doubling its clock rate. Since the main memory speed is unlikely to change, assume that the absolute time to handle a cache miss does not change. How much faster will the machine be with the faster clk , assuming the same miss rate as the previous example.

Soln: Measured in the faster clk cycles, the new miss penalty will be twice as long , or 80 clk cycles. Hence Total miss cycles per instructions = (2% * 80) + 36% * (4% * 80) = 2.75Thus the faster machine with cahe misses will have a CPI of 2+2.75 = 4.75, Compared to the CPI with the cache misses of 3.36 for the slower machine.Using the formula for CPU time from the previous example we can compute the relative performance as Performance with fast clock = Execution time with slow clockPerformance with slow clock Execution time with fast Clock

= IC * CPI * Clock cycle IC * CPI * (clock cycle / 2)= 3.36 / (4.75 * 0.5) = 1.41

5. Increasing the associatively requires more comparators, as well as more tag bits per cache blocks. Assuming a cache of 4K blocks and a 32 bit address, Find the total number of sets and the total number of tag bits for caches that are direct mapped, two way and 4 way set associative and full associative.

Soln:The direct mapped cache as the same number of sets as blocks, and hence 12 bits of index, since log2(4k) = 12; Hence the total number of tag bits is (32 – 12) * 4K = 80K bits

Each degree of associativity decreases the number of sets by the factor of 2 and thus decreses the number of widths used to index the cache by 1 and increases the number of bits in the tag by 1. Thus for a 2 way set associative cache, there are 2k sets, and the total number of tag bits is (32 – 11) * 2 * 2K = 84KbitsFor a 4 way set associative cache the total number of sets is 1K, and the total number of tag bits is (32 -10) * 4 (1K) = 88KbitsFor a fully associative cache, there is only one set with 4k blocks and the tag is 32 bits, leading to total of 32 * 4K * 1= 128K tag bits

6. Suppose we have a processor with a base CPI of 1.0, Assuming all references hit in the primary cache , and a clock rate of 500 MHz . Assume a main memory

22

Page 23: 3.Embedded Memories

Embedded Memories

access time of 200 nsec, including all the miss handling .Suppose the miss rate per instruction at the primary cache is 5%. How much faster will the machine be if we add a secondary cache that has a 20 nsec access time for either a hit or a miss and is large enough to reduce the miss rate to main memory to 2%Soln: Miss rate penalty to main memory is (200 ns/(2ns/clocl cycle)) = 100 clock cycles

The effective CPI with one level of cache is given byTotal CPI = Base CPI + Memory stall cycles per instructionFor the machine with one level of cache, Total CPI = 1.0 + Memory stall cycles per instruction = 1.0 + 5% * 100 = 6.0With two levels of cache, a miss in the pimary ( or first – level) cache can either be satisfied in the secondary cache or in the main memory. The miss penalty for an access to the second level caches is

(20 ns/(2ns/clocl cycle)) = 10 clock cycles

If the miss is satisfied in the secondary cache, then this is the entire miss penalty. If the miss needs to go to main memory , then the total miss penalty is the sum of the secondary cache access time and the main memory access time.Thus , for a two memory cache, total CPI is the sum of the stall cycles for both levels of cache and the base CPITotal CPI = 1+primary stalls per instruction+secondary stall per instruction = 1+5% *2%*100 = 3.5

Thus the machine with secondary cache is faster by 6.0/3.5 = 1.7Alternatively we could have completed the stall cycles by summing the stall cycles of those references that hit in the secondary cache((5%-2%)*10 = 0.3) and those references that go to the main memory, which must include the cost to access the secondary cache as well as the main memory access time (2%*(10+100) = 2.2)

23