ECE 172 Digital Systems Chapter 5 Uniprocessor...

ECE 172 Digital Systems

Chapter 5 Uniprocessor Data Cache

Herbert G. Mayer, PSUStatus 7/28/2018

Syllabus

l  UP Cachesl  Cache Design Parametersl  Effective Time teffl  Cache Performance Parametersl  Replacement Policiesl  Trace Cachel  Conclusionl  Bibliography

Data Cache inUP Microprocessor

(see also presentation about MP Data Cache)

Syllabus UP Cachesl  Intro: Purpose, Design Parameters, Architecturel  Effective Time teffl  Single-Line Degenerate Cachel  Multi-Line, Single-Set Cachel  Single-Line, Multi-Set Cache, Blocked Mappingl  Single-Line, Multi-Set, Cyclic Mappingl  Multi-Line per Set (Associative), Multi-Set Cachel  Replacement Policiesl  LRU Samplel  Compute Cache Sizel  Trace Cachel  Characteristic Cache Curvel  Bibliography

Intro: Purpose of Cachel  Cache is logically part of Memory Subsystem, yet

physically part of microprocessor: located sometimes on same silicon die

l  Purpose: transform slow memory into fast onel  Possible with minimal cost despite high cost per

cache bit, as total cache size is just a few % of total physical main store, often < 1 % of memory size!

l  Works well, if locality is good; else performance is same as slow memory access, or even worse, depending on architecture

l  With poor locality, i.e. with random distribution of memory accesses, cache can actually slow down if:

teff = tcache + (1-h) * tmem and not: teff = max( tcache, (1-h) * tmem )

Intro: Purpose of Cachel  With good locality, cache delivers available data in

close to unit time, 1 cycle: Is Awesome! ☺l  In MP systems, caches must cooperate with other

processors’ caches, memory, some peripheralsl  Even on a UP system there are multiple agents

that need to access memory and thus impact caches

l  Cache must cooperate with VMM of memory subsystem to jointly render a physically small, slow memory into a virtually large, fast memory at small cost of added HW (silicon), and system SW

l  L1 cache access time is ideally contained within single machine cycle; realized on some CPUs

Intro: Trend of Speeds

Intro: Growing Caches

Intel Haswell-E Die, Center Showing: 20 MB Shared Cache

From Definitions in AppendixLinel  Storage area in cache able to hold a copy of a

contiguous block of memory cells: a paragraphl  The portion of memory stored in that line is

aligned on a memory address modulo line size!l  For example, if a line holds 64 bytes on a byte-

addressable architecture, the address of the first byte of a line has 6 trailing zeros: i.e. it is evenly divisible by 64, or we say it is 64-byte aligned

l  Such known zeros don’t need to be stored in the tag; they are implied, they are known a-priori!

l  This shortens the tag, rendering cache a bit simpler; cheaper to manufacture: less HW bits!

From Definitions in AppendixSetl  A logically connected region of memory, mapped to a specific

area of cache (lines), is a set; memory is partitioned into N setsl  Elements of a set don’t need to be physically contiguous in

memory; if contiguous, leftmost log2(N) bits are known a-priori and don’t need to be stored; if cyclic distribution, then the rightmost log2(N) with N sets are known a-priori

l  The number of sets is conventionally labeled Nl  A degenerate case maps all memory onto the whole cache, thus

only a single set exists: N = 1; i.e. one set; not meaningful!l  Notion of set is meaningful only if there are multiple sets. Again:

A memory region belonging to one set can be a physically contiguous block, or a cyclically distributed part of memory

l  Former case is called blocked, the latter cyclic. Cache area into which such a portion of memory is mapped to is also called set

Intro: Cache Design Parameters

l  Number of lines in set: Kl  Number of units –bytes– in a line is named L, AKA

Length of linel  Number of sets in memory, and hence in the cache: Nl  Policy upon store miss: cache write policyl  Policy upon load miss: cache read policyl  What to do, when an empty lines is needed for the

next paragraph to be streamed-in, but none is available? That action is the: replacement policy

l  Size here is the total size of a cache; unit being discussed can be: bits or bytes or word

l  Size = K * ( L + bits for tag and control bits ) * Nl  Ratio of cache size to physical memory is generally

on order of very small percentage, e.g. << 1 %l  Cache access time, typically close to 1 cycle for L1

cache; or very small number of cycles!l  Number of processors with cache: 1 in UP, M in MP

architecturel  Levels of caches, L1, L2, L3 … Last one referred to

as LLC, for last level cache

Intro: Cache Design Parameters

l  Cache-related definitions used throughout are common, though not all manufacturers apply the same nomenclature

l  Initially we discuss cache designs for single-processor architectures

l  In MP cache lecture we progress to more complex cache designs, covering the MESI protocol for a two-processor system with L1 and L2 cache

l  Focus for now: Purely L1 data cache on UP computer architecture

Intro: Cache Architecture

Effective Time teff

l  Starting with teff = tcache + ( 1 - h ) * tmem we observe:l  No matter how many hits (H) we experience during

repeated memory access, the effective cycle time teff is never less than tcache

l  No matter how many misses (M) we experience, the effective cycle time to access a datum is never more than tcache + tmem

l  Desirable to have teff = tmem in case of cache missl  Another way to compute effective access time is to

add all memory-access times, and divide them by the total number of accesses, and thus compute the effective, or average, time teff

Effective Time teffAverage access time per memory access:teff = ( hits * tcache + misses * ( tcache + tmem ) ) /

total_accesses

teff = h * tcache + m * ( tcache + tmem )

if memory accessed immediately, aborted on hits:

teff = ( h + m ) * tcache + m * tmem = tcache + m * tmem

•  Assume an access time of 1 cycle to reference data in the cache; best case, at times feasible

•  Assume an access time of 10 cycles for data in memory; yes, unrealistically fast!!

•  Assume that a memory access is initiated after a cache miss; then:

Effective Time teff

Symb. Name Explanation H Hits Number of successful cache accesses M Misses Number of failed cache accesses A All All accesses A = H + M T Total time Time for A memory accesses

tcache Cache time Time to access data once via the data cache tmem Mem time Time to access data via memory once teff Effective tm. Average time over all memory accesses h Hit rate H / A = h = 1 – m m Miss rate M / A = m = 1 – h

h + m Total rate = 1 Total rate, either hit or miss, probability is 1

Effective Time teff

Effective Time teffl  Compare teff, the effective memory access time in L1

data cache at 99% hit rate vs. 97% hit ratel  Time for hit thit = 1 cycle, time for miss tmiss = 100

cycles; then compare 99 and 97 percent hit rates:l  Given a 99% hit rate:

l  1 miss costs 100 cyclesl  99 hits cost 99 cycles totall  teff = ( 100 + 99 ) / 100 = 1.99 ≈ 2 cycles for average access

l  Given a 97% hit rate: Students compute teff here!l  l  l  l 

Effective Time teffl  Compare teff, the effective memory access time in L1

data cache at 99% hit rate vs. 97% hit ratel  Time for hit thit = 1 cycle, time for miss tmiss = 100

cycles; then compare 99% and 97% hit rates:l  Given a 99% hit rate:

l  1 miss costs 100 cyclesl  99 hits cost 99 cycles totall  teff = ( 100 + 99 ) / 100 = 1.99 ≈ 2 cycles per average access

l  Given a 97% hit rate:l  3 misses costs 300 cyclesl  97 hits cost 97 cycles totall  teff = ( 300 + 97 ) / 100 = 397 / 100 = 3.97 ≈ 4 cycles per average

access

l  Or 100% additional cycles for loss of 2% hit accuracy!

Actual Cache DataIntel Core i7 with 3 levels of cache, L1 access > 1 cycle, L3 access

costing dozens of cycles, still way faster than memory access!

Complexity of OpteronData Cache

-Taken fromWikipedia

Cache Performance Parameters

Single-Line Degenerate Cache

Single-Line Degenerate Cachel  Quick test: what is the minimum size (in number of bits) of

the tag for this degenerate cache? (assume 32-bit architecture, and 64-byte lines)

l  The single-line cache, shown here, stores multiple words

l  Can improve memory access if extremely good locality exists within a very narrow address range

l  Upon miss cache initiates a stream-in operation

l  Is a direct mapped cache: all memory locations know a priori where they’ll reside in cache; there is but one line, one option for them

l  Is a single-set cache

Single-Line Degenerate Cachel  As data cache: exploits only locality of near-by

addresses in the same paragraphl  As instruction cache: Exploits locality of tight

loops that completely fit inside the address range of a single cache line

l  However, there will be a cache-miss as soon as an address makes reference outside of line’s range

l  For example, tight loop with a function call will cause cache miss

l  Stream-in time is time to load a line of data from memory

l  Total overhead: tag bits + valid bit + dirty bit (if write-back)

l  Not advisable to build this type of a cache subsystem ☺

Dual-Line, Single-Set Cache

Dual-Line, Single-Set Cachel  Next cache has 1 set, multiple lines; here 2 lines shownQuick sanity check: minimum size of tag on byte-addressable,

32-bit architecture with 2 lines, 1 set, line size of 16 bytes?l  Each line holds multiple, contiguous addressing units, 4

words, 16 bytes shownl  Thus 2 disparate areas of memory can be cached at the

same timel  Is associative cache; all lines (i.e. 2 lines) in single set

must be searched to determine, whether a memory element is present in cache

l  Is single-set associative cache, since all of memory (singleton set) is mapped onto the same cache lines

Dual-Line, Single-Set Cachel  Some tight loops with a function call can be

completely cached in an I-cache, assuming loop body fits into line and callée fits into the other line

l  Also would allow one larger loop to be cached, whose total body does not fit into a single line, but would fit into 2 lines

l  Applies to more realistic programsl  But if number of lines K >> 1, the time to

search all tags (in set) can grow beyond unit cycle time

l  Again not advisable to build this kind of cache subsystem ☺

Single-Line, Dual-Set Cache

Single-Line, Dual-Set Cachel  This cache architecture has multiple sets, 2 shown, 2 distinct

areas of memory, each being mapped onto separate cache lines: N = 2, K = 1

Quick test: minimum size of the tag on 4-byte per word, 32-bit architecture with 16-byte lines?

l  Each set has a single line, in this case 4 memory words; AKA paragraph in memory

l  Thus 2 disparate areas of memory can be cached at the same time

l  But these areas must reside in separate memory sets, each contiguous, each having only 1 option

l  Is direct mapped; all memory locations know a priori where they’ll reside in cache

l  Is multi-set cache, since parts of memory have their own portion of cache

Single-Line, Dual-Set Cachel  Allows one larger loop to be cached, whose total

body does not fit into a single line of an I-cache, but would fit into two lines

l  But only if by some great coincidence both parts of that loop reside in different memory sets

l  If used as instruction cache, all programs consuming half of memory or less never use the second line in the second set. Hence this cache architecture is a bad idea!

l  If used as data cache, all data areas that fit into first block will never populate the second set

l  Problem specific to blocked mapping; so let’s try cyclic instead

l  Also not advisable to build this type cache ☺

Dual-Set, Single-Line, Cyclic

Dual-Set, Single-Line, Cyclicl  Cache architecture below also has 2 sets, N = 2l  Each set has a single line, each holding 4 contiguous memory

units, 4 words, 16 bytes, K = 1l  So at least 2 disparate areas of memory can be cached at the

same timeQuick test: tag size on 32-bit, 4-byte architecture?

l  Disparate areas (of line size, equal to paragraph size) are scattered cyclically throughout memory

l  Cyclically distributed memory areas associated with each respective set

l  Is direct mapped; all memory locations know a priori where they’ll reside in cache, as each set has a single line

l  Is multi-set cache: different locations of memory are mapped onto different cache lines, the sets

Dual-Set, Single-Line, Cyclic

l  Also allows one larger loop to be cached, whose total body does not fit into a single line, but would fit into two lines

l  Even if parts of loop belong to different setsl  If used as instruction cache, small code section

can use the total cachel  If used as data cache, small data areas can utilize

complete cachel  Cyclic mapping of memory areas to sets is

generally superior to blocked mappingl  Still not advisable to build this type of cache ☺

Multi-Line, Multi-Set, Cache

Multi-Line, Multi-Set, Cachel  Reminder: Tag is that minimal number of address

bits to store, in order to ID the line’s location in memory

l  Use: 32-bit architecture, byte addressable, 2 sets cyclic, line length 16 bytes, 2-way set associative

l  Two sets, memory will be mapped cyclically, AKA in a round-robin fashion

l  Each set has two lines, each line holding 16 bytes; i.e. paragraph length of memory is 16 bytes in this example!

l  Note: direct mapped caches, i.e. caches with one line per set, are also being built; known as non-associative caches

Multi-Line, Multi-Set, Cache

l  Associative cache: once set is known, search all tags for the memory address in all lines of that set; will be a very small number, e.g. 2 or 4; 4 for a 4-way set-associative cache

l  Such a search can be accomplished in fraction of a cycle, i.e. 2 or 4 parallel searches, only 1 of which can hit

l  In earlier example p. 36, line 2 of set 2 is unused, AKA invalid in MESI terminology; invalid is an attractive cache feature ☺

l  By now you know: sets, lines, associate, non-associative, direct mapped!

Replacement Policyl  Replacement policy is the rule that determines:

when all lines are valid (i.e. already busy with other, good data), and a new line must be streamed in:

l  Which of the valid lines in (a set of) the cache is to be replaced, AKA removed?

l  Removal can be low cost, if the modified bit (AKA dirty bit) is clear = 0; this means: data in memory and cache line are identical! In that case: no need to stream out, back into memory!

l  Otherwise removal is costly: If dirty bit is set = 1, data have to be written back into memory, costing a memory access!

l  We call this copying to memory: stream out

Replacement Policy# Name Summary

1 LRU Replaces Least Recently Used cache line; requires keeping track of relative “ages” of lines. Retire line that has remained unused for the longest time of all candidate lines. Speculate that that line will remain unused for the longest time in the future.

2 LFU Replaces Least Frequently Used cache line; requires keeping track of the number m of times this line was used over the last n>=m uses. Depending on how long we track the usage, this may require many bits.

3 FIFO First In First Out: The first of the lines in the set that was streamed in is the first to be retired, when it comes time to find a candidate. Has the advantage that no further update is needed, while all lines are in use.

4 Random Pick a random line from candidate set for retirement; is not as bad as this irrational algorithm might suggest. Reason: The other methods are not too good either J

5 Optimal If a cache were omniscient, it could predict, which line will remain unused for the longest time in the future. Of course, that is not computable. However, for creating the perfect reference point, we can do this with past memory access patterns, and use the optimal access pattern for comparison, how well our chosen policy rates vs. the optimal strategy!

LRU Sample 1Assume the following cache architecture:•  N = 16 sets, cyclic distribution•  K = 4 lines per set•  32-bit architecture, byte-addressable•  write back (dirty bit)•  valid line indicator (valid bit)•  L = 64 bytes per line; AKA line length•  LRU replacement; uses 2 bits (4 lines per

set), to store relative ages•  This results in tag size of ???? bits

LRU Sample 1

Assume the following cache architecture:•  This results in a tag size of 22 bits•  What is the total overhead size per cache

line, measured in bits?––

LRU Sample 1

Assume the following cache architecture:•  Tag size = 22 bits•  2 LRU bits (4 lines per set), to store relative

ages of the 4 lines in each set•  Dirty bit needed, AKA Modified bit = 1•  Valid bit needed = 1•  Overhead per line: 22 + 2 + 1 + 1 = 26 bits

LRU Sample 2l  Sample 2 focuses on one particular Set:l  Let 4 lines be numbered 0..3l  Set is read in order: 1.) line 0 miss, 2.) line 1 miss, 3.)

line 0 hit, 4.) line 2 miss, 5.) line 0 hit again, 6.) line 3 miss, 7.) line 0 hit again, and 8.) another miss

l  Now cache is full, need to find an available line by eviction, to have a line for a new access!

l  Assume initially a cold cache, all lines in the cache are free before these accesses

l  Problem: Once all lines are filled (Valid bit is 1 for all 4 lines) some line must be retired/evicted to make room for new access that missed, but which one?

l  Answer now is based on LRU policy (Least Recently Used line), which in this sample is line 1

LRU Sample 2l  The access order, assuming all memory accesses

are reads, no writes, i.e. dirty bit is always clear:

l  Read miss, all lines invalid, stream paragraph in line 0l  Read miss (implies new address), stream paragraph in line 1l  Read hit on line 0l  Read miss to a new address, stream paragraph into line 2l  Read hit, access line 0l  Read miss, stream paragraph into line 3l  Read hit, access line 0l  Now another Read Miss, all lines valid, find line to retire,

AKA to evict

l  Note that LRU age 002 is youngest for cache line 0, and 112 is the oldest line (AKA the least recently used line) for cache line 1, of the 4 relative ages out of 4 total lines

LRU Sample 2

1.  Initially, in a partly cold cache, if we experience a miss, there will be an empty line (partly cold cache), the paragraph is streamed into the empty line, its relative age is set to 0, and all other ages are incremented by 1

2.  In a warm cache (all lines are used) when a line of age X experiences a hit, its new age becomes 0. Ages of all other lines whose age < X are incremented by 1

3.  Of course the “older ones” remain “older”

Compute Cache SizeTypical Cache Design Parameters:

1.  Number of lines in every set: K2.  Number of bytes in a line, i.e. the Length of line: L3.  Number of sets in memory, and hence in cache: N

4.  Policy upon memory write (cache write policy)5.  Policy upon read miss (cache read policy)

6.  Replacement policy (e.g. LRU, random, FIFO, etc.)7.  Total line size in bit = K * ( 8 * L + tag + control

bits ) * N -with 8-bit bytes

Compute Cache SizeCompute minimum number of bits for 8-way, set-associative cache on a 32-bit architecture with 64 sets, using cyclic allocation of sets, line length L = 32 bytes, using LRU and write-back. Memory is byte addressable, with 32-bit addresses:Tag = . . .

Compute Cache SizeCompute minimum number of bits for 8-way, set-associative cache on a 32-bit architecture with 64 sets, using cyclic allocation of sets, line length L = 32 bytes, using LRU and write-back. Memory is byte addressable, with 32-bit addresses:Tag = 32-5-6 = 21 bits

LRU 8-ways = 3 bits

Dirty bit = 1 bit

Valid bit = 1 bitOverhead per line = 21+3+1+1 = 26 bits

# of lines = K*N = 64 * 8 = 29 lines

Data bits per cache line = 32*8 = 28 bitsTotal cache size = 29*(26+28)= 144,384 bitsSize in bytes approx. = ~17.6 k Bytes

Trace Cachel  Trace Cache is special-purpose cache that does not

hold (raw) instruction bits, but instead stores pre-decoded operations AKA micro-ops

l  Old AMD K5 uses Trace Cache; see [1]l  Intel’s Pentium® P4 uses a 12 k micro-op Trace Cachel  Advantages: faster access to executable bits at every

cached instructionl  Disadvantage: less dense storage, i.e. wasted cache

bits, when compared to a regular I-cachel  Note that cache bits are way more costly than

memory bits; several decimal orders of magnitude!l  Trace caches are falling out of favor since early 2000

Trace Cache

Characteristic Cache Curvel  In the graph below we a use relative number of

cache misses [RM] to avoid infinitely high abscissal  RM = 0 is ideal case: No misses, all hitsl  RM = 1 is worst case: All memory accesses are

cache missesl  If a program exhibits good locality, relative cache

size of 1 results in good performance; we use this as the reference point:

l  Very coarsely, in some ranges, doubling the cache’s size results in 30% less cache misses

l  In other ranges of the characteristic curve, doubling the cache results in just a few % of reduced misses: beyond the sweet spot!

Characteristic Cache Curve

Cache vs. Core on 22 nm Die

UP Cache Summaryl  Cache is a special HW storage, allowing fast access

to small areas of memory, copied into cache linesl  Built with expensive technology, hence the size of a

cache relative to memory size is small; cache holds only a small subset of memory, typically < 1 %

l  Frequently used data (or instructions in an I-cache) are copied to cache, with the hope that the data present in the cache are accessed frequently

l  Miraculously ☺ that is generally true, so caches in general do speed up execution despite slow memories: Exploiting what is known as locality

l  Caches are organized into sets, with each set having 1 or more lines; multiple lines require searching

l  Defined portions of memory get mapped into any one of these sets

Bibliography1.  Shen, John Paul, and Mikko H. Lipasti: Modern

2.  http://forums.amd.com/forum/messageview.cfm?catid=11&threadid=29382&enterthread=y

3.  Lam, M., E. E. Rothberg, and M. E. Wolf [1991]. "The Cache Performance and Optimizations of Blocked Algorithms," ACM 0-89791-380-9/91, p. 63-74.

4.  http://www.ece.umd.edu/~blj/papers/hpca2006.pdf5.  MESI: http://en.wikipedia.org/wiki/Cache_coherence6.  Kilburn, T., et al: “One-level storage systems, IRE

Transactions, EC-11, 2, 1962, p. 223-235

Bibliography7.  Don Anderson and Shanley, T., MindShare [1995]. Pentium

TM Processor System Architecture, Addison-Wesley Publishing Company, Reading MA, PC System Architecture Series. ISBN 0-201-40992-5

8.  Pentium Pro Developer’s Manual, Volume 1: Specifications, 1996, one of a set of 3 volumes

9.  Pentium Pro Developer’s Manual, Volume 2: Programmer's Reference Manual, Intel document, 1996, one of a set of 3 volumes

10. Pentium Pro Developer’s Manual, Volume 3: Operating Systems Writer’s Manual, Intel document, 1996, one of a set of 3 volumes

11. Y. Sheffer: http://webee.technion.ac.il/courses/044800/lectures/MESI.pdf

12. MOESI protocol: http://en.wikipedia.org/wiki/MOESI_protocol

13. MESIF protocol: http://en.wikipedia.org/wiki/MESIF_protocol

DefinitionsFor Caches

DefinitionsAgingl  A cache line’s age is tracked; only in associative

cache, doesn’t apply for direct-mapped cachel  Aging tracks, when a cache line was accessed,

relative to the other lines in this setl  This implies that ages are comparedl  Generally, the relative ages are of interest, such

as: am I older than you? Rather than the absolute age, e.g.: I was accessed at cycle such and such

l  Think about the minimum number of bits needed to store the relative ages of, say, 8 cache lines!

l  Memory access addresses only one line, hence all lines in a set have distinct (relative) ages

DefinitionsAlignmentl  Alignment is a spacing requirement, i.e. the

restriction that an address adhere to a specific placement condition

l  For example, even-alignment means that an address is even, that it be divisible by 2

l  E.g. address 3 is not even-aligned, but address 1000 is; thus the rightmost address bit will be 0

l  In VMM, page addresses are aligned on page-boundaries. If a page-frame has size 4k, then page addresses that adhere to page-alignment are evenly divisible by 4k

l  As a result, the low-order (rightmost) 12 bits are 0. Knowledge of alignment can be exploited to save storing address bits in VMM, caching, etc.

DefinitionsAllocate-on-Write l  If a store instruction experiences a cache miss,

and as a result a cache line is filled, then the allocate-on-write cache policy is used

l  If the write miss causes the paragraph from memory to be streamed into a data cache line, we say the cache uses allocate-on-write

l  Pentium processors, for example, do not use allocate-on-write

l  Antonym: write-by

DefinitionsAssociativityl  If a cache has multiple lines per set, we call it k-way

associative; k stands for number of lines in a setl  Having a cache with multiple lines (i.e. k > 1) does

require searching, or address comparing; search checks, whether some referenced object is in fact present

l  Another way of saying this is: In an associative cache any memory object has more cache lines than just one, where it might live

l  Antonym: direct mapped; if only a single line (per set) exists, the search is reduced to a simple, single tag comparison

DefinitionsBack-Offl  If processor P1 issues a store to a data address

shared with another processor P2, and P2 has cached and modified the same data, a chance for data inconsistency arises

l  To avoid this, P2 with the modified cache line must snoop for other processors’ accesses, to guarantee delivery of the newest data

l  Once the snoop detects the access request from P1, P1 must be prevented from getting ownership of the data; accomplished by temporarily preventing P1 bus access

l  This bus denial for the sake of preserving data integrity is called back-off

DefinitionsBlocking Cachel  Let a cache miss result in streaming-in a linel  If during that stream-in no further accesses

can be made to this cache until the data transfer is complete, this cache is called blocking

l  Antonym: non-blockingl  Generally, a blocking cache yields lower

performance than a non-blocking

DefinitionsBus Masterl  Only one of the devices connected to a system bus has the

right to send signals across the bus; this ownership is called being the bus master

l  Initially Memory & IO Controller (MIOC) is bus master; chipset may include special-purpose bus arbiter

l  Over time, all processors –or their caches– may request to become bus master for some number of bus cycles

l  The MIOC can grant this right; yet each of the processors pi (more specifically: its cache) can request a back-off for pj, even if otherwise pj would be bus master

DefinitionsCritical Chunk Firstl  The number of bytes in a line is generally larger

than the number of bytes that can be brought to the cache across the bus in 1 step, requiring multiple bus transfers to fill a line completely

l  Would be efficient, if the actually needed bytes resided in the first chunk brought across the bus

l  Deliberate policy that accomplishes just that is the Critical Chunk First policy

l  This allows the cache to be unblocked after the first transfer, though line is not completely loaded

l  Other parts of the line may be used later, but the critical byte can thus be accessed right away

DefinitionsDirect Mappedl  If each memory address has just one possible

location (i.e. one single line, of K = 1) in the cache where it could possibly reside, then that cache is called direct mapped

l  Antonym: associative, or fully associativel  Synonym: non-associative

DefinitionsDirectoryl  The collection of all tags is referred to as the cache

directoryl  In addition to the directory and the actual data there

may be further overhead bits in a data cache

Dirty Bitl  Dirty bit is a data structure associated with a cache

line. This bit expresses whether a write hit has occurred on a system applying write-back

l  Synonym: Modified bitl  There may be further overhead bits in a data cache

DefinitionsEffective Cycle Time teffl  Let the cache hit rate h be the number of hits divided

by the number of all memory accesses, with an ideal hit rate being 1; m being the miss rate = 1-h; thus: teff = tcache + (1-h) * tmem = tcache + m * tmem

l  Alternatively, the effective cycle time might be teff = max( tcache, m * tmem )

l  The latter holds, if a memory access to retrieve the data is initiated simultaneously to the cache access

l  tcache = time to access a datum in the cache, ideally 1 cycle, while tmem is the time to access a data item in memory; generally not a constant value

l  The hit rate h varies from 0.0 to 1.0

DefinitionsExclusivel  State in MESI protocol. The E state indicates that

the current cache is not aware of any other cache sharing the same information, and that the line is unmodified

l  E allows that in the future another line may contain a copy of the same information, in which case the E must transition to another state

l  Possible that a higher-level cache (L1 for example viewed from an L2) may actually have a shared copy of the line in exclusive state; however that level of sharing is transparent to other potentially sharing agents outside the current processor

Definitions

Fully Associative Cachel  Possible to not partition cache into setsl  In that case, all lines need to be searched for a

cache hit or missl  We call this a fully associative cachel  Generally works for small caches, since the

search may become costly in time or HW if the cache were large

DefinitionsHit Rate hl  The hit rate h is the number of memory accesses

(reads/writes, AKA loads/stores) that hit the cache, over the total number of memory accesses

l  By contrast H is the total number of just hitsl  A hit rate h = 1 means: all accesses are from the

cache, while h = 0 means, all are from memory, i.e. none hit the cache

l  Conventional notations are: hr and hw for read and write hits

l  See also miss rate

DefinitionsInvalidl  State in the MESI protocoll  State I indicates that its cache line is invalid, and

consequently holds no valid data; it is ready for use of new data

l  It is desirable to have I lines: Allows the stream-in of a paragraph without evicting another cache line

l  Invalid (I) state is always set for all cache lines after system reset

DefinitionsLinel  Storage area in cache able to hold a copy of a

contiguous block of memory cells, i.e. a paragraphl  The portion of memory stored in that line is

aligned on an address modulo the line sizel  For example, if a line holds 64 bytes on a byte-

addressable architecture, the address of the first byte has 6 trailing zeros: evenly divisible by 64, it is 64-byte aligned

l  Such known zeros don’t need to be stored in the tag, the address bits stored in the cache; they are implied

l  This shortens the tag, rendering cache cheaper to manufacture: less HW bits!

DefinitionsLLCl  Last Level Cache is the largest cache in the

memory hierarchy, the one closest to physical memory, or furthest from the processor

l  Typical on multi-core architecturesl  Typical cash sizes: 4 MB to 32 MBl  Common to have one LLC be shared between all

cores of an MCP (Multi-Core Processor), but have option of separating (by fusing) and creating dedicated LLC caches, with identical total size

DefinitionsLRUl  Acronym for Least Recently Usedl  Cache replacement policy (also page replacement

policy discussed under VMM) that requires aging information for the lines in a set

l  Each time a cache line is accessed, that line become the youngest one touched

l  Other lines of the same set do age by one unit, i.e. get older by 1 event: event is a memory access

l  Relative ages are sufficient for LRU tracking; no need to track exact ages!

l  Antonym: last recently used!

Definitions

Locality of Data l  A surprising, beneficial attribute of memory access

patterns: when an address is referenced, there is a good chance that in the near future another access will happen at or near that same address

l  I.e. memory accesses tend to cluster, also observable in hashing functions and memory page accesses

l  Antonym: Randomly distributed, or normally distributed

DefinitionsMESIl  Acronym for Modified, Exclusive, Shared and

Invalidl  This is an ancient protocol to ensure cache

coherence on the family of Pentium processors. A protocol is necessary, if multiple processors have copy of common data with right to modify

l  Through the MESI protocol data coherence is ensured no matter which of the processors performs writes

l  AKA as Illinois protocol due to its origin at the University of Illinois at Urbana-Champaign

Definitions

Miss Ratel  Miss rate is the number of memory (read/write)

accesses that miss the cache over total number of accesses, denoted m

l  Clearly the miss rate, like the hit rate, varies between 0.0 .. 1.0

l  The miss rate m = 1 - h l  Antonym: hit rate h

DefinitionsModifiedl  State in MESI protocoll  M state implies that the cache line found by a write

hit was exclusive, and that the current processor has modified the data

l  The modified state expresses: Currently not shared, exclusively owned data have been modified

l  In a UP system, this is generally expressed by the dirty bit

DefinitionsParagraphl  Conceptual, aligned, fixed-size area of the logical

address space that can be streamed into cachel  Area in the cache of paragraph-size is called a linel  In addition to the actual data, a line in cache has

further information, including the dirty and valid bit (in UP systems), the tag, LRU information, and in MP systems the MESI bits

l  The MESI M state corresponds to the dirty bit in a UP system

Definitions

Replacement Policyl  A replacement policy is a defined convention that

defines which line is to be retired in case a new line must be loaded, none is free in a set, so one has to be evicted

l  Ideally, the line that would remain unused for the longest time in the future should be replaced and its contents overwritten with new data

l  Generally we do not know which line will stay unreferenced for the longest time in the future

l  In a direct-mapped cache, the replacement policy is trivial: it is moot, as there will be just 1 line

DefinitionsSetl  A logically connected region of memory, to be mapped onto a

specific area of cache (line), is a set; there are N sets in memoryl  Elements of a set don’t need to be physically contiguous in

memory; if contiguous, leftmost log2(N) bits are 0; if cyclic distribution, then the rightmost log2(N) after alignment bits are 0

l  The number of sets is conventionally labeled Nl  A degenerate case is to map all memory onto the whole cache,

in which case only a single set exists: N = 1; i.e. one setl  Notion of set is meaningful only if there are multiple sets. A

memory region belonging to one set can be physically contiguous or distributed cyclically

l  In the former case the distribution is called blocked, the latter cyclic. Cache area into which a portion of memory is mapped to is also called set

Definitions

Set-Associativel  A cached system in which each set has multiple

cache lines is called set-associativel  For example, 4-way set associative means that

there are multiple sets (could be 4 sets, 256 sets, 1024 sets, or any other number of sets) and each of those sets has 4 lines

l  Integral powers of 2 are good ☺ to use

l  That’s what the 4 refers to in a 4-way cachel  Antonym: non-associative, AKA direct-mapped

DefinitionsSharedl  State in the MESI protocoll  S state expresses that the hit line is present in

more than one cache. Moreover, the current cache (with the shared state) has not modified the line after stream-in

l  Another cache of the same processor may be such a sharing agent. For example, in a two level cache, the L2 cache will hold all data present in the L1 cache

l  Similarly, another processor’s L2 cache may share data with the current processor’s L2 cache

DefinitionsStale Memoryl  A valid cache line may be overwritten with new datal  The write-back policy records such over writingl  At the moment of a cache write with write-back,

cache and memory are out of synch; we say memory is stale

l  Poses no danger, since the dirty bit (or modified bit) reflects that memory eventually must be updated

l  But until this happens, memory is stalel  Note that if two processors’ caches share memory

and one cache renders memory stale, the other processor should no longer have access to that portion of shared memory

DefinitionsStream-Outl  Streaming out a line refers to the movement of one

line of modified data, out of the cache and back into a memory paragraph

Stream-Inl  The movement of one paragraph of data from

memory into a cache line. Since line length generally exceeds the bus width (i.e. exceeds the number of bytes that can be move in a single bus transaction), a stream-in process requires multiple bus transactions in a row

l  Possible that the byte actually needed will arrive last in a cache line during a sequence of bus transactions; can be avoided with the critical chunk first policy

DefinitionsSnoopingl  After a line write hit in a cache using write-back, the data in

cache and memory are no longer identical. In accordance with the write-back policy, memory will be written eventually, but until then memory is stale

l  The modifier (the cache that wrote) must pay attention to other bus masters trying to access the same line. If this is detected, action must be taken to ensure data integrity

l  This paying attention is called snooping. The right action may be forcing a back-off, or snarfing, or yet something else that ensures data coherence

l  Snooping starts with the lowest-order cache, here the L2 cache. If appropriate, L2 lets L1 snoop for the same address, because L1 may have further modified the line

DefinitionsSquashingl  Starting with a read-miss:l  In a non-blocking cache, a subsequent memory access may

be issued after an earlier read-miss, even if that previous miss results in a stream-in that is currently still under way

l  That subsequent memory access will be a miss again, which is being queued. Whenever an access references an address for which a request is already outstanding, the duplicate request to stream-in can be skipped

l  Not entering this in the queue is called squashingl  The second and any further outstanding memory access can

be resolved, once the first stream-in results in the line being present in the cache

Definitions

Strong Write Orderl  A policy ensuring that memory writes occur in the

same order as the list of store operations in the executing object code

l  Antonym: Weak orderl  The advantage of weak ordering can be speed

gain, allowing a compiler or cache policy to schedule instructions out of order; this requires further care to ensure data integrity

Definitions

Stream-Inl  The movement of a paragraph from memory into a

cache linel  Since line length generally exceeds the bus width

(i.e. exceeds the number of bytes that can be move in a single bus transaction), a stream-in process requires multiple bus transactions

l  It is possible that the byte actually needed arrives last (or first) in a cache line during a sequence of bus transactions

l  Antonym: Stream-out

Definitions

Stream-Outl  The movement of one line of modified data from

cache into a memory paragraphl  Antonym: Stream-in l  Note that unmodified data don’t need to be

streamed-out from cache to memory; they are already present in memory

Definitions

Trace Cachel  Special-purpose cache that holds pre-

decoded instructions, AKA micro-opsl  Advantage: Repeated decoding for

instructions is not neededl  Trace caches have fallen out of favor in the

DefinitionsValid Bitl  Single-bit data structure per cache line, indicating,

whether or not the line is free; free means invalidl  If a line is not valid (i.e. if valid bit is 0), it can be

filled with a new paragraph upon a cache missl  Else, (valid bit 1), the line holds valid informationl  After a system reset, all valid bits of the whole

cache are set to 0l  The I bit in the MESI protocol takes on that role on

an MP cache subsysteml  To be discussed in MP-cache coherence topic

Definitions

Weak Write Orderl  A memory-write policy allowing (a compiler or

cache) that memory writes may occur in a different order than their originating store operations

l  Antonym: Strong Write Orderl  The advantage of weak ordering is potential speed

DefinitionsWrite-Backl  Cache write policy that keeps a line of data (a

paragraph) in the cache even after a write, i.e. after a modification

l  The changed state must be remembered via the dirty bit, AKA Modified state, or modified bit

l  Memory is temporarily stale in such a casel  Upon retirement, any dirty line must be copied

back into memory; called write-backl  Advantage: only one stream-out, no matter how

many write hits did occur to that same line

DefinitionsWrite-Byl  Cache write policy, in which the cache is not

accessed on a write miss, even if there are cache lines in I state

l  A cache using write-by “hopes” that soon there may be a load, which will result in a miss and then stream-in the appropriate line; if not, it was not necessary to stream-in the line in the first place

l  Antonym: allocate-on-write

DefinitionsWrite-Oncel  Cache write policy that starts out as write-through

and changes to write-back after the first write hit to a line

l  Typical policy imposed onto a higher level L1 cache by the L2 cache

l  Advantage: The L1 cache places no unnecessary traffic onto the system bus upon a cache-write hit

l  Lower level L2 cache can remember that a write has occurred by setting the MESI state to modified

Definitions

Write-Throughl  Cache write policy that writes data to memory

upon a write hit. Thus, cache and main memory are in synch

l  Disadvantage: repeated memory access traffic on the bus

ECE 172 Digital Systems Chapter 5 Uniprocessor...

Documents

Transcript of ECE 172 Digital Systems Chapter 5 Uniprocessor...

Cadeaux Caches

Uniprocessor Optimizations and Matrix Multiplication

Complete C++ Course - web.cecs.pdx.edu

Uniprocessor Real-Time Schedulingrobdavis/papers/... · Uniprocessor Real-Time Scheduling ... Rate Monotonic is the optimal priority assignment policy

CPU Caches and Why You Care - Scott Meyers Caches and Why You Care CPU Caches Three common types:

ECS231 Uniprocessor Optimization of Matrix Multiplications …web.cs.ucdavis.edu/~bai/ECS231/optmatmul.pdf · 1 ECS231 Uniprocessor Optimization of Matrix Multiplications and BLAS

Laser caches

Advanced Caches

Cpu Caches

Chapter 1 Uniprocessor Architecture Overview

288 IEEE TRANSACTIONS ON COMPUTERS, VOL. … · Architectural Support for Uniprocessor and Multiprocessor Active Memory Systems ... controller architecture. ... uniprocessor speedup

Reading - web.cecs.pdx.edu

Cache Write Policies and Performance - HP Labsthrough caching, but they study mixed first-level caches with traces under a million references. Among the more recent work in uniprocessor

Uniprocessor Scheduling - cs.unibg.it

1 Uniprocessor Optimizations and Matrix Multiplication.

Uniprocessor Scheduling

Practical Caches

Scheduling for uniprocessor systems Introductionhome.deib.polimi.it/fornacia/lib/exe/fetch.php?media=teaching:... · Scheduling for uniprocessor systems Introduction Lecturer: ...

Buffer Caches

Lec15 snoop coherence - University of Cretehy425/2016f/lectures/Lec15_snoop... · 2016. 12. 16. · Proc1 Proc2 Proc4 Caches Caches Caches Single’Bus Memory I/O Proc3 Caches. Multiprocessor’CacheCoherency!