Post on 08-Sep-2018
2
Syllabus
l UP Cachesl Cache Design Parametersl Effective Time teffl Cache Performance Parametersl Replacement Policiesl Trace Cachel Conclusionl Bibliography
4
Syllabus UP Cachesl Intro: Purpose, Design Parameters, Architecturel Effective Time teffl Single-Line Degenerate Cachel Multi-Line, Single-Set Cachel Single-Line, Multi-Set Cache, Blocked Mappingl Single-Line, Multi-Set, Cyclic Mappingl Multi-Line per Set (Associative), Multi-Set Cachel Replacement Policiesl LRU Samplel Compute Cache Sizel Trace Cachel Characteristic Cache Curvel Bibliography
5
Intro: Purpose of Cachel Cache is logically part of Memory Subsystem, yet
physically part of microprocessor: located sometimes on same silicon die
l Purpose: transform slow memory into fast onel Possible with minimal cost despite high cost per
cache bit, as total cache size is just a few % of total physical main store, often < 1 % of memory size!
l Works well, if locality is good; else performance is same as slow memory access, or even worse, depending on architecture
l With poor locality, i.e. with random distribution of memory accesses, cache can actually slow down if:
teff = tcache + (1-h) * tmem and not: teff = max( tcache, (1-h) * tmem )
6
Intro: Purpose of Cachel With good locality, cache delivers available data in
close to unit time, 1 cycle: Is Awesome! ☺l In MP systems, caches must cooperate with other
processors’ caches, memory, some peripheralsl Even on a UP system there are multiple agents
that need to access memory and thus impact caches
l Cache must cooperate with VMM of memory subsystem to jointly render a physically small, slow memory into a virtually large, fast memory at small cost of added HW (silicon), and system SW
l L1 cache access time is ideally contained within single machine cycle; realized on some CPUs
9
From Definitions in AppendixLinel Storage area in cache able to hold a copy of a
contiguous block of memory cells: a paragraphl The portion of memory stored in that line is
aligned on a memory address modulo line size!l For example, if a line holds 64 bytes on a byte-
addressable architecture, the address of the first byte of a line has 6 trailing zeros: i.e. it is evenly divisible by 64, or we say it is 64-byte aligned
l Such known zeros don’t need to be stored in the tag; they are implied, they are known a-priori!
l This shortens the tag, rendering cache a bit simpler; cheaper to manufacture: less HW bits!
10
From Definitions in AppendixSetl A logically connected region of memory, mapped to a specific
area of cache (lines), is a set; memory is partitioned into N setsl Elements of a set don’t need to be physically contiguous in
memory; if contiguous, leftmost log2(N) bits are known a-priori and don’t need to be stored; if cyclic distribution, then the rightmost log2(N) with N sets are known a-priori
l The number of sets is conventionally labeled Nl A degenerate case maps all memory onto the whole cache, thus
only a single set exists: N = 1; i.e. one set; not meaningful!l Notion of set is meaningful only if there are multiple sets. Again:
A memory region belonging to one set can be a physically contiguous block, or a cyclically distributed part of memory
l Former case is called blocked, the latter cyclic. Cache area into which such a portion of memory is mapped to is also called set
11
Intro: Cache Design Parameters
l Number of lines in set: Kl Number of units –bytes– in a line is named L, AKA
Length of linel Number of sets in memory, and hence in the cache: Nl Policy upon store miss: cache write policyl Policy upon load miss: cache read policyl What to do, when an empty lines is needed for the
next paragraph to be streamed-in, but none is available? That action is the: replacement policy
12
l Size here is the total size of a cache; unit being discussed can be: bits or bytes or word
l Size = K * ( L + bits for tag and control bits ) * Nl Ratio of cache size to physical memory is generally
on order of very small percentage, e.g. << 1 %l Cache access time, typically close to 1 cycle for L1
cache; or very small number of cycles!l Number of processors with cache: 1 in UP, M in MP
architecturel Levels of caches, L1, L2, L3 … Last one referred to
as LLC, for last level cache
Intro: Cache Design Parameters
13
l Cache-related definitions used throughout are common, though not all manufacturers apply the same nomenclature
l Initially we discuss cache designs for single-processor architectures
l In MP cache lecture we progress to more complex cache designs, covering the MESI protocol for a two-processor system with L1 and L2 cache
l Focus for now: Purely L1 data cache on UP computer architecture
Intro: Cache Architecture
14
Effective Time teff
l Starting with teff = tcache + ( 1 - h ) * tmem we observe:l No matter how many hits (H) we experience during
repeated memory access, the effective cycle time teff is never less than tcache
l No matter how many misses (M) we experience, the effective cycle time to access a datum is never more than tcache + tmem
l Desirable to have teff = tmem in case of cache missl Another way to compute effective access time is to
add all memory-access times, and divide them by the total number of accesses, and thus compute the effective, or average, time teff
15
Effective Time teffAverage access time per memory access:teff = ( hits * tcache + misses * ( tcache + tmem ) ) /
total_accesses
teff = h * tcache + m * ( tcache + tmem )
if memory accessed immediately, aborted on hits:
teff = ( h + m ) * tcache + m * tmem = tcache + m * tmem
• Assume an access time of 1 cycle to reference data in the cache; best case, at times feasible
• Assume an access time of 10 cycles for data in memory; yes, unrealistically fast!!
• Assume that a memory access is initiated after a cache miss; then:
17
Effective Time teff
Symb. Name Explanation H Hits Number of successful cache accesses M Misses Number of failed cache accesses A All All accesses A = H + M T Total time Time for A memory accesses
tcache Cache time Time to access data once via the data cache tmem Mem time Time to access data via memory once teff Effective tm. Average time over all memory accesses h Hit rate H / A = h = 1 – m m Miss rate M / A = m = 1 – h
h + m Total rate = 1 Total rate, either hit or miss, probability is 1
19
Effective Time teffl Compare teff, the effective memory access time in L1
data cache at 99% hit rate vs. 97% hit ratel Time for hit thit = 1 cycle, time for miss tmiss = 100
cycles; then compare 99 and 97 percent hit rates:l Given a 99% hit rate:
l 1 miss costs 100 cyclesl 99 hits cost 99 cycles totall teff = ( 100 + 99 ) / 100 = 1.99 ≈ 2 cycles for average access
l Given a 97% hit rate: Students compute teff here!l l l l
20
Effective Time teffl Compare teff, the effective memory access time in L1
data cache at 99% hit rate vs. 97% hit ratel Time for hit thit = 1 cycle, time for miss tmiss = 100
cycles; then compare 99% and 97% hit rates:l Given a 99% hit rate:
l 1 miss costs 100 cyclesl 99 hits cost 99 cycles totall teff = ( 100 + 99 ) / 100 = 1.99 ≈ 2 cycles per average access
l Given a 97% hit rate:l 3 misses costs 300 cyclesl 97 hits cost 97 cycles totall teff = ( 300 + 97 ) / 100 = 397 / 100 = 3.97 ≈ 4 cycles per average
access
l Or 100% additional cycles for loss of 2% hit accuracy!
21
Actual Cache DataIntel Core i7 with 3 levels of cache, L1 access > 1 cycle, L3 access
costing dozens of cycles, still way faster than memory access!
25
Single-Line Degenerate Cachel Quick test: what is the minimum size (in number of bits) of
the tag for this degenerate cache? (assume 32-bit architecture, and 64-byte lines)
l The single-line cache, shown here, stores multiple words
l Can improve memory access if extremely good locality exists within a very narrow address range
l Upon miss cache initiates a stream-in operation
l Is a direct mapped cache: all memory locations know a priori where they’ll reside in cache; there is but one line, one option for them
l Is a single-set cache
26
Single-Line Degenerate Cachel As data cache: exploits only locality of near-by
addresses in the same paragraphl As instruction cache: Exploits locality of tight
loops that completely fit inside the address range of a single cache line
l However, there will be a cache-miss as soon as an address makes reference outside of line’s range
l For example, tight loop with a function call will cause cache miss
l Stream-in time is time to load a line of data from memory
l Total overhead: tag bits + valid bit + dirty bit (if write-back)
l Not advisable to build this type of a cache subsystem ☺
28
Dual-Line, Single-Set Cachel Next cache has 1 set, multiple lines; here 2 lines shownQuick sanity check: minimum size of tag on byte-addressable,
32-bit architecture with 2 lines, 1 set, line size of 16 bytes?l Each line holds multiple, contiguous addressing units, 4
words, 16 bytes shownl Thus 2 disparate areas of memory can be cached at the
same timel Is associative cache; all lines (i.e. 2 lines) in single set
must be searched to determine, whether a memory element is present in cache
l Is single-set associative cache, since all of memory (singleton set) is mapped onto the same cache lines
29
Dual-Line, Single-Set Cachel Some tight loops with a function call can be
completely cached in an I-cache, assuming loop body fits into line and callée fits into the other line
l Also would allow one larger loop to be cached, whose total body does not fit into a single line, but would fit into 2 lines
l Applies to more realistic programsl But if number of lines K >> 1, the time to
search all tags (in set) can grow beyond unit cycle time
l Again not advisable to build this kind of cache subsystem ☺
31
Single-Line, Dual-Set Cachel This cache architecture has multiple sets, 2 shown, 2 distinct
areas of memory, each being mapped onto separate cache lines: N = 2, K = 1
Quick test: minimum size of the tag on 4-byte per word, 32-bit architecture with 16-byte lines?
l Each set has a single line, in this case 4 memory words; AKA paragraph in memory
l Thus 2 disparate areas of memory can be cached at the same time
l But these areas must reside in separate memory sets, each contiguous, each having only 1 option
l Is direct mapped; all memory locations know a priori where they’ll reside in cache
l Is multi-set cache, since parts of memory have their own portion of cache
32
Single-Line, Dual-Set Cachel Allows one larger loop to be cached, whose total
body does not fit into a single line of an I-cache, but would fit into two lines
l But only if by some great coincidence both parts of that loop reside in different memory sets
l If used as instruction cache, all programs consuming half of memory or less never use the second line in the second set. Hence this cache architecture is a bad idea!
l If used as data cache, all data areas that fit into first block will never populate the second set
l Problem specific to blocked mapping; so let’s try cyclic instead
l Also not advisable to build this type cache ☺
34
Dual-Set, Single-Line, Cyclicl Cache architecture below also has 2 sets, N = 2l Each set has a single line, each holding 4 contiguous memory
units, 4 words, 16 bytes, K = 1l So at least 2 disparate areas of memory can be cached at the
same timeQuick test: tag size on 32-bit, 4-byte architecture?
l Disparate areas (of line size, equal to paragraph size) are scattered cyclically throughout memory
l Cyclically distributed memory areas associated with each respective set
l Is direct mapped; all memory locations know a priori where they’ll reside in cache, as each set has a single line
l Is multi-set cache: different locations of memory are mapped onto different cache lines, the sets
35
Dual-Set, Single-Line, Cyclic
l Also allows one larger loop to be cached, whose total body does not fit into a single line, but would fit into two lines
l Even if parts of loop belong to different setsl If used as instruction cache, small code section
can use the total cachel If used as data cache, small data areas can utilize
complete cachel Cyclic mapping of memory areas to sets is
generally superior to blocked mappingl Still not advisable to build this type of cache ☺
37
Multi-Line, Multi-Set, Cachel Reminder: Tag is that minimal number of address
bits to store, in order to ID the line’s location in memory
l Use: 32-bit architecture, byte addressable, 2 sets cyclic, line length 16 bytes, 2-way set associative
l Two sets, memory will be mapped cyclically, AKA in a round-robin fashion
l Each set has two lines, each line holding 16 bytes; i.e. paragraph length of memory is 16 bytes in this example!
l Note: direct mapped caches, i.e. caches with one line per set, are also being built; known as non-associative caches
38
Multi-Line, Multi-Set, Cache
l Associative cache: once set is known, search all tags for the memory address in all lines of that set; will be a very small number, e.g. 2 or 4; 4 for a 4-way set-associative cache
l Such a search can be accomplished in fraction of a cycle, i.e. 2 or 4 parallel searches, only 1 of which can hit
l In earlier example p. 36, line 2 of set 2 is unused, AKA invalid in MESI terminology; invalid is an attractive cache feature ☺
l By now you know: sets, lines, associate, non-associative, direct mapped!
39
Replacement Policyl Replacement policy is the rule that determines:
when all lines are valid (i.e. already busy with other, good data), and a new line must be streamed in:
l Which of the valid lines in (a set of) the cache is to be replaced, AKA removed?
l Removal can be low cost, if the modified bit (AKA dirty bit) is clear = 0; this means: data in memory and cache line are identical! In that case: no need to stream out, back into memory!
l Otherwise removal is costly: If dirty bit is set = 1, data have to be written back into memory, costing a memory access!
l We call this copying to memory: stream out
40
Replacement Policy# Name Summary
1 LRU Replaces Least Recently Used cache line; requires keeping track of relative “ages” of lines. Retire line that has remained unused for the longest time of all candidate lines. Speculate that that line will remain unused for the longest time in the future.
2 LFU Replaces Least Frequently Used cache line; requires keeping track of the number m of times this line was used over the last n>=m uses. Depending on how long we track the usage, this may require many bits.
3 FIFO First In First Out: The first of the lines in the set that was streamed in is the first to be retired, when it comes time to find a candidate. Has the advantage that no further update is needed, while all lines are in use.
4 Random Pick a random line from candidate set for retirement; is not as bad as this irrational algorithm might suggest. Reason: The other methods are not too good either J
5 Optimal If a cache were omniscient, it could predict, which line will remain unused for the longest time in the future. Of course, that is not computable. However, for creating the perfect reference point, we can do this with past memory access patterns, and use the optimal access pattern for comparison, how well our chosen policy rates vs. the optimal strategy!
41
LRU Sample 1Assume the following cache architecture:• N = 16 sets, cyclic distribution• K = 4 lines per set• 32-bit architecture, byte-addressable• write back (dirty bit)• valid line indicator (valid bit)• L = 64 bytes per line; AKA line length• LRU replacement; uses 2 bits (4 lines per
set), to store relative ages• This results in tag size of ???? bits
42
LRU Sample 1
Assume the following cache architecture:• This results in a tag size of 22 bits• What is the total overhead size per cache
line, measured in bits?––
43
LRU Sample 1
Assume the following cache architecture:• Tag size = 22 bits• 2 LRU bits (4 lines per set), to store relative
ages of the 4 lines in each set• Dirty bit needed, AKA Modified bit = 1• Valid bit needed = 1• Overhead per line: 22 + 2 + 1 + 1 = 26 bits
44
LRU Sample 2l Sample 2 focuses on one particular Set:l Let 4 lines be numbered 0..3l Set is read in order: 1.) line 0 miss, 2.) line 1 miss, 3.)
line 0 hit, 4.) line 2 miss, 5.) line 0 hit again, 6.) line 3 miss, 7.) line 0 hit again, and 8.) another miss
l Now cache is full, need to find an available line by eviction, to have a line for a new access!
l Assume initially a cold cache, all lines in the cache are free before these accesses
l Problem: Once all lines are filled (Valid bit is 1 for all 4 lines) some line must be retired/evicted to make room for new access that missed, but which one?
l Answer now is based on LRU policy (Least Recently Used line), which in this sample is line 1
45
LRU Sample 2l The access order, assuming all memory accesses
are reads, no writes, i.e. dirty bit is always clear:
l Read miss, all lines invalid, stream paragraph in line 0l Read miss (implies new address), stream paragraph in line 1l Read hit on line 0l Read miss to a new address, stream paragraph into line 2l Read hit, access line 0l Read miss, stream paragraph into line 3l Read hit, access line 0l Now another Read Miss, all lines valid, find line to retire,
AKA to evict
l Note that LRU age 002 is youngest for cache line 0, and 112 is the oldest line (AKA the least recently used line) for cache line 1, of the 4 relative ages out of 4 total lines
47
LRU Sample 2
1. Initially, in a partly cold cache, if we experience a miss, there will be an empty line (partly cold cache), the paragraph is streamed into the empty line, its relative age is set to 0, and all other ages are incremented by 1
2. In a warm cache (all lines are used) when a line of age X experiences a hit, its new age becomes 0. Ages of all other lines whose age < X are incremented by 1
3. Of course the “older ones” remain “older”
48
Compute Cache SizeTypical Cache Design Parameters:
1. Number of lines in every set: K2. Number of bytes in a line, i.e. the Length of line: L3. Number of sets in memory, and hence in cache: N
4. Policy upon memory write (cache write policy)5. Policy upon read miss (cache read policy)
6. Replacement policy (e.g. LRU, random, FIFO, etc.)7. Total line size in bit = K * ( 8 * L + tag + control
bits ) * N -with 8-bit bytes
49
Compute Cache SizeCompute minimum number of bits for 8-way, set-associative cache on a 32-bit architecture with 64 sets, using cyclic allocation of sets, line length L = 32 bytes, using LRU and write-back. Memory is byte addressable, with 32-bit addresses:Tag = . . .
50
Compute Cache SizeCompute minimum number of bits for 8-way, set-associative cache on a 32-bit architecture with 64 sets, using cyclic allocation of sets, line length L = 32 bytes, using LRU and write-back. Memory is byte addressable, with 32-bit addresses:Tag = 32-5-6 = 21 bits
LRU 8-ways = 3 bits
Dirty bit = 1 bit
Valid bit = 1 bitOverhead per line = 21+3+1+1 = 26 bits
# of lines = K*N = 64 * 8 = 29 lines
Data bits per cache line = 32*8 = 28 bitsTotal cache size = 29*(26+28)= 144,384 bitsSize in bytes approx. = ~17.6 k Bytes
51
Trace Cachel Trace Cache is special-purpose cache that does not
hold (raw) instruction bits, but instead stores pre-decoded operations AKA micro-ops
l Old AMD K5 uses Trace Cache; see [1]l Intel’s Pentium® P4 uses a 12 k micro-op Trace Cachel Advantages: faster access to executable bits at every
cached instructionl Disadvantage: less dense storage, i.e. wasted cache
bits, when compared to a regular I-cachel Note that cache bits are way more costly than
memory bits; several decimal orders of magnitude!l Trace caches are falling out of favor since early 2000
53
Characteristic Cache Curvel In the graph below we a use relative number of
cache misses [RM] to avoid infinitely high abscissal RM = 0 is ideal case: No misses, all hitsl RM = 1 is worst case: All memory accesses are
cache missesl If a program exhibits good locality, relative cache
size of 1 results in good performance; we use this as the reference point:
l Very coarsely, in some ranges, doubling the cache’s size results in 30% less cache misses
l In other ranges of the characteristic curve, doubling the cache results in just a few % of reduced misses: beyond the sweet spot!
56
UP Cache Summaryl Cache is a special HW storage, allowing fast access
to small areas of memory, copied into cache linesl Built with expensive technology, hence the size of a
cache relative to memory size is small; cache holds only a small subset of memory, typically < 1 %
l Frequently used data (or instructions in an I-cache) are copied to cache, with the hope that the data present in the cache are accessed frequently
l Miraculously ☺ that is generally true, so caches in general do speed up execution despite slow memories: Exploiting what is known as locality
l Caches are organized into sets, with each set having 1 or more lines; multiple lines require searching
l Defined portions of memory get mapped into any one of these sets
57
Bibliography1. Shen, John Paul, and Mikko H. Lipasti: Modern
Processor Design, Fundamentals of Superscalar Processors, McGraw Hill, ©2005
2. http://forums.amd.com/forum/messageview.cfm?catid=11&threadid=29382&enterthread=y
3. Lam, M., E. E. Rothberg, and M. E. Wolf [1991]. "The Cache Performance and Optimizations of Blocked Algorithms," ACM 0-89791-380-9/91, p. 63-74.
4. http://www.ece.umd.edu/~blj/papers/hpca2006.pdf5. MESI: http://en.wikipedia.org/wiki/Cache_coherence6. Kilburn, T., et al: “One-level storage systems, IRE
Transactions, EC-11, 2, 1962, p. 223-235
58
Bibliography7. Don Anderson and Shanley, T., MindShare [1995]. Pentium
TM Processor System Architecture, Addison-Wesley Publishing Company, Reading MA, PC System Architecture Series. ISBN 0-201-40992-5
8. Pentium Pro Developer’s Manual, Volume 1: Specifications, 1996, one of a set of 3 volumes
9. Pentium Pro Developer’s Manual, Volume 2: Programmer's Reference Manual, Intel document, 1996, one of a set of 3 volumes
10. Pentium Pro Developer’s Manual, Volume 3: Operating Systems Writer’s Manual, Intel document, 1996, one of a set of 3 volumes
11. Y. Sheffer: http://webee.technion.ac.il/courses/044800/lectures/MESI.pdf
12. MOESI protocol: http://en.wikipedia.org/wiki/MOESI_protocol
13. MESIF protocol: http://en.wikipedia.org/wiki/MESIF_protocol
60
DefinitionsAgingl A cache line’s age is tracked; only in associative
cache, doesn’t apply for direct-mapped cachel Aging tracks, when a cache line was accessed,
relative to the other lines in this setl This implies that ages are comparedl Generally, the relative ages are of interest, such
as: am I older than you? Rather than the absolute age, e.g.: I was accessed at cycle such and such
l Think about the minimum number of bits needed to store the relative ages of, say, 8 cache lines!
l Memory access addresses only one line, hence all lines in a set have distinct (relative) ages
61
DefinitionsAlignmentl Alignment is a spacing requirement, i.e. the
restriction that an address adhere to a specific placement condition
l For example, even-alignment means that an address is even, that it be divisible by 2
l E.g. address 3 is not even-aligned, but address 1000 is; thus the rightmost address bit will be 0
l In VMM, page addresses are aligned on page-boundaries. If a page-frame has size 4k, then page addresses that adhere to page-alignment are evenly divisible by 4k
l As a result, the low-order (rightmost) 12 bits are 0. Knowledge of alignment can be exploited to save storing address bits in VMM, caching, etc.
62
DefinitionsAllocate-on-Write l If a store instruction experiences a cache miss,
and as a result a cache line is filled, then the allocate-on-write cache policy is used
l If the write miss causes the paragraph from memory to be streamed into a data cache line, we say the cache uses allocate-on-write
l Pentium processors, for example, do not use allocate-on-write
l Antonym: write-by
63
DefinitionsAssociativityl If a cache has multiple lines per set, we call it k-way
associative; k stands for number of lines in a setl Having a cache with multiple lines (i.e. k > 1) does
require searching, or address comparing; search checks, whether some referenced object is in fact present
l Another way of saying this is: In an associative cache any memory object has more cache lines than just one, where it might live
l Antonym: direct mapped; if only a single line (per set) exists, the search is reduced to a simple, single tag comparison
64
DefinitionsBack-Offl If processor P1 issues a store to a data address
shared with another processor P2, and P2 has cached and modified the same data, a chance for data inconsistency arises
l To avoid this, P2 with the modified cache line must snoop for other processors’ accesses, to guarantee delivery of the newest data
l Once the snoop detects the access request from P1, P1 must be prevented from getting ownership of the data; accomplished by temporarily preventing P1 bus access
l This bus denial for the sake of preserving data integrity is called back-off
65
DefinitionsBlocking Cachel Let a cache miss result in streaming-in a linel If during that stream-in no further accesses
can be made to this cache until the data transfer is complete, this cache is called blocking
l Antonym: non-blockingl Generally, a blocking cache yields lower
performance than a non-blocking
66
DefinitionsBus Masterl Only one of the devices connected to a system bus has the
right to send signals across the bus; this ownership is called being the bus master
l Initially Memory & IO Controller (MIOC) is bus master; chipset may include special-purpose bus arbiter
l Over time, all processors –or their caches– may request to become bus master for some number of bus cycles
l The MIOC can grant this right; yet each of the processors pi (more specifically: its cache) can request a back-off for pj, even if otherwise pj would be bus master
67
DefinitionsCritical Chunk Firstl The number of bytes in a line is generally larger
than the number of bytes that can be brought to the cache across the bus in 1 step, requiring multiple bus transfers to fill a line completely
l Would be efficient, if the actually needed bytes resided in the first chunk brought across the bus
l Deliberate policy that accomplishes just that is the Critical Chunk First policy
l This allows the cache to be unblocked after the first transfer, though line is not completely loaded
l Other parts of the line may be used later, but the critical byte can thus be accessed right away
68
DefinitionsDirect Mappedl If each memory address has just one possible
location (i.e. one single line, of K = 1) in the cache where it could possibly reside, then that cache is called direct mapped
l Antonym: associative, or fully associativel Synonym: non-associative
69
DefinitionsDirectoryl The collection of all tags is referred to as the cache
directoryl In addition to the directory and the actual data there
may be further overhead bits in a data cache
Dirty Bitl Dirty bit is a data structure associated with a cache
line. This bit expresses whether a write hit has occurred on a system applying write-back
l Synonym: Modified bitl There may be further overhead bits in a data cache
70
DefinitionsEffective Cycle Time teffl Let the cache hit rate h be the number of hits divided
by the number of all memory accesses, with an ideal hit rate being 1; m being the miss rate = 1-h; thus: teff = tcache + (1-h) * tmem = tcache + m * tmem
l Alternatively, the effective cycle time might be teff = max( tcache, m * tmem )
l The latter holds, if a memory access to retrieve the data is initiated simultaneously to the cache access
l tcache = time to access a datum in the cache, ideally 1 cycle, while tmem is the time to access a data item in memory; generally not a constant value
l The hit rate h varies from 0.0 to 1.0
71
DefinitionsExclusivel State in MESI protocol. The E state indicates that
the current cache is not aware of any other cache sharing the same information, and that the line is unmodified
l E allows that in the future another line may contain a copy of the same information, in which case the E must transition to another state
l Possible that a higher-level cache (L1 for example viewed from an L2) may actually have a shared copy of the line in exclusive state; however that level of sharing is transparent to other potentially sharing agents outside the current processor
72
Definitions
Fully Associative Cachel Possible to not partition cache into setsl In that case, all lines need to be searched for a
cache hit or missl We call this a fully associative cachel Generally works for small caches, since the
search may become costly in time or HW if the cache were large
73
DefinitionsHit Rate hl The hit rate h is the number of memory accesses
(reads/writes, AKA loads/stores) that hit the cache, over the total number of memory accesses
l By contrast H is the total number of just hitsl A hit rate h = 1 means: all accesses are from the
cache, while h = 0 means, all are from memory, i.e. none hit the cache
l Conventional notations are: hr and hw for read and write hits
l See also miss rate
74
DefinitionsInvalidl State in the MESI protocoll State I indicates that its cache line is invalid, and
consequently holds no valid data; it is ready for use of new data
l It is desirable to have I lines: Allows the stream-in of a paragraph without evicting another cache line
l Invalid (I) state is always set for all cache lines after system reset
75
DefinitionsLinel Storage area in cache able to hold a copy of a
contiguous block of memory cells, i.e. a paragraphl The portion of memory stored in that line is
aligned on an address modulo the line sizel For example, if a line holds 64 bytes on a byte-
addressable architecture, the address of the first byte has 6 trailing zeros: evenly divisible by 64, it is 64-byte aligned
l Such known zeros don’t need to be stored in the tag, the address bits stored in the cache; they are implied
l This shortens the tag, rendering cache cheaper to manufacture: less HW bits!
76
DefinitionsLLCl Last Level Cache is the largest cache in the
memory hierarchy, the one closest to physical memory, or furthest from the processor
l Typical on multi-core architecturesl Typical cash sizes: 4 MB to 32 MBl Common to have one LLC be shared between all
cores of an MCP (Multi-Core Processor), but have option of separating (by fusing) and creating dedicated LLC caches, with identical total size
77
DefinitionsLRUl Acronym for Least Recently Usedl Cache replacement policy (also page replacement
policy discussed under VMM) that requires aging information for the lines in a set
l Each time a cache line is accessed, that line become the youngest one touched
l Other lines of the same set do age by one unit, i.e. get older by 1 event: event is a memory access
l Relative ages are sufficient for LRU tracking; no need to track exact ages!
l Antonym: last recently used!
78
Definitions
Locality of Data l A surprising, beneficial attribute of memory access
patterns: when an address is referenced, there is a good chance that in the near future another access will happen at or near that same address
l I.e. memory accesses tend to cluster, also observable in hashing functions and memory page accesses
l Antonym: Randomly distributed, or normally distributed
79
DefinitionsMESIl Acronym for Modified, Exclusive, Shared and
Invalidl This is an ancient protocol to ensure cache
coherence on the family of Pentium processors. A protocol is necessary, if multiple processors have copy of common data with right to modify
l Through the MESI protocol data coherence is ensured no matter which of the processors performs writes
l AKA as Illinois protocol due to its origin at the University of Illinois at Urbana-Champaign
80
Definitions
Miss Ratel Miss rate is the number of memory (read/write)
accesses that miss the cache over total number of accesses, denoted m
l Clearly the miss rate, like the hit rate, varies between 0.0 .. 1.0
l The miss rate m = 1 - h l Antonym: hit rate h
81
DefinitionsModifiedl State in MESI protocoll M state implies that the cache line found by a write
hit was exclusive, and that the current processor has modified the data
l The modified state expresses: Currently not shared, exclusively owned data have been modified
l In a UP system, this is generally expressed by the dirty bit
82
DefinitionsParagraphl Conceptual, aligned, fixed-size area of the logical
address space that can be streamed into cachel Area in the cache of paragraph-size is called a linel In addition to the actual data, a line in cache has
further information, including the dirty and valid bit (in UP systems), the tag, LRU information, and in MP systems the MESI bits
l The MESI M state corresponds to the dirty bit in a UP system
83
Definitions
Replacement Policyl A replacement policy is a defined convention that
defines which line is to be retired in case a new line must be loaded, none is free in a set, so one has to be evicted
l Ideally, the line that would remain unused for the longest time in the future should be replaced and its contents overwritten with new data
l Generally we do not know which line will stay unreferenced for the longest time in the future
l In a direct-mapped cache, the replacement policy is trivial: it is moot, as there will be just 1 line
84
DefinitionsSetl A logically connected region of memory, to be mapped onto a
specific area of cache (line), is a set; there are N sets in memoryl Elements of a set don’t need to be physically contiguous in
memory; if contiguous, leftmost log2(N) bits are 0; if cyclic distribution, then the rightmost log2(N) after alignment bits are 0
l The number of sets is conventionally labeled Nl A degenerate case is to map all memory onto the whole cache,
in which case only a single set exists: N = 1; i.e. one setl Notion of set is meaningful only if there are multiple sets. A
memory region belonging to one set can be physically contiguous or distributed cyclically
l In the former case the distribution is called blocked, the latter cyclic. Cache area into which a portion of memory is mapped to is also called set
85
Definitions
Set-Associativel A cached system in which each set has multiple
cache lines is called set-associativel For example, 4-way set associative means that
there are multiple sets (could be 4 sets, 256 sets, 1024 sets, or any other number of sets) and each of those sets has 4 lines
l Integral powers of 2 are good ☺ to use
l That’s what the 4 refers to in a 4-way cachel Antonym: non-associative, AKA direct-mapped
86
DefinitionsSharedl State in the MESI protocoll S state expresses that the hit line is present in
more than one cache. Moreover, the current cache (with the shared state) has not modified the line after stream-in
l Another cache of the same processor may be such a sharing agent. For example, in a two level cache, the L2 cache will hold all data present in the L1 cache
l Similarly, another processor’s L2 cache may share data with the current processor’s L2 cache
87
DefinitionsStale Memoryl A valid cache line may be overwritten with new datal The write-back policy records such over writingl At the moment of a cache write with write-back,
cache and memory are out of synch; we say memory is stale
l Poses no danger, since the dirty bit (or modified bit) reflects that memory eventually must be updated
l But until this happens, memory is stalel Note that if two processors’ caches share memory
and one cache renders memory stale, the other processor should no longer have access to that portion of shared memory
88
DefinitionsStream-Outl Streaming out a line refers to the movement of one
line of modified data, out of the cache and back into a memory paragraph
Stream-Inl The movement of one paragraph of data from
memory into a cache line. Since line length generally exceeds the bus width (i.e. exceeds the number of bytes that can be move in a single bus transaction), a stream-in process requires multiple bus transactions in a row
l Possible that the byte actually needed will arrive last in a cache line during a sequence of bus transactions; can be avoided with the critical chunk first policy
89
DefinitionsSnoopingl After a line write hit in a cache using write-back, the data in
cache and memory are no longer identical. In accordance with the write-back policy, memory will be written eventually, but until then memory is stale
l The modifier (the cache that wrote) must pay attention to other bus masters trying to access the same line. If this is detected, action must be taken to ensure data integrity
l This paying attention is called snooping. The right action may be forcing a back-off, or snarfing, or yet something else that ensures data coherence
l Snooping starts with the lowest-order cache, here the L2 cache. If appropriate, L2 lets L1 snoop for the same address, because L1 may have further modified the line
90
DefinitionsSquashingl Starting with a read-miss:l In a non-blocking cache, a subsequent memory access may
be issued after an earlier read-miss, even if that previous miss results in a stream-in that is currently still under way
l That subsequent memory access will be a miss again, which is being queued. Whenever an access references an address for which a request is already outstanding, the duplicate request to stream-in can be skipped
l Not entering this in the queue is called squashingl The second and any further outstanding memory access can
be resolved, once the first stream-in results in the line being present in the cache
91
Definitions
Strong Write Orderl A policy ensuring that memory writes occur in the
same order as the list of store operations in the executing object code
l Antonym: Weak orderl The advantage of weak ordering can be speed
gain, allowing a compiler or cache policy to schedule instructions out of order; this requires further care to ensure data integrity
92
Definitions
Stream-Inl The movement of a paragraph from memory into a
cache linel Since line length generally exceeds the bus width
(i.e. exceeds the number of bytes that can be move in a single bus transaction), a stream-in process requires multiple bus transactions
l It is possible that the byte actually needed arrives last (or first) in a cache line during a sequence of bus transactions
l Antonym: Stream-out
93
Definitions
Stream-Outl The movement of one line of modified data from
cache into a memory paragraphl Antonym: Stream-in l Note that unmodified data don’t need to be
streamed-out from cache to memory; they are already present in memory
94
Definitions
Trace Cachel Special-purpose cache that holds pre-
decoded instructions, AKA micro-opsl Advantage: Repeated decoding for
instructions is not neededl Trace caches have fallen out of favor in the
2000s
95
DefinitionsValid Bitl Single-bit data structure per cache line, indicating,
whether or not the line is free; free means invalidl If a line is not valid (i.e. if valid bit is 0), it can be
filled with a new paragraph upon a cache missl Else, (valid bit 1), the line holds valid informationl After a system reset, all valid bits of the whole
cache are set to 0l The I bit in the MESI protocol takes on that role on
an MP cache subsysteml To be discussed in MP-cache coherence topic
96
Definitions
Weak Write Orderl A memory-write policy allowing (a compiler or
cache) that memory writes may occur in a different order than their originating store operations
l Antonym: Strong Write Orderl The advantage of weak ordering is potential speed
gain
97
DefinitionsWrite-Backl Cache write policy that keeps a line of data (a
paragraph) in the cache even after a write, i.e. after a modification
l The changed state must be remembered via the dirty bit, AKA Modified state, or modified bit
l Memory is temporarily stale in such a casel Upon retirement, any dirty line must be copied
back into memory; called write-backl Advantage: only one stream-out, no matter how
many write hits did occur to that same line
98
DefinitionsWrite-Byl Cache write policy, in which the cache is not
accessed on a write miss, even if there are cache lines in I state
l A cache using write-by “hopes” that soon there may be a load, which will result in a miss and then stream-in the appropriate line; if not, it was not necessary to stream-in the line in the first place
l Antonym: allocate-on-write
99
DefinitionsWrite-Oncel Cache write policy that starts out as write-through
and changes to write-back after the first write hit to a line
l Typical policy imposed onto a higher level L1 cache by the L2 cache
l Advantage: The L1 cache places no unnecessary traffic onto the system bus upon a cache-write hit
l Lower level L2 cache can remember that a write has occurred by setting the MESI state to modified