Lecture 15: Large Cache Design III
-
Upload
quon-mclaughlin -
Category
Documents
-
view
18 -
download
0
description
Transcript of Lecture 15: Large Cache Design III
![Page 1: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/1.jpg)
1
Lecture 15: Large Cache Design III
• Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks
![Page 2: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/2.jpg)
2
LIN Qureshi et al., ISCA’06
• Memory level parallelism (MLP): number of misses that simultaneously access memory; high MLP miss is less expensive
• Replacement decision is a linear combination of recency and MLP experienced when fetching that block
• MLP is estimated by tracking the number of outstanding requests in the MSHR while waiting in the MSHR
• Can also use set dueling to decide between LRU and LIN
![Page 3: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/3.jpg)
3
Pseudo-LIFO Chaudhuri, MICRO’09
• A fill stack is a FIFO that tracks the order in which blocks entered the set
• Most hits are serviced while a block is near the top of the stack; there is usually a knee-point beyond which blocks stop yielding hits
• Evict the highest block above the knee that has not yet serviced a hit in its current fill stack position
• Allows some blocks to probabilistically escape past the knee and get retained for distant reuse
![Page 4: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/4.jpg)
4
Scavenger Basu et al., MICRO’07
• Half the cache is used as a victim cache to retain blocks that will likely be used in the distant future
• Counting bloom filters to track a block’s potential for reuse and make replacement decisions in the victim cache
• Complex indexing and search in the victim cache
• Another recent paper (NuCache, HPCA’11) places blocks in a large FIFO victim file if they were fetched by delinquent PCs and the block has a short re-use distance
![Page 5: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/5.jpg)
5
V-Way Cache Qureshi et al., ISCA’05
• Meant to reduce load imbalance among sets and compute a better global replacement decision
• Tag store: every set has twice as many ways
• Data store: no correspondence with tag store; need forward and reverse pointers
• In most cases, can replace any block; every block has a 2b saturating counter that is incremented on every access; scan blocks (and decrement) until a zero counter is found; continue scan on next replacement
![Page 6: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/6.jpg)
6
ZCache Sanchez and Kozyrakis, MICRO’10
• Skewed associative cache: each way has a different indexing function (in essence, W direct-mapped caches)
• When block A is brought in, it could replace one of four (say) blocks B, C, D, E; but B could be made to reside in one of three other locations (currently occupied by F, G, H); and F could be moved to one of three other locations
• We thus get a tree of replacement options and we can pick LRU among these options
• Every replacement requires multiple tag look-ups and data block copies; worthwhile if you’re reducing off-chip accesses
![Page 7: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/7.jpg)
7
Temporal Memory Streaming Wenisch et al., ISCA’05
• When a thread incurs a series of misses to blocks (PQRS, not necessarily contiguous), other threads are likely to incur a similar series of misses
• Each thread maintains its miss log in a circular buffer in memory; the directory for P also keeps track of a pointer to multiple log entries of P
• When a thread has a miss on P, it contacts the directory and the directory provides the log pointers; the thread receives multiple streams and starts prefetching
• Log access and prefetches are off the critical path
![Page 8: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/8.jpg)
8
Spatial Memory Streaming Somogyi et al., ISCA’06
• Threads often enter a new region (page) and touch a few arbitrary blocks in that region
• A predictor is indexed with the PC of the first access to that region and the offset of the first access; the predictor returns a bit vector indicating the blocks accessed within that region
• Can even prefetch for regions that have not been touched before!
![Page 9: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/9.jpg)
9
Feedback Directed Prefetching Srinath et al., HPCA’07
• A stream prefetcher has two parameters: P: prefetch distance: how far ahead of the start do we prefetch N: prefetch degree: how much do we advance the start when there is a hit in the stream
• Can vary these two parameters based on pref effectiveness
• Accuracy: a bit tracks if a prefetched block was touched
• Timeliness: was the block touched while in the MSHR?
• Pollution: track recent evictions (Bloom filter) and see if they are re-touched; also guides insertion policy
![Page 10: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/10.jpg)
10
Dead Block Prediction
• Can keep track of the number of accesses to a line during its previous residence; the block is deemed to be dead after that many accesses Kharbutli, Solihin, IEEE TOC’08
• To reduce noise, an access can be considered as a block’s move to the MRU position Liu et al., MICRO 2008
• Earlier DBPs used a trace of PCs to capture when a block has completed its use
• DBP is used for energy savings, replacement policies, and cache bypassing
![Page 11: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/11.jpg)
11
Distill Cache Qureshi, HPCA 2007
• Half the ways are traditional (LOC); when a block is evicted from the LOC, only the touched words are stored in a word-organized cache that has many narrow ways
• Incurs a fair bit of complexity (more tags for the WOC, collection of word touches in L1s, blocks with holes, etc.)
• Does not need a predictor; actions are based on the block’s behavior during current residence
• Useless word identification is orthogonal to cache compression
![Page 12: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/12.jpg)
12
Traditional Networks Huh et al. ICS’05, Beckmann MICRO’04
Example designs for contiguous L2 cache regions
![Page 13: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/13.jpg)
13
Explorations for Optimality Muralimanohar et al., ISCA’07
![Page 14: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/14.jpg)
14
3D Designs, Li et al., ISCA’06
• D-NUCA: first search in cylinder, then multicast search everywhere• Data is migrated close to requester, but need not jump across layers
![Page 15: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/15.jpg)
15
Halo Network Jin et al., HPCA’07
• D-NUCA: Sets are distributed across columns; Ways are distributed across rows
![Page 16: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/16.jpg)
16
Halo Network
![Page 17: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/17.jpg)
17
Nahalal Guz et al., CAL’07
![Page 18: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/18.jpg)
18
Nahalal
• Block is initially placed in core’s private bank and then swapped into the shared bank if frequently accessed by other cores
• Parallel search across all banks
![Page 19: Lecture 15: Large Cache Design III](https://reader036.fdocuments.in/reader036/viewer/2022070400/568134a8550346895d9bb35a/html5/thumbnails/19.jpg)
19
Title
• Bullet