Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay...
-
Upload
roland-hubbard -
Category
Documents
-
view
220 -
download
0
description
Transcript of Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay...
![Page 1: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/1.jpg)
Project Summary
Fair and High Throughput Cache Partitioning Scheme for CMPs
Shibdas BandyopadhyayDept of CISE
University of Florida
![Page 2: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/2.jpg)
Project Proposal
• Architecture of the machine consists of a private level of cache (say L2) and a shared next level of cache (say L3)
• Aim is to further partition the private level of cache (L2) depending on the application characteristics running on the different cores
• For example, if the applications running on different cores share blocks among them, some blocks will be exclusively marked for shared usage
• This will help in reducing miss rate for the applications which shares data heavily. If the applications do not share data it will perform as before
![Page 3: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/3.jpg)
Motivation
![Page 4: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/4.jpg)
Motivation• As can be seen from the previous figure, for the commercial workloads number of shared blocks constitutes majority of the memory access
• As workloads are all web server applications, it shares a large amount of data between the threads it spawns on multiple cores
• In case of virtualized server consolidations, we will have a great amount of sharing among the cores participating in a virtual server
• So, as per Amdahl’s law if we reduce the miss rate for the shared blocks in these situations we should be able to improve the total hit rate
![Page 5: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/5.jpg)
Proposed Strategy• Each cache set has bit vector (call it Replacement Priority Vector [RPV]) of length equal to the associativity of the cache (i.e. equal to the number of the blocks in the set)
• A value of 1 in the position x in that vector indicates that the block x in that particular set is reserved exclusively for shared block. Other blocks can have both private and shared blocks
• During replacement two different strategies are followed depending on the state of the block which comes into the cache
• If the incoming block will be in shared state, all blocks in the set is considered and LRU is replaced
• If the incoming block will be in private state, all blocks except the ones reserved exclusively for shared blocks are considered for LRU replacement
![Page 6: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/6.jpg)
Proposed Strategy• RPV for each cache set is set up by each core depending on the directions of the cache directory controller (We assume a directory based cache coherency protocol)
• Directory tracks the number of misses for the shared blocks in a time interval for all the processors in a buffer called Processor Activity Buffer [PAB]
• PAB consists of three entries: A core Id, number of misses on shared block for that processor in present time interval and that in previous time interval
• If the difference for a particular core is great than a threshold it sends a message to the core to increase the number of reserved shared blocks and vice versa if it is below the threshold
![Page 7: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/7.jpg)
Proposed Strategy• RPV for each set of each core is set to zero initially
• Upon receiving a “increase shared blocks” message from directory, It looks at the current number of shared blocks in each cache sets (A counter is associated with each set which is incremented when a shared block comes into that set and decremented when a shared block is replaced)
• It decides on which sets there will be a increase in the number of reserved shared blocks
• It then modifies RPV for those blocks by turning on a bit in RPV depending on its current RPV
• On Receiving a “decrease shared blocks” message from the directory it finds the sets with the lowest amount of shared blocks and modified RPV accordingly
![Page 8: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/8.jpg)
Cache Cohenrence Protocol• Simple directory based coherence protocol as indicated in Hennessey Patterson
Fetch/Invalidatesend Data Write Back message
to home directory
InvalidateInvalid Shared
(read/only)
Exclusive
(read/write)
CPU ReadSend Read Miss
messageCPU Write: Send Write Miss msg to H.D.
CPU Write: Send Write Miss messageto home directory
CPU read hitCPU write hit
Fetch: send Data Write Back message to home directoryCPU read miss: send
Data Write Back message and read miss to home directory
CPU Read hit
![Page 9: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/9.jpg)
Cache Cohenrence Protocol
Data Write Back:Sharers = {}
(Write back block)
Uncached Shared(read only)
Exclusive
(read/writ)
Read miss:Sharers = {P}send Data Value Reply
Write Miss:Sharers = {P}; send Data Value Reply message
Read miss:Sharers += {P}; send Fetch; send Data Value Replymessage to remote cache(Write back block)
![Page 10: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/10.jpg)
Simulation Strategy
• Each Core is represented by a process. It reads from the trace file generated for this core from the MP trace file
• Each Core processor connects to the directory process using sockets and sends the current address to the directory if this is not a hit in the local cache
• Directory process updates PAB if needed and sends an update to the core processes after T requests to the directory • As this is a simplified coherence protocol, processes wait for acknowledgement and data from the directory before proceeding for the next address
![Page 11: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/11.jpg)
Simulation Strategy
• MP Trace file given by Zhou (Thanks Zhou!!!)
• T is chosen to be 1000
• Each time a “increase” request comes from directory each core looks at first 10 cache sets in terms of number of shared blocks and updates the PRV by inserting a 1 in a random non-zero position of the RPV
• For different L2 cache associativity and size miss rate for each core is plotted along with the case where simple LRU is used
![Page 12: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/12.jpg)
ResultsCache Size = 256 KB / core, Associstivity = 8
0
0.5
1
1.5
2
2.5
3
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
cores
LRUOur Policy
![Page 13: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/13.jpg)
Results
Cache Size = 512 KB/core Associstivity = 16
0
0.5
1
1.5
2
2.5
3
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
Core
LRUOur Policy
![Page 14: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/14.jpg)
Inference
• With higher associativity, the effect of the new policy is clear as allocating some blocks as shared does not affect private data
• LRU is not really used now-a-days. We should compare against new policies like co-operative caching for better insight
• As confirmed by Zhou, the MP workload as indeed of an application which was sharing most of its data apart from the code section, hence performance improvement is more prominent
![Page 15: Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.](https://reader036.fdocuments.in/reader036/viewer/2022081807/5a4d1b657f8b9ab0599af80d/html5/thumbnails/15.jpg)
Tunable Parameters
• This is merely a study with a workload which happened to have a good sharing characteristics
• Many parameters can be tuned, like which cores to update from directory, What number of blocks should be chosen for modifying their RPV
• Analysis of the RPV and the trace to infer if RPV reflects what kind of sharing is present
• Impact of False sharing and how to eliminate it