Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay...

Project Summary

Fair and High Throughput Cache Partitioning Scheme for CMPs

Shibdas BandyopadhyayDept of CISE

University of Florida

Project Proposal

• Architecture of the machine consists of a private level of cache (say L2) and a shared next level of cache (say L3)

• Aim is to further partition the private level of cache (L2) depending on the application characteristics running on the different cores

• For example, if the applications running on different cores share blocks among them, some blocks will be exclusively marked for shared usage

• This will help in reducing miss rate for the applications which shares data heavily. If the applications do not share data it will perform as before

Motivation

Motivation• As can be seen from the previous figure, for the commercial workloads number of shared blocks constitutes majority of the memory access

• As workloads are all web server applications, it shares a large amount of data between the threads it spawns on multiple cores

• In case of virtualized server consolidations, we will have a great amount of sharing among the cores participating in a virtual server

• So, as per Amdahl’s law if we reduce the miss rate for the shared blocks in these situations we should be able to improve the total hit rate

Proposed Strategy• Each cache set has bit vector (call it Replacement Priority Vector [RPV]) of length equal to the associativity of the cache (i.e. equal to the number of the blocks in the set)

• A value of 1 in the position x in that vector indicates that the block x in that particular set is reserved exclusively for shared block. Other blocks can have both private and shared blocks

• During replacement two different strategies are followed depending on the state of the block which comes into the cache

• If the incoming block will be in shared state, all blocks in the set is considered and LRU is replaced

• If the incoming block will be in private state, all blocks except the ones reserved exclusively for shared blocks are considered for LRU replacement

Proposed Strategy• RPV for each cache set is set up by each core depending on the directions of the cache directory controller (We assume a directory based cache coherency protocol)

• Directory tracks the number of misses for the shared blocks in a time interval for all the processors in a buffer called Processor Activity Buffer [PAB]

• PAB consists of three entries: A core Id, number of misses on shared block for that processor in present time interval and that in previous time interval

• If the difference for a particular core is great than a threshold it sends a message to the core to increase the number of reserved shared blocks and vice versa if it is below the threshold

Proposed Strategy• RPV for each set of each core is set to zero initially

• Upon receiving a “increase shared blocks” message from directory, It looks at the current number of shared blocks in each cache sets (A counter is associated with each set which is incremented when a shared block comes into that set and decremented when a shared block is replaced)

• It decides on which sets there will be a increase in the number of reserved shared blocks

• It then modifies RPV for those blocks by turning on a bit in RPV depending on its current RPV

• On Receiving a “decrease shared blocks” message from the directory it finds the sets with the lowest amount of shared blocks and modified RPV accordingly

Cache Cohenrence Protocol• Simple directory based coherence protocol as indicated in Hennessey Patterson

Fetch/Invalidatesend Data Write Back message

to home directory

InvalidateInvalid Shared

(read/only)

Exclusive

(read/write)

CPU ReadSend Read Miss

messageCPU Write: Send Write Miss msg to H.D.

CPU Write: Send Write Miss messageto home directory

CPU read hitCPU write hit

Fetch: send Data Write Back message to home directoryCPU read miss: send

Data Write Back message and read miss to home directory

CPU Read hit

Cache Cohenrence Protocol

Data Write Back:Sharers = {}

(Write back block)

Uncached Shared(read only)

Exclusive

(read/writ)

Read miss:Sharers = {P}send Data Value Reply

Write Miss:Sharers = {P}; send Data Value Reply message

Read miss:Sharers += {P}; send Fetch; send Data Value Replymessage to remote cache(Write back block)

Simulation Strategy

• Each Core is represented by a process. It reads from the trace file generated for this core from the MP trace file

• Each Core processor connects to the directory process using sockets and sends the current address to the directory if this is not a hit in the local cache

• Directory process updates PAB if needed and sends an update to the core processes after T requests to the directory • As this is a simplified coherence protocol, processes wait for acknowledgement and data from the directory before proceeding for the next address

Simulation Strategy

• MP Trace file given by Zhou (Thanks Zhou!!!)

• T is chosen to be 1000

• Each time a “increase” request comes from directory each core looks at first 10 cache sets in terms of number of shared blocks and updates the PRV by inserting a 1 in a random non-zero position of the RPV

• For different L2 cache associativity and size miss rate for each core is plotted along with the case where simple LRU is used

ResultsCache Size = 256 KB / core, Associstivity = 8

0

0.5

1

1.5

2

2.5

3

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

cores

LRUOur Policy

Results

Cache Size = 512 KB/core Associstivity = 16

0

0.5

1

1.5

2

2.5

3

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64

Core

LRUOur Policy

Inference

• With higher associativity, the effect of the new policy is clear as allocating some blocks as shared does not affect private data

• LRU is not really used now-a-days. We should compare against new policies like co-operative caching for better insight

• As confirmed by Zhou, the MP workload as indeed of an application which was sharing most of its data apart from the code section, hence performance improvement is more prominent

Tunable Parameters

• This is merely a study with a workload which happened to have a good sharing characteristics

• Many parameters can be tuned, like which cores to update from directory, What number of blocks should be chosen for modifying their RPV

• Analysis of the RPV and the trace to infer if RPV reflects what kind of sharing is present

• Impact of False sharing and how to eliminate it

Thank You

[email protected]

mailto:[email protected]

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay...

Documents

Transcript of Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay...