A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches

24
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and Jihong Kim School of Computer Science & Engineering Seoul National University Computer Architecture and Embedded Systems (CARES) Laboratory Workshop on Chip Multiprocessor Memory Systems and Interconnects 2007(CMP-MSI) 2007.2.11

description

A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches. Sungjune Youn, Hyunhee Kim and Jihong Kim School of Computer Science & Engineering Seoul National University Computer Architecture and Embedded Systems (CARES) Laboratory - PowerPoint PPT Presentation

Transcript of A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches

1

CMP-MSI.07  CARES/SNU

A Reusability-Aware Cache Memory Sharing Technique

for High Performance CMPs with Private Caches

Sungjune Youn, Hyunhee Kim and Jihong Kim

School of Computer Science & EngineeringSeoul National University

Computer Architecture and Embedded Systems (CARES) Laboratory

Workshop on Chip Multiprocessor Memory Systems and Interconnects 2007(CMP-MSI)

2007.2.11

2

CMP-MSI.07  CARES/SNU

Outline

• Introduction• Motivation• Reusability-Aware Cache Sharing Technique

(RACS)• Overview of the RACS technique• Two Major Steps

– Step 1: Block Reusability Prediction– Step 2: Memory Demand Prediction

• Evaluation • Conclusions

3

CMP-MSI.07  CARES/SNU

Introduction

• Chip Multiprocessors (CMPs) emerge as a dominant architectural alternative

• Most current CMPs support two levels of on-chip hierarchy

• L1 cache organization is almost same – A small private L1 cache

• L2 cache organization could be quite different– Private L2 cache vs. Shared L2 cache

• Efficient L2 cache management is necessary

• On-chip cache memory space is limited in CMPs• Off-chip memory accesses require a much longer

latency than on-chip communication costs

4

CMP-MSI.07  CARES/SNU

L2 Cache Organization in CMPs

• Private L2 cache vs. Shared L2 cache

L1

Short access latency Utilizing capacity efficiently

P0

I$ D$

P1

I$ D$

P2

I$ D$

P3

I$ D$

Shared Bus

On-Chip

Off-Chip Memory

L2 $

P0

I$ D$I$ D$

P1

I$ D$I$ D$

P2

I$ D$I$ D$

P3

I$ D$I$ D$

Shared Bus

On-Chip

Off-Chip Memory

L2 $

P0

I$ D$

L2$

P1

I$ D$

L2$

P2

I$ D$

L2$

P3

I$ D$

L2$

Shared BusOn-Chip

Off-Chip Memory

P0

I$ D$I$ D$

L2$

P1

I$ D$I$ D$

L2$

P2

I$ D$I$ D$

L2$

P3

I$ D$I$ D$

L2$

Shared BusOn-Chip

Off-Chip Memory

How to Combine Strengths of Private & Shared Caches?

But inefficient in utilizing

the L2 cache space

But longer access latency,

More on-chip network traffic

5

CMP-MSI.07  CARES/SNU

Cooperative Caching (CMP-CC)

• “Cooperative Caching for Chip Multiprocessors”, ISCA 2006• Based on the private cache organization • Writing back L2 victims from the local cache to peer

cache

P0

PrivateL2

P1

PrivateL2

P2

PrivateL2

P3

PrivateL2

L2 Victim Randomly with a given probability 0% ~ 100%

6

CMP-MSI.07  CARES/SNU

Problem of the Reusability-Oblivious Write Back

• If the block is written back to other cache, but is not reused

->System performance could be degraded

<The number of unused and reused blocks in CMP-CC100%>

Reusability-Aware Adaptive Write Backs are Necessary

Should be reduced

7

CMP-MSI.07  CARES/SNU

Adaptive Write Back

• Adaptive Write Back requires • Reusability of each block• Memory demand of each processor

P0

PrivateL2

P1

PrivateL2

P2

PrivateL2

P3

PrivateL2

L2 VictimLow Reusability?

High Reusability?

Which peer cache has a block with low reusability?Which peer cache has a memory

demand smaller than P0?

8

CMP-MSI.07  CARES/SNU

Reusability Prediction Technique

• Goal: Do not write back blocks with low reusability

• The reusability of a block is based on Access Time Interval and Frequency (ATIF)

• Reusability after the eviction

time

time

The block with long time interval has high reusability

The block with short time interval has low reusability

Eviction

Eviction

9

CMP-MSI.07  CARES/SNU

Access Time Interval and Frequency Pattern • Classify blocks into 16 patterns

• Two counters per each block– Number of accesses with long time interval – Number of accesses with short time interval

• 2 bits of the short time interval counter

+ 2 bits of the long time interval counter

• Monitor how many blocks are reused per each pattern

• If blocks in a certain pattern are highly reused -> Blocks in this pattern have high reusability

10

CMP-MSI.07  CARES/SNU

Fraction of Unused and Reused Blocks

Short time interval Long time interval

The larger number of long time interval accesses

-> The larger number of reused blocks

11

CMP-MSI.07  CARES/SNU

Memory Demand Prediction Technique• Goal: Do not corrupt the L2 cache of processors

with a high memory demand• Heuristic: The more replacements occur, the more

processor requires memory

• Replacement time interval history (Replinterval_history )• Prediction value of the memory demand• Updated every time replacement occurs

• Smaller Replinterval_history means the processor requires more memory

• We write back the block to the peer cache with smaller memory demand

4

Repl3ReplRepl intervalprevhistoryinterval

newhistoryinterval

)(_

)(_

12

CMP-MSI.07  CARES/SNU

Experiment Setting

• Based on a CATS Shared-Memory Multiprocessor Simulator

• Parameter• L1 I/D Cache

– 16KB, 1-way– 1cycle

• L2 Private Cache– 256 KB, 4-way– 6/40 cycles

• L2 Shared Cache– 1MB, 16-way– 38 cycles

• Off-chip memory latency– 500 cycles

• Splash2 benchmark programs used: Cholesky, FMM, LU, Radix

13

CMP-MSI.07  CARES/SNU

The Number of Unused and Reused Blocks

0

20

40

60

80

100

120

140

CMP-CC RACS Oracle- CMP-CC RACS Oracle-

FMM LU

Num

ber

of

blo

cks

(Thous

ands)

.

Unused blocks Reused blocks

61% 62%

The number of reused blocks is same with CMP-CC100%

The number of unused blocks is reduced by 62%

14

CMP-MSI.07  CARES/SNU

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Cholesky FMM LU Radix

Norm

aliz

ed a

vera

ge

mem

ory

acc

ess

late

ncy

Shared Private CMP-CC (30%) CMP-CC (70%) CMP-CC (100%) RACS Oracle-

Normalized Memory Access Latency

The RACS scheme reduces the average memory access latency by 14% and 4% over the private L2 scheme and

CMP-CC 100% on average, respectively

Even though memory access latency increases as the probability of the CMP-CC increases, RACS reduces it.

Memory access latency increases as the probability of the CMP-CC increases, RACS also reduces it.

15

CMP-MSI.07  CARES/SNU

Normalized Average IPC

Improves average IPC by 3% and 1% over the private L2 scheme and CMP-CC100% on average, respectively

0.9

0.95

1

1.05

1.1

1.15

1.2

Cholesky FMM LU Radix

Norm

aliz

ed A

vera

ge

IPC

Shared Private CMP- CC (100%) RACS Oracle-

16

CMP-MSI.07  CARES/SNU

Normalized Energy Consumption

10% less energy over the private cache

2% less energy over the CMP-CC100%

0

0.2

0.4

0.6

0.8

1

Cholesky FMM LU Radix

Norm

aliz

ed E

nerg

y C

ons

umptio

n

Shared Private CMP-CC (100%) RACS Oracle-

17

CMP-MSI.07  CARES/SNU

Conclusions

• Proposed Reusability-Aware Cache memory Sharing technique (RACS)

• Based on private L2 cache• Taking advantage of both private L2 cache and

shared L2 cache • Adaptively writing back L2 victims to peer L2

cache• Using reusability of the block and memory

demand of the processor

• RACS reduces the number of unused blocks by 60% over CMP-CC.

• RACS reduces the average memory access latency by 14% and 4% over the private L2 cache and CMP-CC, respectively.

18

CMP-MSI.07  CARES/SNU

Thank You

19

CMP-MSI.07  CARES/SNU

Overhead

• Hardware overhead• Peer-to-peer communication lines between caches

– For 4 CPU, 6 lines of 21 bits

• Additional counters for two prediction technique– Reusability prediction

•4-bit counter: the number of accesses with long time interval per each block

•2-bit counters: the number of accesses with short time interval per each block

•2 bits: indicating which processor writes back this block

•1 bit: indicating that block is reused•16 2-bit pattern counters per each private cache

– Memory demand prediction•8 bit counter: time from the last replacement•8 bit : replacement interval history

– Total: 9 bits per block and 48 bits per cache => Area overhead is less than 1% of the private cache

20

CMP-MSI.07  CARES/SNU

• Time overhead• Write back decision is made after

– A block is evicted from the cache and placed in the write back queue

-> Write back decision is not on the critical path

21

CMP-MSI.07  CARES/SNU

• How to distinguish access with short time interval and long time interval?

• If there is any intervening access to the block that belongs to the same set -> long time interval

• If not -> short time interval• Using 2-bit (for 4-way cache) to record the most

recently accessed block

22

CMP-MSI.07  CARES/SNU

Shared State?Block a is evicted from private L2 cache A

Do not write back block aTo peer L2 cache

Yes

No

Low reusability?

No

No

No

Written back from other L2 cache and

not used?

Exists any block with low reusability at the bottom of LRU in peer

L2 caches?

Exists any L2 cache whose memory

demand is ω times smaller than cache A’s?

Write back block a to the cache which has the low reusability block

Write back block a to the cache which has ω times

smaller memory demand

Yes

Yes

Yes

Yes

No

23

CMP-MSI.07  CARES/SNU

• If there is no peer L2 cache with ω times smaller memory demand, we do not write back the block to a peer L2 cache even though it has high reusability

• ω value?• Each private L2 cache has its own ω value• Decreased by 1 when block is reused • Increased by 1 when three blocks are written back to

other cache• If many of blocks are reused, we can write back a

block to a peer cache – Even though the difference of the memory demand is

not large

24

CMP-MSI.07  CARES/SNU

Processor

L2 Cache

L2 Cache

Processor

L2 Cache

Processor

Processor

L2 Cache

Set number, Replinterval_history

Exist the block with low reusability ?

AndReceived Replinterval_history >

own Replinterval_history

1

11

2

2

2