Qilin : Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors
description
Transcript of Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors
![Page 1: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/1.jpg)
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors
Chinnakrishnan S. BallapuramAhmad Sharif
Hsien-Hsin S. Lee
![Page 2: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/2.jpg)
2Ballapuram, Sharif, and Lee
Concurrent Execution in CMP
Code, Data
Single-threaded program
Registers, Stack(Local)
Code Data
Multi-threaded program
Registers, Stack(Local)
Registers, Stack(Local)
Registers, Stack(Local)
Thread 2Thread 1Thread 0Thread 0
Shared Last Level Cache
![Page 3: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/3.jpg)
3Ballapuram, Sharif, and Lee
Self-Modifying Code (SMC) Snoop
IL1IL1
Core 0
IL1IL1 DL1
Core 1
IL1 DL1
Core 2
IL1 DL1
Core 3
IL1 DL1
SMC snoop
SMC snoop
SMC snoop
SMC snoop
![Page 4: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/4.jpg)
4Ballapuram, Sharif, and Lee
Snoop for Core 0 DL1 Miss
IL1IL1
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
![Page 5: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/5.jpg)
5Ballapuram, Sharif, and Lee
External Snoop Request
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
![Page 6: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/6.jpg)
6Ballapuram, Sharif, and Lee
Modified L2 Eviction, External Request, etc
IL1IL1
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
![Page 7: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/7.jpg)
7Ballapuram, Sharif, and Lee
Modified L2 Eviction, External Request, etc
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
As # of cores increasesPower
Performance
![Page 8: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/8.jpg)
8Ballapuram, Sharif, and Lee
Number of Snoop Probes
• SMC Snoops to I-Cache > Snoops to D-Cache > Snoops to LSB.
0
1
2
3
4
5
6
7
8
9
10
11
12to
_lsb
to_d
cach
e
to_i
cach
e
to_l
sb
to_d
cach
e
to_i
cach
e
to_l
sb
to_d
cach
e
to_i
cach
e
to_l
sb
to_d
cach
e
to_i
cach
e
to_l
sb
to_d
cach
e
to_i
cach
e
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threaded apps
Num
ber o
f sno
op p
robe
s in
Mill
ions
2C
4C
2 x 4C8C
16.4M
![Page 9: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/9.jpg)
9Ballapuram, Sharif, and Lee
Snoop Probe and Snoop Rate
• % of data snoop > % of instruction cache snoop
02468
1012141618202224262830
2C 4C 2Px4C 8C 8C-MT 2Px4C-MT
Num
ber
of s
noop
s in
Mill
ions
0%
200%
400%
600%
800%
1000%
1200%
1400%
1600%
1800%
2000%
2200%
2400%
Processor configuration
% o
f sno
op in
crea
se
to_lsbto_dcacheto_icachetotal snoops% of data snoop increase% of SMC snoop increase% of total snoop increase
~22x increase
~12x increase
![Page 10: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/10.jpg)
10Ballapuram, Sharif, and Lee
We propose two techniques to reduce the power consumed by snoop probes:
1. Selective Snoop Probe (SSP)2. Essential Snoop Probe (ESP)
![Page 11: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/11.jpg)
11Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for SMC- SSP for Non-Stack Accesses- SSP for Stack Accesses
![Page 12: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/12.jpg)
12Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for SMC
![Page 13: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/13.jpg)
13Ballapuram, Sharif, and Lee
Normal Operation: To Support SMC
L1 I-Cache
From RS or LSB
dispatch
SMC snoop probe
L1 D-cache MSHR
Core 0
![Page 14: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/14.jpg)
14Ballapuram, Sharif, and Lee
Core 0
SSP (SMC) – No SMC Snoop if BF1 miss
From RS or LSB
dispatch
All store addr
HASH
cntr
MSHR
u1
r1
r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter
BF1SMC snoop probe
L1 I-Cache
L1 D-cache
To filter SMC/XMC snoops
![Page 15: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/15.jpg)
15Ballapuram, Sharif, and Lee
Core 0
SSP (SMC) – No SMC Snoop if BF1 Hit
From RS or LSB
dispatch
All store addr
HASH
cntr
MSHR
u1
r1
r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter
BF1SMC snoop probe
L1 I-Cache
L1 D-cache
![Page 16: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/16.jpg)
16Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for Stack Accesses
![Page 17: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/17.jpg)
17Ballapuram, Sharif, and Lee
Normal Operation: Always Snoop for All Accesses
Snoopprobes
Snoop probes
L2 queue
Last Level Cache
dL1 miss
Core 0
From RS or LSB
dispatch
L1 D-cache MSHR
Snoop controller
Snoop queue
![Page 18: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/18.jpg)
18Ballapuram, Sharif, and Lee
Core 0
SSP – Stack Accesses
All addresses(carry S-bit annotation)
L2 queue
From RS or LSB
dispatch
L1 D-cache MSHR
dL1 miss
Last Level Cache
Snoop controller
0100
Snoop queue
Annotated by
Front-End
![Page 19: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/19.jpg)
19Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for Non-Stack Accesses
![Page 20: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/20.jpg)
20Ballapuram, Sharif, and Lee
Core 0
SSP – Non-stack Accesses Update BF2
From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISI SISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
L2 queueL2 queue
Last Level Cache
Snoop controller
1000
Snoop queuer2 – read Bloom filter
u2 - update Bloom filtercntr - counting Bloom filter
u2u2
Filter snoops to non-stack region
HASH cntr
BF2
![Page 21: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/21.jpg)
21Ballapuram, Sharif, and Lee
SSP – Non-stack Accesses Read BF2
All non-stack addresses
Filter snoops to non-stack region
HASH cntr
u2u2
L2 queue
dL1 miss
r2
r2All addresses(carry S-bit annotation)
r2 – read Bloom filteru2 - update Bloom filtercntr - counting Bloom filter
Last Level Cache
Snoop controller
1000
Snoop queue
BF2
Core 0From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISI SISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
![Page 22: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/22.jpg)
22Ballapuram, Sharif, and Lee
SSP - Selectively Send Snoop Probes
Selectively send snoops
L2 queue
Last Level Cache
Snoop controller
1000
Snoop queuer2 – read Bloom filter
u2 - update Bloom filtercntr - counting Bloom filter
u2u2
Selectively send snoops
All non-stack addressesu2u2
All addresses(carry S-bit annotation)
Core 0From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISI SISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
Filter snoops to non-stack region
HASH cntr
BF2
dL1 miss
![Page 23: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/23.jpg)
23Ballapuram, Sharif, and Lee
Essential Snoop Probe (ESP)- ESP for SMC- ESP for all variables
![Page 24: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/24.jpg)
24Ballapuram, Sharif, and Lee
Essential Snoop Probe (ESP)- ESP for SMC
![Page 25: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/25.jpg)
25Ballapuram, Sharif, and Lee
Core 0
SMC – Normal Operation
L1 I-$
Every Store SnoopsI-cache
From RS or
LSB dispatch
L1 D-$
Other pipe stages
![Page 26: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/26.jpg)
26Ballapuram, Sharif, and Lee
Core 0
ESP Essential Snoop Probe
From RS or
LSB dispatch
Other pipe stages
L1 I-$ L1 D-$
• OS sets a control register bit (SMC-CR) • SMC-CR=1 Non Self-Modifying Code• SMC-CR=0 Self-Modifying Code
SMC-CR=1
![Page 27: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/27.jpg)
27Ballapuram, Sharif, and Lee
Essential Snoop Probe (ESP)- ESP for all variables
![Page 28: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/28.jpg)
28Ballapuram, Sharif, and Lee
Core 0
Normal Operation – Snoop for All Variables
Snoop probes
L2 queue
From RS or
LSB dispatch
Other pipe stages
CMP interconnect domain
Snoop probes
Snoop controller
Snoop queue
Last Level Cache
L1 I-$ L1 D-$
dL1 miss
![Page 29: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/29.jpg)
29Ballapuram, Sharif, and Lee
Core 0
Essential Snoop Probe (ESP) – SMN bit 1
dL1 misswith SMN bit annotation
L2 queue
From RS or
LSB dispatch
Other pipe stages
CMP interconnect domain
SMN bitSMN bit – Snoop-Me-Not bit is 0/1
Snoop controller
1100
Snoop queue
Last Level Cache
L1 I-$ L1 D-$
![Page 30: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/30.jpg)
30Ballapuram, Sharif, and Lee
Core 0
Essential Snoop Probe (ESP) – SMN bit 0
L2 queue
From RS or
LSB dispatch
ESP
Other pipe stages
CMP interconnect domain
SMN bit – Snoop-Me-Not bit is 0/1
Last Level Cache
SMN bit
Snoop controller
0100
Snoop queue
L1 I-$ L1 D-$
ESPESP
dL1 misswith SMN bit annotation
![Page 31: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/31.jpg)
31Ballapuram, Sharif, and Lee
Energy Savings in D-Cache Using SSP
• In the 2C config 5% - 10% data cache energy savings and in the 8C config 30% - 65% is achieved.
• The data cache energy savings increases with the number of cores on the die as the number of snoops to all the cores increases.
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
2C 4C 2Px4C 8C
Processor configuration
% o
f dat
a ca
che
ener
gy s
avin
gs p
er c
ore
SPEC INT 2006SPEC FP 2006games/multi-mediaservermulti-threaded application
![Page 32: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/32.jpg)
32Ballapuram, Sharif, and Lee
Energy Savings in I-Cache Using SSP
• There is a 50% - 70% instruction cache tag energy savings is achieved across all processor configurations.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2C 4C 2Px4C 8C
Processor configuration
% o
f ica
che
tag
ener
gy s
avin
gs p
er c
ore
SPEC INT 2006SPEC FP 2006games/multi-mediaservermulti-threaded application
![Page 33: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/33.jpg)
33Ballapuram, Sharif, and Lee
Performance Impact with SSP
• On average there is 1% - 2% performance improvement across various benchmark categories and different processor configurations is achieved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
SPEC INT 2006 SPEC FP 2006 games/multi-media
server multi-threadedapplication
Harmean acrossbenchmarks
min performanceobserved
maxperformance
observed
2C 4C 2Px4C 8C
![Page 34: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/34.jpg)
34Ballapuram, Sharif, and Lee
Energy Savings with ESP
• It shows that 5% to a maximum of 82% data cache energy is spent on the non-essential snoop probes that can be eliminated using the ESP technique.
• Also, 85% of the snoops to the instruction cache tag energy can be eliminated using ESP.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
dcache icache dcache icache dcache icache dcache icache dcache icache dcache icache
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication
Harmonic meanacross benchmarks%
of c
ache
ene
rgy
spen
t on
non-
esse
ntia
l sno
ops
per
core
2C 4C 2Px4C 8C
![Page 35: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/35.jpg)
35Ballapuram, Sharif, and Lee
• Semantics and program behavior are useful indicators
• They are exploited to reduce power due to snoops
• We proposed– Selective Snoop Probe (SSP) – Essential Snoop Probe (ESP)
• Energy Reduction Results– 5% to 65% in D-cache per core– 50% to 70% in I-cache per core
• 1% - 2% performance improvement
• Extensible to optimize integrated platforms with graphics processor
Conclusion
![Page 36: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/36.jpg)
Georgia TechElectrical and Computer Engineering MARS Labshttp://arch.ece.gatech.edu
Thank You !
![Page 37: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/37.jpg)
BACKUP
![Page 38: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/38.jpg)
38Ballapuram, Sharif, and Lee
Simulation InfrastructureExecution Engine 4-wide, Out-of-OrderLoad buf / Store buf / RS / ROB 96 / 64 / 128 / 256 entriesL1 / L2 latency 4 / 8 cyclesL1 I, L1 D cache size 32KB, 8 way, 64BL2 Cache 4MB, 16 way, 64BL1 TLB entries 128, 4 wayMemory 2GB, DDR 2 timingsCACTI 4.2 70nm power modelBenchmark class Example applicationsServer specJBB, TPCCSPEC FP 2006 wrf, namd, lbm, soplexSPEC INT 2006 hmmer, gobmk, omnetpp,
gccGames and multi-media shooters, realtime
strategy, raytracerMulti-threaded applications ray tracer, cinebench
![Page 39: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/39.jpg)
39Ballapuram, Sharif, and Lee
Number of Modified Lines
• It shows the number of modified lines that needs to be evicted to the last level cache.
0
20
40
60
80
100
120
140
160
180
200
220
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication
Average acrossbenchmarks
Num
ber o
f mod
ified
line
s at
com
plet
ion
2C4C2Px4C8C
![Page 40: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/40.jpg)
40Ballapuram, Sharif, and Lee
Cache access Vs Snoop access
• Cache access – Read one sub-bank (8 bytes)• Snoop access – Need to read all sub-banks to ship the data to other cores
or other processor in an MP system. (all 64 bytes, cache line size)
![Page 41: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/41.jpg)
41Ballapuram, Sharif, and Lee
Hash functions
Cache LineCache Line(physical address)(physical address)
(48-bits)(48-bits)
MESIMESIstatestate
Tag + Tag + Index Index bitsbits
DataData
cntrcntr cntrcntrHASH HASH 33
HASH HASH 33
If M/E stateIf M/E state If S stateIf S state
Unused bitsUnused bits BBCC AATag + Index bits [6-32]Tag + Index bits [6-32]
cntcntrr
cntcntrr
cntcntrr
HASH HASH 33
If bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ C
6153347
![Page 42: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/42.jpg)
42Ballapuram, Sharif, and Lee
Incoming Events to LLCIncoming events to the last level cache
RFO
Data Read
Code fetch
Shared L2 evict
![Page 43: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/43.jpg)
43Ballapuram, Sharif, and Lee
Incoming Events to LLC and Sources of Snoop TriggersIncoming events to the last level cache
iL1 of thiscore
dL1 ofthiscore
RFO - Event trigger
Data Read - Event trigger
Code fetch
Event trigger
Shared L2 evict
![Page 44: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/44.jpg)
44Ballapuram, Sharif, and Lee
Snooped Units in the Triggered CoreIncoming events to the last level cache
iL1 of thiscore
dL1 ofthiscore
LSB of thiscore
MSHR,WBB of this core
RFO - Event trigger
- -
Data Read - Event trigger
- -
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
Shared L2 evict
- Snoop - Snoop
![Page 45: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/45.jpg)
45Ballapuram, Sharif, and Lee
Snoop Probes for Incoming Data ReadIncoming events to the last level cache
iL1 of thiscore
dL1 ofthiscore
LSB of thiscore
MSHR,WBB of this core
iL1 ofother 3cores
dL1 ofother 3cores
LSB of other 3cores
MSHR,WBB of other 3 cores
Shared L2queue
RFO - Event trigger
- - XMC snoop to invalidate line
Snoop snoop load buffer only to invalidate
Snoop to invalidate pending requests
Snoop to invalidate
Data Read - Event trigger
- - XMC snoop to invalidate line
Snoop - Snoop Snoop
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
- XMC snoop
Snoop store buffer only (update writes)
Snoop SMC Snoop
Shared L2 evict
- Snoop - Snoop - Snoop - Snoop Snoop
![Page 46: Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815c2a550346895dca00cd/html5/thumbnails/46.jpg)
46Ballapuram, Sharif, and Lee
Snoop Triggers and Snoop UnitsIncoming events to the last level cache
iL1 of thiscore
dL1 ofthiscore
LSB of thiscore
MSHR,WBB of this core
iL1 ofother 3cores
dL1 ofother 3cores
LSB of other 3cores
MSHR,WBB of other 3 cores
Shared L2queue
RFO - Event trigger
- - XMC snoop to invalidate line
Snoop snoop load buffer only to invalidate
Snoop to invalidate pending requests
Snoop to invalidate
Data Read - Event trigger
- - XMC snoop to invalidate line
Snoop - Snoop Snoop
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
- XMC snoop
Snoop store buffer only (update writes)
Snoop SMC Snoop
Shared L2 evict
- Snoop - Snoop - Snoop - Snoop Snoop
SMC snoop to iL1
On all store addr disp
- - SMC snoop to iL1
On all store addr disp
- - -