Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors
Chinnakrishnan S. Ballapuram
Ahmad Sharif
Hsien-Hsin S. Lee
2Ballapuram, Sharif, and Lee
Concurrent Execution in CMP
Code, Data
Single-threaded program
Registers, Stack(Local)
Code Data
Multi-threaded program
Registers, Stack(Local)
Registers, Stack(Local)
Registers, Stack(Local)
Thread 2Thread 1Thread 0Thread 0
Shared Last Level Cache
3Ballapuram, Sharif, and Lee
Self-Modifying Code (SMC) Snoop
IL1IL1
Core 0
IL1IL1 DL1
Core 1
IL1 DL1
Core 2
IL1 DL1
Core 3
IL1 DL1
SMC snoop
SMC snoop
SMC snoop
SMC snoop
4Ballapuram, Sharif, and Lee
Snoop for Core 0 DL1 Miss
IL1IL1
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
5Ballapuram, Sharif, and Lee
External Snoop Request
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
6Ballapuram, Sharif, and Lee
Modified L2 Eviction, External Request, etc
IL1IL1
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
7Ballapuram, Sharif, and Lee
Modified L2 Eviction, External Request, etc
L2 queue (FIFO)L2 queue (FIFO)
L2 L2 cachecache
Snoop queue Snoop queue (FIFO)(FIFO)
Other Other logic logic and and
buffersbuffers
External interconnectExternal interconnect
CMP core interconnectCMP core interconnect
Core 0
IL1IL1 DL1
SMC snoop
Core 1
IL1 DL1
SMC snoop
Core 2
IL1 DL1
SMC snoop
Core 3
IL1 DL1
SMC snoop
As # of cores increasesPower
Performance
8Ballapuram, Sharif, and Lee
Number of Snoop Probes
• SMC Snoops to I-Cache > Snoops to D-Cache > Snoops to LSB.
0
1
2
3
4
5
6
7
8
9
10
11
12to
_ls
b
to_
dca
che
to_
ica
che
to_
lsb
to_
dca
che
to_
ica
che
to_
lsb
to_
dca
che
to_
ica
che
to_
lsb
to_
dca
che
to_
ica
che
to_
lsb
to_
dca
che
to_
ica
che
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threaded apps
Nu
mb
er
of s
no
op
pro
be
s in
Mill
ion
s
2C
4C
2 x 4C
8C
16.4M
9Ballapuram, Sharif, and Lee
Snoop Probe and Snoop Rate
• % of data snoop > % of instruction cache snoop
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
2C 4C 2Px4C 8C 8C-MT 2Px4C-MT
Nu
mb
er
of
sno
op
s in
Mill
ion
s
0%
200%
400%
600%
800%
1000%
1200%
1400%
1600%
1800%
2000%
2200%
2400%
Processor configuration
% o
f sn
oo
p in
cre
ase
to_lsb
to_dcache
to_icache
total snoops
% of data snoop increase
% of SMC snoop increase
% of total snoop increase
~22x increase
~12x increase
10Ballapuram, Sharif, and Lee
We propose two techniques to reduce the power consumed by snoop probes:
1. Selective Snoop Probe (SSP)2. Essential Snoop Probe (ESP)
11Ballapuram, Sharif, and Lee
Selective Snoop Probe (SSP)- SSP for SMC- SSP for Non-Stack Accesses- SSP for Stack Accesses
13Ballapuram, Sharif, and Lee
Normal Operation: To Support SMC
L1 I-Cache
From RS or LSB
dispatch
SMC snoop probe
L1 D-cache
MSHR
Core 0
14Ballapuram, Sharif, and Lee
Core 0
SSP (SMC) – No SMC Snoop if BF1 miss
From RS or LSB
dispatch
All store addr
HASH
cntr
MSHR
u1
r1
r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter
BF1SMC snoop probe
L1 I-Cache
L1 D-cache
To filter SMC/XMC snoops
15Ballapuram, Sharif, and Lee
Core 0
SSP (SMC) – No SMC Snoop if BF1 Hit
From RS or LSB
dispatch
All store addr
HASH
cntr
MSHR
u1
r1
r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter
BF1SMC snoop probe
L1 I-Cache
L1 D-cache
17Ballapuram, Sharif, and Lee
Normal Operation: Always Snoop for All Accesses
Snoopprobes
Snoop probes
L2 queue
Last Level Cache
dL1 miss
Core 0
From RS or LSB
dispatch
L1 D-cache
MSHR
Snoop controller
Snoop queue
18Ballapuram, Sharif, and Lee
Core 0
SSP – Stack Accesses
All addresses(carry S-bit annotation)
L2 queue
From RS or LSB
dispatch
L1 D-cache
MSHR
dL1 miss
Last Level Cache
Snoop controller
0
1
0
0
Snoop queue
Annotated by
Front-End
20Ballapuram, Sharif, and Lee
Core 0
SSP – Non-stack Accesses Update BF2
From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISISISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
L2 queueL2 queue
Last Level Cache
Snoop controller
1
0
0
0
Snoop queuer2 – read Bloom filter
u2 - update Bloom filtercntr - counting Bloom filter
u2u2
Filter snoops to non-stack region
HASH cntr
BF2
21Ballapuram, Sharif, and Lee
SSP – Non-stack Accesses Read BF2
All non-stack addresses
Filter snoops to non-stack region
HASH cntr
u2u2
L2 queue
dL1 miss
r2
r2All addresses(carry S-bit annotation)
r2 – read Bloom filteru2 - update Bloom filtercntr - counting Bloom filter
Last Level Cache
Snoop controller
1
0
0
0
Snoop queue
BF2
Core 0
From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISISISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
22Ballapuram, Sharif, and Lee
SSP - Selectively Send Snoop Probes
Selectively send snoops
L2 queue
Last Level Cache
Snoop controller
1
0
0
0
Snoop queuer2 – read Bloom filter
u2 - update Bloom filtercntr - counting Bloom filter
u2u2
Selectively send snoops
All non-stack addresses
u2u2All addresses(carry S-bit annotation)
Core 0
From RS From RS or LSBor LSB
dispatchdispatch
All non-stack addressesAll non-stack addresses
MEME SISISISIMEME
L1 D-cacheL1 D-cache MSHRMSHR
Filter snoops to non-stack region
HASH cntr
BF2
dL1 miss
25Ballapuram, Sharif, and Lee
Core 0
SMC – Normal Operation
L1 I-$
Every Store SnoopsI-cache
From RS or
LSB dispatch
L1 D-$
Other pipe stages
26Ballapuram, Sharif, and Lee
Core 0
ESP Essential Snoop Probe
From RS or
LSB dispatch
Other pipe stages
L1 I-$ L1 D-$
• OS sets a control register bit (SMC-CR) • SMC-CR=1 Non Self-Modifying Code• SMC-CR=0 Self-Modifying Code
SMC-CR=1
28Ballapuram, Sharif, and Lee
Core 0
Normal Operation – Snoop for All Variables
Snoop probes
L2 queue
From RS or
LSB dispatch
Other pipe stages
CMP interconnect domain
Snoop probes
Snoop controller
Snoop queue
Last Level Cache
L1 I-$ L1 D-$
dL1 miss
29Ballapuram, Sharif, and Lee
Core 0
Essential Snoop Probe (ESP) – SMN bit 1
dL1 misswith SMN bit annotation
L2 queue
From RS or
LSB dispatch
Other pipe stages
CMP interconnect domain
SMN bitSMN bit – Snoop-Me-Not bit is 0/1
Snoop controller
1
1
0
0
Snoop queue
Last Level Cache
L1 I-$ L1 D-$
30Ballapuram, Sharif, and Lee
Core 0
Essential Snoop Probe (ESP) – SMN bit 0
L2 queue
From RS or
LSB dispatch
ESP
Other pipe stages
CMP interconnect domain
SMN bit – Snoop-Me-Not bit is 0/1
Last Level Cache
SMN bit
Snoop controller
0
1
0
0
Snoop queue
L1 I-$ L1 D-$
ESPESP
dL1 misswith SMN bit annotation
31Ballapuram, Sharif, and Lee
Energy Savings in D-Cache Using SSP
• In the 2C config 5% - 10% data cache energy savings and in the 8C config 30% - 65% is achieved.
• The data cache energy savings increases with the number of cores on the die as the number of snoops to all the cores increases.
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
2C 4C 2Px4C 8C
Processor configuration
% o
f d
ata
ca
ch
e e
ne
rgy
sa
vin
gs
pe
r c
ore
SPEC INT 2006
SPEC FP 2006
games/multi-media
server
multi-threaded application
32Ballapuram, Sharif, and Lee
Energy Savings in I-Cache Using SSP
• There is a 50% - 70% instruction cache tag energy savings is achieved across all processor configurations.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2C 4C 2Px4C 8C
Processor configuration
% o
f ica
che
tag
en
erg
y sa
vin
gs
pe
r co
re
SPEC INT 2006SPEC FP 2006games/multi-media
servermulti-threaded application
33Ballapuram, Sharif, and Lee
Performance Impact with SSP
• On average there is 1% - 2% performance improvement across various benchmark categories and different processor configurations is achieved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
SPEC INT 2006 SPEC FP 2006 games/multi-media
server multi-threadedapplication
Harmean acrossbenchmarks
min performanceobserved
maxperformance
observed
2C 4C 2Px4C 8C
34Ballapuram, Sharif, and Lee
Energy Savings with ESP
• It shows that 5% to a maximum of 82% data cache energy is spent on the non-essential snoop probes that can be eliminated using the ESP technique.
• Also, 85% of the snoops to the instruction cache tag energy can be eliminated using ESP.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
dcache icache dcache icache dcache icache dcache icache dcache icache dcache icache
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication
Harmonic meanacross benchmarks%
of
cach
e en
ergy
spe
nt o
n no
n-es
sent
ial s
noop
s pe
r co
re
2C 4C 2Px4C 8C
35Ballapuram, Sharif, and Lee
• Semantics and program behavior are useful indicators
• They are exploited to reduce power due to snoops
• We proposed– Selective Snoop Probe (SSP) – Essential Snoop Probe (ESP)
• Energy Reduction Results– 5% to 65% in D-cache per core– 50% to 70% in I-cache per core
• 1% - 2% performance improvement
• Extensible to optimize integrated platforms with graphics processor
Conclusion
38Ballapuram, Sharif, and Lee
Simulation Infrastructure
Execution Engine 4-wide, Out-of-Order
Load buf / Store buf / RS / ROB 96 / 64 / 128 / 256 entries
L1 / L2 latency 4 / 8 cycles
L1 I, L1 D cache size 32KB, 8 way, 64B
L2 Cache 4MB, 16 way, 64B
L1 TLB entries 128, 4 way
Memory 2GB, DDR 2 timings
CACTI 4.2 70nm power model
Benchmark class Example applications
Server specJBB, TPCC
SPEC FP 2006 wrf, namd, lbm, soplex
SPEC INT 2006 hmmer, gobmk, omnetpp, gcc
Games and multi-media shooters, realtime strategy, raytracer
Multi-threaded applications ray tracer, cinebench
39Ballapuram, Sharif, and Lee
Number of Modified Lines
• It shows the number of modified lines that needs to be evicted to the last level cache.
0
20
40
60
80
100
120
140
160
180
200
220
SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication
Average acrossbenchmarks
Nu
mb
er
of m
od
ifie
d li
ne
s a
t co
mp
letio
n
2C
4C2Px4C
8C
40Ballapuram, Sharif, and Lee
Cache access Vs Snoop access
• Cache access – Read one sub-bank (8 bytes)• Snoop access – Need to read all sub-banks to ship the data to other cores
or other processor in an MP system. (all 64 bytes, cache line size)
41Ballapuram, Sharif, and Lee
Hash functions
Cache LineCache Line(physical address)(physical address)
(48-bits)(48-bits)
MESIMESIstatestate
Tag + Tag + Index Index bitsbits
DataData
cntrcntr cntrcntr
HASH HASH 33
HASH HASH 33
If M/E stateIf M/E state If S stateIf S state
Unused bitsUnused bits BBCC AA
Tag + Index bits [6-32]Tag + Index bits [6-32]
cntcntrr
cntcntrr
cntcntrr
HASH HASH 33
If bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ C
6153347
42Ballapuram, Sharif, and Lee
Incoming Events to LLCIncoming events to the last level cache
RFO
Data Read
Code fetch
Shared L2 evict
43Ballapuram, Sharif, and Lee
Incoming Events to LLC and Sources of Snoop TriggersIncoming events to the last level cache
iL1 of
this
core
dL1 of
this
core
RFO - Event trigger
Data Read - Event trigger
Code fetch
Event trigger
Shared L2 evict
44Ballapuram, Sharif, and Lee
Snooped Units in the Triggered CoreIncoming events to the last level cache
iL1 of
this
core
dL1 of
this
core
LSB of
this
core
MSHR,
WBB of
this core
RFO - Event trigger
- -
Data Read - Event trigger
- -
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
Shared L2 evict
- Snoop - Snoop
45Ballapuram, Sharif, and Lee
Snoop Probes for Incoming Data ReadIncoming events to the last level cache
iL1 of
this
core
dL1 of
this
core
LSB of
this
core
MSHR,
WBB of
this core
iL1 of
other 3
cores
dL1 of
other 3
cores
LSB of
other 3
cores
MSHR,
WBB of
other 3 cores
Shared L2
queue
RFO - Event trigger
- - XMC snoop to invalidate line
Snoop snoop load buffer only to invalidate
Snoop to invalidate pending requests
Snoop to invalidate
Data Read - Event trigger
- - XMC snoop to invalidate line
Snoop - Snoop Snoop
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
- XMC snoop
Snoop store buffer only (update writes)
Snoop SMC Snoop
Shared L2 evict
- Snoop - Snoop - Snoop - Snoop Snoop
46Ballapuram, Sharif, and Lee
Snoop Triggers and Snoop UnitsIncoming events to the last level cache
iL1 of
this
core
dL1 of
this
core
LSB of
this
core
MSHR,
WBB of
this core
iL1 of
other 3
cores
dL1 of
other 3
cores
LSB of
other 3
cores
MSHR,
WBB of
other 3 cores
Shared L2
queue
RFO - Event trigger
- - XMC snoop to invalidate line
Snoop snoop load buffer only to invalidate
Snoop to invalidate pending requests
Snoop to invalidate
Data Read - Event trigger
- - XMC snoop to invalidate line
Snoop - Snoop Snoop
Code fetch
Event trigger
SMC snoop
Snoop store buffer only (updated writes)
Snoop (update writes)
- XMC snoop
Snoop store buffer only (update writes)
Snoop SMC Snoop
Shared L2 evict
- Snoop - Snoop - Snoop - Snoop Snoop
SMC snoop to iL1
On all store addr disp
- - SMC snoop
to iL1
On all store addr disp
- - -
Top Related