Post on 14-Dec-2015
University of Utah 1
Interconnect-Aware Coherence Protocols for Chip Multiprocessors
Liqun ChengNaveen Muralimanohar
Karthik RamaniRajeev Balasubramonian
John Carter
2 University of Utah
Motivation: Coherence Traffic
CMPs are ubiquitous Requires coherence among
multiple cores
Coherence operations entail
frequent communication
Messages have different
latency and bandwidth needs
Heterogeneous wires 11% better performance
22.5% lower wire power
L2
C1 C2 C3
L1 L1 L1
Read Req
Fwd to owner
Data
Ex ReqInval
Inv Ack
Messages related to read missMessages related to write miss
3 University of Utah
1 Rd-Ex request from
processor 1
2 Directory sends clean
copy to processor 1
3 Directory sends
invalidate message to
processor 2
4 Cache 2 sends
acknowledgement
back to processor 1
Cache 1
L2 & Directory
Cache 2
Processor 1 Processor 2
12 3
4
Critical
Non-Critical
Exclusive request for a shared copy
4 University of Utah
Wire Characteristics
Wire Resistance and capacitance per unit length
),()22(0 verthorizverthorizwire fringenglayerspaci
width
spacing
thicknessKC
)2()( BarrierwidthBarrierthicknessRwire
Resistance Capacitance Bandwidth
Width
Spacing
5 University of Utah
Design Space Exploration
Tuning wire width and spacing Base case
B wires
Fast butLow bandwidth
L wires
(Width & Spacing)
Delay Bandwidth
6 University of Utah
Design Space Exploration
Tuning Repeater size and spacing
Traditional WiresLarge repeatersOptimum spacing
Power Optimal WiresSmaller repeatersIncreased spacing
Dela
y Po
wer
7 University of Utah
Design Space Exploration
Base caseB wires8x plane
Base caseW wires4x plane
PoweroptimizedPW wires4x plane
Fast, low bandwidth
L wires8x plane
Latency 1x
Power 1x
Area 1x
Latency 1.6x
Power 0.9x
Area 0.5x
Latency 3.2x
Power 0.3x
Area 0.5x
Latency 0.5x
Power 0.5x
Area 4x
8 University of Utah
Outline
Overview
Wire Design Space Exploration
Protocol-dependent Techniques
Protocol-independent Techniques
Results
Conclusions
9 University of Utah
Directory Based Protocol (Write-Invalidate)
Map critical/small messages on L wires and non-
critical messages on PW wires
Read exclusive request for block in shared state
Read request for block in exclusive state
Negative Ack (NACK) messages
Exploit hop
imbalance
10 University of Utah
Read to an Exclusive Block
Proc 2L1
Proc 1L1
L2 & Directory
Read Req
Spec Reply
Req
ACK
Fwd Dirty Copy
WB Data
(critical)
(non-critical)
(non-critical)
11 University of Utah
NACK Messages NACK – Negative Acknowledgement generated when directory
state is busy
Can employ MSHR id of the request instead of full
address
Directory load is low
Requests can be served at next try
Sending NACK on L-Wires can improve performance
Directory load is high
Frequent back off and retry cycles
Sending NACK on PW-Wires can reduce power consumption
12 University of Utah
Snoop Bus Based Protocol
Similar to bus-based SMP system
Signal wires and voting wires
Signal wires
To find the state of the block
Voting wires
To vote for owner of the shared data
13 University of Utah
Protocol-Independent Techniques
Narrow bit-width operands for synchronization
variables Lock and barrier use small integers
Writeback data to PW-wires Writeback messages are rarely on the critical path
Narrow messages to L-wires Only contain src, dst, operand and MSHR_id
For example: reply for upgrade message
14 University of Utah
Implementation Complexity
Heterogeneous interconnect incurs
additional complexity Cache coherence protocols
Robust enough to handle message re-
ordering
Decision process
Interconnect implementation
15 University of Utah
Complexity in the Decision Process
In the directory based system Optimizations that exploit hop imbalance
Check directory state
Dynamic mapping of NACK messages Track directory load
Narrow Messages Compute the width of an operand
16 University of Utah
Overhead in Interconnect Implementation
Additional Multiplexing/De-multiplexing at
sender and receiver side
Additional latches required for power
optimized wires Power savings in PW-Wires goes down by 5%
Wire area overhead Zero – Equal metal area for base and
heterogeneous case
17 University of Utah
Router Complexity
Crossbar
VC 1
VC 2
Out 1
Out 2
Base Model
Physical Channel 1
18 University of Utah
Router Complexity
Crossbar
Out 1
Out 2
B
PW
L
Each Physical channel is split into 3 channels (L, PW & B)
L, PW, B
L, PW, B
64 bytes
32 bytes
24 bits
19 University of Utah
Outline
Overview
Wire Design Space Exploration
Protocol-dependent Techniques
Protocol-independent Techniques
Results
Conclusions
20 University of Utah
Evaluation Platform & Simulation Methodology
Virtutech Simics Simulator
Sixteen-Core CMP
Ruby Timing model (GEMS)
NUCA cache architecture
MOESI Directory protocol
Benchmarks
SPLASH2
Opal Timing model (GEMS)
Out-of-Order Processor
Multiple outstanding requests
Processor
L2
21 University of Utah
Wire Model
MM M
Wire RCV
ores
ocap Icap
Cside-wall
Cadj
Wire Type Relative
Latency
Relative Area Dynamic Power Static Power
B-Wire 8x 1x 1x 2.65 1x
B-Wire 4x 1.6x 0.5x 2.9 1.13x
L-Wire 8x 0.5x 4x 1.46 0.55X
PW-Wire 4x 3.2x 0.5x 0.87 0.3x
Ref: Banerjee et al.
65nm process, 10 Metal Layers – 4 in 1X and 2 in each 2X, 4X and 8X plane
22 University of Utah
Heterogeneous Interconnects B – Wires
Request carrying address Response that are on critical path
L- Wires (latency optimized) Narrow Messages Unblock & Write-Control Messages NACK
PW-Wires (power optimized) Writeback data Response to read request for an exclusive block
23 University of Utah
Performance Improvements
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Bar
nes
Ch
ole
sky
FF
T
FM
M
LU
Co
nt
LU
No
nC
on
t
Oce
an C
on
t
Oce
anN
on
Co
nt
Rad
ix
Ray
trac
e
Vo
lren
d
Wn
sq
Wsp
a
Sp
ee
du
p
Base Model Heter - Interconnect
Average improvement 11%
24 University of Utah
Percentage of Critical/Noncritical Messages
13%
40%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Bar
nes
Cho
lesk
y
FF
T
FM
M
LU C
ont
LU N
onC
ont
Oce
an C
ont
Oce
an N
onC
ont
Rad
ix
Ray
trac
e
Vol
rend
Wns
q
Wsp
a
L Wire Traffic
PW Wire Traffic
Performance 11%
Power Saving in wire 22.5%
25 University of Utah
Power Savings in Wires
0%
5%
10%
15%
20%
25%
30%
Bar
nes
Ch
ole
sky
FF
T
FM
M
LU
Co
nt
LU
No
nC
on
t
Oce
an C
on
t
Oce
anN
on
Co
nt
Rad
ix
Ray
trac
e
Vo
lren
d
Wn
sq
Wsp
a
Ave
rag
e
26 University of Utah
L-Message Distribution
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Barne
s
Chole
sky
FFTFMM
LU-Cont
LU-Nonc
Ocean
-Cont
Ocean
-Nonc
Radix
Raytra
ce
Volre
nd
Wat
er-N
sq
Wat
er-S
pa HopImbalance
Unblock
& Ctrl
Narrow
Msgs
27 University of Utah
Sensitivity Analysis Impact of out-of-order core
Average speedup 9.3% Partial simulation (only 100M instructions) OOO core is more tolerant to long latency operations
Link Bandwidth & Routing Algorithm Benchmarks with high link utilization are very
sensitive to bandwidth change
Deterministic routing incurs 3% performance loss
compared to adaptive routing
28 University of Utah
Conclusions
Coherence messages have diverse needs
Intelligent mapping of messages to heterogeneous wires can improve performance and power
Low bandwidth, high speed links improve performance by 11% for SPLASH benchmarks
Non-critical traffic on power optimized network decreases wire power by 22.5%