CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros...
-
Upload
mary-lynch -
Category
Documents
-
view
214 -
download
1
Transcript of CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros...
![Page 1: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/1.jpg)
CCNoC: On-Chip Interconnects forCache-Coherent Manycore Server Chips
CiprianSeiculescu
Stavros Volos
Naser Khosro Pour
Babak Falsafi
Giovanni De Micheli LSIIntegratedSystemsLaboratory
![Page 2: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/2.jpg)
NoCs Major Power Consumer
Move towards manycore • Tiled architectures
Network-on-Chip (NoC) • Significant power
consumer• 40% MIT RAW• 30% Intel Tera-scale
Cache coherent CMP• Server workloads
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
C$
Core Core
$ $
Crossbar
![Page 3: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/3.jpg)
Proposals to Reduce NoC Power
Multiple networks• Better area and power [Balfour & Dally ICS 2006]
Commercial server workloads• Traffic patterns are different
Run on cache coherent CMPs• Strong relation between coherence protocol and NoC
Not optimized for Commercial Server Workload traffic
![Page 4: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/4.jpg)
Contributions
Commercial server workloads• Optimized for reuse in L1, little sharing• Full blown coherence protocol in CMPs• Only some transitions are frequent
Duality in Request/Response message size
CCNoC• Full advantage of heterogeneity • Same number of buffers • 16% less power same performance as Mesh
![Page 5: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/5.jpg)
Outline
Overview
Why CCNoC?
Dual-router design
Evaluation
Conclusions
![Page 6: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/6.jpg)
Dual Router is More Efficient
Dual router• Two crossbars per routing node
Wires less expensive on-chip• Use more wires for better performance
Area and power grows faster than connectivity• Balfour & Dally ICS 2006• Dual router: better performance, power and area
N bit wide
N/2 bit wide
N/2 bit wide
![Page 7: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/7.jpg)
Right Dual Router Design
Avoid protocol level deadlock• Separate
- Requests - Responses
• Use Virtual Channels
CCNoC • sub-networks
- Request / Response• No VCs needed• Same number of buffers
Buffers are power hungry
MIT RAW
BuffersCrossbar + Links
H.S.Wang & L.S.Peh, MICRO 2003
![Page 8: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/8.jpg)
Protocol Activity
CMPs implement full blown coherence protocol
• Some transitions are frequent [Hardavellas ISCA 2009]- Read clean block- Evict clean block- Write to unshared block
• Other transitions needed for correctness (infrequent)- Read dirty block- Evict dirty- Write to shared block
![Page 9: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/9.jpg)
Frequent Read Protocol Activity
Reader Directory Writer
Read Req
Read Resp
Evict Clean Req
Short Req
Short Req
Short Resp
Long Resp
![Page 10: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/10.jpg)
Frequent Write Protocol Activity
Writer Directory
Fetch/Upgrade Req
FetchResp
Short Req
Short Req
Short Resp
Long Resp
Upgrade Resp
![Page 11: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/11.jpg)
Infrequent Read Protocol Activity
Reader Directory Writer
Read Req
Read Resp
Short Req
Short Req
Short Resp
Long Resp
Downgrade Req
Downgrade Resp
![Page 12: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/12.jpg)
Infrequent Write Protocol Activity
Writer Directory Reader 1Fetch/Upgrade Req
Fetch Resp
Short Req
Short Req
Short Resp
Long Resp
Reader 2
Upgrade Resp
Inv Req Inv
Req
Inv Resp
Inv Resp
Evict Dirty Req
![Page 13: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/13.jpg)
Traffic Analysis
DB
2
OR
AC
LE
DB
2 M
IX
AP
AC
HE
ZE
US
EM
3D
SP
EC
2K
OLTP DSS WEB SCI MIX
0%
20%
40%
60%
80%
100%
Long RespShort RespLong ReqShort Req
Tra
ffic
Dis
trib
uti
on
Request: 93% short Response: 86% long
![Page 14: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/14.jpg)
CCNoC Router
Request network narrow: optimized for short messages Response network wide: optimized for long messages
RequestSwitch
ResponseSwitch
NI
Router
![Page 15: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/15.jpg)
Previous Work
Balfour et al. ICS 2006• Better than single large router• Read/Write traffic• Same number of reads and writes
Yoon et al. DAC 2010• Physical channel better then virtual channel
Not optimized for cache coherent CMP• Running commercial server workloads
![Page 16: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/16.jpg)
Outline
Overview
Why CCNoC?
Dual-router design
Evaluation
Conclusions
![Page 17: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/17.jpg)
Evaluation Methodology
FLEXUS• Full system simulation • 16 or 8 UltraSPARC III
ISA cores• Split I/D, 64KB L1• 1 or 2 MB L2
ORION 2.0• power estimation• area estimation
Workloads• OLTP: TPC-C
- IBM DB2 and Oracle
• DSS: TPC-H - IBM DB2- Q1, Q6, Q13, Q16
• Web: SPECweb99 - Apache and Zeus
• Scientific: EM3D• Multiprogrammed:
- SPEC2K - 2x: gcc, twolf, art, mcf
![Page 18: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/18.jpg)
Evaluation NoCs
Mesh-128 - baseline• 128 bit flit width
Torus - reference• 128 bit flit width
Mesh-176 – high performance • 176 bit flit width
CCNoC• Request: 48 bit flit width• Response: 128 bit flit width
Switches• Wormhole flow control• Input queued • Transmission protocol
- On/Off
• Input buffers- 2 entry
![Page 19: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/19.jpg)
Performance
DB
2
OR
AC
LE
DB
2 M
IX
AP
AC
HE
ZE
US
EM
3D
SP
EC
2K
OLTP DSS WEB SCI MIX
0
0.2
0.4
0.6
0.8
1
1.2
Mesh-128Mesh-176CCNoC
No
rma
lize
d I
PC
(to
To
rus
)
Performance loss: 2% Torus, 8% Mesh-176
![Page 20: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/20.jpg)
Power Savings
Power savings: 16% Mesh-128, 22% Torus, 38% Mesh-176
DB
2
OR
AC
LE
DB
2 M
IX
AP
AC
HE
ZE
US
EM
3D
SP
EC
2K
OLTP DSS
WEB SCI
MIX
-2.22044604925031E-16
0.2
0.4
0.6
0.8
1
1.2
1.4
TorusMesh-128Mesh-176CCNoC
No
rma
lize
d T
ota
l P
ow
er(
%)
![Page 21: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/21.jpg)
Conclusions
Duality in Request/Response traffic• Request: dominated by short messages• Response: dominated by long messages
Proposed CCNoC• Narrow request network• Wide response network
Showed significant power savings• 22% against Torus• 38% against Mesh-176
![Page 22: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.](https://reader030.fdocuments.in/reader030/viewer/2022032800/56649d215503460f949f6e01/html5/thumbnails/22.jpg)
Thank you!
Q&A