0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of...
Transcript of 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of...
![Page 1: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/1.jpg)
Scalable Nanophotonic Interconnect for Cache Coherent Multicores
Randy W. Morris, Jr. and Avinash K. KodiDepartment of Electrical Engineering and Computer Science
Ohio University, Athens, OH 45701 E-mail: [email protected], [email protected]
Website: http://oucsace.cs.ohiou.edu/~avinashk/
WINDS 2010: Workshop on the Interaction between Nanophotonic Devices and Systems
Atlanta, GADecember 5, 2010
![Page 2: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/2.jpg)
Talk Outline
• Section I: Motivation & Background
• Section II: Dual Sub-Network for Snoopy Cache Coherent Nanophotonic Architecture
• Section IV: Performance Analysis
• Section V: Future Work
!
![Page 3: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/3.jpg)
Why Nanophotonics?
"
2. Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61
Clock Distribution 11%Dual FPMACs 36 %Router & Links 28 %10-port RF 4%IMEM + DMEM 21%
Tile Power: Intel Tera-Flops (65 nm)2
28%
• Power consumption of Network-on-Chips (NoCs) 1 using metallic interconnects is projected to exceed expectation by a factor of 10
1. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, “Research Challenges for On-Chip Interconnection Networks”, IEEE Micro, vol. 27, no. 5, pp. 96 – 108, September-October 2007.
Nanophotonic Technology
- Low Power
- Small Footprint (10 – 15 !m)
- High Bandwidth (10 – 20 Gbps)
- CMOS Compatibility
![Page 4: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/4.jpg)
#
Resonant wavelength (!0) !0 ! m= neff ! 2"R
m # an integerneff # effective refractive indexR # radius of the ring resonator
Output Port 1
VR
Input Port 0 Output Port 0
n+ p+ n+ =VOFF=VONVR
Input Port 0 Output Port 0
n+ p+ n+
Micro-ring Resonators
1. Lipson, M., Compact Electro-Optic Modulators on a Silicon Chip, IEEE J. Sel. Top. Quant., Vol. 12, No. 6, Nov.-Dec. 2006, p. 1520-6.2. M. Lipson, Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities, IEEE Journal of LightwaveTechnologies, Vol. 23, No. 12, 12 December 2005 (invited).
![Page 5: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/5.jpg)
Cache Coherence
- Write propagation (write by any processor should become visible to all other processors)
- Write serialization (all writes from same or different processors are seen in the same order by all processors)
Snoopy Protocols
P1 P2 P3 P4
$ $ $ $
Memory Broadcast
Easy to ProgramNot Easily Scalable
Directory Protocols
P1 P4
$ $ MD
Interconnection Network
Point-to-Point
DM
ScalableHigh miss latency
![Page 6: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/6.jpg)
Problems with Snoopy Networks
Two major problems with snoopy cache coherent networks
(1) Interconnect bandwidth for broadcasting of memory requests- Bus Networks: Limits one request per cycle - Multiple Buses: Increases cache controllers- Point-to-Point Networks: Selective multicasting & Ordering
(2) Cache Access Rate- Cache tag lookup (latency)- Increased power consumption
![Page 7: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/7.jpg)
$
Electrical– Split Transactional Bus– Sun Fireplane (SC 2001)– Timestamp Snooping (ASPLOS 2000), Multicast Snooping
(ISCA 2001– Jetty (HPCA 2001), Region Scout (ISCA 2005), Intel QPI– Broadcasting on Ordered Networks (HPCA 2009, MICRO 2009)
Related Work (to name a few)
Optical/Nanophotonic- SYMNET (Trans on Parallel & Dist Systems 2004)- Shared Bus (MICRO 2006), Wavelength Routed Oblivious Network (ASPLOS 2010)- Spectra (ISPLED 2009), ATAC (PACT 2010)
![Page 8: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/8.jpg)
%
• Advantages of the proposed architecture– Dual sub-networks for memory request
• Broadcast & Multicast networks
– Broadcast network used by all tiles to fetch the missed block
• Network access implemented using tokens• Determines the sharing pattern
– Multicast network to be shared between nodes to send selective requests
• Reduces the broadcast requirement• Simultaneous transient requests in progress to different memory
locations
– Reducing the external laser power by unique power guiding techniques
CC-NPA Architecture
![Page 9: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/9.jpg)
Tile 0
Tile 4
Tile 8
Tile12
Tile 1
Tile 5
Tile 9
Tile13
Tile 2
Tile 6
Tile10
Tile14
Tile 3
Tile 7
Tile11
Tile 15
Proposed Broadcast Sub-Network Architecture: CC-NPA
Control center
Core 0
Core 2
Core 1
Core 3
Sh
ared L2
L1 C
ache
L1 C
ache
L1 C
ache
L1 C
ache
Tran
smitter
Receiver
![Page 10: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/10.jpg)
Power Guiding
As only one core can transmit, route power to a column of cores.
- Reduction in optical power (~75%)
To Colum
n 1 To C
olum
n 2
To Column 3
The active column is determined by the circulating optical tokes
2 dB optical loss
To Column 0
![Page 11: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/11.jpg)
Tile 0
Tile 4
Tile 8
Tile 12 Tile 13
Tile 9
Tile 5
Tile 1 Tile 2
Tile 6
Tile 10
Tile 14 Tile 15
Tile 11
Tile 7
Tile 3
Optical Token System (1/3)
power powerinject inject
return return
Control Center
Requestsa token
inject token
Received Token
power
![Page 12: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/12.jpg)
Tile 0
Tile 4
Tile 8
Tile 12 Tile 13
Tile 9
Tile 5
Tile 1 Tile 2
Tile 6
Tile 10
Tile 14 Tile 15
Tile 11
Tile 7
Tile 3
Optical Token System (2/3)
power powerinject inject
return return
Requestsa token
inject token
power
&&&&&&&&&&&&&&&&&&&&&&&&&&
Token Re-Injected
![Page 13: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/13.jpg)
Tile 0
Tile 4
Tile 8
Tile 12 Tile 13
Tile 9
Tile 5
Tile 1 Tile 2
Tile 6
Tile 10
Tile 14 Tile 15
Tile 11
Tile 7
Tile 3
Optical Token System (3/3)
power powerinject inject
return return
Requestsa token
inject token
power Token Returns
To next column
Fairness can be insured with additional techniques (Fair slot, Two pass)
![Page 14: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/14.jpg)
Proposed Multicast Sub-Network
For larger networks, snoopy-based cache coherence reduces performance- Broadcasting data to all shared tiles, consuming more address bandwidth- Consumes more latency and power at the caches
• Wavelength routed second multicast sub-network
• Filter and route cache requests to nodes that hold the cache data
• Reduction in required bandwidth and power dissipation
• Potential for simultaneous multiple requests (could lead to race conditions)
0
0"2
0"$
0"6
0"&
1
1"2
(() L+ ,adi0 OceanSin6le!Sharer Mutiple!Sharer
Percentage of request with multiple sharers
![Page 15: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/15.jpg)
'(
Initial Performance Analysis• Performance Comparison
– Simics with Gems Memory Module– FFT, LU, Radiosity, Ocean, Radix, & Water
• Area & Power Analysis
>arameter @alue >arameter @alue
L1AL2!coherence !"#$ Core!(reDuency! 5!&'(
L2!cache siGeAaccoc 256 +,-16"/a1 )hreads!HcoreI 2
L1 cacheAaccoc 64+,-4"/a1 Issue!policy $3"or6er
Cache!line!siGe 64, Memory!SiGe!HKLI 4
Memory!Controllers 16 Mddress!Landwidth!HoptI
)#*!+,-.
Mddress!Landwidth!HelecI
320!&,:;
Simics Parameters
![Page 16: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/16.jpg)
Splash-2 Speed up (16-cores)
0
0"2
0"$
0"6
0"&
1
1"2
1"$
(() L+ ,adiocity ,adi0 ,aytrace
OlectricalCC"N>M
- CC-NPA increases performance by about 25%
![Page 17: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/17.jpg)
Splash-2 Speed up (64-cores)
0
0"Q
1
1"Q
2
2"Q
3
3"Q
$
(() L+ ,adi0 Ocean
OlectricalCC"N>M
- CC-NPA increases performance of up to 2x
![Page 18: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/18.jpg)
DeTice LossHdLI DeTice LossHdLI
Coupler HLcI 1 (ilter drop!HLfI 1
Non"Linearity HLnI 1 Lendin6 HLLI 1
>hoto"detector HLpI 1 VaTe6uide Crossin6!HLwcI 0<05
Modulator InsertionHLiI
1 ,eceiTer!HL,SISensitiTity
"20 6,=
VaTe6uide!Hper!cmI!HLVI 1<3 Splitter!HLsI 3
Laser Officiently 30> ,in6!modulation 150!?@-A
,in6!Heatin6 100!?@-A )IMA!Tolta6e!amp" 1<1!:@-A!B 100!?@-Ait
Power Analysis
LB LC
LsLi
Lf
LWC
LP,LRS
5×LS + 7×LW + LC + LN + 3×LI + LF + 8×LB+ 100×LWC
"$3"1!dL!Hper!waTelen6thI
Total Power (opt) = 5.44 W (8 wavelengths)
LW
![Page 19: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/19.jpg)
Area Analysis
DeTice Mrea! H!m2)
VaTe6uide!HpitchI (/( "0
Micro"rin6!resonator '**!
>hoto"detector '**!
)IMA!Limitin6!Mmplifier */*!)!( 100!2
Off-Chip Laser
On-ChipModulator
Transmission Medium Photodetector
TIABuffer Chain Limiting Amplifier
Driver for Electronics
Optical Layer
Electronics Layer
On-Chip
Pitch (5.5!m)
Photo-detector (100 !m2)
TIA/Limiting Amp (0.02625 mm2)
Broadcast Sub-Network: 24 mm2 (optical) 51 mm2 (electrical)
Ring Resonator (100 !m2)
![Page 20: 0 * ) ) &- '1 * 0 ' 2#,&' 2*,&0&)-' 34$-.*0&5 · 2016. 1. 6. · ¥ Power consumption of Network-on-Chips (NoCs) O u s in g m e ta llic interconnects is projected to exceed expectation](https://reader036.fdocuments.in/reader036/viewer/2022081409/6073b1e6a4ccbc123a5d5466/html5/thumbnails/20.jpg)
Conclusion & Future Work
• CC-NPA is both a low power & high bandwidth network for future cache coherent many-core processors
• CC-NPA combines the benefits the of snoopy cache coherent protocols and nanophotonics
• CC-NPA provides scalable bandwidth using two sub-networks (broadcast and multicast)
• Future work will involve designing and optimizing the multicast sub-network