AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message Queue
Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks
description
Transcript of Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks
![Page 1: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/1.jpg)
Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks
Aniruddha N. Udipi
with Naveen Muralimanohar*,Rajeev Balasubramonian
University of Utah and *HP Labs
![Page 2: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/2.jpg)
University of Utah 2
Motivation - I
• Future CMPs are likely to be power-limited– On-chip networks consume 20-36% of total chip power– Network power dominated by routers
• Chip design and verification costs are tremendous– Directory-based protocols are complicated and have the inherent
problem of indirection– Snooping-based protocols are well understood and simple to design
• Metal and wiring are cheap and plentiful
• We are no longer pin limited for the interconnection network
![Page 3: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/3.jpg)
University of Utah 3
Motivation - II
• Future of multi-core computing likely to diverge into two separate tracks
– Mid-range multicore machines for home/office
• 16-64 cores– Many-core machines for
scientific/server applications• 1000s of cores
• Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores
• Design energy-efficient networks for moderate core-counts
VM
![Page 4: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/4.jpg)
University of Utah 4
Executive Summary
• Elimination of routers leads us back to bus-based networks
• Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity
• Enhancing the life of buses for moderately sized CMPs– Filtered segmented bus, low-swing wiring, address
interleaved buses, page coloring
![Page 5: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/5.jpg)
University of Utah 5
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing Wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
![Page 6: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/6.jpg)
Baseline Chip and Interconnect Organization
University of Utah 6
Core L1
L2
• Simple mesh used for illustration here, other options discussed in the paper
• Static-NUCA shared L2, each line has a “home” slice based on its address
Router
![Page 7: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/7.jpg)
University of Utah 7
Where does energy go in the network?
1.39e-10 J/access
1.56e-11 J/access8X
Router Link Energy estimates based on CACTI 6.0 and Orion 2.0
![Page 8: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/8.jpg)
University of Utah 8
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
![Page 9: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/9.jpg)
University of Utah 9
What is the solution?
• We are left with.. a bus!• Could we really just use a bus?
• Not really–Too many links activated on
every transaction–Energy gained by
eliminating routers lost by activating more links
– Poor performance due to increased arbitration times and network contention
![Page 10: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/10.jpg)
University of Utah 10
We can do better..
Useless snoop: Particular cache line not present in any other core
![Page 11: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/11.jpg)
• Segment and filter snoop transactions at intermediate points
• Two types of filters– Out-filter– In-filter
• Reduces number of links activated
• Allows for safe parallelism (serialization happens at the central bus if required)
Filtered Bus
University of Utah 11
Bus link Filter
![Page 12: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/12.jpg)
Filters
• Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter”
• Each of these is a Counting Bloom Filter
– 2 arrays of 10-bit entries– Subsets of the address bits hashed into
each of these arrays, incremented to add entries, decremented to remove entries
– To test for membership, simply check if entries in both arrays are non-zero
– Compact representation, false positives possible
University of Utah 12
Bus link In + Out Filter
![Page 13: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/13.jpg)
Out-filter - Case 1
University of Utah 13
RHome Segment • Bloom filter in every
segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment
• If a line has never left a segment, none of its transactions need to be seen outside
Energy Saved
• Completely localized transaction
• Only home segment activated
Bus link In - FilterActivated bus Activated filter
Out - FilterR – Requested Address
![Page 14: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/14.jpg)
Out-filter – Case 2
University of Utah 14
Home Segment
R
Update
• If the line is being requested from outside its home segment, transaction has to go out on the central bus
• The out-filter of the home segment is updated appropriately
• The in-filter then takes over
RR R
Bus link
Activated bus Activated filterIn - Filter Out - Filter
R – Requested Address
![Page 15: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/15.jpg)
In-filter
University of Utah 15
RRR
• Bloom filters keep track of a superset of lines currently present in the segment
• Only broadcast within the local segment if requiredEnergy Saved
Bus link
Activated bus Activated filter
In - Filter Out - Filter
R – Requested Address
![Page 16: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/16.jpg)
Arbitration
• Global arbitration delay is non-trivial for a single bus connecting even 16 cores
• Multi-step arbitration, as required• On every request
– arbitrate for local bus and broadcast– if filter indicates that the transaction is complete, “validate”
broadcast via wired-OR– if not, arbitrate for central bus and hold broadcast in a
single-entry buffer until the central bus is available– at the remote sub-buses, priority is given to requests
originating from the central bus
University of Utah 16
![Page 17: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/17.jpg)
University of Utah 17
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
![Page 18: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/18.jpg)
Low-swing Wiring
• Differential low-swing wiring up to 10X more energy efficient than regular wiring
• These have less impact on packet-switched networks since routers are the bottleneck anyway
–Amdahl’s law!• Slightly increased latency, more metal requirement
University of Utah 18
![Page 19: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/19.jpg)
University of Utah 19
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
![Page 20: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/20.jpg)
Address Interleaved Buses
• As core counts increase, increased pressure on the bus due to contention
• At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip
• To shore up performance, increase the number of buses
– different buses handle mutually exclusive addresses– increased metal requirement
University of Utah 20
![Page 21: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/21.jpg)
University of Utah 21
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
![Page 22: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/22.jpg)
Page Coloring
• OS-assisted page-coloring for L2 cache• We use a simple first-touch approach• Improved locality helps any network, but is especially well-suited for our network because
– More flexibility in page placement– Less negative impact by sub-optimal page
placement– Improves filter behavior
University of Utah 22
![Page 23: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/23.jpg)
University of Utah 23
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
![Page 24: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/24.jpg)
University of Utah 24
Methodology
• Virtutech SIMICS full-system simulator– “g-cache” significantly modified to add network models
• CACTI 6.0 and Orion 2.0 for router/link energy computation• 16 cores for most experiments, sensitivity analysis for 32- and
64-core systems• 32nm process, 3GHz clock • 32K D-L1, 16K I-L1, 2MB/slice shared L2• 200 cycle main memory latency• 4KB page size • PARSEC, NAS, SPLASH-2 benchmark suites – run for entire
Region-Of-Interest/parallel section• Baseline routers - 4 VCs, 8 buffers/VC
![Page 25: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/25.jpg)
Energy Consumption – Address Network
University of Utah 25
Ring – 20xGrid – 27xFbfly – 31x
![Page 26: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/26.jpg)
Energy Consumption – Data Network
University of Utah 26
Ring – 2xGrid – 2.5xFbfly – 3x
![Page 27: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/27.jpg)
How does energy consumption reduce?
• Router : Link energy ratio is high enough to significantly impact energy characteristics
• Efficient bloom filters, at 16KB/filter
– Out-filters are 85% accurate (note that there are only false positives, no false negatives)
– In-filters are 90% accurate
University of Utah 27
![Page 28: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/28.jpg)
Effect of Page Coloring
• More locality• Better filtering
– Out filter accuracy increases from 85% to 97%
University of Utah 28
![Page 29: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/29.jpg)
System Performance
University of Utah 29
Ring – 7%Grid – 3%Fbfly – 1%
![Page 30: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/30.jpg)
How does performance improve?
• Two basic reasons– Inherent indirection in directory-based protocols– Deep pipelines in routers increasing the no-load latency
• Avg. latency in bus-based network is 16.4 cycles– Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2
cyc) + Link latency (10.5 cyc)
• Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention
– Link (6 cyc) + Router (9 cyc)
University of Utah 30
![Page 31: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/31.jpg)
Scaling – 32 Cores – Energy
Average energy reduction of 19X in address network, 3X in data network
University of Utah 31
![Page 32: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/32.jpg)
32 Cores – Performance
Average 5% drop in performance
University of Utah 32
![Page 33: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/33.jpg)
Scaling - 64 Cores – Energy
Average reduction of 13X in address network, 2.5X in data network
University of Utah 33
![Page 34: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/34.jpg)
64 Core - Performance
University of Utah 34
Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses
![Page 35: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/35.jpg)
Router Optimizations
University of Utah 35
• For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than
– 3.5 X at 16 cores– 4.5X at 32 cores– 7X at 64 cores
• Current energy ratio is approx. 70X
![Page 36: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/36.jpg)
University of Utah 36
Outline
• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion
![Page 37: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/37.jpg)
University of Utah 37
Related Work
• Packet Switched Networks– Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et
al. (HPCA ’09), TRIPS, TILERA• Hierarchical Networks
– Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09)• Snoop Filtering
– Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08)
• Bus applications in CMPs– Manevich et al. (NOCS ’09)
![Page 38: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/38.jpg)
Key Contributions
• For moderate core counts, buses just work!– Dramatic energy reduction– little or no loss in performance– simple snooping protocols, reduction in design
complexity• Low-swing wiring• Multiple Address Interleaved buses• OS-assisted page coloring• Potential for router optimization
University of Utah 38
![Page 39: Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks](https://reader036.fdocuments.in/reader036/viewer/2022062222/56816865550346895ddec15b/html5/thumbnails/39.jpg)
University of Utah 39
Thank you..
• Questions?