A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI...
-
Upload
diana-nash -
Category
Documents
-
view
223 -
download
0
description
Transcript of A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI...
A Low-Area Interconnect Architecture for Chip Multiprocessors
Zhiyi Yu and Bevan Baas
VLSI Computation LabECE Department, UC Davis
Outline
• Motivation– Why chip multiprocessors– The difference between chip-multiprocessor
interconnect and multiple-chip interconnect• Low-area asymmetric interconnect architecture• High performance multiple-link architecture
Why Chip Multiprocessors
• Chip Multiprocessor challenges– Traditional high performance techniques are less
practical (e.g., increased clock frequency)– Power dissipation is a key constraint– Global wires scale poorly in advanced technologies
• Solution: chip multiprocessors– Parallel computation for high performance– Reduce the clock freq. and voltage when full rate
computation is not needed for high energy efficiency– Constrain wires no longer than the size of one core
Interconnect in Chip Multiprocessors vs. Interconnect in Multiple Chips
• Chip multiprocessors have relatively limited area resources for interconnect circuitry– Try to reduce the area of buffer with little reduction of
the interconnect capability• Chip multiprocessors have relatively abundant
wire resources for inter-processor connection – Try to use more wire resources to increase the
interconnect capability
Outline
• Motivation• Low-area asymmetric interconnect
architecture– Traditional dynamic Network-On-Chip– Statically configured Network-On-Chip– Proposed asymmetric architecture
• High performance multiple-link architecture
Background: Traditional Network on Chip (NoC) using Dynamic Routing
• Advantage: flexible• Disadvantage: high area and power cost
buffer
buffe
r
buffe
r
buffer
north in
westin
south in
eastin
core
north out
south out
westout
eastout
Route& Switch
North
West
South
East
North
West
South
CoreEast
Background: Static Nearest Neighbor Interconnect Architecture
Route& Switch
East
North
West
South
Core
buffe
r
north in
westin
south in
eastin
core
north out
south out
westout
eastout
• Low area cost– One buffer per processor, not four
• High latency for long distance communication– Data passes through each intermediate processor
Asymmetric Data Traffic in Inter-processor Communication
• Data traffic of routers in a 9-processor JPEG encoder
• 80% of traffic goes to processor core, 20% passes by core
Network data words of input ports of router East North West South
Relative 9% 26% 22% 43%
Network data words of output ports of router Core East North West South
Relative 80% 8% 4% 8% 0%
Asymmetrically-Buffered Interconnect Architecture
Route& Switch
East
North
West
South
East
North
West
South
Core
Flip-flop
Flip
-flop
Flip
-flop
Flip-flop
north in
westin
south in
eastin
Core
north out
south out
westout
eastout
Buf
fer
Route& Switch
East
North
West
South
Core
Route& Switch
East
North
West
South
East
North
West
South
Core
• Large buffer only for the processing core• Support flexible direct switch/route interconnect
Asymmetric
Dynamic packet
Static
Outline
• Motivation• Low-area asymmetric interconnect architecture• High performance multiple-link
architecture
Design Option Exploration
• Choose static routing architecture – Much smaller area and
power required• Assign two buffers
(ports) for each processing core– Natural fit with 2-input,
1-output instructions (e.g., C=A+B)
Core
Buf
fer
Num. oflinks?
Dynamicor static ?
Num. ofbuffers?
Single Link vs. Multiple Links• Why consider multi-link
architectures?– Single link might be
inefficient in some cases– There are many link/wire
resources on chip• Fully connected multi-
link architectures have huge numbers of long wires– Alternative simplified
architectures are needed
in_W
out_E
to core
to switch
in_N
in_W
in_S
core
in2out2
to core
out1 in1
to switch
to switch
expensiveglobal wires
Type1: one link per edge
Fully connected, two links per edge
Architectures with Multiple Links
to core
neighborlink
to switch
in2_W
out2_E
to core
out1_E
in1_W
to switch
to switch
in1_Nin2_N
in2_N
type2
type3
type4
type5
type6
type7
neighborlink
to switch
in2_N
to core
to switch
in3_N
to switch
in2_N
to core
to switch
in3_N
to core
to switch
in4_N
neighborlink
to switch
in2_N
to core
to switch
in3_N in1_N
to switch
neighborlink
Two links per edge Three links per edge Four links per edge
One
link
ded
icat
edto
nei
ghbo
r lin
kA
ll lin
ks th
e sa
me
Area and Speed of Seven Proposed Network Architectures
• All seven architectures implemented in hardware
• Four-link architectures (types 6 and 7) require almost 25% more area
• Clock rates for all are within approx. 2%
Comparing Routing Latency using Communication Models
• All architectures have the same routing latency for the one→one, one→all, and all→all communication
• Different types of architectures behave differently for the all-one communication
Routing latencies for all→one communication
with n x n processors
Type1
Type2/3
Type4/5
Example all→one linear communication for six
processors
Summary• Asymmetric buffer allocation topology uses 1 or 2
large buffers instead of 4 large buffers to support long distance communication– Provides 2 to 4 times area savings with little
performance reduction• Using 2 or 3 links per edge achieves good
area/performance tradeoffs for chip multiprocessors containing simple single-issue processors– Provides 2 to 3 times higher all→one communication
capacity with 5%-10% increased area
Acknowledgements• Funding
– Intel Corporation– UC Micro – NSF Grant No. 0430090 – CAREER award 0546907 – SRC GRC Grant 1598 – IntellaSys Corporation– S Machines
• Special thanks – Members of the VCL, R. Krishnamurthy, M. Anders,
and S. Mathew– ST Microelectronics and Artisan