The Routability of Multiprocessor Network …pc/research/publications/...Network_Nets = Link Nets +...
Transcript of The Routability of Multiprocessor Network …pc/research/publications/...Network_Nets = Link Nets +...
The Routability of Multiprocessor Network Topologies in FPGAsManuel Saldaña, Lesley Shannon and Paul Chow
Dept. of Electrical and Computer EngineeringUniversity of Toronto
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N N
N
NN
N
NN
N
N
NN
N N
N
N
N
N
N N
N
NN
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
Generic FPGA fabric
16-node Hypercube
8-nodeFully-connected
8-nodeRing
32-nodeMesh
8-node Star
Logic Cell
Routing Channels
Mapping Topologies
PeripheralBus
RX_FIFO
TX_FIFO
RAMNetworkInterface
UART
µP
RAMIFctrl
RAMIFctrl
Link
Conclusions
Motivation
Paper to appear in System Level Interconnection Prediction [SLIP2006] on March 2006.
Cost Metric
We derived a cost metric (CM) to compare and understand the implementationcosts of bigger systems.
where
D diameter of the topologyChannel_width number of bits per channel (64 bits)Bisection_width number of links that need to be cut to have two equal sets of nodes.ftarget required frequencyfreq actual frequency achievedK performance factor (1.05)
0.00
5.00
10.00
15.00
20.00
25.00
8 16 32Nodes
Co
st
ring
star
meshhypercube
fully-connected
The difference between resource utilization of ring, star, mesh, and hypercubetopologies is not significant up to 32 nodes (11 % of the total number of nets in the system).
Systems up to at least 16 nodes do not require a sophisticated Network-on-Chip. Just use point-to-point links.
A fully-connected topology can be implemented with at least 16 nodes, but 32 nodes exceed the routing resources on the FPGA.
FPGAs have a fixed set of resources that exist whether they are used or not. Do not limit the connectivity if there are resources available.
As expected, the hypercube and mesh should continue to scale to larger systems. The ring starts to look worse because of a low bisection bandwidth and increase in latency.
Area and Placement
We simulate resource depletion by restricting the available space to place and route a design.
The MicroBlaze is a Relational Placement Macro(RPM). We found that RPMs limit the placementon the FPGA, especially if the FPGA fabric is narrow, such as Virtex4.
The placement of MicroBlazes does not changeconsiderably across topologies with the same number of nodes.
XC4VLX25 and XC2V2000 both have the samenumber of slices, but the aspect ratio is different,thus affecting the packing efficiency of the RPMs.
Only the Master node
Link 1
Link 2
Link 3
XC4VLX25 XC2V2000XC4VLX40
Restricted area
Scaled images of different FPGA sizes
Virtex 4 Virtex 2
Routing Resources
11% 3.7% routing
overheaddifferenceWith respect to the ring topology, the
fully-connected topology has 15% more routingoverhead with 8 nodes and 49% with 32 nodes.
Network_Nets = Link Nets
+ Network Interface Nets
For the ring, star, mesh and hypercubetopologies, the maximum difference inrouting overhead is 3.7% with 8 nodes, and 11% with 32 nodes.
For the 32-node fully-connected system,the Network_Nets account for 54% of thetotal nets in the system.
The graph shows the Routing Overhead,which is the percentage of Network_Netswith respect to the total number of netsin the entire design.
100
90
80
70
60
50
40
30
20
10
32168
% o
f N
ets
in t
he
Sys
tem
Nodes
ringstar
meshhypercube
fully-connected
To understand how different well-known topologies map to a generic FPGA structure with a fixed set of routing and logic resources.
FPGA fabrics have evolved over time to suit the requirements of large, localized digitalcircuits. Given a network of such circuits, what might limit the implemententions of Networks-on-Chip (NoC) on FPGAs?
Maximum Frequency
The target frequency is 180MHz,which is the maximum frequency of the MicroBlaze soft-processor.
We used the Xilinx Xplorer utility to improve the performance of the designs.
- Global optimization- Register duplication- Timing-driven packing and placement- Different cost tables
Techniques used to improve performance:The drop in frequency for the fully-connected topology is a sign of congestion, but it still routes.
Logic Resources
The increase in LUTs is roughly linear as thenumber of nodes increases, except for the fully-connected topology, which is exponential.
We measure the total number of LUTs per system, including soft-processors, memory interfaces, FIFOs, and Network Interfaces.
The difference between the ring, star,mesh and hypercube topologies is 11% with 32 nodes and 5% with 8 nodes.
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
32168
LUTs
Nodes
ringstar
meshhypercube
fully-connected
Manually interconnecting all the processing elements is error prone, so we need an automated flow.
Design Flow
TopologyGenerator
(Perl Script)
SystemGenerator
(C program)
Graph Description
File
.mhs, .mss
.xmp, .ucf
* Number of links* Number of Nodes* Topology
Synthesized, Placed and RoutedMultiprocessor Systems(.ncd files)Design files
Xilinx Implementation Flow
Synthesis PARMAP
Input parameters
We can experiment with different topologies by simply changingthe number of nodes, the number of links, and the topology parameters.
Routing/Node =Routing Utiliz. × Routing Ovrhd (%)
Nodes
LUTs/Node =Logic Utiliz. × Logic Ovrhd (%)
Nodes
CM =D × Routing/Node × K
(ftarget−freq)× LUTs/Node
Bisection width× Channel width× freq
T opology Nodes Max Sp eed Best T otal Map PARFreq. Grade Run Runs Options Options(MHz)
fully con. 8 170 12 2 6 -timing -ol high -xe n -ol high
ring 16 180 12 1 1 -retiming on -ol high -timing -xe n -global opt on -ol highstar 16 180 12 4 4 -timing -ol high -xe n register duplication -t 9 -ol highmesh 16 180 12 2 2 -timing -ol high -xe n -ol highhypercube 16 180 12 2 2 -timing -ol high -xe n -ol highfully con. 16 126 12 5 6 -ol high -ol high -t 9 -xe n
ring 32 123 11 5 6 -ol high -ol high -t 9 -xe n