The Routability of Multiprocessor Network …pc/research/publications/...Network_Nets = Link Nets +...

1
The Routability of Multiprocessor Network Topologies in FPGAs Manuel Saldaña, Lesley Shannon and Paul Chow Dept. of Electrical and Computer Engineering University of Toronto N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N Generic FPGA fabric 16-node Hypercube 8-node Fully-connected 8-node Ring 32-node Mesh 8-node Star Logic Cell Routing Channels Mapping Topologies Peripheral Bus RX_FIFO TX_FIFO RAM Network Interface UART µP RAM IF ctrl RAM IF ctrl Link Conclusions Motivation Paper to appear in System Level Interconnection Prediction [SLIP2006] on March 2006. Cost Metric We derived a cost metric (CM) to compare and understand the implementation costs of bigger systems. where D diameter of the topology Channel_width number of bits per channel (64 bits) Bisection_width number of links that need to be cut to have two equal sets of nodes. ftarget required frequency freq actual frequency achieved K performance factor (1.05) 0.00 5.00 10.00 15.00 20.00 25.00 8 16 32 Nodes Cost ring star mesh hypercube fully-connected The difference between resource utilization of ring, star, mesh, and hypercube topologies is not significant up to 32 nodes (11 % of the total number of nets in the system). Systems up to at least 16 nodes do not require a sophisticated Network-on-Chip. Just use point-to-point links. A fully-connected topology can be implemented with at least 16 nodes, but 32 nodes exceed the routing resources on the FPGA. FPGAs have a fixed set of resources that exist whether they are used or not. Do not limit the connectivity if there are resources available. As expected, the hypercube and mesh should continue to scale to larger systems. The ring starts to look worse because of a low bisection bandwidth and increase in latency. Area and Placement We simulate resource depletion by restricting the available space to place and route a design. The MicroBlaze is a Relational Placement Macro (RPM). We found that RPMs limit the placement on the FPGA, especially if the FPGA fabric is narrow, such as Virtex4. The placement of MicroBlazes does not change considerably across topologies with the same number of nodes. XC4VLX25 and XC2V2000 both have the same number of slices, but the aspect ratio is different, thus affecting the packing efficiency of the RPMs. Only the Master node Link 1 Link 2 Link 3 XC4VLX25 XC2V2000 XC4VLX40 Restricted area Scaled images of different FPGA sizes Virtex 4 Virtex 2 Routing Resources 11% 3.7% routing overhead difference With respect to the ring topology, the fully-connected topology has 15% more routing overhead with 8 nodes and 49% with 32 nodes. Network_Nets = Link Nets + Network Interface Nets For the ring, star, mesh and hypercube topologies, the maximum difference in routing overhead is 3.7% with 8 nodes, and 11% with 32 nodes. For the 32-node fully-connected system, the Network_Nets account for 54% of the total nets in the system. The graph shows the Routing Overhead, which is the percentage of Network_Nets with respect to the total number of nets in the entire design. 100 90 80 70 60 50 40 30 20 10 32 16 8 % of Nets in the System Nodes ring star mesh hypercube fully-connected To understand how different well-known topologies map to a generic FPGA structure with a fixed set of routing and logic resources. FPGA fabrics have evolved over time to suit the requirements of large, localized digital circuits. Given a network of such circuits, what might limit the implemententions of Networks-on-Chip (NoC) on FPGAs? Maximum Frequency The target frequency is 180MHz, which is the maximum frequency of the MicroBlaze soft-processor. We used the Xilinx Xplorer utility to improve the performance of the designs. - Global optimization - Register duplication - Timing-driven packing and placement - Different cost tables Techniques used to improve performance: The drop in frequency for the fully-connected topology is a sign of congestion, but it still routes. Logic Resources The increase in LUTs is roughly linear as the number of nodes increases, except for the fully-connected topology, which is exponential. We measure the total number of LUTs per system, including soft-processors, memory interfaces, FIFOs, and Network Interfaces. The difference between the ring, star, mesh and hypercube topologies is 11% with 32 nodes and 5% with 8 nodes. 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 32 16 8 LUTs Nodes ring star mesh hypercube fully-connected Manually interconnecting all the processing elements is error prone, so we need an automated flow. Design Flow Topology Generator (Perl Script) System Generator (C program) Graph Description File .mhs, .mss .xmp, .ucf * Number of links * Number of Nodes * Topology Synthesized, Placed and Routed Multiprocessor Systems (.ncd files) Design files Xilinx Implementation Flow Synthesis PAR MAP Input parameters We can experiment with different topologies by simply changing the number of nodes, the number of links, and the topology parameters. Routing/Node = Routing Utiliz. × Routing Ovrhd (%) Nodes LUTs/Node = Logic Utiliz. × Logic Ovrhd (%) Nodes CM = D × Routing/Node × K (f target -freq) × LUTs/Node Bisection width× Channel width× freq T opology No des Max Sp eed Best T otal Map PAR F req. Grade Run Runs Options Options (MHz) fully con. 8 170 12 2 6 -timing -ol high -xe n -ol high ring 16 180 12 1 1 -retiming on -ol high -timing -xe n -global opt on -ol high star 16 180 12 4 4 -timing -ol high -xe n register duplication -t 9 -ol high mesh 16 180 12 2 2 -timing -ol high -xe n -ol high hyp ercub e 16 180 12 2 2 -timing -ol high -xe n -ol high fully con. 16 126 12 5 6 -ol high -ol high -t 9 -xe n ring 32 123 11 5 6 -ol high -ol high -t 9 -xe n

Transcript of The Routability of Multiprocessor Network …pc/research/publications/...Network_Nets = Link Nets +...

Page 1: The Routability of Multiprocessor Network …pc/research/publications/...Network_Nets = Link Nets + Network Interface Nets For the ring, star, mesh and hypercube topologies, the maximum

The Routability of Multiprocessor Network Topologies in FPGAsManuel Saldaña, Lesley Shannon and Paul Chow

Dept. of Electrical and Computer EngineeringUniversity of Toronto

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N N

N

NN

N

NN

N

N

NN

N N

N

N

N

N

N N

N

NN

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

Generic FPGA fabric

16-node Hypercube

8-nodeFully-connected

8-nodeRing

32-nodeMesh

8-node Star

Logic Cell

Routing Channels

Mapping Topologies

PeripheralBus

RX_FIFO

TX_FIFO

RAMNetworkInterface

UART

µP

RAMIFctrl

RAMIFctrl

Link

Conclusions

Motivation

Paper to appear in System Level Interconnection Prediction [SLIP2006] on March 2006.

Cost Metric

We derived a cost metric (CM) to compare and understand the implementationcosts of bigger systems.

where

D diameter of the topologyChannel_width number of bits per channel (64 bits)Bisection_width number of links that need to be cut to have two equal sets of nodes.ftarget required frequencyfreq actual frequency achievedK performance factor (1.05)

0.00

5.00

10.00

15.00

20.00

25.00

8 16 32Nodes

Co

st

ring

star

meshhypercube

fully-connected

The difference between resource utilization of ring, star, mesh, and hypercubetopologies is not significant up to 32 nodes (11 % of the total number of nets in the system).

Systems up to at least 16 nodes do not require a sophisticated Network-on-Chip. Just use point-to-point links.

A fully-connected topology can be implemented with at least 16 nodes, but 32 nodes exceed the routing resources on the FPGA.

FPGAs have a fixed set of resources that exist whether they are used or not. Do not limit the connectivity if there are resources available.

As expected, the hypercube and mesh should continue to scale to larger systems. The ring starts to look worse because of a low bisection bandwidth and increase in latency.

Area and Placement

We simulate resource depletion by restricting the available space to place and route a design.

The MicroBlaze is a Relational Placement Macro(RPM). We found that RPMs limit the placementon the FPGA, especially if the FPGA fabric is narrow, such as Virtex4.

The placement of MicroBlazes does not changeconsiderably across topologies with the same number of nodes.

XC4VLX25 and XC2V2000 both have the samenumber of slices, but the aspect ratio is different,thus affecting the packing efficiency of the RPMs.

Only the Master node

Link 1

Link 2

Link 3

XC4VLX25 XC2V2000XC4VLX40

Restricted area

Scaled images of different FPGA sizes

Virtex 4 Virtex 2

Routing Resources

11% 3.7% routing

overheaddifferenceWith respect to the ring topology, the

fully-connected topology has 15% more routingoverhead with 8 nodes and 49% with 32 nodes.

Network_Nets = Link Nets

+ Network Interface Nets

For the ring, star, mesh and hypercubetopologies, the maximum difference inrouting overhead is 3.7% with 8 nodes, and 11% with 32 nodes.

For the 32-node fully-connected system,the Network_Nets account for 54% of thetotal nets in the system.

The graph shows the Routing Overhead,which is the percentage of Network_Netswith respect to the total number of netsin the entire design.

100

90

80

70

60

50

40

30

20

10

32168

% o

f N

ets

in t

he

Sys

tem

Nodes

ringstar

meshhypercube

fully-connected

To understand how different well-known topologies map to a generic FPGA structure with a fixed set of routing and logic resources.

FPGA fabrics have evolved over time to suit the requirements of large, localized digitalcircuits. Given a network of such circuits, what might limit the implemententions of Networks-on-Chip (NoC) on FPGAs?

Maximum Frequency

The target frequency is 180MHz,which is the maximum frequency of the MicroBlaze soft-processor.

We used the Xilinx Xplorer utility to improve the performance of the designs.

- Global optimization- Register duplication- Timing-driven packing and placement- Different cost tables

Techniques used to improve performance:The drop in frequency for the fully-connected topology is a sign of congestion, but it still routes.

Logic Resources

The increase in LUTs is roughly linear as thenumber of nodes increases, except for the fully-connected topology, which is exponential.

We measure the total number of LUTs per system, including soft-processors, memory interfaces, FIFOs, and Network Interfaces.

The difference between the ring, star,mesh and hypercube topologies is 11% with 32 nodes and 5% with 8 nodes.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

32168

LUTs

Nodes

ringstar

meshhypercube

fully-connected

Manually interconnecting all the processing elements is error prone, so we need an automated flow.

Design Flow

TopologyGenerator

(Perl Script)

SystemGenerator

(C program)

Graph Description

File

.mhs, .mss

.xmp, .ucf

* Number of links* Number of Nodes* Topology

Synthesized, Placed and RoutedMultiprocessor Systems(.ncd files)Design files

Xilinx Implementation Flow

Synthesis PARMAP

Input parameters

We can experiment with different topologies by simply changingthe number of nodes, the number of links, and the topology parameters.

Routing/Node =Routing Utiliz. × Routing Ovrhd (%)

Nodes

LUTs/Node =Logic Utiliz. × Logic Ovrhd (%)

Nodes

CM =D × Routing/Node × K

(ftarget−freq)× LUTs/Node

Bisection width× Channel width× freq

T opology Nodes Max Sp eed Best T otal Map PARFreq. Grade Run Runs Options Options(MHz)

fully con. 8 170 12 2 6 -timing -ol high -xe n -ol high

ring 16 180 12 1 1 -retiming on -ol high -timing -xe n -global opt on -ol highstar 16 180 12 4 4 -timing -ol high -xe n register duplication -t 9 -ol highmesh 16 180 12 2 2 -timing -ol high -xe n -ol highhypercube 16 180 12 2 2 -timing -ol high -xe n -ol highfully con. 16 126 12 5 6 -ol high -ol high -t 9 -xe n

ring 32 123 11 5 6 -ol high -ol high -t 9 -xe n