Download - The Routability of Multiprocessor Network …pc/research/publications/...Network_Nets = Link Nets + Network Interface Nets For the ring, star, mesh and hypercube topologies, the maximum

Transcript
Page 1: The Routability of Multiprocessor Network …pc/research/publications/...Network_Nets = Link Nets + Network Interface Nets For the ring, star, mesh and hypercube topologies, the maximum

The Routability of Multiprocessor Network Topologies in FPGAsManuel Saldaña, Lesley Shannon and Paul Chow

Dept. of Electrical and Computer EngineeringUniversity of Toronto

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N N

N

NN

N

NN

N

N

NN

N N

N

N

N

N

N N

N

NN

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

N

Generic FPGA fabric

16-node Hypercube

8-nodeFully-connected

8-nodeRing

32-nodeMesh

8-node Star

Logic Cell

Routing Channels

Mapping Topologies

PeripheralBus

RX_FIFO

TX_FIFO

RAMNetworkInterface

UART

µP

RAMIFctrl

RAMIFctrl

Link

Conclusions

Motivation

Paper to appear in System Level Interconnection Prediction [SLIP2006] on March 2006.

Cost Metric

We derived a cost metric (CM) to compare and understand the implementationcosts of bigger systems.

where

D diameter of the topologyChannel_width number of bits per channel (64 bits)Bisection_width number of links that need to be cut to have two equal sets of nodes.ftarget required frequencyfreq actual frequency achievedK performance factor (1.05)

0.00

5.00

10.00

15.00

20.00

25.00

8 16 32Nodes

Co

st

ring

star

meshhypercube

fully-connected

The difference between resource utilization of ring, star, mesh, and hypercubetopologies is not significant up to 32 nodes (11 % of the total number of nets in the system).

Systems up to at least 16 nodes do not require a sophisticated Network-on-Chip. Just use point-to-point links.

A fully-connected topology can be implemented with at least 16 nodes, but 32 nodes exceed the routing resources on the FPGA.

FPGAs have a fixed set of resources that exist whether they are used or not. Do not limit the connectivity if there are resources available.

As expected, the hypercube and mesh should continue to scale to larger systems. The ring starts to look worse because of a low bisection bandwidth and increase in latency.

Area and Placement

We simulate resource depletion by restricting the available space to place and route a design.

The MicroBlaze is a Relational Placement Macro(RPM). We found that RPMs limit the placementon the FPGA, especially if the FPGA fabric is narrow, such as Virtex4.

The placement of MicroBlazes does not changeconsiderably across topologies with the same number of nodes.

XC4VLX25 and XC2V2000 both have the samenumber of slices, but the aspect ratio is different,thus affecting the packing efficiency of the RPMs.

Only the Master node

Link 1

Link 2

Link 3

XC4VLX25 XC2V2000XC4VLX40

Restricted area

Scaled images of different FPGA sizes

Virtex 4 Virtex 2

Routing Resources

11% 3.7% routing

overheaddifferenceWith respect to the ring topology, the

fully-connected topology has 15% more routingoverhead with 8 nodes and 49% with 32 nodes.

Network_Nets = Link Nets

+ Network Interface Nets

For the ring, star, mesh and hypercubetopologies, the maximum difference inrouting overhead is 3.7% with 8 nodes, and 11% with 32 nodes.

For the 32-node fully-connected system,the Network_Nets account for 54% of thetotal nets in the system.

The graph shows the Routing Overhead,which is the percentage of Network_Netswith respect to the total number of netsin the entire design.

100

90

80

70

60

50

40

30

20

10

32168

% o

f N

ets

in t

he

Sys

tem

Nodes

ringstar

meshhypercube

fully-connected

To understand how different well-known topologies map to a generic FPGA structure with a fixed set of routing and logic resources.

FPGA fabrics have evolved over time to suit the requirements of large, localized digitalcircuits. Given a network of such circuits, what might limit the implemententions of Networks-on-Chip (NoC) on FPGAs?

Maximum Frequency

The target frequency is 180MHz,which is the maximum frequency of the MicroBlaze soft-processor.

We used the Xilinx Xplorer utility to improve the performance of the designs.

- Global optimization- Register duplication- Timing-driven packing and placement- Different cost tables

Techniques used to improve performance:The drop in frequency for the fully-connected topology is a sign of congestion, but it still routes.

Logic Resources

The increase in LUTs is roughly linear as thenumber of nodes increases, except for the fully-connected topology, which is exponential.

We measure the total number of LUTs per system, including soft-processors, memory interfaces, FIFOs, and Network Interfaces.

The difference between the ring, star,mesh and hypercube topologies is 11% with 32 nodes and 5% with 8 nodes.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

32168

LUTs

Nodes

ringstar

meshhypercube

fully-connected

Manually interconnecting all the processing elements is error prone, so we need an automated flow.

Design Flow

TopologyGenerator

(Perl Script)

SystemGenerator

(C program)

Graph Description

File

.mhs, .mss

.xmp, .ucf

* Number of links* Number of Nodes* Topology

Synthesized, Placed and RoutedMultiprocessor Systems(.ncd files)Design files

Xilinx Implementation Flow

Synthesis PARMAP

Input parameters

We can experiment with different topologies by simply changingthe number of nodes, the number of links, and the topology parameters.

Routing/Node =Routing Utiliz. × Routing Ovrhd (%)

Nodes

LUTs/Node =Logic Utiliz. × Logic Ovrhd (%)

Nodes

CM =D × Routing/Node × K

(ftarget−freq)× LUTs/Node

Bisection width× Channel width× freq

T opology Nodes Max Sp eed Best T otal Map PARFreq. Grade Run Runs Options Options(MHz)

fully con. 8 170 12 2 6 -timing -ol high -xe n -ol high

ring 16 180 12 1 1 -retiming on -ol high -timing -xe n -global opt on -ol highstar 16 180 12 4 4 -timing -ol high -xe n register duplication -t 9 -ol highmesh 16 180 12 2 2 -timing -ol high -xe n -ol highhypercube 16 180 12 2 2 -timing -ol high -xe n -ol highfully con. 16 126 12 5 6 -ol high -ol high -t 9 -xe n

ring 32 123 11 5 6 -ol high -ol high -t 9 -xe n