[IEEE 2011 3rd International Conference on Next Generation Networks and Services (NGNS) - Hammamet,...

Design and FPGA Implementation of a QoS Router for Networks-on-Chip

Yahia Salah1 and Rached Tourki2 EµE Laboratory, Faculty of Sciences of Monastir

Monastir, 5019, Tunisia 1 [email protected], 2 [email protected]

Abstract—Network-on-Chip (NoC) is believed to be a solution to the existing and future interconnection problems in highly complex chips. Different alternatives proposed circuit-switched NoCs to guarantee performance and Quality-of-Service (QoS) parameters for Systems-on-Chips (SoC). However, implementing scheduling mechanisms with different service classes and exploring the advantages of wormhole routing and virtual channels is an important way to provide QoS guarantees in terms of transmission delays and bandwidth. This paper presents a packet-switched NoC router with QoS support. It uses a priority-based scheduler to solve conflicts between multiple connections with heterogeneous traffic flows and to minimize network latency. The hardware design of the router has been implemented at the RTL level; its functionality is evaluated and QoS requirements for each service class are derived. We show the trade-off between an optimal scheduling strategies implementation and the performance of the system.

Keywords-Network-on-Chip; router; packet scheduling; quality-of-service guarantee; FPGA implementation; performance.

I. INTRODUCTION Regarding the amount of available transistors on future

Systems-on-chip (SoCs), the Network-on-Chip (NoC) will naturally become a good alternative to traditional on-chip communication architectures like ad-hoc point-to-point interconnections or shared-bus structures [1, 2]. NoC provides a multi-core architecture for managing the SoC design complexity by incorporating concurrency and synchronization. Furthermore, it provides a flexible and scalable design approach, and satisfies the need for communication and data transfers. The NoC architecture, as is exemplified by Figure 1, consists of components such as, routers, physical channels, and network interfaces. The major component in the communication infrastructure of NoC is the router with a set of bidirectional ports linked to neighbour routers and to an associated Intellectual Property (IP) core. It represents the core communication medium. In Figure 1, routers are interconnected to each other by a pair of parallel opposite channels that composes each interconnecting link to form a given topology like a mesh or torus network. Their role is to forward the data packets throughout the network from the source to the destination IP. The network interface resource is needed to decouple the computation from the communication, enabling IP cores and interconnect to be designed separately, and to be integrated more easily.

R00packets

IP00NI

R10

R01 R02 R03

IP01NI IP02NI IP03NI

R11

IP11NI

R12

IP12NI

R13

IP13NI

R20 R21 R22 R23

R30

IP30NI

R31

IP31NI

R32

IP32NI

R33

IP33NI

IP10NI

IP20NI IP21NI IP22NI IP23NI

LEGEND

Rij Router

NI Network Interface

IP coreIPij

Physical Channels

Figure 1. Typical architecture of a NoC

A NoC can be described by its topology and by the strategies used for buffering, switching, flow control, routing and arbitration. These mechanisms are important issues in the NoC design. However, providing quality-of-service (QoS) is considered to be a critical issue for applications that have hard real-time requirements [3]. In such architectures, the on-chip network needs to provide a satisfactory service bounded by deterministic transmission delays and communication throughput. Most existing NoC works use circuit-switching scheme to achieve adaptability and predictability in multi-cores system design [4-7]. However, the wide majority of NoC authors predict that packet-switched on-chip interconnection networks will be essential to address the complexity of future SoC designs with various QoS requirements [2, 8-10]. It is expected that diverse applications such as multimedia, control and mobile components will be supported on a NoC environment. Under such a scenario, NoC should be able to provide various levels of support for these applications. Thus, the router needs to efficiently utilize the limited bandwidth of the links to deliver communication data between IP cores and satisfies the different requirements for each service level. The specified QoS targets can be achieved by assigning different priorities to various data packets, and exploring the advantages of wormhole routing and virtual channels.

Our main contribution in this paper is the design of a scalable packet based router allowing data transfer and managing dynamically several communications in parallel. As

2011 3rd International Conference on Next Generation Networks and Services

978-1-4673-0140-4/11/$26.00 ©2011 IEEE 84

a second contribution in this paper, is to employ scheduling strategies at routers level to provide time and bandwidth-related guarantees in an on-chip switched network. The rest of the paper is organized as follows, in next section we present in detail the proposed NoC router architecture with QoS support. In section 3 we show the performance results and finally we conclude.

II. NOC ROUTER ARCHITECTURE Our NoC architecture provides two types of communication

services to IP cores: best-effort (BE) and guaranteed services (GS). Each QoS connection is assigned to an individual virtual channel (VC). There are two types of virtual channels per input port router: one for the GS packets and the second for the BE packets. Once a data unit is received at a router’s input port it may be allocated to a virtual channel and accesses to an output port, passes through the crossbar switch and arrives at the neighbour router.

We identify five important issues in the design of the router network architecture. These are: the routing method, network flow control, switching policy, arbitration strategy, and buffering scheme. Routing method determines how a message chooses a path in the network topology, while flow control deals with the allocation of channels and buffers to a message as it traverses this path. Switching is the mechanism that removes data from an input channel and places it on an output channel of the router, while arbitration is responsible to schedule the use of channels and buffers by the packets. And finally, buffering defines the approach used to store packets while the router arbitration cannot schedule them.

In this work, we use wormhole routing [1, 11] as switching technique, which is an attractive method to reduce message latency by pipelining transmission over the channels along a message’s route. It splits the packets of a message into multiple flits (control flow units). However, wormhole scheme may produce the head-of-line blocking problem when packets block each other in a circular fashion in case of traffic congestion. The first way to solve the deadlock problem is to apply virtual channel flow control in the router design. The second is using the deterministic routing like symmetric XY discipline [5, 8]. This scheme appears as the appropriate solution to cope with the question of flexibility and it can be easily implemented. Therefore, in the proposed router architecture, we combine XY wormhole routing with virtual channels to increase the NoC performance; and we consider the effect of packet schedule on the transmission latency and throughput.

A. The packet format and protocol End-to-end communication in the NoC architecture is

accomplished by an exchange of messages between source and destination IP cores. Message is partitioned into packets for the allocation of control state and into flits for the allocation of channel bandwidth and buffer capacity. A packet is the basic unit of routing and sequencing in packet-switching. It consists of three kinds of flits: the header flit, the data flit and the tail flit. The header flit contains the routing information and required quality-of-service for each packet. It represents the number of data flits in the packet, the service level (BE or GS),

and the source and destination addresses in the network topology. The tail flit contains the error checking pattern and is a flag to signal the end of the packet.

In order to improve the flexibility of the network, a variable packet length organization is proposed in this work. This way, it can be decided dynamically how to split a message into packets in order to achieve the best performance.

B. The router structure The router presented here has a set of bidirectional network

ports linked to neighbour’s routers and a bidirectional service port to enable processing cores of the system to inject and eject packets to/from the network. A top-level schematic of the implementation is shown in Figure 2. The router consists of four main blocks: input port controller (IPC), output port controller (OPC), a crossbar switch (CS) and a centralized control logic module (CL) performing flits scheduling and reconfiguration. An explanation of these components is given here. For clarity, the figure shows only one input port controller and one output port controller. The number of router ports is configurable. By this way network topology is flexible. The user can also parameterize the flit size, number of data flits per packet, number of virtual channels per input-port, buffers number and their depth.

Input

Credit Out

Input VC buffers

GS-VCi

BE-VCj

CrossbarSwitch

OPCOutput

Request

Credit In

IPM

Reconfiguration Logic

Scheduler

Gra

nt

Figure 2. Top level schematic of the on-chip router

Each IPC block has input virtual channels make possible the separation between different traffics despite the use of a common physical medium. Virtual channels provide the ability to deliver guaranteed communications throughput because they allow a packet having a greater priority (GS level) than other packets (BE level). Each set of input-port virtual channels possesses its one input process module (IPM) as shown in Figure 2. At the reception of an incoming flit, the IPM performs the XY routing algorithm to determine the appropriate output port on which the flit will be sent, inserts the flit in the GS-VC or BE-VC buffers, then sends a request to the scheduler (located in the CL block) to schedule this flit based on its type (BE or GS) and on its deadline. Each input-port virtual channel has a counter (pckval), and each output port has a free buffer spaces counter (Credit-In) and an internal signal (poutstate) to indicate if the port is been used. If this is the case, and for each incoming flit of a packet, the pckval counter goes increasing until the output port is free. It will equal to zero when the tail flit of the packet is received, processed and delivered.

85

The CS block provides data path between input and output ports. The access to the crossbar is controlled by the scheduler and reconfiguration logic modules.

The OPC block maintains the status of output virtual channel (free/busy) and keeps track of credits available in the down-stream router’s VC buffer using credit based flow control. Each OPC module contains m lanes (m is fixed to one lane of one flit size in our implementation).

To avoid overflow in the input queues, particularly for the BE service level, a link-level flow control scheme is implemented by the flow control strategy and the reconfiguration process. These two mechanisms manage flit exchanges between neighbour routers. The virtual channel implementation employs “credit-based” flow control strategy, due to the advantages over handshake. Here, each router is initialized with the amount of free buffer space in the connected routers. Every time a flit is send to a next router, the free buffer spaces counter (Credit-In) corresponding to that destination port is decremented. When a router schedules a flit for the next cycle, it signals its predecessor that the free buffer spaces counter can be incremented (Credit-Out). The numbers of available flit cycles per each service level in the buffers of the next inputs ports are stored in a next buffer state table of the reconfiguration logic. When a space is available, the scheduler module schedules flits that are buffered at the input ports and waiting for transmission to their appropriate output ports. In the following section, we will explain the proposed scheduling method that composes scheduler module.

C. The proposed scheduling solution Our scheduling approach for GS and BE flits is based on

well known strategies used by real-time operating systems (RTOS), which are: round-robin (RR), High-Priority First (HPF), and Earliest Deadline First (EDF) [12]. In this work a task corresponds to a transfer of one packet (set of flits) through the NoC.

• The RR technique uses a list of pre-ordered tasks, and the scheduler sequentially passes control from one task to another; non ready tasks are skipped. When tasks are independent from one other, RR prevents deadlocks, reduces waiting time, and gives the scheduler the required flexibility to handle these tasks. However, this technique is not suitable for hard real-time tasks.

• The HPF technique assigns static priorities to tasks, but schedules them at run-time. Priorities are assigned according to the pre-fixed execution rate of each task.

• The EDF technique uses the deadline of a task as its priority. The task with the earliest deadline has the highest priority, while the task with the latest deadline has the lowest priority. The execution queue is sorted by the flits deadline. When a task becomes ready it must be inserted in the execution queue, or it must preempt the current task if its deadline is before that of the currently running task. EDF does not make any assumption about the periodicity of the tasks; therefore it can be used for scheduling periodic as well as aperiodic tasks. The EDF technique is optimal for

scheduling independent and preemptable tasks on single processor systems, leading to higher resource utilization [12].

The main goal of using the aforementioned schemes is to find a feasible schedule for GS and BE flits, whenever it exists, namely a schedule meeting the deadlines of all the GS flits and guaranteeing a minimum bandwidth for the BE packets. Related to the above-mentioned advantages, EDF strategy can be used to handle predictable applications with hard real-time guarantees either with or without streams of data packets. However, when applied to the scheduling of packets in a NoC, this technique suffers from a major limitation its additional cost resulting from packets classification overhead within the routers based on their deadline. The queue length may increase drastically, and thus, will reduce the NoC performance. To deal with this limitation, we attempt to reduce the cost of implementing the EDF queue by limiting its use on the basis of four header flits of real-time packets; and the remaining data flits of these packets are scheduled via their composition TR

kl/DRkl. TR

kl will denotes the number of flits remaining to be routed from input port Ik to output port Ol, and DR

kl its deadline. One necessary condition to satisfy the deadline constraint is to have Tij/Dij ≤ 1, for all i, j. Tij corresponds to the number of GS flits to be sent from network interface NIi to NIj, and Dij is their arrival deadline counted in number of cycles.

The scheduler may receive up to I requests where I is the set of input ports within a router. Each request specifies: the type of the flit GS or BS, the deadline constraint if it is a GS, and the destination output port as computed by the XY routing algorithm. For each shared output port, the scheduler behaves as follows. GS requests are handled first, and the EDF policy is used to grant one of these requests based on their relative deadlines. If more than one GS-request have the same deadline, the scheduler sets an equal priority to all of them and randomly grants one. If all requests are of type BE then the RR policy is used. Hence, it guarantees a minimum bandwidth for BE traffic. For each granted request the corresponding flit will be extracted from its VC and forwarded to the corresponding output port. Un-granted requests will be handled at the next cycle. The relative deadline of each GS request is computed on the first router visited and then updated by the neighbour-hop router.

III. EXPERIMENTAL RESULTS This section is divided into two parts. The first part gives an

overview of the different router implementations and their synthesis results. The second part presents the performance analysis and evaluation of the router based on the proposed scheduling strategies.

A. Synthesis Results The proposed router is designed using VHDL language at

the register transfer level (RTL) and implemented on a Xilinx FPGA device using ISE 9.2i tool. The target device is a FPGA Virtex-II XC2V1000.

To investigate the feasibility of the concept presented here, we have implemented an N inputs-outputs router. The network channel width (WPC) is 32 bits. We use one clock cycle for a

86

flit crossing a link. Thus, the flit size is 32 bits. Here, the physical channel is multiplexed by two virtual channels. Each supported traffic class (BE or GS) is assigned to a virtual channel (VC) for flow control purposes, and the depth of each input VC buffer is fixed to 2 flits. The output virtual channel contains one lane of one flit size.

The router operating frequency (F) is obtained from the ISE software tool, allowing a theoretical peak performance (BTh) of:

)s/Gbits(B)bits(W)ports(N)MHz(F ThPC =∗∗ (1)

Table 1 gives an overview of the different router implementations. For each number of effective inputs and outputs ports per router, the table shows the results of operating frequency (megahertz), bandwidth (gigabits per second) and area (in slices).

TABLE I. RESULTS OF THE ROUTER IMPLEMENTATIONS ON FPGA

Router type F (MHz) BTh (Gbits/s) Area (slices)

3-port 183.5 17.616 102 4-port 176 22.528 185 5-port 164 26.240 291

The number of inputs-outputs ports varies from 3 to 5 depending on the network topology and the router’s location in the network. Regular structures, like 2D mesh or 2D torus, are attractive because of their predictability and ease of design. Both have a maximum of five connections between four neighbor routers and one to a related IP core.

Figure 3 shows the comparison of the maximum frequency of the 5-ports router in function of the number of virtual channels per input port. Note that if we have more than one virtual channel, only one is used for the GS traffic and the rest of the channel(s) is used for the BE traffic. In this experiment, we provide each input virtual channel with a 2-flit buffer in the intermediate router, and the data-flit wide is 32 bits. The maximum router frequency approaches to 180 MHz when the number of virtual channels per input port equal to one. It is decreasing when the virtual channels number is increasing. This is due to the growth of the scheduling phase between virtual channels and also due to the growth of the switch complexity (number of flits per input port, and bus width).

1 2 3 40

20

40

60

80

100

120

140

160

180 172

137152

164

Max

. rou

ter f

requ

ency

(MH

z)

Number o f virtua l channels per input port

Figure 3. Maximum router frequency in function of the number of virtual channels

The chosen device contains 5,120 slices. Figure 4 shows that the percentage of FPGA slices used per router depends on the number of virtual channels per port. This parameter profoundly impacts on the router area and bandwidth performance. For a four virtual channels, we obtain an area of 469 slices (9.16% of FPGA resources utilization) for a bandwidth of 21.92 Gbits/s. As shown in both figures 3 and 4, maximum frequency and area are not and doesn’t appear linearly increased, because additional logic is needed to manage interaction between virtual channels.

1 2 3 40

2

4

6

8

10

4.35%

9.16%

7.34%

5.68%

% o

f con

sum

ed s

lices

per

rout

er

Number of virtual channels per input port

Figure 4. Percentage of FPGA slices consumed by the router in function of the number of virtual channels

Comparing the figures 3 and 4, we can conclude that the router with a number of two virtual channels is attractive. It provides 164 MHz for a relatively cheaper device (area of 291 slices, i.e. 5.68% of available FPGA resources). Table 2 shows the comparison between our design and others. The simplicity of our router architecture not only reduces the hardware costs, it also increases the clock rate, resulting in a high-bandwidth that equal to 26.24 Gbits/s.

TABLE II. THE COMPARIISON BETWEEN OUR DESIGN AND OTHERS

Design [7] [9] [13] Ours

Link width (bits)

32 16 32 32

Number of ports

4 5 5 5

Frequency (MHz)

138 159.5 100 164

Bandwidth (Gbits/s)

0.124 - 16 26.240

Area (slices) 366 308 1031 291 FPGA device

VirtexII-Pro xcv2p30-7

VirtexII-Pro 2VP70ff1704

VirtexII-Pro

VirtexII xc2v1000

B. Performance Analysis To evaluate traffic performance of the NoC architecture, we

used the router according the network model, and implemented also a random pattern generating model to emulate the original IPs communicating via their corresponding network interfaces. This pattern generator injects packets number with random length and constraints. A recorder/monitor is developed for dynamically and instantaneously collecting the number of

87

packets sent and received, together with timing information necessary to calculate acceptance rate, average network latency and throughput. The acceptance rate is the number of packets received at the destination per cycle per node. The network latency denotes the difference between the time at which the first flit of a packet has been offered to the network and the time the last flit has been delivered to the destination node. The throughput is measured at the ports of the network in flits per cycle. At a particular injection rate, the number of GS and BE packets to be generated are specified as a ratio r=GS/BE.

The NoC router assumes that it takes two clock cycles to model the max internal delay for a header flit and takes one cycle to forwarding a data flit to its output port. Wormhole routing is used because it requires less buffer usage to achieve minimal routing latency. Our design supports the proposed scheduling strategies that are used for best-effort and hard-real traffic flows to impact the average GS/BE buffer budget and network latency.

Figures 5 and 6 present the average network latency of the BE and GS traffic under different injection rate that varied from 0.05 to 0.7, and ratio r is varied from 0.25 to 1. As can be observed from the plots, for all values of r, as the injection rate is increased the network latency of the BE traffic increases. There is also an increase of average BE network latency with increasing r values, since more priority is given to GS traffic over BE traffic. The average network latency for the GS traffic, on the other hand, remains almost constant. A larger improvement is obtained on the average network latency reduction, comparatively to [14].

0.1 0.2 0.3 0.4 0.5 0.6 0.7r=0.25

r=0.5r=1

0

20

40

60

80

100

120

140

160

Latency (cycles)

Injection Rate (packets/cycle/node)

r=0.25r=0.5

r=1

Figure 5. Average network latency for BE traffic under different injection rate

0.1 0.2 0.3 0.4 0.5 0.6 0.7r=0.25

r=0.5r=1

0

10

20

30

40

50

60

Latency (cycles)

Injection Rate (packets/cycle/node)

r=0.25

r=0.5

r=1

Figure 6. Average network latency for GS traffic under different injection

rate

We evaluate the proposed scheduling strategies at routers level to minimize contentions between multiple connections with heterogeneous traffic flows. Figure 7 depicts the influence of these policies on the accepted data rate using 2 VCs per input-port router.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Acc

epta

nce

Rat

e (p

acke

ts/c

ycle

s/no

de)

Packet Insertion Rate (packets/cycle/node)

r=1; with scheduling strategies r=1; without scheduling strategies

Figure 7. PIR Vs accepted PIR

For small injection rates, the accepted traffic follows the inserted traffic. Between packet insertion rates 0.3 and 0.7, the accepted traffic becomes practically constant, representing a throughput saturation point. Accepted data rate with scheduling strategies is significantly better than without strategies, and has a saturation point at over 0.32 for a ratio r equal to 1. We compare the results with the work proposed in [10] where a number of 1, 2, and 4 VCs are used per physical channel. Therefore, using the aforementioned scheduling strategies, we can improve the reduction of network saturation, resulting in an increasing of the channel utilization and throughput. This makes it possible to improve the quality-of-service required by the proposed NoC.

IV. CONCLUSION In this paper, we presented a scalable router design for

packet-switched on-chip networks in SoC environment. This router provides both guaranteed and best-effort communication services. It uses a priority-based scheduler that supports three scheduling schemes. Synthesis results obtained for different implementations show that the router enables a high bandwidth (≈26 Gbits/s) and operating frequency (164 MHz) with a minimum cost of area (291 slices). Experimental results show that our router (as compared with other) achieves an optimal packets scheduling, an increasing of channel utilization and throughput, a reduction of network latency, and a resources conflicts avoidance for GS and BE connections. In a future work, we plan to explore efficient adaptive routing algorithms that can be implemented together with packet scheduling schemes to evaluate our router architecture on different network topologies.

REFERENCES

[1] A. Jantsch and H. Tenhunen, eds., Networks on Chip, Kluwer Academic Publishers, NewYork, 2003.

[2] M. Ali, M. Welzl, and M. Zwicknagl. Networks on Chips: Scalable Interconnects for Future Systems on Chips. Proceedings of the 3rd IEEE

88

International Conference on Circuits and Systems for Communications (ICCSC'06), 6-7 July 2006, Bucharest, Romania.

[3] R. Dick. Embedded System Synthesis Benchmarks (E3S). Available at http://www.ece.northwestern. edu/~dickrp/e3s, 2007.

[4] K. Goossens et al. The Æthereal network on chip: Concepts, architectures, and implementations. IEEE Design and Test of Computers, vol. 22 (2005), no. 5, pp. 414-421.

[5] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch. Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip. In Proceedings of DATE ’04, Paris, France, February 2004, vol. 2, pp. 890–895

[6] T. Bjerregaard. The MANGO Clockless Network-on-Chip: Concepts and Implementation. [PhD thesis], Technical University of Denmark, DTU, 2005.

[7] C. Hilton and B. Nelson. PNoC: a flexible circuit-switched NoC for FPGA-based systems. IEE Proc. Comput. Digit. Tech., Vol. 153, No 3, pp. 181-188, May 2006.

[8] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS architecture and design process for network on chip. Journal of Systems Architecture, Volume 50 (2004), Issue 2-3 (Special Issue on Network on Chip), pp. 105-128.

[9] S.A. Asghari, H. Pedram, M. Khademi, and P. Yaghini. Designing and Implementation of a Network on Chip Router Based on Handshaking Communication Mechanism. World Applied Sciences Journal 6 (1): 88-93, 2009.

[10] A. Mello, L. Tedesco, N. Calazans, and F. Moraes. Evaluation of Current QoS Mechanisms in Networks on Chip. In Proc. of the Intl. Symp. on System-on-Chip, Tampere, Finland, November 2006, pp. 1-4.

[11] L. Benini and G. de Micheli. Networks on chips: a new SoC paradigm. Computer, vol. 35 (2002), no. 1, pp: 70–78.

[12] C.L. Liu, and J.W. Layland. Scheduling algorithm for multiprogramming in a hard real time environment. Journal of the Association for Computing Machinery, vol. 20 (1973), pp. 46-61.

[13] Carrillo, S., Harkin, J., McDaid, L., Morgan, F., "Advances in Scalable, Adaptive Interconnect for Reconfigurable Bio-Inspired Computational Platforms". DATE 2011: Workshop on Design Methods and Tools for FPGAs-based Acceleration of Scientific Computing, France - Grenoble, March 2011.

[14] P. Vellanki, N. Banerjee, and K. S. Chatha. Quality-of-Service and Error Control Techniques for Mesh based Network-on-Chip Architectures. Integration, the VLSI Journal, Elsevier Publications, Vol. 38, Issue 3, pp 353-382, 2005.

89

[IEEE 2011 3rd International Conference on Next Generation Networks and Services (NGNS) - Hammamet,...

Documents

Transcript of [IEEE 2011 3rd International Conference on Next Generation Networks and Services (NGNS) - Hammamet,...