Adaptive Routing in Network-on-Chips Using a Dynamic-Programming Network

16
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011 3701 Adaptive Routing in Network-on-Chips Using a Dynamic-Programming Network Terrence Mak, Member, IEEE, Peter Y. K. Cheung, Senior Member, IEEE, Kai-Pui Lam, and Wayne Luk, Member, IEEE Abstract—Dynamic routing is desirable because of its substan- tial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffic. However, im- plementation of adaptive routing in a network-on-chip system is not trivial and is further complicated by the requirements of deadlock-free and real-time optimal decision making. In this pa- per, we present a deadlock-free routing architecture which em- ploys a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switch- ing. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduce the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic-routing solution with minimal hard- ware overhead. Our results, based on a cycle-accurate simulator, demonstrate the effectiveness of the DP network, which outper- forms both the deterministic and adaptive-routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP network is insignificant, based on the results obtained from the hardware implementations. Index Terms—Adaptive routing, Bellman equation, dynamic programming (DP), DP network, network-on-chip (NoC). I. I NTRODUCTION I NTERCONNECT performance is rapidly deteriorating with the continuous scaling in technology processes. As pre- dicted by the International Technology Roadmap for Semicon- ductors (ITRS) in Fig. 1, there is a significant performance gap between interconnection RC delay and the gate delay, and this gap will be increasing exponentially (9:1 with the 65-nm technology, according to ITRS 2005 report [1]). The gap will continue to grow even with the help of new interconnect mate- rials and aggressive interconnect optimization [2], [3]. Further- more, because of the tightly packed wires, capacitances that are attributed to interconnect parasitic also increase drastically. As a result, multilevel interconnect networks have become the pri- Manuscript received December 31, 2009; revised April 28, 2010 and June 7, 2010; accepted September 6, 2010. Date of publication September 30, 2010; date of current version July 13, 2011. T. Mak is with the School of Electrical, Electronic and Computer Engi- neering, Newcastle University, NE1 7RU Newcastle upon Tyne, U.K. (e-mail: [email protected]). P. Y. K. Cheung is with the Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ London, U.K. (e-mail: [email protected]). K.-P. Lam is with the Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong (e-mail: [email protected]). W. Luk is with the Department of Computing, Imperial College London, SW7 2AZ London, U.K. (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIE.2010.2081953 Fig. 1. Projected relative delay for local and global wires and for logic gates in technologies of the near future. [1]. mary limit on the productivity, performance, energy dissipation, and signal integrity of gigascale integration [4]. Recently, network-on-chip (NoC) has been proposed as a promising solution to the increasingly complicated on-chip communication challenges [5]–[7]. Such architectures consist of a network of regular tiles where each tile can be an implementation of general-purpose processors, DSP blocks, memory blocks, and embedded reconfiguration modules, etc. Communications among these tile-based modules are follow- ing a packet-switch or circuit-switch scheme where messages are transmitted among the processing elements. The NoC architecture would be an ideal solution to provide effective integration for multiple modular blocks [8] and can potentially mitigate the gigascale integration challenge [9], [10]. In such an NoC environment, the routing of flits (or packets) becomes a critical issue, which determines the interprocessor communication performance. Routing provides a protocol for moving data through the NoC infrastructure and also deter- mines the path of data transport. The selection of commu- nication pathway would greatly affect the latency of packets transmitted from the source to the destination and, therefore, can have significant impact on the overall traffic flow in the network. An intelligent routing mechanism is required to uti- lize the communication bandwidth and minimize transportation latency. Dynamic routing (or adaptive routing) has been widely used in computer and data network design. Utilizing the online communication patterns and real-time information, dynamic 0278-0046/$26.00 © 2010 IEEE

Transcript of Adaptive Routing in Network-on-Chips Using a Dynamic-Programming Network

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011 3701

Adaptive Routing in Network-on-Chips Usinga Dynamic-Programming NetworkTerrence Mak, Member, IEEE, Peter Y. K. Cheung, Senior Member, IEEE,

Kai-Pui Lam, and Wayne Luk, Member, IEEE

Abstract—Dynamic routing is desirable because of its substan-tial improvement in communication bandwidth and intelligentadaptation to faulty links and congested traffic. However, im-plementation of adaptive routing in a network-on-chip systemis not trivial and is further complicated by the requirements ofdeadlock-free and real-time optimal decision making. In this pa-per, we present a deadlock-free routing architecture which em-ploys a dynamic programming (DP) network to provide on-the-flyoptimal path planning and network monitoring for packet switch-ing. Also, a new routing strategy called k-step look ahead isintroduced. This new strategy can substantially reduce the sizeof routing table and maintain a high quality of adaptation whichleads to a scalable dynamic-routing solution with minimal hard-ware overhead. Our results, based on a cycle-accurate simulator,demonstrate the effectiveness of the DP network, which outper-forms both the deterministic and adaptive-routing algorithms inaverage delay on various traffic scenarios by 22.3%. Moreover, thehardware overhead for DP network is insignificant, based on theresults obtained from the hardware implementations.

Index Terms—Adaptive routing, Bellman equation, dynamicprogramming (DP), DP network, network-on-chip (NoC).

I. INTRODUCTION

INTERCONNECT performance is rapidly deteriorating withthe continuous scaling in technology processes. As pre-

dicted by the International Technology Roadmap for Semicon-ductors (ITRS) in Fig. 1, there is a significant performancegap between interconnection RC delay and the gate delay, andthis gap will be increasing exponentially (9:1 with the 65-nmtechnology, according to ITRS 2005 report [1]). The gap willcontinue to grow even with the help of new interconnect mate-rials and aggressive interconnect optimization [2], [3]. Further-more, because of the tightly packed wires, capacitances that areattributed to interconnect parasitic also increase drastically. Asa result, multilevel interconnect networks have become the pri-

Manuscript received December 31, 2009; revised April 28, 2010 andJune 7, 2010; accepted September 6, 2010. Date of publication September 30,2010; date of current version July 13, 2011.

T. Mak is with the School of Electrical, Electronic and Computer Engi-neering, Newcastle University, NE1 7RU Newcastle upon Tyne, U.K. (e-mail:[email protected]).

P. Y. K. Cheung is with the Department of Electrical and ElectronicEngineering, Imperial College London, SW7 2AZ London, U.K. (e-mail:[email protected]).

K.-P. Lam is with the Department of Systems Engineering and EngineeringManagement, The Chinese University of Hong Kong, Shatin, Hong Kong(e-mail: [email protected]).

W. Luk is with the Department of Computing, Imperial College London,SW7 2AZ London, U.K. (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIE.2010.2081953

Fig. 1. Projected relative delay for local and global wires and for logic gatesin technologies of the near future. [1].

mary limit on the productivity, performance, energy dissipation,and signal integrity of gigascale integration [4].

Recently, network-on-chip (NoC) has been proposed as apromising solution to the increasingly complicated on-chipcommunication challenges [5]–[7]. Such architectures consistof a network of regular tiles where each tile can be animplementation of general-purpose processors, DSP blocks,memory blocks, and embedded reconfiguration modules, etc.Communications among these tile-based modules are follow-ing a packet-switch or circuit-switch scheme where messagesare transmitted among the processing elements. The NoCarchitecture would be an ideal solution to provide effectiveintegration for multiple modular blocks [8] and can potentiallymitigate the gigascale integration challenge [9], [10].

In such an NoC environment, the routing of flits (or packets)becomes a critical issue, which determines the interprocessorcommunication performance. Routing provides a protocol formoving data through the NoC infrastructure and also deter-mines the path of data transport. The selection of commu-nication pathway would greatly affect the latency of packetstransmitted from the source to the destination and, therefore,can have significant impact on the overall traffic flow in thenetwork. An intelligent routing mechanism is required to uti-lize the communication bandwidth and minimize transportationlatency.

Dynamic routing (or adaptive routing) has been widely usedin computer and data network design. Utilizing the onlinecommunication patterns and real-time information, dynamic

0278-0046/$26.00 © 2010 IEEE

3702 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011

routing can effectively avoid hot spots or faulty componentsand can reduce the possibility of packets being continuouslyblocked. Several partially adaptive-routing algorithms withinthe context of NoC were proposed, and the evaluations of theirperformances were reported. For example, implementation ofwormhole-adaptive odd–even routing was described in [8],[11], and [12]. In [13], a minimal routing mechanism withpartially adaptive protocols was proposed. However, implemen-tation of adaptive routing in an NoC system is not trivial andis further complicated by the requirements of deadlock-freeand real-time optimal decision making. Also, the previouslyproposed adaptive approaches only exploit local traffic whichlead to a moderate improvement in packet latency and trafficload balancing. Optimal path planning and routing adaptations,which were considered as hardware expensive as their counter-parts in computer networks, are rarely studied.

In this paper, we introduce a novel methodology to enabledynamic routing in an NoC. A massive parallel and high-throughput network architecture, namely, dynamic program-ming (DP) network, that provides real-time computation forshortest path problems is presented. This network coupleswith the NoC to enable optimal traffic control based on theonline network status and, thus, provides optimal path planningand dynamic routing with novel routing mechanics. The DPnetwork presents a simple, reliable, and efficient methodologyto enable adaptive routing in NoCs. The major contributions ofthis paper are as follows.

1) A novel DP network for high speed and parallel shortestpath computation is presented. The characteristics ofthe DP network, such as discrete and continuous-timeformulations, network dynamics, and convergence, arediscussed, and two numerical examples are presentedto exemplify the high-gain and versatility properties.(Section III)

2) Integration of DP network and NoC architecture as a dualnetwork is introduced. Routing mechanics and routing-table updating strategies, such as fully optimal and sub-optimal k-step look ahead (KLSA), are presented. Thedual network enables a tradeoff between the routing op-timality and memory consumption. Network scalabilityand deadlock issues are also discussed. (Section IV)

3) Performances and merits of the DP network are investi-gated thoroughly through experimental studies based onSystemC cycle accurate simulator. The new method iscompared with other popular routing schemes, such asXY and odd–even, in different traffic benchmarks andlarge-scale NoC architectures. (Section V). The proposedDP-network architecture is realized using Xilinx field-programmable gate array (FPGA) device, and hardwareoverhead and performances are evaluated. (Section VI)

II. PRELIMINARIES

A. Routing in NoC

NoC is an architecture inspired by data-communication net-works, such as Internet, communication [14], and wirelessnetworks [15], with interprocessor communication supported

by a packet-switched and circuit-switched networks [5], [6].The basic idea of NoC is to communicate across the chip ina way similar to that of messages transmitted over the Internetas the methods and architectures from the computer networkcould be borrowed and adopted to the on-chip communicationand can potentially resolve the interconnect scaling challenges.It has been reported that the NoC architecture can effectivelyovercome the long-wire disadvantages from bus architecturesas on-chip switches are connected in a regular topology withpoint-to-point basis, and long wires can be eliminated fromthe architecture [10]. Also, the architecture is decoupled intodifferent layers, such as transaction and physical layers. Thus,the layered architecture enables independent optimization anddesign for each independent abstract layer.

Given an NoC architecture, routing becomes the most impor-tant design strategy to consider, which determines the overallsystem performance. Routing strategies can be categorized intodeterministic and adaptive schemes. In a deterministic routingstrategy, source and destination determine the traversal path.Popular deterministic routing schemes for NoC are sourcerouting and XY routing, which are also referred to as 2-Ddimension-order routing [16]. In source routing, the source corespecifies the route to the destination. In XY routing, the packetfollows the rows first then moves along the columns towardthe destination, or vice versa. XY routing can be implementedusing algorithmic routing logic but is limited to regular networktopologies.

In an adaptive-routing strategy, the traversal path is decidedon a per-hop basis. Adaptive schemes involve dynamic arbi-tration and next-hop selection mechanisms, i.e., based on locallink congestions. There are several adaptive-routing algorithmsthat have been proposed within the context of NoC [17]. Forexample, a methodology that focuses on deadlock-free adaptiverouting has been proposed in [18], which provides a frameworkto design routing tables that can outperform the turn-model-based deadlock-free routing algorithm. Other schemes, such asthe adaptive odd–even [8], [12] and adaptive selection node-on-path (NoP) [13], also provide routing adaptability but onlyexploit local traffic or conditions of neighbors. There is a greatpotential to improve communication efficiency by consider-ing the global traffic at runtime using adaptive routing, suchas global traffic monitoring [19] and adaptive global routing[20]. However, these approaches employ either a rule-basedapproach or heuristics for traffic adaptation. Utilizing an on-demand shortest path computation could improve the routingoptimality and adaptability effectively.

Minimal-cost (or shortest path) computation is fundamentalamong different dynamic-routing strategies. The basic idea isthat the routing algorithm always chooses the least congestedpath toward the destination through optimal path planning. Theleast congested route can be found based on the shortest pathcomputation where the path cost is obtained at runtime. Sincethe network status, such as traffic intensity and conditions, ischanging at runtime, the dynamic-routing algorithm should beable to discover the congestions and perform shortest path com-putation at the same time. A novel DP-network architecture thatprovides real-time shortest path computation and optimal pathplanning is proposed in this paper. The background of shortest

MAK et al.: ADAPTIVE ROUTING IN NETWORK-ON-CHIPS USING DYNAMIC-PROGRAMMING NETWORK 3703

TABLE INOTATIONS USED IN THIS PAPER

path computation and the parallel computation architecture aredescribed in the following.

B. Shortest Path Computation

DP is a powerful mathematical technique for making asequence of interrelated decisions. Bellman formalized the termDP and used it to describe the process of solving problemswhere one needs to find the best decision one after another[21], [22]. It provides a systematic procedure for determiningthe optimal combination of decisions which takes much lesstime than naïve methods [23]. In contrast to other optimizationtechniques, such as linear programming (LP), DP does notprovide a standard mathematical formulation of the algorithm.Rather, DP is a general type of approach to problem solving,and it restates an optimization problem in recursive form, whichis known as Bellman equation [21], [22]. The Bellman equationfor optimal-value function V (·) is unique and can be defined asthe solution to the recursive equation [22], [24].

The shortest path problem can be described as follows: Givena directed graph G = (V,A) with n = |V| nodes, m = |A|edges, and a cost associated with each edge u → v ∈ A, whichis denoted as Cu,v , the edge cost can be defined subject todifferent applications, and the cost is defined as the numberof flits or packets in a buffer in this paper. The total costof a path p = 〈n0, n1, . . . , nk〉 is the sum of the costs of itsconstituent edges: Cost(p) =

∑ki=1 Ci−1,i. The shortest path of

G from ni to nj is then defined as any path p with cost that ismin

∑ki=1 Ci−1,i for all constituent edges ni. The notations are

summarized in Table I.The shortest path problem as a linear optimization problem

can be formally stated. Suppose that node nw is the destinationnode and it aims to compute the shortest path cost d(v, w) ∀v ∈V . To express this as a linear program, the constraint becomesd(v, w) ≤ d(u,w) + Cu,v to denote that the cost of the shortestpath from any node nv to destination nw is less than or equalto the shortest path from node nu plus the cost of a directpath from node nu to node nv . The destination node nw vertex

initially receive a value d(w,w) = 0. Thus, the following LPformulation can be obtained:

minimize∑∀v∈V

d(v, w)

subject to d(v, w) ≤ d(u,w) + Cv,u ∀v, u ∈ Vd(w,w) = 0.

The previous formulation yields the shortest path from anynodes in V to destination nw, which is known as multiple-source–single-destination shortest path problem. Solution of anLP problem can be resolved readily using any standard LPsolver [25].

Alternatively, the shortest path problem can be stated in theform of Bellman equation, which defines a recursive procedurein step k and can lead to a simple parallel architecture tospeed up the computation. To find the cost of the shortest pathfrom nv to nw, it requires the notion of DP value or, namely,cost-to-go function, which is the expected cost from nv tonw. This expected cost is being updated recursively based onthe previous estimates until it reaches its optimality criteria.This algorithm is known as DP. We denote the DP value fornv to nw at the kth iteration as V (k)(v, w), and V ∗(v, w) isthe optimal DP value, which is equal to the resolved variabled(v, w) from the aforementioned LP formulation. The Bellmanequation becomes

V (k)(v, w) = min∀u∈V

{V (k−1)(u,w) + Cv,u

}(1)

where V (w,w) = 0. If the recursion is expanded from n0 tonk, the DP value can be expressed as the total cost of the pathfrom node n0 to node nk

V ∗k (n0, nk) = min

{n0,n1,...,nk}∈P kn0,nk

{k∑

i=1

Ci−1,i

}(2)

where destination node nw = nk and P ki,j are the set of paths

from ni to nj , all of which have k edges. In addition, theoptimal decisions at each node ni that lead to the shortest pathcan be readily obtained from the argument of the minimumoperator at the Bellman equation as follows:

nv = arg min∀u∈V

{V ∗(u,w) + Cv,u} (3)

where the optimal decision becomes μ(v, w). Both the LP andDP can yield the optimal solution for shortest path problems.However, the DP approach presents an opportunity for solvingthe problem using a parallel architecture and can greatly im-prove the computational speed.

III. SHORTEST PATH COMPUTATION USING DP NETWORK

A. General Architecture

Mapping Bellman recursive DP to a parallel computationplatform can be realized with the introduction of a DP-networkarchitecture. The network has a parallel architecture and can be

3704 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011

Fig. 2. Unit interconnection in a general DP-network architecture, where 1 ≤i, j, k ≤ n; k �= i, j. Unit (i, j) outputs the cost-to-go value V (i, j), whichwill be the input of other units according to the problem network structure.At each unit, there are |N (i)| sites, which correspond to the total number ofneighboring nodes of i, to carry out the inference operations as defined in thesite function.

used to derive DP solution through the simultaneous propaga-tion of successive inferences. Originally, it provides an efficientplatform for checking data inconsistency due to results fromdifferent inference paths [26]. In [26], with close resemblanceto the deterministic-type DP formulation on closed semiring,Lam and Tong introduced a continuous-time ordinary differ-ential equation (ODE) network to solve a set of graph opti-mization problems with an asynchronous and continuous-timecomputational framework. This new class of inference networkis inherently stable in all cases, and it has been shown to berobust and with arbitrarily fast convergence rate [26]. A similarparallel computational network for DP has also been proposedin [24]. The network was proven to converge to optimal solutioneven under an asynchronous network.

A DP network is formed by the interconnection of self-contained computational units. Fig. 2 shows the structure of aunit and the connections in a general inference network. Eachunit is to represent a binary relation 〈i, j〉 between two objectsi and j and is denoted by U(i, j). At each unit, there are |N (i)|sites, which correspond to the total number of neighboringnodes of i, to carry out the inference operations, as defined inthe site function. The value of the corresponding relationshipbetween i and j is then determined by resolving the conflictamong all of the site outputs. In essence, if Sk(i, j) representsthe site output at the kth site and g(i, j) stands for the unitoutput of unit (i, j), then

Sk(i, j) = g(i, k) ⊗ g(k, j) (4)g(i, j) = ⊕∀k∈N (i) Sk(i, j) (5)

where ⊗ is the inference operator for the site function (which isusually the same at all of the sites) and is the conflict-resolutionoperator for the unit function. Also, the computational unit ⊕denotes the unit which resolves the binary relation (i, j).

The shortest path problem can be mapped to the DP network.For the original problem graph, each node refers to a processorunit. However, in the DP network, each computational unitU(i, j) represents the binary relation, i.e., the expected distancebetween node i and j. When the network has converged,the solution of the problem would be found at the output ofeach computational unit. In general, if there are m nodes inthe original graph, then the DP network (based on the Bell-

man equation) will have m − 1 functional units with U(i, j),where i = 1, 2, . . . , j − 1, j + 1, . . . , m.. By supposing that theinterconnection network has a fixed topology, the multiple-source–multiple-destination solutions can be obtained by ap-plying the DP network m times for computing the shortest pathsfor m different destinations.

Let g(i, k) = Ci,k and g(k, j) = V (k, j). The architecture ofthe DP network can then be defined as follows. A DP networkfor the shortest path problem can be stated in terms of networkstructure as ⊗ is substituted by “+” and ⊕ is substituted by“min” as

Sk(i, j) = g(i, k) + g(k, j) (6)g(i, j) = min

∀k∈N (i)Sk(i, j). (7)

The computational units are interconnected and resemble theshortest path problem structure. Each unit represents a node,and an interconnection represents an edge. With the realization,the network converges, and the optimal solution can be read-ily implemented using a distributed network. Note that whenthe network resolves, the optimal cost-to-go function can beobtained as V ∗(i, j) = g(i, j). Also, this network architectureencompasses the advantage of simplicity and parallelization,which presents a great opportunity to be applied for on-chiprouting and optimization.

B. Discrete and Continuous-Time Formulations

The recursive formulation of the Bellman equation onlyspecifies the mechanism to update value V (u, v), as can befound from the classical Value Iteration algorithm [23], [27].Therefore, the priority and order of the updating process arenot relevant, and the value V (u, v) can be computed asyn-chronously. This allows an opportunity to design distributedcomputation system to realize the DP network with distributedcomputational units without synchronous control. Furthermore,the asynchronous property can be further exploited to considera continuous-time framework of the DP network, as opposedto the discrete-time DP network. The continuous-time formu-lation provides an analytical framework to study the networkproperties, such as network convergence. In the following, boththe discrete and continuous-time formulations are discussed.

1) Continuous-Time Formulation: Consider a DP networkthat is constructed based on the original shortest path problem.Computational unit i is interconnected with adjacent node j,∀j ∈ N (i), where Ci,j is finite. Assume that the min and +operators require an infinitesimal time δt; the output of theoperator at time t + δt can be expressed as [26]

gt+δt(i, j) = min∀k∈N (i)

{g(k, j) + Ci,k} (8)

Assuming that the transition costs between the current node andthe nonadjacent nodes are infinite, minimizing only over the setof neighboring nodes in (8) is equivalent to minimizing over allnodes. Also, minimizing only over the adjacent nodes leads toa hardware realization with smaller cost. Suppose that the costfunction Ci,j is a constant and the min and + operators requirean infinitesimal time, each computational unit U(i, j) could

MAK et al.: ADAPTIVE ROUTING IN NETWORK-ON-CHIPS USING DYNAMIC-PROGRAMMING NETWORK 3705

then behave dynamically as a first-order system. The wholenetwork can be described by a set of differential equations

dg(i, j)dt

= −λig(i, j) + λi min∀k∈N (i)

{g(k, j) + Ci,k},∀i (9)

where λi is the system pole for unit U(i, j), which controls therate of how g(i, j) may change. If λi = 0, then |dg(i, j)/dt| =0, and g(i, j) becomes a constant, and the unit is said tobe fully constrained and has a fixed memory. Whereas, fora memoryless unit with λi = ∞, it has an infinite power tochange because |dg(i, j)/dt| can be made arbitrarily large.Also, the units are interconnected based on N (i), which definesthe set of adjacent nodes of unit U(i, j). Therefore, g(k, j) isthe output of unit U(k, j), which is an adjacent unit of U(i, j)in N (i).

2) Discrete-Time Formulation: The equivalent discrete for-mulation can be obtained based on (9). Let δt = 1. The systemof differential equations (9) then becomes

gt+1(i, j) = λi min∀k∈N (i)

{gt(k, j) + Ci,k}∀i (10)

where λi defines the converging time constant, which controlsthe convergence speed of the system, as will be shown in thenext section.

C. Convergence of the Network

There are two important considerations in using a DP net-work. First, will the network always converge to the desiredsolution? Second, what are the parameters or conditions thataffect the convergence rate of the network? The answer tothe first question is a “yes” because it follows directly fromthe principle of the Bellman optimality equation which statesthat the constituent optimal expected value of all states areoptimal. The local minimization based on the Bellman equationperformed at each distinct unit, in fact, is driving the network toa global optimal state, which is the desired solution. To measurethe “distance” of the network from this global minimum and inline with Hopfield’s energy modeling in [28], the computationalenergy E(t) can be defined as the root-mean-square (rms) errorif the system deviates from the optimal solution. From (9), theenergy function for the continues-time ODE can be stated as

E(t) =∑∀i

(−λig(i, j) + λi min

∀k∈N (i){g(k, j) + Ci,k}

)2

(11)

where E(t) = 0 when the network has converged. To determinethe convergence rate of the network, an explicit expression fordE(t)/dt has to be evaluated. By differentiating the energyfunction in (11), the following expression is obtained:

dE(t)dt

=dE(t)dg(i, j)

· dg(i, j)dt

=∑∀i

[d

dg(i, j)

(− λig(i, j) + λi min

∀k∈N (i){g(k, j)

+ Ci,k})2

· dg(i, j)dt

]. (12)

By evaluating the first term in (12), the following expressionis obtained:

dE(t)dt

=∑∀i

[−2λi

(−λig(i, j)+λi min

∀k∈N (i){g(k, j)+Ci,k}

)

·dg(i, j)dt

](13)

=∑∀i

[−2λi

(dg(i, j)

dt

)2]

. (14)

Note that in order to establish the aforesaid expression, itis assumed that all outputs of units g(i, j) do not provide afeedback to the unit itself. Thus, in the set of neighboring nodes,∀k ∈ N (i), k �= i. Hence, all the factors that make up the sumof the right-hand side of (14) are nonnegative. In other words,the energy function E(t) defined in (11) is a monotonicallydecreasing function of time as

dE(t)dt

≤ 0. (15)

From the definition of (11), note that the function E(t) isbounded. The time evolution of the continuous DP-networkmodel described by the system of first-order differential equa-tions in (9) represents a trajectory in the station space, whichseeks out the minima of the energy function E(t) and comesto a stop at such fixed point. From (14), note that the derivativedE(t)/dt vanishes only at the point that satisfies the Bellmanoptimal criterion

dg(i, j)dt

= 0 ∀i. (16)

D. Numerical Examples

Example 1: Computing the Expected Costs in a Ten-NodeArray: A ten-state random-walk problem can be solved bya ten-unit continuous-time DP network. The ten states areindexed by Si, i = 1, 2, . . . , 10. The outputs of the ten unitsof the network, signifying the expected costs to the destina-tion, are described by a vector V (Si, S10), i = 1, 2, . . . , 10,which has a semantic meaning of the expected reward ofV (Si, S10), i = 1, 2, . . . , 10. Also, the transition cost is definedas Ci,i+1 = 1 and Ci+1,i = 1 for all i, j = 1, 2, . . . , 9, andCi,j = ∞ for all j �= i + 1 and j �= i − 1. The continuous-timeDP network can be modeled by a set of differential equationson the ten nodes Si. The expected rewards V (Si, S10) evolveas first-order lag controlled by λ, which is the reciprocal ofthe network-convergence-time constant. In particular, it relatesto the computational delay of each computational unit in anetwork implementation, and the latency of information prop-agates throughout the network. The discount factor γ is aproblem-related parameter, which defines the discount factorfor multistage cost. The value is independent of the network

3706 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011

Fig. 3. Convergence of a DP network for the ten-node array random-walkproblem where the time constant 1/λ = 1 ns. Each curve corresponds to theoutput of each unit and represents the cost-to-go value from that node to thedestination node S10.

implementation and is subject to the requirements of the objec-tive function

dV (S1, S10)dt

= − λV (S1, S10) + λγV (S2, S10) (17)

dV (Si, S10)dt

= − λV (Si, S10)

+ λ min{Ci,i−1 + γV (Si−1, S10), Ci,i+1

+ γVS(i+1, S10)} ∀i = 2, 3, . . . , 9

(18)

V (S10, S10) = 0. (19)

Equation (17) describes the boundary node S1 which has asingle “right” action. For nodes Si, i = 2, 3, . . . , 9, they haveboth left and right actions and can be readily shown to followthe equations as typified in (18). A destination-node valueV (S10, S10) is defined to be zero, as in (19).

Given arbitrary positive initial values of V (Si, S10) ∀i,the converged values of the respective differential equations[(17)–(19)] can be verified to be identical with the optimalvalues governed by the Bellman equations. Fig. 3 shows theconvergence results obtained by using Matlab ODE solver1

for the differential equations. The converged values are foundto be [6.10, 5.67, 5.20, 4.68, 4.10, 3.44, 2.71, 1.90, 1.00, 0].The results are verified correctly against the results computedusing the well-known Bellman–Ford algorithm for shortest pathproblems. Also, note that node S9 is the quickest to converge,whereas S1 is the slowest. This is because there is a dependence

1The differential equation solver is based on ode45, which is provided inthe Matlab. The ode45 is based on an explicit Runge–Kutta formula, theDormand–Prince pair.

Fig. 4. Convergence of the cost-to-go values of all the nodes from a 10 × 10mesh network. (a) t = 1 ns. (b) t = 5 ns. (c) t = 10 ns. (d) t = 20 ns.

on the expected cost, and it takes the longest time for informa-tion to propagate to S1 from S10.

Example 2: Computing the Expected Costs in a 10 × 10Mesh: Consider a 100-node network with 10 by 10 meshinterconnection. Each node only connects to a maximum of fouradjacent nodes, while each node at the edges connects to three,and each node at the corners connect to two. The nodes areoriented as a perfect square. All transitions would result in acost of one, and the destination node at the center would havean expected cost of zero.

Similar to Example 1, the continuous-time DP network canbe modeled by 100 differential equations on the 100 nodesSij ∀i, j = 1, 2, . . . , 10. The expected cost V (Sij , Sij) ∀i, j =1, 2, . . . , 10 evolves as first-order lag controlled by λ.

Let 1/λ = 2 ns and the destination node to be S5,5. Thevalues of the expected cost are shown in Fig. 4. At timet = 1 ns, the expected costs are randomly initialized, andV (S5,5, S5,5) = 0 as S5,5 is the destination node. The networkbegins to converge to the optimal solution at time t = 20 ns,and the intermediate results are also shown in the figure. Theconvergence of the DP network in the 2-D mesh depends on λ,which, in this example is equal to 0.5. The network settles to thedesired solution at t = 20 ns. By increasing λ, the time neededfor the network to settle decreases. Also, even if λ is a largevalue (e.g., λ = 0.9), the network still converges to the optimalsolution.

Fig. 5 shows the convergence of the network with different λvalues. The results are rms errors between the VS output fromthe network and the values obtained using the Bellman–Fordalgorithm, averaged over the 10 × 10 mesh example. Clearly,λ is the reciprocal of the network time constant, which governsthe time required to obtain the optimal solutions.

E. Summary

In this section, the characteristics of the DP network havebeen discussed. The DP network can be formulated in discreteand continuous-time forms. The monotonic property of the

MAK et al.: ADAPTIVE ROUTING IN NETWORK-ON-CHIPS USING DYNAMIC-PROGRAMMING NETWORK 3707

Fig. 5. RMS error of the DP network for computing shortest paths in a 10 ×10 mesh network with different λ values, where λ is the reciprocal of thenetwork time constant.

continuous-time network has been shown, and the networkconvergence has been discussed. The convergence rate of thenetwork depends on λ, which is the time constant that variesbased on the different implementation platforms. In the fol-lowing, the embedding of the DP network in NoC, to provideshortest path computation on-the-fly, and the dynamic routing,to enhance the network utilization, are discussed.

IV. NoC ROUTING WITH DP NETWORK

A. Routing Architecture

An interesting feature of an on-chip communication networkor NoC is that the communication network itself defines thegraph of the shortest path problem. This provides an opportu-nity to compute the optimal path by embedding a DP unit ateach node. Unlike the general computer network, the shortestpath routing computation is solely attributed to the processors ateach node. The NoC environment demands tighter timing andperformance constraints as well as more flexible implementa-tion methodologies, which can be achieved by implementing aDP-network architecture.

The DP network shown in Fig. 6 consists of distributedcomputational units and links between the units. The topologyof the network resembles the defined graph topology, which isthe communication structure of an NoC. At each node, there isa computational unit, which implements the DP unit equationsin (10). The numerical solution of the unit will be propagated tothe neighboring units via the neighborhood interconnects. TheDP network is tightly coupled with the NoC, and each compu-tational unit locally exchanges control and system parameterswith the tile or core. The DP network quickly resolves theoptimal solution, as will be shown later in this paper, and willpass the control decisions to the router or other controllers in thetile, while the real-time information, such as average queuingtime, will be inputted to the computational unit.

The DP network presents several distinguishing features toan on-chip communication system. First, the distributed archi-tecture enables a scalable real-time monitoring functionality forthe NoC. Each computational unit acquires local information,and, through communication with neighboring units, a globaloptimization can be achieved. Second, because of the simplicityof the computational unit, the dedicated DP network provides a

Fig. 6. Example of a 3 by 3 mesh network coupled with a DP network.

real-time response and will not consume any data-flow networkbandwidth. Third, because of the convergence property, asdiscussed in Section III-C, the DP network provides an effectivesolution to optimal path planning and dynamic routing.

1) DP Routing Mechanics: Consider a node–table-routingarchitecture in which the routing table is stored at each router.The destination of the header flit will be checked, and it willdecide the routing direction based on the routing-table entries.In contrast to the table-based routing in which a routing algo-rithm computes the route or next hop of a packet at runtime,algorithmic routing is more restrictive to simple routing algo-rithms and can only be applied on regular topologies, such asa mesh topology. The routing-table approach enables the useof per-hop network state information, such as queue lengths,to select among several possible next-hop at each stage of theroute.

Algorithm 1 presents an algorithm for updating the routingtable with a DP network. At each node unit, there are k inputsfrom the k neighbor nodes for the expected costs. The outputof the unit at node ni is the updated expected cost V (i, j) andis sent to all adjacent nodes. The main algorithm is outlinedin lines 4–10. For each destination j and direction k, theexpected cost will be computed, and the minimum cost will beselected, as stated in line 8. The optimal direction for routing isselected and used to update the routing table, as stated in line 9.Although the algorithm consists of two for loops, this can berealized in a hardware with a parallel architecture, and thecomputational-delay complexity can be reduced to linear.

Algorithm 1 Update routing table for destination nj

1: Inputs: V (i, j), i ∈ N (i), where N (i) returns allneighbor nodes of ni, and i = 1, 2, . . . , N

2: Outputs: V ∗(i, j)3: Definitions:

ni is the current node;Ci,k is input queue-length node i from direction k

4: for all i such that ni ∈ V do

3708 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011

5: for all k directions such that k ∈ N (i), where N (i)return all neighbor nodes of ni do

6: V ′(i, k, j) = V (k, j) + Ci,k

7: end for8: V ∗(i, j) = min∀k V ′(i, k, j)9: μ(i, j) = arg min∀k V ′(i, k, j) {Update routing table}

10: end for

Many routers use routing tables either at the source (sourcerouting) or at each hop (node–table routing) along the routeto implement the routing algorithm. In adaptive routing, therouting table is updated dynamically or periodically, such thatthe communication traffic can be altered subject to the choiceof switching mechanisms. The DP network does not interactor interfere with the packet-switching mechanisms but alter therouting table at runtime. Also, a mesh-network topology will beused throughout this paper for illustrating the idea. However,the proposed methodology is not limited to the mesh topology,and simple modifications can be made for tackling networkof different forms, such as torus, butterfly fat tree, and othercustom-designed topology, based on the flexible routing-table-based design. Also, the numerical accuracy of the cost estimatesmight affect the network performance. Due to the nature of theDP-based decision making, the absolute cost is not crucial tothe decisions but the difference between the costs. A reasonablebit width is adopted, e.g., 8 b, to be allocated to realize the DPcomputation throughout this paper.

Deadlock can effectively be avoided by adopting one ofthe deadlock-free turn model. In this paper, the west-first [29]turn model is used. It prohibits all turns to the south–westand north–west direction. The dynamic-routing scheme will beswitched to XY routing whenever the destination node is withinthese directions. In this case, the north–west and south–westturns are removed, and thus, the routing dependences will neverform a cycle in the network. Alternatively, other turn models,such north-last, can also be applied in the DP network toavoid deadlock, with a similar performance, at the designer’sdisposal.

2) DP Network Computational Complexity: The delay ofthe DP network converges to an optimal routing solution de-pending on the network topology, which determines the delayinformation propagates within the network and the delay ofeach computational unit. It can be seen that each unit involvesO(|A|) additions and comparisons, where |A| is the numberof edges. Note that the number of additions corresponds to thenumber of adjacent nodes, and |A| is an upper bound, whichcorresponds to the configuration of a fully connected network.Hence, the worst case solution time is O(k|A|), where k isthe number of iterations evaluated by each unit. In softwarecomputation, k is equal to the number of nodes in the network;thus, k = |V|, which guarantees that all nodes have been up-dated [23]. However, in hardware implementation with parallelexecution, k is determined by the network structure, and Aadditions can be executed in parallel. Each computational unitcan simultaneously compute the new expected cost for allneighboring nodes. Therefore, the solution time becomes thetime for the updated value to be distributed to every other node,

TABLE IICONVERGENCE ANALYSIS OF A DP NETWORK

FOR DIFFERENT NETWORK TOPOLOGIES

Fig. 7. Comparing the two routing strategies, where the shaded area repre-sents the nodes covered in the routing table and ns is the source and nt isthe destination. (a) Optimal decision can be made at ns. (b) Since V (s, u) ≤V (s, w) and the Manhattan distances for nu to nt and nw to nt are the same,nu is selected as transition node in the suboptimal path to nt.

and the computational complexity becomes O(1). Also, it isassumed that the comparator delay is transient in time andis independent of the network size. For a more conservativeestimation of computational delay, we can assume a binary-tree comparator to be implemented, and the computationalcomplexity becomes O(log2(|A|)). Consider a mesh networkwith N nodes with

√N rows and

√N columns. The longest

path in this network is 2√

N − 1, which is the minimum timerequired for updating the expected costs at all nodes. Therefore,the network convergence time is proportional to the networkdiameter, which is the longest path in the network. The DP-network convergence time for some of the network topologiesare summarized in Table II.

B. Optimality and Memory Tradeoff

One concern for the table-based routing mechanics is therouting-table size, which requires allocation of memory orregisters. Even though the adaptive routing brings in substan-tial advantage in routing delay and throughput, the memoryrequirement could sometimes become a hindrance for the sys-tem to scale up [16]. In this Section, a new method, namely,KSLA, is introduced. This method yields a suboptimal solutionin dynamic routing but can substantially reduce the memoryrequirement.

Instead of storing routing decisions for all destinations in arouting table, storing a table that provides optimal decision tolocal premises can enable a suboptimal path to the destinationswith a substantial reduction on the storage requirement. Theidea is that each router computes the routing decisions for nodesthat are k steps away from the current node. A k-step regionis shown in the shaded area in Fig. 7(b). If the destination iswithin the k-step region, an optimal decision is readily availablein the routing table. Otherwise, a transition node nu is selected

MAK et al.: ADAPTIVE ROUTING IN NETWORK-ON-CHIPS USING DYNAMIC-PROGRAMMING NETWORK 3709

such that the sum of the DP value to the transition node andthe Manhattan distance from that node to the destination is thesmallest. These procedures repeat at each hopping step, andeventually, the packet arrives at the destination in a suboptimalroute. Fig. 7 shows the two strategies graphically.

Algorithm 2 KSLA Routing Algorithm1: Inputs: Destination node nt

2: Outputs: Routing direction μ(s, t)3: Definitions:

ns is the current node;D(s, t) returns the number of steps from s to t;μ(s, t) returns the routing direction of destination t atnode s;k(s) returns a set of nodes that are k steps away from s;M(i, j) returns the Manhattan distance from ni to nj .

4: if D(s, t) ≤ k then5: return μ(s, t)6: else7: for all nodes i such that i ∈ k(s) do8: V ′(s, i, t) = V (s, i) + M(i, t)9: end for10: μ(s, t) = arg min∀i∈k(s) V ′(s, i, t)11: end if12: return μ(s, t)

The KSLA algorithm is presented in Algorithm 2. The inputsare the destination nodes, which are the same as the routerdesigned for the global optimal path planning. For every flit orpacket, the algorithm checks whether this destination is withinthe k-step region. This can be achieved differently for differenttopologies. For a mesh, this can be checked by analyzing thecoordinates and comparing the Manhattan distances. Extensionof KSLA to irregular and other topologies requires implemen-tation of other heuristics, which will be studied in our futurework. This step is line 8 in Algorithm 2. If the destination iswithin the k-step region, the optimal routing decision can bereadily retrieved from the routing table. If the destination isoutside the region, which is not covered in the routing table,the algorithm finds a node within the region that is closest tothe destination and with minimal cost. In line 10, the conditionensures that the node chosen is the closest to the destination.Lines 7–10 are aiming to find a node that is leading to thedestination node with the minimal expected cost. Finally, inline 11, this node within the region will be output as the next-hop direction.

With the optimal routing scheme, the total cost to go fromnode ns to nt is

V ∗(n0, t) = min∀ni∈P m

n0,nm

⎧⎨⎩

m∑j=1

Ci−1,i

⎫⎬⎭ (20)

where i = 0, 1, . . . ,m and nm = nt. In other words, eachrouter is able to look ahead for all possible paths Pm

n0,nmto

the destination and choose the one with minimal delay. For theKSLA approach, the routers can only look ahead for k steps

Fig. 8. Theoretical estimates for the approximation error of the KSLA ap-proach with respect to optimal DP values and routing-table size in terms of theaddress space for the corresponding k values.

at each round. Therefore, the total expected cost W k(n0, t)becomes the sum of the �m/k� rounds of k-step propagationsplus the expected cost of the last round, which requires stepsthat are less than or equal to k

W kt (n0) =

�m/k�∑l=0

min∀ni∈P k

nlk,n(l+1)k

⎧⎨⎩

(l+1)k∑j=lk+1

Ci−1,i

⎫⎬⎭

+ min∀ni∈P m

n�m/k�k,nm

⎧⎨⎩

m−�m/k�k∑j=�m/k�k+1

Ci−1,i

⎫⎬⎭ (21)

where m ≥ k, i = 0, 1, . . . ,m and nm = nt. Suppose that theintermediate nodes in the KSLA are the same as those inthe optimal path Pm

n0,nm, the path produced by KSLA is the

optimal. In this case, the lower error bound for KSLA iszero, with W k(n0, t) = V ∗(n0, t). Furthermore, the expectedcost between the optimal and KSLA cases have an interestingproven2 relationship, which can be expressed by the following:

W k−1(n0, t) ≥ W k(n0, t) ≥ V ∗(n0, t) (22)

where m ≥ k > 1. This expression implies that the KSLAapproximation error decreases monotonically when k increases.Note that there is no theoretical upper bound for the expectedcost for the KSLA approach. If the packet is trapped at a nodewith a single path to the destination and this path is faulty, thepacket will not reach the destination. Similar to other routingalgorithms, such as XY and odd–even, backtracking or specialrescue routines are required to help the packet to escape fromthe trapped node. Nonetheless, this situation is rare, and theKSLA can approximate the optimal path in most cases, asshown in the Monte Carlo simulation.

A Monte Carlo simulation has been performed to verify thetheoretical results. The relative error of KSLA with respectto the optimal DP values and with different parameter k is

2This can be derived using the inequality min∀P mn0,n2

{C0,1 + C1,2} ≤min∀P k

n0,n1C0,1 + min∀P m−k

n1,n2C1,2, where Ci,j ≥ 0, ∀i → j ∈ A

3710 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011

shown in Fig. 8. For each k, the optimal path cost and thecost using KSLA are obtained and computed. The relative erroris equal to the differences between the path costs using thetwo approaches. The figure presents an average relative errorof 1000 networks with randomly generated path costs. Theresult consistently shows the monotonicity of the parameterk in KSLA. Also, it is interesting to observe that the errordecreases drastically between k = 1 and k = 4. For the case ofk = 4, the error can be reduced to 10%. Consider the substantialrequirement in memory, a relatively small k in KSLA canalready provide a good-quality suboptimal routing solution.

For any n-node network, the memory addresses required canbe reduced to k(k + 1), where k ≤ √

n and a 4-ary networktopology is assumed. In general, there are 2k(k + 1) nodeswithin the k-step region. Using the west-first turn model toavoid deadlock, only k(k + 1) destination cost are required tobe evaluated and stored. Selecting an appropriate k enables atradeoff between memory consumption and routing optimalityat the designer’s disposal. Fig. 8 shows the size of the routing-table requirement for each k, as well as the relative error for theKSLA routing. For the case of k = 4, the number of memoryaddresses required is only 20 for the KSLA approach versus 63for a full routing table.

V. RESULTS AND DISCUSSION

A. Simulation Environment

In order to perform a complete evaluation of the proposedrouting algorithm, the open Noxim [30], which is an open-source SystemC simulator for NoC of different structures,is employed. The Noxim simulator provides a virtual cycle-accurate NoC architectural model where various performancemetrics, including throughput and delay of the on-chip commu-nication methodologies, can be evaluated. In order to evaluatethe performance of the proposed DP network, additional portsfor communicating the DP values are added to the NoximNoC router architecture. Routing tables and the table-updatingscheme, as described in the previous section, are also intro-duced to the simulator. A new DP routing function is imple-mented for realizing both the global path planning and KSLA.Although a mesh topology is considered in our experiments, theNoxim-based NoC architecture can be easily extended to othertopological structures by modifying the interconnection of portsof the routers. The traffic-pattern benchmarks embedded inNoxim are used for the routing performance evaluation. Thesetraffic patterns, such as hot-spot random traffic and transpose,provide a comprehensive evaluation for the routing capability,as shown in other related works [13].

By varying the packet injection rate, different routing al-gorithms produce different average packet-delivery delay andsaturation point. The average packet-delivery delay is used asa metric to evaluate the routing algorithm. The DP networkprovides the shortest path planning, by minimizing the packet-delivery delay at every node. For a mesh topology, the conver-gence time of the network is 2

√n − 1 cycles. The sampling

frequency of the DP network has to be aligned with this con-vergence time. Therefore, the cost and routing-table updatingperiods are also the same as the network convergence time,

Fig. 9. Average packet delay in random traffic with four hot-spot nodes at thecenter of an 8 × 8 mesh network.

which is 2√

n − 1 cycles. Also, the maximum packet-deliverydelay is used to evaluate the routing performance, which isimportant for NoC real-time applications. The experimentscarried out refer to an 8 × 8 size NoC. Traffic sources generate8-flit packets with an exponential distribution, the parametersof which depend on the packet injection rate. The first in, firstout (FIFO) buffers have a capacity of 16 flits. Each simulationwas initially run for 1000 cycles to allow transient effects tostabilize and, afterward, executed for 20 000 cycles. Since it is amesh topology, the convergence time of the network is 2

√n − 1

cycles, and thus, it is 15 cycles in this experiment. The updatingperiod for individual routing table is then set to be 15 cycles.

B. Results for Average Packet Delay

In order to evaluate the DP-network performance, the aver-age packet delay between the DP and four other well-knownrouting algorithms, namely, XY [16], DyAD [8], odd–even[12] and odd–even routing with an NoP selection scheme [13],are compared. Each packet is generated randomly from theprocessors following a traffic pattern and comprises from two toten flits. A fully optimal DP-network dynamic routing is appliedfor the experiments in this section. The results for using KSLAwill be presented in the next section.

Fig. 9 shows the results of a random traffic with hot spots.This type of traffic pattern is considered to be more realisticthan random traffic with uniform distribution. In most of theapplications, certain processors or tiles are more frequentlyaccessed than others, such as memory nodes and input/outputnodes. In this scenario, there are four hot spots located in thecenter of the network with 20% hot-spot traffic. When traffic isdirected to the center of the network, the central region willbe substantially congested. Deterministic routing algorithms,however, would still divert traffic to these regions. Routing al-gorithms, such as NoP and DP, can slightly outperform other al-gorithms with deterministic routings. The DyAD routing adoptsa scheme that switches between XY and odd–even dynamicallyand, thus, presents a result in between the two algorithms.

MAK et al.: ADAPTIVE ROUTING IN NETWORK-ON-CHIPS USING DYNAMIC-PROGRAMMING NETWORK 3711

Fig. 10. Average packet delay in random traffic with hot-spot nodes at thefour corners of an 8 × 8 mesh network.

Fig. 11. Average packet delay in matrix-transpose traffic in an 8 × 8 meshnetwork.

The results are consistent with literature [8]. Fig. 10 shows theresults of another hot-spot traffic where the hot spots are locatedin the corners of the network. In this case, there will be nocongested traffic at the center of the network. The dynamic-routing algorithm has a larger degree of freedom to divertthe packets to the destination via a potentially smaller delaypath. The results demonstrate the performance advantage ofadaptive algorithms, such as DP and NoP, with respect to staticalgorithms, such as XY. These adaptive algorithms providea larger bandwidth when the network is less congested. Theperformance advantage from using dynamic routing is moresubstantial in this case. In particular, DP outperforms the otherrouting algorithms by 24.7%.

Figs. 11 and 12 show the results for a transpose and butterflytraffic, respectively. The transpose traffic emulates an interest-ing communication pattern that frequently appears in system-on-chip design, such as traffic in the fast Fourier transformarchitectures, which is very similar to a matrix transpose [16].It can be observed that the performances of XY routing and

Fig. 12. Average packet delay in butterfly traffic in an 8 × 8 mesh network.

TABLE IIICOMPARISONS FOR PACKET INJECTION RATES BETWEEN THE DP

AND FOUR OTHER ROUTING ALGORITHMS

DyAD are poor due to the congested routes along the horizontalhopping, which coincide with results reported in literature[8], [11], [13]. The DP routing can delay the saturation pointsignificantly because of the optimal path planning, which isable to utilize the throughput of the network effectively. Itis interesting to observe that NoP also provides an efficientrouting scheme which adapts to the congestions by delayingthe saturated packet injection rate to 0.02 in transpose traffic.DP outperforms the other routing schemes by 28.4% and 28.9%for the transpose and butterfly traffic, respectively.

We also compared the maximum packet injection rate fora fixed average delay with different routing algorithms. Theresults are summarized in Table III. In this scenario, a largerinjection rate implies a better utilization of network throughput.The results show that DP outperforms the other routing algo-rithm by 22.3% with the utilization of real-time traffic informa-tion. The other dynamic-routing scheme, odd-even routing withNoP selection, also outperforms the other deterministic routingalgorithms, such as XY.

C. Results for KSLA

The recently proposed NoP approach in [13] is a special caseof the KSLA. In NoP, each router chooses the routing directionbased on the queue information that is two steps away from thecurrent node. A hill-climbing heuristic is implemented for therouting. However, the NoP approach does not compute the DPvalues for the destination nodes, whereas a score value, whichresembles the DP expected delay, is computed on demand. Forthe DP network, the DP value is computed by the DP network

3712 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011

Fig. 13. Comparison of average packet delay between KSLA, odd–even usingan NoP selection, XY, and DP routing approaches. An 8 × 8 mesh network withpacket injection rate of 0.02 packet per cycle per node and look-ahead steps fork = 0, 1, . . . , 8 are considered.

and distributed to all routers. This provides a fast decision timeas only a simple lookup table is required when the header flitarrives. In the following, the experimental results of comparingthe NoP and KSLA algorithms are discussed.

A special transpose-traffic scenario is considered with apacket injection rate of 0.02 packet per cycle per node. Theperformances of KSLA with different k, XY, and NoP routingsare shown in Fig. 13. When k = 0, KSLA has the same perfor-mance as XY. This is because the routing table is initializedfollowing the XY routing scheme, and the routing table isnever updated. For the case of k = 2, KSLA provides a similarperformance as NoP (the average delay is equal to 124 for NoPand 108 for DP). This suggests that NoP resembles a specialcase of KSLA routing, specifically, when k = 2. By increasingthe k value, the average routing delay is further reduced until itconverges to 42 packet delay per cycle per node, where KSLAresembles DP. These results confirm the tradeoff in routingoptimality with different k steps, as shown in the earlier MonteCarlo simulation in Fig. 8.

D. Summary

This section has presented a novel DP network for adaptiverouting in NoC. The DP network provides on-the-fly shortestpath computation using distributed DP and enables dynamicrouting based on the real-time traffic conditions and conges-tions. Also, a KSLA routing strategy has been presented. Itcan provide tradeoff between routing optimality and mem-ory consumption. Experimental results demonstrate the perfor-mance and merits of optimal routing over other deterministicand adaptive-routing approaches, which are based on partialor local traffic information. The optimal DP-network-basedrouting outperforms XY routing by 28.9% and also improvesother adaptive-routing strategies, such as adaptive odd–even, by18.4%. It is interesting to observe that the new KSLA approach

Fig. 14. Schematic design of a standard NoC router except that a DP computa-tion unit is integrated to enable dynamic routing. The “queue-length prediction”block allows realization of different cost-function estimators that provide thecost value for the DP network. The DP computational unit interconnects withother DP units located at adjacent tiles. The DP unit also updates the routingdirection in the routing table.

is a generalization of other adaptive-routing algorithm, whichapplies hill-climbing heuristics for route planning.

Also, the DP network provides shortest path computationconditioned on constant inputs of cost function. Given that thehop costs, which are the queue depths, can change faster thanthe convergence time of the DP algorithm, then the convergenceof the network cannot be guaranteed. Additional circuits arerequired to smooth out the input costs, such that the fluctuationof the cost function does not affect the convergence of thenetwork.

VI. HARDWARE IMPLEMENTATION

There is a number of different implementation strategies thatcan be investigated for the proposed DP network. For example,DP network can be realized using analog circuit which couldenable high-performance and low-power on-chip adaptation[31], [32]. Alternatively, digital synchronous and asynchronousdesigns would result to different hardware and timing charac-teristics. Investigations on the implementation strategies are outof the scope of this paper. In this section, we aim to study thehardware overhead of a DP implementation based on a simplesynchronous network.

In this section, an implementation of the DP network andthe dynamic-routing-enabled NoC architecture are presented.Comparisons on the utilization of hardware resources and clockfrequencies are discussed.

A. Router Architecture

Fig. 14 shows the architecture of a router, which enablesdynamic routing. The router design is similar to that used inNoC [8]. An additional block implements the dynamic-routingalgorithm. The queue-length prediction unit captures the queue

MAK et al.: ADAPTIVE ROUTING IN NETWORK-ON-CHIPS USING DYNAMIC-PROGRAMMING NETWORK 3713

Fig. 15. Realization of the DP computational unit using standard logic. Thecircuit implements the discrete form of Bellman equation and outputs theupdated cost-to-go value. The decision variables can also be obtained viathe multiplexer. Since a mesh network is considered here, there are four routingdirections that can be encoded using two bits D0 and D1, where D1 is themost significant bit.

length from the input FIFO and evaluates the communicationcost for that particular direction. The routing table stores therouting directions, which are constantly updated by the DPnetwork. Successive updating of all entries in the table relieson a synchronous controller, and units in the network aresynchronized using counters. The counter provides a referenceto indicate which node is regarded as the destination and alsoprovides an address reference to the routing table. The DP unitoutputs zero if the current node is the destination; otherwise,it outputs the result of the DP computation. The shortest pathcomputation and optimal routing mechanics are implementedusing the DP computational unit, which is shown in Fig. 15.Computation units from different routers are interconnected soas to form a DP network. This figure signifies that the compu-tational network is simultaneously computing the shortest pathwhile the router keeps feeding the new cost estimates into thenetwork.

The shortest path computation requires a minimum operationto evaluate and compare the cost of all actions at each node.Also, adders are required to sum up the costs at the currentnode and the expected cost associated with the action, as shownin (10). Also, a multiplexer is needed to output the associatedaction for the minimum expected cost. Therefore, the basiccircuit in a DP computational unit comprises four adders,three comparators, and three multiplexers. This circuit can befurther extended to provide multiple inputs by increasing thenumber of adders, minimizers, and operators. The continuous-time formulation of the DP network provides a mathematicalframework and convention for convergence analysis that can beapplied to study the convergence of the system. The actual im-plementation can be either analog or digital, which correspondsto the discrete- and continuous-time versions of the networkformulation, respectively. The digital network also convergesbut with a different time constant when compared with theanalog realization.

Fig. 16 shows the interconnections between the DP compu-tational units and its neighboring nodes. The interconnectionsprovide a means to deliver the expected values from the neigh-boring nodes to the DP unit and update the optimal routingdirection. The data-flow diagram for the KSLA algorithm is

Fig. 16. Interconnecting the computational units with its adjacent nodes.

shown in Fig. 17. When the destination information is obtainedfrom the packet, Manhattan distance to the destination fromthe current node is calculated. If the distance is smaller thanor equal to k, the routing direction to the destination can bedirectly obtained from the routing table. Otherwise, the nodeswithin the k-step region are obtained. The nodes in the k-stepregion are temporary destinations that are k steps away fromthe current node. For a typical mesh topology, it is relativelytrivial to obtain the temporary nodes, which can be done byusing the Manhattan distance and lookup tables. One node isselected based on an arbitrary selection scheme. Other selectionschemes can be used, such as using the expected costs or traffic.For simplicity, a node is selected randomly in this experiment.The address of this node will be inputted to the routing table toobtain the routing direction.

B. Results of FPGA Implementations

To further evaluate the effectiveness and the hardware cost ofthe proposed methodology, a DP network is implemented usinga Xilinx Virtex-4 XC4LX80 FPGA device. A mesh NoC isimplemented using System Generator [33] and synthesis usingthe Xilinx ISE synthesis tools. The design has been placed androuted to obtain the hardware area-consumption results.

The experiment is designed to evaluate the hardware over-head of the two different routing methods, which are the DPnetwork and KSLA routing. The DP-network routing employsa full routing table, which provides optimal routing directionsfor all destinations in the network. The KSLA provides routingdirections for destinations that are k steps away. The XYrouting is also implemented as a reference. Algorithmic routingis employed for computing the routing directions for the XYrouting. Similar to other NoC architecture, a wormhole-routingmechanism is implemented.

1) Convergence of DP Network in an FPGA: The perfor-mance and network convergence of the DP network in an FPGArealization is studied. DP networks with topologies of 3 × 3,

3714 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011

Fig. 17. Data-flow routine for the KSLA algorithm.

Fig. 18. Convergence of the DP network in an FPGA implementation. They-axis is the rms error of the optimal values obtained from the value outputs ateach computational unit. The x-axis is the clock cycle. The period of the clockcycle is not specified here and varies with different FPGA devices. (a) 3 × 3network. (b) 4 × 4 network. (c) 5 × 5 network. (d) 6 × 6 network.

TABLE IVHARDWARE AREA RESULTS FOR THE DP NETWORKS. RESULTS ARE

OBTAINED BASED ON A XILINX VIRTEX-4 XC4VLX40 FPGA

4 × 4, 5 × 5, and 6 × 6 are considered. The network-convergence rate for evaluating the shortest path problems canbe observed from the outputs of the computational units. Theseoutputs are captured at each time stamp and compared withthe optimal values in order for the rms errors to be computed.Fig. 18 shows the errors of the DP network in different clockcycles and in different network topologies. The hardware areaand clock-frequency results are summarized in Table IV. Thecomputational units are carefully placed in an FPGA, suchthat a significant physical separation between the units isintroduced. In this case, the operating frequency and powerdissipations reasonably indicate the contributions of delay fromwires between the units. It can be observed that the networkconverges to the optimal solution from 5 to 11 clock cycles,depending on the network configurations. The convergencetime of the mesh network is bounded by 2

√n − 1, where n is

TABLE VHARDWARE AREA RESULTS FOR THE XY, DP, AND KSLA ROUTERS.

RESULTS ARE OBTAINED BASED ON A XILINX

VIRTEX-4 XC4VLX40 FPGA

the total number of nodes in a mesh. Suppose that the networkis operating at 200 MHz; the convergence time is boundedby 5(2

√n − 1) ns. The DP network can rapidly evaluate the

shortest path and provide optimal path planning for dynamicrouting.

2) Hardware Results: The hardware-area consumption forrouters with five input and output ports are summarized inTable V. The resource consumption for the XY router in thiswork is similar to that of the implementation reported in [34].The overhead of a DP network router is small. The overallarea is slightly larger than the XY router. The DP router uses20.6% more slices than the XY router. For the KSLA router, thearea overhead is 40.3%. The KSLA employs more hardware re-sources for the procedures in evaluating the intermediate nodesfor suboptimal routing. In order to verify the memory reductionby using KSLA, we synthesize the design to distributed regis-ters, which are located at the reconfigurable tiles. By measuringthe logic utilization, the reduction in memory consumptioncan be demonstrated. Table V compares the logic consumptionbetween DP and KSLA. The approximation scheme can reducememory consumption up to 6% for the case when the buffer sizeis equal to 16. Although additional logics are required to realizethe KSLA, the reduction in memory consumption outweighsthe extra hardware logic. The router area is still dominated bythe input FIFO buffers; the area overhead for the DP networkcan be negligible. As seen in Table V, the DP overhead isonly 23% for a typical buffer size. The DP network withthe continuous-time formulation can be implemented using ananalog circuit as proposed in [31], in which the hardware areaand power consumption could be significantly reduced.

VII. CONCLUSION

This paper has presented a novel DP network for fullyoptimal routing in NoC. The DP network provides on-the-flyshortest path computation by using distributed DP and updatingthe routing table for optimal path planning based on the real-time network status. The mathematical formulations and

MAK et al.: ADAPTIVE ROUTING IN NETWORK-ON-CHIPS USING DYNAMIC-PROGRAMMING NETWORK 3715

convergence analysis of the network are presented. Twoexamples are presented to exemplify the robustness of thenetwork and the rapid resolution of shortest path problemsin different network structures. The routing mechanicsand the KSLA routing strategy are presented which canprovide tradeoffs between routing optimality and memoryconsumption. Experimental results confirm the performanceand merits of optimal routing over other deterministic andadaptive-routing approaches, which are based on partial andlocal traffic information. The optimal DP-network-basedrouting outperforms the XY routing by 28.9% and is also betterthan the other adaptive-routing strategies, such as the odd–even,by 18.4%. It has been observed that the new KSLA approachis a generalization of other adaptive-routing algorithm, whichapplies hill-climbing heuristics for latency minimization.Moreover, the hardware overhead for a DP network has beenexamined. It was found that a DP network consumes lessthan 20.6% of extra hardware area when compared with thedeterministic routing algorithms for a standard router design.The results suggest that a DP network offers a new and effectivesolution for dynamic minimal routing in NoC and can greatlyenhance the performance of on-chip communication. The DP-network approach can be further enhanced to enable fault toler-ance and dynamic power management in NoCs to reduce powerdissipation, which will be investigated in our future work.

REFERENCES

[1] D. Edenfeld, A. B. Kahng, M. Rodgers, and Y. Zorian, “Technologyroadmap for semiconductors,” Computer, vol. 37, no. 1, pp. 42–53,Jan. 2002.

[2] J. Cong, “An interconnect-centric design flow for nanometer technolo-gies,” Proc. IEEE, vol. 89, no. 4, pp. 505–528, Apr. 2001.

[3] J. Davis, R. Venkatesan, A. Kaloyeros, M. Beylansky, S. Souri,K. Banerjee, K. Saraswat, A. Rahman, R. Reif, and J. Meindl, “Intercon-nect limits on gigascale integration (GSI) in the 21st century,” Proc. IEEE,vol. 89, no. 3, pp. 305–324, Mar. 2001.

[4] J. D. Meindl, “Interconnect opportunities for gigascale integration,” IEEEMicro, vol. 23, no. 3, pp. 28–35, May/Jun. 2003.

[5] W. Dally and B. Towles, “Route packets, not wires: On-chip interconnec-tion networks,” in Proc. DAC, 2001, pp. 684–689.

[6] L. Benini and D. Bertozzi, “Network-on-chip architectures and designmethods,” Proc. Inst. Elect. Eng.—Comput. Digit. Tech., vol. 152, no. 2,pp. 261–272, Mar. 2005.

[7] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Berg,K. Tiensyrj, and A. Hemani, “A network-on-chip architecture and designmethodology,” in Proc. Int. Symp. VLSI, 2002, pp. 105–112.

[8] J. Hu, “Design methodologies for application specific networks-on-chip,”Ph.D. dissertation, Carnegie Mellon Univ., Pittsburgh, PA, 2005.

[9] R. Marculescu and P. Bogdan, “The chip is the network: Toward a sci-ence of network-on-chip design,” in Foundations and Trends in Elec-tronic Design Automation. College Park, MD: Now Publishers, 2009,pp. 371–461.

[10] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Perfor-mance evaluation and design tradeoffs for network-on-chip interconnectarchitectures,” IEEE Trans. Comput., vol. 54, no. 8, pp. 1025–1040,Aug. 2005.

[11] G. Chiu, “The odd–even turn model for adaptive routing,” IEEE Trans.Parallel Distrib. Syst., vol. 11, no. 7, pp. 729–738, Jul. 1992.

[12] T. Schonwald, J. Zimmermann, O. Bringmann, and W. Rosenstiel, “Fullyadaptive fault-tolerant routing algorithm for network-on-chip architec-tures,” in Proc. Euromicro Conf. Digit. Syst. Des. Archit. Methods Tools,2007, pp. 527–534.

[13] G. Ascia, V. Catania, M. Palesi, and D. Patti, “Implementation and analy-sis of a new selection strategy for adaptive routing in networks-on-chip,”IEEE Trans. Comput., vol. 57, no. 6, pp. 809–820, Jun. 2008.

[14] K. Kobayashi, M. Kameyama, and T. Higuchi, “Communication networkprotocol for real-time distributed control and its LSI implementation,”IEEE Trans. Ind. Electron., vol. 44, no. 3, pp. 418–426, Jun. 1997.

[15] Y. Ishii, “Exploiting backbone routing redundancy in industrial wirelesssystems,” IEEE Trans. Ind. Electron., vol. 56, no. 10, pp. 4288–4295,Oct. 2009.

[16] W. Dally and B. Towles, Principles and Practices of InterconnectionNetworks. San Mateo, CA: Morgan Kaufmann, 2004.

[17] D. Wu, B. Al-Hashimi, and M. Schmitz, “Improving routing efficiencyfor network-on-chip through contention-aware input selection,” in Proc.ASP-DAC, 2006, pp. 36–41.

[18] M. Palesi, R. Holsmark, and S. Kumar, “A methodology for design ofapplication specific deadlock-free routing algorithms for NoC systems,”in Proc. Int. CODES, 2006, pp. 142–147.

[19] V. Rantala, T. Lehtonen, P. Liljeberg, and J. Plosila, “Hybrid NoC withtraffic monitoring and adaptive routing for future 3D integrated chips,” inProc. DAC, 2008, pp. 1–4.

[20] S. Bourduas and Z. Zilic, “Latency reduction of global traf-fic in wormhole-routed meshes using hierarchical rings for global rout-ing,” in Proc. IEEE Int. Conf. Appl.-Specific Syst., Archit. Process., 2007,pp. 302–307.

[21] R. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ.Press, 1957.

[22] R. Bellman, “On a routing problem,” Quart. Appl. Math., vol. 16, no. 1,pp. 87–90, 1958.

[23] T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms.Cambridge, MA: MIT Press, 2001.

[24] D. Bertsekas and J. Tsitsiklis, Parallel and Distributed Computation:Numerical Methods. Princeton, NJ: Prentice-Hall, 1989.

[25] F. Hillier and G. Lieberman, Introduction to Operations Research. NewYork: McGraw-Hill, 1995.

[26] K. Lam and C. Tong, “Closed semiring connectionist network for theBellman–Ford computation,” Proc. Inst. Elect. Eng.—Comput. Digit.Tech., vol. 143, no. 3, pp. 189–195, May 1996.

[27] D. Bertsekas, Dynamic Programming and Optimal Control. Belmont,MA: Athena Scientific, 2007.

[28] J. J. Hopfield, “Neurons with graded response have collective computa-tional properties like those of two-state neurons,” Proc. Nat. Acad. Sci.,vol. 81, no. 10, pp. 3088–3092, May 1984.

[29] C. Glass and L. Ni, “The turn model for adaptive routing,” ACMSIGARCH Comput. Archit. News, vol. 20, no. 2, pp. 278–287, 1992.

[30] Noxim, Network-on-Chip Simulator, 2008. [Online]. Available:http://sourceforge.net/projects/noxim

[31] T. Mak, P. Sedcole, P. Cheung, W. Luk, and K. Lam, “A hybridanalog–digital routing network for NoC dynamic routing,” in Proc. IEEEInt. Symp. NoC, 2007, pp. 173–182.

[32] T. Mak, K.-P. Lam, H. S. Ng, G. Rachmuth, and C.-S. Poon, “A current-mode analog circuit for reinforcement learning problems,” in Proc. IEEEISCAS, 2007, pp. 1301–1304.

[33] Xilinx System Generator for DSP Version 8.2.02: User Guide, Xilinx,San Jose, CA, 2006.

[34] U. Ogras and R. Marculescu, “’It’s a small world after all’: NoC per-formance optimization via long-range link insertion,” IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol. 14, no. 7, pp. 693–706, Jul. 2006.

Terrence Mak (S’05–M’09) received the B.Eng.and M.Phil. degrees in systems engineering fromThe Chinese University of Hong Kong, Shatin,Hong Kong, in 2003 and 2005, respectively, andthe Ph.D. degree from Imperial College London,London, U.K., in 2009.

During his Ph.D., he was as a Research EngineerIntern with the Very Large Scale Integration (VLSI)Group, Sun Microsystems Laboratories, Menlo Park,CA. He was also a Visiting Research Scientist in thePoon’s Neuroengineering Laboratory, Massachusetts

Institute of Technology, Cambridge. He has been with the School of Electrical,Electronic and Computer Engineering, Newcastle University, Newcastle uponTyne, U.K., as a Lecturer, since 2010. His research interests include field-programmable gate array architecture design, network-on-chip, reconfigurablecomputing, and VLSI design for biomedical applications.

Dr. Mak was the recipient of both the Croucher Foundation Scholarship andthe U.S. Naval Research Excellence in Neuroengineering in 2005. In 2008, heserved as the Cochair of the U.K. Asynchronous Forum, and in March 2008,he was the Local Arrangement Chair of the Fourth International Workshop onApplied Reconfigurable Computing.

3716 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 58, NO. 8, AUGUST 2011

Peter Y. K. Cheung (M’85–SM’04) received theB.S. degree with first class honors from the ImperialCollege of Science and Technology, University ofLondon, London, U.K., in 1973.

He was with Hewlett Packard, Scotland. Since1980, he has been with the Department of Electri-cal Electronic Engineering, Imperial College, wherehe is currently a Professor of digital systems andHead of the department. He runs an active researchgroup in digital design, attracting support from manyindustrial partners. His research interests include

very large scale integration architectures for signal processing, asynchronoussystems, reconfigurable computing using field-programmable gate arrays, andarchitectural synthesis.

Prof. Cheung was elected as one of the first Imperial College TeachingFellows in 1994 in recognition of his innovation in teaching.

Kai-Pui Lam received the B.Sc. (Eng) degree inelectrical engineering from the University of HongKong, Shatin, Hong Kong, in 1975, the M.Phil.degree in electronics from The Chinese Universityof Hong Kong (CUHK), Shatin, in 1977, and theD.Phil. degree in engineering science from OxfordUniversity, Oxford, U.K., in 1980.

He is a Professor with the Department of SystemsEngineering and Engineering Management, CUHK.His research is focused on using field-programmablegate array in bioinformatics and neuronal dynamics

and on intradaily information in financial volatility forecasting.

Wayne Luk (S’85–M’89) received the M.A., M.Sc.,and D.Phil. degrees in engineering and computerscience from the University of Oxford, Oxford, U.K.

He is Professor of computer engineering withthe Department of Computing, Imperial CollegeLondon, London, U.K., and a Visiting Professorwith Stanford University, Stanford, CA, and Queen’sUniversity Belfast, Belfast, U.K. Much of his currentwork involves high-level compilation techniques andtools for parallel computers and embedded systems,particularly those containing reconfigurable devices

such as field-programmable gate arrays. His research interests include the-ory and practice of customizing hardware and software for specific appli-cation domains, such as graphics and image processing, multimedia, andcommunications.