[IEEE 2010 IEEE 23rd Canadian Conference on Electrical and Computer Engineering - CCECE - Calgary,...

A LOW-AREA AND LOW-LATENCY NETWORK ON CHIP

Xiaofang Wang and Leeladhar Bandi

Dept. of Electrical and Computer EngineeringVillanova University

800 Lancaster AvenueVillanova, PA 19085

Email: [email protected]

ABSTRACT

With the rapid increase of processing elements (PEs) on a sin-gle chip, the communication network poses a major limitingfactor for both performance and power consumption in fu-ture SoCs. This paper presents a low-area and low-latencywormhole-switching network on chip (NoC). By introduc-ing a new PE-router organization, our design not only re-duces the total number of routers for a given number of PEs,but also offers much more routing flexibility compared to ex-isting mesh-based solutions. In our network, each router isshared by four PEs and each general PE has access to fourdirectly-connected routers in addition to the NEWS (North,East, West, South) connections between neighboring PEs. Bysharing routers among PEs, the network reduces the averagehop count for packets thereby reducing the latency and im-proving the throughput of the network. Experimental resultsshow that the proposed network reduces the network latencyby up to 50.3% for an SoC with 64 PEs. The network satura-tion point is extended by up to approximately 100%.

1. INTRODUCTION

The International Technology Roadmap for Semiconductors(ITRS) foresees that the communication network will be alimiting factor for both performance and power consumptionin future SoCs [1]. Driven by the technology challenges andincreasing bandwidth demands presented in deep submicrondesigns, packet-switching networks on chip (NoCs) [2] areemerging quickly to replace shared buses and dedicated wiresas the interconnection fabric in SoCs. A packet-switching on-chip network uses dedicated routers interconnected by someform of network topology to interchange messages amongPEs. The routing algorithm in routers determines the path thatmessages will travel through the communication network. Themost popular network topology has been 2-D mesh due to itsmodularity and scalability.

The advantages of NoC-based architectures include highscalability and versatility, low latency, high bandwidth, dis-tributed routing decisions, reusability, and low power con-

sumption for communication over a flexible modular medium[3, 4]. Some types of NoCs can also offer guaranteed ser-vices, i.e., guarantees on the bandwidth and latency for linksbetween components in the system to have a predictable tim-ing behavior. This is an important goal in multimedia andembedded systems as there are often strict deadlines on theproduction of data (e.g., a video frame must be displayed ev-ery 1/50 second on the screen).

NoCs have attracted significant research interest in the lastfew years [5]. Many studies assume communication-intensivescenarios where simple PEs are used and applications are of-ten diverse and heterogeneous. The system is expected tohave a lot of ad-hoc communication activities. So the goalsfor such NoCs have been reducing the network and router la-tencies and the power consumption. The quest for high per-formance networks has led to very area-expensive and com-plicated routers. Very little progress, however, has been madeto reduce the significant area requirements of the communi-cation fabric.

This work proposes a high-performance and area-efficientwormhole-switching network on chip. By introducing a newPE-router organization, our design not only reduces the to-tal number of routers for a given number of PEs, but also of-fers much more routing flexibility compared to existing mesh-based solutions. The major motivation behind our design isthat more resources should be used for computing instead ofcommunication. In our network, each router is shared by fourPEs and each PE has access to four directly-connected routersin addition to the NEWS (North, East, West, South) connec-tions between neighboring PEs. By sharing routers amongPEs, the network reduces the average hop count for a packetthereby reducing the latency and improving the throughput ofthe network.

2. NETWORK TOPOLOGY

Fig. 1 shows our network for a 4x4 PE array. Rxy (x, y ∈{0, 1, 2}) represents a router. The network employs the worm-hole technique [2] for packet switching. Each PE has a unique

PE(0,0)

PE(0,1)

PE(2,2)

PE(0,2)

PE(0,3)

PE(1,0)

PE(2,3)

PE(2,0)

PE(3,0)

PE(2,1)

PE(3,1)

PE(1,1)

PE(1,2)

PE(1,3)

PE(3,2)

PE(3,3)

R00 R01 R02

R10 R11 R12

R20 R21 R22

Fig. 1. PE-router organization of the proposed network.

ID and is identified as PE(i, j), where i and j represent therow number and column number, respectively, of the PE. Sim-ilarly, each router has a unique ID and is identified as Rxy,where x and y represent the row number and column number,respectively, of the router. The PEs identified as PE(0, 0),PE(0, N − 1), PE(N − 1, 0), and PE(N − 1, N − 1) areknown as corner nodes while PEs along the edges of the ar-ray other than corner nodes are known as border nodes; theremaining nodes are referred to as general nodes. Each PE isconnected to one (if the PE is a corner node) or two (if the PEis a border node) or four (if the PE is a general node) routersand each router is shared by four PEs.

In conventional mesh-based network topologies, each routeris exclusively owned by a single PE whereas in our networkeach router can serve communication needs of four PEs. Thisnot only reduces the number of routers for a given numberof PEs, but also offers more flexibilities for PEs to adapt tonetwork conditions. Each PE in our network has four out-put channels to send out packets and four input channels tosink packets. On the other hand, in conventional mesh-basedtopologies, one router is required for each PE; hence, therouter/PE ratio is one. In our network, only (N − 1)2 routersare required to connect N2 PEs, where N is the number ofPEs in each row or column of the SoC, thus reducing requiredarea for the communication network, leaving more space forthe computing fabric.

In addition to the wormhole-switching network, PEs arealso connected to its immediate neighbors by bidirectionalNEWS (North, East, West, and South) links. Packets betweenimmediate neighbors are exchanged by the NEWS networkwithout involving any router. The NEWS network reducesthe burden of the routing network, especially for applicationsthat require mostly local communication among immediateneighbors.

3. PROCESSING ELEMENTS

A PE can be a processor, memory controller, or a peripheraldevice. In our work, we consider a PE to be a source whichgenerates packets as well as a sink which receives packets. Inaddition to the application function module of each PE, a PEalso includes the following components for communication.

1. Input and output ports: Each general PE (corner andborder PEs have a smaller number of input and outputports) has eight bidirectional ports. Of the eight ports,four ports are connected to the bidirectional ports ofneighboring PEs and the other four ports are connectedto the ports of the neighboring routers.

2. Packet Generator: Each PE consists of a packet gen-erator which is responsible for generating packets thatwrap the information the PE wants to send out to an-other PE. Depending on the traffic pattern used for orthe application in the PE, packets are generated by thepacket generator and a destination PE address is as-signed to each generated packet. Each packet is dividedinto a number of flits wherein each flit contains the flowcontrol information. Packets are received at the destina-tion PE via one of its input ports connected to a routeror a PE. A PE can act as a source and a destination si-multaneously for different packets.

3. Controller: The controller unit in a PE is responsible forthe routing of the packets generated by the packet gen-erator to one of its neighboring PEs or routers. Sincegeneral PEs are connected to more than one router, thecontroller unit includes a select function to choose oneof the output ports of the PE. The controller unit keepstrack of the resources available in the neighboring routersand sends the packets to the router determined by thealgorithm. The details of the selection function are de-scribed in Section 5.

4. ROUTER DESIGN

Each router in our network has eight ports which would re-quire an 8x8 crossbar if the standard virtual-channel routerarchitecture is used for our network. An n×n crossbar switchdirectly connects n inputs to n outputs with no intermediatestages. In effect, such a switch consists of n n : 1 multiplex-ers, one for each output. Each of the n input lines is con-nected to one input of the n n : 1 multiplexers. The outputsof the multiplexers drive the m output ports. Thus this switchconsists of eight 8:1 multiplexers, one for each output. Fortu-nately, the flexibility available in our network helps us dividethese eight 8:1 multiplexers into four 8:1 multiplexers withadditional four 3:1 multiplexers shown in the Fig. 2.

The reduction of the crossbar size is based on the follow-ing observation. Let us consider the router R11. It is con-nected to the PEs PE(1, 1), PE(1, 2), PE(2, 2), and PE(2, 1)

Virtual Channel AllocatorController

VCIDInput Channels

N

E

W

S

PE1

PE2

PE3

PE4

N

E

W

S

PE1

PE2

PE3

PE4

Switch Allocator

Crossbar [8:4]

Fig. 2. Router architecture.

and to the routers R10, R01, R12, and R21. The packets des-tined to PE(1, 1) never come to the router R11 from PEsPE(1, 2) and PE(2, 1) because these two PEs are directlyconnected to the PE(1, 1) and all packets between those PEsare transferred via the directly connected NEWS channels.Thus the communication between these PEs with PE(1, 1)never traverses through the router. Also, the packets comingfrom the routers R10 and R01 are never destined to PE(1, 1)because these routers are already connected to the PE(1, 1)and the packets whose destination is PE(1, 1) are sent di-rectly to the destination by these routers. The packets trans-mitted from PE(1, 1) never have the destination PE(1, 1).Thus, if a packet whose destination is PE(1, 1) reaches therouter R11 then it has to be transmitted to R11 by routers R12or R21 or the PE(2, 2). Hence the output to PE(1, 1) in therouter R11 can come only from three input channels ratherthan eight; therefore, a 3:1 multiplexer is sufficient instead ofan 8:1 multiplexer. This concept is applicable at all the out-put channels in the routers which are connected to the PEs,thereby making four 8:1 multiplexers in addition with four 3:1multiplexers sufficient instead of using eight 8:1 multiplexers.

5. PACKET ROUTING

Our network not only reduces the average hop count for pack-ets, but also offers much more flexibility for routing. In thissection, we discuss how packets are handled in our network.

A PE is responsible for generating packets according tothe flit formats of the network to wrap the data it wants tosend to another PE. Once a packet is generated in a PE, this PEacts as the source PE to that packet, In conventional networks,each PE is directly connected to only one router and hencefor the communication to be possible each PE has to send a

packet to the router to which it is connected. Whereas in ournetwork, a general PE is connected to four routers and thePE has the flexibility to choose one among the four routersconnected to it. A critical question arises naturally here isthat which one of the multiple routers the PE should chooseto pass the packet over. This chosen router is known as thesource router for the packet. A similar question exists for thedestination PE where one of its adjacent routers needs to beselected (known as destination router) to send the packet tothe destination PE. The selection of the source router and thedestination router is very important in our network and has alarge impact on the network latency.

5.1. How the source router is chosen

In conventional topologies, the destination router and desti-nation PE have the same address as each PE is connectedto a single dedicated router. Whether deterministic or adap-tive, all the existing routing algorithms are developed basedon the source router (PE) address and the destination router(PE) address. Our network assigns different addresses for thedestination PE and the routers connected to the PE. Routingalgorithms uses both the destination router address as well asthe destination PE address. The destination router address isdetermined using the destination PE address and sent alongwith the destination PE address. Routing algorithms use theaddress of the destination router to determine the path fol-lowed by a packet. Extra bits to the flits are used to carrythe destination router address along with the destination PEaddress.

Each PE in our network is provided with a selection func-tion to choose its own source router when sending out a packet.As explained in Section 2, corner, border, and general PEs areconnected to one, two, and four routers, respectively. For cor-ner PEs, the source router will be the router to which each cor-ner PE is connected, while border PEs and general PEs haveto chose one router among two or four connected routers asthe source router. The selection of the source and destinationrouters can be done in several ways, e.g., random selection,selecting the router with the least load or the router nearestto the destination PE. In this paper, we consider the routernearest to the destination only.

In this type of selection, the router which is nearest to thedestination is chosen. Of the routers connected to the sourcePE the router which is minimum hops away from the destina-tion PE is defined as the source router nearest to the destina-tion. Depending on the topology the algorithms to determinethe source router nearest to the destination vary. Corner PEsin our network are connected to only one router. This routerwill always act as the source router. When we consider borderPEs or general PEs, the following two cases are considered.

Case 1: The number of hops between the source and des-tination PEs in the horizontal and the vertical directions isnon-zero.

Case 2: The number of hops between the source and desti-nation PEs in the horizontal direction or the vertical directionis zero.

5.2. How the destination router is chosen

In the header flit of the packet, some extra bits are reserved forthe destination router’s address. An extra valid bit is addedto the router address to validate the router address. When thepacket is sent from the source PE to the source router, the des-tination router address has not been determined. Hence, thedestination router field in the header flit contains a garbagevalue and hence the valid bit is set to ‘0’. The address of thedestination router, which is the router through which a packetreaches its destination PE, is determined by a selection func-tion in the source router. After the destination router addressis calculated, it is added to the header flit of the packet andthe valid bit is set to ‘1’. Now, as both the source PE androuter addresses and the destination PE and router addressesare known, the routing algorithm determines the next stop ofthe packet and the packet is transferred to a VC of the cho-sen router. After the packet reaches the destination router it issent to the destination PE.

Similar to the selection function in each PE to determinethe source router, a selection function is provided in routersto determine the destination router to the source router. Ifthe destination PE is at a corner, the destination router is theonly router to which the corner PE is connected, while forborder destination PEs and general destination PEs, a desti-nation router has to be chosen among the two or four con-nected routers. Similar to the selection of source router, theselection of destination router can be done in several ways. Inthis paper, we choose the router nearest to the chosen sourcerouter.

In this type of selection, the router which is nearest tothe source router is chosen. Of the routers connected to thedestination PE the router which has the smallest number ofhops away from the source router is defined as the destinationrouter nearest to the source. Depending on the topology thealgorithms to determine the destination router nearest to thesource router vary. The router connected to the corner desti-nation PE in our network always acts as the PE’s destinationrouter, while for border destination PEs or general destinationPEs, the same two cases for the selection of the source routerare handled separately.

5.3. Packet Routing

Once the source and destination routers are determined, pack-ets can be transferred using adapted mesh-based routing al-gorithms like XY, West-First etc. Here we use the XY algo-rithm as an example to show how we adapt the conventionalalgorithms to our network. The XY algorithm for our net-work is different from the generally used algorithm in two

Adapted XY Algorithm for the Proposed Network

Input: Coordinates of current router(xco, yco) and Coordinates of destination router (dst_xco, dst_yco) Coordinates of dest. PE (pe_dst_xco, pe_dst_yco)Output: Selected Output Channel

Procedure:

Xoffset:= dst_xco – xco ; Yoffset:= dst_yco – yco;Xoffset_pe:= pe_dst_xco – dst_xco; Yoffset_pe:= pe_dst_yco – dst_yco;

if ( Yoffset > 0 ) thenChannel:= East;

else if ( Yoffset < 0 ) thenChannel:= West;

else if ( Yoffset = 0 ) then{ if ( Xoffset < 0 ) then

Channel:= North; else if ( Xoffset > 0 ) then

Channel:= South; else{if ( |Xoffset_pe| = 0 and |Yoffset_pe| = 0 ) then

Channel:= North-West;elseif ( |Xoffset_pe| = 0 and |Yoffset_pe| = 1 ) then Channel:= North-East;elseif ( |Xoffset_pe| = 1 and |Yoffset_pe| = 1 ) then Channel:= South-East;elseif ( |Xoffset_pe| = 1 and |Yoffset_pe| = 0 ) then Channel:= South-West;

}}

Fig. 3. XY algorithm for the proposed network.

ways. First, using the destination PE coordinates the desti-nation router’s coordinates are determined using the selec-tion function for the destination router as explained. No extracalculation is required in conventional topologies. Secondly,modification is to be made to the algorithm to include all thefour PEs connected to the router instead of one PE as is thecase in the conventional algorithm. Fig. 3 shows the modifiedXY algorithm used in our network.

6. RESULTS AND DISCUSSION

A SystemC-based simulator was developed to test performanceand explore architectural and routing choices. Both the con-ventional network and our network have been implemented.The simulation results show our network outperform the con-ventional architecture using various routing algorithms, in-cluding XY, west-first, north-last, negative-first, fully-adaptivealgorithms, under uniform random, transpose, shuffle, bit-reversal, bit-compliment, fixed-hop traffic patterns. Due tothe limited space here, we present the average network latencyon an 8x8 mesh using negative-first and fully-adaptive rout-ing algorithms under uniform random, shuffle, bit-reversal,and transpose traffic patterns, in Fig. 4, Fig. 5, Fig. 6, andFig. 7, respectively. Both the number of virtual channels perlink and the number of buffers in a virtual channel are set to4. A packet contains 16 flits.

In all the shown experiments, our network reduces thenetwork latency by an amount of between 18.3% and 50.3%compared to the conventional architecture. More importantly,the network stable range (the range between zero traffic loadand the saturation point of the network capacity) is extended

0.1 0.2 0.3 0.4 0.5

100

150

200

250

300

350

400

450

500

550

600

Tra!c load (fraction of capaci ty)

Ave

rage

net

wor

kla

ten

cy(c

ycle

s)

Conventional networkOur network

(a) Negative-First

0.1 0.2 0.3 0.4 0.5

100

150

200

250

300

350

400

450

500

550

600


Ave

rage

net

wor

kla

ten

cy(c

ycle

s)


(b) Fully-Adaptive

Fig. 4. Average network latency under uniform random traf-fic.

0.1 0.2 0.3 0.4 0.5

100

150

200

250

300

350

400

450

500

550

600


Ave

rage

net

wor

kla

ten

cy(c

ycle

s)


(a) Negative-First

0.1 0.2 0.3 0.4 0.5 0.6

100

150

200

250

300

350

400

450

500

550

600


Ave

rage

net

wor

kla

ten

cy(c

ycle

s)


(b) Fully-Adaptive

Fig. 5. Average network latency under shuffle traffic.

by more than 30% with the negative-first routing algorithmunder all traffic patterns, and up to 100% with the fully-adaptivealgorithm under bit-reversal and transpose traffic patterns. Ingeneral, adaptive algorithms benefit more from the flexibil-ity of our architecture and perform better than deterministicrouting algorithms on our network.

7. CONCLUSIONS

The rapidly increasing number of PEs on a chip poses seriouschallenges to the design of communication architectures forfuture SoCs. Packet-switching NoCs have attracted signifi-cant research efforts in recent years and have showed promis-

0.1 0.2 0.3 0.4 0.5

100

150

200

250

300

350

400

450

500

550

600


Ave

rage

net

wor

kla

ten

cy(c

ycle

s)


(a) Negative-First

0.1 0.2 0.3 0.4 0.5

100

150

200

250

300

350

400

450

500

550

600


Ave

rage

net

wor

kla

ten

cy(c

ycle

s)


(b) Fully-Adaptive

Fig. 6. Average network latency under bit-reversal traffic.

0.1 0.2 0.3 0.4 0.5

100

150

200

250

300

350

400

450

500

550

600


Ave

rage

net

wor

kla

ten

cy(c

ycle

s)


(a) Negative-First

0.1 0.2 0.3 0.4 0.5

100

150

200

250

300

350

400

450

500

550

600


Ave

rage

net

wor

kla

ten

cy(c

ycle

s)


(b) Fully-Adaptive

Fig. 7. Average network latency under transpose traffic.

ing advantages in terms of scalability, reusability, and pre-dictable latency. Existing NoC solutions assume communication-centric scenarios and require significant chip area, leavinglittle room for the computing fabric. In this paper, we in-troduced a novel NoC based on a new PE-router organiza-tion, which not only significantly reduces the area requiredby the communication network, but also provides flexibilitiesto routing algorithms. By applying our selection functions,existing high-performance routing algorithms can be adaptedto work on our network. The simulation results using ourSystemC-based simulator prove that our network reduces thenetwork latency by up to 50.3% for an 8x8 PE array. More-over, the network saturation point can be extended by approx-imately 100%.

8. REFERENCES

[1] J. D. Owens, W. J. Dally, R. Ho, D. N. Jayasimha, S. W.Keckler, and L.-S. Peh, “Research challenges for on-chipinterconnection networks,” IEEE Micro, vol. 27, no. 5,pp. 96–108, Sept. 2007.

[2] W. J. Dally and C. L. Seitz, “Deadlock-free message rout-ing in multiprocessor interconnection networks,” IEEETrans. Comput., vol. 36, no. 5, pp. 547–553, May 1987.

[3] A. Adriahantenaina, H. Charlery, A. Greiner, L. Mortiez,and C. A. Zeferino, “Spin: A scalable, packet switched,on-chip micro-network,” Design, Automation and Test inEurope Conference and Exhibition, vol. 2, pp. 70–73,March 2003.

[4] L. Benini and G. D. Micheli, “Networks on chips: A newsoc paradigm,” In Proceedings of Conference on Design,Automation and Test in Europe, vol. 35, no. 1, pp. 418–419, March 2002.

[5] T. Bjerregaard and S. Mahadevan, “A survey of researchand practices of network-on-chip,” ACM Comput. Surv.,vol. 38, no. 1, March 2006.

[IEEE 2010 IEEE 23rd Canadian Conference on Electrical and Computer Engineering - CCECE - Calgary,...

Documents

Transcript of [IEEE 2010 IEEE 23rd Canadian Conference on Electrical and Computer Engineering - CCECE - Calgary,...