Noc

6
On-Chip Networks (NoCs) Seminar on Embedded System Architecture (Prof. A. Strey) Simone Pellegrini [email protected] University of Innsbruck – December 10, 2009 1 Introduction Advancements in chip manufacturing technology allow, nowadays, the integration of several hard- ware components into a single integrated circuit reducing both manufacturing costs and system di- mensions. System-on-a-Chip (SoC) integrates – on a single chip – several building blocks, or IPs (In- tellectual Property), such as general-purpose programmable processors, DSPs, accelerators, memory blocks, I/O blocks, etc. Recently, the exponential growth of the number of IPs employed in SoC de- signs is arising the need for a more efficient and scalable on-chip core-to-core interconnection. Traditionally, on-chip interconnects consist of 3 solutions [1]: Bus : A bus is shared between several IPs, arbitration logic is needed to guarantee serialization of access requests. Crossbar : A switch connecting multiple inputs to multiple outputs in a matrix manner. Point-to-Point : Each component has a dedicated interconnection towards other dependent IPs. An example of bus interconnection is depicted in Fig. 1.a, bus are cheap to realize, however in a highly interconnected multicore system it can easily become a communication bottleneck. Furthermore, the capacitive load increases with the number of cores attached to the bus resulting in higher energy con- sumption. The crossbar overcomes some of the limitations of the bus, it provides lower latency but it is not scalable. Dedicated point-to-point links are optimal in terms of latency, power usage and bandwidth availability as designed accordingly to the specific core-to-core bandwidth requirement. However, the number of dedicated point-to-point interconnections exponentially grows with the num- ber of IPs resulting in a larger realization area. Crossbar and point-to-point interconnections usually scale efficiently up to 20 cores, for larger number of IPs integrated in a single chip a more scalable and flexible solution is needed. The solution consists of an on-chip data-routing network consisting of communication links and routing nodes generally known as Network-on-Chip (NoC) architectures (Fig. 1.c). CPU RAM MPEG DSP a) Bus RAM MPEG CPU DSP b) Crossbar RAM MPEG CPU DSP c) Point-to-Point Figure 1: Examples of communication structures in Systems-on-Chip. a) traditional bus-based com- munication, b) crossbar and c) point-to-point. 1

Transcript of Noc

Page 1: Noc

On-Chip Networks (NoCs)Seminar on Embedded System Architecture (Prof. A. Strey)

Simone [email protected]

University of Innsbruck – December 10, 2009

1 Introduction

Advancements in chip manufacturing technology allow, nowadays, the integration of several hard-ware components into a single integrated circuit reducing both manufacturing costs and system di-mensions. System-on-a-Chip (SoC) integrates – on a single chip – several building blocks, or IPs (In-tellectual Property), such as general-purpose programmable processors, DSPs, accelerators, memoryblocks, I/O blocks, etc. Recently, the exponential growth of the number of IPs employed in SoC de-signs is arising the need for a more efficient and scalable on-chip core-to-core interconnection.

Traditionally, on-chip interconnects consist of 3 solutions [1]:

Bus : A bus is shared between several IPs, arbitration logic is needed to guarantee serialization ofaccess requests.

Crossbar : A switch connecting multiple inputs to multiple outputs in a matrix manner.

Point-to-Point : Each component has a dedicated interconnection towards other dependent IPs.

An example of bus interconnection is depicted in Fig. 1.a, bus are cheap to realize, however in a highlyinterconnected multicore system it can easily become a communication bottleneck. Furthermore, thecapacitive load increases with the number of cores attached to the bus resulting in higher energy con-sumption. The crossbar overcomes some of the limitations of the bus, it provides lower latency butit is not scalable. Dedicated point-to-point links are optimal in terms of latency, power usage andbandwidth availability as designed accordingly to the specific core-to-core bandwidth requirement.However, the number of dedicated point-to-point interconnections exponentially grows with the num-ber of IPs resulting in a larger realization area. Crossbar and point-to-point interconnections usuallyscale efficiently up to 20 cores, for larger number of IPs integrated in a single chip a more scalableand flexible solution is needed. The solution consists of an on-chip data-routing network consistingof communication links and routing nodes generally known as Network-on-Chip (NoC) architectures(Fig. 1.c).

CPU RAM

MPEGDSP

a) Bus

RAM

MPEGCPU

DSP

b) Crossbar

RAM

MPEG

CPU

DSP

c) Point-to-Point

Figure 1: Examples of communication structures in Systems-on-Chip. a) traditional bus-based com-munication, b) crossbar and c) point-to-point.

1

Page 2: Noc

2 Basic NoCs’ Building Blocks

switch

NA

Figure 2: Example of On-Chip Network.

A NoC consists of routing nodes spread outacross the chip connected via communicationlinks. In Fig. 2 the main components of a NoCare depicted:

Network Adapter implements the interfaceby which cores (IP blocks) connect to theNoC. Its function is to decouple computa-tion (the cores) from communication (thenetwork).

Routing node routes the data according tochosen protocols. It implements the rout-ing strategy.

Link connects the nodes, providing the rawbandwidth. It may consist of one or morelogical or physical channels.

Several NoC designs have been proposed in literature, while some of them aim at low-latency othersare more oriented to maximize bandwidth. Like computer networks, a NoC is usually designed tomet application requirements and at each level (network adapter, routing and link) several designtrade-offs must be considered.

2.1 Network Adapter (NA)

NA handles the end-to-end flow control, encapsulating the messages or transactions generated bythe cores for the routing strategy of the network. At this level messages are broken into packetsand, depending from the underlying network, additional routing information (like destination core)are encoded into a packet header. Packets are further decomposed into flits (flow control units)which again are divided into phits (physical units), which are the minimum size datagram that can betransmitted in a single link transaction. A flit is usually composed by 1 to 2 phits. The NA implementsa Core Interface (CI) at the core side and a Network interface (NI) at network side. The level ofdecoupling introduced by the NA may vary. A high degree of decoupling allows for easy reuse ofcores. On the other hand, a lower level of decoupling (a more network-aware core) has the potentialto make more optimal use of the network resources. Standard sockets can be used to implement theCI, the Open Core Protocol (OCP) and the Virtual Component Interface (VCI) are two examples widelyused in SoCs.

2.2 Network Level (routing)

The main responsibility of the network is to deliver messages from a sender IP to a receiver core. ANoC is defined by its topology and the protocol implemented by it.

2.2.1 Topology

Topology concerns the layout and connectivity of the nodes and links on the chip. Two forms ofinterconnections are explored in modern NoCs: regular and irregular topologies. Regular topologiesare divided into 3 classes k-ary n-cube, k-ary tree and k-ary n-dimensional fat tree, where k is thedegree of each dimension and n is the number of dimensions. Most NoCs implement regular formsof network ,topology that can be laid out on a chip surface, for example, k-ary 2-cube, commonlyknown as grid-based topology. Generally, a grid topology makes better use of links (utilization), whiletree-based topologies are useful for exploiting locality of traffic and better optimize the bandwidth.Irregular forms of topologies are based on the concept of clustering (frequently used in GALS chips)and derived by mixing different forms in a hierarchical, hybrid, or asymmetric way.

2

Page 3: Noc

2.2.2 Protocol

The protocol concerns the strategy of moving data through the network and it is implemented at therouter level. A generic scheme of a router is showed in Fig. 2. The switch is the physical compo-nent which deals with data transport by connecting input and output buffers. Switch connectionsare dictated by the routing algorithm which defines the path a message has to follow to reach itsdestination.

The wide majority of NoCs are based on packet-switching networks. Alike circuit-switching net-works, where a circuit from source to destination is set up and reserved until the transfer of datais completed, in packet-switching networks packets are forwarded on a per-hop basis (each packetcontains routing information and data). Most common forward strategies are the following:

Store-and-forward the node stores the complete packet and forwards it based on the informationwithin its header. The packet may stall if the router does not have sufficient buffer space.

Wormhole the node looks at the header of the packet (stored in the first flit) to determine its nexthop and immediately forwards it. The subsequent flits are forwarded as they arrive to the samedestination node. As no buffering is done, wormhole routing attains a minimal packet latency.The main drawback is that a stalling packet can occupy all the links a worm spans.

Virtual cut-through works like the wormhole routing but before forwarding a packet the node waitsfor a guarantee that the next node in the path will accept the entire packet.

The main forwarding technique used in NoCs is wormhole because of the low latency and the smallrealization area as no buffering is required. Most often connection-less routing is employed for besteffort (BE) while connection-oriented routing is preferable for guarantee throughput (GT) neededwhen applications have QoS requirements.

Once the destination node is known, in order to determine to which of the switch’s output portsthe message should be forwarded, static or dynamic techniques can be used.

Deterministic routing the path is determined by packet source and destination. Popular determin-istic routing schemes for NoC are source routing and X-Y routing (2D dimension order routing).In X-Y routing, for example, the packet follows the rows first, then moves along the columnstoward the destination or vice versa. In source routing, the routing information are directlyencoded into the packet header by the sending IP.

Adaptive routing the routing path is decided on a per-hop basis. Adaptive schemes involve dy-namic arbitration mechanisms, for example, based on local link congestion. This results in amore complex node implementations but offers benefits like dynamic load balancing.

Deterministic routing is preferred as it is easier to implement, adaptive routing, on the other hand,tends to concentrate traffic at the centre of the network resulting in increased congestion there [1].

Beside the delivery of messages from the source to the destination core, the network level isalso responsible of ensuring the correct operation, also known as flow control. Deadlock and livelockshould be detected and resolved. Deadlock occurs when network resources (e.g., link bandwidth orbuffer space) are suspended waiting for each other to be released, i.e. where one path is blockedleading to others being blocked in a cyclic way. Livelock occurs when resources constantly changestate waiting for other to finish. Methods to avoid deadlock, and livelock can be applied either locallyat the nodes or globally by ensuring logical separation of data streams by applying end-to-end controlmechanisms. Deadlocks can be avoided by using Virtual Channels (VC), VCs allow multiple logicalchannels to share a single physical channel. When one logical channel stalls, the physical nodecan continue serving other logical channels (VCs require additional buffering as several independentqueues must be provided).

2.3 Link Level

Link-level research regards the node-to-node links, the goal is to design point-to-point interconnec-tions which are fast, low power and reliable. As chip manufacturing technology scales, the effectsof wires on link delay and power consumption increase. Due to physical limitations, is not possible

3

Page 4: Noc

to produce long and fast wires, so segmentation is done by inserting repeater buffers at a regulardistance to keep the delay linearly dependent on the length of the wire. Partition of long wires intopipeline stages as an alternative segmentation technique is an effective way of increasing through-put. This comes at expense of latency as pipeline stages are more complex than standard repeaters(buffering is required). In future chip interconnects, optical wires are gaining more and more inter-ests. The idea consist in integrating the technology used in fiber optic wires on the chip surface. Asshown in Fig. 3, for links longer than 3-4 mm optical interconnects can dramatically reduce signaldelay and power consumption [1].

3 NoC Analysis

Figure 3: Delay comparison of optical and electri-cal interconnect (with and without repeaters) in aprojected 50 nm technology

Metrics usually employed to describe a NoC canbe divided into two categories: performance andcost. Latency, bandwidth, throughput and jitterare the most common performance factors whilecost is often referred to power consumption andarea usage. Design a NoC usually requires tofind the optimal trade-off between applicationneeds (usually expressed in terms of traffic re-quirements), area usage and power consump-tion. For example choosing adaptive routingscheme in favour of static ones can have ben-efits on the average latency as the network loadincreases. Routing algorithm can also optimizethe aggregate bandwidth as well as link utiliza-tion. The jitter, at last, refers to the delay varia-tion of received packets in a flow caused by net-work congestion or improper queuing. Knowl-edge of the application traffic behaviour canhelp in compensating the jitter via a properbuffer dimensioning which allows to absorb theeffect of traffic bursts.

From the point of view of costs, reduce thepower consumption is one of the main objec-tives. There are two main terms which are usedto measure the power consumption of a NoC system: (i) power per communicated bit and (ii) idlepower. Idle power is often reduced by using asynchronous circuits in implementing some of the NoCcomponents, e.g. the NA.

4 Available NoC Solutions

Design of a NoC is possible by means of libraries which give a high level description of NoC compo-nents that can be assembled together by system designers to suite the application needs. AvailableNoC libraries can be classified corresponding to the two following metrics as depicted in Fig.4:

Parametrizability at system-level : the ease with which a system-level NoC characteristic can bechanged at instantiation time. The NoC description may encompass a wide range of parameters,such as: number of slots in the switch, pipeline stages in the links, number of ports of thenetwork, and others.

Granularity of NoC : describe the abstraction level at which the NoC component is described. Atfine level the level the network can be assembled by putting together basic building blockswhich at the coarse level, the NoC can be described as a system core.

In the next sections two concrete systems which employ NoC will be analyzed: Xpipes and the RAWprocessor. Xpipes is a library with the purpose of providing to system designers the building blocks

4

Page 5: Noc

to instantiate custom NoCs for SoC systems. The RAW processor consists of a general purpose multi-core CPU architecture which use NoC-based interconnect between processing cores.

4.1 Xpipes

Figure 4: Available NoC solutions classi-fication.

Xpipes is a library of highly parameterised soft macros(network interface, switch and switch-to-switch link) thatcan be turned into instance-specific network componentsat instantiation time. Components can be assembled to-gether allowing users to explore several NoC designs (e.g.different topologies) to better fit the specific applicationneeds.

The high degree of parameterisation of Xpipes networkbuilding blocks regards both global network-specific pa-rameters (such as flit size, maximum number of hops be-tween any two nodes, etc.) and block-specific parameters(content of routing tables for source-based routing, switchI/O ports, number of virtual channels, etc.). The XpipesNoC backbone relies on a wormhole switching techniqueand makes use of a static routing algorithm called streetsign routing. Important attention is given to reliability asdistributed error detection techniques are implemented atlink level. To achieve at high clock frequency, in Xpipeslinks are pipelined in order to optimize throughput. Thesize of segments can be tailored to the desired clock frequency. Overall, the delay for a flit to tra-verse from across one link and node is 2N + M cycles, where N is number of pipeline stages and Mthe switch stages.

One of the main advantage of Xpipes over other NoC libraries is the provided tool set. TheXpipesCompiler is a tool designed to automatically instantiates an application-specific custom com-munication infrastructure using Xpipes components from the system specification. The output of theXpipesCompiler is a SystemC description that can be fed to a back-end RTL synthesis tool for siliconimplementation.

4.2 RAW Processor

The RAW microprocessor is a general purpose multicore CPU first designed at MIT [3]. The main focusis to exploit ILP across several CPU cores (called tiles) whose functional units are connected througha NoC. The first prototype, realized in 1997, had 16 individual tiles and a clock speed of 225 MHz.

RAW processor design is optimized for low-latency communication and for efficient execution ofparallel codes. Tiles are interconnected using four 32-bit full-duplex networks-on-chip, a 2 dimen-sional grid topology is used thus each tile connects to its four neighbours. The limited length of wires,which is no greater than the width of a tile, allows high clock frequencies. Two of the networks arestatic and managed by a single static router (which is optimized for low-latency), while the remainingtwo are dynamic. The interesting aspect of the RAW processor is that the networks are integrated di-rectly into the processors’ pipeline as depicted in Fig. 5. This architecture enables an ALU-to-networklatency of 4 clock cycles (for a 8 stage pipeline tile).

The static router is a 5 stages pipeline that controls 2 routing crossbars and thus 2 physical net-works. Routing is performed by programming the static routers in a per-clock-basis. These instruc-tions are generated by the compiler and as the traffic pattern is extracted from the application atcompile time, router preparation can be pipelined allowing data words to be forwarded towards thecorrect port upon arrival.

The dynamic network is based on packet-switching and the wormhole routing protocol is used. Thepacket header contains the destination tile, an user field and the length of the message. Dynamicrouting introduces higher latency as additional clock cycles are required to process the messageheader. Two dynamic networks are implemented to handle deadlocks. The memory network has a

5

Page 6: Noc

Figure 5: Raw compute processor pipeline.

restricted usage model that uses deadlock avoidance. The general network usage is instead unre-stricted and when deadlocks happen the memory network is used to restore the correct functionality.

The success of the RAW architecture is demonstrate by the Tilera company, founded by formerMIT members, which distributes CPUs based on the RAW design. Current top microprocessor is theTile-GX which packs into a single chip 100 tiles at a clock speed of 1,5 Ghz [4]. Also Intel recentlypresented the Single-chip Cloud Computer (SCC) which packs 48 tiles connected through a NoC [5].

5 Summary

On-Chip networks are gaining more and more interest for both embedded SoC systems and generalpurpose CPUs as the number of integrated cores is constantly increasing and traditional interconnectsdo not scale. As chip manufacturing technology scales down, computation is getting cheaper andcommunication is becoming the main source of power consumption. NoC tries to solve the problemby reducing the complexity and the number of chip interconnections and by limiting the wire lengthto allow higher clock speeds. Several NoC libraries and NoC-based general purpose CPUs have beenintroduced making effective programmability of these chips one of the next big research challenges.

References

[1] Bjerregaard, Tobias and Mahadevan, Shankar: A survey of research and practices of Network-on-chip, ACM Computing Surveys, Volume 38, Number 1, 2006.

[2] Davide Bertozzi and Luca Benini: Xpipes: A Network-on-Chip Architecture for Gigascale Systems-on-Chip, IEEE Circuits and Systems Magazine, Volume 4, Issue 2, page(s): 18-31, 2004.

[3] Michael B. Taylor et al.: The RAW microprocessor: A Computational Fabric for Software circuitsand General Purpose Programs, IEEE Micro, Volume 22, Issue 2, page(s): 25- 35, 2002.

[4] Tilera: Tile-GX Processors Family, http://www.tilera.com/products/TILE-Gx.php.

[5] Intel Research: Single-chip Cloud Computer, http://techresearch.intel.com/articles/Tera-Scale/1826.htm

6