[IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille,...

8
A Memory-Effective Fault-Tolerant Routing Strategy for Direct Interconnection Networks M.E. G´ omez, P. L´ opez and J. Duato Dept. of Computer Engineering Universidad Polit´ ecnica de Valencia E-mail: [email protected] Abstract High-performance interconnection networks are crucial in massively parallel computers. Routing is one of the most important design issues of interconnection networks. Mo- reover, the huge amount of hardware of these machines ma- kes fault-tolerance another important design issue. In this paper, we propose a mechanism that combines scalable rou- ting and fault-tolerance for commercial switches to build direct regular topologies, which are the topologies used in large machines. The hardware required is not complex. Furthermore, it allows a high degree of fault-tolerance in- flicting a minimal decrease of performance. Keywords: Memory-effective routing, fault-tolerance, regular topologies, distributed routing, adaptive routing. 1 Introduction High-performance interconnection networks are crucial to achieve the maximum performance in large parallel com- puters. Routing is one of the most important design is- sues of interconnection networks. Another important issue is fault-tolerance, since the huge number of nodes signifi- cantly increases the probability of failure. Routing is deterministic if only one path is provided, or adaptive, if several paths are available between a source- destination pair. Routing strategies can be also classified according to the place where routing decisions are taken. In source routing, the source node computes the path and sto- res it in the packet header. Since the header itself must be transmitted, it consumes network bandwidth. In distributed routing, each switch computes the next link that will be used by the packet. The packet header only contains the destina- tion node. In source routing routers are very simple. On the other hand, distributed routing offers more efficiency. Distributing routing can be implemented in two different ways. In the first case, there is a hardware at the switches that implements a combinational logic circuit and computes the output port to be used as a function of the current and This work was supported by the Spanish MCYT under Grant TIC2003-08154-C06-01. destination nodes and the status of the output ports. The implementation is very efficient in terms of both area and speed, but the algorithm is specific to the topology and to the routing strategy used on that topology. This is the ap- proach used for the fixed regular topologies used in large multicomputers. With the introduction of clusters of work- stations, routers based on forwarding tables were proposed. In this case, there is a table at each node that stores, for each destination node, the output ports that must be used. The main advantage of table-based routing is that any to- pology and any routing algorithm can be used. However, forwarding tables suffers from a lack of scalability, as their size grows linearly with network size, and, most important, the time required to access the table also depends on its size. Large cluster-based machines use any of the commer- cial high-performance switch-based point-to-point inter- connects. Network topology is either a direct or an indi- rect multistage network. The large size of these machines makes source routing inefficient due to the header size, and distributed routing based on forwarding tables is also inef- ficient due to the table size. On the other hand, specifically designed hardwired routing algorithms are not feasible, as general purpose switches are used. In [4], we developed the Flexible Interval Routing (FIR), a routing strategy for switch-based networks which is opti- mized for the most widely used regular network topologies. By optimized, we understand that it is scalable, from several points of view: (i) the required memory space does not li- nearly depend on network size, (ii) the hardware required to apply the routing algorithm is not complex and does not de- pend on network size and (iii) the time required to perform a routing operation does not depend on network size. As well as performance, fault-tolerance has became ano- ther important issue. The huge number of nodes of large machines significantly increases the probability of failure. Failures in the interconnection network may isolate a large fraction of the machine. In order to deal with permanent faults in a system, two fault models can be used: static or dynamic. In a static fault model, all the faults are known in advance when the machine is (re)booted. It needs to be combined with checkpointing techniques in order to be ef- fective. In a dynamic fault model, once a new fault is found, actions are taken in order to appropriately handle it. There exist several approaches to tolerate faults in the in- 1 Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Transcript of [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille,...

Page 1: [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

A Memory-Effective Fault-Tolerant Routing Strategy for Direct InterconnectionNetworks ∗

M.E. Gomez, P. Lopez and J. DuatoDept. of Computer Engineering

Universidad Politecnica de Valencia

E-mail: [email protected]

Abstract

High-performance interconnection networks are crucialin massively parallel computers. Routing is one of the mostimportant design issues of interconnection networks. Mo-reover, the huge amount of hardware of these machines ma-kes fault-tolerance another important design issue. In thispaper, we propose a mechanism that combines scalable rou-ting and fault-tolerance for commercial switches to builddirect regular topologies, which are the topologies used inlarge machines. The hardware required is not complex.Furthermore, it allows a high degree of fault-tolerance in-flicting a minimal decrease of performance.

Keywords: Memory-effective routing, fault-tolerance,regular topologies, distributed routing, adaptive routing.

1 Introduction

High-performance interconnection networks are crucialto achieve the maximum performance in large parallel com-puters. Routing is one of the most important design is-sues of interconnection networks. Another important issueis fault-tolerance, since the huge number of nodes signifi-cantly increases the probability of failure.

Routing is deterministic if only one path is provided, oradaptive, if several paths are available between a source-destination pair. Routing strategies can be also classifiedaccording to the place where routing decisions are taken. Insource routing, the source node computes the path and sto-res it in the packet header. Since the header itself must betransmitted, it consumes network bandwidth. In distributedrouting, each switch computes the next link that will be usedby the packet. The packet header only contains the destina-tion node. In source routing routers are very simple. On theother hand, distributed routing offers more efficiency.

Distributing routing can be implemented in two differentways. In the first case, there is a hardware at the switchesthat implements a combinational logic circuit and computesthe output port to be used as a function of the current and

∗This work was supported by the Spanish MCYT under GrantTIC2003-08154-C06-01.

destination nodes and the status of the output ports. Theimplementation is very efficient in terms of both area andspeed, but the algorithm is specific to the topology and tothe routing strategy used on that topology. This is the ap-proach used for the fixed regular topologies used in largemulticomputers. With the introduction of clusters of work-stations, routers based on forwarding tables were proposed.In this case, there is a table at each node that stores, foreach destination node, the output ports that must be used.The main advantage of table-based routing is that any to-pology and any routing algorithm can be used. However,forwarding tables suffers from a lack of scalability, as theirsize grows linearly with network size, and, most important,the time required to access the table also depends on its size.

Large cluster-based machines use any of the commer-cial high-performance switch-based point-to-point inter-connects. Network topology is either a direct or an indi-rect multistage network. The large size of these machinesmakes source routing inefficient due to the header size, anddistributed routing based on forwarding tables is also inef-ficient due to the table size. On the other hand, specificallydesigned hardwired routing algorithms are not feasible, asgeneral purpose switches are used.

In [4], we developed the Flexible Interval Routing (FIR),a routing strategy for switch-based networks which is opti-mized for the most widely used regular network topologies.By optimized, we understand that it is scalable, from severalpoints of view: (i) the required memory space does not li-nearly depend on network size, (ii) the hardware required toapply the routing algorithm is not complex and does not de-pend on network size and (iii) the time required to performa routing operation does not depend on network size.

As well as performance, fault-tolerance has became ano-ther important issue. The huge number of nodes of largemachines significantly increases the probability of failure.Failures in the interconnection network may isolate a largefraction of the machine. In order to deal with permanentfaults in a system, two fault models can be used: static ordynamic. In a static fault model, all the faults are knownin advance when the machine is (re)booted. It needs to becombined with checkpointing techniques in order to be ef-fective. In a dynamic fault model, once a new fault is found,actions are taken in order to appropriately handle it.

There exist several approaches to tolerate faults in the in-

1

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Page 2: [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

terconnection network. However, most of the solutions pro-posed in the literature are based on designing fault-tolerantrouting algorithms able to find an alternative path when apacket meets a fault. In [3], we proposed a fault-tolerantrouting mechanism that inflicts a minimal decrease of per-formance in the presence of faults, and tolerates a reaso-nably large number of faults, without disabling any heal-thy node and without requiring too many extra hardwareresources.

The goal of this paper is to make an integration proposal,combining the FIR [4] and the aforementioned fault-tolerantmechanism [3] to obtain a fault-tolerant and memory-efficient routing strategy for direct networks (meshes andtori). As a consequence, we will obtain a distributed, fault-tolerant and scalable routing strategy that can be used in ge-neral purpose switches. The rest of the paper is organizedas follows. To make the paper self-contained, Section 2 de-scribes the Flexible Interval Routing scheme, and Section 3briefly presents the fault-tolerant methodology proposed in[3]. Section 4 presents the extension of FIR in order to sup-port fault-tolerance. Section 5 evaluates the Fault-TolerantFIR (FT-FIR). Finally, some conclusions are drawn.

2 Flexible Interval Routing

Interval routing was introduced in [5]. The idea be-hind an interval routing scheme is to group destination no-des reachable from the same output port into some inter-val. The intervals associated with the different output portsof a switch are non-overlapping. Each packet is forwardedthrough the output port whose interval contains the destina-tion identifier of the packet. To implement interval routing,it is sufficient to store the bounds of each interval. Moreo-ver, the required hardware is simple (a pair of comparatorsfor each output port) then, it is also very fast as the routingoption can be determined after a single comparison delay ifall comparators work concurrently.

In [4], we proposed an extension of the interval routingscheme, the Flexible Interval Routing (FIR). It can imple-ment deterministic and adaptive routing on k-ary n-cubenetworks. Each output port has also an associated interval,stored in two registers, First Interval (FI) and Last Inter-val (LI). But, in order to add flexibility, we associate addi-tional registers to the output ports. As regular topologiesusually identify each node by its coordinates in each net-work dimension, to check if a given dimension is suitablefor routing, the coordinates of the current and destinationnodes in this dimension are compared. Therefore, each out-put port has also a Mask Register (MR). This register has nbits (n = log(N), N being the network size) and indicateswhich bits of the packet destination address are consideredin the comparison. Figure 1 shows how the destination ad-dress is compared with FI and LI, after being masked (Dm).If cyclic (or modulo N) intervals are allowed (because wra-paround links are present in the topology, which is the caseof torus networks), if LI < FI , then the condition for a de-stination address D to be inside the interval is Dm ≥ FI orDm ≤ LI . The result of the comparison for all the output

ports can be stored in the Allowed Register (AR), unique forthe switch, with a size of d bits (d being the switch degree).

FIR permits that the associated intervals to the differentoutput ports of a given switch overlap, therefore, more thanone output port can be allowed. This can be used to pro-vide adaptive routing. However, in order to guarantee dead-lock freedom, some routing restrictions must usually be ap-plied. In regular topologies, deadlock freedom is usuallyensured by traversing network dimensions following someorder. These routing restrictions are taken into account inFIR by means of an additional register associated to eachoutput port, the Routing Restrictions Register (RRR). It de-fines, for each output port, which other output ports of theswitch should be selected prior to this one. This register hasone bit per each output port. For a given output port i, thej bit in the RRR indicates if the output port j has more pre-ference (bit set to 1) or not (0) than output port i. Thus, thefinal routing decision for a given output port i is obtained ta-king into account the Allowed bits of the other output portsand the bits in its RRR. Figure 2 shows the implementationfor output port i. This is done in all the output ports of theswitch, and all the results can be stored in a Routing Regi-ster (RR), unique per switch, with a number of bits equal tothe number of output ports. The RR contains the result ofthe routing function [2]. In case of adaptive routing, it maycontain more than one 1. The selection function [2] selectsthe output port to be used.

With virtual channel multiplexing, the proposed registerswill be associated to virtual channels. Hence, each virtualchannel will be provided with a LI, FI, MR and RRR regi-ster. Moreover, the RRR will have one bit per virtual chan-nel and it will define which virtual channels have more pre-ference. This is also the case for the registers associatedto the whole switch (the Allowed and Routing Registers),that will represent virtual channels instead of physical ones.Therefore, the FIR approach requires, at each switch, d× vFI, LI, MR and RRR configuration registers, and one RRand AR registers of size d × v, being v the number of vir-tual channels per output port.

By appropriately setting the proposed configuration regi-sters, the routing algorithm can be defined. In [4] we showhow to configure them for the most popular routing algo-rithms used in meshes and tori networks. Following, wepresent some illustrative examples for this paper.

2.1 Examples

First, we present an example showing the registers confi-guration for a 16-node 2-D mesh and deterministic routing.The destination addresses require 4 bits, so the LI, FI andMR will contain also 4 bits (assuming no virtual channelmultiplexing). The RRR requires one bit per each outputport. For a 2-D mesh, 4 bits are needed. Figure 3.(a) showsthe configuration of the registers associated to the outputports for XY routing. Figure 3.(b) shows the prototypedconfiguration for a central switch ji, i being the two leastsignificant bits of the switch address and j the two most si-gnificant bits. The MR selects the two most significant bitsof the destination address for the Y links, and the two least

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Page 3: [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

PacketDestinationAddress

(port i)Mask Register

+−

+

FI (First Interval)

LI (Last Interval)

n bits

n bits

n

n

n

n

n

n

n

Dm(masked

destination)

−+ FI>LI?

LI>=md?

md>=FI?

d bitsAllowed Register

Allowed port i

Bounds Comparision

Figure 1. Generation of the allowed bit at each output port.

d bits

d bits

Register

RoutingRestrictionsRegister port i

d

d

d bitsRouting Register

1d

Allowed i

Allowed

0

Routing bit i(port i)

i

Figure 2. Routing bit for port i is obtainedfrom the AR and the RRR associated.

significant bits for the X links. LI and FI mark the boundsof the interval associated to each port. Finally, the RRRs en-force the XY routing. In the Y links, the bits in the RRRscorresponding to the X links are set to 1, since packets mustbe first routed in the X dimension. In the X links, the bitsin the RRRs corresponding to the Y ports are set to 0.

In torus, some action must be performed in order to pre-vent deadlocks in the rings of each dimension. A possibleapproach is the use of two virtual channels. See [4] for adetailed FIR registers configuration. Another approach thatdoes not require the use of virtual channels is the bubbleflow control mechanism [1]. With this mechanism, packetsthat are injected into the network or cross a network dimen-sion require two free buffers to guarantee deadlock freedom.In order to support the bubble mechanism in our proposal,we require one additional register associated to each outputport: the Bubble Register (BR). It has one bit per input port(including the injection ports). The input ports that onlyneed one free buffer to send through the current output portwill have its associated register bit set to 1. If two buffersare required in the output port because the packet is injectedinto the network or crosses a network dimension, then thebit must be set to 0. Figure 4 shows how the bubble con-dition is enforced in the routing decision. The Routing bitassociated to output port i will be obtained considering theAR, the RRR, the bit associated to the corresponding inputport in the BR and the space available at the output port.

We consider fully adaptive routing based on Duato’s pro-tocol [2]. In this routing algorithm, each physical link issplit into several virtual channels, the adaptive and the es-cape ones. Escape channels provides a deadlock free rou-ting subfunction. With the bubble flow control applied to

the escape channels, two virtual channels are at least requi-red, one of them is the escape channel (D), which is usedaccording to deterministic routing and the rest are the adap-tive (A) ones. Figure 5 shows the prototyped configurationfor a 4 × 4 torus assuming dimension order routing in theescape channels. As the deterministic virtual channels mustmeet the bubble condition, the BR must be configured. Asexpected, the RRRs associated to the adaptive channels donot establish any preference among VCs. This means thatthe routing function returns the adaptive channels that ap-proach the packet to its destination and the deterministicchannel provided by the deterministic routing subfunction.

3 Fault-Tolerant Methodology

The methodology provides fault-tolerance both in n-dimensional mesh and torus networks, assuming a staticfault model, thus, it knows in advance where the failuresare located. The proposed methodology is focused only inthe computation of the new routing info 1. The methodo-logy assumes that the initial (i.e., without faults) routing al-gorithm routes packets by using Duato’s fully adaptive rou-ting with at least two virtual channels (at least one adaptiveand one escape) per physical channel. The adaptive chan-nel(s) enables routing through any minimal path whereasthe escape channel guarantees deadlock freedom based onthe bubble flow control. The methodology computes a fault-free path for each source-destination pair. It avoids faults byusing intermediate nodes for routing. Packets are first for-warded from the source node (S) to an intermediate node(I), and later, from this node to the destination node (D).Minimal adaptive routing is used in both subpaths. Noticethat packets are not ejected from the network at the I node.When possible, the I node is selected inside the minimaladaptive cube defined by S and D. Thus, both subcubesdefined by S and I , and by I and D, are inside this cube,but they are smaller and avoid the failure. Packets can beadaptively routed inside each stretch (S-I and I-D). At theI node, some special actions must be performed in order toavoid deadlocks. We propose the use of two different es-cape channels. One of them will be used as escape channelfor the S-I stretch and the second one for the I-D stretch.

1Detection of faults, checkpointing, and distribution of routing info isout of the scope of the methodology.

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Page 4: [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

1100,01010100..1100

0011,0000

0001..0011

0011,0000

0001..0011

0001..00110011,0000

0 1 2 3

7654

98 10 11

151413120011,0X00

2X− X+Y+Y−

0

4 bitsRRR, AR, RR

FI..LIMR,RRR

Y+

X+

3 1

0011,0000

0011,000X0000..0000

0010..00110011,0X00

0000..00001100,01X1

0100..11001100,0101

0000..00001100..01X1

1100,01000100..1100

0000..00001100,01X0

0001..0011 0010..00110011,0X00

0011,0X000011..0011

0011..0011

0011..00110011,0X00

0000..00000011,000X

0000..00001100,00X1

0000..00010011,000X

0000..00100011,0000

0011,000X0000..0001 0000..0010

0000..00010011,000X

0000..00100011,0000

1000..11001100,X001

0000..0100

1000..1100

1100,00X1

1100,X1010000..01001100,01X1

1000..11001100,X101

0000..01001100,01X1

1000..11001100,X100

0000..01001100,01X0

0010..00110011,0X00

0011..00110011,0X00

1100,X1011100..1100

0000..00000011,000X

0000..00010011,000X

0000..00100011,0000

1100..11001100,X001

0100..10001100,0001

0000..10001100,0101

1100..11001100,X101

0000..10001100,0101

1100..11001100,X100

0100..10001100,0100

0010..00110011,0X00

0000..00000011,000X 0011,0000

0100..11001100,0001

(a)

X+

Y+00(i+1)..0011

(j+1)00..1100

MR,RRRFI..LI

0000..(j−1)001100,01X1

0011,0X00

0000..00(i−1)0011,000X

1100,X101

ji

(b)

Figure 3. (a). Configuration of the FIR registers in a 4 × 4 mesh with XY routing. (b). Prototypedconfiguration of the FIR registers.

By using an I node, all the 1-fault combinations are to-lerated. However, in a scenario with more than one failureit may be necessary to use additional mechanisms. Mis-routing and switching off adaptive routing are used. Mis-routing forces routing packets several hops along differentdirections (up to three directions in our metrology). Oncemisrouting is completed, then normal routing is applied tothe packet. In order to be deadlock-free, the directions tomisroute a packet must be used according to the order esta-blished by the deterministic routing. In particular, the me-thodology use the X+Y +Z+X−Y −Z− order, which isdeadlock-free and adds routing flexibility (it allows routingpackets in both directions of the same dimension).

A packet routed through I nodes requires two subhea-ders (see Figure 6). The first one is used for routing thepacket towards the I node, and the second one for routingthe packet towards the final destination. At the I node, thefirst subheader is removed. Packet subheaders also includecontrol fields about misrouting (direction and hops, up tothree misroutings are allowed at each routing phase) andswitching off adaptive routing (one bit). The methodologyalso requires to store the routing info at each source node,but not at the switches. For every destination, this info in-cludes the possible I node (if required) and info about mis-routing and switching off adaptive routing.

4 Fault-Tolerant FIR

As commented before, the fault-tolerant methodologyapplied to k-ary n-cubes requires at least three virtual chan-nels per physical channel. Two of them are used as determi-nistic channels with the bubble flow control. In addition, in

order to allow adaptive routing, at least one additional adap-tive channel is required. In this section, we show how FIRcan provide fault-tolerance using the above described fault-tolerant methodology at the expense of a few extra amountof memory. Only two additional bits per virtual channel (Hand Adapt) are required.

In Figure 7 we present a possible numbering for the vir-tual channels associated to the fault-tolerant methodologyin a switch of a 2-D torus network. Each output port issplit into three virtual channels, two of them are the escapechannels, E1 and E0. E0 is used by the packets when tra-veling from the source to the intermediate node, and E1 isused by the packets when traveling from the intermediatenode to the final destination. At least, there is an adaptivechannel. Figure 7 also shows how to configure the FIR re-gisters. RRRs establish the deterministic routing orderingin the escape channels of the different directions. In thiscase, deterministic X + Y + X − Y − routing is used inthe escape channels in order to be able to misroute. Mostpreference is given to the X+ direction by having the bitscorresponding to the X+ escape channels set to 1 in theRRRs corresponding to the other directions. Y − directionhas the least preference, by having the bits correspondingto the escape channels of the other directions set to 1 inits RRR. The escape channel selection will be done by thenew switch hardware taking into account the packet hea-der information as shown below. Only the adequate escapechannel will be allowed. As expected, the RRRs associa-ted to the adaptive channels do not establish any preferenceamong the VCs. So, the routing function returns all the ad-aptive channels that approach the packet to its destinationand the corresponding deterministic channel following the

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Page 5: [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

d bits

d

d

0

i

Input port j d bits

AllowedRegister

Allowed i

Routing bit i(port i)

Routing Register

d bitsRegister port iRestrictionsRouting

Bubble Register(port i)

d

Yes1

Two freebuffers at port i?

i

1

Figure 4. Implementation of the bubble condition at output port i.

X+DX+A+DY+AYX−DX−AY−DY−A

A DA

D

DA

D

A

((j+(K+1)/2) mod K)00..((j+K−1) mod K)001100,00010001,00000100

X+

00((i+1) mod K)..00((i+K/2) mod K)0011,00000000,11111111

0011,00000000,00010000

1100,00010001,01000000

00((i+1) mod K)..00((i+K/2) mod K)00Y+

8 bits

013567 24

RRR, AR, RR, BR

MR,RRR,BRFI..LI

((j+1) mod K)00..((j+K/2) mod K)00

0011,00000000,11111111

00((i+(K+1)/2) mod K)..00((i+K−1) mod K)0011,00000000,00000001

00((i+(K+1)/2) mod K)..00((i+K−1) mod K)

((j+(K+1)/2) mod K)00..((j+K−1) mod K)001100,00000000,11111111

((j+1) mod K)00..((j+K/2) mod K)001100,00000000,11111111

ji

Figure 5. FIR register configuration for a 4×4 Torus with adaptive routing and the bubble flow controlmechanism.

deterministic routing subfunction. The bubble condition inthe escape channels is implemented by the BR2.

Once the FIR registers have been configured, next wepresent how to modify the associated switch hardware totake into account the packet header information related tofault-tolerance. As stated above, packet headers includeinformation about intermediate nodes, disabling adaptivityand misrouting. After decoding the packet header, in addi-tion to the destination identifier, four control bits are extrac-ted or generated (see Figure 8) to be used by the hardwareassociated to fault-tolerance. The I and A bits correspondto the same bits of the packet header. The Mis bit is setto one if any of the hops fields is not 0 in the subheadercorresponding to the current routing phase. The Dir fieldhas a number of bits equal to the switch degree and con-tains a “1” in the position that corresponds with the currentdirection (port) that must be used to misroute the packet.Mis can be obtained by using three comparators, one perhops field in the packet subheader. The Dir bit associatedto each output port can be generated with a decoder. Its in-put is the first direction to misroute, which is obtained byusing a multiplexer (implemented by the tristate gates andtheir control logic in the Figure). Figure 8 shows how thefirst subheader is processed. Once the intermediate nodeis reached, this subheader is removed and then the second

2The bits corresponding to the injection channels are not shown. Theyare set to 0. Two buffers are required to inject a new packet.

subheader is processed in the same way.

The proper escape channel must be selected dependingon whether the packet is traveling to the intermediate nodeor to the final destination (this information is given by the Ibit). In order to do that, we propose to extend Figure 1 asshown in Figure 9.(a). The figure will provide the Allowedbit for the escape channels, E0 and E1, taking into accountthe I bit associated to the packet header that indicates therouting phase. E0 is allowed only if I is 1, that is, packetis traveling to the intermediate node. E1 only if I is 0, thatis, packet is traveling to its final destination. H is a newconfiguration bit that indicates which escape channel corre-sponds to the current virtual channel. H is set to 0 in E0escape channels and to 1 in E1 escape channels. E0 Allo-wed bit will be set to 1 if H is 0 and I is 1. E1 will be set to1 if H is 1 and I is 0. A xor gate serves to this purpose. Onthe other hand, the adaptive channels are not affected by theI bit, as they can be used in both routing phases.

Concerning disabling adaptivity, when the A bit is set to0 in any of the packet subheaders, the packet must be rou-ted in a deterministic way in that phase and therefore it mustuse the deterministic virtual channels corresponding to theproper escape channel. So, if the A bit is set to 0, the ad-aptive channels are disabled. This is shown in Figure 9.(b).Adapt is another new configuration bit that indicates whe-ther the current virtual channel is adaptive. Adapt will beset to 1 in the adaptive virtual channels, and to 0 in the es-

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Page 6: [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

Header 1 Header 2 Header 2

I

Dst1: Intermediate nodeDst2: Destination node

I Dst2 A Datad1 h1 d2 h2 h3Dst1 Dst2 A Datad2 h2d1 h1 d3 h3d1 h1 d2 h2 d3 h3A d3

di: direction to misroutehi: number of hops to misroute

A: 1 adaptive routing, 0 deterministic routingI: 1 with intermediate node (2 headers),

0 without intermediate node (1 header)

Source DestinationIntermediate Node

Figure 6. Packet header for the fault-tolerant methodology.X+

Y+

FI..LI

MR,RRR,BR

E1

E0

00((i+1) mod K)..00((i+K/2) mod K)0011,000000000000,000010000000AE0 E1

A

0011,000000000000,11111111111100((i+(K+1)/2) mod K)..00((i+K−1) mod K)

00((i+(K+1)/2) mod K)..00((i+K−1) mod K)0011,000000011011,000000000010

E0

E1

A

0011,000000000000,11111111111100((i+1) mod K)..00((i+K/2) mod K)00E0 E1 A

((j+(K+1)/2) mod K)00..((j+K−1) mod K)00

0011,000000011011,00000000000100((i+(K+1)/2) mod K)..00((i+K−1) mod K)

1100,000000000011,010000000000((j+1) mod K)00..((j+K/2) mod K)00

1100,000000000000,111111111111((j+1) mod K)00..((j+K/2) mod K)00

((j+(K+1)/2) mod K)00..((j+K−1) mod K)001100,000000000000,111111111111

((j+(K+1)/2) mod K)00..((j+K−1) mod K)001100,000011011011,000000010000

1100,000000000011,001000000000((j+1) mod K)00..((j+K/2) mod K)00

00((i+1) mod K)..00((i+K/2) mod K)0011,000000000000,000001000000

1100,000011011011,000000001000

X+E1

X+E0

Y Y35

X X XY−A

Y−E0

Y

12 bits

RRR, AR, RR, BR

01267891011−E1−A

4+E0 +E1

X+A

Y+A−E1 −E0

ji

Figure 7. Numbering of the virtual channels for the fault-tolerant methodology in a switch of a 2-Dmesh or torus network. FI, LI, MR and RRR configuration for a 4 × 4 torus.

BoundsComparision

Packet I bit

Packet A bitAdapt

Allowed Registerd bits

H

Figure 10. Combination of Figure 9.(a) and Fi-gure 9.(b).

cape channels. The hardware shown in Figure 9.(a) worksfor the escape channels, whereas the one shown in Figure9.(b) works for the adaptive ones. As a given virtual chan-nel should be configured either as escape or adaptive, thehardware required to generate the Allowed bit is obtainedby properly merging both figures, as shown in Figure 10.

So, up to this point, two new switch configuration bits(H and Adapt) are required per virtual channel. Therefore,we propose adding two registers to the switch with a num-ber of bits equal to the number of virtual channels of theswitch, that is, d × v bits. The Adaptive Register consitsof the Adapt bits of all virtual channels, and the bits corre-sponding to the adaptive channels are set to 1. The EscapeRegister consists of the H bits of all virtual channels andthe bits corresponding to escape channels E1 are set to 1.On the other hand, some additional gates are required, asshown in Figure 10.

Next, we present how to implement the third mechanismof the fault-tolerant methodology: misrouting. As explai-

ned above, misrouting will force routing packets severalhops along several directions, at most three directions ineach routing phase, as indicated in the packet subheader(see Figure 6). Once misrouting is consumed, then regu-lar routing (deterministic or adaptive, depending on the Abit) will be applied to the packet. Notice that when misrou-ting a packet, it can not be routed adaptively, so the adap-tive channels will be disabled when a packet is marked tobe misrouted, and the corresponding escape channel will beused instead. Figure 11.(b) implements the Allowed bit forthe adaptive channels when misrouting is considered. Re-member that the Mis bit is set to 1 when any of the hopsfields at the packet subheader corresponding to the currentrouting phase is larger than 0. In this case, the adaptivechannels are not allowed.

When misrouting a packet, the proper escape channelcorresponding to the first direction to misroute must beused, even if the packet destination is not inside the cor-responding interval bounds, since misrouting can providenon-minimal paths. Again, E0 will be used if the packet istraveling to the intermediate node or E1 if the packet is inits way to the final destination. The direction to misrouteis the first one with hops field larger than 0 in the packetsubheader corresponding to the current routing phase. Asthe packet travels along its path, the number of hops asso-ciated to that direction is decreased at the packet subheader,and when the number of hops is 0, then the next directionto misroute indicated in the packet subheader is followed.Figure 11.(a) shows how escape channels are selected. Re-member that the Dir bit associated to output port i is ob-

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Page 7: [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

Header 1 Header 2

I Dst2 A Datad1 h1 d2 h2 h3Dst1 d1 h1 d2 h2 d3 h3A d3

+− +−

0 0 0

Decoder...

A bit Dir bits (d bits)I bit Mis bit

+−

Figure 8. Fault-tolerance control bits are obtained from the packet header.

Packet I bit

H

d bitsAllowed Register

Allowed E0/E1

(a)ComparisionBounds

Allowed AdaptivePacket A bit

d bitsAllowed Register

ChannelsAdapt

BoundsComparision

(b)

Figure 9. (a). Logic to select the proper escape channel. (b). Logic to disable adaptive routing.

BoundsComparision

BoundsComparision

Mis bit d bitsAllowed Register

Packet A bitAdapt

Packet I bitH

Dir biti

Figure 12. Combination of Figures 11.(a) and11.(b).

tained from the packet subheader and indicates (set to 1) ifthe first direction with hops field different of 0 in the packetsubheader corresponds to the direction associated to outputport i (i.e. to the port that the virtual channel belongs to).Once the hops of all the directions to misroute at the currentrouting phase are consumed, the Mis bit is set to 0, and the-refore packets are routed in an adaptive/deterministic wayfollowing minimal path (taking into account the FI..LI regi-sters). Figure 12 combines Figure 11.(b) and Figure 11.(a).

Summing up, the FIR strategy permits implementing thefault-tolerant methodology with just an extra amount of me-mory of two registers (Adaptive and Escape Registers) perswitch with a number of bits equal to the number of vir-tual channels in the switch. The switch hardware is only alittle more complex. Some logic gates must be added to Fi-gure 1 to obtain the Allowed bits (see Figure 12). Our goalwas to show how the FIR scheme can easily be enhancedto support fault-tolerance with a few changes in the switchhardware and very little additional memory.

Indeed, the Fault Tolerant FIR scheme still supports aconventional (i.e. a non-fault tolerant one) routing algo-rithm. In that case, Adapt bit should be properly configuredand the H bit of escape channels will be set to 0 in order toonly select the E0 (the only one) escape channel. To be fullycompatible, the packet header still has to have the fields thatcorresponds to disabling adaptivity and misrouting, and theI bit. Of course, in this case, no intermediate nodes areused, adaptive routing is always enabled and misrouting isnot permitted.

5 Evaluation of the Fault-Tolerant FIR

In this section, we evaluate the proposed strategy. We areinterested in analyzing its fault-tolerance and performance.We will also give some comments about the routing delayand the amount of memory required.

5.1 Fault-Tolerance and Performance

As the proposed FT-FIR scheme is a hardware imple-mentation of the fault tolerance mechanism proposed in [3],the fault-tolerant degree and performance of the proposal isthe same obtained by the fault-tolerant methodology. Asshown in [3], the proposed mechanism is 7-fault tolerant.However, the percentage of tolerated fault combinations isgreater than 99,9% up to 14 failures. Interconnection net-work performance is also the same since routing algorithmis the same. In presence of failures, the performance degra-dation is the one inflicted by the fault-tolerant methodology,which is small. As an example, network throughput degra-des less than 10% when injecting 14 random failures in a512-node Torus.

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Page 8: [IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

BoundsComparision

BoundsComparision

Packet A bit

d bitsAllowed Register

Allowed AdaptiveChannelsAdapt

Mis bit

Packet I bit

H

Mis bit d bitsAllowed Register

Allowed E0/E1

(a) (b)

Dir biti

Figure 11. Logic to apply misrouting (a). The first direction indicated in the packet subheadercorresponds to the direction of the current output channel, thus selecting an escape channel. (b).Adaptive channels are disabled when misrouting.

5.2 Routing Delay

Routing delay of the original FIR scheme is given by thetime required to check if the destination address is inside theinterval (all the ports make these comparisons concurrently)thus generating the Allowed bits plus the time to mergethis information with the routing restrictions (RRR). In theFT-FIR, this delay is slightly increased in order to properlyconsider the additional constraints of the fault-tolerant me-thodology. However, as part of this hardware may work inparallel with the interval comparison, we expect that the in-crease in routing delay with respect to FIR will be small.Notice, though, that some hardware would be also requiredto support the fault-tolerance mechanism in a conventionalrouting scheme. This additional hardware will also increasethe routing delay of conventional routing.

5.3 Memory required by the Fault-Tolerant FIRStrategy

In this section we will compare the amount of memoryrequired at the switches by the FT-FIR strategy and an im-plementation based on forwarding tables. Notice that thefault-tolerant information is stored in the source nodes, butnot at the switches.

Assume that we have a network composed by N nodes,build with switches with d ports. The links attached to eachport are split into up to v virtual channels. Finally, the rou-ting algorithm offers a maximum of r routing options. TheFT-FIR approach needs to associate configuration registersto each virtual channel, three of them (FI, LI and MR) ofsize log(N) bits and two (RRR and BR) of size d × v bits.No matter if the routing is deterministic or adaptive, or thenumber of routing options provided in the latter case. Inaddition, two configuration registers are associated to thewhole switch, Adaptive and Escape Registers, of size d× vbits in order to provide fault-tolerance. Therefore, the totalnumber of bits required to implement the FT-FIR strategy isCFTFIR = d× v × (3× log(N) + 2× d× v) + 2× d× vbits. So, its cost remains being O(log(N)) as in FIR.On the other hand, routing based on forwarding tables re-quires a table with as many entries as the number of no-des, and each entry must contain the port(s) returned bythe routing function. Hence, the cost of this alternative is

CFT = N × log(d× v)× r bits in each switch. This cost isO(N), which is not scalable with the network size. In ad-dition, it also requires the proper logic to manage the fault-tolerant mechanism.

6 Conclusions

In this paper, we have proposed a fault-tolerant distribu-ted routing strategy for commercial switches, Fault-TolerantFlexible Interval Routing (FT-FIR) that does not requireforwarding tables at switches. The strategy provides both,scalable routing and fault-tolerance assuming commercialswitches for regular direct topologies requiring a relativefew amount of hardware. The proposed scheme is scalable,only 7 registers (two of them of just one bit) must be as-sociated to each virtual channel of the output ports, with aO(log(N)) total requirements of buffer space. The stra-tegy is flexible. Moreover, it provides a high degree offault-tolerance while inflicting a minimal decrease of per-formance in the presence of faults by implementing the ef-ficient fault-tolerant mechanism proposed in [3], based onthe use of intermediate nodes, misrouting and disabling ad-aptive routing.

References

[1] C. Carrion, R. Beivide, J.A. Gregorio, and F. Vallejo. A FlowControl Mechanism to Avoid Message Deadlock in K-ary N-Cube Networks. Forfth Int. Conference on High PerformanceComputing, pp. 332-329, December 1997.

[2] J. Duato, S. Yalamanchili and L. Ni. Interconnection Net-works. An Engineering Approach. Morgan Kaufmann, 2004.

[3] M.E. Gomez, J. Duato, J. Flich, P. Lopez, A. Robles, N.A.Nordbotten, T. Skeie, and O. Lysne. A New Adaptive Fault-Tolerant Routing Methodology for Direct Networks, in Proc.Int. Conference on High Performance Computing, 2004.

[4] M.E. Gomez, P. Lopez, J. Duato. A Memory-Effective RoutingStrategy for regular Interconnection Networks, to appear inProc. International Parallel and Distributed Processing Sym-posium, 2005. Best Paper Award in the Architecture Track.

[5] N. Santoro and R. Khatib. Routing without routing tables.Tech. report SCS-TR-6, School of Computer Science, CarletonUniversity, 1982. Also as: Labelling and Implicit Routing inNetworks, Computer Journal 28(1), 1985, pp. 5-8.

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE