[IEEE Comput. Soc. Press 11th International Parallel Processing Symposium - Genva, Switzerland (1-5...

8
Deadlock- and Livelock-Free Routing Protocols for Wave Switching * Jos6 Duato, Pedro Ldpez Facultad de Informhtica Universidad PolitCcnica de Valencia Sudhakar Yalamanchili Computer Systems Research Laboratory School of Electrical and Computer Engineering P.O.B. 22012 46071 - Valencia, SPAIN E-mail: jduatoQgap. upv. es Abstract Wave switching is a hybrid switching technique for high performance routers. It combines wormhole switching and circuit switching in the same router architecture. Wave switching achieves very high performance by ex- ploiting communication loCali@ When two nodes are going to communicate frequently, a physical circuit is established between them. By combining circuit switching, pre-established physical circuits and wave pipelining acmss channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits. In this papel; we propose two protocols for routers im- plementing wave switching. The first protocol handles the network as a cache of circuits, automatically establishing a circuit when two nodes are going to communicate. Sub- sequent communications use the previously established cir- cuit. When a new circuit requests channels belonging to an- other circuit, a replacement algorithm selects the circuit to be torn down. The second protocol relies on the program- mer and/or the compiler to decide when a circuit should be established or tom down for a set of messages. Also, we show that the proposedprotocols are always able to deliver messages, and are deadlock- and livelock-free. 1. Introduction Distributed memory multiprocessors rely on an intercon- nection network to exchange information between nodes. In multicomputers [2], each processor has a local address space and the interconnection network is used for message pass- ing between processors. In distributed shared-memory mul- tiprocessors (DSMs), the interconnection network is used ei- ther to access remote memory locations [ 161 or to support a cache coherence protocol [ 171. Nowadays, these architec- tures use a similar hardware support to implement the in- terconnection network [ 14, 16, 171. State-of-the-art inter- * Supported by the Spanish CICYT under Grant TIC9605 10-CO2-01 1063-7133/97 $10.00 0 1997 IEEE 570 Georgia Institute of Technology Atlanta, Georgia 30332-0250 E-mail:sudha@ee . gatech. edu connection networks use low dimensional topologies and wormhole switching [5]. Multicomputers usually send messages by calling a sys- tem function. This system call has a considerable overhead due to buffer allocation at source and destination nodes, message copying between user and kernel space, packeti- zation, in-order delivery and end-to-end flow control. Even for a very efficient messaging layer based on active mes- sages [20], software overhead accounts for 50-709’0 of the total cost [15]. Therefore, reducing the network hardware latency has a minimal impact on performance. On the other hand, messages are directly sent by the hardware in DSMs, as a consequence of remote memory accesses or coherence commands. Reducing the network hardware latency and increasing network throughput is crucial to improve the performance of DSMs. Satisfying the requirements of multicomputers and DSMs is not a trivial task. Wormhole routers are simple and fast. The main limitation of wormhole switching is the contention produced by blocked messages. Those messages remain in the network, preventing the use of the channels they occupy and wasting channel bandwidth. Virtual channels can increase throughput considerably by dynamically sharing the physical bandwidth among several messages [7]. Another approach to reduce contention and improve channel utilization consists of using adaptive routing [ 111. Adaptive routing algorithms must be carefully designed to avoid deadlocks [8,9]. Virtual channels can be combined with adaptive routing to maximize throughput. Unfortunately, virtual channels and adaptive routing make the router more complex, increasing node delay [4]. As a consequence, latency may increase. Latency can also be reduced by using an appropriate mapping of processes to processors, exploiting spatial locality in communications. In many cases, this locality is not only spatial but also temporal. If the router architecture supports circuit switching, the compiler could generate instructions that instruct the router to set up a path or circuit that will be heavily used during a certain period of time. Also, once a circuit has been established there is no contention for the messages using that circuit.

Transcript of [IEEE Comput. Soc. Press 11th International Parallel Processing Symposium - Genva, Switzerland (1-5...

Deadlock- and Livelock-Free Routing Protocols for Wave Switching *

Jos6 Duato, Pedro Ldpez Facultad de Informhtica

Universidad PolitCcnica de Valencia

Sudhakar Yalamanchili Computer Systems Research Laboratory

School of Electrical and Computer Engineering P.O.B. 22012

46071 - Valencia, SPAIN E-mail: j d u a t o Q g a p . upv. es

Abstract

Wave switching is a hybrid switching technique for high performance routers. It combines wormhole switching and circuit switching in the same router architecture. Wave switching achieves very high performance by ex- ploiting communication loCali@ When two nodes are going to communicate frequently, a physical circuit is established between them. By combining circuit switching, pre-established physical circuits and wave pipelining acmss channels and switches, it is possible to increase network bandwidth considerably, also reducing latency for communications that use pre-established physical circuits.

In this papel; we propose two protocols for routers im- plementing wave switching. The first protocol handles the network as a cache of circuits, automatically establishing a circuit when two nodes are going to communicate. Sub- sequent communications use the previously established cir- cuit. When a new circuit requests channels belonging to an- other circuit, a replacement algorithm selects the circuit to be torn down. The second protocol relies on the program- mer and/or the compiler to decide when a circuit should be established or tom down for a set of messages. Also, we show that the proposedprotocols are always able to deliver messages, and are deadlock- and livelock-free.

1. Introduction

Distributed memory multiprocessors rely on an intercon- nection network to exchange information between nodes. In multicomputers [2], each processor has a local address space and the interconnection network is used for message pass- ing between processors. In distributed shared-memory mul- tiprocessors (DSMs), the interconnection network is used ei- ther to access remote memory locations [ 161 or to support a cache coherence protocol [ 171. Nowadays, these architec- tures use a similar hardware support to implement the in- terconnection network [ 14, 16, 171. State-of-the-art inter-

* Supported by the Spanish CICYT under Grant TIC9605 10-CO2-01

1063-7133/97 $10.00 0 1997 IEEE 570

Georgia Institute of Technology Atlanta, Georgia 30332-0250

E-mail:sudha@ee . gatech. edu

connection networks use low dimensional topologies and wormhole switching [ 5 ] .

Multicomputers usually send messages by calling a sys- tem function. This system call has a considerable overhead due to buffer allocation at source and destination nodes, message copying between user and kernel space, packeti- zation, in-order delivery and end-to-end flow control. Even for a very efficient messaging layer based on active mes- sages [20], software overhead accounts for 50-709’0 of the total cost [15]. Therefore, reducing the network hardware latency has a minimal impact on performance. On the other hand, messages are directly sent by the hardware in DSMs, as a consequence of remote memory accesses or coherence commands. Reducing the network hardware latency and increasing network throughput is crucial to improve the performance of DSMs.

Satisfying the requirements of multicomputers and DSMs is not a trivial task. Wormhole routers are simple and fast. The main limitation of wormhole switching is the contention produced by blocked messages. Those messages remain in the network, preventing the use of the channels they occupy and wasting channel bandwidth. Virtual channels can increase throughput considerably by dynamically sharing the physical bandwidth among several messages [7]. Another approach to reduce contention and improve channel utilization consists of using adaptive routing [ 111. Adaptive routing algorithms must be carefully designed to avoid deadlocks [8,9]. Virtual channels can be combined with adaptive routing to maximize throughput. Unfortunately, virtual channels and adaptive routing make the router more complex, increasing node delay [4]. As a consequence, latency may increase.

Latency can also be reduced by using an appropriate mapping of processes to processors, exploiting spatial locality in communications. In many cases, this locality is not only spatial but also temporal. If the router architecture supports circuit switching, the compiler could generate instructions that instruct the router to set up a path or circuit that will be heavily used during a certain period of time. Also, once a circuit has been established there is no contention for the messages using that circuit.

It is possible to design a router architecture such that a circuit is set up and left open for future transmission of data items. As far as we know, this technique was first proposed in [3] for systolic communication. It has also been proposed in [ I31 for message passing.

The underlying idea behind pre-established circuits is similar to the use of cache memory in a processor. A common limitation of both, caches and networks, is that they are limited resources. If a new circuit requires the use of some channels belonging to previously established circuits, they must be torn down. Another common aspect of both, caches and networks, is that they may require compiler support to work more efficiently. Prefetching is an efficient technique to hide memory latency in case of cache misses [18]. Similarly, when two nodes are going to exchange several messages, the compiler could generate in- structions that instruct the router to set up a circuit between those nodes before that circuit is needed. When data are available, the circuit has already been established. Thus, message transfer will be faster because header routing time and contention have been eliminated. There is an important difference between caches and circuits. Caches are much faster than core memory. However, circuits offer the same bandwidth as regular channels. If circuits were able to use faster channels, network performance would increase considerably because bandwidth would be allocated to messages that really need that bandwidth.

In a previous paper, we proposed a new hybrid switching technique called wave switching as well as a router architec- ture supporting it [lo]. Wave switching implements worm- hole switching and circuit switching concurrently. By com- bining circuit switching, pre-established physical circuits' and wave pipelining across channels and switches, it is pos- sible to increase network bandwidth considerably, also re- ducing latency for communications that use pre-established physical circuits. As shown in [ 101, wave switching is able to reduce latency and increase throughput by a factor higher than three if messages are long enough (2 128 flits), even if circuits are not reused. For short messages, wave switching can only improve performance if circuits are reused.

In this paper, we propose two routing protocols to estab- lish and tear-down circuits. The first protocol handles the network as a cache of circuits. The second protocol relies on the programmer andor the compiler to decide when a cir- cuit should be established or torn down for a set of messages. Also, we show that the proposed protocols are deadlock- and livelock-free. The router architecture for wave switching proposed in [IO] is described in section 2. Section 3 pro- poses two routing protocols for wave switching. Those pro- tocols are shown to be deadlock-free and livelock-free in section 4. Finally, the paper presents some concluding re- marks and directions for future work.

A physical circuit is a circuit made of physical channels. A virrual cir- cuir is a circuit made of virtual channels. Where it is irrelevant to the dis- cussion, we will refer to them simply as circuits.

I 1 Frodto local Processor I I I I

Input queues (3 Output queues (virtual channels) (virtual channels)

I I

Figure 1. Typical airchitecture of a wormhole router

2. Router Architecture for Wave Switching

When the network topology has more dimensions than the number of physical dimensions used to implement it, some channels require long wires. In this case, chan- nel delay has a major impact on clock frequency. Some researchers have proposed the use of pipelined channels to reduce the impact of wire length on clock frequency [19]. Pipelined channels use wave pipelining. New data are injected into a channel before previously injected data reached the other end of the channel. Propagation speed is only limited by wire capacitance. At a given time, several data are propagating along a channel. As wires have some capacitance, wave front is; not sharp, limiting the maximum frequency at which data can be pipelined.

The use of pipelined channels allows the designer to compute clock frequency independently of wire delay. Although long wires do not affect the speed of the router, clock frequency is still selected by considering routing delay and switch delay. In this section, we present a router architecture that allows higher clock frequencies, increasing bandwidth considerably. It is based on the use of wave pipelining across both, switches and channels.

Assume that the router supports circuit switching and that a circuit has been established. Let us analyze the requirements to support iwave pipelining. If we consider the router architecture sh'own in Figure 1, we can see that it is possible to pipeline flits across the switch as fast as flit buffers are able to dleliver flits. Similarly, flits can be pipelined into physical channels at the same speed. However, using a higher clock frequency implies that round-trip delay for sending flits across channels and receiving acknowledgments requires more clock cycles. As a consequence, a windowing protocol with a longer window should be used. A longer window also requires

571

deeper buffers. Finally, deeper buffers require a longer delay to reach the first empty buffer in the worst case, therefore increasing latency and limiting the increase in clock frequcncy.

Therefore, a different approach is required if we want to take full advantage of wave pipelining across switches and channels. Taking into account that the circuit has been pre- viously established, flits will not find any busy channel in their way towards the destination node. If there is no con- tention with other messages then there is no need for flow control, unless virtual channels are used.

Assume that physical channels are not split into virtual channels. In this case, circuits use physical channels and flow control can be removed. As a consequence, flit buffers are no longer required and every switch transmits informa- tion directly to the switch in the next node. However, a care- ful design is required to minimize the skew between wires in a parallel data path. Synchronizers are required at each de- livery channel. Synchronizers may also be required at each switch input to reduce the skew. This switching technique is known as circuit switching. With circuit switching, it is possible to use wave pipelining across switches and physi- cal channels, clocking at a very high frequency. Basically, clock frequency is limited by memory bandwidth, by sig- nal skew and by latch setup time. With a proper router and memory design, this frequency can be much higher than the one used in current routers, increasing channel bandwidth and network throughput accordingly. As shown in [lo], cir- cuit simulations using Spice indicated that clock frequency could be up to four times higher than in a wormhole router using the same technology. Note that latency for sending in- formation in pre-established physical circuits is also consid- erably reduced because flit buffers have been removed.

However, pre-established physical circuits require the use of dedicated physical channels to allow multiple cir- cuits between routers. Also, some mechanism is required to set up and tear-down circuits. This can be solved by using a hybrid router architecture as shown in Figure 2. This architecture implements wormhole switching and circuit switching concurrently. Circuits are built using physical channels. However, physical circuits are set up and torn down by using a set of dedicated virtual channels, which will be referred to as control channels. Wormhole switching uses another set of dedicated virtual channels. This router architecture has several switches SO, SI, . . . , s k and two routing control units. One of them, together with switch So, implements wormhole switching. The second routing control unit implements pipelined circuit switching (PCS) [ 121 as follows: The remaining switches 4, . . . , Sk implement circuit switching on pre-established physical circuits using wave pipelining. As circuit switching does not provide flow control at the link level, physical channels are split into narrower physical channels. Although there is no flow control at the link level, end-to-end flow control is required between the injection buffer at the source node and the delivery buffer at the destination node. Taking

I ml PCS Routing Control Unit

Figure 2. New router architecture

into account the round-trip delay for control signals (i.e., acknowledgments), a windowing protocol is implemented. This protocol requires deep delivery buffers to prevent buffer overflow while acknowledgments are transmitted towards the source node of the message.

Eac'l physical channel in switch SO is split into IC + w vir- tual channels. Among them, IC channels are the control chan- nels associated with the corresponding physical channels in switches SI, . . . , sk. Control channels are only used to set up and tear-down physical circuits. These channels have capacity for a single Ait because they only transmit control flits. Control channels are handled by the PCS routing con- trol unit. The remaining w virtual channels are used to trans- mit messages using wormhole switching and require deeper buffers. They are handled by the wormhole routing control unit, as mentioned above. The hybrid switching technique

572

implemented by this router architecture will be referred to as wave switching.

The proposed router architecture allows sending mes- sages using wormhole switching. Messages are routed using either a deterministic or an adaptive routing algo- rithm, blocking if necessary on busy channels. Thus, the routing algorithm must be deadlock-free. The router also al- lows establishing physical circuits on switches SI, . . . , S k . These circuits are established by sending a probe through the control channels. In order to maximize the probability of establishing a circuit, a misrouting backtracking protocol with a maximum of m misroutes is used (MB-m) [12]. Once the physical circuit has been reserved, an acknowl- edgment is returned. The only difference with respect to PCS proposed in [12] is that the path being reserved is formed by a different set of channels, i.e., those using switches 31, . . . , sk. Therefore, the probe only reserves a fragment of a physical circuit if the corresponding channels (a bidirectional control channel and the associated physical channel in switch Si, i E [l..k]) are free. Both of them are reserved at the same time. Therefore, status registers are only needed for control channels in the PCS routing control unit. The control channel is needed to backtrack if the need arises, to return the acknowledgment, and to release the physical circuit. Once a circuit is no longer required, it is torn down by sending a control flit through the same path using the control channels.

Figure 3 shows the status registers associated with the PCS routing control unit. The Channel Status registers in- dicate whether the corresponding channel is free or busy. It can be easily extended to handle faulty channels. These registers are associated with each output control channel. The direct and reverse mappings between input and output channels are stored in the Direct Channel Mappings and Re- verse Channel Mappings registers, respectively. As men- tioned above, the reverse path is required to return acknowl- edgments. The History Store keeps track of the output links that have already been searched by the probe. Together, the History Store registers of all the network keep track of the paths already searched by each probe, therefore avoiding the repeated search of the same path. By storing information about searched links in the History Store register, the probe is kept small. Finally, a control bit associated with each out- put control channel (Ack Returned) indicates whether the acknowledgment for path setup has been returned through that channel.

Header Backtrack Misroute Force

Figure 3. Status registers of the PCS routing control unit

X1-offset Xn-offset

Figure 4. Formait of a routing probe

The format of a routing probe is shown in Figure 4. The Header bit identifies the flit as a probe. The Backtrack bit indicates whether the probe is progressing or backtracking. The Misroute field indicates the number of misrouting op- erations performed by the probe. The Force bit is used by one of the routing protocols to force channel release. The remaining fields are offsets from the destination node.

The circuits starting at each node are recorded in a spe- cial set of registers denoted as Circuit Cache. Those regis- ters are located in the network interface of every node. Fig- ure 5 shows the structure of those registers. When a cir- cuit uses a switch S,, i E [l..k] at the source node, it uses the same switch Si at every intermediate node. The Switch field indicates the switch being searched by the probe or the switch used by the circuit once the path has been set up. The Channel field indicates the output channel used by the cir- cuit at the source node. The Dest field indicates the destina- tion node of the circuit. If a ]probe does not succeed to estab- lish a circuit across some switch, it may try other switches depending on the routing protocol. The Initial Switch field records the first switch tried to avoid repeating the search. The Ack Returned field indicates that the acknowledgment of path setup has been returned and the circuit is ready to be used. The remaining fields are used for replacement pur- poses. The In-use bit is set when there is a message in tran- sit, thus preventing the circuit from being released until mes- sage transmission has finished. This bit is reset when the source node receives the aclcnowledgment for the last frag- ment of the message. The ]Replace field stores accounting information regarding the use of the circuit. The meaning of this field depends on the replacement algorithm.

Depending on the requirlements of the applications run- ning on the machine, the interconnection network may be designed with a single or several wave pipelined switches. Note that if a single wave pipelined switch per node is used, each pair of adjacent routers can only have one link capa-

Switch Channel Ilest. In-use Replace

Initial Ack Switch Switch Channel Dest. In-use Replace Returned

Figure 5. Structure of the Circuit Cache regis- ters

573

ble of wave pipelining between them. On the other hand, splitting physical channels into narrower physical channels shares bandwidth in a very inflexible way. Additionally, a few control signals must be provided for each individual physical channel. Thus, it is not recommended to split each channel into many narrow physical channels.

In addition to a higher network bandwidth and lower latency for messages using pre-established physical cir- cuits, the proposed router architecture has some interesting advantages. Circuits are established by sending a probe. The probe uses the MB-m protocol, being allowed to backtrack if it cannot proceed forward. This protocol is very resilient to static faults in the network, as indicated in [12]. Also, once a circuit has been established between two nodes, in-order delivery is guaranteed for all the messages transmitted between those nodes. Additionally, for message passing, software overhead associated with message trans- mission can be considerably reduced if message buffers are allocated at both ends when the circuit is established. Those buffers will be reused by all the messages using the circuit. If the circuit is explicitly established by the programmer and/or the compiler for a set of messages, buffer size is determined by the longest message of the set. On the other hand, if the circuit is automatically established the first time a node sends a message to another node, the size of the longest message using that circuit is not known at that time. A reasonably large buffer can be allocated. In this case, buffers may have the be re-allocated for longer messages.

Also, note that channel width in low dimensional topolo- gies is limited by pin count. More wires are allowed across the network bisection [l, 61. It can be seen in Figure 2 that only a few control signals connect the PCS routing control unit with each switch Si, i E [l..k]. For very high performance, several switches per node can be used, each one being implemented in its own chip. In this case, channel bandwidth does not decrease when the number of switches increases, assuming that this number is small. As a consequence, scalability is excellent because the number of switches (chips) per node can increase as network size increases, thus compensating the higher average distance traveled by messages. Such an architecture design follows a multi-chip implementation approach similar to the Cray T3D routers wherein each dimensional crossbar is a sep- arate chip. This effectively removes from consideration the pin out limitations between adjacent nodes since the pin requirements between these chips is quite small. The interesting design question then becomes how best to use the bisection bandwidth resource that is determined by the packaging technology.

Finally, the proposed router architecture is very flexible. It can be tailored to different requirements. Several param- eters can be adjusted, including the number of fast switches, the number of virtual channels for wormhole switching, and the routing protocols for wormhole switching and PCS. The simplest version of wave router is obtained by setting k = 1 and w = 0. In this case, all the messages use PCS.

3. Routing Protocols

In this section, we propose two routing protocols for wave switching. The first routing protocol automatically establishes a circuit when a node sends a message to another node and no circuit existed between them. When a circuit is being established and all the requested channels have been previously reserved by other circuits, a replacement algo- rithm selects a circuit. This circuit is torn down, releasing the channels forming it, and allowing the establishment of the new circuit. This protocol works like cache protocols. A cache line is brought from main memory every time a miss occurs. When a line is required and the cache is full, a replacement algorithm selects a line to be removed from the cache. This protocol will be referred to as Cache-Like Routing Protocol (CLRP).

The second protocol relies on the programmer and/or the compiler to determine when a circuit should be established or torn down. Circuits are only established when there is enough temporal communication locality so that it is worth establishing a circuit. When communication between two nodes is not frequent enough, messages are sent using wormhole switching. This protocol will be referred to as Compiler Aided Routing Protocol (CARP). A circuit should only be requested if there are enough free physical channels to build it. However, the routing protocol should consider the case where all the requested channels are busy. This protocol is similar to prefetching for caches. When a circuit is going to be heavily used, it is established in advance. The main difference between caches and networks regarding this protocol is that a circuit should be explicitly torn down when it is no longer needed.

We believe that the CARP protocol is able to achieve a higher performance because a circuit is only established when there is enough temporal communication locality. By doing so, every message or set of messages uses the best suited switching technique. In particular, the CARP proto- col does not establish circuits for individual short messages. Moreover, when buffers are allocated for transmitting a set of messages, buffer size is large enough to store the longest message, therefore avoiding buffer re-allocation. Addition- ally, channels are a scarce resource. The CARP protocol al- lows the use of global optimization algorithms. However, developing a suitable compiler support for the CARP pro- tocol may take several years. On the other hand, the CLRP protocol does not need any compiler or programmer support. Therefore, it can be used for the first generation of multipro- cessors implementing wave switching in their interconnec- tion network.

3.1. Cache-Like Routing Protocol

The CLRP protocol for message transmission uses the Force bit. When the Force bit is not set, the probe backtracks if it does not find a free valid channel. If the Force bit is set, the probe does not backtrack, tearing circuits down to ob-

574

tain the required channels. The most general form of this protocol is as follows: When a node sends a message to an- other node, the source node reads the Circuit Cache register to see if a circuit exists for the requested destination. If it exists, the circuit is used. Otherwise, a circuit is established in several phases, possibly tearing-down other circuits, and storing the corresponding entry in the Circuit Cache. In the first phase, a switch Si, i E [l..IC] with a free output chan- nel is selected at the source node, and a probe with the Force bit reset is sent to establish a physical circuit. The selected switch is recorded in the Initial Switch field of the corre- sponding entry in the Circuit Cache register. It is convenient that neighboring nodes try to use different initial switches. For example, in a 2D-mesh, node (2, y) can first try switch 1 + (2 + y) mod le. At each intermediate node, a free out- put channel from switch Si is selected. As mentioned above, the probe uses the MB-m protocol, being allowed to mis- route up to m times, and to backtrack if it cannot proceed forward. If the circuit is successfully reserved, an acknowl- edgment is returned. If it were not possible to establish a cir- cuit, and the probe backtracks up to the source node, the next switch modulo b is tried. The current switch is recorded in the Switch field. The Initial Switch field prevents the probe from searching the same circuit twice.

If it were not possible to establish a circuit across any switch, the Force bit is set in the probe, sending it again across switch Initial Switch. This is the second phase of the protocol. If the probe cannot proceed forward at some node (including the source node), it selects a circuit from the Cir- cuit Cache such that it uses one of the requested channels and has the Ack Returned bit set. This circuit starts at the current node. Therefore, it can be torn down without inter- rupting any message in transit. Once the message currently using that circuit (if any) has been sent, the In-use bit is re- set and the circuit is torn down, thus releasing the requested channel. It may happen that all the requested channels be- long to circuits being set up or circuits that cross the cur- rent node but start at different nodes. In that case, a circuit crossing the node is selected among those that returned the acknowledgment flit (those that have the Ack Returned bit set in the PCS routing control unit). A control flit is sent to- wards the source node of that circuit, requesting its source node to release it. Once the message currently using that cir- cuit (if any) has been sent, the circuit is torn down, thus re- leasing the requested channel. Therefore, the probe can pro- ceed reserving the circuit. In the very unlikely case that all the outgoing channels of a node belong to circuits currently being established, the probe backtracks even if the Force bit is set. If it were not possible to establish a circuit, and the probe backtracks up to the source node, the next switch mod- ulo IC is tried. If it were not possible to establish a circuit across any switch, the protocol enters the third phase. In this phase, the message is transmitted using wormhole switching through So switches.

The CLRP protocol can be simplified in several ways. First, when a circuit cannot be established by using Initial

Switch, the Force bit can Ibe set without trying the remain- ing switches. Similarly, the second phase may try a single switch. Second, the Force bit can be set when the probe is first sent to establish the circuit, therefore skipping phase one. The optima1 protocol depends on the number of phys- ical switches per node, and on the applications. It can only be tuned by using traces firom real applications. This falls out of the scope of this paper.

3.2. Compiler Aided Routing Protocol

The CARP protocol is much simpler. However, it re- quires that the compiler and/or the programmer generate in- structions to set up and tear-down a circuit. It works as fol- lows: The compiler and/or the programmer decide whether a physical circuit should be established fora set of messages. For those messages not requiring pre-established circuits, wormhole switching is used across SO switches. When a physical circuit is requested, a switch Si, i E [l..le] is selected,and a probe is sent to establish it. As mentioned above, it is convenient that neighboring nodes try to use dif- ferent initial switches. At each intermediate node, a free out- put channel from switch Si .is selected. As mentioned above, the probe uses the MB-m protocol, being allowed to mis- route up to m times, and to backtrack if it cannot proceed forward. If the circuit is successfully reserved, it will be used for the corresponding messages. When the circuit is no longer required, it is explicitly torn down by generating the appropriate instructions. If it were not possible to establish a circuit, the partially reserved path is torn down and the next switch modulo k is tried. If it were not possible to establish a circuit across any switch, messages requesting that circuit will have to use wormhole switching through SO switches.

4. Deadlock and Livelock Avoidance

For the proof of deadlock freedom, we rely on the prop- erties of routing protocols fcir PCS and wormhole switching. In [12], it was shown that the MB-m protocol is deadlock- free. The basic idea is that ai probe can always backtrack up to the source, therefore releasing all the resources it previ- ously reserved, and allowing other probes to advance. Also, in [5, 8,9], it was shown how to design deadlock-free rout- ing algorithms for wormholle switching. Basically, for de- terministic routing algorithrns there should be no cyclic de- pendencies between channels. For adaptive routing, cyclic dependencies are allowed provided that there exists a sub- set of channels without cyclic dependencies between them. The following theorems prove that the proposed protocols are deadlock-free.

Theorem 1 The CLRP protocol is deadlock-fre.

Proof.- The proof proceeds by analyzing all the possible cases. The use of previously reserved circuits does not pro- duce deadlock because messages do not request more re- sources. No deadlock can arise while establishing a circuit

575

with the Force bit reset in phase one because the misrout- ing backtracking protocol MB-m is deadlock-free. When the Force bit is set (phase two), the probe may block at a node waiting on a previously established circuit to be torn down. If that circuit starts at the current node, it is immedi- ately torn down unless there is a message in transit. If there is a message in transit, it will take a finite amount of time to end the transmission because messages have a finite length, and the destination node will accept the message because the circuit was previously established.

If the circuit does not start at the current node, an already established circuit crossing that node is selected. The selec- tion is done in finite time. After selecting a circuit, a control flit is sent towards the source node of the circuit using con- trol channels. Those control channels are free because once the acknowledgment for path setup is returned, no other traf- fic crosses those channels towards the source node. When the control flit reaches the source node, the circuit will be released in finite time, just after ending the transmission of the current message (if any). It may happen that the circuit is being released while the control flit advances towards the source node of the circuit. In this case, the control flit is discarded at some intermediate node and the circuit is re- leased in finite time. Also, it may happen that two differ- ent nodes send control flits requesting the same circuit to be released. The first control flit will initiate circuit releasing. The second control flit will be discarded, as indicated above. Both probes requesting channels from the circuit at different nodes will be able to reserve those channels in finite time.

If all the requested channels at a given node belong to circuits being established, the probe does not block, avoid- ing deadlock by backtracking to the previous node. If the probe is not able to establish any circuit, it uses wormhole switching, blocking if necessary on busy channels. This does not produce deadlock because the routing algorithm used for wormhole switching is deadlock-free. Addition- ally, PCS and wormhole switching do not interact. Each switching technique uses its own set of resources (routing control unit, switches and channels). Therefore, the CLRP protocol is deadlock-free. 0

It could be thought that a probe may block at a node if all the requested channels belong to circuits being established. However, waiting on busy channels while keeping previ- ously reserved channels produces channel dependencies. As probes use the MB-m protocol, there would be cyclic depen- dencies between channels. Therefore, each probe could be waiting for a channel that will never be released, producing a deadlock. Thus, deadlock is avoided by backtracking.

Theorem 2 The CARP protocol is deadlock-free.

Proo$ Establishing a circuit cannot produce deadlock because the misrouting backtracking protocol MB-m is deadlock-free. If the probe is not able to establish any circuit, the message uses wormhole switching, blocking if necessary on busy channels. This does not produce deadlock because the routing algorithm used for wormhole

switching is deadlock-free. Additionally, PCS and worm- hole switching do not interact. Each switching technique uses its own set of resources. Therefore, the CARP protocol is deadlock-free. U

Guaranteeing the absence of deadlock is not enough because the proposed routing protocols use misrouting and backtracking. Therefore, a probe could be trying to establish a circuit forever, never blocking but never reaching its destination. As above, we rely on the properties of routing protocols for PCS and wormhole switching. The misrouting backtracking protocol MB-m is livelock-free because misrouting is limited to m misroutes. Also, when a probe backtracks, it does not search again the same path because the History Store in the PCS routing control unit keeps track of the paths already searched. As the number of paths in a network is finite, MB-m is livelock-free. Also, minimal routing algorithms for wormhole switching are livelock-free. The following theorems prove that the proposed protocols are livelock-free.

Theorem 3 The CLRP protocol is livelock-free.

Pro08 The proof proceeds by analyzing all the possible cases. As MB-m is livelock-free, a probe with the Force bit reset will either succeed reserving a path or will return to the source node after exhausting the search of paths using switch Si. The number of switches is finite and the Initial Switch field prevents a probe from using the same switch twice. Therefore, a probe cannot be trying to establish cir- cuits forever in phase one. When the Force bit is set and there exists a previously established circuit starting at the current node or crossing it, that circuit will be released, and the probe will be able to make progress toward its destina- tion. If all the output channels at a node belong to circuits currently being established, the probe backtracks. Livelock is avoided by using the History Store, therefore preventing the probe from visiting a previously visited node. If the probe backtracks up to the source node with the Force bit set after exhausting the search of all the paths, the protocol enters the third phase and minimal routing is used through So switches. As minimal routing is livelock-free, the CLRP protocol is livelock-free. 0

Theorem 4 The CARP protocol is livelock-free.

Proofi As MB-m is livelock-free, a probe will either suc- ceed reserving a path or will return to the source node after exhausting the search of paths using switch Si. The number of switches is finite and the Initial Switch field prevents a probe from using the same switch twice. Therefore, a probe cannot be trying to establish circuits forever. If the probe backtracks up to the source node after exhausting the search of all the paths, minimal routing is used through SO switches. As minimal routing is livelock-free, the CARP protocol is livelock-free. 0

576

5. Conclusions

Wave switching is a new hybrid switching technique that exploits communication locality by combining circuit switching and wormhole switching. A wave router has two or more switches. One of the switches uses standard worm- hole switching. The remaining switches implement circuit switching. These switches achieve a higher bandwidth by using wave pipelining across switches and channel wires. As shown in [lo], wave switching is able to reduce latency and increase throughput by a factor higher than three if messages are long enough (2 128 flits), even if circuits are not reused. The wormhole switch can be used to transmit individual messages for which circuit switching is not efficient. The new switching technique is aimed at reducing latency and increasing throughput by taking advantage of spatial and temporal communication locality. Instead of op- timizing the transmission of individual messages, the new switching technique optimizes the overall communication between pairs of processors by allowing the construction of high-bandwidth physical circuits.

The new switching technique also allows to reduce the overhead of the software messaging layer in multicomputers by offering a better hardware support. In particular, message buffers can be allocated at both ends when the physical cir- cuit is established. Those buffers will be reused by all the messages using the physical circuit. Additionally, in-order delivery and tolerance to static faults in the network is guar- anteed for all the messages using physical circuits.

In this paper, we have proposed two routing protocols for wave switching. The first routing protocol (Cache-Like Routing Protocol, CLRP) automatically establishes a circuit when a node sends a message to another node and no cir- cuit existed between them. When a circuit is being estab- lished and all the requested channels have been previously reserved by other circuits, a replacement algorithm selects a circuit. This circuit is torn down, releasing the channels forming it, and allowing the establishment of the new cir- cuit. The second protocol (Compiler Aided Routing Proto- col, CARP) relies on the programmer and/or the compiler to determine when a circuit should be established or torn down. Circuits are only established when there is enough tempo- ral communication locality so that it is worth establishing a circuit. When communication between two nodes is not fre- quent enough, messages are sent using wormhole switching. Additionally, we have shown that the proposed protocols are deadlock-free and livelock-free, therefore guaranteeing that every message will reach its destination in finite time.

References [ 11 A. Agarwal, “Limits on interconnection network perfor-

mance”, IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 398412, October 1991.

[2] W.C. Athas and C.L. Seitz, “Multicomputers: Message- passing concurrent computers,” IEEE Computer, vol. 21, no. 8, pp. 9-24, August 1988.

[3] S. Borkaret al., “iwarp: An integrated solution to high-speed parallel computing,” in Proc. Supercomputing ’88, Novem- ber 1988.

[4] A.A. Chien, “A cost and speedmodel for k-ary n-cube worm- hole routers,” in Proc. Hot Interconnects’93, August 1993.

[5] W.J. Dally and C.L. Seitz, “Deadlock-free message routing in multiprocessor initerconnection networks,” IEEE Trans. Computers, vol. C-36, no. 5, pp. 547-553, May 1987.

[6] W.J. Dally, “Express cubes: Improving the performance of k-ary n-cube interconnection networks,” IEEE Trans. Com- puters, vol. C-40, no. 9, pp. 1016-1023, September 1991.

[7] W.J. Dally, “Virtual-channel flow control,” IEEE Trans. Par- allel and Distributed Systems, vol. 3, no. 2, pp. 194-205, March 1992.

[8] J. Duato, “A new theory of deadlock-free adaptive routing in wormhole networks,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 12, pp. 1320-1331, December 1993.

[9] J. Duato, “A necessary and sufficient condition for deadlock- free adaptive routing in wormhole networks,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 10, pp. 1055- 1067, October 1995.

[lo] J. Duato, P. Lbpez, E Silla and S. Yalamanchili, “A high per- formance router architecture for interconnection networks,” in Proc. 1996 Int, Con$ Parallel Processing, August 1996.

111 P.T. Gaughan and S. Yalamanchili, “Adaptive routing proto- cols for hypercube interconnection networks,” IEEE Com- puter, vol. 26, no. 5, pp. $2-23, May 1993.

121 P.T. Gaughan and S. k’alamanchili, “A family of fauit tolerant routing protocols for (direct multiprocessor networks,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 5, pp. 482-497, May 1995.

131 J.-M. Hsu and P. Banerjee, “Hardware support for message routing in a distributed memory multicomputer,” in Proc. 1990 Int. Con$ Parallel Processing, August 1990.

141 Intel Scalable Systems Division, “Intel Paragon Systems Manual,” Intel Corporation.

151 V. Karamcheti and A A . Chien, “Do faster routers imply faster communicationi?,” in Parallel Computer Routing and Communication, K. Elolding and L. Snyder (ed.), Springer- Verlag, pp. 1-15,1994.

161 R.E. Kessler and J.L. Schwarzmeier, “CRAY T3D: A new dimension for Cray Research,” in Compcon, pp. 176-182, Spring 1993.

171 D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, Id. Horowitz and M. Lam, “The Stan- ford DASH multiprocessor,” IEEE Computer, vol. 25, no. 3, pp. 63-79, March 1992.

181 T. Mowry, M. Lam and A. Gupta, “Design and evaluation of a compiler algorithm for prefetching,” in Proc. 5th Int. Con$ Architectural Support for Programming Languages and Op- erating Systems, October 1992.

191 S.L. Scott and J.R. Goodman, “The impact of pipelined chan- nels on k-ary n-cube networks,” IEEE Trans. Parallel and Distributed Systems, vol. 5 , no. 1, pp. 2-16, January 1994.

[20] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, “Active messages: a mechanism for integrated communication and computation”, Proc. 19th Int. Symp. Computer Architecture, June 1992.

577