Design and performance analysis of input-output buffering delta-based ATM switch with backpressure...

Design and performance analysis of input-output buffering delta-based ATM switch with backpressure mechanism

R.Y. Awdeh H.T. Mouftah

Indexing terms: Asynchronous transfer mode, Delta networks, Packet swirching, Swirch architecture

~

Abstract: A new delta-based ATM switch architecture is described that is based on expanding a delta network so that blocking is eased while preserving self routing and path uniqueness. The switch employs a combined external input-output buffering strategy, but operates in such a way that output buffer overflows never happen. An analytical model, that takes into account this backpressure mechanism, is developed for arbitrary switch parameters, and computer simulations are used to assess the accuracy of the analysis. It is shown that a maximum throughput of above 0.90 can be achieved for a large-size switch using an expansion factor of 16. It is also shown that the backpressure mechanism can reduce the overall memory size needed to achieve a given cell loss performance, compared with the case where it is not used. The switch is shown to compare very well to the well-known knockout switch in terms of both performance and complexity. Finally, a distinctive feature of the proposed architecture is that internal node buffering can be used without disturbing cell sequencing.

1 Introduction

ATM (asynchronous transfer mode) has emerged as the key technology for future BISDNs (broadband integrated services digital networks) 11-31, ATM is a high-speed packet switching technique where information is organised into short fixed-length packets called cells. The header of each cell contains the necessary routing information thus processing overhead at intermediate switching nodes is minimised. Over the past few years, many architectures have been proposed for ATM switching [4, 51 and most of these proposals are based on delta networks 161. An ( N x N ) delta network can be constructed using binary switching elements (SEs). organised in n (= log,N) stages, with each stage having N/2 SEs. Delta networks include many known topologies, such as omega [7] and banyan [8] networks, which were shown to be equivalent [SI. Delta networks have many desirable features, such as low complexity, self routing, modularity,

0 IEE, 1994 Paper 12411 (E7), first received 2nd July 1993 and in revised form 10th March 1994 The authors are with the Department of Electrical and Computer Engineering, Queen’s University at Kingston, Ontario, Canada K7L 3N6

I E E Proc.-Commun., Vol. 141, No. 4 , August IYY4

and suitability for VLSI implementation. Unfortunately, delta networks are internally blocking with cells contending over the same internal links, even if they are destined to different output ports. Also, as with any packet switch, output blocking arises when multiple cells request the same output port at the same time. Blocking severely degrades the throughput performance of delta networks

In this paper, we propose fabric expansion for delta networks to ease both internal and output blocking. Pre- viously, fabric expansion was used to improve the performance of nonblocking switches [lo]. The second contribution of this paper is the development of a discrete-time analytical model for the performance evalu- ation of the proposed switch assuming arbitrary switch parameters, and a backpressure mechanism from the output buffers to the input buffers. Only few analytical models exist which take into account backpressuring from the output buffers, and they are built around nonblocking fabrics of infinite size [ll, 123. It is shown that a high-performance ATM switch can be constructed by expanding a delta network, with a complexity that is much less than that of the well-known knockout switch [13]. It is also shown that the backpressure mechanism can result in memory savings. An important feature of the proposed architecture is that internal node buffering can be used without disturbing cell sequencing.

2 Expanded delta fast packet switch (EDFPS)

2.1 Architecture Motivated by the desire to preserve the self-routing and path-uniqueness features of delta networks (which insures distributed control and preserves cell sequencing, respectively), with the effects of blocking reduced, but without the square growth in complexity, the expanded delta fast packet switch (EDFPS) is being proposed. An ( N x N ) EDFPS is constructed by interleaving EF(N x N) delta networks, where E F is a positive integer that refers to the expansion factor. Letting M = N x EF, an ( N x N ) EDFPS can be regarded as an (A4 x M ) interconnection network (IN) with n stages only, where successive stages are interconnected as in the originating delta network. Notice that we do not say an

C ~ I .

This work was partially supported by Queen’s University and the Canadian Institute for Tele- communications Research (CITR) under the Net- works of Centres of Excellence programme of the Federal Government of Canada.

255

(M x M) delta network simply because M need not be a power of two. In this IN, only N inputs are used, and each logical output port consists of EF outputs. An omega-based (8 x 8) EDFPS is shown in Figs. 1 and 2 for EF = 3 and 4, respectively. The routing scheme in the EDFPS is the same as in a delta network, namely, bit- controlled self routing: if the stages of the switch are numbered from 1 to n starting from the input side, a cell

’7 :

Fig. 1 (8 x 8) Expanded deltafast packet switch with EF = 3

I !

i ! i i ! !

! I I ! I i

I

!

i

Fig. 2

256

(8 x 8 ) Expanded deltafast packet switch with EF = 4

at any inlet of any SE in stage s requests the upper outlet of that SE if d, = 0, or the lower outlet if d, = 1. (It is assumed that each cell carries its output port address d , d , .. . d , , where d , is the most significant bit, in addition to an activity bit to indicate its presence.)

External input buffering is required to reduce cell loss, since for reasonable values of EF (i.e. EF N for large N), it might not be possible to simultaneously establish all the required paths within the IN. Also, external output buffering is necessary when EF > 1 to cope with the possibility of multiple cells simultaneously arriving at the same output port. The IN provides connectivity between the input port modules (IPMs), and the output port modules (OPMs). An IPM performs cell alignment and synchronisation, buffering, and table translation. An OPM performs similar functions (plus concentration if necessary). In particular, an OPM buffers cells emerging from the corresponding EF outputs of the IN in a first-in first-out (FIFO) shared-memory buffer. Both the IPMs and OPMs participate in the two-phase algorithm that is going to be described subsequently.

Two points should be noticed with regard to the IN (i.e. the expanded delta network). First, this is a single- path architecture in which there exists one path only for each input-output pair. This means that internal node buffering can be used in the EDFPS without disturbing cell sequencing. Secondly, as EF increases (which is the case if an acceptable performance level is to be achieved), more and more SEs and links of the first stages of an EDFPS become jobless (shown in Figs. 1 and 2 in dotted form), and can be removed without affecting the proper functioning of the switch. This favourably distinguishes the expanded delta network from the replicated delta network [14, 151. Furthermore, we have shown that given a target maximum throughput, the expanded delta network requires less crosspoint complexity compared to other known delta-based networks, including the replicated delta network.*

2.2 Operation For reasonable values of EF, cells may be lost within the IN as a result of internal conflicts, as well as at the input and output ports as a result of buffer overflows. Different packet switches use different approaches to deal with internal blocking, such as dropping blocked cells [SI, internally buffering them [16, 171, or using a request- acknowledgement mechanism [lS, 191. In the EDFPS, we have chosen to follow the third category, and adopt an algorithm to setup conflict-free paths in the IN, and at the same time to prevent output buffer overflows. This means that cell loss can occur only at the input buffers. The algorithm consists of two phases: path setup and cell transmission. We assume that the switch operates in a time-slotted fashion, and that input and output lines operate at the same speed.

2.2.1 Path setup phase: Phase 1 starts at the beginning of each time slot. Each IPM sends a request packet, composed of a copy of both of the activity bit and the output port address of the head-of-line (HOL) cell in the corresponding input buffer. Request packets then self route through the IN. On contention between any two request packets, the winner is chosen at random, and the other packet is discarded (by resetting its activity bit). The SEs hold their settings after request packets pass through

‘Approach for comparing delta-based networks with the crossbar switch‘, submitted to Electronics Letters

I E E Proc.-Commun., Vol. 141, No. 4 , August I994

them. A request packet that reaches a last stage output with an activity bit of I, establishes a conflict-free path from its originating input port to the desired output port. Up to EF request packets can reach a particular OPM, which compares the number of arriving requests (say R), with the free buffer space currently available in that port (say A cells). If R < A , the OPM issues ACKs to all requesting input ports, and send them backward through the paths that have just been setup. Otherwise, it selects A out of R contending requests in a random fashion (or using any appropriate selection policy, such as round robin; in this case the starting point should be incremented or randomised in each time slot to insure fairness), and backpropagates ACKs to winning input ports through the reserved paths. An ACK is simply the activity bit of a winning request packet. This ends phase 1.

2.2.2 Cell transmission phase: During this phase, an IPM that receives an ACK sends its HOL cell to the desired output port through the reserved path (after removing the destination address field, since it is not going to be used in this phase), in a conflict-free (refers to the interconnection network) and a loss-free (refers to the output buffers) manner. Notice that the state of the SEs is set in the first phase and prior to the actual transmission of cells.

Both phases must be performed within a time slot, which necessitates a speedup of the internal switching fabric compared to the external lines to compensate for processing overhead. Each request packet is composed of n + 1 bits which have to be processed by n stages. Assume that a request packet suffers from a one-bit time delay at each stage. Each one-bit ACK travels in the backward direction with a total delay of one-bit time corresponding to its transmission time, since it does not require any processing in the SEs it passes through. In phase 2, T bits (=424 for an ATM cell) flow freely (i.e. without any processing at intermediate SEs) through the path that has been setup in phase 1. Thus, the required speed-up factor S is roughly estimated by S = ( (n + 1) + n + 1 + T)/T = 1 + 2(n + l)/T. For N = 64 and an ATM cell, S = 1.033, while S = 1.052 for N = 1024. Thus, the required speed-up factor is very small, even for large switch sizes.

BISDN is expected to support diverse applications with diverse performance requirements, and therefore it is very desirable that an ATM switch can handle prioritised traffic. The EDFPS can accommodate prioritised traffic as follows. Each cell is assumed to carry a priority field of [log, PI bits, when P traffic classes exist. In phase I, each request packet contains also the priority field, and con- tentions among request packets in both the IN and the OPMs are resolved acccording to their priorities. Fur- thermore, a group of cells simultaneously arriving at a particular output port (in phase 2) can be placed in the FIFO buffer in an order representing their priorities.

3 Performance analysis

In this Section, we analyse the switch for its throughput (TP), delay (D), and cell-loss probability (PloSs), assuming the uniform traffic model. Here, cells arrive at the input buffers of the switch according to i.i.d. Bernoulli pro- cesses, each with parameter p . An arriving cell chooses its destination port uniformly among all N output ports. In the analysis, we assume that HOL cells in the input

I E E Pro<.-Commun., Vol. 141, No. 4, August 1994

buffers are statistically independent of each other with respect to their destination choices. This assumption has two consequences: the input buffers are independent of each other, and the traffic as seen by the IN network is always uniform. We also assume that EF is a power of 2; i.e. E F = 2’ where k is a positive integer. This assumption also has two consequences. First, the traffic will be uniformly distibuted across the SEs of each stage which insures fairness and simplifies the analysis. The former point can be explained by observing that in the network of Fig. 1 which uses EF = 3, the upper four SEs in the second stage are subject to receive more traffc than other SEs in the same stage. To solve the fairness problem when E F is not a power of 2, one can use a randomisation/distribution network [I71 at the front end of the switch. However, in this case cell sequencing may be disturbed if internal node buffering is used. A second consequence of the assumption is that conflict-free paths are guaranteed till stage k, inclusive. Finally, the service discipline in all input and output buffers is assumed to be FIFO.

We start by analysing the IN, assuming that the effects of the input buffers are known. Then, we examine the input and output buffers separately. Finally, performance measures of interest are obtained using an iterative algorithm that couples all the analyses in a way that accounts for the backpressure mechanism.

3.1 Analysis of interconnection network Define TP,, as the throughput of the IN: the average number of cells delivered by the IN per time slot per input port. Also, define P , as the success rate of the IN: the probability of success of a cell passing through the IN in a given slot. Letting bin be the probability that a given input buffer is not empty, it is clear that both TP,, and P , are functions of Bin . It can be easily seen that the load as seen by the input links of stage k + 1 is uniform with rate ak = fii,,/EF, where ai (k d i < n) is the cell arrival rate on each output link in stage i. Starting with ak = Bi,,/EF, Patel’s [6 ] recursion can be used as follows to find both T P , , and P,.

a , + l = 1 - ( 1 - a i / 2 ) 2 f o r k < i < n (1)

P, is obtained by

and TP,, by

T P I N = Bin x Ps = E F x a, (3)

The maximum throughput of the IN can be obtained by setting pi. = 1.

3.2 Analysis of input buffers Given the symmetry and the independence assumption of HOL cells in the input buffers, focus on one input buffer, which we model by a Geom/Geom/l/E, queueing system; i.e. a single-server, E,,-capacity queue with geometrically distributed interarrival and service times. (Ei, includes the space allocated for the HOL cell that is currently being served.) As a consequence, we have the following equation

Pr [cell interarrival time = t time slots] = p(1 - p)I-l

for t = 1, 2, . . . (4)

251

Also, since an IPM resubmits an unsuccessful path setup request packet in succeeding time slots till it is successful, we have

Pr [cell service time = t time slots] = Ps(l - Ps)'-' for t = 1, 2, . . . (5)

Let the random variable Q' denote the number of cells in a given input buffer at the end of the ith time slot including the cell under service if any, and let q j (0 Q j Q Bin) be the steady-state probability of having j cells in that input buffer. For a finite Bin, the buffer occupancy can be modelled by a (Bin + 1)-state discrete-time Markov chain. The state transition diagram of this chain is shown in Fig. 3. The state transition probabilities

r 7 \ ......................... jhn-l) I

Fig. 3 State transition diagram of input buffer

Ti' a Pr [Q" = . j l Qm-' = i] are given by the following equation

I - P i=O, j=O P i = O , j = 1 Pp, + (1 - pX1 - P,) Pdl - P) P(1 - P.) 1 - PA1 - P I I", 1"

-0 otherwise

1 Q i Q Bin - 1, j = i 1 < i < Bin, j = i - 1 1 < i Q B, - 1, j = i + 1 i = B . j = B .

(6)

q j can be easily obtained from the balance equations as

qo, the steady-state probability of having an empty input buffer, can be obtained by summing over all possible states, and equating the result to 1, which gives

The average queue length is obtained by Bm

Q = iqi (10) i = 1

Using Little's result [20], the average delay of a cell (in time slots) in an input buffer is given by

Eqn. 11 includes the one time slot required for cell transmission through the IN (because of the way we defined the random variable Q). Finally, the special case of Bin = cc is analysed in Section 7.1.

3.3 Analysis of ourpur buffers An output buffer can receive up to EF cells in each time slot. Recall that a, is the arrival rate on each output link of the last stage, and that EF such links enter an output buffer. We model each output buffer by a Geom(EF)/D/l/B,,, queueing system; i.e. a single-server, deterministic service time (with a service time of 1 time slot per cell), Bo,,-capacity queue, with EF i.i.d. Bernoulli arrivals each with parameter a,. (In contrast to Bin, Bo,, does not include the space allocated for the HOL cell currently under service.) Fixing attention on a particular (i.e. tagged) queue, we define the random variable R' as the number of cell arrivals at the tagged queue during time slot i. Then, rj a Pr [R = ~1 is obtained as

Also, let the random variable K' denote the number of cells in the tagged queue at the end of the ith time slot, excluding the cell under service, if any. Then

(13) K' = min {B,,, , max (0, K'- I + R' - 1)}

For finite EF and Bo,, , the buffer occupancy K' can be modelled by a (B,,, + 1)-state discrete-time Markov chain. The corresponding transition diagram is shown in Fig. 4 for Bo,, > EF; however, the analysis is valid

A cell that arrives at a full input buffer is lost. We estimate the cell-loss probability by Ploss = (offered load - carried load)/offered load. The offered load is p and the carried load is Ps(l - qo); thus,

ps(l - 40)

Fig. 4 State transition diagram of output buffer

PI, , = 1 - ~

for all cases. The transition probabilities P'j P Pr [K" = j I K"- = i] are given by

ro + r l

n = E,., - i + 1

258

i = 0 , j = 0 1 Q i Q E,,,, j = i - 1 1 Q j Q B o u , - l ,max(O,j-EF+ I ) < i < j

j = E,,,, max (0, Bo,, - E F + 1) < i Q Bo,,

otherwise (14)

IEE Proc.-Commun., Vol. 141, No. 4, August 1994

where r j is given by eqn. 12. From the balance equations,

min (EF, m)

k , - , - 1 i k , - i (1 ~ r1) Pr [ K = m] = - k , r0 i = 2 ro

for 2 < m < Bo,, (15) where

1 ko A Pr [ K = 01 =

To solve for k i , use the following recursion:

zo = 1

2, = l - r o - r l

r0

for 2 < m < Bo,, (16) Then, ki can be obtained by

k . = zi for o < i < B,,, (17) ' * + F z m

m = 1

Letting pcOrried be the probability that an output line carries a cell, it can be easily seen that

Pcarried = 1 - ~ O ' O

The average queue length is Bovl

I? = 2 iki i = l

Using Little's result, the average delay of a cell (in time slots) in an output buffer is obtained by

R DO", = - f l (20)

Prarried

The second term corresponds to the one time slot needed to transmit the cell over an output line. Finally, the special case of Bo,, = CO is analysed in Section 7.2.

3.4 Coupling all analyses: iterative approach At steady state, the following two equation obtain

T P = P(1 - Ploss) (21)

In the following, we describe an iterative algorithm to find performance measures of interest. The basic idea is that at steady state the input load to an input buffer which is p(1 - PloSs) must equal to the output load from an output buffer which is pcorried. Given N, EF, B,,, B,,,, and p , and starting with the initial condition PI, , = 0, the following steps are repeated till convergence.

(i) an. = dl - Pl,,)/EF (ii) Use Geom(EF)/D/l/B,,, model, to find a, such that

1 pcarried - p(1 - PloSs) 1 < error tolerance?

t Recall that prarrid is a function of CL,.

I E E Proc-Commun., Vol. 141, No. 4, August 1994

(iii) Find ak that corresponds to a, of step (ii) (iv) P, = an./ak (v) Use Geom/Geom/l/B,, model, to find qo (vi) ploSs = 1 - Ps(l - qo)/P

After the algorithm converges, TP and D can be obtained by eqns. 21 and 22. The maximum throughput (TP,,) is equal to T P when p = 1.0. When B , = CO, TP,,, equals the smallest value of p at which D grows without limit. When Bin = CO and Bo,, = CO, TP,,, also equals the maximum throughput of the IN.

4 Results

4.1 Analysis validation To validate our analytical model, we have constructed two simulators: Sirn I and Sirn 11. In Sirn I, we exactly model the switch, while in Sirn 11, new uniformly distributed destination addresses are generated at the beginning of every time slot for HOL cells in all input buffers. (Any simulation point has been obtained by averaging over five independent simulation mns; the 95% confidence intervals are very tight and thus will not be shown in the Figures.) In all cases, we have noticed that the analytic results are in excellent agreement with those obtained by Sim 11. This should not be surprising, since Sirn I1 adopts the independence assumption that has been made in the analysis. On the other hand, the analytic results are opti- mistic when compared to Sirn I results, with the accuracy of the analysis improving as E F increases, given large enough E,,, . The discrepancy between analysis and simulation is due to input buffer HOL destination correlation that has been ignored in the analysis. Increasing E F has two joint effects: reducing internal blocking, and allowing more cells to be simultaneously switched to the same output port. The net result is that the capacity of the IN increases and the average service time for input buffer HOL cells decreases. If Bo,, is large enough to cope with this increased capacity, buffering would start shifting to the output buffers and the analysis becomes more accurate.

Fig. 5 shows TP,, as a function of log, N for Bin = CO, E,,, = CO, and different values of EF. It is interesting to see that with EF = 16, the analytic results are very close to those of Sirn I. Also, notice that a switch of size 1024 achieves well above 90% maximum throughput

0.3 - 0-21 " " ' ' ' ' 1

1 2 3 L 5 6 7 8 9 10 log 2 N

Fig. 5 factor Parameter values E , = oc, E,,, = oc - analysis 0 Sim I 0 Sim II

Maximum throughput against log, N as function of expansion

259

with an EF as small as 16. Although this case (Bin = CO and Bo,, = CO) may seem to be trivial, it establishes the upper limit on TP,, . One should recall here that many switch architectures suffer from poor throughput performance even with infinite size buffers, such as input- buffered switches [21]. Using N = 64, EF = 4, Bin = 10, and Bo,, = 8, Figs. 6 and 7 show D and PLoss, respectively,

16-

1 2 - 0

U -

8 -

12-

10

8 -

' 6 -

z 0 -

0

-

//

50

4 0 .

cell arrival rate

Fig. 6 Parameter values N = 64, EF = 4, E , , = IO, E,,, = 8 - analysis 0 SimI 0 Sim I1

Delay against cell arrival rate

-

100 r

'- 0.60 0.65 0.70 0-75 0.80 cell arrival rate

Fig. 7 Parameter values N = 64, EF = 4, Bin = 10, Bo,, = 8 - analysis 0 SimI 0 Sim I1

Cell loss probability against cell arrival rate

against p . It can be seen that the accuracy of the analysis is quite acceptable even with EF = 4. With N = 64, EF = 16, B , = 10, and B,,, = 64, the analysis becomes very accurate as can be seen from Figs. 8 and 9.

4.2 Analytic results Assuming N = 256 and B , = CO, Figs. 10-12 show D as a function of p for different values of EF. In each figure, different values of Bo,, are examined. From the Figures, observe the following.

(i) In all cases, the delay remains very close to its minimum value up to loads just below TP,,. This is very desirable since it enables operation at relatively high loads without risking excessive delays.

(ii) For a given EF, increasing B,, results in increasing TP,,. This is achieved by reducing the number of cells which are denied access to the output buffers, although they have successfully established conflict-free paths through the IN.

260

(iii) The larger EF is, the larger is Bo,, needed to achieve the maximum possible throughput (achieved with Bo,, = CO). This is a result of the increased capacity of the IN.

i O l 0.3 0.4 0.5 0.6 0-7 0.8 0.9 10

cell arrival rate

Fig. 0 Delay against cell arrival rate Parameter values N = 64. EF = 16, B . = 10, E,,, = 64 - analysis 0 SimI 0 Sim I1

cell arrival rate

Fig. 9 Parameter values N = 64, EF = 16, B,, = IO, E,, = 64

~ analysis 0 SimI 0 Sim I1

Cell loss probability against cell arrival rate

Lo;t= I I I I 4 + 1 1

2o t 1 ° H 0.70 0.75 0.80 0.85 0.90 0.65 0

cell arrival rate

Fig. 10 Parameter values N = 256, EF = 8, B,, = m (analysis)

Delay against cell arrival rate as Junction ofoutput buffer size

(iv) When EF = 32, while TP,, increases with increasing Bo,, , the delay is less for smaller values of Bo,, at loads below TP,,. For example, D = 8.88 (10.18) when Bo,, = 16 (CO) at p = 0.94. This may be explained as follows. Increasing Bo,, has two joint effects: Din decreases because of the backpressure mechanism, and at the same time Do,, increases. When operating at high loads very

IEE Proc.-Commun., Vol. 141, No. 4 , August 1994

close to the maximum capacity of an output buffer (which is l.O), the increase in Do,, becomes more domi- nant than the reduction in Din . Thus, D increases. For the above example, Din = 2.17 (1.59) and Do,, = 6.71 (8.59),

,0-7-

10-8-

0.7 0.8 0 9 1.0 cell arrival rate

Fig. 11 size Parameter values N = 256, EF = 16, E, , = m (analysis)

Delay against cell arrival rate as a function of output buffer

-

0.7 0.8 0.9 l.0 cell arrival rate

Fig. 12 Parameter values N = 256. EF = 32, E,, = 00 (analysis)

Delay against cell arrival rate us function of output buffer size

when B , = 16 (CO). Given the small difference between TP,, with Bo,, = 16 and that with E,,, = CO, the analysis may suggest that using Bo, = 16 (or 32) is better from the delay performance point of view.

4.3 Effect of backpressure Although we believe that the extra complexity needed for implementing the backpressure (BP) mechanism is small, we still need to justify the importance of adopting such a mechanism. To do so, we compare the BP mechanism with the queue loss (QL) mechanism [12]. With QL, a cell that successfully establishes a conflict-free path through the IN, but cannot find a space in the buffer of the desired output port, is simply discarded. The iterative algorithm previously described for the BP mechanism can be modified for the QL case (see Section 7.3). Assume operation at a fixed total buffer budget; in other words, E,,, = Bin + Bo,, is fixed. For N = 256, EF = 8, and p = 0.70, Fig. 13 shows PIoes against Bin for different values of B,,, , with both BP and QL. From the figure, we observe the following.

(i) For the same B,,,, BP outperforms QL. This may be explained as follows. With BP, some sort of buffer sharing exists among input and output buffers causing PIOs to be smaller compared to that with QL, assuming the same switch parameters and load value. The larger

IEE Proc.-Commun., Vol. 141, No. 4 , August 1994

E,,, is, the more significant is the performance difference between BP and QL.

input buffer size Fig. 14 Parameter values N = 256, EF = 16, p = 0 90 (analysis 0-0 BP (E,,, = 15) 0 0 QL (E,,, = 15) x ~~ x BP (E,,, = 25) x x QL(B,,, = 25) 0-0 BP (E,, , = 35) 0 0 QL (E,,, = 351

Effect ofbackpressure on cell loss proability

26 1

it can be seen that the three observations (i)-(iii) hold here also. Since the above results have been obtained by approximate analysis, it is reasonable to question the validity of our observations. Using simulation (Sim I, of course), Fig. 15 shows PLO, against E,, for N = 64,

10-61

5 9 13 17 21 25 t0-7(

input buffer size

Fig. 15 Parameter values N = 64, E F = 4. E,,, = 25, p = 0 65 lsimulalion) 0-0 BP Y - x Q L

Effect of backpressure on cell loss probability

EF = 4, p = 0.65, and E,,, = 25. It is clear that the fore- going observations are valid here also. All the above implies that BP can result in overall memory savings for a given cell loss performance. This is in agreement with the results reported in Reference 12 for input-output buffering nonblocking switches. More numerical results are needed to determine the range of loads for which BP outperforms QL, given N , E F , and E,,, .

4.4 EDFPS against knockout switch It is well established that nonblocking output buffering packet switches have the best throughput-delay performance when infinite size buffers are used [ 2 1 ] . The best known example of such switches is the knockout switch (KS) [ 1 3 ] . In the following, we present a brief comparison between the EDFPS and the KS: for the same N and p , we compare the total number of crosspoints N C (only as a rough estimate of the complexity) and the total buffer size N E required by each switch to achieve target values of Ptoss. Both N C and N E are nor- malised with respect to N . Section 7.4 briefly evaluates the KS in terms of both complexity and performance.

In an ( N x N ) EDFPS with EF = 2' ( k = 1, 2, . . .), the first k - 1 stages have N(2k- ' - 1) functioning SEs only, each of size (1 x 2). The kth stage consists of EF x N / 2(1 x 2) SEs, while the remaining n - k stages have EF x N / 2 ( 2 x 2 ) SEs each. Since each OPM is attached to EF output links only (compared to N links in the KS) and EF is relatively small, concentration can be avoided in the OPMs. For fair comparison with the KS, we assume the same shared buffered structure, which necessitates the use of an (EF x E F ) shifter within each OPM. Letting each ( 2 x 2 ) SE be represented by four crosspoints, and each ( 1 x 2 ) S E by two crosspoints, we have

N 2

2 x N(2'-' - 1) + 2 x EF - + 4

N 2

x ( n - k ) E F - + 4 x

= 2EF(1 + n) - 2 (23 ) Also, NC,,,,, = E , + E,, , . E,,, is assumed to be a multiple of E F , consistent with the KS where the buffer

262

size ( E ) is a multiple of L, for a concentration factor of N : L.

Assuming a switch of size 1024, we have determined (using analysis) the required values of both N C and N E to achieve two target values of Plossr each under two load values. Table 1 summarises the results. From this Table,

Table 1 : EDFPS against knockout switch for N = 1024

P p,,,, Switch and parameters NE NC

0.50 K S ( L = 6 . 8 = 1 2 ) 12 25648 15 86 20 42112 26 86

EDFPS ( E F = 4, E,,, = 11, E,,, = 4) KS (L = 10. B = 20) EDFPS (EF = 4. E,, = 22. E ",,, = 4)

10 - l o

o 90 10-5 KS ( L = 7, B = 49) 49 29744 EDFPS (ff = 32, E,,, = 9, B,,, = 32) 702 KS ( L =12,8=108) 108 50304 EDFPS (EF = 32, E... = 19, E ",,, = 32) 702

41

51

notice that although the KS requires slightly smaller buffer sizes at p = 0.50, the savings in the crosspoint count are quite significant in all cases. For example, to achieve P,,,, = lo-'' under 90% load, our switch requires an N C of about 700 compared to about 50000 required by the KS. The large number of crosspoints in the KS is mainly because the complexity of the output bus interfaces grows directly with N . The EDFPS solves this problem by distributing the concentration function over the IN. At p = 0.90, the EDFPS also requires a smaller number of buffers for the same Ploss. One possible explanation that has been mentioned earlier is that with the BP mechanism, some sort of sharing exists among input and output buffers, and this buffer sharing is emphesized at high loads. A similar observation was reported in Reference 16 in the context of internally buffered delta networks, where it was concluded that delta networks built with SEs containing a certain com- bination of finite input and output buffers (with BP) out- perform those built with SEs containing finite output buffers only given a fixed total buffer budget.

5 Concluding remarks

An input-output buffering ATM switch architecture with backpressure mechanism has been described and its performance analysed. The accuracy of the analysis has been assessed using simulation. The proposed switch is built around a new multistage interconnection network structure called the expanded delta network. In the expanded delta network, we try to overcome both internal and output blocking, while preserving the path-uniqueness and self-routing features of a delta network. It has been shown that very high throughputs can be achieved with relatively small expansion factors, for large-size switches. The backpressure mechanism has been shown to reduce the overall cell memory size needed to achieve a given cell loss probability, at least for the cases examined. The switch has been shown to compare very well to the knockout switch, in terms of both performance and complexity. A distinctive feature of the proposed architecture is that internal node buffering can be used without disturbing cell sequencing. The analysis of a buffered version of the switch has been carried out and will be the subject of another paper.

6 References

1 MINZER, S.: 'Broadband ISDN and asynchronous transfer mode (ATM)', IEEE Commun. Mag., 1989,217, (9). pp. 17-24

I E E Proc.-Commun., Vol. 141, No. 4, August 1994

2 BAE, J., and SUDA, T.: ‘Survey of traffic control schemes and pro- tocols in ATM networks’, Proc. IEEE, 1991.79, (2), pp. 170-189

3 BOUDEC, J.-Y.: ‘The asynchronous transfer mode: a tutorial’, Comput. Netw. I S D N Syst., 1992,24, pp. 279-309

4 AHMADI, H., and DENZEL, W.: ‘A survey of modem high- performance switching techniques’, IEEE J. Sel. Areas Commun., 1989,7, (7), pp. 1091-1 103

5 TOBAGI, F.: ‘Fast packet switch architectures for broadband integrated services digital networks’, Proc. IEEE, 1990, 78, (l), pp. 133- 166

6 PATEL, J.: ‘Performance of processor-memory interconnections for multiprocessors’, IEEE Trans. Comput., 1981,30, (lo), pp. 771-780

7 LAWRIE, D.: ‘Access and alignment of data in an array processor’, IEEE Trans. Comput., 1975,24,(12), pp. 1145-1155

8 COKE, L., and LIPOVSKI, G.: ‘Banyan networks for partitioning processor systems’. Proceedings of the first annual symposium on Computer architecture, 1973, pp. 21-28

9 WU, C.-L., and FENG, T.-Y.: ‘On a class of multistage interconnection networks’, IEEE Trans. Comp., 1980,29, (8), pp. 694-702

IO LIEW, S., and LU, K.: ‘Comparison of buffering strategies for asym- metric packet switch modules’, IEEE J. Sel. Areas Commun., 1991,9, (3), pp. 428-437

11 JUNG, Y., and UN, C.: ‘Analysis of backpressuring-type packet switches with input and output buffering’, IEE Proc. I, 1993, 140, (4), pp. 277-284

12 PATTAVINA, A., and BRUZZI, G.: ‘Analysis of input and output queueing for nonhlocking ATM switches’, IEEEIACM Trans. Netw., 1993, 1, (3), pp. 314-328

13 YEH, Y., HLUCHYJ, M., and ACAMPORA, A.: ‘The knockout switch: a simple, modular, architecture for high-performance packet switching’, IEEE J . Sei. Areas Commun., 1987, 5, (8), pp. 1274-1283

14 KRUSKAL, C., and SNIR, M.: ‘The performance of multistage interconnection networks for multiprocessors’, IEEE Trans. Comput., 1983,32, (12), pp. 1091-1098

15 KUMAR, M., and JUMP, I.: ‘Generalized delta networks’. Pro- ceedings of international conference on Parallel processing, 1983, pp. 10-18

16 SZYMANSKI, T., and SHAIKH, S.: ‘Markov chain analysis of packet-switched banyans with arbitrary switch sizes, queue sizes, link multiplicities and speedup’. Proceedings of IEEE INFOCOM89.1989, pp. 960-971

17 TURNER, J.: ‘Design of a broadcast packet switching network, IEEE Trans., 1988, COM-36, (6). pp. 734-743

18 PATTAVINA, A.: ‘A broadband packet switch with input and output queueing’. Proceedings of the 13th international symposium on Switching, 1990, pp. 11-16

19 TA, Q., and MEDITCH, J.: ‘A high-speed integrated services switch based on 4 x 4 switching elements’. Proceedings of IEEE INFOCOMPO, 1990, pp. 1164-1171

20 KLEINROCK, L.: ‘Queueing systems, vol. I : Theory’ (Wiley, New York, 1975)

21 KAROL, M., HLUCHYJ, M., and MORGAN, S.: ‘Input versus output queueing on a spacedivision packet switch’, IEEE Trans., 1987, COM-35, (12), pp. 1347-1356

22 MEISLING, T.: ‘Discrete-time queueing theory’, Oper. Res., 1958, 6, pp. 96-105

7 Appendix

7.1 Analysis of input buffer when B,, = CO

An input buffer can be modelled by a Geom/Geom/l queueing system. A well-known result on the Geom/G/I queue with FIFO service [22] is

where 0 is a random variable that represents the service time. Our model is a special case of this, and can be obtained by substituting with Ere] = 1/P, and E[@*] = (2 - PJPf. Thus,

D , ’“ p , - P

Notice that eqn. 25 could have also been obtained directly from eqn. 10.

IEE Proc.-Commun., Vol. 141, N o . 4, August 1994

7.2 Analysis of output buffer when Bo,, = CO

Let the random variable M denote the steady state number of cells in the whole buffer system including the HOL cell currently under service, and I’ be the following indicator function

1 M i > O 0 M ‘ = O

Then, the following equation describes the imbedded Markov chain of the tagged output buffer

(27)

Assuming steady-state behaviour, and in a similar manner to Reference 20, pp. 180-184, obtains

Mi = M i - I + Ri - 1‘- I

Substituting with E[R] = a, x EF, and Var [RI = a, x (1 - a,) x EF in eqn. 28 gives

(EF - l ) ~ , 2(1 - EF x a,)

f l DO”, =

The same result was obtained in Reference 21 using the z-transform approach.

7.3 Analysis of EDFPS with queue loss The EDFPS is analysed with the queue loss (QL) mechanism. The analyses given in Section 3 are all still valid here. We need only to modify the iterative algorithm that couples these analyses. Let Pioss, in be the probability of cell loss resulting from input buffer overflows, and Pioss,oyl be the probability of cell loss resulting from output buffer overflows. Starting with the initial condition Ploss. in = 0, the following steps are repeated till convergence

(i) a, = P(1 - Pioss. i“)/EF (ii) Find ak that corresponds to a, of step (i)

(iii) P , = a,/ak (iv) Use Geom/Geom/l/B,, model to find q,, (v) Pioss, in = 1 - Ps(l - qo)/P

Both Pi,,,in and a, result from the convergence of the algorithm. Then, using output buffer Geom(EF)/D/1/BOut model, one obtains pcarried . Pioss, is then obtained by

Pcorried Pioss. OYt = 1 - - a, x EF

At steady state, the throughput of the switch is TP = pcarried, and the delay (in time slots) is obtained by

263

7.4 Complexity and performance of knockout switch

7.4.1 Complexiry: The KS [13] is fully interconnected architecture with each input having a direct non- overlapping path to each output, with a total of N2 physical paths and N output bus interfaces. Each inter- face consists of N address filters, an (N x L) knockout concentrator, an (L x L) shifter, in addition to B cell buffers. An (N x L) knockout concentrator has L sec- tions of competition, and is composed of (2 x 2) contention elements and single-input/single-output one-bit delay elements. For N B L, each section of the concentrator contains approximately N contention elements and N/2 delay elements. An (L x L) shifter can he constructed using an (M x M) delta network, where M = 2"' and m = [log, L1. This results in m2"-l(2 x 2) SEs. Assuming that each (2 x 2) element can be represented by four crosspoints and each address filter by one crosspoint, and ignoring the one-bit delay elements, we get the following

NC,, = N + 4NL -k 2m2" (33) Also, NB,, = B, where B is assumed to be a multiple of L.

7.4.2 Performance: It is possible to divide the analysis of the KS into the analysis of the concentrator and the anlysis of the output buffer. This was done in Reference

13, where an exact analysis of the first-mentioned and an approximate analysis of the last were given. In the following, we give an alternative but a more accurate analysis of the switch. With the assumption of uniform traffic with parameter p, the probability of i cells arriving simultaneously at a particular output buffer becomes

L + l < i < N

The equation can be used together with eqns. 14-20 to obtain both the delay of a cell (D), and the load carried by an output buffer (pcorriCd). The throughput of the switch is simply TP = pCarried. The overall cell loss probability can be obtained by

(35)

This analysis gives results that are very accurate compared to those obtained by simulation. As an example, for N = 64, L = 5, and B = 15, analysis (simulation) gives D = 1.74 (1.74) and Plosr = 5.78 x (5.44 x lo-') at p = 0.60, and D = 4.82 (4.83) and PIoss = 3.98 x lo-' (4.05 x at p = 0.90.

264 IEE Proc.-Commun., Vol. 141, No. 4, August 1994

Design and performance analysis of input-output buffering delta-based ATM switch with backpressure...

Documents

Transcript of Design and performance analysis of input-output buffering delta-based ATM switch with backpressure...