OJOMU, SUNDAY ABAYOMI PG/M.ENG/06/41381 PERFORMANCE ...

OJOMU, SUNDAY ABAYOMI

PG/M.ENG/06/41381

PERFORMANCE ANALYSIS OF VOQ PACKET-SWITCH

ELECTRONIC ENGINEERING

A THESIS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS

ENGINEERINGPUBLIC , UNIVERSITY OF NIGERIA NSUKKA

Digitally Signed by Webmaster’s Name

JUNE, 2009

TITLE PAGE

PERFORMANCE ANALYSIS OF VOQ PACKET-SWITCH

BY

OJOMU, SUNDAY ABAYOMI

PG/M.ENG/06/41381

BEING A THESIS SUBMITTED TO THE

DEPARTMENT OF ELECTRONIC ENGINEERING

UNIVERSITY OF NIGERIA, NSUKKA

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF MASTER OF ENGINEERING

(COMMUNICATIONS) DEGREE

IN ELECTRONIC ENGINEERING

SUPERVISOR: DR C. I. ANI

JUNE 2009

APPROVAL PAGE

This thesis was approved for the Department of Electronic Engineering, University of Nigeria, Nsukka by

-------------------------- --------------------------------------

Dr. C. I. Ani Ven. Prof. Engr. T. C. Madueme

Supervisor Head of Department

------------------------------

External Examiner

CERTIFICATION

OJOMU, Sunday Abayomi, a postgraduate student in the Department of Electronic Engineering with Registration Number PG/M.ENG/06/41381

has satisfactorily completed the requirements for the course and dissertation for the degree of Master of Engineering (Communications). The

work contained in this dissertation is original and has not been submitted in part or in whole for any other degree of this or any other university or

institution.

--------------------------- ---------------------------

Dr. C. I. Ani Date

Supervisor

DEDICATION

To the glory of God the Almighty whose infinite love for us was made manifest through Jesus Christ by the help of the Holy Spirit.

AKNOWLEGMENTS

It took me a good long time to complete this work, yet without the help, support and general prayers of the several people it would not have been

satisfactorily concluded. The following persons form the cross section of people of goodwill to me in my post graduate work:

Dr. C. I. Ani served as an indefatigable couch for me along the way. He stood as a symbol of stimulating environment during his tenure as the

Head of Department of Electronic Engineering, UNN and after his tenure. He piloted me pleasantly with the interest of a mother eagle to the

eaglet when in flight training towards a higher altitude.

I owe a big gratitude to my wife and children for their prayers and patience on my absence severally in the course of this post graduate schooling.

For urgent financial supports, I will remember my cousin, Mrs Bosede Adeniji , my undergraduate friends Mr Yinka Obiwole, Mr Dada

Ademokoya. My friend from childhood, Mr Martins Akinwumi (a.k.a Marteco) will not be forgotten for their moral supports. I will not forget

to thank my UNN post-graduate class mates like Chief Alexis Ironbar, Mr Alex Ndi, Ericson, Ezea Ugo, Emeka, Ifeanyi, Charles, Joseph Mom,

Mrs Toyin, Abgo, Ifenayi, Ken and others.

I will never forget Professor I Okoro of Electrical Engineering Department, UNN whose book on MATLAB induced my strong appetite for

MATLAB as a necessary tool for the simulations. I will always remember Elder Eko James of the Department of Electrical and Electronic

Engineering, CRUTECH, Calabar. Miss Uju Okoyeuzu of Department of Computer Science, UNN will always be remembered for her faith in

getting what seems to be too hard to achieve.

Finally, many heartfelt gratitude to Mr Christian, one of the assistants in the PG laboratory, all the staff of Electronic Engineering Department

UNN and all my roommates in Nkrumah Hall. May the Lord God Almighty bless all of you for your labour of love to me in Jesus name.

ABSTRACT

This dissertation presents an approach that correlates directly the analytical model to the simulation model of analysis with minimal assumptions

for the analysis of performance of VOQ packet-switch in order to determine the comparative advantage of its queueing scheme over other

queueing configurations. This approach takes the packet arrival as On-Off arrival process, with geometric service time in each virtual-queue and

first-in-first-out (FIFO) buffer sections. Heuristic expressions of throughput and delay were derived and validated by computer simulations.

Computer simulation results shows that VOQ configuration allows dynamic adjustment of buffer allocation between 5 and 20 with 100%

throughput under all traffic conditions. Any buffer size above 20 under this architecture results in excessive delay.

TABLE OF CONTENTS

Certification iii

Dedication iv

Acknowledgements v

Abstract vi

CHAPTER PAGE

1. INTRODUCTION

1.0 Background of the project 1

1.1 Buffering /Queueing 2

1.2 Sequence Preserving 3

1.3 Major Performance Parameters 3

1.4 Objectives 4

1.5 Scope 4

1.6 Methodology 4

1.7 Outline of Thesis 5

2. AN OVERVIEW OF PACKET SWITCH

ARCHITECTURE

2.0 Introduction 6

2.1 Basics of digital switches and switching techniques 6

2.1.1 Structure of Circuit-switched Elements 6

2.1.2 Combination of switch elements 10

2.1.3 Structure of packet-mode switches 13

2.1.4 Packet-switched Network 15

2.1.5 Categories of packet-switched Network 16

2.2 Classification of Packet-Switch Architectures 25

2.2.1 Single-stage vs. multi-stage switches 25

2.2.2 Input Queueing 27

2.3 Output Queueing 33

2.4 Shared Queueing 36

2.4.1 Shared Input Queueing 36

2.4.2 Shared Output Queueing 37

2.5 Shared-Memory implementations 38

2.5.1 Sliding-Window (SW) Packet switch 38

2.6 Combined Input and Output Queuing (CIOQ) architecture 42

2.6.1 CIOQ architectures with FIFO input queues 43

2.6.2 CIOQ with finite output buffers and back-pressure 44

2.7 Virtual Out Queued (VOQ) 46

2.7.1 Analysis Model 47

3. MODELS AND SIMULATIONS OF VOQ PACKET-SWITCH

3.0 Introduction 51

3.1 VOQ packet Switch architecture 51

3.2 The Model of the VOQs 52

3.2.1 Traffic Model 52

3.3 The VOQ Simulation Model 53

3.3.1 The traffic Source 53

3.3.2 VOQ process 54

3.3.3 The Single Server Process 54

3.3.4 Output Switch 54

3.3.5 The destinations 54

3.4 The QoS metrics 55

4. SIMULATIONS AND RESULTS

4.0 Introduction 56

4.1 The Traffic pattern 56

4.2 Simulation results 57

5. CONCLUSIONS 61

REFERENCES 69

APPENDIX A1

APPENDIX A2

APPENDIX A3

CHAPTER 1

INTRODUCTION

1.0 Background of the Project

Global bandwidth demand is growing at an exponential rate, fueled mainly by the proliferation of the Internet as a worldwide communication

medium for a wide spectrum of applications. The growth rates have been sustained by optical transmission technologies such as wavelength-

division multiplexing (WDM), which provides very large transmission bandwidth. So far, electronic routing and switching systems have been

able to keep up with developments in optical transmission technologies by exploiting the advances in silicon technology. The demand for

transmission bandwidth outstrips outstrip the available routing and switching capacities. Although all-optical switches are very promising in

achieving very high throughput, yet with two distinct disadvantages. First, their switching granularities are very coarse, which prevents per-

packet switching. Secondly, optical buffering is cumbersome and impractical. Therefore, electronic packet switches are expected to continue to

play a large role for a long time to come [1]. Electronic switches also share in some disadvantages in terms of scalability and fabric speed. The

architectures that can scale to the desired throughputs in the multi-terabit range are being developed. Therefore, it is pertinent to take a look at

the trend of this development, by providing comparative performance analysis of various classes of packet switch architectures and in particular

that of Virtual Output-Queued (VOQ) packet switches.

In digital switching, user messages or data are divided into segments by specialized equipment called packet assembler/disassembler (PAD).

These segmented data have addressing, sequencing and error control information fields. The resulting unit of data (group of binary digits),

including data and call control signals, and possibly error control information are arranged in a specified format. This unit of data which is

switched as a composite whole is called a packet. The electronic switching system that supports the packets is called ‘packet switch’ [2][3][4].

Packet Switching may be defined as the transmission of data by means of addressed packets whereby a transmission channel is occupied for the

duration of transmission of the packet only; the channel is then available for use by packets being transferred between different data terminal

equipment [3].

A packet switch has four components: input port, output port, the routing processor and the switching fabric, as shown in figure 1.1.

Input and Output ports

With reference to the OSI standard [5] [6], the input port performs the physical and data link functions for the packet switch. The bits are

constructed from the received signal. The packet is decapsulated from the frame. Errors are detected and corrected. The packet is then ready to

be routed by the network layer. In addition to a physical layer processor and a data link processor, the input port has buffers to hold the packet

before it is directed to the switching fabric.

The switch fabric

The switch fabric is used for the input and output ports interconnection. The switching fabric may use a specific scheduling algorithm to decide

which packet to transfer in the next packet time. This is particularly important for VOQ input-buffered switches.

Fig.1.1. Functional blocks of a packet switch. [2]

Switching

fabric.

Routing processor

and signaling

Output port input port

Routing Processor/signaling.

When interconnections are made, the packets are directed to their respective destinations in an orderly manner prescribed by the algorithm in the

processing center and signaling protocol.

1.1 Buffering/queueing

Due to statistical nature of the traffic, buffering of packets in the packet switch is unavoidable. Two or more packets may be addressed to the

same port at the same time. In such a situation, each output line can serve only one packet per time slot; other packets must be stored in the

buffers. This process and strategies of storing the packets is buffering or queueing. Buffers may be placed at the input ports, output ports and

within the switch fabric. This process is the traditional way of solving contention in packet switch [5,6]. Some basic methods of

buffering/queueing are: input queueing (IQ), output queueing (OQ), combined input output queueing (CIOQ) and virtual output queuing (VOQ).

VOQ is an algorithm in which the input buffers are organized according to outputs. The platform is still the input-queued, crossbar-based

architecture.

1.2 Sequence preserving/scheduling.

Depending on the switch-fabric application, there may be a strict requirement regarding the order in which packets leave the switch. In particular,

certain applications, such as ATM, impose the strict requirement that packets belonging to the same flow must be delivered in the order they

arrived. For packet switch, this means that sequence of packets belonging to a given combination of input port, output port, and virtual lane must

be maintained. The scheduling algorithm for the switch fabric takes care of this [2].

A pair of communication users are never intended to interfere with communication between any other pair. While these are certainly highly

desirable, the price to be paid is also very high due to the complexity in terms of interconnection links in such network with N users. Somehow,

this may be feasible for a network having just a few users, but for networks such as the public telephone system or the internet with tens of

millions of users, such a solution is clearly both impractical and prohibitively expensive.

However, in general the sum of the requested user bandwidth is much smaller than the total bandwidth available to the group which offers the

potential for substantial cost savings. To take advantage of this, traffic from the groups of end users is multiplexed onto a single link of

bandwidth, T, which will be shared by all of the users in the group. This principle is called statistical multiplexing. Packet switching in this

environment has proven to be a major technological breakthrough in providing cost-effective data communications among information

processing systems.

The switches under study are built on statistical multiplexing technology [4]. The communication network is segmented. This is done by first

assigning groups of users that will share a link, and then interconnecting the groups by means of packet switches.

In order to avoid collisions of packets in the switch and to be fair to each packet the sharing of the switch fabric connections should be optimum.

The coordination mechanism gives rise to what is called scheduling of packets. The best of the packet switch gives the best QoS such as a

prescribed minimum delay and good throughput using any possible scheduling methods.

1.3 Major Performance Parameters

The following parameters [4] are used to describe performance characteristics of a packet switch.

• Average throughput: This is used to describe the average number of packets that exit the switch during one packet cycle. This

parameter depends largely on the number of switch output ports. Equivalently, throughput may be expressed as the fraction of time

the output lines are busy (output utilization).

• Average packet delay: The average delay encountered by a packet while traversing the switch (usually expressed in packet cycles).

• Packet-loss rate: This may be stated as percentage of arriving packets that are dropped by the switch, owing to lack of buffer space.

• Packet delay variation: This is also referred to as jitter. The distribution of packet delays around the mean.

1.4 OBJECTIVES

The objectives of this dissertation are:

(a) To experiment with the computer simulation model of VOQ switching system since it could prove too costly and in many cases far too

risky, to work with the real switch.

(b) To carry out performance analysis of VOQ packet switch in order to determine its comparative advantage over the rest of other queueing

configurations.

1.5 SCOPE

This dissertation has a limited scope. No physical implementation of the architecture of the VOQ switch is required for the analysis. Analysis is

based on the computer simulation model of VOQ. Virtual-Output-Queued (VOQ) structure is adopted for the analysis. Essentially, the logical

structure of a 3 x 3 IQ forms the basis for this VOQ structure [7]. Inter-arrival time distributions considered are Markov modulated arrivals and

geometrically distributed service times. The bursty traffic model is based on a two-state Markov chain consisting of an ON and an OFF state.

Each input in the on-off traffic model can alternate between active and idle periods of geometrically distributed durations.

1.6 METHODOLOGY

A typical communication switch architecture consists of input ports, output ports, switch fabric, some buffer memory and control system. Past

and recent switch designs were motivated by issues of QoS provisioning. As a result, varying architectures were proposed on output buffering,

input buffering, combined input and output buffering. The evolution of these architectures for the past several years was presented.

Throughput is the fundamental metric for the performance evaluation of a packet switch and the basic metric for estimating packet delay [8].

Towards this direction, an overview of performance evaluation methods carried out on packet switches was conducted from which the research

direction was defined.

The model was built and simulated with MATLAB/simulink. The simulation time of MATLAB was used as the run time and analysis was

carried out from the results.

1.7 OUTLINE OF THESIS

The write-up was organized as follows: Chapter 1 presented the general introduction which contained the background of the packet switch

VOQ performance analysis. The chapter also introduced the objectives of the project, the methodology and the outline. Chapter 2 presented an

overview of existing packet switches. Three main classes of switching architectures were presented based on queuing discipline and their mode

of operations. Performance analyses of VOQ packet switch architectures are reviewed. Chapter 3 presented the model and simulation of the 3x3

VOQ switch. In the same chapter, the performance analysis under traffic characteristic of uniformly distributed On-Off, and Geo/Geo/1 service

discipline were presented. Chapter 4 contained the performance characteristics of the VOQ architecture based on simulations and the results

were compared with the numerical analysis of chapter 3. Chapter 5 summarized the conclusions in the course of this project and suggests possible

directions for future work.

CHAPTER 2

AN OVERVIEW OF PACKET-SWITCH ARCHITECTURES

2.0 Introduction

In this chapter, the concept of packet-switching are presented. An overview of the existing packet-switched architectures is also given, with the

main focus on providing insight into the most recent developments in each class. Few categories of and some performance analysis techniques of

Virtual Output Queued (VOQ) switches are discussed. These categories are based on queuing discipline. The chapter concludes with a

discussion on the merits and drawbacks of the most relevant architectures. The chapter does not intend to provide a comprehensive overview of

all packet-switch architectures.

2.1. Basics of digital Switches and Switching techniques

This section presents the concepts of the design of a virtual-packet-switch. Electronic switching systems may be approached as a system having

two major parts. These are the switching fabric (or switch matrix), and the call control part. The switching matrix is the part that gets the traffic

from one user port to another, and the call control is the part that manages what connections the switching fabric should make.

Circuits typically arrive at switches as E-1/T-1 digital circuits [3]. Sometimes higher-rate interfaces are used, some of which exists as ISDN and

analogue connections to codecs built into the switch’s interface cards. But E-1/T-1 connections predominate [3]. The E-1/T-1 connections are

used here to illustrate the issues of digital time domain switching, which is the core technology for all modern switches [3].

2.1.1 Structure of Circuit-Switched Elements

Digital switching does not only connote the switching of signals in digital form for the purpose of PCM links. Interconnections of subscribers to

these links require timing. The timing is achieved by means of timeslots for the inlet and outlets. The following are switching elements used for

this timing in circuit-switched system.

Space-division switching elements

Space-division means taking data from one TDM channels, and sending it to one of many other TDM channels, and usually also changing

channels very rapidly. In space-division switch, the paths in the circuit are separated from one another in space. A typical example is Crossbar

switch. It connects n inputs to m outputs (say) in a grid using electronic microswitches at each cross point as illustrated in figure 2.9. In this n =

4 and m = 5.

Fig. 2.9. Crossbar switch with 4 inputs and 5 outputs.

The major limitation of this design is the number of crosspoints required. To connect n inputs to m outputs using crossbars switch requires n x

m crosspoints. In practice fewer crosspoints are in use at any given time, thus making the switch inefficient [3]. The solution to the limitations

of the crossbar switch is the multistage switch, which combines crossbar switches in several stages as shown in figure 2.10.

1

3

4

I II III

2

IV V

Crosspoint

To control station

The multistage switch combines crossbar switches in several stages as shown in figure 2.10. In a single crossbar switch, only one row or

column (one path) is active for any connection. So, N x N crosspoints are needed. By the creation of multiple paths inside the switch, the

number of crosspoints is decreased. Each crosspoint in the middle stage can be accessed by multiple crosspoints in the first or third stage.

The whole idea of multistage switching is to share the crosspoints in the middle-stage crossbars. However, sharing can cause a lack of

availability if the resources are limited and users want a connection at the same time. At a state, if one input cannot be connected to an output

because there is no path available between them, that is, all the possible intermediate switches are occupied, then it becomes a blocking state for

the input. This is a drawback in this switch arrangement. Blocking occurs during periods of heavy traffic.

In a single-stage switch, blocking does not occur because every combination of input and output has its own crosspoint hence there is always a

path. The output may be busy at a given time but there is no blocking in the path.

N/n x N/n

k

crossbars N/n

crossbars

N/n

crossbars

n x k n : . k x n : . n

n x k n : .

n x k n : .

: . N/n x N/n

k x n : . n

k x n : . n

: .

: .

:

:

: .

:

N N

Stage 1 Stage 2

Stage 3

Fig.2.10 Crossbar multistage switch

In large systems, the number of stages can be increased to cut down on the number of cross points required. As the number of stages increases,

however, possible blocking increases as well. Blocking is common on public telephone systems in the wake of a natural disaster when call

being made to check on or reassure relatives far outnumber the regular load of the system.

Time-division switch

Time division switch uses time-division multiplexing (TDM) inside a switch.. Inside the switch is a combination of a TDM multiplexer, a TDM

demultiplexer, and a TSI as shown in figure 2.11. The TSI consists of random access memory (RAM) with several memory locations. The size

of each location is the size of a single time slot. The number of locations is the same as the number of inputs and the outputs. The RAM fills up

with incoming data from time slots in the order received. Slots are then sent out in an order based on the decisions of a control unit.

The channels on the link are structured into frames and multiframe arrangements so that any attempt to interconnect or link calling and the called

subscribers will require a time slot adjustment [4].

Time switching operation principle.

Assume a time switching plane shown in figure 2.12, with five subscribers on the inlet side of a TDM and five subscribers on the outlet side of a

TDM switch and five subscribers on the outlet side of a TDM switch. If a subscriber D on the inlet time slot 4, that is element D, wants to be

connected to subscriber U on the outlet time slot 5, the inlet sample packet must be delayed in the switch for one time slot before being sent out

on time slot 5 for the connection. The type of switch designed to provide the appropriate delay is called a time switch or Time Slot Interchange

(TSI)[4]. This is an interface to the outlet channel in the link. The sample from the inlet channel is stored in a buffer and then read out when

appropriate output time slot arrives..

Each input timeslot word is stored in a buffer. A control store holds information on the time at which each sample has to be read out. At the

appropriate time, this control store connects the data buffer to the output lines.

Fig.2.11 Time-Slot Interchange (TSI)

M

D

TSI

RAM

Control unit

1 3

2 4

3 1

4 2

2

1

3

4

2

1

3

4

A B C D

A B C D

T

M

D

T

C

D

D

B

A

A

B

C

2.1.2 Combination of Switch Elements.

The very simple architectures shown in figures 2.11 and 2.12 are not suitable for handling traffic volumes experienced in the thousands of circuits

handled by real public network exchanges. In practice, more complex combinations of time and space switching elements are used, to achieve a

balance between capacity, blocking and economy.

Space-Time-Space switch.

Figures 2.13 and 2.14 show a very simple STS switch, which switches between N(N ≤ 30) incoming circuits and N outgoing circuits, for each of

30 time slots per circuits. In real life, the call paths will generally be bi-directional, and so the outgoing and incoming circuits will not generally

be distinguishable, but it helps to separate them for showing how the switching works.

The figure shows a switch connecting time slot 3 of incoming circuit number 2, to time slot 6 of outgoing circuit number 1. The space switches

are controlled by blocks of memory which are written to by the call control software.

Fig. 2.12 Time-division switching

TDM

A

B

C

D

E

Time

switch

S

T

U

V W Inlet channels

TDM

Outlet channels

One frame

A E D C B A

W V U T S W

* *

Time-Space-Time Switching

To achieve the time shift, a 3-slot delay unit is more commonly used. A space switch at the incoming side switches circuit 2 through to the delay

unit, just for the duration of time slot 3. A space switch at the outgoing side switches the output of that delay unit through to outgoing circuit 1,

for the duration of time slot 6.

N incoming circuits

each carrying frames

of (say) 30 time slots

Fig. 2.13 Space -Time -Space (STS) architectures

30-channel fixed

delay time switch

Incoming Nx30

Space switch

Outgoing N × 30 Space switch

( 3 slots)

( 1 slot )

( 2 slots)

( 30 s

lots

)

Cct 2 slot3

Cct 1 slot 6

At the heart of Time Space Time Switch is an N x N space switch, which cycles all possible N2 combinations of space connections once per

frame time. The incoming circuit (2) is permanently connected to a variable delay element, which is programmed to switch slot 3 into the time

slot for which the space switch connects incoming circuit 2 to outgoing circuit 1. Then a second delay element switches that time slot on circuit 1

into slot 6, as required.

The TSI is the switch element that provides the variable �t time delay in the SPS or TST combination.

TDM

TDM

TDM

TDM

TDM

TDM

Fig.2.14 Space-time-space switch

block diagram

SPACE

TIME

SPACE

T

I S

T S I

T S I

Fig. 2.15. Time Space Time Switch (TST) architecture [4].

Incoming time switch

Variable �t

Variable �t

Variable �t

Outgoing time

switch

Variable �t

Variable �t

Variable �t

Variable �t

Variable �t

Variable �t

Variable �t Cct 1 slot 6

Cct 2 slot 3

Arrays of N time

switch elements on

both in and out sides

switch

N outgoing circuits,

each of 30 time slots.

N incoming ccts,

each carrying frames of

(say) 30 slots.

N x N space switch .

From figure 2.16, the inputs that arrive in frames of size n are mapped into frames of size k by the input TSI switches. The introduction of TSI

switches at the input and output stages and the introduction of a single time-shared crossbar switch result in a much more compact design than

space switches.

2.1.3 Structure of packet-mode switches.

A switch used in a packet-switched network has a different structure from a switch used in a circuit-switched network. A packet switch has four

components: input ports, output ports, the routing processor and the switching fabric, as shown in figure 2.17.

TSI

TSI

TSI

TSI

TSI

Fig. 2.16 Time-space-time switch block diagram [7]

N

outputs n slots

TDM

n slots

Space stage

TSI stage

TDM

k slots

n slots

N

inputs

TDM

k slots

k×n

n×k

Input ports

The schematic diagram of an input port is shown in figure 2.18. The input port performs the physical and data link functions of the packet switch.

The bits are constructed from the received signal. The packet is decapsulated from the frame. Errors are detected and corrected. The packet is

then ready to be routed by the network layer. In addition to a physical layer processor and a data link processor, the input port has buffers to hold

the packet before it is directed to the switching fabric.

Fig. 2.17. The components of a packet switch [2]

Source: Anurag Kumar (2005)

Control and signaling functions

Line

interface

Input processing

and forwarding

Queuing and

scheduling

Switch

fabric

Line

interface

Line

interface Input

processing

and forwarding

Queuing and

scheduling

Line

interface

o/p queue

scheduling

processing

o /p queue

scheduling

processing

Figure 2.19 shows the schematic diagram of an output port. The output port performs the same function as the input port, but in the reverse order.

First the outgoing packets are queued, then the packet is encapsulated in a frame, and finally the physical layer functions are applied to the frame

to create the signal to be sent on the line.

Routing Processor.

The routing processor performs the functions of the network layer. It does the table lookup by searching the routing table. The destination

address is used to find the address of the next hop and, at the same time, the output port number from which the packet is sent out.

Switching fabrics.

Physical layer

processor

Data link layer

processor Queue

Input port

Physical layer

processor

Data link layer

processor

processor Queue

Output port

Fig. 2.19. Schematic diagram of an output port

Fig. 2.18. Schematic diagram of an input port

Cross bar switches and Banyan switches are used in structure of the fabric. But the banyan switch is more realistic although crossbar switch is

the simplest type; this is because of the reliability of the delay issue. A banyan switch is shown in figure 2.20. It is a multistage switch with

microswitches at each stage that route the packets based on the output port represented as a binary string. For n inputs and n outputs, we have

log2n stages with n/2 microswitches at each stage. The first stage routes the packet based on the high-order bit of the binary string. The second

stage routes the packet based on the second high-order bit, until the strings are routed out [1]. Figure 2.20 shows a banyan switch with eight

inputs and eight outputs [2]. The number of stages is log2(8) = 3

2.1.4 Packet-Switched Network.

In multiplexing, bandwidth (FDM) or time (TDM) is used to establish occupancy slots into which individual data sources are assigned the use of

those slots. Thereafter, information in the form of voice or data uses the reserved slot for the duration of the voice call or data transmission

session. In packet switching, specialized equipment called packet assembler/disassembler (PAD) divides data into defined segments that have

addressing, sequencing and error control information added. The resulting unit of data is called a packet and may represent a user message or a

very small portion of a user message [2].

A1

A2

A3

A4

B1

B2

B3

B4

C1

C2

C3

C4

0

1

1

0

1

0

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

1

0

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Left bit Middle bit Right bit

Fig. 2.20. A banyan switch architecture

The flow of packets between nodes in a packet network is intermixed with respect to the originated and destination of packets. That is, traffic in

the form of packets from many users can share large portions of the transmission facilities used to form a packet network. Thus, packet network

use is normally more economical than transmission over the public switched telephone network for long distance transmission. Figure 2.21 is an

illustration of architecture of a packet-switched network.

Fig. 2.21 Packet-switched network architecture [2]

DCE – Data Circuit-terminating Equipment

DSE – Data Switching Exchange.

D

SE

D

SE

Packet mode

DTE

Mainframe Computer 1

DTE

DTE

DTE

DCE

DCE

DCE

PAD

DCE

Non-packet

mode

PAD

X.25

X .25

X.3 X.75

X.25

Packet network 1

Packet network 2

X.25

X.28

X.29

DTE

Mainframe Computer 2

DTE

Non-packet

mode

The packet mode switching network is constructed through the use of PAD and the equipment that routes and transmits packets. Some types of

DTEs can create their own packets, while other types of DTEs require the conversion of their protocol into packets through the use of a packet

assembler/disassembler (PAD). Equipment that routes packets through the network is called packet switches. Packet switches examine the

destination of packets as they flow through the network and transfer the packets onto trunks interconnecting switches based upon the packet

destination and network activity.

There are two basic approaches to transferring information over a packet-switched network. The first approach, called connection-oriented,

involves setting up connection across the network before information can be transferred. The setup procedure typically involves the exchange of

signaling messages and the allocation of resources along the path from input to the output for the duration of the connection. The second

approach is connectionless and does not involve a prior allocation of resources. Instead the packet is routed independently from switch to switch

until the packet arrives at its destination.

2.1.5 Categories of Packet-Switched Network

Packet-switched networks are divided into two categories: datagram and virtual-circuit networks. Figure 2.23 shows datagram packet switching

concepts.

Datagram Network

In datagram packet switching, there is no prior resource allocation for a packet. This means that there is no reserved bandwidth on the links, and

there is no scheduled processing time for each packet. Resources are allocated on demand. The allocation is done on a first-come first-served

basis. When a switch receives a packet, no matter what is the source or destination, the packet must wait if there is no other packet being

processed. This lack of reservation may create delay.

In datagram network, each packet is treated independently of all others. Even if the packet is part of a multipacket transmission, the network

treats it as though it existed alone.

Datagram switching is normally done at the network layer. Figure 2.23 shows how datagram approach is used to deliver four packets from station

A to station X. The switches in a data gram network are traditionally referred to as routers [1].

In figure 2.23 all four packets (or datagram) belong to the same message, but may travel different paths to reach their destination. This is so

because the links may be involved in carrying packets from other sources and do not have the necessary bandwidth available to carry all the

packets from A to X. This approach can cause the datagrams of a transmission to arrive at their destination out of order with different delays

Figure 2.23. A block diagram of datagram network [1].

1

1 2 3 4 3

4

1

2

2

4 1

4

3

3 1 2

X

A

between the packets. Packets may also be lost or dropped because of lack of resources. In most protocols, it is the responsibility of an upper-

layer protocol to reorder the datagrams or ask for lost datagrams before passing them o to the application. The datagram networks are referred to

as connectionless network. This is because the packet switch does not keep information about the connection state. There are no setup or tear

down phases. Each packet is treated the same by a switch regardless of its source or destination.

However, each switch has a routing table which is based on the destination address. The routing tables are dynamic and are updated

periodically. The destination addresses and their corresponding forwarding output ports are recorded in the routing tables. This is different from

the table of a circuit-switched network in which each entry is created when the setup phase is completed and deleted when the teardown phase is

over. Figure 2.24 shows the routing table for a packet switch in datagram network.

Each packet is datagram network carries a header that contains, among other data, the destination address of the packet. When the switch receives

the packet, this destination address is examined; the routing table is consulted to find the corresponding port though which the packet should be

forwarded. This address remains the same during the entire journey of the packet.

Virtual-Circuit network.

In virtual-circuit network, the following characteristics are common:

1. There are setup and tear down phases in addition to data transfer.

2. Resources can be allocated during set up phase or on demand as in a datagram network.

3. Each packet carries an address in the header. However, the address in the packet header has local jurisdiction, which in it is defined

what should be the next switch and the channel on which the packet is being carried, not end-to-end jurisdiction.

Fig. 2.25. A switch topology in virtual-circuit network [1]

D

A B

C

End system

End system

End system

End system

4. All packets follow the same path established during the connection.

5. A virtual-circuit is normally implemented in the data link layer, while a circuit-switched network is implemented in the physical layer

and a datagram network is implemented in the network layer.

Figure 2.25 is an illustration of a virtual-circuit network for four end systems for the switches. The network has switches that allow traffic

from sources to destinations. A source or destination can be a computer, packet switch, or any device that connects other networks.

Fig. 2.24. Routing table in a datagram network. [1]

1

3 2

4

Destination

address

Output

port

1232

4160

:

:

8130

1

2

:

:

3

Figure 2.26 illustrates the data flow in a virtual-circuit packet network. Although a route will be established based upon network activity, once

established the route remains fixed for the duration of the call. Thus, packets will flow in sequence through each switch which reduces both the

amount of processing required to be performed at each switch and delays associated with waiting for out of sequence packets to arrive at a

destination node prior to being able to pass an ordered sequence of packets to their destination.

Fig. 2.26 Data flow in virtual-circuit packet network

A

B

C

D

Y Z X W Y Z X W Y Z X W

D C B A D C B A

D

C

B A

Addressing in a virtual-circuit network.

In a virtual-circuit network, two types of addressing are involved; these are global and local. The global address in a virtual-circuit network is

used only to create a virtual-circuit identifier. This identifier is used for data transfer. It is a small number that has only switch scope; it is used

by a frame between two switches. When a frame arrives at a switch, it has a VCI; when it leaves, it has a different VCI [1]. Figure 2.27 shows

how the VCI in a data frame changes form one switch to another. Each switch uses its own unique set of VCIs.

Phases of switching in virtual-circuit network.

For communication to take place for data transfer in a virtual-circuit network, the source and destination need to go through three phases: setup,

data transfer, and teardown. In the setup phase, the source and destination use their global addressees to help switches make table entries for the

connection. In the teardown phase, the source and destination inform the switches to delete the corresponding entry. Data transfer occurs

between these two phase.

(1) Setup phase.

In the set up phase, a switch creates an entry for a virtual circuit. The next step is to send a setup request frame from the source which will be

followed by the acknowledgement if the destination so prepares to accept a link for data transfers. In figure 2.28 is shown the process of setup

request from source A to destination B.

Fig. 2.27 Virtual-circuit identifier.

31

VCI VCI

66

a. Source A sends a setup request frame to switch 1.

b. Switch 1 receives the setup request frame. It knows that a frame going from A to B goes out through port 3. The switch, in the setup

phase, acts as a packet switch; it has a routing table which is different from the switching table. For the moment, assume that it knows

the output port. The switch creates an entry n its table for this virtual circuit, but it is only able to fill three of the four columns. The

switch assigns the incoming port (1) and chooses an available incoming VCI (14) and the outgoing port (3). It does not yet know the

outgoing VCI, which will be found during the acknowledgment step.

The switch then forwards the frame through port 3 to switch 2.

c. Switch 2 receives the setup request frame. The same events happen here as at switch 1; three columns of the table area completed: in this

case, incoming port (1), incoming VCI (66), and outgoing port (2).

d. Switch 3 receives the setup request frame. Again, three columns are completed: incoming port (2), incoming VCI (22), and outgoing

port (3).

e. Destination B receives the setup frame, and if it is ready to receive frames from A, it assigns a VCI to the incoming frames that come

from A, in this case 77. This VCI lets the destination now that the frames come from A, and not other sources.

(2) Acknowledgments.

A special frame, called the acknowledgment frame, completes the entries in the switching table. Figure 2.29 shows the process.

a. The destination sends an acknowledgment to switch 3. The acknowledgment carries the global source and

destination addresses so the switch knows which entry in the table is to be completed. The frame also carries VCI 77, chosen by the

destination as the incoming VCI for frames from A. Switch 3 uses this VCI to complete the outgoing VCI

column for this entry. Note that 77 is the incoming VCI for destination B, but the outgoing VCI for switch 3.

b. Switch 3 sends an acknowledgment to switch 2 that contains its incoming VCI in the table, chosen in the previous step.

Switch 2 uses this as the outgoing VCI in the table.

c. Switch 2 sends an acknowledgment to switch 1 that contains its incoming VCI in the table, chosen in the previous step.

Switch 1 uses this as the outgoing VCI in the table.

Fig. 2.28. Setup request in a virtual-circuit network

B A

1

1 2

3

2

Switch 1

Switch 2

Switch 3 b c

a d e

Incoming Outgoing

Port VCI Port VCI

1 14 3

Incoming Outgoing

Port VCI Port VCI 2 22 3

Incoming Outgoing

Port VCI Port VCI

1 66 2

3

d. Finally switch 1 sends an acknowledgment to source A that contains its incoming VCI n the table, chosen in the previous

step.

e. The source uses this as the outgoing VCI for the data frames to be sent to destination B.

(3) Data transfer phase.

To transfer a frame from a source to its destination, all switches need to have a table entry for this virtual circuit. Each table, in its simplest has

four columns. The entries in the tables were made during the setup and acknowledgment phases. Figure 2.30 shows a table holding

corresponding information for the data transfer phase. This figure shows a frame arriving at port 1 with a VCI of 14. When the frame arrives, the

switch looks in its table to find port 1 and a VCI of 14. When it is found, the switch knows to change the VCI to 22 and send out the frame from

Fig. 2.29 Setup acknowledgment in a virtual-circuit network

B

1

1 2

3

2

Switch 1

Switch 2

Switch 3 d

e

Incoming Outgoing

Port VCI Port VCI

1 14 3 66

Incoming Outgoing

Port VCI Port VCI 2 22 3 77

Incoming Outgoing

Port VCI Port VCI 1 66 2 22

3 77

VCI=77 VCI=14

66

22

14

A

b c a

port 3. Figure 2.30 shows how a frame from source A reaches destination B and how its VCI changes during the trip. Each switch changes the

VCI and routes the frame.

The data transfer phase is active until the source sends all its frames to the destination. The procedure at the switch is the same for each frame of

a message. The process creates a virtual circuit between the source and destination.

Fig. 2.30 Data transfer phase switching and tables

in virtual-circuit network [1]

Incoming Outgoing

Port VCI Port VCI

1

1

14

77

3

2

22

41

1

2

Data 77 Data 14 Data 22

3

Data

4

1

(4) Teardown phase

In this phase, source A, after sending all frames to B, sends a special frame called a teardown request. Destination B responds with a teardown

confirmation frame. All switches delete the corresponding entry from their table.

The ATM network is a typical cell-switched network using datagram approach [9]. ATM is more efficient than plesiochronous digital hierarchy

(PDH) and synchronous digital hierarchy (SDH) because it dynamically and optimally allocates available network resources vial cell relay

Fig.2.31 Source-to-destination

data transfer [1]

Data 14

WAN

Incoming Outgoing

Port VCI Port VCI

2

22

3

77

Incoming Outgoing

Port VCI Port VCI

1

14

3

66

B A

1

1 1 2

2

4 3

4

2

Incoming Outgoing

Port VCI Port VCI

1

66

2

22

Data

22

Data

66

3

Data 7 7

switching. It is the transfer mode or protocol adopted for broadband integrated services digital network (B-ISDN), which supports all types of

interactive point-to-point and distributive point-to-multipoint communication services [10].

ATM breaks the information bit stream, whatever its origin, (voice, text, data, etc.) into small packets of fixed length. A header is attached to

each data packet to enable correct destination. The fixed –length combination of service data and header is known as an ATM cell, which is 53

bytes long, with 48-byte payload that carries serve data, and a 5-byte header that carries identification, control, and routing information. The

maximum transmission efficiency of ATM is therefore [11]

ηATM = (48/53) x 100% = 90.57%.

The user access devices, called the endpoints, are connected through a user-to-network interface (UNI) to the switches inside the network. The

switches are connected through network-to-network interfaces (NNI). The switch is able to cross-connect VPs and also to sort and switch their

VC contents as shown in figure 2.32.

Connection between two endpoints is accomplished through transmission paths (TPs), virtual paths (VPs), and virtual circuits (VCs). A

transmission path is the physical connection between an endpoint and a switch or between two switches. A transmission path is divided into

several virtual paths. A virtual path provides a connection or a set of connections between two switches. ATM networks are based on virtual

circuits (VCs). All cells belonging to a single frame follow the same virtual circuit and remain in their original order until they reach their

destination.

2.2 Classification of Packet-Switch Architectures

There has been quite adequately a number of switch architecture overview papers, such as [12], [13], [14], [15], [16], [17], and [18]. A more

recent overview focusing especially on space-division architectures was presented in [19]. These papers used many different ways to classify

switch architectures into categories. Some of these criteria are blocking vs. non-blocking, buffering strategy (input-buffered, output-buffered or

Fig. 2.32 Generic architecture of an ATM network [6]

UNI NNI

NNI

NNI UNI

UNI

Endpoints

Endpoints

Endpoints

combined), lossy vs. lossless, single-stage vs. multi-stage, buffer implementation (partitioned, grouped, shared), time- or space-divided (or

combined TST/STS). Many of these categorizations focused too strongly on implementation details. In the following overview of packet-switch

architectures, the main focus will be on the queuing discipline. However, from a high-level point of view— regarding the switch as a black box

that moves data from A to B— it can be argued that the choice of queuing discipline is also an implementation issue, so a few implementations

will be presented for emphasis.

For the purpose of this dissertation, the correct abstraction level is that of the internal switch architecture, and thus the queuing discipline, because

this is the determining factor of switch performance [19].

2.2.1 Single-stage vs. multi-stage switches

Fig. 2.1 depicts KN x KN Benes multistage switch fabric [20], consisting of three stages of KN X N switches each. For large networks, the

multistage approach is much cheaper in terms of the required number of switch elements than single-stage port expansion. In general, for an M x

M fabric with M = KN, KS switch elements are required for an S-stage fabric compared to K2 for a single-stage fabric.

Single-stage architectures have a strong performance advantage over multi-stage architectures, but inherently are not as scalable. Therefore, as

bandwidth demands continue to rise, there will be a growing need for multi-stage fabrics because these can scale to many times more ports than

any single-stage architecture can [21].

Although the multi-stage fabric is a cascade of single-stage switches, still, the extension of a given single-stage architecture to multi-stage is far

from trivial. To convey the complexity of making multi-stage fabrics work, some issues were highlighted:

(a) Network topology: which topology— (e.g. Benes, Banyan, BMIN, Hypercube, Torus, etc).— is the right one for a given

application?

(b) Performance: both in terms of delay because more stages means more latency and throughput because multi-stage fabrics are often internally

blocking owing to a combination of the fabric’s interconnection topology, static routing, or higher-order HoL blocking .

(c) Fabric-internal routing: static routing is easy to implement, but leads to poor performance under unfavorable traffic patterns because it

cannot adapt to congestion situations, whereas adaptive routing can improve performance, but is expensive to implement and may lead to out-of-

order delivery. Either source routing or per-hop look-up is required.

(d) Flow control: to maintain performance, prevent higher-order HoL blocking, and guarantee fabric internal losslessness, proper flow control is

required. To obtain maximum performance, global end-to-end flow control may be necessary, which presents significant scaling problems. (e)

Multicast support: multicast traffic requires both proper duplication strategies and destination set encoding with corresponding lookup tables in

every switch to maintain scalability (carrying full bitmaps is clearly not feasible for fabrics with hundreds of ports).

0 0

N-1

1 0

N-1

k-1 0

N-1

0

1

k-1

0

1

k-1

0

0

N-1

N-1

0

N-1

Figure 2.1: A kN x kN_ � � multistage network consisting of 3k N x N � _ � switch elements.

(f) Practical implementation and cost also formed a significant issue that motivates a need for a scalable architecture.

Given an N x N switch element without buffers and assume persistent traffic sources with uniformly distributed packet destinations, the switch

throughput was derived [22]:

The probability that m packets arrive simultaneously for the same output is

mNmN

mN

NN)mX(P

−

−

==

11

1 (2.1)

The throughput TN equals the sum over all I of the probability that at least one cell is destined to output I, divided by N. And for large N,

throughput is

63201

11

11 .eN

limT

N

N≈−=

−−=

∞→∞ (2.2)

As there are no buffers, the cell loss ratio equals 1- T∞ = 36801

.e

≈ (2.3)

Outcome:

The outcome is that under the assumption of i.i.d. random sources, maximum throughput reaches 63%, while 37% of all packets are discarded

[23].

Obviously, this packet-loss rate is absolutely inacceptable for any practical purposes, so the next consideration is that switch architectures that

incorporate some form of queuing [24].

2.2.2 Input Queuing

A switch belongs to the class of input-queuing switches if the buffering function is performed before the routing function [24]. Fig. 2.2 shows the

architecture of input queued (IQ) packet-switch. The bandwidth requirement on each buffer is only proportional to the port speed and not to the

number of switch ports [22].

Only one read and one write operation are required per packet cycle on each buffer, so that with a port speed of B, the total bandwidth is equal to

2B per buffer and 2NB aggregate for a switch of size N x N.

It is obvious from the diagram that there is the HoL blocking on input 1 of the packet destined to output x owing to contention for output y.

Nevertheless, output contention is resolved. This task is executed by the arbiter unit depicted in Fig. 2.2.

In each packet cycle, a selection is made from the set of head-of-queue packets to be forwarded to the outputs during this cycle. This selection

must satisfy the condition that at most one packet is forwarded to any single output.

Additionally, the selection policy must be designed such that fairness among inputs and outputs is guaranteed. Once the selection has been made,

the selected packets are removed from their buffers and forwarded through the routing fabric, which can be a simple crossbar. The three-phase

algorithm first presented in [24] is an example of such a selection algorithm. In an input-queued switch with FIFO queues, an input has at most

one packet that is eligible for transmission, namely the HoL packet. When output contention occurs (multiple inputs have a packet for the same

y x

y x

routing

1

x

y

N

queuing

1

N

Crossbar

configuration

1

1

1

1

requests grants

(contention resolution) arbiter

Figure 2.2: Input queuing with FIFO queues.

output), one and only one of these inputs is granted permission to transmit according to a certain rule. Various propositions to arbitrate amongst

HoL packets destined to the same output have been proposed and investigated.

Outcome

It is shown that the input queuing scheme limits switch throughput to a theoretical maximum of merely %6.5822 ≈− of maximum bandwidth

under the assumption of independent, uniform Bernoulli traffic sources [24]. Owing to the increased correlation in the traffic arrivals caused by

the buffering, adding buffers will actually degrade throughput compared to the no-buffer case. Additionally, it has been shown that the HoL

selection policy does not affect throughput, although it may affect average delay to a small degree. [20, 25].

Performance analysis model

In [26], the switch is modeled as N independent single server queues (one for each input port), where each queue has an effective service time

distribution which accounts for the HoL contention. More specifically, it was assumed for each input port that there is a probability q of winning

switch arbitration, where q is the function of λ, the traffic carried per port.

The average delay D was derived as a function of the average input load p assuming an infinite number of switch ports (N → ∞), infinite

buffering and that a packet arrived at each port with probability λ per slot, destined with equal probability of (1 / N ) to any one of the output

ports. Each input port had a buffer for incoming packets, which were served on an FCFS basis at the beginning of each slot.

The analysis was then partitioned into two parts: q (determined as a function of λ) and the individual input queues.

The throughput per port was given as the steady state expected value of T:

( )[ ] ( )[ ]j

N

j

j NENEN

T εε == ∑=1

1

(2.4)

In the steady state, T = λ

The number of successful deliveries during a slot equals [26] :

∑ =

N

j j )N(1ε

(2.5)

The outcome:

−−

−+==

ρλ

λλλλ 1

12

2

)(T (2.6)

Where ρ is the probability of occupancy for each queue.

From which was obtained

02122 2 =++−− ρλρλρ )()( (2.7)

The maximum λ is 58022 .=−

The value of q as a function of λ was computed as:

)(q λ

λ−

+=12

11

(2.8)

The average delay is

12222

12−

−+−−−−

=)p)(p(

)p)(p(D (2.9)

with 220 −<≤ p , p ≡ λ

And average queue length is [26]

[ ])(

NEk j λλλ−

+==12

2

(2.10)

Performance disadvantages:

The maximum throughput of 0.58 is low. It was concluded in [27] that the essential cause of this low maximum throughput was that a blocked

packet at the head of an input queue prevented other packets behind it destined to idle outputs from being forwarded. In [27] it was shown that

throughput even decreased with increasingly correlated traffic (burstiness), to as little as 50% for strongly correlated traffic. In [28] the analysis

concluded that this type of input-queued switch was not work conserving – there is a packet waiting for output x, but it cannot be served because

it is blocked by a packet destined to output y.

Improvements:

Spurred on by this discovery of the inherent performance disadvantage of FIFO-based input buffering compared to output buffering, various

approaches were proposed to overcome this limitation. The general approach was to make more than one packet from each input queue eligible

for transmission. Early approaches were mainly based on scheduling packets into the future, or on the use of random access buffers instead of

FIFOs. The following is a few of the early approaches:

(a) Port grouping

The grouping of multiple physical outputs into one logical output was proposed in [30].

Such channel grouping resulted in a switch with fewer, but faster ports. The arbitration will redirect a request for a logical port to the first

corresponding idle physical port, so that HoL blocking is reduced. This principle was applied in the GIGA switch architecture [29].

(b) Scheduling packets by advance time slot.

In [31] a distributed contention-resolution mechanism was presented. Transmission times were dynamically allocated to packets stored in the

input queues. The algorithm consists of request, arbitration, and transmission phases. In the request phase, each input sends a request for its HoL

packet to the requested output. Each output j counts the requests directed to it, Rj. Output j keeps track of the next available transmission slot Tj It

assigns transmission slots Tj through Tj + Rj – 1 to the requesting inputs, and updates Tj to Tj + Rj. Upon reception of a transmission slot, an

input checks its transmission table; if the assigned slot is not yet reserved, the packet is removed from the queue and will be transmitted at the

assigned timeslot. Otherwise, the packet remains at HoL and has to retry in the next slot. This scheme reduces the HoL blocking, because packets

may be scheduled for future timeslots, thus bringing forward new HoL packets sooner than the three-phase algorithm. However, as timeslots

assigned by an output may have already been reserved at the requesting input, slots may be wasted, thus limiting maximum throughput.

Approximate analysis and simulation [Obara89] show that maximum throughput under uniform i.i.d. traffic was about 76% (for a 16 x 16

system), up from 58% for conventional FIFO buffers.

Time slot reservation scheme was proposed in [30,31,32]. In these scheme, contention resolution is performed by a number of reservation tables

RTi, 1 ≤ i ≤ N, one per input. RTi indicates the reservation state of the outputs at t + i – 1, where t is the current timeslot. Each RT contains one

entry per output, indicating whether the output is reserved at that time. Each input i attempts to match a packet in its queue to a non-reserved slot

in the corresponding table RTi , and reserves the corresponding entry when successful. The packet selected is removed from the queue and will be

transmitted at t + i. Each timeslot, RTi is copied into RTi-1, RT1 is discarded, whereas RTN is set to all non-reserved. The approach requires

FIRO (First In, Random Out) buffers, because the algorithm performs a search-depth through each input buffer (up to a certain depth d) to find a

packet that can occupy a non-reserved slot in the reservation table. For a search-depth d=16 and a 16 x 16 system, the maximum throughput

under uniform i.i.d. traffic is about 90%. Special care must be taken to prevent unfairness among inputs because input 1 has a much smaller

chance of reserving a slot than input þ , owing to the way the tables are shifted from N to 1 as time progresses.

In [31] an improvement to the algorithm of [33] was presented. To reduce the probability of assigned timeslots being wasted, the inputs of the

switch can be grouped into groups of size k. From each group of inputs, k packets may depart in one timeslot. Thus, up to k conflicting timeslot

assignments can be managed by an input group, which drastically reduces the probability of a timeslot being wasted. Simulations have shown that

maximum throughput increases to over 90% for a group size of 4 for a 64 x 64 switch under uniform i.i.d. traffic. A further improvement,

proposed in [34], is the grouping of outputs as well as inputs, thus creating a combined input- and output-queued switching fabric.

In [35 the problem of what is termed “turn-around time” (TAT) between input controllers and a centralized contention-resolution unit was

addressed in particular. It was proofed that the TAT equals the amount of time it takes to complete the scheduling of one packet (request-

acknowledge). By pipelining the requests to the contention-resolution unit, throughput degradation due to long TAT can be pre-vented.

(c) Window-based approach

The window-based approach was proposed in [36]. The window-based schemes essentially search the input queue for packets destined to idle

outputs. The strict FIFO queuing discipline of the input queues is relaxed, allowing other packets than the ones at HoL to contend for idle outputs.

Each input queue first contends with the HoL packet. The HoL packets that “win” the contention and their respective outputs are removed from

the contention process, if the contentions have been completed. Where ω is the window depth, ω = 1 corresponds to strict FIFO queuing. For

uniform i.i.d. traffic this approach improved throughput considerably, even for small ω. For a 16 x 16 switch, throughput increases to 77% for a

window size ω = 3, and to 88% for ω = 8 as reported in [36]. However, as traffic exhibits a more busty nature, the number of packets destined to

different packets within the window ω decreases rapidly, so that a much larger window is required to achieve the same throughput, which

requires more processing power in the scheduling unit. Typically, the maximum input queue size Q is much larger than the number of outputs Q.

Therefore, it makes sense to sort the input queue based on the destination of the packets; this way, only N packets need to be considered, instead

of up to Q. In [22, ] it was reported that this arrangement was proposed as early as 1984 in a paper by McMillen R. J, titled “Packet Switched

Multiple Queue Switch Node and Processing Method,” US Patent no. 4,623,996, filed Oct. 18, 1984, published Nov. 18, 1986; and has received

a significant amount of attention in recent years.

Other variations:

Other solutions in [37, 19] involve running the switch fabric at higher speeds than the input and output lines, or using multiple switch fabrics in

parallel. These include the following:

1. Output-port expansion: Output-port expansion with a factor F requires an N x NF switch fabric, where each group of F physical output ports

corresponds to one logical output. This way, up to F packets can be delivered to an output in each cycle. This approach is conceptually

similar to output-channel grouping.

2. Input-port expansion: Input-port expansion with a factor F requires an N x NF switch fabric, where each group of F physical input ports

corresponds to one logical input. This allows F packets from each input to contend in each cycle. This differs from the windowing scheme in that

more than one packet can be selected from the same input.

3. Switch speed-up: If the switch operates at a speed F times faster than the links, F packets can be transported from a single input, and also F

packets can be delivered to an output in one cycle. (More conventional speed-up schemes allow multiple packets to be delivered to one output,

but allow only one packet per input to be transmitted [38].

As multiple packets are being delivered to an output in one cycle, some degree of output queuing is also required, however. This was the

immediate motivation for the next architecture: output queueing.

2.3 Output Queueing

The alternative classic solution to input queueing is output queueing [22, 27], as in Fig. 2.3. Here, the buffering function is performed after the

routing function. The buffers are placed at the switch fabric outputs. Theoretically [22], output queuing offers maximum, ideal performance,

because

• HoL blocking does not exist, thus lifting the throughput limitation from which input queuing suffers,

• and contention is reduced to a minimum—only output contention occurs, which is unavoidable because of the statistical nature of

packet-switched traffic, but there is no input contention, which leads to lower average delays also at low utilization than in input

queuing.

It can be shown theoretically [22] that a non-blocking output-queuing switch having queues of infinite size offers the best performance in terms

of both throughput and delay. This type of switch is the only type that is truly work-conserving (in the sense of our definition of work-

conserving) under any traffic pattern.

Performance analysis model

Given that in one packet cycle N packets may arrive destined for the same output, so that N writes (N arriving packets) and one read (one

departing packet) are required. In[14] a tagged output queue was considered for analysis.

The number of packet arrivals, A, is represented as

ak [ ]kNk

N

kr

N

p

N

pkAP

−

−

==∆ 1 (2.11)

Fig. 2.3: Output Queuing [22]

N

N N

queuingg

1

N

routing

1

1

1 1

1

k = 0, 1, 2, . . . . .. , N

which for N = ∞, becomes [ ]!k

epkAP

pk

r

−

==∆ (2.12)

k=0,1,2,. . . . . ., N = ∞

With Q denoting the number of packets in the tagged queue at the end of the mth time slot, and Am denote the number of packet arrivals during

the mth time slot and b denote the frame size,

Q, = min { max (0,Qm-1 + Am – l), b } in which

When Qm-1 = 0 and Am > 0, one of the arriving packets is immediately transmitted during the mth time slot; that is, a packet flows through the

switch without suffering any delay.

For N → ∞ and b→∞, the queue size Qm was modeled as a M/D/1 queue.

For finite N and b, Q was modeled by a finite-state, discrete-time Markov chain with state transition probabilities Pij = ∆ Pr [Qm = j |Qm-1 = i ]

given by

≤≤=

≤≤−≤≤−=≤≤

==+

=

∑+−=

+−

otherwise

ji,bja

ji,bja

ij,bia

j,iaa

P

N

ijm

m

ij

i

ij

0

0

011

11

00

1

1

0

0

(2.13)

Where ak is given by (2.11) and (2.12) for N < ∞ and N = ∞, respectively.

The steady state queue size was obtained with the normalized throughput and arrival rate as

ooo aq−=1ρ (2.14)

Pr [packet loss] = p

oρ−1 (2.15)

The mean waiting time, ,W for a packet making it into an output FIFO is obtained as [22]

)p(

pW

−=

12 (2.15)

The average queuing delay D of this output-queued switch fabric with infinite-sized queues under Bernoulli traffic, as N→∞, equals [22]:

( )p

pD

−=

12 (2.16)

where p )p( 10 <≤ is the average input load.

Major Limitations of Output-queued architecture

(a) The major drawback was that both the speed of the internal interconnect and the memory bandwidth had to be N times higher than the

external line speed in order to be able to transport N packets to a single output queue in one cycle, whereas an input-queued switch suffered

no such limitation. This requirement made output queuing in general an expensive solution because here buffers with an access rate N times

faster than in an input-queued architecture are required, making the output-queued solution less suitable for very high-speed packet switches.

Additionally, it did not scale well to larger switches, as the speed-up factor increased with switch size [39].

(b) The bandwidth requirement on a single buffer was then proportional to both port speed and number of ports, because in one packet cycle N

packets may arrive destined for the same output, so that N writes (N arriving packets) and one read (one departing packet) would be required.

Thus, the bandwidth requirements equals (N+1)C per output queue, with C being the port speed and N the number of input ports. Thus, the

aggregate bandwidths equals N(N+1)C, which is quadratic in the number of ports, and therefore output queuing was inherently less scalable

to more and/or faster ports than input queuing [40].

(c) It did not scale well to larger switches, as the speed-up factor increased with witch size.

Examples of output-queued switch architectures are the Buffered Crossbar, the Knockout, the Gauss, and the ATOM switch.

2.4 Shared Queuing

Both the input- and the output-queued architectures treated so far have assumed that each input or output queue has a fixed amount of buffer

space that is dedicated to this queue. This strategy may lead to suboptimum use of resources: for instance, in case some ports are heavily loaded

while others are mostly idle, the former may obviously benefit from temporarily getting more resources, while the latter may get by with less.

From this realization the concept of shared queuing was born [22].

This concept implies that buffer resources in a switch fabric are pooled together. All logical input or output queues or groups thereof can use

resources from a buffer pool until the pool is exhausted. The aim is to achieve better resource efficiency.

Although the sharing concept has proven effective especially in reducing packet-loss rates, it can lead to unfairness, as packets from one

particular input or destined to one particular output can monopolize the shared resource, thus actually causing performance degradation, as was

also pointed out in [36] and [37].

The following two sections provide some examples of both shared input and shared output queuing [20]. Practical implementations incorporating

the shared-queuing approach include the Starlite switch, the Vulcan switch, the ATLAS switch, the PRIZMA switch, the Swedish ATM switch

core [20, 41, 42, 43, 44, 45] and the Hitachi ATM switch [20, 46].

2.4.1 Shared Input queueing

In [47] an input-buffered scheme was introduced that group sets of k inputs together, providing a shared buffer per group, see Fig. 2.4. The

operation principle is that in each cycle, up to k packets are selected from each group; the queuing discipline is no longer strictly FIFO, so that

HoL blocking is alleviated. Also the packets to be forwarded are selected using the following algorithm:

1. Assign the “idle” flag to all output ports;

2. Randomly choose a grouped queue;

3. From this grouped queue, choose up to k packets having differing, idle destinations in a queuing-age sequence. Mark the selected

outputs “occupied”.

4. Repeat step 3 for the next grouped queue, until either all grouped queues have been processed or all output ports are marked “occupied”.

This algorithm was very similar in operation to the Reservation with Preemption and Acknowledgment (RPA) algorithm described in [49] when k

= 1, except that RPA was explicitly based on virtual output queuing, whereas this is implicit in this approach: although the packets in the shared

buffer are not sorted by destination, packets to any destination can depart from any grouped queue at any time. Clearly this algorithm was not

very efficient: in the worst case, an entire grouped queue must be searched in each of a maximum of N iterations [50].

Fig. 2.4: Grouped Input Queuing, from [22].

N

queuingg

1

N-k+1

routing 1

1

1

arbiter

N/k

1 k

k

1

1

k

N

Simulation results for grouping factor k = 2, 4, and 8 showed that using this scheme, throughput drastically improved compared to FIFO input

queuing. The grouping effect, as also shown in [47], leads to much lower loss rates, in particular when considering bursty traffic.

The Starlite switch is an example of implemented shared-input queueing. It was developed at AT&T Bell Laboratories [22]. It presents an

example of shared input queuing. There is no dedicated buffer per input, but rather one buffer that stores all packets that could not be delivered

owing to output contention. Excess packets are dropped. The buffered packets are fed back into the switch (hence the name ‘shared recirculating

queue’) through M additional input ports, dedicated to serving packets being recirculated. This reduces the effective size of the switch because an

(N+M) x (N+M) size switch is required to realize an N x N switch, and may cause packets to be delivered out of sequence. The size of the

recirculation buffer and the number of input ports M dedicated to it can be tuned to meet specific packet loss requirements.

2.4.2 Shared Output queueing.

Shared output queueing structure is shown in Fig. 2.5. By sharing the available memory space among all outputs better memory efficiency was

achieved, which had been demonstrated in [37]. The aggregate bandwidth through the shared memory was derived as 2N B, which was in fact

equal to the aggregate bandwidth requirement for the input-queued architecture, instead of N(N+1)B in case of dedicated output buffers.

Fig. 2.5: Shared output Queuing, logical structure [22].

N

1

routing

1

1

1

N

Shared memory

queuingg

N N

2.5 Shared-memory implementations.

Although a shared-memory implementation reduced the aggregate memory bandwidth to 2NB compared to dedicated output queuing, the

implementation of the shared memory itself was the greatest challenge in terms of implementation. The aggregate bandwidth requirement equals

that of an input-queued switch with identical throughput, but in the input-queued switch the memory was distributed over N inputs, which

typically were also separated physically (on separate chips and boards), whereas in the shared-memory switch the entire bandwidth is funneled

through a single memory. This was the main drawback of the shared-memory architecture. To obtain the desired bandwidth, several approaches

were possible, for example employing wide memory, or employing parallel (interleaved) memory banks[50], pipelining memory access, or

combinations thereof, as illustrated by the following examples of such switches.

2.5.1 Sliding-Window (SW) Packet switch

In [50] it was shown that when optimal-performance small size shared-memory switches were interconnected to form a larger switch using a

traditional approach of using multistage interconnection networks (MINs), the throughput performance of the switch suffers a sharp degradation.

Comparative performance evaluation under bursty traffic of the sliding window switching architecture was presented in [50]. The result

overcame the restriction on the scaling of shared-memory packet switches to a large size. Fig. 2.6 shows the memory module while figure 2.7

shows the overall architecture of the SW switching system with decentralized pipeline control. The architecture consists of the following

independent stages: (B) the self-routing parameter assignment circuit; (C) the input interconnection network; (E) shared parallel memory modules

used for write and read of data packets; (F) the output interconnection network.

Fig. 2.6. Memory module with memory controller for the SW switch, from [50]

Memory Controller

J=σ J=1 ---------------------------

Output Scan Array (OSA)

SW.k SW.j

( j,k)

Waddr Raddr

Memory Module

j

( j,k)

outgoing packets with

J=SW.j k = SW.k

incoming packets with

self-routing

tag (i, j, k, )

0 1 0 1 0 2 3 0 1 1 2 3 1

ADDRESS READ

ADDRESS WRITE

Input lines are denoted by I1, I2, .. .. … IN and the output lines are denoted by 21, 22 ….. 2N. Input lines carry the incoming data packets and the

output lines carry the outgoing data packets after being switched to their output destinations by the SW switching system.

In the SW switch, memory modules (see figure 2.6) are independent and they use their local memory controllers to perfume WRITE and READ

operations for data packets based only on the information available locally. The SW switch provided a way to reduce the performance bottleneck

created by centralized controllers in typical shared-memory switch architecture.

The main objectives of the SW switch architecture were to:

Fig.2.7. Schematic diagram of the SW-based switch architecture, from [50]

Pro

cess

ing

H

eader

I1

S

elf

Routi

ng P

aram

eter

(i,

j,

k)

assi

gnm

ent

Cir

cuit

(N

xm

) In

put

inte

rconnec

tion N

etw

ork

Pro

cess

ing

Hea

der

I2

Pro

cess

ing

Hea

der

IN

B

A1

A2

AN C

(m

xN

) O

utp

ut

Inte

rconnec

tion N

etw

ork

F

21

22

2N

OSA

Memory Module

WRITE

Memory Controller

READ

D1 E1

READ

D2 E2

OSA

Memory Module

WRITE

Memory Controller

DN

OSA

Memory Module

WRITE

Memory Controller

READ

EN

(a) Facilitate global sharing of physically separate memory modules among all the input and output ports of the switch to reduce packet

loss under bursty traffic conditions.

(b) Alleviate the need for a centralized memory controller.

(c) Partition the overall switching function into multiple independent stages.

(d) Allow multiple independent stages to make switching decisions based only on the information available locally.

(e) Operate multiple independent stages in a pipeline fashion in order to enhance packet switching speed.

(f) Provide maximum output utilization even when backlog occurs due to burstiness.

(g) Provide various memory sharing schemes for finite memory space deployed in the switch.

(h) Provide various memory sharing schemes for finite memory space deployed in the switch.

The PRIZMA switch, as a practical implementation of shared output queueing, was presented in [22]. This concept is represented logically in

Fig. 2.8, and is explained as follows: all packets are stored in a central, shared memory having a size of M packets, and the output queues only

contain addresses that point to the corresponding packets in the shared memory.

The data section is controlled by a control section to achieve the desired switching function. It consists of N output queues, one free address

queue, and demultiplexers to route incoming addresses to the correct output queues.

The free queue initially contains all shared memory addresses. It provides every input with a memory address to store the next incoming packet at

the output buffers. When a packet arrives, the corresponding input stores it at the designated memory address, forwards the address to the control

section along with the packet’s header, which contains destination and priority information.

The control section routes the address to one or more output queues, according to its destination, where it is appended at the tail of the queue(s).

The input then requests a new address from the free queue, and will receive one, if present. As there are N parallel input routers, this process can

be executed in parallel for all inputs.

Each output queue removes the address at its head and uses this address to configure the corresponding output router, which will fetch the data

from the memory at the requested address.

Figure 2.8: Shared output queuing, as implemented in the PRIZMA switch, [22].

read addresses

packet memory router

router

1

N

N

1

1

M

output queues

write address

control section

routing tags

free address queue

data section

When completed, the address is returned to the free queue, to be used again for newly arriving packets. Because of the parallel output routers all

output queues can transmit simultaneously.

By means of output-queue thresholds, queues can be prevented from using up an unreasonable portion of the shared memory, which could lead to

performance degradation on other outputs.

In [22], the sliding model analysis was presented as follows: given that the size of a pointer is log2M bits, the aggregate bandwidth requirement

on the dedicated output queues that store the addresses equals L

MBNN 2log

)1( + , with L being equal to the size (in bits) of one packet.

By the separation of control and data paths, this relaxed the overall bandwidth requirement. For example, consider that for a typical packet

switch with N = 32, M = 1024, L = 64bytes = 512 bits, this means a bandwidth reduction by a factor of more than 50.

Additionally, multicast could easily be supported with this architecture [22]. An incoming multicast packet is stored only once in memory,

whereas its address is duplicated to each destination output queue. An occupancy counter associated with each memory address keeps track of

how many copies still have to be transmitted; this counter is initialized to the number of destinations (the multicast fan out). Once it reaches zero,

the address is freed and returned to the free queue.

Limitations: Despite the dedication of buffer spaces to each queue in the input queue and output queue architecture, there was no optimum use

of resources such as queue space [28].

2.6 Combined Input and Output Queuing (CIOQ) architecture.

The solutions to the limitations of pure input queue and output queue were proposed by several researches and have arrived at hybrid solution

from two opposite directions: either by realizing that adding output queues with a modest speed-up factor improved the performance of an input-

queued switch, or by realizing that adding input queues to an output-queued switch improved its packet-loss characteristics. The two options

gave rise to CIOQ.

A switch belongs to the class of combined input and output queuing switches if the queuing function is performed both before and after the

routing function [22]. An architecture of

CIOQ packet switch is illustrated in Fig. 2.9.

As Fig. 2.9 shows, a CIOQ switch consists of three main components: the input queues, the interconnecting routing fabric, and the output queues.

The Combined input- and output-queuing switch of Fig. 2.9 shows input speed-up Si and output speed-up So.

The aggregate input and output bandwidth of the routing fabric equals N(Si + So) times the link rate [22].

Accordingly, CIOQ switch architectures were classified based on the following criteria:

• Input queue organization: FIFO or non-FIFO.

• Internal speed-up factor S: full speed-up, S = N, or partial speed-up, 1 < S < N.

• Output buffer size: infinite or finite. In the former case, clearly not a practical configuration, S > N . In the latter case the output buffers

can be dedicated or shared.

1

N

1

N

So So

Si Si

routing Input queues output queues

Fig. 2.9. An architecture of a CIOQ switch.[22]

• Furthermore, the fabric may be internally lossy or lossless (using some form of fabric-internal flow control, e.g. back-pressure).

The remainder of this section reviews the various CIOQ architectures proposed in [22].

2.6.1 CIOQ architectures with FIFO input queues

The case with FIFO input buffers, finite dedicated output buffers using back-pressure to prevent packet loss at the output queues was analyzed in

[25], with full speed-up and it was concluded that in such a configuration the best strategy was to share all available memory space among all

output queues because this reduced the input queue HoL blocking probability. Also in [52], the case of finite dedicated output buffers with

limited speed-up (S < N) was studied. It was concluded that for a speed-up larger than three, the performance-limiting factor was the HoL

blocking at the input queues. A CIOQ system with infinite output queues and a speed-up factor L under correlated traffic was analyzed in [27].

Upper and lower bounds on the maximum throughput were derived for the cases of uncorrelated and strongly correlated traffic, respectively. The

conclusions were that

(a) traffic correlation influenced maximum throughput negatively, but not to a large extent, and

(b) the upper and lower bounds converged quickly to 1 for S >1.

A difference was made in [18] between queue loss (QL) and back-pressure (BP) modes of operation, for CIOQ, where the former may lead to

packet loss at the output queues, whereas the latter prevents this by not allowing more packets to enter an output queue than the available space.

Therefore, in the QL case, excess packets are dropped at the output queue, whereas in BP mode they have to wait in the input queues. The CIOQ

was considered under limited input and output queues and a speed –up factor S. Based on the analytical model and corresponding simulations,

then the following conclusions:

• For both modes, a larger speed-up S led to a higher maximum throughput. However, S > 4 brought little additional improvement.

• For a given output queue size, QL offered a higher maximum throughput than BP because less HoL blocking occured in the input

queues.

• For the BP mode, maximum throughput improved when the output buffer size was increased.

• Naturally, this also led to a decrease in average delay.

• For the QL mode, on the other hand, increasing the output buffer did not improve maximum throughput, but led to slightly higher

average delay.

• A tradeoff between input and output buffer size existed when a total buffer budget was given to achieve the optimum packet-loss rate.

For QL, more buffer space must be allocated to the output, whereas for BP more input buffer space was required. For utilization ≤0.8,

BP required the least amount of total buffer space.

In [53] also exists a proposal for a CIOQ switch with limited speed-up S based on a Batcher– Banyan network with r parallel distributing

modules. The results of their analysis was that with speed-up S = 4 a performance close to output queuing could be achieved, assuming uniform,

uncorrelated arrivals.

Arbitrary speed up S, was analyzed in [54] for a CIOQ system with limited input and output queues, while adopting the queue loss scheme. The

main focus was to study the system under non-uniform traffic, which was known to cause performance degradation [55]. In [56], packet-loss

figures under uniform traffic were derived, indicating that for a given output queue size, an optimum speed-up S exists that yields the lowest loss

rate. Increasing S further is not only more expensive implementation-wise, but will also increase the loss rate.

As in [56], an optimum distribution of a given buffer budget over in and output queues were demonstrated. Regarding non-uniform traffic, it was

concluded that the dominant non-uniformity was that of the output destination distribution. Based on these findings, an output buffer sharing

strategy was proposed that improved loss performance by allocating more buffer space to queues experiencing higher load.

All these results had been obtained assuming Si = 1 and So= S, i.e., the speed-up was implemented only at the output side.

2.6.2 CIOQ with finite output buffers and back-pressure.

Combined input-output-queued switch with finite output buffers and back pressure was analyzed in [59]. The architecture is as shown in figure

2.10.

The system consists of

• N output queues of size b each

• N input queues and

• an interconnection structure between the output queues and the input queues to route the packets, with speed-up S =

min(b, N).

Fig. 2.10. An N × N CIOQ switch with finite output buffers [59].

1

N

1

N

N 1

1 N

routing Input queues output queues

b

b

When an output queue becomes full, this is instantaneously flagged to the input queues (back-pressure mode, as previously described) so that no

packets are lost owing to output-queue overflow. All queues, input and output were always served in FIFO order. It was found that maximum

throughput did not depend on selection policy, but only on the output buffer size and packet size distribution. The FCFS policy was shown to

give a lower bound on delay, whereas the LCFS policy led to an upper bound. All results in this section was derived in [59] under the

assumption of uniform, independent arrivals (Bernoulli traffic).

Results:

Table 2.1 lists numerical values for a range of b as derived in [59]. These results stood for large values of N. It was noted that the sequence of

throughput was monotonically increasing in b; that is

01 ≥∀< + b*b

*b λλ . (2.17)

For b = 0, that is, without any output queuing, the throughput equals 0.5857, while the limit for b→∞ equals 1, as had been established earlier in

[24].

The case where the available buffer space Nα was fully shared among all output queues was analyzed in [48]. Eq. 2.18 expresses the maximum

throughput *

bλ as a function ofα , the number of buffers per output. *

bλ is the maximum throughput for the dedicated output buffer (non-shared)

case with b packets per output.

.222 2* ++−+= αααλb (2.18)

It was concluded that the sharing of the available buffer space also improved throughput; because a packet will never be kept at the input FIFO

HoL as long as there is at least one buffer available in the shared memory.

The limitation: For bursty or asymmetric arrivals, the result of (2.18) will no longer hold [24].

2.7. Virtual Output Queued (VOQ) architecture

In section 2.7 various algorithms and implementations of switch architectures that have attempted to improve on various drawbacks of pure input

queueing (IQ) and pure output queueing (OQ) were presented. These algorithms all suffered from either HoL blocking in FIFO input queues or

unfairness in the sharing of memory in case of output-queued architecture. The degradation in their performance was improved upon by the

method of organized input buffers according to outputs [60- 67]. This arrangement led to virtual output queueing (VOQ) architecture as shown in

Fig. 2.11.

In the VOQ architecture, separate queues are maintained at the input side for each output line. These separate queues are called VOQs. The

routing fabric is still the simple crossbar. In each cycle, the scheduling unit collects the requests, executes some algorithm to decide which VOQs

may forward a packet, returns the grants and configures the crossbar. The granted VOQs remove their HoL packets and forward them to the

outputs.

Both conventional input-queued switches and VOQ switches based on crossbars can only transmit one packet from each input and receive one

packet at each output in one cycle (Si = So = 1). The key difference between the conventional IQ and VOQ is that, whereas the former has only

one packet per input eligible for transmission, an input with VOQ can dispatch as much as N requests (one per output)[64,68-78]. The scheduler

has to solve the problem of matching the inputs with the outputs in an optimal and fair manner.

2.7.1 The Analysis of VOQ architecture

Single Queue Model of each VOQ:

Output 1

Output N

input 1

input N

1

N

N

1

scheduler

Fig. 2.11. An architecture of VOQ packet-switch [66]

In [66, 69] each VOQ is modeled as a buffer that receives arrivals of packets characterized by ON and OFF states. A discrete-time, two-state

Markov chain generating arrivals modeled by an ON-OFF source was used in as depicted in Fig. 2.12. Letting the parameters p and q denote the

probabilities that, in the r slot, the Markov chain remains in states ON and OFF, respectively.

In the OFF state, no packet is generated ( 00 =λ ), whereas in the ON state one packet is generated at the rate of one per cycle ( 11 =λ ). While in

the ON state, the packet stream is divided into consecutive bursts; all packets in one burst have the same destination.

Pr[ON= r slot] = ( ) 111 ≥− −

r,ppr

(2.19)

Pr[OFF = j slot] = ( ) 01 ≥− j,pqj

(2.20)

Generally in slot n, inter-arrival time distribution, fn, is shown [66] as

>−−

==

− 111

1

2n)q(q)p(

np

f

n

n (2.21)

That is, probability of two consecutive arrivals, fi = p, is identical to the probability that following an arrival the Markov chain will remain in

state ON. Similarly, f2 is the probability that following an arrival, the chain transitions to the OFF state and then returns to the ON state. For n>2,

1-q

q p

1-p

OFF ON

Fig. 2.12. One-slot Markov chain for arrival process to a VOQ [66]

it is apparent that following a transition from the ON state to the OFF state, there are n-2 time slots [67,78] during which the chain remains in the

OFF state before returning to state ON.

It was shown in [22][66] that, given (2.19), (2.20), (2.21) and the state probabilities P0 and P1 respectively, the mean arrival rate is

1100 λλλ PPa += (2.22)

0110

01

pp

p

+=

(2.23)

( )

( )pq

q

−−−

=2

1 (2.24)

And the mean burst length, B, is

( )pB

−=

1

1 (2.25)

And that offered load for every VOQ is [66,68]

( )

( )

−

+−−−

= 11

11

qqpµµµρφ (2.26)

It was shown in [66] that service rate, µ, to each queue at the input was

( )

−−

−

=− N

ao

ao

o

Nλµλµ

µµ

11

(2.27)

µo is service rate to the entire VOQ.

It is very clear that if there are N outputs, for example, each input line will comprise N separate VOQs. Each VOQ block is seen as separate

queues for all N outputs. At stability the arrival rate, λ, is equal to the service rate to the entire VOQ thus we have it shown in [66,68,72] that

λ = µ{1-πoN(1-λ)} (2.28)

Mean arrival rate, λ, is

λ = (1-q)/(2-q- p), (2.29)

where πo is the probability of having an empty queue and N is the number of ON states.

Mean Queue size

In [68] the buffer occupancy was calculated as

( )( )a

a u

λµλρ

−−

=1

1

(2.30)

Therefore, it was shown in [66] that mean queue size is

( )ρµλ−

=1

aQ (2.31)

Throughput per input port

The throughput per input port is the total number of packets in N queues sent out with request generation rate [68], σ, plus a successful service

rate, µ. From [66] throughput is

T = N* σ * µ (2.32)

And that ρλρλσ aa ++= (2.33)

Mean Queue delay

Using Little’s theorem and where B is queue capacity, the mean VOQ queue delay from [66, 68,77,78] is

( )( )( ) ( )

( )ρµ

ρρρρρρ

−=

−⋅−⋅−⋅−−⋅

= +

1

1

11

111B

BBB

D

(2.34)

This closed-form expression for VOQ queue delay agrees with a synchronous Geo/Geo/1/B queue with service rate given by a PIM scheduler

[64,71].

CHAPTER 3

MODELS AND SIMULATION OF VOQ PACKET-SWITCH

3.0 Introduction:

This chapter presents the architecture and model of VOQ-switch upon which simulation and the consequent analysis are based. The key

components of the MATLAB Simulink VOQ model are also presented.

3.1 VOQ Architecture:

Fig. 3.1 shows the architecture of the VOQ. At each input, a separate queue is maintained for each output. The architecture assumes the use of

parallel iterative matching (PIM) scheduling algorithm under i.i.d Bernoulli traffic [68, 69, 70, 71-80].

The objective here is to model a switch that,

• Connects multiple independent data sources to three different and independent destinations.

• Holds arriving packets in a buffer (a queue) for each of the data sources.

• Randomly resolves contention if two or more simultaneous packets at the head of their respective queues share the same intended

destination, with no bias to any particular source of packets.

3.2. The model of the VOQs.

The model of the VOQ is shown in Fig. 3.2. This model is based on the 3x3 input-queued, crossbar-based fabric packet switch architecture

shown in Fig. 3.1.

3

1

3

SCHEDULING

1

OUT SWITCH

PACKET

SOURCE

PACKET INPORT

SOURCE

OUTPORT

INPORT

2

VOQs

VOQs

VOQs

PACKET INPORT

SOURCE

2

Fig. 3.1. The architecture of 3 x 3 VOQ packet-switch

3.2.1 Traffic model

In practical terms, it is very difficult to theoretically characterize the exact nature of packet arrivals because of the presence of diverse traffic

classes in high-speed networks involving the packet switches. Thus most of the researches on packet switching consider processes which make

the analysis tractable and reasonable for an approximation to real life scenario. Geometrically distributed durations of occurrence is one which

makes analysis theoretically tractable [64, 66, 67, 73]. Given that link slot is different from switch slot [66], for this reason a packet is wholly

available to the switch only when the last bit of that packet has arrived at the input port. Therefore, it takes one whole link slot or r-switch slots

for all the bits of a packet to arrive. This agrees with burstiness in the traffic [22]. Therefore, there must be (r-1) idle switch-slots for one arrival

in a link slot. An idle time interval is OFF, while active time interval is ON; such a switch-slot arrival process leads to the Markov chain of VOQ

SCHEDULING

1

2

3

λ

λ

λ

Destinations

µo

µo

µo

o/p buffers

λa µ ρ

9-VOQs

OUTPUT

SWITCH

SORTINGS

1

2

3

Packet

packet

packet 3

2

1

Fig.3.2. Model of the 3 x 3 VOQ packet-switch

arrival process in [66, 68]. Each input in the on-off traffic model can alternate between active and idle periods of geometrically distributed

durations as shown in figure 3.3.

During an active period τon, packets destined for the same output arrive continuously in consecutive switch slots [68].

3.3 The VOQ simulation model.

Fig. 3.4 shows the performance simulation model of the VOQ using MATLAB Simulink event-based and time-based blocks.

Fig. 3.3 On-Off model of packet-input at the VOQ input

τon τoff

ON OFF ON OFF ON OFF ON

t

Solid arrows from left to right represent data flow, arrows from right to left depicts flow control. The following key components depicted in

figure 3.4 are, from left to right:

3.3.1 Traffic source.

The Time-based Entity Generator, G, generates the packets. It generated the packet at geometrically distributed time intervals. Event-based

Random Number generator block, which was connected to the input port of the Time-based Entity generator, determines the inter generation time

of the entity (packet).

The arrival rate:

The arrival rate, λo, is proportional to inter generation time of a packet. It was simulated here as the inverse of inter generation time. The

parameter of the Random number generator is set to ‘geometrical’ with the variable probability between 0 and 1. Intergeneration time is

measured at the ‘w’ port of the time-based entity generator. The number generated by G is measured at the ‘#d’ port of the block.

Entity Scopes and Signals scopes were used to monitor arrival patterns of packets.

3.3.2 VOQ destination distribution process.

Fig. 3.4 Performance simulation model of 3x3 VOQ packet- switch.

Set attr. sort. Get

Attrib.

G

Output process

Input process Switch

Input

buffers

For each packet generated by the time-based generator, the Set attribute block stamped, at random, the destination of the packets. This number

was randomly placed between 1, 2 and 3. The numbers signified the output port selection in the first stages of the switching. The sorting instance

was also time stamped by the Timer block. The FIFO Queue block stored the packet that the switch could not route immediately. The queue

size of each VOQ was variable by the ‘Capacity’ parameter on the queue block.

The group of VOQs were sent to the Entity Path Combiner for onward transfer to the next stage of the switching called here as ‘switch’.

3.3.3 The Single server Process

The Single Server block the processed one of the packets, at a time, that was accepted at the time slot of the switch from the input queue or

transmitted from the output queue to the SimEvents Sink (destination). While a packet was in this block, other packets were waiting in the FIFO

queue block.

The Start Timer and Read Timer blocks worked together to compute the time that each packet (entity) spent in the queue and server. The total

time was the transition time of the packet through the switch from input to the output. The result of the computation was read from the Display

block. From this computation the average delay was measured.

3.3.4 Output switch.

The Output Switch of the simulink modeled the output ports of the switch fabric. The destination ‘address’ that was set by ‘Set Attribute’ block

to each packet was sorted by the ‘Get Attribute’ block. The Get Attribute block the reassembled all packets for respective destinations according

to the destinations stamps 1, 2, or 3. Note that the Infinite server block, at the output, models the latency in the switch system by delaying each

data-containing entity according to the setting of Service time for the output link. The service time was set at 1. The output queue was set at

infinity.

3.3.5 The destinations.

The Entity Sink block absorbed packets that have completed their processing in the switch fabric. This serves as the output link or destinations.

3.4. The QoS metrics.

The QoS metrics considered are delay, throughput and packet loss rate with respect to the load from G.

Throughput: This is measured by the ratio of number of packets received at the output port of the last stage of switching to the number input

from the source.

Delay: The average packet delay is measured by the Timer at the timer tag parameter point ‘w’. The timers are connected to the servers at the

arriving point, and prior to the destinations.

Loss rate: The difference between the number of packet sent and the number received is measured by the ‘Maths Operations’ block. The ratio

of this difference to the total number of packet sent is the loss rate averaged over each simulation session.

The load was represented as λo. The measurements were taken by the use of Math Operations blocks. All events occurred at a discrete time slot

intervals [66, 68, 80] in which at most a single arrival and a single service event may occur.

Fig. 4.1. Queue length vs simulation time per VOQ

CHAPTER 4

SIMULATIONS AND RESULTS ANALYSIS

4.0 Introduction

In this chapter, the conditions under which the QoS parameters were determined for the analysis of the performance are presented. Some

sessions of simulation in MATLAB Simulink were carried out to determine responses from which results were presented.

4.1 The traffic pattern

In Appendix A1 is shown the block diagram of the simulation environment. Figure 4.1 shows the arrival pattern at the queue. This result shows

that at the simulation time between 0 and 1000 there were instantaneous jumps. This suggests initial stage when the three independent sources

try gaining access to the switch at one. The interarrival time within this time could not fit into the exponential distribution. From simulation time

of 1000 and above the switching has stabilized and then the arrival rate and service rate are almost equal. The average queue length has become

steady and not increasing nor decreasing.

Fig. 4.2. Stabilized Arrival patterns of packets

In figure 4.2 is shown the packet arrival pattern in consideration of destinations. It was for 10000 simulation runs. There is shown in this figure an

evidence of continuous flow of packets into the system, because there is no gap of zero packets in the graph. The maximum rate was 8

packets/unit time, while the minimum was 1 packet per second. This minimum is consistent with the fact that the service rate was 1 packet/unit

time; which explains the concentration of the patterns at the level of 1.

4.2 Simulation results

For a different VOQ sizes of 5, 10 and 20 and same arrival rates, λo, results are shown with respect to average delay, loss, average queue length

and throughput. The service rate was fixed at packet per unit time at the output and for the server to each VOQ.

Table 4.1 Performance at VOQ capacity=5

Queue length. The pattern of queue length per VOQ is shown in Fig. 4.2. The queue grew until saturated at an average of 4.153 maximum at λ

= 0.3526 maximum at same total service rate of packet per simulation time.

VOQ capacity ,Q = 10

λo

(/sec)

Input (packets)

Output (packets)

Packet

loss

Av. Queue

Length

Throughput

T

Delay

(sec.)

0.2253 6759 6759 0 0.0150 0.99 1.766

0.3319 9957 9957 0 0.0482 0.99 2.030

0.3892 9987 9958 29 0.0977 0.99 2.652

0.4347 13050 13035 15 0.0950 0.99 2.397

0.5910 17961 17958 13 0.5989 0.99 3.833

VOQ capacity = 5

λo (/sec)

Input (packets)

Output (packets)

Packet

loss

Av. Queue

Length

Throughput

T

Delay

(sec.)

0.2253 6759 6759 0 0.0150 0.99 1.766

0.3319 9957 9957 0 0.0460 0.99 2.060

0.3892 11658 11643 15 0.1050 0.99 2.953

0.4347 13029 13024 05 0.1349 0.99 3.091

0.5910 17718 17714 04 0.5716 0.99 5.439

0.9513 28524 28436 12 2.9730 0.99 12.100

Table 4.2 Performance at VOQ capacity=10

Table 4.3 Performance at VOQ=20

0.9513 29457 29225 202 6.6870 0.99 23.300

VOQ capacity ,Q = 20

λo

(/sec) Input

(packets)

Output (packets)

Packet

loss

Av. Queue

Length

Throughput

T

Delay

(sec.)

0.2253 6759 6759 0 0.0150 0.99 1.766

0.3319 9957 9957 0 0.0482 0.99 2.062

0.3892 11667 11667 29 0.0871 0.99 2.346

0.4347 17982 17969 13 0.5469 0.99 4.601

0.5910 29871 29405 466 15 0.98 8.34

0.9513 30145 29645 500 18 0.98 28.37

Throughput

Figure 4.3 is the plot of throughput as a function of load for three buffer sizes. This figure is for case of buffer sizes Q=5, 10 and 20 respectively.

The ratio of output to the input is throughput. It is to be noted that all the buffer schemes gave their best performance at small loads between 0.2

and 0.45. Throughput reduced slightly for Q=10 and 20 at the loads between 0.45 and 0.6.

The best performance for throughput was exhibited when the buffer size was 5. From the small loads to the heaviest load, throughput remains

constant at 0.99. However, a load of 1 and above packets/sec could easily lead to oversubscription, because the arrival rate will be definitely

higher than service rate.

The worst scheme of buffer size was Q=10 at high loads above 0.5.

Delay analysis.

Fig 4.3 Plot of throughput vs load for Q=5,10, 20

The performance in terms of delay is plotted as shown in figure 4.4. The delays were all the same for small load with different buffer size.

Common to the buffer schemes the fact that as load increased, delay also increased. However, the delay variation was highest for Q=20. This

behavior at Q=20 was expected for the fact that packets that arrive early will find space to queue waiting for its service. The longer the line the

longer is the waiting time, since the service was FIFO.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.984

0.985

0.986

0.987

0.988

0.989

0.99

0.991

0.992

load

Th

rou

gh

pu

t

Throughput vs load

Q=5

Q=10

Q=20

Fig. 4.4 Delay vs load for buffer size Q=5, 10, 20

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

load

av

era

ge

de

lay

(s

ec

.)

Q=5

Q=10

Q=20

Average queue length

Figure 4.3 shows the plot of the queue length for buffer sizes of Q=5, 10 and 20 respectively.

Average queue length was the same for Q=5 and Q=10 for small loads between 0.2 and 0.6.

It can be observed that at load range of 40% -60% the queue length at Q=20 was too large to be accepted for configuration. This is because of the

rate at which maximum queue length was attained. For example, at Q=5, maximum average queue length was close to 5 and at Q=10 maximum

queue length was approaching the capacity of 10. It is to infer therefore that as load becomes higher but same service discipline of round robin

(RR) at the output, packets had time to build up faster. This caused the abrupt jump in the curve.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

2

4

6

8

10

12

14

16

load

avera

ge q

ueue length

Fig. 4.3

plot of average queue length vs load

Q=5

Q=10

Q=20

CHAPTER 5

CONCLUSION AND RECOMMENDATIONS

5.0 Summary

From the simulation results in chapter 4, it is very obvious that throughput of 0.99 confirms the efficiency of VOQ which gives a better

performance than input-queued (IQ) or pure output-queued (OQ) architecture under different loads.

Furthermore, these results suggest that a buffer allocation strategy above 20 cannot be optimum under all traffic conditions. Therefore, an

opportunity exists to dynamically adjust buffer allocation between 5 and 20 according to current traffic patterns for minimum delay, low packet

loss and good throughput.

Having identified the pros and cons of various approaches for existing packet switch architectures, the fundamental problem at hand was

understood, which is having the best performance with minimum tradeoffs in the QoS metrics.

5.1 Conclusion

The VOQ systems differ fundamentally from the established packet switch architectures in conventional IQ, OQ or CIOQ systems employing

FIFO input queues. This analysis proved that the throughput of close to 100% can be achieved and is achievable under all traffic conditions.

Since the input queue is the primary consideration in this work, it is clear from the results that it is not possible under VOQ to achieve a lower

delay factor than other architectures beyond buffer space of 20 at each VOQ buffer.

This work was limited to the architectures of the packet switch, so the speed of the switch fabric was not considered.

The selection algorithms for each queue was not the focus of this work also.

5.2 Recommendations

The results of this analysis could be a useful reference for making decisions in achieving the best and practicable performance with VOQ packet-

switch.

As the demand for switch capacity continues to grow, future switches invariably will require more ports, higher data rates per port and more

sophisticated QoS support. Thus a fixed buffer allocation as suggested by the simulation results, in this dissertation, cannot be optimum under all

conditions. Using VOQ architecture as the building, continued effort is required to sustain the prospects of VOQ packet-switch. Further work on

this analysis could be done in case of impact of selection algorithm on VOQ architecture especially when connected in tandem in communication

network.

REFERENCES

[1] Forouzan B.A, Data Communication and Networking, Tata McGraw-Hill Publishing Company Limited, New Delhi, (2007), pp 775.

[2] Held G., Data Communications Networking Devices: Operation, utilization and LAN and WAN Internetworking, John Wiley & Sons,

Ltd. (2000), pp. 185-845.

[3] L. Kleinrock, “On Resource Sharing In a Distributed Communication Environment”, IEEE Commun. Mag,

Jan 1979, vol. 17, no.1, pp. 27-34.

[4] Leo-Garcia and Widjaja, Communication Networks, Fundamental Concepts and Key Architectures, McGraw-Hill (2000), USA.

[5] Folts H, OSI Workbook. Vienna, VA: OMNICOM, Inc., (1983).

[6] Pattavina, A., “Nonblocking Architectures for ATM Switching,” IEEE Commun. Mag., Feb. 1993, pp. 38-48.

[7] Dunlop J. and Smith D. G., Telecommunications Engineering, Stanley Thornes (Publishers Ltd), United Kingdom,(1998), pp 360-370.

[8] K. Yoshigoe and K.J. Christensen, “An Evolution to Crossbar Switches with Virtual Output Queueing and buffered Cross points,” IEEE

network, vol. 17, No. 5, Sept./Oct. 2003.

[9] William Stallings Data and Computer Communications Macmillan Publishing Company (1989), London.

[10] Kumar et al, Communication Networking, Analytical Approach, Morgan Kaufmann Publishers (Elsevier), (2005).

[11] Freeman, R, Telecommunication System Engineering , Wiley, (1980)New York

[12] AT & T Bell Laboratories. Engineering and Operations in the Bell System, Second Edition, 1983.

[13] Kennedy G. and Davis B., Electronic Communication Systems, Macmillan/McGraw-Hill, (1992), Lake Forest.

[14] Hluchyj, M.G. and M.J. Karol, “Queueing in High-Performance Packet Switching,” IEEE J. Sel. Areas Commun., vol. 6, no. 9, Dec.

1988, pp. 1587-1597.

[15] Rathgeb, E.P., T.H. Theimer, M.N. Huber, “Buffering Concepts for ATM Switching Networks,” in Proc. GLOBECOM ’88, Hollywood,

FL, Nov. 1988, pp. 1277-1281.

[16] Ahmadi, H. and W.E. Denzel, “A Survey of Modern High-Performance Switching Techniques,” IEEE J. Sel. Areas Commun., vol. 7, no.

7, Sep. 1989, pp. 1091-1103.

[17] Tobagi, F., “Fast Packet Switch Architectures for Broadband Integrated Services Digital Networks,” in Proc. IEEE, vol. 78, no. 1, Jan.

1990, pp. 133-167.

[18] Ifiok O., Communication Engineering Principles, Palgrave (2001), New York.

[19] Awdeh, R.Y. and H.T. Mouftah, “Survey of ATM Switch Architectures,” Computer Networks and ISDN Systems, vol. 27 (1995), pp.

1567-1613.

[20] Minkenberg, C., T. Engbersen and M. Colmant, “A Robust Switch Architecture for Bursty Traffic,” in Proc. Int. Zurich Seminar on

Broadband Commun. IZS 2000, Zurich, Switzerland, Feb. 2000, pp. 207-214.

[21] Patel, J.K., “Performance of Processor-Memory Interconnections for Multiprocessors,” IEEE Trans. Computing, vol. 30, no. 10, Oct.

1981, pp. 771-780.

[22] Minkenberg, J.A. “On Packet Switch Design, PhD thesis, Eindhoven University of Technology, 2001.

[23] Hui, J.Y. and E. Arthurs, “A Broadband Packet Switch for Integrated Transport,”IEEE J. Sel. Areas Commun., vol. 5, no. 8, Oct. 1987,

pp. 1264-1273.

[24] Karol, M.J., M.G. Hluchyj and S.P. Morgan, “Input vs Output Queueing on a Space-division Packet Switch,” IEEE Trans. Commun.,

vol. 35, no. 12, 1987, pp. 1347-1356.

[25] Iliadis, I. and W.E. Denzel, “Performance of Packet Switches with Input and Output Queueing,” in Proc. ICC ’90, Apr. 1990, pp. 747-

753.

[26] Hui, J.Y. and E. Arthurs, “A Broadband Packet Switch for Integrated Transport,” IEEE J. Sel. Areas Commun., vol. 5, no. 8, Oct. 1987,

pp. 1264-1273.

[27] Li, S.-Q., “Performance of a Nonblocking Space-Division Packet Switch with Correlated Input Traffic,” IEEE Trans. Commun., vol. 40,

no. 1, Jan. 1992, pp. 97-108.

[28] Cao, X.-R., “The Maximum Throughput of a Nonblocking Space-Division-Packet Switch with Correlated Destinations,” IEEE Trans.

Commun., vol. 43, no. 5, May 1995, pp. 1898-1901.

[29] Souza, R.J., P.G. Krishnakumar, C.M. O¨ zveren, R.J. Simcoe, B.A. Spinney, R.E. Thomas and R.J. Walsh, “GIGASwitch System: A

High-performance Packet-switching Platform,” Digital Technical J., vol. 6, no. 1, 1994, pp. 9-22.

[30] Pattavina, A., “Multichannel Bandwidth Allocation in a Broadband Packet Switch,” IEEE J. Sel. Areas Commun., vol. 6, no. 9, Dec.

1988, pp. 1489-1499.

[31] Obara, H. and T. Yasushi, “An efficient contention resolution algorithm for input queueing ATM cross-connect switches,” Int. J. Digital

and Analog Cabled Syst., vol. 2, Dec. 1989, pp. 261-272.

[32] Matsunaga, M. and H. Uematsa, “A 1.5 Gb/s 8x8 Cross-Connect Switch Using a Time Reservation Algorithm,” IEEE J. Sel. Areas

Commun., vol. 9, no. 8, Oct. 1991, pp. 1308-1317.

[33] Obara, H., “Optimum architecture for input queueing ATM switches,” IEE Electron. Lett., 28th Mar. 1991, pp. 555-557.

[34] Obara, H., S. Okamoto and Y. Hamazumi, “Input and Output Queueing ATM Switch Architecture with Spatial and Temporal Slot

Reservation Control,” IEE Electron. Lett., Jan. 1992, pp. 22-24.

[35] Obara, H. and Y. Hamazumi, “Parallel Contention Resolution Control for Input Queueing ATM Switches,” IEE Electron. Lett., vol. 28,

no. 9, Apr. 1992, pp. 838-839.

[36] Hluchyj, M.G. and M.J. Karol, “Queueing in High-Performance Packet Switching,” IEEE J. Sel. Areas Commun., vol. 6, no. 9, Dec.

1988, pp. 1587-1597.

[37] Liew, S.C., “Performance of Various Input-buffered and Output-buffered ATM Switch Design Principles under Bursty Traffic:

Simulation Study,” IEEE Trans. Commun., vol. 42 no. 2/3/4, Feb/Mar/Apr 1994, pp. 1371-1379.

[38] Oie, Y., M.Murata, K. Kubota and H. Miyahara, “Effect of Speedup in Nonblocking Packet Switch,” in Proc. ICC ’89, Jun. 1989, pp.

410-414.

[39] Gupta, A.K., L.O. Barbosa and N.D. Georganas, “A 16 x 16 Limited Intermediate Buffer Switch Module for ATM Networks,” in Proc.

IEEE GLOBECOM ’91, Phoenix, AZ, Dec. 1991, pp. 939-943.

[40] Del Re, E. and R. Fantacci, “Performance Evaluation of Input and Output Queueing Techniques in ATM Switching Systems,” IEEE

Trans. Commun., vol. 41, no. 10, Oct. 1993.

[41] Yeh, Y., M.G. Hluchyj and A.S. Acampora, “The Knockout Switch: A Simple, Modular Architecture for High-Performance Packet

Switching,” IEEE J. Sel. Areas Commun., vol. 5, no. 8, Oct. 1987.

[42] Yoon, H., M.T. Liu, K.Y. Lee and Y.M. Kim, “The Knockout Switch Under Nonuniform Traffic,” IEEE Trans. Commun., vol. 43, no. 6,

Jun. 1995, pp. 2149-2156.

[43] Eng, K.Y., “A Photonic Knockout Switch for High-Speed Packet Networks,” IEEE J. Sel. Areas Commun., vol. 6, no. 7, Aug. 1988, pp.

1107-1116.

[44] Suzuki, H., H. Nagano and T. Suzuki, “Output-buffer Switch Architecture for synchronous Transfer Mode,” in Proc. ICC ’89, Boston,

MA, Jun. 1989, pp. 99-103.

[45] Andersson, P., and C. Svensson, “A VLSI Architecture for an 80 Gb/s ATM Switch Core,” in Proc. 8th Annual IEEE Int’l Conf.

Innovative System in Silicon, Austin, TX, Oct. 9-11, 1996.

[46] Kozaki, T., N. Endo, Y. Sakurai, O. Matsubara, M. Mizukami and K. Asano, “32x_ 32 Shared Buffer Type ATM Switch VLSI’s for B-

ISDN’s,” IEEE J. Sel. Areas Commun., vol. 9, no. 8, Oct. 1991, pp. 1239-1247.

[47] Tao, Z. and S. Cheng, “A NewWay To Share Buffer - Grouped Input Queueing In ATM Switching,” in Proc. IEEE GLOBECOM ’94,

vol. 1, pp. 475-479.

[48] Iliadis, I., “Performance of a Packet Switch with Shared Buffer and Input Queueing,” in Proc. Teletraffic and Datatraffic in a Period of

Change, ITC-13, 1991, pp. 911-916.

[49] Ajmone Marsan, M.G., A. Bianco and E. Leonardi, “RPA: A Simple, Efficient, and flexible Policy for Input Buffered ATM Switches,”

IEEE Commun. Letters, vol. 1, no. 3, May 1997, pp. 83-86.

[50] Kumar S, “The Sliding-Window Packet switch: a New Class of Packet Switch With Plural Memory modules and Decentralized Control”

IEEE J.SAC vol. 21, no.4, May 2003, pp 656-673.

[51] Denzel, W.E., A.P.J. Engbersen and I. Iliadis, “A Flexible Shared-Buffer Switch for ATM at Gb/s Rates,” Computer Networks and ISDN

Systems, vol. 27, no. 4, Jan. 1995, pp. 611-624.

[52] Gupta, A.K. and N.D. Georganas, “Analysis of a Packet Switch with Input and Output Buffers and Speed Constraints,” in Proc. IEEE

INFOCOM ’91, Bal Harbour, FL, Apr. 1991, pp. 694-700.

[53] Chang, C.-Y., A.J. Paulraj and T. Kailath, “A Broadband Packet Switch Architecture with Input and Output Queueing,” in Proc. IEEE

GLOBECOM ’94, pp. 448-452.

[54] Lee, M.J. and D.S. Ahn, “Cell Loss Analysis and Design Trade-Offs of Nonblocking ATM Switches with Non-uniform Traffic,”

IEEE/ACM Trans. Netw., vol. 3, no. 2, Apr. 1995, pp. 199-210.

[55] Li, S.Q., “Non-uniform Traffic Analysis on a Nonblocking Space-Division Packet Switch,” IEEE Trans. Commun., vol. 38, Jul. 1990,

pp. 21-31.

[56] Chen, J.S.-C. and R. Guerin, “Performance Study of an Input Queueing Packet Switch with Two Priority Classes,” IEEE Trans.

Commun., vol. 39, no. 1, Jan. 1991, pp. 117-126.

[57] Giacopelli, J., J. Hickey,W. Marcus, D. Sincoskie and M. Littlewood, “Sunshine: A High-performance Self-routing Broadband Packet

Switch Architecture,” IEEE J. Sel. Areas Commun., vol. 9, no. 8, Oct. 1991, pp. 1289-1298.

[58] Huang, A. and S. Knauer, “Starlite: A Wideband Digital Switch,” in Proc.GLOBECOM ’84, Atlanta, GA, Dec. 1984, pp.121-125.

[59] Iliadis, I. and Denzel W.E., “Analysis of Packet Switches with Input and

Output Queuing,” IEEE Trans. Commun., vol. 41, no. 5, May 1993, pp. 731-740.

[60] Roberto Rojas-Cessa and Chuan-bi Lin, “Captured-Frame Eligibility and Round- robin Matching for Input-queue Packet

Switches,” IEEE Communication Letters, vol.8, issue 9, pp. 585-587, Sep. 2004

[61] D.Banovic, I.Radusinovic, “VOQ Simulator v2.0 - Tool for Performance Analysis of VOQ Switches”, submitted for IEEE

HPSR 2006 . http://klamath.stanford.edu/tools.

[62] Feng Wang and Mounir Hamdi, “Scalable Router Memory Architecture based on Interleaved DRAM: Analysis and

Numerical Studies” Proc. IEEE ICC, 2007.

[63] Mckeown, N, et al, “Achieving 100% throughput in an Input-Queued Switch” INFOCOM ’96, PP.296-302.

[64] Hyoung-II Lee and Seung-Woo Seo, “Analysis model of Multiple Input-Queued switches with PIM scheduling

Algrithm”, IEEE Comm. Letters, vol. 5, no.7 July 2001, pp.316-318.

[65] Krishna, P.,et al., “On the speedup required for work-Conserving Crossbar Switches,” IEEE J. Select. Areas

Commun., vol. 17, no. 6, 1999,pp. 1057-1066.

[66] Itamar Elhanany et al: “On Uniformly Distributed ON/OFF arrivals in Virtual Output Queued Switches with

Geometric Service Times”, Proc. IEEE ICC, Vol. 1, pp 173-177, May 2003.

[67] Itamar Elhanany and Dan Sadot: “DISA: A Robust Scheduling Algorithm for Scalable Crosspoint-Based Switch

Fabrics”’ IEEE J. Select Areas Commun., vol. 21. no. 4, pp 535-545, May 2003.

[68] Ding-Jyh Tsaur, et al: “Scheduling Algorithm and Evaluating Performance of a Novel 3D-VOQ switch” IJCSNS International Journal of Computer Science and Networking Security, VOL. 6 No.3A, March 2006. pp. 25-34.

[69] Tamir Y and H.-C. Chi, “Symmetric Crossbar Arbiters for VLSI Communication Switches”, IEEE Transactions on

Parallel and Distributed Systems, vol. 4, no. 1, Jan.1993, pp. 13-27.

[70] Yener, B., E. Leonardi and F. Neri, “ Algorithms for Virtual Output Queued Switching,” in Proc. GLOBECOM

’99, Rio de Janeiro, Brazil, Dec. 1999, vol. 2, pp. 1203-1210.

[71] McKeown, N., “The iSLIP scheduling Algorithm for Input-Queued Switches, “ IEEE/ACM Trans. Networking, vol. 7,

no. 2, Apr. 1999, pp 188-201.

[72] Whittle Peter, “Probability”. John Wiley & Sons Ltd. Great Britain, 1976.

[73] Menkkittikul, A and N. McKeown, “ A Starvation-free Algorithm for Achieving 100% Throughput in an Input-Queued Switch,” in Proc.

IEEE INFOCOM ’98, San Francisco, CA, Apr. 1998, pp 792-799.

[74] Dai, J. G. and B. Prabhkar, “The Throughput of Data Switches with and without speedup,” in Proc. INFOCOM 2000,Tel Aviv, Israel,

Mar. 200 vol. 2, pp. 556-564.

[75] Serpanos, D., Antoniadis, P., “FIRM: A class of distributed scheduling algorithms for high-speed ATM switches with multiple input

queues”, IEEE INFOCOM’00, Tel Aviv, Isreal, March 2000, pp. 0548-0555.

[76] Smiljanic, A., et al., “RRGS-round robin greedy scheduling for electronic/optical terabit switches”, IEEE GLOBECOM’99. Rio De

Janeiro, Brazil, Dec. 1999, pp. 1244-1250.

[77] Xiao Zhang, Laxmi N. Bhuyan, “An Efficient Scheduling Algorithm for Combined Input-Crosspoint-Queued (CICQ) Switches”, in

IEEE GLOBECOM, Nov. 2004.

[78] R. Rojas-Cessa and E. Oki, “Round-Robin Selection with Adaptable-size Frame in a Combined Input-Crosspoint Buffered Switch,”

IEEE Comm. Letters, vol.7, No. 11, pp. 555-557, Nov. 2003.

COMPLETE VOQ PACKET-SWITCH SIMULATION ENVIRONMENT

output

3969

delay 2

4.339

delay 1

2.306

Start Timer 4

IN OUT

IN

OUT

w

Read Timer

IN

OUT

w

IN OUT

N-Server

IN

OUT

#d

Get Attribute

IN

A1

OUTFIFO Queue

IN

OUT

len

Entity Sink 1

IN #a

Display 6

1.286

Display 20.5183

[79] Zeng Guo and Roberto Rojas-Cessa, “Framed Round-Robin Arbitration with Explicit Feedback Control for Combined Input-Crosspoint

Buffered Packet Switches”, In Proc. of IEEE ICC. Jun. 2006.

[80] Neha Kumar et al., “Fair Scheduling in Input-Queued Switches under Inadmissible Traffic”, in IEEE GLOBECOM, Nov. 2004.

APPENDIX

A1

106

APPENDIX A2

FLOW CHART FOR THE SIMULATION OF VOQ PACKET SWITCH

Generate

Packet i.i.d

Attach random

destination of

packets

Set arrival time of

each packet

Place each packet

on respective

VOQ

Output packet in

Round Robin at

destinations

Get time out of

each packet

Calculate service

time for a number

of packets at

each destination

Count number

of packet losses

Count number

of packet at

each destination

stop

start

107

APPENDIX A3

MATLAB CODES FOR THE PLOTS OF SIMULATION RESULTS

%===== TRANSFER OF DATA FROM SIMULINK TO WORKSPACE==========

out=[que_length.time, que_length.signals.values]

x=out(:,1);

y=out(:,2);

plot(x,y)

xlabel('simulation time');

ylabel('average delay')

grid

%== ====================== VOQ1 ===============

%============throughput==

out2=[out_2.time, out_2.signals.values]

x=out2(:,1);

y=out2(:,2);

plot(x,y)

xlabel('simulation time');

ylabel('throughput')

grid

% PLOT OF AVERAGE DELAY VERSUS LOAD

a=[0.2253 0.3319 0.3892 0.4347 0.5910 0.9513];

Q1=[1.733 2.060 2.953 3.091 5.439 12.01];

Q2=[1.766 2.030 2.652 2.397 3.833 23.0];

Q3=[1.766 2.062 2.346 4.601 8.34 28.27];

plot(a,Q1,'-*r',a,Q2,':+b',a,Q3,'-xk')

grid on

xlabel('load ')

ylabel('average delay (sec.)')

legend('Q=5','Q=10','Q=20')

108

% PLOT OF AVERAGE QUEUE LENGTH VERSUS LOAD============

p=[0.2253 0.3319 0.3892 0.4347 0.5910 0.9513];

q_l1=[0.0150 0.0460 0.1050 0.1349 0.5716 2.973];

q_l2=[0.0150 0.0482 0.0977 0.0950 0.5989 6.687];

q_l3=[0.0150 0.0482 0.0871 0.5469 15 18];

plot(p,q_l1,'-*r',p,q_l2,':+b',p,q_l3,'-xk')

grid on

xlabel('load ')

ylabel('average queue length')

legend('Q=5','Q=10','Q=20')

% PLOT OF THROUGHPUT VERSUS LOAD =============== =====

b=[0.2253 0.3319 0.3892 0.4347 0.5910 0.9513];

T1=[0.99 0.99 0.99 0.99 0.99 0.99];

T2=[0.99 0.99 0.99 0.99 0.989 0.988];

T3=[0.99 0.99 0.99 0.99 0.989 0.984];

plot(b,T1,'-*r',b,T2,':+b',b,T3,'-xk')

grid on

xlabel('load ')

ylabel('Throughput')

legend('Q=5','Q=10','Q=20')

OJOMU, SUNDAY ABAYOMI PG/M.ENG/06/41381 PERFORMANCE ...

Documents

Transcript of OJOMU, SUNDAY ABAYOMI PG/M.ENG/06/41381 PERFORMANCE ...