OJOMU, SUNDAY ABAYOMI PG/M.ENG/06/41381 PERFORMANCE ...
Transcript of OJOMU, SUNDAY ABAYOMI PG/M.ENG/06/41381 PERFORMANCE ...
OJOMU, SUNDAY ABAYOMI
PG/M.ENG/06/41381
PERFORMANCE ANALYSIS OF VOQ PACKET-SWITCH
ELECTRONIC ENGINEERING
A THESIS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS
ENGINEERINGPUBLIC , UNIVERSITY OF NIGERIA NSUKKA
Digitally Signed by Webmaster’s Name
JUNE, 2009
TITLE PAGE
PERFORMANCE ANALYSIS OF VOQ PACKET-SWITCH
BY
OJOMU, SUNDAY ABAYOMI
PG/M.ENG/06/41381
BEING A THESIS SUBMITTED TO THE
DEPARTMENT OF ELECTRONIC ENGINEERING
UNIVERSITY OF NIGERIA, NSUKKA
IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF MASTER OF ENGINEERING
(COMMUNICATIONS) DEGREE
IN ELECTRONIC ENGINEERING
SUPERVISOR: DR C. I. ANI
JUNE 2009
APPROVAL PAGE
This thesis was approved for the Department of Electronic Engineering, University of Nigeria, Nsukka by
-------------------------- --------------------------------------
Dr. C. I. Ani Ven. Prof. Engr. T. C. Madueme
Supervisor Head of Department
------------------------------
External Examiner
CERTIFICATION
OJOMU, Sunday Abayomi, a postgraduate student in the Department of Electronic Engineering with Registration Number PG/M.ENG/06/41381
has satisfactorily completed the requirements for the course and dissertation for the degree of Master of Engineering (Communications). The
work contained in this dissertation is original and has not been submitted in part or in whole for any other degree of this or any other university or
institution.
--------------------------- ---------------------------
Dr. C. I. Ani Date
Supervisor
DEDICATION
To the glory of God the Almighty whose infinite love for us was made manifest through Jesus Christ by the help of the Holy Spirit.
AKNOWLEGMENTS
It took me a good long time to complete this work, yet without the help, support and general prayers of the several people it would not have been
satisfactorily concluded. The following persons form the cross section of people of goodwill to me in my post graduate work:
Dr. C. I. Ani served as an indefatigable couch for me along the way. He stood as a symbol of stimulating environment during his tenure as the
Head of Department of Electronic Engineering, UNN and after his tenure. He piloted me pleasantly with the interest of a mother eagle to the
eaglet when in flight training towards a higher altitude.
I owe a big gratitude to my wife and children for their prayers and patience on my absence severally in the course of this post graduate schooling.
For urgent financial supports, I will remember my cousin, Mrs Bosede Adeniji , my undergraduate friends Mr Yinka Obiwole, Mr Dada
Ademokoya. My friend from childhood, Mr Martins Akinwumi (a.k.a Marteco) will not be forgotten for their moral supports. I will not forget
to thank my UNN post-graduate class mates like Chief Alexis Ironbar, Mr Alex Ndi, Ericson, Ezea Ugo, Emeka, Ifeanyi, Charles, Joseph Mom,
Mrs Toyin, Abgo, Ifenayi, Ken and others.
I will never forget Professor I Okoro of Electrical Engineering Department, UNN whose book on MATLAB induced my strong appetite for
MATLAB as a necessary tool for the simulations. I will always remember Elder Eko James of the Department of Electrical and Electronic
Engineering, CRUTECH, Calabar. Miss Uju Okoyeuzu of Department of Computer Science, UNN will always be remembered for her faith in
getting what seems to be too hard to achieve.
Finally, many heartfelt gratitude to Mr Christian, one of the assistants in the PG laboratory, all the staff of Electronic Engineering Department
UNN and all my roommates in Nkrumah Hall. May the Lord God Almighty bless all of you for your labour of love to me in Jesus name.
ABSTRACT
This dissertation presents an approach that correlates directly the analytical model to the simulation model of analysis with minimal assumptions
for the analysis of performance of VOQ packet-switch in order to determine the comparative advantage of its queueing scheme over other
queueing configurations. This approach takes the packet arrival as On-Off arrival process, with geometric service time in each virtual-queue and
first-in-first-out (FIFO) buffer sections. Heuristic expressions of throughput and delay were derived and validated by computer simulations.
Computer simulation results shows that VOQ configuration allows dynamic adjustment of buffer allocation between 5 and 20 with 100%
throughput under all traffic conditions. Any buffer size above 20 under this architecture results in excessive delay.
TABLE OF CONTENTS
Certification iii
Dedication iv
Acknowledgements v
Abstract vi
CHAPTER PAGE
1. INTRODUCTION
1.0 Background of the project 1
1.1 Buffering /Queueing 2
1.2 Sequence Preserving 3
1.3 Major Performance Parameters 3
1.4 Objectives 4
1.5 Scope 4
1.6 Methodology 4
1.7 Outline of Thesis 5
2. AN OVERVIEW OF PACKET SWITCH
ARCHITECTURE
2.0 Introduction 6
2.1 Basics of digital switches and switching techniques 6
2.1.1 Structure of Circuit-switched Elements 6
2.1.2 Combination of switch elements 10
2.1.3 Structure of packet-mode switches 13
2.1.4 Packet-switched Network 15
2.1.5 Categories of packet-switched Network 16
2.2 Classification of Packet-Switch Architectures 25
2.2.1 Single-stage vs. multi-stage switches 25
2.2.2 Input Queueing 27
2.3 Output Queueing 33
2.4 Shared Queueing 36
2.4.1 Shared Input Queueing 36
2.4.2 Shared Output Queueing 37
2.5 Shared-Memory implementations 38
2.5.1 Sliding-Window (SW) Packet switch 38
2.6 Combined Input and Output Queuing (CIOQ) architecture 42
2.6.1 CIOQ architectures with FIFO input queues 43
2.6.2 CIOQ with finite output buffers and back-pressure 44
2.7 Virtual Out Queued (VOQ) 46
2.7.1 Analysis Model 47
3. MODELS AND SIMULATIONS OF VOQ PACKET-SWITCH
3.0 Introduction 51
3.1 VOQ packet Switch architecture 51
3.2 The Model of the VOQs 52
3.2.1 Traffic Model 52
3.3 The VOQ Simulation Model 53
3.3.1 The traffic Source 53
3.3.2 VOQ process 54
3.3.3 The Single Server Process 54
3.3.4 Output Switch 54
3.3.5 The destinations 54
3.4 The QoS metrics 55
4. SIMULATIONS AND RESULTS
4.0 Introduction 56
4.1 The Traffic pattern 56
4.2 Simulation results 57
5. CONCLUSIONS 61
REFERENCES 69
APPENDIX A1
APPENDIX A2
APPENDIX A3
CHAPTER 1
INTRODUCTION
1.0 Background of the Project
Global bandwidth demand is growing at an exponential rate, fueled mainly by the proliferation of the Internet as a worldwide communication
medium for a wide spectrum of applications. The growth rates have been sustained by optical transmission technologies such as wavelength-
division multiplexing (WDM), which provides very large transmission bandwidth. So far, electronic routing and switching systems have been
able to keep up with developments in optical transmission technologies by exploiting the advances in silicon technology. The demand for
transmission bandwidth outstrips outstrip the available routing and switching capacities. Although all-optical switches are very promising in
achieving very high throughput, yet with two distinct disadvantages. First, their switching granularities are very coarse, which prevents per-
packet switching. Secondly, optical buffering is cumbersome and impractical. Therefore, electronic packet switches are expected to continue to
play a large role for a long time to come [1]. Electronic switches also share in some disadvantages in terms of scalability and fabric speed. The
architectures that can scale to the desired throughputs in the multi-terabit range are being developed. Therefore, it is pertinent to take a look at
the trend of this development, by providing comparative performance analysis of various classes of packet switch architectures and in particular
that of Virtual Output-Queued (VOQ) packet switches.
In digital switching, user messages or data are divided into segments by specialized equipment called packet assembler/disassembler (PAD).
These segmented data have addressing, sequencing and error control information fields. The resulting unit of data (group of binary digits),
including data and call control signals, and possibly error control information are arranged in a specified format. This unit of data which is
switched as a composite whole is called a packet. The electronic switching system that supports the packets is called ‘packet switch’ [2][3][4].
Packet Switching may be defined as the transmission of data by means of addressed packets whereby a transmission channel is occupied for the
duration of transmission of the packet only; the channel is then available for use by packets being transferred between different data terminal
equipment [3].
A packet switch has four components: input port, output port, the routing processor and the switching fabric, as shown in figure 1.1.
Input and Output ports
With reference to the OSI standard [5] [6], the input port performs the physical and data link functions for the packet switch. The bits are
constructed from the received signal. The packet is decapsulated from the frame. Errors are detected and corrected. The packet is then ready to
be routed by the network layer. In addition to a physical layer processor and a data link processor, the input port has buffers to hold the packet
before it is directed to the switching fabric.
The switch fabric
The switch fabric is used for the input and output ports interconnection. The switching fabric may use a specific scheduling algorithm to decide
which packet to transfer in the next packet time. This is particularly important for VOQ input-buffered switches.
Fig.1.1. Functional blocks of a packet switch. [2]
Switching
fabric.
Routing processor
and signaling
Output port input port
Routing Processor/signaling.
When interconnections are made, the packets are directed to their respective destinations in an orderly manner prescribed by the algorithm in the
processing center and signaling protocol.
1.1 Buffering/queueing
Due to statistical nature of the traffic, buffering of packets in the packet switch is unavoidable. Two or more packets may be addressed to the
same port at the same time. In such a situation, each output line can serve only one packet per time slot; other packets must be stored in the
buffers. This process and strategies of storing the packets is buffering or queueing. Buffers may be placed at the input ports, output ports and
within the switch fabric. This process is the traditional way of solving contention in packet switch [5,6]. Some basic methods of
buffering/queueing are: input queueing (IQ), output queueing (OQ), combined input output queueing (CIOQ) and virtual output queuing (VOQ).
VOQ is an algorithm in which the input buffers are organized according to outputs. The platform is still the input-queued, crossbar-based
architecture.
1.2 Sequence preserving/scheduling.
Depending on the switch-fabric application, there may be a strict requirement regarding the order in which packets leave the switch. In particular,
certain applications, such as ATM, impose the strict requirement that packets belonging to the same flow must be delivered in the order they
arrived. For packet switch, this means that sequence of packets belonging to a given combination of input port, output port, and virtual lane must
be maintained. The scheduling algorithm for the switch fabric takes care of this [2].
A pair of communication users are never intended to interfere with communication between any other pair. While these are certainly highly
desirable, the price to be paid is also very high due to the complexity in terms of interconnection links in such network with N users. Somehow,
this may be feasible for a network having just a few users, but for networks such as the public telephone system or the internet with tens of
millions of users, such a solution is clearly both impractical and prohibitively expensive.
However, in general the sum of the requested user bandwidth is much smaller than the total bandwidth available to the group which offers the
potential for substantial cost savings. To take advantage of this, traffic from the groups of end users is multiplexed onto a single link of
bandwidth, T, which will be shared by all of the users in the group. This principle is called statistical multiplexing. Packet switching in this
environment has proven to be a major technological breakthrough in providing cost-effective data communications among information
processing systems.
The switches under study are built on statistical multiplexing technology [4]. The communication network is segmented. This is done by first
assigning groups of users that will share a link, and then interconnecting the groups by means of packet switches.
In order to avoid collisions of packets in the switch and to be fair to each packet the sharing of the switch fabric connections should be optimum.
The coordination mechanism gives rise to what is called scheduling of packets. The best of the packet switch gives the best QoS such as a
prescribed minimum delay and good throughput using any possible scheduling methods.
1.3 Major Performance Parameters
The following parameters [4] are used to describe performance characteristics of a packet switch.
• Average throughput: This is used to describe the average number of packets that exit the switch during one packet cycle. This
parameter depends largely on the number of switch output ports. Equivalently, throughput may be expressed as the fraction of time
the output lines are busy (output utilization).
• Average packet delay: The average delay encountered by a packet while traversing the switch (usually expressed in packet cycles).
• Packet-loss rate: This may be stated as percentage of arriving packets that are dropped by the switch, owing to lack of buffer space.
• Packet delay variation: This is also referred to as jitter. The distribution of packet delays around the mean.
1.4 OBJECTIVES
The objectives of this dissertation are:
(a) To experiment with the computer simulation model of VOQ switching system since it could prove too costly and in many cases far too
risky, to work with the real switch.
(b) To carry out performance analysis of VOQ packet switch in order to determine its comparative advantage over the rest of other queueing
configurations.
1.5 SCOPE
This dissertation has a limited scope. No physical implementation of the architecture of the VOQ switch is required for the analysis. Analysis is
based on the computer simulation model of VOQ. Virtual-Output-Queued (VOQ) structure is adopted for the analysis. Essentially, the logical
structure of a 3 x 3 IQ forms the basis for this VOQ structure [7]. Inter-arrival time distributions considered are Markov modulated arrivals and
geometrically distributed service times. The bursty traffic model is based on a two-state Markov chain consisting of an ON and an OFF state.
Each input in the on-off traffic model can alternate between active and idle periods of geometrically distributed durations.
1.6 METHODOLOGY
A typical communication switch architecture consists of input ports, output ports, switch fabric, some buffer memory and control system. Past
and recent switch designs were motivated by issues of QoS provisioning. As a result, varying architectures were proposed on output buffering,
input buffering, combined input and output buffering. The evolution of these architectures for the past several years was presented.
Throughput is the fundamental metric for the performance evaluation of a packet switch and the basic metric for estimating packet delay [8].
Towards this direction, an overview of performance evaluation methods carried out on packet switches was conducted from which the research
direction was defined.
The model was built and simulated with MATLAB/simulink. The simulation time of MATLAB was used as the run time and analysis was
carried out from the results.
1.7 OUTLINE OF THESIS
The write-up was organized as follows: Chapter 1 presented the general introduction which contained the background of the packet switch
VOQ performance analysis. The chapter also introduced the objectives of the project, the methodology and the outline. Chapter 2 presented an
overview of existing packet switches. Three main classes of switching architectures were presented based on queuing discipline and their mode
of operations. Performance analyses of VOQ packet switch architectures are reviewed. Chapter 3 presented the model and simulation of the 3x3
VOQ switch. In the same chapter, the performance analysis under traffic characteristic of uniformly distributed On-Off, and Geo/Geo/1 service
discipline were presented. Chapter 4 contained the performance characteristics of the VOQ architecture based on simulations and the results
were compared with the numerical analysis of chapter 3. Chapter 5 summarized the conclusions in the course of this project and suggests possible
directions for future work.
CHAPTER 2
AN OVERVIEW OF PACKET-SWITCH ARCHITECTURES
2.0 Introduction
In this chapter, the concept of packet-switching are presented. An overview of the existing packet-switched architectures is also given, with the
main focus on providing insight into the most recent developments in each class. Few categories of and some performance analysis techniques of
Virtual Output Queued (VOQ) switches are discussed. These categories are based on queuing discipline. The chapter concludes with a
discussion on the merits and drawbacks of the most relevant architectures. The chapter does not intend to provide a comprehensive overview of
all packet-switch architectures.
2.1. Basics of digital Switches and Switching techniques
This section presents the concepts of the design of a virtual-packet-switch. Electronic switching systems may be approached as a system having
two major parts. These are the switching fabric (or switch matrix), and the call control part. The switching matrix is the part that gets the traffic
from one user port to another, and the call control is the part that manages what connections the switching fabric should make.
Circuits typically arrive at switches as E-1/T-1 digital circuits [3]. Sometimes higher-rate interfaces are used, some of which exists as ISDN and
analogue connections to codecs built into the switch’s interface cards. But E-1/T-1 connections predominate [3]. The E-1/T-1 connections are
used here to illustrate the issues of digital time domain switching, which is the core technology for all modern switches [3].
2.1.1 Structure of Circuit-Switched Elements
Digital switching does not only connote the switching of signals in digital form for the purpose of PCM links. Interconnections of subscribers to
these links require timing. The timing is achieved by means of timeslots for the inlet and outlets. The following are switching elements used for
this timing in circuit-switched system.
Space-division switching elements
Space-division means taking data from one TDM channels, and sending it to one of many other TDM channels, and usually also changing
channels very rapidly. In space-division switch, the paths in the circuit are separated from one another in space. A typical example is Crossbar
switch. It connects n inputs to m outputs (say) in a grid using electronic microswitches at each cross point as illustrated in figure 2.9. In this n =
4 and m = 5.
Fig. 2.9. Crossbar switch with 4 inputs and 5 outputs.
The major limitation of this design is the number of crosspoints required. To connect n inputs to m outputs using crossbars switch requires n x
m crosspoints. In practice fewer crosspoints are in use at any given time, thus making the switch inefficient [3]. The solution to the limitations
of the crossbar switch is the multistage switch, which combines crossbar switches in several stages as shown in figure 2.10.
1
3
4
I II III
2
IV V
Crosspoint
To control station
The multistage switch combines crossbar switches in several stages as shown in figure 2.10. In a single crossbar switch, only one row or
column (one path) is active for any connection. So, N x N crosspoints are needed. By the creation of multiple paths inside the switch, the
number of crosspoints is decreased. Each crosspoint in the middle stage can be accessed by multiple crosspoints in the first or third stage.
The whole idea of multistage switching is to share the crosspoints in the middle-stage crossbars. However, sharing can cause a lack of
availability if the resources are limited and users want a connection at the same time. At a state, if one input cannot be connected to an output
because there is no path available between them, that is, all the possible intermediate switches are occupied, then it becomes a blocking state for
the input. This is a drawback in this switch arrangement. Blocking occurs during periods of heavy traffic.
In a single-stage switch, blocking does not occur because every combination of input and output has its own crosspoint hence there is always a
path. The output may be busy at a given time but there is no blocking in the path.
N/n x N/n
k
crossbars N/n
crossbars
N/n
crossbars
n x k n : . k x n : . n
n x k n : .
n x k n : .
: . N/n x N/n
k x n : . n
k x n : . n
: .
: .
:
:
: .
:
N N
Stage 1 Stage 2
Stage 3
Fig.2.10 Crossbar multistage switch
In large systems, the number of stages can be increased to cut down on the number of cross points required. As the number of stages increases,
however, possible blocking increases as well. Blocking is common on public telephone systems in the wake of a natural disaster when call
being made to check on or reassure relatives far outnumber the regular load of the system.
Time-division switch
Time division switch uses time-division multiplexing (TDM) inside a switch.. Inside the switch is a combination of a TDM multiplexer, a TDM
demultiplexer, and a TSI as shown in figure 2.11. The TSI consists of random access memory (RAM) with several memory locations. The size
of each location is the size of a single time slot. The number of locations is the same as the number of inputs and the outputs. The RAM fills up
with incoming data from time slots in the order received. Slots are then sent out in an order based on the decisions of a control unit.
The channels on the link are structured into frames and multiframe arrangements so that any attempt to interconnect or link calling and the called
subscribers will require a time slot adjustment [4].
Time switching operation principle.
Assume a time switching plane shown in figure 2.12, with five subscribers on the inlet side of a TDM and five subscribers on the outlet side of a
TDM switch and five subscribers on the outlet side of a TDM switch. If a subscriber D on the inlet time slot 4, that is element D, wants to be
connected to subscriber U on the outlet time slot 5, the inlet sample packet must be delayed in the switch for one time slot before being sent out
on time slot 5 for the connection. The type of switch designed to provide the appropriate delay is called a time switch or Time Slot Interchange
(TSI)[4]. This is an interface to the outlet channel in the link. The sample from the inlet channel is stored in a buffer and then read out when
appropriate output time slot arrives..
Each input timeslot word is stored in a buffer. A control store holds information on the time at which each sample has to be read out. At the
appropriate time, this control store connects the data buffer to the output lines.
Fig.2.11 Time-Slot Interchange (TSI)
M
D
TSI
RAM
Control unit
1 3
2 4
3 1
4 2
2
1
3
4
2
1
3
4
A B C D
A B C D
T
M
D
T
C
D
D
B
A
A
B
C
2.1.2 Combination of Switch Elements.
The very simple architectures shown in figures 2.11 and 2.12 are not suitable for handling traffic volumes experienced in the thousands of circuits
handled by real public network exchanges. In practice, more complex combinations of time and space switching elements are used, to achieve a
balance between capacity, blocking and economy.
Space-Time-Space switch.
Figures 2.13 and 2.14 show a very simple STS switch, which switches between N(N ≤ 30) incoming circuits and N outgoing circuits, for each of
30 time slots per circuits. In real life, the call paths will generally be bi-directional, and so the outgoing and incoming circuits will not generally
be distinguishable, but it helps to separate them for showing how the switching works.
The figure shows a switch connecting time slot 3 of incoming circuit number 2, to time slot 6 of outgoing circuit number 1. The space switches
are controlled by blocks of memory which are written to by the call control software.
Fig. 2.12 Time-division switching
TDM
A
B
C
D
E
Time
switch
S
T
U
V W Inlet channels
TDM
Outlet channels
One frame
A E D C B A
W V U T S W
* *
Time-Space-Time Switching
To achieve the time shift, a 3-slot delay unit is more commonly used. A space switch at the incoming side switches circuit 2 through to the delay
unit, just for the duration of time slot 3. A space switch at the outgoing side switches the output of that delay unit through to outgoing circuit 1,
for the duration of time slot 6.
N incoming circuits
each carrying frames
of (say) 30 time slots
Fig. 2.13 Space -Time -Space (STS) architectures
30-channel fixed
delay time switch
Incoming Nx30
Space switch
Outgoing N × 30 Space switch
( 3 slots)
( 1 slot )
( 2 slots)
( 30 s
lots
)
Cct 2 slot3
Cct 1 slot 6
At the heart of Time Space Time Switch is an N x N space switch, which cycles all possible N2 combinations of space connections once per
frame time. The incoming circuit (2) is permanently connected to a variable delay element, which is programmed to switch slot 3 into the time
slot for which the space switch connects incoming circuit 2 to outgoing circuit 1. Then a second delay element switches that time slot on circuit 1
into slot 6, as required.
The TSI is the switch element that provides the variable �t time delay in the SPS or TST combination.
TDM
TDM
TDM
TDM
TDM
TDM
Fig.2.14 Space-time-space switch
block diagram
SPACE
TIME
SPACE
T
I S
T S I
T S I
Fig. 2.15. Time Space Time Switch (TST) architecture [4].
Incoming time switch
Variable �t
Variable �t
Variable �t
Outgoing time
switch
Variable �t
Variable �t
Variable �t
Variable �t
Variable �t
Variable �t
Variable �t Cct 1 slot 6
Cct 2 slot 3
Arrays of N time
switch elements on
both in and out sides
switch
N outgoing circuits,
each of 30 time slots.
N incoming ccts,
each carrying frames of
(say) 30 slots.
N x N space switch .
From figure 2.16, the inputs that arrive in frames of size n are mapped into frames of size k by the input TSI switches. The introduction of TSI
switches at the input and output stages and the introduction of a single time-shared crossbar switch result in a much more compact design than
space switches.
2.1.3 Structure of packet-mode switches.
A switch used in a packet-switched network has a different structure from a switch used in a circuit-switched network. A packet switch has four
components: input ports, output ports, the routing processor and the switching fabric, as shown in figure 2.17.
TSI
TSI
TSI
TSI
TSI
Fig. 2.16 Time-space-time switch block diagram [7]
N
outputs n slots
TDM
n slots
Space stage
TSI stage
TDM
k slots
n slots
N
inputs
TDM
k slots
k×n
n×k
Input ports
The schematic diagram of an input port is shown in figure 2.18. The input port performs the physical and data link functions of the packet switch.
The bits are constructed from the received signal. The packet is decapsulated from the frame. Errors are detected and corrected. The packet is
then ready to be routed by the network layer. In addition to a physical layer processor and a data link processor, the input port has buffers to hold
the packet before it is directed to the switching fabric.
Fig. 2.17. The components of a packet switch [2]
Source: Anurag Kumar (2005)
Control and signaling functions
Line
interface
Input processing
and forwarding
Queuing and
scheduling
Switch
fabric
Line
interface
Line
interface Input
processing
and forwarding
Queuing and
scheduling
Line
interface
o/p queue
scheduling
processing
o /p queue
scheduling
processing
Figure 2.19 shows the schematic diagram of an output port. The output port performs the same function as the input port, but in the reverse order.
First the outgoing packets are queued, then the packet is encapsulated in a frame, and finally the physical layer functions are applied to the frame
to create the signal to be sent on the line.
Routing Processor.
The routing processor performs the functions of the network layer. It does the table lookup by searching the routing table. The destination
address is used to find the address of the next hop and, at the same time, the output port number from which the packet is sent out.
Switching fabrics.
Physical layer
processor
Data link layer
processor Queue
Input port
Physical layer
processor
Data link layer
processor
processor Queue
Output port
Fig. 2.19. Schematic diagram of an output port
Fig. 2.18. Schematic diagram of an input port
Cross bar switches and Banyan switches are used in structure of the fabric. But the banyan switch is more realistic although crossbar switch is
the simplest type; this is because of the reliability of the delay issue. A banyan switch is shown in figure 2.20. It is a multistage switch with
microswitches at each stage that route the packets based on the output port represented as a binary string. For n inputs and n outputs, we have
log2n stages with n/2 microswitches at each stage. The first stage routes the packet based on the high-order bit of the binary string. The second
stage routes the packet based on the second high-order bit, until the strings are routed out [1]. Figure 2.20 shows a banyan switch with eight
inputs and eight outputs [2]. The number of stages is log2(8) = 3
2.1.4 Packet-Switched Network.
In multiplexing, bandwidth (FDM) or time (TDM) is used to establish occupancy slots into which individual data sources are assigned the use of
those slots. Thereafter, information in the form of voice or data uses the reserved slot for the duration of the voice call or data transmission
session. In packet switching, specialized equipment called packet assembler/disassembler (PAD) divides data into defined segments that have
addressing, sequencing and error control information added. The resulting unit of data is called a packet and may represent a user message or a
very small portion of a user message [2].
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
0
1
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
0
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Left bit Middle bit Right bit
Fig. 2.20. A banyan switch architecture
The flow of packets between nodes in a packet network is intermixed with respect to the originated and destination of packets. That is, traffic in
the form of packets from many users can share large portions of the transmission facilities used to form a packet network. Thus, packet network
use is normally more economical than transmission over the public switched telephone network for long distance transmission. Figure 2.21 is an
illustration of architecture of a packet-switched network.
Fig. 2.21 Packet-switched network architecture [2]
DCE – Data Circuit-terminating Equipment
DSE – Data Switching Exchange.
D
SE
D
SE
Packet mode
DTE
Mainframe Computer 1
DTE
DTE
DTE
DCE
DCE
DCE
PAD
DCE
Non-packet
mode
PAD
X.25
X .25
X.3 X.75
X.25
Packet network 1
Packet network 2
X.25
X.28
X.29
DTE
Mainframe Computer 2
DTE
Non-packet
mode
The packet mode switching network is constructed through the use of PAD and the equipment that routes and transmits packets. Some types of
DTEs can create their own packets, while other types of DTEs require the conversion of their protocol into packets through the use of a packet
assembler/disassembler (PAD). Equipment that routes packets through the network is called packet switches. Packet switches examine the
destination of packets as they flow through the network and transfer the packets onto trunks interconnecting switches based upon the packet
destination and network activity.
There are two basic approaches to transferring information over a packet-switched network. The first approach, called connection-oriented,
involves setting up connection across the network before information can be transferred. The setup procedure typically involves the exchange of
signaling messages and the allocation of resources along the path from input to the output for the duration of the connection. The second
approach is connectionless and does not involve a prior allocation of resources. Instead the packet is routed independently from switch to switch
until the packet arrives at its destination.
2.1.5 Categories of Packet-Switched Network
Packet-switched networks are divided into two categories: datagram and virtual-circuit networks. Figure 2.23 shows datagram packet switching
concepts.
Datagram Network
In datagram packet switching, there is no prior resource allocation for a packet. This means that there is no reserved bandwidth on the links, and
there is no scheduled processing time for each packet. Resources are allocated on demand. The allocation is done on a first-come first-served
basis. When a switch receives a packet, no matter what is the source or destination, the packet must wait if there is no other packet being
processed. This lack of reservation may create delay.
In datagram network, each packet is treated independently of all others. Even if the packet is part of a multipacket transmission, the network
treats it as though it existed alone.
Datagram switching is normally done at the network layer. Figure 2.23 shows how datagram approach is used to deliver four packets from station
A to station X. The switches in a data gram network are traditionally referred to as routers [1].
In figure 2.23 all four packets (or datagram) belong to the same message, but may travel different paths to reach their destination. This is so
because the links may be involved in carrying packets from other sources and do not have the necessary bandwidth available to carry all the
packets from A to X. This approach can cause the datagrams of a transmission to arrive at their destination out of order with different delays
Figure 2.23. A block diagram of datagram network [1].
1
1 2 3 4 3
4
1
2
2
4 1
4
3
3 1 2
X
A
between the packets. Packets may also be lost or dropped because of lack of resources. In most protocols, it is the responsibility of an upper-
layer protocol to reorder the datagrams or ask for lost datagrams before passing them o to the application. The datagram networks are referred to
as connectionless network. This is because the packet switch does not keep information about the connection state. There are no setup or tear
down phases. Each packet is treated the same by a switch regardless of its source or destination.
However, each switch has a routing table which is based on the destination address. The routing tables are dynamic and are updated
periodically. The destination addresses and their corresponding forwarding output ports are recorded in the routing tables. This is different from
the table of a circuit-switched network in which each entry is created when the setup phase is completed and deleted when the teardown phase is
over. Figure 2.24 shows the routing table for a packet switch in datagram network.
Each packet is datagram network carries a header that contains, among other data, the destination address of the packet. When the switch receives
the packet, this destination address is examined; the routing table is consulted to find the corresponding port though which the packet should be
forwarded. This address remains the same during the entire journey of the packet.
Virtual-Circuit network.
In virtual-circuit network, the following characteristics are common:
1. There are setup and tear down phases in addition to data transfer.
2. Resources can be allocated during set up phase or on demand as in a datagram network.
3. Each packet carries an address in the header. However, the address in the packet header has local jurisdiction, which in it is defined
what should be the next switch and the channel on which the packet is being carried, not end-to-end jurisdiction.
Fig. 2.25. A switch topology in virtual-circuit network [1]
D
A B
C
End system
End system
End system
End system
4. All packets follow the same path established during the connection.
5. A virtual-circuit is normally implemented in the data link layer, while a circuit-switched network is implemented in the physical layer
and a datagram network is implemented in the network layer.
Figure 2.25 is an illustration of a virtual-circuit network for four end systems for the switches. The network has switches that allow traffic
from sources to destinations. A source or destination can be a computer, packet switch, or any device that connects other networks.
Fig. 2.24. Routing table in a datagram network. [1]
1
3 2
4
Destination
address
Output
port
1232
4160
:
:
8130
1
2
:
:
3
Figure 2.26 illustrates the data flow in a virtual-circuit packet network. Although a route will be established based upon network activity, once
established the route remains fixed for the duration of the call. Thus, packets will flow in sequence through each switch which reduces both the
amount of processing required to be performed at each switch and delays associated with waiting for out of sequence packets to arrive at a
destination node prior to being able to pass an ordered sequence of packets to their destination.
Fig. 2.26 Data flow in virtual-circuit packet network
A
B
C
D
Y Z X W Y Z X W Y Z X W
D C B A D C B A
D
C
B A
Addressing in a virtual-circuit network.
In a virtual-circuit network, two types of addressing are involved; these are global and local. The global address in a virtual-circuit network is
used only to create a virtual-circuit identifier. This identifier is used for data transfer. It is a small number that has only switch scope; it is used
by a frame between two switches. When a frame arrives at a switch, it has a VCI; when it leaves, it has a different VCI [1]. Figure 2.27 shows
how the VCI in a data frame changes form one switch to another. Each switch uses its own unique set of VCIs.
Phases of switching in virtual-circuit network.
For communication to take place for data transfer in a virtual-circuit network, the source and destination need to go through three phases: setup,
data transfer, and teardown. In the setup phase, the source and destination use their global addressees to help switches make table entries for the
connection. In the teardown phase, the source and destination inform the switches to delete the corresponding entry. Data transfer occurs
between these two phase.
(1) Setup phase.
In the set up phase, a switch creates an entry for a virtual circuit. The next step is to send a setup request frame from the source which will be
followed by the acknowledgement if the destination so prepares to accept a link for data transfers. In figure 2.28 is shown the process of setup
request from source A to destination B.
Fig. 2.27 Virtual-circuit identifier.
31
VCI VCI
66
a. Source A sends a setup request frame to switch 1.
b. Switch 1 receives the setup request frame. It knows that a frame going from A to B goes out through port 3. The switch, in the setup
phase, acts as a packet switch; it has a routing table which is different from the switching table. For the moment, assume that it knows
the output port. The switch creates an entry n its table for this virtual circuit, but it is only able to fill three of the four columns. The
switch assigns the incoming port (1) and chooses an available incoming VCI (14) and the outgoing port (3). It does not yet know the
outgoing VCI, which will be found during the acknowledgment step.
The switch then forwards the frame through port 3 to switch 2.
c. Switch 2 receives the setup request frame. The same events happen here as at switch 1; three columns of the table area completed: in this
case, incoming port (1), incoming VCI (66), and outgoing port (2).
d. Switch 3 receives the setup request frame. Again, three columns are completed: incoming port (2), incoming VCI (22), and outgoing
port (3).
e. Destination B receives the setup frame, and if it is ready to receive frames from A, it assigns a VCI to the incoming frames that come
from A, in this case 77. This VCI lets the destination now that the frames come from A, and not other sources.
(2) Acknowledgments.
A special frame, called the acknowledgment frame, completes the entries in the switching table. Figure 2.29 shows the process.
a. The destination sends an acknowledgment to switch 3. The acknowledgment carries the global source and
destination addresses so the switch knows which entry in the table is to be completed. The frame also carries VCI 77, chosen by the
destination as the incoming VCI for frames from A. Switch 3 uses this VCI to complete the outgoing VCI
column for this entry. Note that 77 is the incoming VCI for destination B, but the outgoing VCI for switch 3.
b. Switch 3 sends an acknowledgment to switch 2 that contains its incoming VCI in the table, chosen in the previous step.
Switch 2 uses this as the outgoing VCI in the table.
c. Switch 2 sends an acknowledgment to switch 1 that contains its incoming VCI in the table, chosen in the previous step.
Switch 1 uses this as the outgoing VCI in the table.
Fig. 2.28. Setup request in a virtual-circuit network
B A
1
1 2
3
2
Switch 1
Switch 2
Switch 3 b c
a d e
Incoming Outgoing
Port VCI Port VCI
1 14 3
Incoming Outgoing
Port VCI Port VCI 2 22 3
Incoming Outgoing
Port VCI Port VCI
1 66 2
3
d. Finally switch 1 sends an acknowledgment to source A that contains its incoming VCI n the table, chosen in the previous
step.
e. The source uses this as the outgoing VCI for the data frames to be sent to destination B.
(3) Data transfer phase.
To transfer a frame from a source to its destination, all switches need to have a table entry for this virtual circuit. Each table, in its simplest has
four columns. The entries in the tables were made during the setup and acknowledgment phases. Figure 2.30 shows a table holding
corresponding information for the data transfer phase. This figure shows a frame arriving at port 1 with a VCI of 14. When the frame arrives, the
switch looks in its table to find port 1 and a VCI of 14. When it is found, the switch knows to change the VCI to 22 and send out the frame from
Fig. 2.29 Setup acknowledgment in a virtual-circuit network
B
1
1 2
3
2
Switch 1
Switch 2
Switch 3 d
e
Incoming Outgoing
Port VCI Port VCI
1 14 3 66
Incoming Outgoing
Port VCI Port VCI 2 22 3 77
Incoming Outgoing
Port VCI Port VCI 1 66 2 22
3 77
VCI=77 VCI=14
66
22
14
A
b c a
port 3. Figure 2.30 shows how a frame from source A reaches destination B and how its VCI changes during the trip. Each switch changes the
VCI and routes the frame.
The data transfer phase is active until the source sends all its frames to the destination. The procedure at the switch is the same for each frame of
a message. The process creates a virtual circuit between the source and destination.
Fig. 2.30 Data transfer phase switching and tables
in virtual-circuit network [1]
Incoming Outgoing
Port VCI Port VCI
1
1
14
77
3
2
22
41
1
2
Data 77 Data 14 Data 22
3
Data
4
1
(4) Teardown phase
In this phase, source A, after sending all frames to B, sends a special frame called a teardown request. Destination B responds with a teardown
confirmation frame. All switches delete the corresponding entry from their table.
The ATM network is a typical cell-switched network using datagram approach [9]. ATM is more efficient than plesiochronous digital hierarchy
(PDH) and synchronous digital hierarchy (SDH) because it dynamically and optimally allocates available network resources vial cell relay
Fig.2.31 Source-to-destination
data transfer [1]
Data 14
WAN
Incoming Outgoing
Port VCI Port VCI
2
22
3
77
Incoming Outgoing
Port VCI Port VCI
1
14
3
66
B A
1
1 1 2
2
4 3
4
2
Incoming Outgoing
Port VCI Port VCI
1
66
2
22
Data
22
Data
66
3
Data 7 7
switching. It is the transfer mode or protocol adopted for broadband integrated services digital network (B-ISDN), which supports all types of
interactive point-to-point and distributive point-to-multipoint communication services [10].
ATM breaks the information bit stream, whatever its origin, (voice, text, data, etc.) into small packets of fixed length. A header is attached to
each data packet to enable correct destination. The fixed –length combination of service data and header is known as an ATM cell, which is 53
bytes long, with 48-byte payload that carries serve data, and a 5-byte header that carries identification, control, and routing information. The
maximum transmission efficiency of ATM is therefore [11]
ηATM = (48/53) x 100% = 90.57%.
The user access devices, called the endpoints, are connected through a user-to-network interface (UNI) to the switches inside the network. The
switches are connected through network-to-network interfaces (NNI). The switch is able to cross-connect VPs and also to sort and switch their
VC contents as shown in figure 2.32.
Connection between two endpoints is accomplished through transmission paths (TPs), virtual paths (VPs), and virtual circuits (VCs). A
transmission path is the physical connection between an endpoint and a switch or between two switches. A transmission path is divided into
several virtual paths. A virtual path provides a connection or a set of connections between two switches. ATM networks are based on virtual
circuits (VCs). All cells belonging to a single frame follow the same virtual circuit and remain in their original order until they reach their
destination.
2.2 Classification of Packet-Switch Architectures
There has been quite adequately a number of switch architecture overview papers, such as [12], [13], [14], [15], [16], [17], and [18]. A more
recent overview focusing especially on space-division architectures was presented in [19]. These papers used many different ways to classify
switch architectures into categories. Some of these criteria are blocking vs. non-blocking, buffering strategy (input-buffered, output-buffered or
Fig. 2.32 Generic architecture of an ATM network [6]
UNI NNI
NNI
NNI UNI
UNI
Endpoints
Endpoints
Endpoints
combined), lossy vs. lossless, single-stage vs. multi-stage, buffer implementation (partitioned, grouped, shared), time- or space-divided (or
combined TST/STS). Many of these categorizations focused too strongly on implementation details. In the following overview of packet-switch
architectures, the main focus will be on the queuing discipline. However, from a high-level point of view— regarding the switch as a black box
that moves data from A to B— it can be argued that the choice of queuing discipline is also an implementation issue, so a few implementations
will be presented for emphasis.
For the purpose of this dissertation, the correct abstraction level is that of the internal switch architecture, and thus the queuing discipline, because
this is the determining factor of switch performance [19].
2.2.1 Single-stage vs. multi-stage switches
Fig. 2.1 depicts KN x KN Benes multistage switch fabric [20], consisting of three stages of KN X N switches each. For large networks, the
multistage approach is much cheaper in terms of the required number of switch elements than single-stage port expansion. In general, for an M x
M fabric with M = KN, KS switch elements are required for an S-stage fabric compared to K2 for a single-stage fabric.
Single-stage architectures have a strong performance advantage over multi-stage architectures, but inherently are not as scalable. Therefore, as
bandwidth demands continue to rise, there will be a growing need for multi-stage fabrics because these can scale to many times more ports than
any single-stage architecture can [21].
Although the multi-stage fabric is a cascade of single-stage switches, still, the extension of a given single-stage architecture to multi-stage is far
from trivial. To convey the complexity of making multi-stage fabrics work, some issues were highlighted:
(a) Network topology: which topology— (e.g. Benes, Banyan, BMIN, Hypercube, Torus, etc).— is the right one for a given
application?
(b) Performance: both in terms of delay because more stages means more latency and throughput because multi-stage fabrics are often internally
blocking owing to a combination of the fabric’s interconnection topology, static routing, or higher-order HoL blocking .
(c) Fabric-internal routing: static routing is easy to implement, but leads to poor performance under unfavorable traffic patterns because it
cannot adapt to congestion situations, whereas adaptive routing can improve performance, but is expensive to implement and may lead to out-of-
order delivery. Either source routing or per-hop look-up is required.
(d) Flow control: to maintain performance, prevent higher-order HoL blocking, and guarantee fabric internal losslessness, proper flow control is
required. To obtain maximum performance, global end-to-end flow control may be necessary, which presents significant scaling problems. (e)
Multicast support: multicast traffic requires both proper duplication strategies and destination set encoding with corresponding lookup tables in
every switch to maintain scalability (carrying full bitmaps is clearly not feasible for fabrics with hundreds of ports).
0 0
N-1
1 0
N-1
k-1 0
N-1
0
1
k-1
0
1
k-1
0
0
N-1
N-1
0
N-1
Figure 2.1: A kN x kN_ � � multistage network consisting of 3k N x N � _ � switch elements.
(f) Practical implementation and cost also formed a significant issue that motivates a need for a scalable architecture.
Given an N x N switch element without buffers and assume persistent traffic sources with uniformly distributed packet destinations, the switch
throughput was derived [22]:
The probability that m packets arrive simultaneously for the same output is
mNmN
mN
NN)mX(P
−
−
==
11
1 (2.1)
The throughput TN equals the sum over all I of the probability that at least one cell is destined to output I, divided by N. And for large N,
throughput is
63201
11
11 .eN
limT
N
N≈−=
−−=
∞→∞ (2.2)
As there are no buffers, the cell loss ratio equals 1- T∞ = 36801
.e
≈ (2.3)
Outcome:
The outcome is that under the assumption of i.i.d. random sources, maximum throughput reaches 63%, while 37% of all packets are discarded
[23].
Obviously, this packet-loss rate is absolutely inacceptable for any practical purposes, so the next consideration is that switch architectures that
incorporate some form of queuing [24].
2.2.2 Input Queuing
A switch belongs to the class of input-queuing switches if the buffering function is performed before the routing function [24]. Fig. 2.2 shows the
architecture of input queued (IQ) packet-switch. The bandwidth requirement on each buffer is only proportional to the port speed and not to the
number of switch ports [22].
Only one read and one write operation are required per packet cycle on each buffer, so that with a port speed of B, the total bandwidth is equal to
2B per buffer and 2NB aggregate for a switch of size N x N.
It is obvious from the diagram that there is the HoL blocking on input 1 of the packet destined to output x owing to contention for output y.
Nevertheless, output contention is resolved. This task is executed by the arbiter unit depicted in Fig. 2.2.
In each packet cycle, a selection is made from the set of head-of-queue packets to be forwarded to the outputs during this cycle. This selection
must satisfy the condition that at most one packet is forwarded to any single output.
Additionally, the selection policy must be designed such that fairness among inputs and outputs is guaranteed. Once the selection has been made,
the selected packets are removed from their buffers and forwarded through the routing fabric, which can be a simple crossbar. The three-phase
algorithm first presented in [24] is an example of such a selection algorithm. In an input-queued switch with FIFO queues, an input has at most
one packet that is eligible for transmission, namely the HoL packet. When output contention occurs (multiple inputs have a packet for the same
y x
y x
routing
1
x
y
N
queuing
1
N
Crossbar
configuration
1
1
1
1
requests grants
(contention resolution) arbiter
Figure 2.2: Input queuing with FIFO queues.
output), one and only one of these inputs is granted permission to transmit according to a certain rule. Various propositions to arbitrate amongst
HoL packets destined to the same output have been proposed and investigated.
Outcome
It is shown that the input queuing scheme limits switch throughput to a theoretical maximum of merely %6.5822 ≈− of maximum bandwidth
under the assumption of independent, uniform Bernoulli traffic sources [24]. Owing to the increased correlation in the traffic arrivals caused by
the buffering, adding buffers will actually degrade throughput compared to the no-buffer case. Additionally, it has been shown that the HoL
selection policy does not affect throughput, although it may affect average delay to a small degree. [20, 25].
Performance analysis model
In [26], the switch is modeled as N independent single server queues (one for each input port), where each queue has an effective service time
distribution which accounts for the HoL contention. More specifically, it was assumed for each input port that there is a probability q of winning
switch arbitration, where q is the function of λ, the traffic carried per port.
The average delay D was derived as a function of the average input load p assuming an infinite number of switch ports (N → ∞), infinite
buffering and that a packet arrived at each port with probability λ per slot, destined with equal probability of (1 / N ) to any one of the output
ports. Each input port had a buffer for incoming packets, which were served on an FCFS basis at the beginning of each slot.
The analysis was then partitioned into two parts: q (determined as a function of λ) and the individual input queues.
The throughput per port was given as the steady state expected value of T:
( )[ ] ( )[ ]j
N
j
j NENEN
T εε == ∑=1
1
(2.4)
In the steady state, T = λ
The number of successful deliveries during a slot equals [26] :
∑ =
N
j j )N(1ε
(2.5)
The outcome:
−−
−+==
ρλ
λλλλ 1
12
2
)(T (2.6)
Where ρ is the probability of occupancy for each queue.
From which was obtained
02122 2 =++−− ρλρλρ )()( (2.7)
The maximum λ is 58022 .=−
The value of q as a function of λ was computed as:
)(q λ
λ−
+=12
11
(2.8)
The average delay is
12222
12−
−+−−−−
=)p)(p(
)p)(p(D (2.9)
with 220 −<≤ p , p ≡ λ
And average queue length is [26]
[ ])(
NEk j λλλ−
+==12
2
(2.10)
Performance disadvantages:
The maximum throughput of 0.58 is low. It was concluded in [27] that the essential cause of this low maximum throughput was that a blocked
packet at the head of an input queue prevented other packets behind it destined to idle outputs from being forwarded. In [27] it was shown that
throughput even decreased with increasingly correlated traffic (burstiness), to as little as 50% for strongly correlated traffic. In [28] the analysis
concluded that this type of input-queued switch was not work conserving – there is a packet waiting for output x, but it cannot be served because
it is blocked by a packet destined to output y.
Improvements:
Spurred on by this discovery of the inherent performance disadvantage of FIFO-based input buffering compared to output buffering, various
approaches were proposed to overcome this limitation. The general approach was to make more than one packet from each input queue eligible
for transmission. Early approaches were mainly based on scheduling packets into the future, or on the use of random access buffers instead of
FIFOs. The following is a few of the early approaches:
(a) Port grouping
The grouping of multiple physical outputs into one logical output was proposed in [30].
Such channel grouping resulted in a switch with fewer, but faster ports. The arbitration will redirect a request for a logical port to the first
corresponding idle physical port, so that HoL blocking is reduced. This principle was applied in the GIGA switch architecture [29].
(b) Scheduling packets by advance time slot.
In [31] a distributed contention-resolution mechanism was presented. Transmission times were dynamically allocated to packets stored in the
input queues. The algorithm consists of request, arbitration, and transmission phases. In the request phase, each input sends a request for its HoL
packet to the requested output. Each output j counts the requests directed to it, Rj. Output j keeps track of the next available transmission slot Tj It
assigns transmission slots Tj through Tj + Rj – 1 to the requesting inputs, and updates Tj to Tj + Rj. Upon reception of a transmission slot, an
input checks its transmission table; if the assigned slot is not yet reserved, the packet is removed from the queue and will be transmitted at the
assigned timeslot. Otherwise, the packet remains at HoL and has to retry in the next slot. This scheme reduces the HoL blocking, because packets
may be scheduled for future timeslots, thus bringing forward new HoL packets sooner than the three-phase algorithm. However, as timeslots
assigned by an output may have already been reserved at the requesting input, slots may be wasted, thus limiting maximum throughput.
Approximate analysis and simulation [Obara89] show that maximum throughput under uniform i.i.d. traffic was about 76% (for a 16 x 16
system), up from 58% for conventional FIFO buffers.
Time slot reservation scheme was proposed in [30,31,32]. In these scheme, contention resolution is performed by a number of reservation tables
RTi, 1 ≤ i ≤ N, one per input. RTi indicates the reservation state of the outputs at t + i – 1, where t is the current timeslot. Each RT contains one
entry per output, indicating whether the output is reserved at that time. Each input i attempts to match a packet in its queue to a non-reserved slot
in the corresponding table RTi , and reserves the corresponding entry when successful. The packet selected is removed from the queue and will be
transmitted at t + i. Each timeslot, RTi is copied into RTi-1, RT1 is discarded, whereas RTN is set to all non-reserved. The approach requires
FIRO (First In, Random Out) buffers, because the algorithm performs a search-depth through each input buffer (up to a certain depth d) to find a
packet that can occupy a non-reserved slot in the reservation table. For a search-depth d=16 and a 16 x 16 system, the maximum throughput
under uniform i.i.d. traffic is about 90%. Special care must be taken to prevent unfairness among inputs because input 1 has a much smaller
chance of reserving a slot than input þ , owing to the way the tables are shifted from N to 1 as time progresses.
In [31] an improvement to the algorithm of [33] was presented. To reduce the probability of assigned timeslots being wasted, the inputs of the
switch can be grouped into groups of size k. From each group of inputs, k packets may depart in one timeslot. Thus, up to k conflicting timeslot
assignments can be managed by an input group, which drastically reduces the probability of a timeslot being wasted. Simulations have shown that
maximum throughput increases to over 90% for a group size of 4 for a 64 x 64 switch under uniform i.i.d. traffic. A further improvement,
proposed in [34], is the grouping of outputs as well as inputs, thus creating a combined input- and output-queued switching fabric.
In [35 the problem of what is termed “turn-around time” (TAT) between input controllers and a centralized contention-resolution unit was
addressed in particular. It was proofed that the TAT equals the amount of time it takes to complete the scheduling of one packet (request-
acknowledge). By pipelining the requests to the contention-resolution unit, throughput degradation due to long TAT can be pre-vented.
(c) Window-based approach
The window-based approach was proposed in [36]. The window-based schemes essentially search the input queue for packets destined to idle
outputs. The strict FIFO queuing discipline of the input queues is relaxed, allowing other packets than the ones at HoL to contend for idle outputs.
Each input queue first contends with the HoL packet. The HoL packets that “win” the contention and their respective outputs are removed from
the contention process, if the contentions have been completed. Where ω is the window depth, ω = 1 corresponds to strict FIFO queuing. For
uniform i.i.d. traffic this approach improved throughput considerably, even for small ω. For a 16 x 16 switch, throughput increases to 77% for a
window size ω = 3, and to 88% for ω = 8 as reported in [36]. However, as traffic exhibits a more busty nature, the number of packets destined to
different packets within the window ω decreases rapidly, so that a much larger window is required to achieve the same throughput, which
requires more processing power in the scheduling unit. Typically, the maximum input queue size Q is much larger than the number of outputs Q.
Therefore, it makes sense to sort the input queue based on the destination of the packets; this way, only N packets need to be considered, instead
of up to Q. In [22, ] it was reported that this arrangement was proposed as early as 1984 in a paper by McMillen R. J, titled “Packet Switched
Multiple Queue Switch Node and Processing Method,” US Patent no. 4,623,996, filed Oct. 18, 1984, published Nov. 18, 1986; and has received
a significant amount of attention in recent years.
Other variations:
Other solutions in [37, 19] involve running the switch fabric at higher speeds than the input and output lines, or using multiple switch fabrics in
parallel. These include the following:
1. Output-port expansion: Output-port expansion with a factor F requires an N x NF switch fabric, where each group of F physical output ports
corresponds to one logical output. This way, up to F packets can be delivered to an output in each cycle. This approach is conceptually
similar to output-channel grouping.
2. Input-port expansion: Input-port expansion with a factor F requires an N x NF switch fabric, where each group of F physical input ports
corresponds to one logical input. This allows F packets from each input to contend in each cycle. This differs from the windowing scheme in that
more than one packet can be selected from the same input.
3. Switch speed-up: If the switch operates at a speed F times faster than the links, F packets can be transported from a single input, and also F
packets can be delivered to an output in one cycle. (More conventional speed-up schemes allow multiple packets to be delivered to one output,
but allow only one packet per input to be transmitted [38].
As multiple packets are being delivered to an output in one cycle, some degree of output queuing is also required, however. This was the
immediate motivation for the next architecture: output queueing.
2.3 Output Queueing
The alternative classic solution to input queueing is output queueing [22, 27], as in Fig. 2.3. Here, the buffering function is performed after the
routing function. The buffers are placed at the switch fabric outputs. Theoretically [22], output queuing offers maximum, ideal performance,
because
• HoL blocking does not exist, thus lifting the throughput limitation from which input queuing suffers,
• and contention is reduced to a minimum—only output contention occurs, which is unavoidable because of the statistical nature of
packet-switched traffic, but there is no input contention, which leads to lower average delays also at low utilization than in input
queuing.
It can be shown theoretically [22] that a non-blocking output-queuing switch having queues of infinite size offers the best performance in terms
of both throughput and delay. This type of switch is the only type that is truly work-conserving (in the sense of our definition of work-
conserving) under any traffic pattern.
Performance analysis model
Given that in one packet cycle N packets may arrive destined for the same output, so that N writes (N arriving packets) and one read (one
departing packet) are required. In[14] a tagged output queue was considered for analysis.
The number of packet arrivals, A, is represented as
ak [ ]kNk
N
kr
N
p
N
pkAP
−
−
==∆ 1 (2.11)
Fig. 2.3: Output Queuing [22]
N
N N
queuingg
1
N
routing
1
1
1 1
1
k = 0, 1, 2, . . . . .. , N
which for N = ∞, becomes [ ]!k
epkAP
pk
r
−
==∆ (2.12)
k=0,1,2,. . . . . ., N = ∞
With Q denoting the number of packets in the tagged queue at the end of the mth time slot, and Am denote the number of packet arrivals during
the mth time slot and b denote the frame size,
Q, = min { max (0,Qm-1 + Am – l), b } in which
When Qm-1 = 0 and Am > 0, one of the arriving packets is immediately transmitted during the mth time slot; that is, a packet flows through the
switch without suffering any delay.
For N → ∞ and b→∞, the queue size Qm was modeled as a M/D/1 queue.
For finite N and b, Q was modeled by a finite-state, discrete-time Markov chain with state transition probabilities Pij = ∆ Pr [Qm = j |Qm-1 = i ]
given by
≤≤=
≤≤−≤≤−=≤≤
==+
=
∑+−=
+−
otherwise
ji,bja
ji,bja
ij,bia
j,iaa
P
N
ijm
m
ij
i
ij
0
0
011
11
00
1
1
0
0
(2.13)
Where ak is given by (2.11) and (2.12) for N < ∞ and N = ∞, respectively.
The steady state queue size was obtained with the normalized throughput and arrival rate as
ooo aq−=1ρ (2.14)
Pr [packet loss] = p
oρ−1 (2.15)
The mean waiting time, ,W for a packet making it into an output FIFO is obtained as [22]
)p(
pW
−=
12 (2.15)
The average queuing delay D of this output-queued switch fabric with infinite-sized queues under Bernoulli traffic, as N→∞, equals [22]:
( )p
pD
−=
12 (2.16)
where p )p( 10 <≤ is the average input load.
Major Limitations of Output-queued architecture
(a) The major drawback was that both the speed of the internal interconnect and the memory bandwidth had to be N times higher than the
external line speed in order to be able to transport N packets to a single output queue in one cycle, whereas an input-queued switch suffered
no such limitation. This requirement made output queuing in general an expensive solution because here buffers with an access rate N times
faster than in an input-queued architecture are required, making the output-queued solution less suitable for very high-speed packet switches.
Additionally, it did not scale well to larger switches, as the speed-up factor increased with switch size [39].
(b) The bandwidth requirement on a single buffer was then proportional to both port speed and number of ports, because in one packet cycle N
packets may arrive destined for the same output, so that N writes (N arriving packets) and one read (one departing packet) would be required.
Thus, the bandwidth requirements equals (N+1)C per output queue, with C being the port speed and N the number of input ports. Thus, the
aggregate bandwidths equals N(N+1)C, which is quadratic in the number of ports, and therefore output queuing was inherently less scalable
to more and/or faster ports than input queuing [40].
(c) It did not scale well to larger switches, as the speed-up factor increased with witch size.
Examples of output-queued switch architectures are the Buffered Crossbar, the Knockout, the Gauss, and the ATOM switch.
2.4 Shared Queuing
Both the input- and the output-queued architectures treated so far have assumed that each input or output queue has a fixed amount of buffer
space that is dedicated to this queue. This strategy may lead to suboptimum use of resources: for instance, in case some ports are heavily loaded
while others are mostly idle, the former may obviously benefit from temporarily getting more resources, while the latter may get by with less.
From this realization the concept of shared queuing was born [22].
This concept implies that buffer resources in a switch fabric are pooled together. All logical input or output queues or groups thereof can use
resources from a buffer pool until the pool is exhausted. The aim is to achieve better resource efficiency.
Although the sharing concept has proven effective especially in reducing packet-loss rates, it can lead to unfairness, as packets from one
particular input or destined to one particular output can monopolize the shared resource, thus actually causing performance degradation, as was
also pointed out in [36] and [37].
The following two sections provide some examples of both shared input and shared output queuing [20]. Practical implementations incorporating
the shared-queuing approach include the Starlite switch, the Vulcan switch, the ATLAS switch, the PRIZMA switch, the Swedish ATM switch
core [20, 41, 42, 43, 44, 45] and the Hitachi ATM switch [20, 46].
2.4.1 Shared Input queueing
In [47] an input-buffered scheme was introduced that group sets of k inputs together, providing a shared buffer per group, see Fig. 2.4. The
operation principle is that in each cycle, up to k packets are selected from each group; the queuing discipline is no longer strictly FIFO, so that
HoL blocking is alleviated. Also the packets to be forwarded are selected using the following algorithm:
1. Assign the “idle” flag to all output ports;
2. Randomly choose a grouped queue;
3. From this grouped queue, choose up to k packets having differing, idle destinations in a queuing-age sequence. Mark the selected
outputs “occupied”.
4. Repeat step 3 for the next grouped queue, until either all grouped queues have been processed or all output ports are marked “occupied”.
This algorithm was very similar in operation to the Reservation with Preemption and Acknowledgment (RPA) algorithm described in [49] when k
= 1, except that RPA was explicitly based on virtual output queuing, whereas this is implicit in this approach: although the packets in the shared
buffer are not sorted by destination, packets to any destination can depart from any grouped queue at any time. Clearly this algorithm was not
very efficient: in the worst case, an entire grouped queue must be searched in each of a maximum of N iterations [50].
Fig. 2.4: Grouped Input Queuing, from [22].
N
queuingg
1
N-k+1
routing 1
1
1
arbiter
N/k
1 k
k
1
1
k
N
Simulation results for grouping factor k = 2, 4, and 8 showed that using this scheme, throughput drastically improved compared to FIFO input
queuing. The grouping effect, as also shown in [47], leads to much lower loss rates, in particular when considering bursty traffic.
The Starlite switch is an example of implemented shared-input queueing. It was developed at AT&T Bell Laboratories [22]. It presents an
example of shared input queuing. There is no dedicated buffer per input, but rather one buffer that stores all packets that could not be delivered
owing to output contention. Excess packets are dropped. The buffered packets are fed back into the switch (hence the name ‘shared recirculating
queue’) through M additional input ports, dedicated to serving packets being recirculated. This reduces the effective size of the switch because an
(N+M) x (N+M) size switch is required to realize an N x N switch, and may cause packets to be delivered out of sequence. The size of the
recirculation buffer and the number of input ports M dedicated to it can be tuned to meet specific packet loss requirements.
2.4.2 Shared Output queueing.
Shared output queueing structure is shown in Fig. 2.5. By sharing the available memory space among all outputs better memory efficiency was
achieved, which had been demonstrated in [37]. The aggregate bandwidth through the shared memory was derived as 2N B, which was in fact
equal to the aggregate bandwidth requirement for the input-queued architecture, instead of N(N+1)B in case of dedicated output buffers.
Fig. 2.5: Shared output Queuing, logical structure [22].
N
1
routing
1
1
1
N
Shared memory
queuingg
N N
2.5 Shared-memory implementations.
Although a shared-memory implementation reduced the aggregate memory bandwidth to 2NB compared to dedicated output queuing, the
implementation of the shared memory itself was the greatest challenge in terms of implementation. The aggregate bandwidth requirement equals
that of an input-queued switch with identical throughput, but in the input-queued switch the memory was distributed over N inputs, which
typically were also separated physically (on separate chips and boards), whereas in the shared-memory switch the entire bandwidth is funneled
through a single memory. This was the main drawback of the shared-memory architecture. To obtain the desired bandwidth, several approaches
were possible, for example employing wide memory, or employing parallel (interleaved) memory banks[50], pipelining memory access, or
combinations thereof, as illustrated by the following examples of such switches.
2.5.1 Sliding-Window (SW) Packet switch
In [50] it was shown that when optimal-performance small size shared-memory switches were interconnected to form a larger switch using a
traditional approach of using multistage interconnection networks (MINs), the throughput performance of the switch suffers a sharp degradation.
Comparative performance evaluation under bursty traffic of the sliding window switching architecture was presented in [50]. The result
overcame the restriction on the scaling of shared-memory packet switches to a large size. Fig. 2.6 shows the memory module while figure 2.7
shows the overall architecture of the SW switching system with decentralized pipeline control. The architecture consists of the following
independent stages: (B) the self-routing parameter assignment circuit; (C) the input interconnection network; (E) shared parallel memory modules
used for write and read of data packets; (F) the output interconnection network.
Fig. 2.6. Memory module with memory controller for the SW switch, from [50]
Memory Controller
J=σ J=1 ---------------------------
Output Scan Array (OSA)
SW.k SW.j
( j,k)
Waddr Raddr
Memory Module
j
( j,k)
outgoing packets with
J=SW.j k = SW.k
incoming packets with
self-routing
tag (i, j, k, )
0 1 0 1 0 2 3 0 1 1 2 3 1
ADDRESS READ
ADDRESS WRITE
Input lines are denoted by I1, I2, .. .. … IN and the output lines are denoted by 21, 22 ….. 2N. Input lines carry the incoming data packets and the
output lines carry the outgoing data packets after being switched to their output destinations by the SW switching system.
In the SW switch, memory modules (see figure 2.6) are independent and they use their local memory controllers to perfume WRITE and READ
operations for data packets based only on the information available locally. The SW switch provided a way to reduce the performance bottleneck
created by centralized controllers in typical shared-memory switch architecture.
The main objectives of the SW switch architecture were to:
Fig.2.7. Schematic diagram of the SW-based switch architecture, from [50]
Pro
cess
ing
H
eader
I1
S
elf
Routi
ng P
aram
eter
(i,
j,
k)
assi
gnm
ent
Cir
cuit
(N
xm
) In
put
inte
rconnec
tion N
etw
ork
Pro
cess
ing
Hea
der
I2
Pro
cess
ing
Hea
der
IN
B
A1
A2
AN C
(m
xN
) O
utp
ut
Inte
rconnec
tion N
etw
ork
F
21
22
2N
OSA
Memory Module
WRITE
Memory Controller
READ
D1 E1
READ
D2 E2
OSA
Memory Module
WRITE
Memory Controller
DN
OSA
Memory Module
WRITE
Memory Controller
READ
EN
(a) Facilitate global sharing of physically separate memory modules among all the input and output ports of the switch to reduce packet
loss under bursty traffic conditions.
(b) Alleviate the need for a centralized memory controller.
(c) Partition the overall switching function into multiple independent stages.
(d) Allow multiple independent stages to make switching decisions based only on the information available locally.
(e) Operate multiple independent stages in a pipeline fashion in order to enhance packet switching speed.
(f) Provide maximum output utilization even when backlog occurs due to burstiness.
(g) Provide various memory sharing schemes for finite memory space deployed in the switch.
(h) Provide various memory sharing schemes for finite memory space deployed in the switch.
The PRIZMA switch, as a practical implementation of shared output queueing, was presented in [22]. This concept is represented logically in
Fig. 2.8, and is explained as follows: all packets are stored in a central, shared memory having a size of M packets, and the output queues only
contain addresses that point to the corresponding packets in the shared memory.
The data section is controlled by a control section to achieve the desired switching function. It consists of N output queues, one free address
queue, and demultiplexers to route incoming addresses to the correct output queues.
The free queue initially contains all shared memory addresses. It provides every input with a memory address to store the next incoming packet at
the output buffers. When a packet arrives, the corresponding input stores it at the designated memory address, forwards the address to the control
section along with the packet’s header, which contains destination and priority information.
The control section routes the address to one or more output queues, according to its destination, where it is appended at the tail of the queue(s).
The input then requests a new address from the free queue, and will receive one, if present. As there are N parallel input routers, this process can
be executed in parallel for all inputs.
Each output queue removes the address at its head and uses this address to configure the corresponding output router, which will fetch the data
from the memory at the requested address.
Figure 2.8: Shared output queuing, as implemented in the PRIZMA switch, [22].
read addresses
packet memory router
router
1
N
N
1
1
M
output queues
write address
control section
routing tags
free address queue
data section
When completed, the address is returned to the free queue, to be used again for newly arriving packets. Because of the parallel output routers all
output queues can transmit simultaneously.
By means of output-queue thresholds, queues can be prevented from using up an unreasonable portion of the shared memory, which could lead to
performance degradation on other outputs.
In [22], the sliding model analysis was presented as follows: given that the size of a pointer is log2M bits, the aggregate bandwidth requirement
on the dedicated output queues that store the addresses equals L
MBNN 2log
)1( + , with L being equal to the size (in bits) of one packet.
By the separation of control and data paths, this relaxed the overall bandwidth requirement. For example, consider that for a typical packet
switch with N = 32, M = 1024, L = 64bytes = 512 bits, this means a bandwidth reduction by a factor of more than 50.
Additionally, multicast could easily be supported with this architecture [22]. An incoming multicast packet is stored only once in memory,
whereas its address is duplicated to each destination output queue. An occupancy counter associated with each memory address keeps track of
how many copies still have to be transmitted; this counter is initialized to the number of destinations (the multicast fan out). Once it reaches zero,
the address is freed and returned to the free queue.
Limitations: Despite the dedication of buffer spaces to each queue in the input queue and output queue architecture, there was no optimum use
of resources such as queue space [28].
2.6 Combined Input and Output Queuing (CIOQ) architecture.
The solutions to the limitations of pure input queue and output queue were proposed by several researches and have arrived at hybrid solution
from two opposite directions: either by realizing that adding output queues with a modest speed-up factor improved the performance of an input-
queued switch, or by realizing that adding input queues to an output-queued switch improved its packet-loss characteristics. The two options
gave rise to CIOQ.
A switch belongs to the class of combined input and output queuing switches if the queuing function is performed both before and after the
routing function [22]. An architecture of
CIOQ packet switch is illustrated in Fig. 2.9.
As Fig. 2.9 shows, a CIOQ switch consists of three main components: the input queues, the interconnecting routing fabric, and the output queues.
The Combined input- and output-queuing switch of Fig. 2.9 shows input speed-up Si and output speed-up So.
The aggregate input and output bandwidth of the routing fabric equals N(Si + So) times the link rate [22].
Accordingly, CIOQ switch architectures were classified based on the following criteria:
• Input queue organization: FIFO or non-FIFO.
• Internal speed-up factor S: full speed-up, S = N, or partial speed-up, 1 < S < N.
• Output buffer size: infinite or finite. In the former case, clearly not a practical configuration, S > N . In the latter case the output buffers
can be dedicated or shared.
1
N
1
N
So So
Si Si
routing Input queues output queues
Fig. 2.9. An architecture of a CIOQ switch.[22]
• Furthermore, the fabric may be internally lossy or lossless (using some form of fabric-internal flow control, e.g. back-pressure).
The remainder of this section reviews the various CIOQ architectures proposed in [22].
2.6.1 CIOQ architectures with FIFO input queues
The case with FIFO input buffers, finite dedicated output buffers using back-pressure to prevent packet loss at the output queues was analyzed in
[25], with full speed-up and it was concluded that in such a configuration the best strategy was to share all available memory space among all
output queues because this reduced the input queue HoL blocking probability. Also in [52], the case of finite dedicated output buffers with
limited speed-up (S < N) was studied. It was concluded that for a speed-up larger than three, the performance-limiting factor was the HoL
blocking at the input queues. A CIOQ system with infinite output queues and a speed-up factor L under correlated traffic was analyzed in [27].
Upper and lower bounds on the maximum throughput were derived for the cases of uncorrelated and strongly correlated traffic, respectively. The
conclusions were that
(a) traffic correlation influenced maximum throughput negatively, but not to a large extent, and
(b) the upper and lower bounds converged quickly to 1 for S >1.
A difference was made in [18] between queue loss (QL) and back-pressure (BP) modes of operation, for CIOQ, where the former may lead to
packet loss at the output queues, whereas the latter prevents this by not allowing more packets to enter an output queue than the available space.
Therefore, in the QL case, excess packets are dropped at the output queue, whereas in BP mode they have to wait in the input queues. The CIOQ
was considered under limited input and output queues and a speed –up factor S. Based on the analytical model and corresponding simulations,
then the following conclusions:
• For both modes, a larger speed-up S led to a higher maximum throughput. However, S > 4 brought little additional improvement.
• For a given output queue size, QL offered a higher maximum throughput than BP because less HoL blocking occured in the input
queues.
• For the BP mode, maximum throughput improved when the output buffer size was increased.
• Naturally, this also led to a decrease in average delay.
• For the QL mode, on the other hand, increasing the output buffer did not improve maximum throughput, but led to slightly higher
average delay.
• A tradeoff between input and output buffer size existed when a total buffer budget was given to achieve the optimum packet-loss rate.
For QL, more buffer space must be allocated to the output, whereas for BP more input buffer space was required. For utilization ≤0.8,
BP required the least amount of total buffer space.
In [53] also exists a proposal for a CIOQ switch with limited speed-up S based on a Batcher– Banyan network with r parallel distributing
modules. The results of their analysis was that with speed-up S = 4 a performance close to output queuing could be achieved, assuming uniform,
uncorrelated arrivals.
Arbitrary speed up S, was analyzed in [54] for a CIOQ system with limited input and output queues, while adopting the queue loss scheme. The
main focus was to study the system under non-uniform traffic, which was known to cause performance degradation [55]. In [56], packet-loss
figures under uniform traffic were derived, indicating that for a given output queue size, an optimum speed-up S exists that yields the lowest loss
rate. Increasing S further is not only more expensive implementation-wise, but will also increase the loss rate.
As in [56], an optimum distribution of a given buffer budget over in and output queues were demonstrated. Regarding non-uniform traffic, it was
concluded that the dominant non-uniformity was that of the output destination distribution. Based on these findings, an output buffer sharing
strategy was proposed that improved loss performance by allocating more buffer space to queues experiencing higher load.
All these results had been obtained assuming Si = 1 and So= S, i.e., the speed-up was implemented only at the output side.
2.6.2 CIOQ with finite output buffers and back-pressure.
Combined input-output-queued switch with finite output buffers and back pressure was analyzed in [59]. The architecture is as shown in figure
2.10.
The system consists of
• N output queues of size b each
• N input queues and
• an interconnection structure between the output queues and the input queues to route the packets, with speed-up S =
min(b, N).
Fig. 2.10. An N × N CIOQ switch with finite output buffers [59].
1
N
1
N
N 1
1 N
routing Input queues output queues
b
b
When an output queue becomes full, this is instantaneously flagged to the input queues (back-pressure mode, as previously described) so that no
packets are lost owing to output-queue overflow. All queues, input and output were always served in FIFO order. It was found that maximum
throughput did not depend on selection policy, but only on the output buffer size and packet size distribution. The FCFS policy was shown to
give a lower bound on delay, whereas the LCFS policy led to an upper bound. All results in this section was derived in [59] under the
assumption of uniform, independent arrivals (Bernoulli traffic).
Results:
Table 2.1 lists numerical values for a range of b as derived in [59]. These results stood for large values of N. It was noted that the sequence of
throughput was monotonically increasing in b; that is
01 ≥∀< + b*b
*b λλ . (2.17)
For b = 0, that is, without any output queuing, the throughput equals 0.5857, while the limit for b→∞ equals 1, as had been established earlier in
[24].
The case where the available buffer space Nα was fully shared among all output queues was analyzed in [48]. Eq. 2.18 expresses the maximum
throughput *
bλ as a function ofα , the number of buffers per output. *
bλ is the maximum throughput for the dedicated output buffer (non-shared)
case with b packets per output.
.222 2* ++−+= αααλb (2.18)
It was concluded that the sharing of the available buffer space also improved throughput; because a packet will never be kept at the input FIFO
HoL as long as there is at least one buffer available in the shared memory.
The limitation: For bursty or asymmetric arrivals, the result of (2.18) will no longer hold [24].
2.7. Virtual Output Queued (VOQ) architecture
In section 2.7 various algorithms and implementations of switch architectures that have attempted to improve on various drawbacks of pure input
queueing (IQ) and pure output queueing (OQ) were presented. These algorithms all suffered from either HoL blocking in FIFO input queues or
unfairness in the sharing of memory in case of output-queued architecture. The degradation in their performance was improved upon by the
method of organized input buffers according to outputs [60- 67]. This arrangement led to virtual output queueing (VOQ) architecture as shown in
Fig. 2.11.
In the VOQ architecture, separate queues are maintained at the input side for each output line. These separate queues are called VOQs. The
routing fabric is still the simple crossbar. In each cycle, the scheduling unit collects the requests, executes some algorithm to decide which VOQs
may forward a packet, returns the grants and configures the crossbar. The granted VOQs remove their HoL packets and forward them to the
outputs.
Both conventional input-queued switches and VOQ switches based on crossbars can only transmit one packet from each input and receive one
packet at each output in one cycle (Si = So = 1). The key difference between the conventional IQ and VOQ is that, whereas the former has only
one packet per input eligible for transmission, an input with VOQ can dispatch as much as N requests (one per output)[64,68-78]. The scheduler
has to solve the problem of matching the inputs with the outputs in an optimal and fair manner.
2.7.1 The Analysis of VOQ architecture
Single Queue Model of each VOQ:
Output 1
Output N
input 1
input N
1
N
N
1
scheduler
Fig. 2.11. An architecture of VOQ packet-switch [66]
In [66, 69] each VOQ is modeled as a buffer that receives arrivals of packets characterized by ON and OFF states. A discrete-time, two-state
Markov chain generating arrivals modeled by an ON-OFF source was used in as depicted in Fig. 2.12. Letting the parameters p and q denote the
probabilities that, in the r slot, the Markov chain remains in states ON and OFF, respectively.
In the OFF state, no packet is generated ( 00 =λ ), whereas in the ON state one packet is generated at the rate of one per cycle ( 11 =λ ). While in
the ON state, the packet stream is divided into consecutive bursts; all packets in one burst have the same destination.
Pr[ON= r slot] = ( ) 111 ≥− −
r,ppr
(2.19)
Pr[OFF = j slot] = ( ) 01 ≥− j,pqj
(2.20)
Generally in slot n, inter-arrival time distribution, fn, is shown [66] as
>−−
==
− 111
1
2n)q(q)p(
np
f
n
n (2.21)
That is, probability of two consecutive arrivals, fi = p, is identical to the probability that following an arrival the Markov chain will remain in
state ON. Similarly, f2 is the probability that following an arrival, the chain transitions to the OFF state and then returns to the ON state. For n>2,
1-q
q p
1-p
OFF ON
Fig. 2.12. One-slot Markov chain for arrival process to a VOQ [66]
it is apparent that following a transition from the ON state to the OFF state, there are n-2 time slots [67,78] during which the chain remains in the
OFF state before returning to state ON.
It was shown in [22][66] that, given (2.19), (2.20), (2.21) and the state probabilities P0 and P1 respectively, the mean arrival rate is
1100 λλλ PPa += (2.22)
0110
01
pp
p
+=
(2.23)
( )
( )pq
q
−−−
=2
1 (2.24)
And the mean burst length, B, is
( )pB
−=
1
1 (2.25)
And that offered load for every VOQ is [66,68]
( )
( )
−
+−−−
= 11
11
qqpµµµρφ (2.26)
It was shown in [66] that service rate, µ, to each queue at the input was
( )
−−
−
=− N
ao
ao
o
Nλµλµ
µµ
11
(2.27)
µo is service rate to the entire VOQ.
It is very clear that if there are N outputs, for example, each input line will comprise N separate VOQs. Each VOQ block is seen as separate
queues for all N outputs. At stability the arrival rate, λ, is equal to the service rate to the entire VOQ thus we have it shown in [66,68,72] that
λ = µ{1-πoN(1-λ)} (2.28)
Mean arrival rate, λ, is
λ = (1-q)/(2-q- p), (2.29)
where πo is the probability of having an empty queue and N is the number of ON states.
Mean Queue size
In [68] the buffer occupancy was calculated as
( )( )a
a u
λµλρ
−−
=1
1
(2.30)
Therefore, it was shown in [66] that mean queue size is
( )ρµλ−
=1
aQ (2.31)
Throughput per input port
The throughput per input port is the total number of packets in N queues sent out with request generation rate [68], σ, plus a successful service
rate, µ. From [66] throughput is
T = N* σ * µ (2.32)
And that ρλρλσ aa ++= (2.33)
Mean Queue delay
Using Little’s theorem and where B is queue capacity, the mean VOQ queue delay from [66, 68,77,78] is
( )( )( ) ( )
( )ρµ
ρρρρρρ
−=
−⋅−⋅−⋅−−⋅
= +
1
1
11
111B
BBB
D
(2.34)
This closed-form expression for VOQ queue delay agrees with a synchronous Geo/Geo/1/B queue with service rate given by a PIM scheduler
[64,71].
CHAPTER 3
MODELS AND SIMULATION OF VOQ PACKET-SWITCH
3.0 Introduction:
This chapter presents the architecture and model of VOQ-switch upon which simulation and the consequent analysis are based. The key
components of the MATLAB Simulink VOQ model are also presented.
3.1 VOQ Architecture:
Fig. 3.1 shows the architecture of the VOQ. At each input, a separate queue is maintained for each output. The architecture assumes the use of
parallel iterative matching (PIM) scheduling algorithm under i.i.d Bernoulli traffic [68, 69, 70, 71-80].
The objective here is to model a switch that,
• Connects multiple independent data sources to three different and independent destinations.
• Holds arriving packets in a buffer (a queue) for each of the data sources.
• Randomly resolves contention if two or more simultaneous packets at the head of their respective queues share the same intended
destination, with no bias to any particular source of packets.
3.2. The model of the VOQs.
The model of the VOQ is shown in Fig. 3.2. This model is based on the 3x3 input-queued, crossbar-based fabric packet switch architecture
shown in Fig. 3.1.
3
1
3
SCHEDULING
1
OUT SWITCH
PACKET
SOURCE
PACKET INPORT
SOURCE
OUTPORT
INPORT
2
VOQs
VOQs
VOQs
PACKET INPORT
SOURCE
2
Fig. 3.1. The architecture of 3 x 3 VOQ packet-switch
3.2.1 Traffic model
In practical terms, it is very difficult to theoretically characterize the exact nature of packet arrivals because of the presence of diverse traffic
classes in high-speed networks involving the packet switches. Thus most of the researches on packet switching consider processes which make
the analysis tractable and reasonable for an approximation to real life scenario. Geometrically distributed durations of occurrence is one which
makes analysis theoretically tractable [64, 66, 67, 73]. Given that link slot is different from switch slot [66], for this reason a packet is wholly
available to the switch only when the last bit of that packet has arrived at the input port. Therefore, it takes one whole link slot or r-switch slots
for all the bits of a packet to arrive. This agrees with burstiness in the traffic [22]. Therefore, there must be (r-1) idle switch-slots for one arrival
in a link slot. An idle time interval is OFF, while active time interval is ON; such a switch-slot arrival process leads to the Markov chain of VOQ
SCHEDULING
1
2
3
λ
λ
λ
Destinations
µo
µo
µo
o/p buffers
λa µ ρ
9-VOQs
OUTPUT
SWITCH
SORTINGS
1
2
3
Packet
packet
packet 3
2
1
Fig.3.2. Model of the 3 x 3 VOQ packet-switch
arrival process in [66, 68]. Each input in the on-off traffic model can alternate between active and idle periods of geometrically distributed
durations as shown in figure 3.3.
During an active period τon, packets destined for the same output arrive continuously in consecutive switch slots [68].
3.3 The VOQ simulation model.
Fig. 3.4 shows the performance simulation model of the VOQ using MATLAB Simulink event-based and time-based blocks.
Fig. 3.3 On-Off model of packet-input at the VOQ input
τon τoff
ON OFF ON OFF ON OFF ON
t
Solid arrows from left to right represent data flow, arrows from right to left depicts flow control. The following key components depicted in
figure 3.4 are, from left to right:
3.3.1 Traffic source.
The Time-based Entity Generator, G, generates the packets. It generated the packet at geometrically distributed time intervals. Event-based
Random Number generator block, which was connected to the input port of the Time-based Entity generator, determines the inter generation time
of the entity (packet).
The arrival rate:
The arrival rate, λo, is proportional to inter generation time of a packet. It was simulated here as the inverse of inter generation time. The
parameter of the Random number generator is set to ‘geometrical’ with the variable probability between 0 and 1. Intergeneration time is
measured at the ‘w’ port of the time-based entity generator. The number generated by G is measured at the ‘#d’ port of the block.
Entity Scopes and Signals scopes were used to monitor arrival patterns of packets.
3.3.2 VOQ destination distribution process.
Fig. 3.4 Performance simulation model of 3x3 VOQ packet- switch.
Set attr. sort. Get
Attrib.
G
Output process
Input process Switch
Input
buffers
For each packet generated by the time-based generator, the Set attribute block stamped, at random, the destination of the packets. This number
was randomly placed between 1, 2 and 3. The numbers signified the output port selection in the first stages of the switching. The sorting instance
was also time stamped by the Timer block. The FIFO Queue block stored the packet that the switch could not route immediately. The queue
size of each VOQ was variable by the ‘Capacity’ parameter on the queue block.
The group of VOQs were sent to the Entity Path Combiner for onward transfer to the next stage of the switching called here as ‘switch’.
3.3.3 The Single server Process
The Single Server block the processed one of the packets, at a time, that was accepted at the time slot of the switch from the input queue or
transmitted from the output queue to the SimEvents Sink (destination). While a packet was in this block, other packets were waiting in the FIFO
queue block.
The Start Timer and Read Timer blocks worked together to compute the time that each packet (entity) spent in the queue and server. The total
time was the transition time of the packet through the switch from input to the output. The result of the computation was read from the Display
block. From this computation the average delay was measured.
3.3.4 Output switch.
The Output Switch of the simulink modeled the output ports of the switch fabric. The destination ‘address’ that was set by ‘Set Attribute’ block
to each packet was sorted by the ‘Get Attribute’ block. The Get Attribute block the reassembled all packets for respective destinations according
to the destinations stamps 1, 2, or 3. Note that the Infinite server block, at the output, models the latency in the switch system by delaying each
data-containing entity according to the setting of Service time for the output link. The service time was set at 1. The output queue was set at
infinity.
3.3.5 The destinations.
The Entity Sink block absorbed packets that have completed their processing in the switch fabric. This serves as the output link or destinations.
3.4. The QoS metrics.
The QoS metrics considered are delay, throughput and packet loss rate with respect to the load from G.
Throughput: This is measured by the ratio of number of packets received at the output port of the last stage of switching to the number input
from the source.
Delay: The average packet delay is measured by the Timer at the timer tag parameter point ‘w’. The timers are connected to the servers at the
arriving point, and prior to the destinations.
Loss rate: The difference between the number of packet sent and the number received is measured by the ‘Maths Operations’ block. The ratio
of this difference to the total number of packet sent is the loss rate averaged over each simulation session.
The load was represented as λo. The measurements were taken by the use of Math Operations blocks. All events occurred at a discrete time slot
intervals [66, 68, 80] in which at most a single arrival and a single service event may occur.
Fig. 4.1. Queue length vs simulation time per VOQ
CHAPTER 4
SIMULATIONS AND RESULTS ANALYSIS
4.0 Introduction
In this chapter, the conditions under which the QoS parameters were determined for the analysis of the performance are presented. Some
sessions of simulation in MATLAB Simulink were carried out to determine responses from which results were presented.
4.1 The traffic pattern
In Appendix A1 is shown the block diagram of the simulation environment. Figure 4.1 shows the arrival pattern at the queue. This result shows
that at the simulation time between 0 and 1000 there were instantaneous jumps. This suggests initial stage when the three independent sources
try gaining access to the switch at one. The interarrival time within this time could not fit into the exponential distribution. From simulation time
of 1000 and above the switching has stabilized and then the arrival rate and service rate are almost equal. The average queue length has become
steady and not increasing nor decreasing.
Fig. 4.2. Stabilized Arrival patterns of packets
In figure 4.2 is shown the packet arrival pattern in consideration of destinations. It was for 10000 simulation runs. There is shown in this figure an
evidence of continuous flow of packets into the system, because there is no gap of zero packets in the graph. The maximum rate was 8
packets/unit time, while the minimum was 1 packet per second. This minimum is consistent with the fact that the service rate was 1 packet/unit
time; which explains the concentration of the patterns at the level of 1.
4.2 Simulation results
For a different VOQ sizes of 5, 10 and 20 and same arrival rates, λo, results are shown with respect to average delay, loss, average queue length
and throughput. The service rate was fixed at packet per unit time at the output and for the server to each VOQ.
Table 4.1 Performance at VOQ capacity=5
Queue length. The pattern of queue length per VOQ is shown in Fig. 4.2. The queue grew until saturated at an average of 4.153 maximum at λ
= 0.3526 maximum at same total service rate of packet per simulation time.
VOQ capacity ,Q = 10
λo
(/sec)
Input (packets)
Output (packets)
Packet
loss
Av. Queue
Length
Throughput
T
Delay
(sec.)
0.2253 6759 6759 0 0.0150 0.99 1.766
0.3319 9957 9957 0 0.0482 0.99 2.030
0.3892 9987 9958 29 0.0977 0.99 2.652
0.4347 13050 13035 15 0.0950 0.99 2.397
0.5910 17961 17958 13 0.5989 0.99 3.833
VOQ capacity = 5
λo (/sec)
Input (packets)
Output (packets)
Packet
loss
Av. Queue
Length
Throughput
T
Delay
(sec.)
0.2253 6759 6759 0 0.0150 0.99 1.766
0.3319 9957 9957 0 0.0460 0.99 2.060
0.3892 11658 11643 15 0.1050 0.99 2.953
0.4347 13029 13024 05 0.1349 0.99 3.091
0.5910 17718 17714 04 0.5716 0.99 5.439
0.9513 28524 28436 12 2.9730 0.99 12.100
Table 4.2 Performance at VOQ capacity=10
Table 4.3 Performance at VOQ=20
0.9513 29457 29225 202 6.6870 0.99 23.300
VOQ capacity ,Q = 20
λo
(/sec) Input
(packets)
Output (packets)
Packet
loss
Av. Queue
Length
Throughput
T
Delay
(sec.)
0.2253 6759 6759 0 0.0150 0.99 1.766
0.3319 9957 9957 0 0.0482 0.99 2.062
0.3892 11667 11667 29 0.0871 0.99 2.346
0.4347 17982 17969 13 0.5469 0.99 4.601
0.5910 29871 29405 466 15 0.98 8.34
0.9513 30145 29645 500 18 0.98 28.37
Throughput
Figure 4.3 is the plot of throughput as a function of load for three buffer sizes. This figure is for case of buffer sizes Q=5, 10 and 20 respectively.
The ratio of output to the input is throughput. It is to be noted that all the buffer schemes gave their best performance at small loads between 0.2
and 0.45. Throughput reduced slightly for Q=10 and 20 at the loads between 0.45 and 0.6.
The best performance for throughput was exhibited when the buffer size was 5. From the small loads to the heaviest load, throughput remains
constant at 0.99. However, a load of 1 and above packets/sec could easily lead to oversubscription, because the arrival rate will be definitely
higher than service rate.
The worst scheme of buffer size was Q=10 at high loads above 0.5.
Delay analysis.
Fig 4.3 Plot of throughput vs load for Q=5,10, 20
The performance in terms of delay is plotted as shown in figure 4.4. The delays were all the same for small load with different buffer size.
Common to the buffer schemes the fact that as load increased, delay also increased. However, the delay variation was highest for Q=20. This
behavior at Q=20 was expected for the fact that packets that arrive early will find space to queue waiting for its service. The longer the line the
longer is the waiting time, since the service was FIFO.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.984
0.985
0.986
0.987
0.988
0.989
0.99
0.991
0.992
load
Th
rou
gh
pu
t
Throughput vs load
Q=5
Q=10
Q=20
Fig. 4.4 Delay vs load for buffer size Q=5, 10, 20
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
load
av
era
ge
de
lay
(s
ec
.)
Q=5
Q=10
Q=20
Average queue length
Figure 4.3 shows the plot of the queue length for buffer sizes of Q=5, 10 and 20 respectively.
Average queue length was the same for Q=5 and Q=10 for small loads between 0.2 and 0.6.
It can be observed that at load range of 40% -60% the queue length at Q=20 was too large to be accepted for configuration. This is because of the
rate at which maximum queue length was attained. For example, at Q=5, maximum average queue length was close to 5 and at Q=10 maximum
queue length was approaching the capacity of 10. It is to infer therefore that as load becomes higher but same service discipline of round robin
(RR) at the output, packets had time to build up faster. This caused the abrupt jump in the curve.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
12
14
16
load
avera
ge q
ueue length
Fig. 4.3
plot of average queue length vs load
Q=5
Q=10
Q=20
CHAPTER 5
CONCLUSION AND RECOMMENDATIONS
5.0 Summary
From the simulation results in chapter 4, it is very obvious that throughput of 0.99 confirms the efficiency of VOQ which gives a better
performance than input-queued (IQ) or pure output-queued (OQ) architecture under different loads.
Furthermore, these results suggest that a buffer allocation strategy above 20 cannot be optimum under all traffic conditions. Therefore, an
opportunity exists to dynamically adjust buffer allocation between 5 and 20 according to current traffic patterns for minimum delay, low packet
loss and good throughput.
Having identified the pros and cons of various approaches for existing packet switch architectures, the fundamental problem at hand was
understood, which is having the best performance with minimum tradeoffs in the QoS metrics.
5.1 Conclusion
The VOQ systems differ fundamentally from the established packet switch architectures in conventional IQ, OQ or CIOQ systems employing
FIFO input queues. This analysis proved that the throughput of close to 100% can be achieved and is achievable under all traffic conditions.
Since the input queue is the primary consideration in this work, it is clear from the results that it is not possible under VOQ to achieve a lower
delay factor than other architectures beyond buffer space of 20 at each VOQ buffer.
This work was limited to the architectures of the packet switch, so the speed of the switch fabric was not considered.
The selection algorithms for each queue was not the focus of this work also.
5.2 Recommendations
The results of this analysis could be a useful reference for making decisions in achieving the best and practicable performance with VOQ packet-
switch.
As the demand for switch capacity continues to grow, future switches invariably will require more ports, higher data rates per port and more
sophisticated QoS support. Thus a fixed buffer allocation as suggested by the simulation results, in this dissertation, cannot be optimum under all
conditions. Using VOQ architecture as the building, continued effort is required to sustain the prospects of VOQ packet-switch. Further work on
this analysis could be done in case of impact of selection algorithm on VOQ architecture especially when connected in tandem in communication
network.
REFERENCES
[1] Forouzan B.A, Data Communication and Networking, Tata McGraw-Hill Publishing Company Limited, New Delhi, (2007), pp 775.
[2] Held G., Data Communications Networking Devices: Operation, utilization and LAN and WAN Internetworking, John Wiley & Sons,
Ltd. (2000), pp. 185-845.
[3] L. Kleinrock, “On Resource Sharing In a Distributed Communication Environment”, IEEE Commun. Mag,
Jan 1979, vol. 17, no.1, pp. 27-34.
[4] Leo-Garcia and Widjaja, Communication Networks, Fundamental Concepts and Key Architectures, McGraw-Hill (2000), USA.
[5] Folts H, OSI Workbook. Vienna, VA: OMNICOM, Inc., (1983).
[6] Pattavina, A., “Nonblocking Architectures for ATM Switching,” IEEE Commun. Mag., Feb. 1993, pp. 38-48.
[7] Dunlop J. and Smith D. G., Telecommunications Engineering, Stanley Thornes (Publishers Ltd), United Kingdom,(1998), pp 360-370.
[8] K. Yoshigoe and K.J. Christensen, “An Evolution to Crossbar Switches with Virtual Output Queueing and buffered Cross points,” IEEE
network, vol. 17, No. 5, Sept./Oct. 2003.
[9] William Stallings Data and Computer Communications Macmillan Publishing Company (1989), London.
[10] Kumar et al, Communication Networking, Analytical Approach, Morgan Kaufmann Publishers (Elsevier), (2005).
[11] Freeman, R, Telecommunication System Engineering , Wiley, (1980)New York
[12] AT & T Bell Laboratories. Engineering and Operations in the Bell System, Second Edition, 1983.
[13] Kennedy G. and Davis B., Electronic Communication Systems, Macmillan/McGraw-Hill, (1992), Lake Forest.
[14] Hluchyj, M.G. and M.J. Karol, “Queueing in High-Performance Packet Switching,” IEEE J. Sel. Areas Commun., vol. 6, no. 9, Dec.
1988, pp. 1587-1597.
[15] Rathgeb, E.P., T.H. Theimer, M.N. Huber, “Buffering Concepts for ATM Switching Networks,” in Proc. GLOBECOM ’88, Hollywood,
FL, Nov. 1988, pp. 1277-1281.
[16] Ahmadi, H. and W.E. Denzel, “A Survey of Modern High-Performance Switching Techniques,” IEEE J. Sel. Areas Commun., vol. 7, no.
7, Sep. 1989, pp. 1091-1103.
[17] Tobagi, F., “Fast Packet Switch Architectures for Broadband Integrated Services Digital Networks,” in Proc. IEEE, vol. 78, no. 1, Jan.
1990, pp. 133-167.
[18] Ifiok O., Communication Engineering Principles, Palgrave (2001), New York.
[19] Awdeh, R.Y. and H.T. Mouftah, “Survey of ATM Switch Architectures,” Computer Networks and ISDN Systems, vol. 27 (1995), pp.
1567-1613.
[20] Minkenberg, C., T. Engbersen and M. Colmant, “A Robust Switch Architecture for Bursty Traffic,” in Proc. Int. Zurich Seminar on
Broadband Commun. IZS 2000, Zurich, Switzerland, Feb. 2000, pp. 207-214.
[21] Patel, J.K., “Performance of Processor-Memory Interconnections for Multiprocessors,” IEEE Trans. Computing, vol. 30, no. 10, Oct.
1981, pp. 771-780.
[22] Minkenberg, J.A. “On Packet Switch Design, PhD thesis, Eindhoven University of Technology, 2001.
[23] Hui, J.Y. and E. Arthurs, “A Broadband Packet Switch for Integrated Transport,”IEEE J. Sel. Areas Commun., vol. 5, no. 8, Oct. 1987,
pp. 1264-1273.
[24] Karol, M.J., M.G. Hluchyj and S.P. Morgan, “Input vs Output Queueing on a Space-division Packet Switch,” IEEE Trans. Commun.,
vol. 35, no. 12, 1987, pp. 1347-1356.
[25] Iliadis, I. and W.E. Denzel, “Performance of Packet Switches with Input and Output Queueing,” in Proc. ICC ’90, Apr. 1990, pp. 747-
753.
[26] Hui, J.Y. and E. Arthurs, “A Broadband Packet Switch for Integrated Transport,” IEEE J. Sel. Areas Commun., vol. 5, no. 8, Oct. 1987,
pp. 1264-1273.
[27] Li, S.-Q., “Performance of a Nonblocking Space-Division Packet Switch with Correlated Input Traffic,” IEEE Trans. Commun., vol. 40,
no. 1, Jan. 1992, pp. 97-108.
[28] Cao, X.-R., “The Maximum Throughput of a Nonblocking Space-Division-Packet Switch with Correlated Destinations,” IEEE Trans.
Commun., vol. 43, no. 5, May 1995, pp. 1898-1901.
[29] Souza, R.J., P.G. Krishnakumar, C.M. O¨ zveren, R.J. Simcoe, B.A. Spinney, R.E. Thomas and R.J. Walsh, “GIGASwitch System: A
High-performance Packet-switching Platform,” Digital Technical J., vol. 6, no. 1, 1994, pp. 9-22.
[30] Pattavina, A., “Multichannel Bandwidth Allocation in a Broadband Packet Switch,” IEEE J. Sel. Areas Commun., vol. 6, no. 9, Dec.
1988, pp. 1489-1499.
[31] Obara, H. and T. Yasushi, “An efficient contention resolution algorithm for input queueing ATM cross-connect switches,” Int. J. Digital
and Analog Cabled Syst., vol. 2, Dec. 1989, pp. 261-272.
[32] Matsunaga, M. and H. Uematsa, “A 1.5 Gb/s 8x8 Cross-Connect Switch Using a Time Reservation Algorithm,” IEEE J. Sel. Areas
Commun., vol. 9, no. 8, Oct. 1991, pp. 1308-1317.
[33] Obara, H., “Optimum architecture for input queueing ATM switches,” IEE Electron. Lett., 28th Mar. 1991, pp. 555-557.
[34] Obara, H., S. Okamoto and Y. Hamazumi, “Input and Output Queueing ATM Switch Architecture with Spatial and Temporal Slot
Reservation Control,” IEE Electron. Lett., Jan. 1992, pp. 22-24.
[35] Obara, H. and Y. Hamazumi, “Parallel Contention Resolution Control for Input Queueing ATM Switches,” IEE Electron. Lett., vol. 28,
no. 9, Apr. 1992, pp. 838-839.
[36] Hluchyj, M.G. and M.J. Karol, “Queueing in High-Performance Packet Switching,” IEEE J. Sel. Areas Commun., vol. 6, no. 9, Dec.
1988, pp. 1587-1597.
[37] Liew, S.C., “Performance of Various Input-buffered and Output-buffered ATM Switch Design Principles under Bursty Traffic:
Simulation Study,” IEEE Trans. Commun., vol. 42 no. 2/3/4, Feb/Mar/Apr 1994, pp. 1371-1379.
[38] Oie, Y., M.Murata, K. Kubota and H. Miyahara, “Effect of Speedup in Nonblocking Packet Switch,” in Proc. ICC ’89, Jun. 1989, pp.
410-414.
[39] Gupta, A.K., L.O. Barbosa and N.D. Georganas, “A 16 x 16 Limited Intermediate Buffer Switch Module for ATM Networks,” in Proc.
IEEE GLOBECOM ’91, Phoenix, AZ, Dec. 1991, pp. 939-943.
[40] Del Re, E. and R. Fantacci, “Performance Evaluation of Input and Output Queueing Techniques in ATM Switching Systems,” IEEE
Trans. Commun., vol. 41, no. 10, Oct. 1993.
[41] Yeh, Y., M.G. Hluchyj and A.S. Acampora, “The Knockout Switch: A Simple, Modular Architecture for High-Performance Packet
Switching,” IEEE J. Sel. Areas Commun., vol. 5, no. 8, Oct. 1987.
[42] Yoon, H., M.T. Liu, K.Y. Lee and Y.M. Kim, “The Knockout Switch Under Nonuniform Traffic,” IEEE Trans. Commun., vol. 43, no. 6,
Jun. 1995, pp. 2149-2156.
[43] Eng, K.Y., “A Photonic Knockout Switch for High-Speed Packet Networks,” IEEE J. Sel. Areas Commun., vol. 6, no. 7, Aug. 1988, pp.
1107-1116.
[44] Suzuki, H., H. Nagano and T. Suzuki, “Output-buffer Switch Architecture for synchronous Transfer Mode,” in Proc. ICC ’89, Boston,
MA, Jun. 1989, pp. 99-103.
[45] Andersson, P., and C. Svensson, “A VLSI Architecture for an 80 Gb/s ATM Switch Core,” in Proc. 8th Annual IEEE Int’l Conf.
Innovative System in Silicon, Austin, TX, Oct. 9-11, 1996.
[46] Kozaki, T., N. Endo, Y. Sakurai, O. Matsubara, M. Mizukami and K. Asano, “32x_ 32 Shared Buffer Type ATM Switch VLSI’s for B-
ISDN’s,” IEEE J. Sel. Areas Commun., vol. 9, no. 8, Oct. 1991, pp. 1239-1247.
[47] Tao, Z. and S. Cheng, “A NewWay To Share Buffer - Grouped Input Queueing In ATM Switching,” in Proc. IEEE GLOBECOM ’94,
vol. 1, pp. 475-479.
[48] Iliadis, I., “Performance of a Packet Switch with Shared Buffer and Input Queueing,” in Proc. Teletraffic and Datatraffic in a Period of
Change, ITC-13, 1991, pp. 911-916.
[49] Ajmone Marsan, M.G., A. Bianco and E. Leonardi, “RPA: A Simple, Efficient, and flexible Policy for Input Buffered ATM Switches,”
IEEE Commun. Letters, vol. 1, no. 3, May 1997, pp. 83-86.
[50] Kumar S, “The Sliding-Window Packet switch: a New Class of Packet Switch With Plural Memory modules and Decentralized Control”
IEEE J.SAC vol. 21, no.4, May 2003, pp 656-673.
[51] Denzel, W.E., A.P.J. Engbersen and I. Iliadis, “A Flexible Shared-Buffer Switch for ATM at Gb/s Rates,” Computer Networks and ISDN
Systems, vol. 27, no. 4, Jan. 1995, pp. 611-624.
[52] Gupta, A.K. and N.D. Georganas, “Analysis of a Packet Switch with Input and Output Buffers and Speed Constraints,” in Proc. IEEE
INFOCOM ’91, Bal Harbour, FL, Apr. 1991, pp. 694-700.
[53] Chang, C.-Y., A.J. Paulraj and T. Kailath, “A Broadband Packet Switch Architecture with Input and Output Queueing,” in Proc. IEEE
GLOBECOM ’94, pp. 448-452.
[54] Lee, M.J. and D.S. Ahn, “Cell Loss Analysis and Design Trade-Offs of Nonblocking ATM Switches with Non-uniform Traffic,”
IEEE/ACM Trans. Netw., vol. 3, no. 2, Apr. 1995, pp. 199-210.
[55] Li, S.Q., “Non-uniform Traffic Analysis on a Nonblocking Space-Division Packet Switch,” IEEE Trans. Commun., vol. 38, Jul. 1990,
pp. 21-31.
[56] Chen, J.S.-C. and R. Guerin, “Performance Study of an Input Queueing Packet Switch with Two Priority Classes,” IEEE Trans.
Commun., vol. 39, no. 1, Jan. 1991, pp. 117-126.
[57] Giacopelli, J., J. Hickey,W. Marcus, D. Sincoskie and M. Littlewood, “Sunshine: A High-performance Self-routing Broadband Packet
Switch Architecture,” IEEE J. Sel. Areas Commun., vol. 9, no. 8, Oct. 1991, pp. 1289-1298.
[58] Huang, A. and S. Knauer, “Starlite: A Wideband Digital Switch,” in Proc.GLOBECOM ’84, Atlanta, GA, Dec. 1984, pp.121-125.
[59] Iliadis, I. and Denzel W.E., “Analysis of Packet Switches with Input and
Output Queuing,” IEEE Trans. Commun., vol. 41, no. 5, May 1993, pp. 731-740.
[60] Roberto Rojas-Cessa and Chuan-bi Lin, “Captured-Frame Eligibility and Round- robin Matching for Input-queue Packet
Switches,” IEEE Communication Letters, vol.8, issue 9, pp. 585-587, Sep. 2004
[61] D.Banovic, I.Radusinovic, “VOQ Simulator v2.0 - Tool for Performance Analysis of VOQ Switches”, submitted for IEEE
HPSR 2006 . http://klamath.stanford.edu/tools.
[62] Feng Wang and Mounir Hamdi, “Scalable Router Memory Architecture based on Interleaved DRAM: Analysis and
Numerical Studies” Proc. IEEE ICC, 2007.
[63] Mckeown, N, et al, “Achieving 100% throughput in an Input-Queued Switch” INFOCOM ’96, PP.296-302.
[64] Hyoung-II Lee and Seung-Woo Seo, “Analysis model of Multiple Input-Queued switches with PIM scheduling
Algrithm”, IEEE Comm. Letters, vol. 5, no.7 July 2001, pp.316-318.
[65] Krishna, P.,et al., “On the speedup required for work-Conserving Crossbar Switches,” IEEE J. Select. Areas
Commun., vol. 17, no. 6, 1999,pp. 1057-1066.
[66] Itamar Elhanany et al: “On Uniformly Distributed ON/OFF arrivals in Virtual Output Queued Switches with
Geometric Service Times”, Proc. IEEE ICC, Vol. 1, pp 173-177, May 2003.
[67] Itamar Elhanany and Dan Sadot: “DISA: A Robust Scheduling Algorithm for Scalable Crosspoint-Based Switch
Fabrics”’ IEEE J. Select Areas Commun., vol. 21. no. 4, pp 535-545, May 2003.
[68] Ding-Jyh Tsaur, et al: “Scheduling Algorithm and Evaluating Performance of a Novel 3D-VOQ switch” IJCSNS International Journal of Computer Science and Networking Security, VOL. 6 No.3A, March 2006. pp. 25-34.
[69] Tamir Y and H.-C. Chi, “Symmetric Crossbar Arbiters for VLSI Communication Switches”, IEEE Transactions on
Parallel and Distributed Systems, vol. 4, no. 1, Jan.1993, pp. 13-27.
[70] Yener, B., E. Leonardi and F. Neri, “ Algorithms for Virtual Output Queued Switching,” in Proc. GLOBECOM
’99, Rio de Janeiro, Brazil, Dec. 1999, vol. 2, pp. 1203-1210.
[71] McKeown, N., “The iSLIP scheduling Algorithm for Input-Queued Switches, “ IEEE/ACM Trans. Networking, vol. 7,
no. 2, Apr. 1999, pp 188-201.
[72] Whittle Peter, “Probability”. John Wiley & Sons Ltd. Great Britain, 1976.
[73] Menkkittikul, A and N. McKeown, “ A Starvation-free Algorithm for Achieving 100% Throughput in an Input-Queued Switch,” in Proc.
IEEE INFOCOM ’98, San Francisco, CA, Apr. 1998, pp 792-799.
[74] Dai, J. G. and B. Prabhkar, “The Throughput of Data Switches with and without speedup,” in Proc. INFOCOM 2000,Tel Aviv, Israel,
Mar. 200 vol. 2, pp. 556-564.
[75] Serpanos, D., Antoniadis, P., “FIRM: A class of distributed scheduling algorithms for high-speed ATM switches with multiple input
queues”, IEEE INFOCOM’00, Tel Aviv, Isreal, March 2000, pp. 0548-0555.
[76] Smiljanic, A., et al., “RRGS-round robin greedy scheduling for electronic/optical terabit switches”, IEEE GLOBECOM’99. Rio De
Janeiro, Brazil, Dec. 1999, pp. 1244-1250.
[77] Xiao Zhang, Laxmi N. Bhuyan, “An Efficient Scheduling Algorithm for Combined Input-Crosspoint-Queued (CICQ) Switches”, in
IEEE GLOBECOM, Nov. 2004.
[78] R. Rojas-Cessa and E. Oki, “Round-Robin Selection with Adaptable-size Frame in a Combined Input-Crosspoint Buffered Switch,”
IEEE Comm. Letters, vol.7, No. 11, pp. 555-557, Nov. 2003.
COMPLETE VOQ PACKET-SWITCH SIMULATION ENVIRONMENT
output
3969
delay 2
4.339
delay 1
2.306
Start Timer 4
IN OUT
IN
OUT
w
Read Timer
IN
OUT
w
IN OUT
N-Server
IN
OUT
#d
Get Attribute
IN
A1
OUTFIFO Queue
IN
OUT
len
Entity Sink 1
IN #a
Display 6
1.286
Display 20.5183
[79] Zeng Guo and Roberto Rojas-Cessa, “Framed Round-Robin Arbitration with Explicit Feedback Control for Combined Input-Crosspoint
Buffered Packet Switches”, In Proc. of IEEE ICC. Jun. 2006.
[80] Neha Kumar et al., “Fair Scheduling in Input-Queued Switches under Inadmissible Traffic”, in IEEE GLOBECOM, Nov. 2004.
APPENDIX
A1
106
APPENDIX A2
FLOW CHART FOR THE SIMULATION OF VOQ PACKET SWITCH
Generate
Packet i.i.d
Attach random
destination of
packets
Set arrival time of
each packet
Place each packet
on respective
VOQ
Output packet in
Round Robin at
destinations
Get time out of
each packet
Calculate service
time for a number
of packets at
each destination
Count number
of packet losses
Count number
of packet at
each destination
stop
start
107
APPENDIX A3
MATLAB CODES FOR THE PLOTS OF SIMULATION RESULTS
%===== TRANSFER OF DATA FROM SIMULINK TO WORKSPACE==========
out=[que_length.time, que_length.signals.values]
x=out(:,1);
y=out(:,2);
plot(x,y)
xlabel('simulation time');
ylabel('average delay')
grid
%== ====================== VOQ1 ===============
%============throughput==
out2=[out_2.time, out_2.signals.values]
x=out2(:,1);
y=out2(:,2);
plot(x,y)
xlabel('simulation time');
ylabel('throughput')
grid
% PLOT OF AVERAGE DELAY VERSUS LOAD
a=[0.2253 0.3319 0.3892 0.4347 0.5910 0.9513];
Q1=[1.733 2.060 2.953 3.091 5.439 12.01];
Q2=[1.766 2.030 2.652 2.397 3.833 23.0];
Q3=[1.766 2.062 2.346 4.601 8.34 28.27];
plot(a,Q1,'-*r',a,Q2,':+b',a,Q3,'-xk')
grid on
xlabel('load ')
ylabel('average delay (sec.)')
legend('Q=5','Q=10','Q=20')
108
% PLOT OF AVERAGE QUEUE LENGTH VERSUS LOAD============
p=[0.2253 0.3319 0.3892 0.4347 0.5910 0.9513];
q_l1=[0.0150 0.0460 0.1050 0.1349 0.5716 2.973];
q_l2=[0.0150 0.0482 0.0977 0.0950 0.5989 6.687];
q_l3=[0.0150 0.0482 0.0871 0.5469 15 18];
plot(p,q_l1,'-*r',p,q_l2,':+b',p,q_l3,'-xk')
grid on
xlabel('load ')
ylabel('average queue length')
legend('Q=5','Q=10','Q=20')
% PLOT OF THROUGHPUT VERSUS LOAD =============== =====
b=[0.2253 0.3319 0.3892 0.4347 0.5910 0.9513];
T1=[0.99 0.99 0.99 0.99 0.99 0.99];
T2=[0.99 0.99 0.99 0.99 0.989 0.988];
T3=[0.99 0.99 0.99 0.99 0.989 0.984];
plot(b,T1,'-*r',b,T2,':+b',b,T3,'-xk')
grid on
xlabel('load ')
ylabel('Throughput')
legend('Q=5','Q=10','Q=20')