Competitive Evaluation of
Switch Architectures
David Hay
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Competitive Evaluation of
Switch Architectures
Research Thesis
Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of
Philosophy
David Hay
Submitted to the Senate of the
Technion - Israel Institute of Technology
Iyyar 5767 Haifa April 2007
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
The research thesis was done under the supervision of Prof. Hagit Attiya in the Department of
Computer Science.
Hagit, there are no words to express how grateful I am for your help and patient guidance
all these years. I feel privileged to have worked with you and learn from your experience, as
a researcher, a teacher, but first and foremost as a human being. I felt that you were always
available for me, for any question, any thought or even just for a chat. Among the countless things
I learned from you, I especially appreciate how you perfectly balanced between guiding me, while
maintaining my independence as a researcher. No doubt you are the kind of advisor any student
dreams of (and much much more).
I would like to thank my collaborators, Dr. Isaac Kesslasy, Gabriel Scalosub and Prof. Jennifer
L. Welch, for many helpful and fruitful discussions. The periods we worked together were among
the most enjoyable during all my studies. I thank Isaac Kesslasy also for organizing my intern-
ship in Cisco Systems during summer 2006. This internship had a significant contribution to my
research.
I am thankful to committee members: Prof. Israel Cidon, Dr. Isaac Kesslasy, Prof. Yishai
Mansour, Prof. Seffi Naor, Prof. Danny Raz and Dr. Adi Rosen. I benefited tremendously from
your insights and comments.
I would also like to thank all the other people from the Computer Science department, with
whom I worked and who helped me all these years since my undergraduate studies.
Special thanks to Moshe Saikevich for his consistent moral support (and graphic advises and
help). Moshe, without you none of this would have happened.
Last but not least, I thank my parents Yael and Yigal Hay, my grandparents Zvi and Zvia Gracy
and Lia and Jacob Hay, my brothers Eyal, Roee and Assaf, and the rest of my family, for always
being there for me and supporting my decisions and choices.
The generous financial help of the Blankstein and Wolf foundations is gratefully acknowledged.
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Contents
Abstract 1
Abbreviations and Notations 3
1 Introduction 5
1.1 Classification of Switch Architectures . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Evaluating the Performance of Switch Architecture . . . . . . . . . . . . . . . . . 10
1.2.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 The Packet Scheduling Process Bottleneck . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Relative Queuing Delay in Parallel Packet Switches . . . . . . . . . . . . . 16
1.4.2 Packet-mode Scheduling in Combined Input-Output Queued Switches . . . 17
1.4.3 Jitter Regulation for Multiple Streams . . . . . . . . . . . . . . . . . . . . 18
2 Background 19
2.1 CIOQ Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Output-Queued Switch Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Parallel Packet Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Delay Jitter Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
i
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
3 Model Definitions 26
4 Relative Queuing Delay in PPS 29
4.1 Summary of Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 A Model for Parallel Packet Switches . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Lower Bounds on the Relative Queuing Delay . . . . . . . . . . . . . . . . . . . . 35
4.3.1 General Techniques and Observations . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Lower Bounds for Fully-Distributed Demultiplexing Algorithms . . . . . . 40
4.3.3 Lower Bounds for u-RT Demultiplexing Algorithms . . . . . . . . . . . . 48
4.4 Upper Bounds on the Relative Queuing Delay . . . . . . . . . . . . . . . . . . . . 52
4.5 Demultiplexing Algorithms with Optimal RQD . . . . . . . . . . . . . . . . . . . 58
4.5.1 Optimal Fully-Distributed Demultiplexing Algorithms . . . . . . . . . . . 59
4.5.2 Optimal 1-RT Demultiplexing Algorithm . . . . . . . . . . . . . . . . . . 61
4.5.3 Optimal u-RT Demultiplexing Algorithm . . . . . . . . . . . . . . . . . . 67
4.6 Extensions of the PPS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6.1 The Relative Queuing Delay of an Input-Buffered PPS . . . . . . . . . . . 69
4.6.2 Recursive Composition of PPS . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Packet-Mode Scheduling in CIOQ Switches 76
5.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 A model for packet-mode CIOQ switches . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Simple Upper and Lower Bounds on the Relative Queuing Delay . . . . . . . . . . 80
5.4 Tradeoffs between the speedup and the relative queuing delay . . . . . . . . . . . . 83
5.4.1 Matrix Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.2 Mimicking an Ideal Shadow Switch with Speedup S ≈ 4 . . . . . . . . . . 86
ii
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
5.4.3 Mimicking an Ideal Shadow Switch with Speedup S ≈ 2 . . . . . . . . . . 91
5.5 Mimicking an Ideal Shadow Switch with Bounded Buffers . . . . . . . . . . . . . 94
5.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6 Jitter Regulation for Multiple Streams 104
6.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Model Description, Notation, and Terminology . . . . . . . . . . . . . . . . . . . 106
6.3 Online Multi-Stream Max-Jitter Regulation . . . . . . . . . . . . . . . . . . . . . 108
6.4 An Efficient Offline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7 Conclusions 122
Bibliography 127
iii
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
List of Figures
1 High-level model of a switch and its bottlenecks. . . . . . . . . . . . . . . . . . . 6
2 Combined Input-Output Queued Switch with Virtual Output-Queuing. . . . . . . . 9
3 A 5× 5 PPS with 2 planes in its center stage, without buffers in the input-ports . . 10
4 Illustration of different delay metrics. . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Time-points associated with a cell c ∈ T . . . . . . . . . . . . . . . . . . . . . . . 35
6 Illustration of traffic T in the proof of Theorems 1 and 4. . . . . . . . . . . . . . . 42
7 Illustration of traffic T in the proof of Theorem 5. . . . . . . . . . . . . . . . . . . 47
8 Illustration of traffic Te T |I in the proof of Theorems 6 and 8. . . . . . . . . . . . 50
9 The number of cells arriving until time-slot t − 1, and still queued in plane k by
time-slot τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10 Illustration for the different cases in the proof of Theorem 10 . . . . . . . . . . . . 57
11 A (2, 〈2, 2〉)-RPPS with 5 input-ports and 5 output-ports. . . . . . . . . . . . . . . 73
12 Summary of the results described in Chapter 5. . . . . . . . . . . . . . . . . . . . 78
13 Illustration of the proof of Theorem 21. . . . . . . . . . . . . . . . . . . . . . . . 80
14 Simulation results for a switch, operating under uniform traffic. . . . . . . . . . . . 98
15 Simulation results for a switch, operating under spotted traffic. . . . . . . . . . . . 99
16 Simulation results for a switch, operating under diagonal traffic. . . . . . . . . . . 100
iv
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
17 Trace-Driven simulation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
18 Simulation results of the Store&Forward Greedy Algorithm. . . . . . . . . . . . . 102
19 The multi-stream jitter regulation model. . . . . . . . . . . . . . . . . . . . . . . . 106
20 Geometric view of delay jitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
21 Geometric view of the right margin of the release band. . . . . . . . . . . . . . . . 113
22 Illustration of Lemma 18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
v
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
List of Tables
4.1 The relative queuing delay (in time-slots) of a bufferless OC-192 PPS with 1024
ports and speedup 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Illustration of Example 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
vi
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Abstract
To support the growing need for Internet bandwidth, contemporary backbone routers and switches
operate with an external rate of 40 GB/s and hundreds of ports. At the same time, applications
with stringent quality of service (QoS) requirements call for powerful control mechanisms, such as
packet scheduling and queue management algorithms.
The primary goal of our research is to provide analytic methodologies for designing and eval-
uating high-capacity high-speed switches. A unique feature of our approach is a worst-case com-
parison of switch performance relative to an ideal switch, with no limitations. This competitive
approach is natural due to the central role of incomplete knowledge, and it can reveal the strengths
and weaknesses of the studied mechanisms and indicate important design choices.
We first consider the parallel packet switch (PPS) architecture in which cells are switched
in parallel through intermediate slower switches. We study the effects of this parallelism on the
overall performance and present tight bounds on the average queuing delay introduced by the
switch relative to an ideal output-queued switch. Our lower bounds hold even if the algorithm in
charge of balancing the load among middle-stage switches is randomized.
We also study how variable-size packets can be scheduled contiguously without segmentation
and reassembly in a combined input-output queued (CIOQ) switch. This mode of scheduling be-
came very attractive recently, since most common network protocols (e.g., IP) work with variable
size packets. We present frame-based schedulers that allow a packet-mode CIOQ switch with small
speedup to mimic an ideal output-queued switch with bounded relative queuing delay.
A slightly different line of research involves studying how different QoS measures can be
guaranteed in a stand-alone environment, where traffic arrives to a regulator that should shape it
1
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
to meet the demand. We focus on jitter regulators, which should shape the incoming traffic to
be perfectly periodic and show upper and lower bounds for multiple stream jitter regulation: In
the offline setting, jitter regulation can be solved in polynomial time, while in the online setting
a buffer augmentation is needed in order to compete with the optimal algorithm; the amount of
buffer augmentation depends linearly on the number of streams.
2
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Abbreviations and Notations
N The number of the switch’s ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
R The external rate of the switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
S The speedup of the switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
K The number of planes in a parallel packet switch . . . . . . . . . . . . . . . . . . . . . 10
r The internal rate of a parallel packet switch . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Lmax The maximum packet size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
orig The input-port at which a cell arrives at the switch . . . . . . . . . . . . . . . . . . . 26
dest The output-port for which a cell is destined . . . . . . . . . . . . . . . . . . . . . . . . . . 26
packet The packet corresponds to cell a specific cell . . . . . . . . . . . . . . . . . . . . . . . . 26
first The first cell of a packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
last The last cell of a packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
T A traffic. The collection of cells arriving at the switch . . . . . . . . . . . . . . . . 26
ta The time-slot at which a cell arrives at the switch . . . . . . . . . . . . . . . . . . . . 26
shift A cell obtained by shifting another cell by predetermined time-slots . . . .26
ESW An execution of the switch SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
σ Coin tosses sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
tlSW The time-slot at which a cell leaves the switch . . . . . . . . . . . . . . . . . . . . . . . 27
ES The execution of the shadow switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
tlS The time-slot at which a cell leaves the shadow switch . . . . . . . . . . . . . . . . 27
delay The queuing delay of a cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
R The relative queuing delay of a cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Rmax The maximum relative queuing delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Ravg The average relative queuing delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
RAmax The maximum relative queuing delay against adversary A . . . . . . . . . . . . .27
RAavg The average relative queuing delay against adversary A . . . . . . . . . . . . . . . 27
JSW The delay jitter of switch SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
JS The delay jitter of the shadow switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
J The relative delay jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
S The state space of a demultiplexor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
plane The plane through which a cell is sent (in a PPS) . . . . . . . . . . . . . . . . . . . . . 34
tp The time-slot at which a cell leave the plane . . . . . . . . . . . . . . . . . . . . . . . . . 34
succ The immediate successor of a cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
C The reachable configuration space of a demultiplexor . . . . . . . . . . . . . . . . . 48
A The number of cells arriving at the switch . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
∆ The imbalance of a plane with respect to an output . . . . . . . . . . . . . . . . . . . 53
Q The length of a queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
L The number of a cell leaving a plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
B The set of reachable buffer states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
to The time-slot at which a cell leaves the demultiplexor . . . . . . . . . . . . . . . . 69
tCCF The time-slot at which CCF forwards a cell . . . . . . . . . . . . . . . . . . . . . . . . . . 86
L The set of eligible packet sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
X The inter-release time of cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
M The total number of streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
MJ The max-jitter of a multi-stream traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Chapter 1
Introduction
The rapid increase in the demand for Internet bandwidth and the boost in line rates of contemporary
data networks establish the basic nodes at the network core—namely, the switches and routers—as
one of the network’s primary performance bottlenecks.
Today, switches and routers are built to operate with link rate of up to 40 Gb/s and hundreds
of ports. At the same time, contemporary data networks are required to integrate different types of
services (for example, IP traffic with voice and video traffic), implying that the switch (or router)
must meet stringent quality of service (QoS) requirements and provide service differentiation be-
tween applications. In order to cope with these challenging tasks, all routers and switches are
equipped with powerful control mechanisms, such as packet scheduling and queue management
algorithms. As switches become larger and faster, robust parallel and distributed architectures are
often used; these architectures require additional mechanisms for coordination and load balancing.
Varghese [142, Page 302] identifies three bottlenecks in the design of high-speed switches: the
address lookup process, which determines which output link a packet will switch to, the switching
process, which is responsible for forwarding packets from the input-port to the output-port, and
the packet scheduling process, which is done at the output-port and decides how packets leave the
outbound links of the switch. Figure 1 depicts the general structure of a packet switch and indicates
the locations of the above-mentioned bottlenecks.
This thesis focuses primarily on problems arising from the switching process bottleneck (ex-
5
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Switch
Fabric
Address lookup
bottleneck
Packet Scheduling
bottleneck
Output-portsInput-ports
R
R
R
RR
R
RN
2
11
2
N
Scheduler (switching bottleneck)
Figure 1: High-level model of a switch and its bottlenecks. The switch has N input-ports and Noutput-ports, operating at external rate R.
cept Chapter 6 that deals with the packet scheduling bottleneck) and aims to provide analytical
methodologies for designing and evaluating the related switch control mechanisms. In addition,
our results allow to compare between different switch architectures.
Generally, given an existing switch architecture in which the line rates, buffering locations
and control lines are specified, we offer switching algorithms and evaluate their performance.
In addition, we prove inherent limitations of the architecture and point out to important design
trade-offs. Note that the algorithms and their analysis strongly depend on the investigated switch
architecture; therefore, this research involves a large variety of algorithmic problems.
Most of our results are relativistic, in the sense that they are measured in comparison to an
optimal switch which is not limited by its architecture. Similar to online algorithms, in which
there is no information about future events, the competitive approach taken in this research is
natural due to the central role of lack of knowledge.
A primary candidate for an ideal switch is an output-queued switch (see Section 1.1), which is
considered optimal with respect to its ability to guarantee different QoS demands. For that reason,
6
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
this comparison is often referred to as the ability of a switch to mimic or emulate an output-queued
switch [120].
Because such analysis is not burdened by probabilistic assumptions on the incoming traffic that
can be misleading, it reveals the strengths and weaknesses of the studied mechanisms and architec-
tures. In addition, analytic evaluation, and especially worst-case evaluation, is important because
it allows QoS demands to be guaranteed (unlike empirical evaluation based on simulations).
In the rest of this chapter, we first discuss in Section 1.1 how to classify different switch archi-
tectures. In Section 1.2, we overview the methods and metrics for evaluating switch architectures
performance. Section 1.3 discusses in more depth the problems arising from the packet scheduling
process bottleneck. Section 1.4 overviews the main results of this thesis.
1.1 Classification of Switch Architectures
Karol et al. [78] considered switches without buffers at all. In these switches when m cells destined
for the same output-port arrive at the same time-slot, m − 1 cells are dropped and one cell is
transmitted over the switch fabric.
Even in case of uniform well-behaved traffic, such a bufferless switch suffers from large loss
ratio and low throughput. Hence, buffering within the switch is needed to handle conflicts among
different flows. The location of the buffers, their size, and their management, depend on the specific
architecture of the switch, and play a major role in its performance. Therefore, switches are often
classified according to their buffering strategy [16]. In the rest of this section we employ such a
classification and present common architectures by the location of their buffers.
Output-Queued Switches: In output-queued (OQ) switches, a cell arriving to the switch is im-
mediately transfered to the output-port it is destined for. At each time-slot, at most one cell leaves
each output-port; conflicting cells are queued in a buffer at the output-port.
Output-queued switches provide the highest throughput and lowest average cell-delay, since
cells are queued only when the output-port is transmitting a cell. Furthermore, traffic destined
for one output-port does not affect other output-ports, implying that misbehaving flows are easily
7
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
isolated.
However, since it is possible that in a specific time-slot, all input-ports send a cell for the
same destination, output-ports are required to operate in the aggregate rate of the input-ports.
This property yields that the output-queued switch architecture does not scale with the number of
external ports, and therefore it is impractical for high-speed switches with large number of ports.
Shared-Memory Switches: Shared-memory switches are a variant of output-queued switches,
where buffers are not dedicated to a specific output-port. Naturally, shared-memory switches are
more flexible than dedicated memory, and require significantly smaller buffer size than output-
queued switches, sometimes just two or three times larger than a single output-buffer of an output-
queued switch [139].
Since at each time-slot, all the input-ports can write to the shared memory and all the output-
ports can read from it, the shared memory should operate in the aggregate rate of both the input-
ports and the output-ports. Hence, like output-queued switches, these switches are not practical if
the switch is large or operates at high speed.
Input-Queued and Combined-Input-Output-Queued Switches: Input Queued (IQ) switches,
with buffering at the input-ports, were suggested to reduce the rate in which memory units are
required to operate. Cells arriving at the switch are queued in FIFO input-buffers, and then are
forwarded to the appropriate output-port, as dictated by a centralized scheduler. The switch fabric
that is used in IQ switches is a bufferless crossbar, which puts the following constraint on the
scheduler: At each time-slot at most one cell is forwarded from each input-port and to each output-
port.
The most infamous problem in input-queued switches is the head-of-line (HOL) problem,
where a cell destined for an occupied output-port, blocks other cells from being forwarded [78].
To eliminate such problems different buffering policies were suggested. The most common one
is virtual output-queuing (VOQ) in the input-ports [133], where each input-buffer is divided to
N different FIFO queues according to the cell destination. In this case, the scheduler makes its
scheduling decisions based on the cells located in the head of each queue (i.e., N2 cells). Since
8
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Crossbar
Fabric
Scheduler
Output-portsInput-ports
R
R
R
RR
R
R
N
2
11
2
N
Figure 2: Combined Input-Output Queued Switch with Virtual Output-Queuing. The switch fabricoperates at rate S ·R, where S is the speedup of the switch.
scheduling decisions are typically made at least once in every time-slot, the scheduler may become
the bottleneck for implementing a high-speed, large switch.
In addition to the virtual output queues, some input-queued switches have speedup; namely, the
switch fabric runs S ≥ 1 times faster than the external line rates (where S is the switch speedup),
enabling the switch to make S scheduling decisions every time-slot. When S > 1, a certain amount
of buffering should be done in the output side of the buffer, and therefore such switches are usually
referred to as Combined Input-Output Queued (CIOQ) switches (Figure 2).
Buffered Crossbar Switches: Recently, CIOQ switches with additional (small) buffers in the
crosspoints were also considered. These buffered crossbar or combined input-crosspoint queued
(CICQ) switches circumvent the major constraint imposed by the bufferless crossbar fabric and
introduce orthogonality between the operations of input-ports and output-ports. This strong prop-
erty greatly simplifies the design of switching algorithms [38, 138], at the expense of having N2
additional buffers that must be allocated and managed.
9
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Figure 3: A 5× 5 PPS with 2 planes in its center stage, without buffers in the input-ports
Parallel Packet Switches (PPS): Switching cells in parallel is a natural approach to build switches
with very high external line rate and with a large number of ports. A parallel packet switch
(PPS) [74] is a three-stage Clos network [41], with K < N switches in its center stage, also
called planes. Each plane is an N × N switch operating at rate r < R; each plane is connected
to all input-ports on one side, and to all output-ports on the other side (Figure 3). This model
is based on an architecture used in inverse multiplexing systems [53, 56], and especially on the
inverse multiplexing for ATM (IMA) technology [12, 36].
Iyer and McKeown [70] also consider a variant of the PPS architecture, called input-buffered
PPS, having finite buffers in its input-ports in addition to buffers in its output-ports.
Additional two architectures have topology similar to the PPS. The Parallel Switching Architec-
ture (PSA) [109] has several combined input-output queued (CIOQ) switches operating in parallel
with no speedup, whereas the Switch-Memory-Switch (SMS) model [121, 122, 127] has M > N
parallel memories that reside between the input and output ports.
1.2 Evaluating the Performance of Switch Architecture
Switch architectures are evaluated by their ability to provide different QoS guarantees. Some of the
important performance figures are the maximum or average delay of cells, the switch throughput,
and cell loss probability. Contemporary network applications necessitate even more sophisticated
10
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
performance metrics (e.g, delay jitter). In Section 1.2.1, we discuss these metrics in detail.
These performance figures can be evaluated under different assumptions on the incoming traf-
fic.
A traditional approach is to model the arrival of cells as a stochastic process. The performance
figures are derived from the switch behavior in response to the arrival pattern, which is calculated
either using analytical probabilistic methods (e.g., traditional queuing theory) or using simula-
tions. The most common assumption is a uniform traffic, where cells arrival is an independent and
identically-distributed (i.i.d) Bernoulli process with parameter p, 0 < p ≤ 1, and cell destination is
chosen uniformly among all the output-ports. The simplicity of uniform traffic makes it attractive
in analytical evaluation; however, it usually leads to unrealistic overly optimistic results, compared
to real-life traffic.
More sophisticated traffic patterns (e.g., on/off traffic [2, 64] or hot-spot traffic pattern [118])
were suggested to model real traffic more accurately. Unfortunately, such models generally tend
to be either unrealisticly simple or too complex for closed-form analysis.
A contemporary approach is to use restrictive models that only bound the incoming traffic
rather than exactly characterize it. Our research focuses on such models, which capture the nature
of most known traffic patterns, and yet can be handled analytically.
These models are particularly appealing because a switch can be used as part of a network
(e.g., the Internet, LAN networks or WAN networks) whose traffic characterization can be very
different, and may even change over time. Therefore, a restrictive model that captures all traffic
patterns at once may yield more meaningful results than stochastic processes, which try to exactly
characterize the arrival pattern.
A prime example for a restrictive traffic model is the (R,B)-leaky-bucket model [140]. In this
model, it is only required that the combined rate of flows sharing the same input-port or the same
output-port does not exceed the external rate R of that port by more than a fixed bound B, which is
independent of time [34]. Other examples of restrictive models are the (R, T )-smooth model [60],
which was later used by Borodin et al [28] for adversarial queuing theory and is often referred to
as the AQT (R, T ) model. Traffic patterns that obey the strong law of large numbers [47] are also
widely used recently, since they enable the usage of fluid models [46] for evaluating the switch
11
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
behavior, although the arrival processes remain discrete [47].
Our research takes a competitive approach for evaluating the behavior of switch architectures:
We compare the performance of a switch against an ideal shadow switch receiving the same in-
coming traffic [120], which may be unrestricted or obey one of the restrictive models described
earlier. As mentioned before, since output-queued switches are considered optimal with respect
to their delay and throughput performance, this comparison is referred to as output queued switch
emulation.
The measure of how closely a switch mimics the ideal switch depends on the relevant QoS de-
mand. For example, in [14, 70, 74, 86] the performance figure discussed is the queuing delay, and
therefore the competitive analysis is resulted by the relative queuing delay; namely, the difference
in queuing delay between the evaluated switch and the shadow switch. When switches are allowed
to drop cells, and the figure of interest is the number of cells successfully delivered, competitive
analysis results in a competitive ratio (or equivalently the switch miss-fraction [113]).
1.2.1 Performance Metrics
In this section, we survey the most common performance metrics used in current research to eval-
uate a single switch. Note that end-to-end evaluations (for example, over IP networks) are outside
the scope of this thesis.
Throughput and Stability: The throughput of a switch can be defined in several ways. One
common definition is the average number of cells which are successfully transmitted by the switch
per time-slot per input-port [16]. In other cases [30], throughput is defined as the maximum rate
at which none of the offered frames (in our case, cells) are dropped by the device (in our case, the
switch). In this case the throughput is usually normalized to the maximum theoretical rate (namely,
R). For example, a 100% throughput means that even in maximum load conditions, no cells are
dropped by the switch. Note that the relation between cell loss rate to this notion of throughput is
not immediate: For example, in an extreme condition, in which the switch drops a small number
of cells for any incoming traffic (even with the lowest rate), the switch throughput is 0%, while the
loss-rate is very small.
12
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
When discussing throughput, a definition of stability comes handy [104]. A switch is stable (in
the strong sense) if the expected queue length does not grow without bound: that is, if for every
input-port i, output-port j, limt→∞ E(Qi,j(t)) is finite, where Qi,j(t) is the number of cells from
flow (i, j) that are still queued in one of the switch buffers. A switch achieves 100% throughput
if and only if it is stable under all admissible traffics, and therefore with finite buffers, no cells are
dropped [105].
A stronger stability measure is the ability of the switch to be work-conserving (greedy) [37,
88, 91]. A work-conserving switch guarantees that if a cell is pending for output port j at time-
slot t, then some cell leaves the switch from output-port j at time-slot t. This property prevents
an output-port from being idle unnecessarily, and by that ensures the switch stability, maximizes
its throughput and minimizes its average cell delay. Note that work conservation is a strictly
stronger property than stability since there are switches (e.g., the parallel packet switch, described
in Section 1.1) which are stable but not work-conserving.
Queue Length and Cell Loss Ratio: A more fine-grained measure than stability is bounding the
queue lengths (also referred to as backlogs), bounding the expected queue length or approximating
the distribution of the lengths (over time). Since buffer sizes play a major role in both the design
and the pricing of the switch, queue length bounds are very important performance figures. More-
over, the figures obtained have great practical importance and usually can be easily translated to
other important bounds (e.g., on cell delays and cell loss ratio).
As a simple example, a work-conserving output-queued switches the maximum buffer size
needed for any (R,B) leaky-bucket traffic is B cells [45]. In this case, no cells are dropped and
the maximum queue length is B; if the output queues operate under first-come-first-serve (FCFS)
policy, the maximum latency is B time-slots.
If buffer sizes are bounded, cells that cannot be stored in the buffers are dropped. The number
of cells dropped, compared to the number of cells arrived at the switch, is captured by the cell loss
ratio. Clearly, characterizing the queue lengths is often a first step towards evaluating the cell loss
ratio.
13
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Average Delay Maximum DelayPropogationDelay
Delay Jitter
numberof cells
delay time
Figure 4: Illustration of different delay metrics [79, Page 219].
Cell Delay and Queuing Delay: For a wide class of interactive or time-critical applications
(such as voice conversation and other diverse telecommunication services), cell delays are more
important than throughput [94]. In such applications excessive latency can inhibit usability of the
cells, and therefore should be avoided.
Naturally, the maximum cell delay may occur only in extreme situations, implying that this
metric can be overly pessimistic. In such cases, we are interested in the average cell delay.
The dominant causes of delay are fixed propagation delays (e.g., those arising from speed-
of-light restrictions) and queuing delays in the switch [49]. Since propagation delays are a fixed
property of the topology, cell delay is minimized when queuing delays are minimized. Further-
more, propagation delays are strongly dependent on technology, therefore a switch architecture is
best evaluated through queuing delay.
Delay Jitter: Another important QoS parameter is the delay jitter, which is sometimes referred
to as the cell delay variation. Delay jitter is the difference between the maximal and minimal cell
transfer delays.
Guaranteeing certain delay jitter is especially important in interactive communication (such as
video or audio streaming). In such applications bounding the delay jitter is translated to bounds on
14
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
the buffer size at the destination.
Mansour and Patt-Shamir [100] define the delay jitter to measure how far off is the difference
of delivery times of different packets from the ideal difference in a perfectly periodic sequence. In
the natural case, where the abstract source of the incoming traffic operates in a perfectly periodic
manner, both definitions are equivalent.
Other Measures: The ATM forum defines other quality-of-service parameters, which are not
used widely: Cell Misinsertion Rate (CMR), Cell Error Rate (CER)and Severely Errored Cell
Block Ratio (SECBR). Our research does not address these measures. We will assume that all cells
are transmitted over the switch without errors or misinsertations.
1.3 The Packet Scheduling Process Bottleneck
After the switching process bottleneck is resolved (recall Figure 1), incoming packets are stored
at their destinations (that is, the buffers of the respective output-ports) awaiting to be scheduled
outside the switch.
A packet-scheduler, which manages the buffers of a single output-port, is responsible on de-
ciding which packet will leave the output-port outgoing link in each time-slot. Depending on the
demands from the switch, the packet-schedulers are geared to ensure the relevant performance
metrics out of those described in Section 1.2.1.
It is important to notice that typically flows from different sources traverse the switch at the
same time, and therefore compete on the same switch resources (for example, the switch buffers
or the switch internal transmission lines). One of the key roles of the packet scheduling process is
to protect well-behaved flows from misbehaved ones [34]. This is often called flow isolation, and
the ability to provide such isolation is one of the most important evaluation criteria for switching
architectures. The demand for flow isolation is sometimes formalized by the slightly stronger con-
cept of fairness: When several flows are equally important (i.e., demand the same QoS guarantees)
they should be treated fairly by the switch, and obtain an equal fraction of the switch resources.
Well-known approaches to solve flow isolation and fairness are by allocating per-flow buffers [111]
15
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
or by using appropriate queuing disciplines (e.g., GPS [115] or WFQ [50, 147]).
The problem of packet scheduling becomes even more difficult when the buffers of the output-
port have only bounded size. In such cases, the packet schedulers cannot handle every traffic
patterns and some packets must be dropped. The most common and simple drop mechanism is
tail-drop, in which incoming packets are dropped if the buffer is full. However, modern switches
and routers often implement more sophisticated drop mechanisms such as Random Early Detection
(RED) [55] that aims at optimizing the performance of TCP traffic.
In this research, we take a competitive analysis point of view also when evaluating the packet
scheduling process. We compare the performance of these schedulers to an optimal scheduler
that uses the same buffer size but has a complete knowledge on future packet arrivals (that is, an
offline algorithm). In addition, in order to investigate the trade-off between the buffer size and the
scheduler performance, we also investigate resource augmentation scenarios, in which the packet-
scheduler compensates its lack of knowledge by using additional buffers.
Note that the switching problem and the packet scheduling problem are not orthogonal: One
can devise switching algorithms that aim at optimizing certain QoS guarantees; such switching
algorithms are especially important in IQ Switch (recall Section 1.1) in which there are no buffers
in the output-ports. However, the problems often become independent if the switching algorithm
provides output-queued emulation.
1.4 Overview of the Thesis
1.4.1 Relative Queuing Delay in Parallel Packet Switches
We provide analysis of the relative queuing delay of cells in a PPS compared to an ideal switch,
capturing the influence of parallelism on PPS performance [13, 14]. Our lower and upper bounds
on the relative queuing delay depend on the amount and type of information used for balancing the
load among the lower-speed switches and indicate significant differences in the PPS performance:
sharing even out-dated information among input-ports can highly improve the switch performance.
An attractive paradigm for balancing load on the average is with randomization [17, 108]; even
16
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
a very simple strategy ensures, with high probability, maximum load close to the optimal distri-
bution [61]. Given these successful applications of randomization in traditional load balancing
settings and in other high-bandwidth switches [58, 136], it is tempting to employ randomization
in parallel packet switches in order to improve their performance. Nevertheless, we show that
randomization does not help to decrease the average relative queuing delay. This surprising result
holds because the common practice is that switches should not mis-sequence cells [81]. This prop-
erty allows an adversary to exploit a transient increase of the relative queuing delay and perpetuate
it long enough to increase the average relative queuing delay.
On the positive side, we introduce a generic methodology for analyzing the maximal relative
queuing delay by measuring the imbalance between the lower-speed switches. The methodology
is used to devise new optimal algorithms that rely on slightly out-dated global information on the
switch status. It is also used to provide a complete proof of the maximum relative queuing delay
provided by the fractional traffic dispatch algorithm [72, 86].
These results are discussed in Chapter 4.
1.4.2 Packet-mode Scheduling in Combined Input-Output Queued Switches
The need for packet-mode schedulers arises from the fact that in most common network protocols,
traffic is comprised of variable size packets (e.g., IP datagrams), while real-life switches store and
transmit packets as fixed-sized cells, with fragmentation and reassembly done outside the switch.
Packet-mode schedulers consider the linkage between cells that correspond to the same packet and
are constrained so that such cells should be delivered from the switch contiguously [57]. Packet-
aware scheduling schemes avoid the overhead induced by packet segmentation and reassembly that
can become very significant at high speeds.
We devise coarse-grained schedulers that allow a packet-mode combined input output queued
(CIOQ) switch with small speedup to mimic an ideal output-queued switch with bounded relative
queuing delay [15]. The schedulers are coarse-grained, making a scheduling decision every certain
number of time-slots and work in a pipelined manner based on matrix decomposition.
Our schedulers demonstrate a trade-off between the switch speedup and the relative queuing
17
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
delay incurred while mimicking an output-queued switch. When the switch is allowed to incur
high relative queuing delay, a speedup arbitrarily close to 2 suffices to mimic an ideal output-
queued switch. This implies that packet-mode scheduling does not require higher speedup than a
cell-based scheduler. The relative queuing delay can be considerably reduced with just a doubling
of the speedup. We also show that it is impossible to achieve zero relative queuing delay (that is, a
perfect emulation), regardless of the switch speedup.
We further evaluate the performance of our scheduler through extensive simulations, both un-
der real-life traffic traces and under various stochastic traffic models. These simulations clearly
indicate that in practice this scheduler performs significantly better than its theoretical bound.
These results are presented in Chapter 5.
1.4.3 Jitter Regulation for Multiple Streams
We also investigate the packet scheduling process. Specifically, we refer to each output-port as a
stand-alone environment with a bounded-sized buffer, and study how different QoS measures can
be guaranteed for a traffic traversing such an environment. A prime example is a jitter regulator,
which should shape the incoming traffic to be perfectly periodic, using a bounded-sized buffer.
While previous work on this topic [100] handles only a single stream, we show upper and
lower bounds for multiple stream jitter regulation in offline and online settings [63]: In the offline
setting, jitter regulation can be solved in polynomial time, while in the online setting a buffer
augmentation is needed in order to compete with the optimal algorithm; the amount of buffer
augmentation depends linearly on the number of streams.
Chapter 6 presents our results on jitter regulation.
18
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Chapter 2
Background
In this chapter we survey the most relevant background to our research. Section 2.1 deals with
related work on CIOQ switches. Section 2.2 describes the research done on OQ emulation. In
Section 2.3, we present the known results on Parallel Packet Switches and related architectures.
We conclude in Section 2.4, by describing the prior work on jitter regulation.
2.1 Prior Work on CIOQ Switches
In a combined input-output queued (CIOQ) switch, described in Section 1.1, arriving cells are
first stored in the input side of the switch and then forwarded over a crossbar switch fabric to the
output-side as dictated by a scheduling algorithm.
The switch fabric operates S time faster than the external rate, where S is the speedup of the
switch, and imposes the following major constraint on the scheduling algorithm: At each time-slot
at most S cells can be forwarded from any input-port and at most S cells can be sent to any output-
port. An alternative approach to express this constraint is by defining a scheduling opportunity (or
scheduling decision), in which at most one cell is forwarded from the each input-port and to each
output-port. The speedup S implies that the switch has S scheduling opportunities every time-slot.
A common approach to solve the scheduling problem in CIOQ switches is to refer to the switch
as a bipartite graph G(t) = 〈V1, V2, E〉, where V1 is the set of input-ports, V2 is the set of output-
19
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
ports and an edge (v1, v2) ∈ E exists if and only if there is a cell waiting for scheduling from
input-port v1 to output-port v2 at time-slot t. Note that after each scheduling action, a new graph
should be obtained.
A solution to the classic maximum size matching problem achieves 100% throughput under
uniform i.i.d traffic, but can lead to instability and unfairness with other traffic patterns [104].
These results can be improved by assigning weights on the edges and solving the maximum weight
matching problem. It is shown [104] that if the weights are assigned according to the lengths on
the queues (LQF) or the waiting time of the cell in the head of the queue (OCF), 100% throughput
can be achieved even for non-uniform traffic. However, these algorithms are unfeasible in large
and high-speed switches due to the high complexity of maximum weight matching solutions.
To overcome this problem, several algorithms that are based on a solution of maximal match-
ing were proposed. In general, these algorithms operate in iterations, such that in each iteration
an unmatched input-port picks unmatched output-port and adds the edge to the matching, until
the matching converges to a maximal matching (usually after N iterations). The difference be-
tween the maximal-matching based algorithms is in the way conflicting requests are resolved. In
Parallel Iterative Matching (PIM) [5] these requests are resolved randomly (this yields that with
high probability O(log N) iterations suffice for the algorithm to converge), while in iSLIP [103]
these requests are resolved using round-robin pointers. Other maximal matching based algorithms
include Wave Front Arbiters (WFA) [132], iLQF [105] and iOCF [105].
Nevertheless, the scheduling algorithm complexity is typically the main performance limitation
of CIOQ switches [24], since scheduling decisions are done every time-slot, meaning that the
scheduling algorithm speed should be at least the same as the external line rate.
One approach to overcome this scheduling complexity is by using randomization, which was
proven extremely successful in simplifying the implementation of algorithms. A prime example
of a linear-time randomized algorithm for CIOQ switches was presented by Tassiulas [136], that
proposed to compare, at each scheduling decision, the weight of the current matching to the weight
of a randomly chosen matching. Giaccone et al. [58] later improved this algorithm and showed
how such randomized algorithms can achieve good delay performance.
Another approach to solve the scheduling problem is by matrix decomposition [31]. Such
20
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
solutions assume that the arrival rate of each flow is known and decompose the arrival traffic
rate matrix Λ = [λi,j] (λi,j is the rate of flow (i, j)) into permutation matrices, which are used
periodically as scheduling decisions.
A recent approach is to use a coarse-grained scheduler, which makes a scheduling decision
every predetermined number of time-slots [24, 114]. In such algorithms, a frame is defined as τ
consecutive time-slots, and scheduling decisions are done in the boundaries of these frames. The
scheduling decision should encompass all necessary information for the input-port to schedule
cells for τ time-slots. Notice that matrix decomposition techniques are a promising approach in
devising such frame-based schedulers, as was previously proposed for optical switching and in
satellite-switched time-devision multiple access (SS/TDMA) schedulers [83, 95, 137, 145].
Aggarwal et al. [3] combine these three approaches and devise a randomized, coarse-grained
algorithm for matrix decomposition. Basically, they looked at the matrix decomposition problem
as coloring a bipartite multi-graph, and proposed a randomized edge coloring algorithm that color
the graph with as few colors as possible. Their algorithm achieves nearly optimal results with very
low implementation complexity.
All the above mentioned schedulers dealt only with fixed-size cells. However, some schedulers
that handle variable size packets directly were also proposed. Previous work [57, 101] considers
packet-mode scheduling in an input-queued (IQ) switch with crossbar fabric and no speedup. It
proves analytically that packet-mode IQ switches can achieve 100% throughput, provided that
the input-traffic is well-behaved; this matches the earlier results on cell-based scheduling [104].
Marsan et al. [101] also show that under low load and small packet size variance, packet-mode
schedulers may achieve better average packet delay than cell-based schedulers. A different line
of research used competitive analysis to evaluate packet-mode scheduling, when each packet has
a weight representing its importance and a deadline until which it should be delivered from the
switch [62].
21
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
2.2 Prior Work on Output-Queued Switch Emulation
The question whether a feasible switch architecture can emulate an output-queued switch was was
first raise by Prabhakar and Mckeown in the context of combined input-output queued switches [120].
They answered this question in the affirmative and presented the first output-queued emulation al-
gorithm, called most urgent cell first algorithm (MUCFA), that requires a speedup of 4. Following
this seminal paper, other works investigated the speedup required for a CIOQ switch to emulate
an output-queued switch [35, 91, 130, 131]. A prime example is the critical cells first (CCF)
algorithm [37], which allows a CIOQ switch with speedup of at least 2 to emulate (exactly) an
output-queued switch. In addition, a matching lower bound was also proven [37]: A CIOQ switch
needs speedup ≥ 2 − 1N
in order to emulate an output-queued switch. Notice that all these algo-
rithms do not make any assumptions on the incoming traffic.
The demand for exact emulation is sometimes relaxed to allow the investigated switch archi-
tecture to lag behind in the OQ switch by a fixed and predetermined relative queuing delay [72].
We refer to such relaxed emulation as OQ switch mimicking. In this context, the ability of a CIOQ
switch with small speedup (that is, S < 2) to mimic an OQ switch was investigated in [59]: when
the traffic is well-behaved (that is, obeys one of the restrictive model described in Section 1.2) the
demand of speedup S ≥ 2 − 1/N can be relaxed at the expense of a bounded relative queuing
delay. In Chapter 5, we show that packet-mode CIOQ switch can provide OQ mimicking.
The ability to emulate or mimic an OQ switch was also investigated under other switch archi-
tectures.
A recent line of research deals with buffered crossbar (described in Section 1.1). Magill et
al. [99] showed that a buffer crossbar with S = 2 can emulate a first-come-first-serve (FCFS)
output-queued switch with any arrival pattern. Furthermore, they also showed that if the buffers at
the crosspoints are of size at least k, more general queuing disciplines can be emulated, namely a
FCFS with k strict priorities. These results were further improved in [38] showing that OQ switch
with any weighted round robin scheduler can be emulated using a fully-distributed algorithm (that
is, each input-port and each output-port make independent decisions). Turner [138] investigated
packet-mode schedulers in buffered crossbars and showed that that a buffered crossbar switch with
speedup 2 and crosspoint buffers of size 5Lmax, where Lmax is the maximum packet size, can
22
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
mimic an output-queued switch with relative queuing delay of (7/2)Lmax time-slots.
2.3 Prior Work on Parallel Packet Switches
The parallel packet switch architecture was first considered by Iyer et al. [70, 72, 74], who evalu-
ated its ability to mimic output-queued switches. Iyer et al. [74] introduced the Centralized PPS
Algorithm (CPA) that allows a PPS with speedup S ≥ 2 to mimic a FCFS output-queued switch
with zero relative queuing delay; here, the speedup S of the switch is the ratio of the aggregate
capacity of the internal traffic lines, connected to an input- or output-port, to the capacity of its
external line (namely, S = KrR
).
Unfortunately, these algorithms are impractical for real switches, because they gather infor-
mation from all the input-ports in every scheduling decision. To overcome this problem, Iyer and
McKeown [72] suggest a fully-distributed algorithm that works with speedup S = 2 and mimics
a FCFS output-queued switch with relative queuing delay of⌈N R
r
⌉time-slots. Another family
of fully-distributed algorithms, called fractional traffic dispatch (FTD) [86], works with switch
speedup S ≥ KdK/2e , and their relative queuing delay is at least 2NR/r time-slots. However, these
papers did not provide complete and precise proofs for the correctness of the proposed algorithms
and their performance.1
The requirement for additional speedup is relaxed by adding buffers in the demultiplexors.
For such an input-buffered PPS, Iyer and McKeown [72] suggest a fully-distributed algorithm that
allows a PPS with speedup S = 1 to mimic a FCFS output-queued switch with relative queuing
delay of⌈2N R
r
⌉time-slots.
2.4 Prior Work on Delay Jitter Regulation
The problem of jitter control has received much attention in recent years, along with the increasing
importance of providing QoS guarantees. A prime example is the Differentiated Services (Diff-
Serv) architecture, in which there is a specific requirement to maintain low-jitter for Expedited
1See Remarks 1 and 2 in Chapter 4 for further details.
23
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Forwarding (EF) traffic [49].
Jitter regulators that capture jitter control mechanisms, use an internal buffer to shape the
traffic [79, 100, 149]. These regulators typically use scheduling algorithms that are not work-
conserving, i.e., they might delay releasing a cell even if there are cells in the buffer and the
outgoing links are not fully utilized.
Several algorithms have been proposed with the aim of providing traffic jitter control: A jitter
control algorithm which reconstructs the entire sequence at the destination using a predetermined
maximum delay bound was proposed in [116]. The Jitter-Earliest-Due-Date algorithm proposed
in [143] uses a predetermined maximum delay bound in order to calculate a deadline for every
cell, such that it is released precisely upon its deadline. The Stop-and-Go algorithm proposed
in [60] uses time frames of predetermined lengths in order to regulate traffic, such that cells arriv-
ing in the middle of a frame, are only made available for sending in the following time frame. The
Hierarchical-Round-Robin algorithm proposed in [76] uses a framing strategy similar to the one
used in the Stop-and-Go algorithm, but releases are governed by a round robin policy that some-
times allocates non-utilized release time-slots to other streams. Other jitter control algorithms are
surveyed thoroughly in [147].
A slightly different line of research investigated jitter regulation in the Combined Input-Output
Queue switch architecture, forcing the jitter regulator to obey additional constraints posed by the
switching architecture [83].
The problem of jitter control in an adversarial setting was studied by Mansour and Patt-Shamir [100]
in a simplified single-stream model, with only a single abstract source. They present an efficient
offline algorithm, which computes an optimal release schedule in these settings. They further de-
vise an online algorithm, which uses a buffer of size 2B, and produces a release schedule with
the optimal jitter attainable with buffer of size B, and then show a matching lower bound on the
amount of resource augmentation needed, proving that their online algorithm is optimal in this
sense.
For the same model, Koga [89] presents an optimal offline algorithm and a nearly optimal
online algorithm for the case where a cell can be stored in the buffer at most a predetermined
amount of time.
24
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
The burstiness of the traffic is also captured by its rate jitter, which was first defined as the
short-term average rate of a traffic [76]. Mansour and Patt-Shamir [100] introduced another defini-
tion for the rate jitter, which bounds the difference in cell delivery rates at various times. Since the
difference in delivery time between two successive cells is the reciprocal of the instaneous delivery
rate, the rate jitter is defined as the difference between the maximal and minimal inter-departure
time.
Note that delay jitter is more restrictive than rate jitter [100]. Therefore, unlike delay jitter
regulators which completely reconstruct the incoming traffic, rate jitter regulators typically just
partially reconstruct the traffic, and therefore are easier to implement [147].
25
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Chapter 3
Model Definitions
A N × N switch handles either fixed-size cells or variable-size packets arriving at N input-ports
at rate R and destined for N output-ports working at the same rate R. Packets (cells) arrive at the
input-ports and leave the output-ports in a time-slotted manner (that is, all the switch external lines
are synchronized). For variable-size packets, we refer to each part of the packet that is transmitted
during a single slot as a single fixed-size cell, and measure the packet size in cell-units. Unless
otherwise specified, we assume the switch does not drop cells.
For every cell c, ta(c) denotes the time-slot in which cell c arrived at the switch. In addition,
we denote by orig(c) and dest(c) the input-port at which c arrives and the output-port for which c
is destined. packet(c) denotes the packet that corresponds to cell c; first(p), last(p) are the first
and last cells of packet p.
The definitions in the rest of this chapter assume that only fixed-size cells arrive at the switch.
However, they can be easily extended to hold also for variable-size packets.
A traffic T is a finite collection of cells, such that no two cells arrive at the same input-port at
the same time-slot. A flow (i, j) is the collection of cells sent from input-port i to output-port j.1
projection of a traffic T on a set of input-ports I , denoted by T |I , is c ∈ T | orig(c) ∈ I. Since
for any input-port i and traffic T , there are no two cells c1, c2 ∈ T |i such that ta(c1) = ta(c2), the
arrival times of cells in T |i induce a total order on them.
1It is important to notice that a flow at the switch level may correspond to several flows at the network level, allsharing the same input-port and the same output-port of the switch.
26
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
For any cell c, shift(c, t) is a cell with the same origin and destination such that ta(shift(c, t))=
ta(c) + t. The shift operation is used for concatenating two finite traffics, T1 and T2, so that T2
starts after the last cell of traffic T1. Formally, T1 T2 is the traffic T1 ∪ shift(c, t) | c ∈ T2,where t = 1 + maxta(c) | c ∈ T1|I1.
ESW (ALG, T ) is the execution of the switch, using scheduling algorithm ALG in response to
incoming traffic T . If ALG is a randomized algorithm, we denote by ESW (ALG, σ, T ) the execution
ESW (ALG, T ) taking into account the coin-tosses sequence σ obtained by the algorithm. The exact
definition of the execution is determined by the switch architecture that is investigated. Yet, given
the execution ESW (ALG, σ, T ) one can determine uniquely the time-slot in which cell c leaves
the switch for every cell c ∈ T . This time-slot is denoted by tlSW (c, T ) (or tlSW (σ, c, T ) if the
algorithm is randomized).
The switch is compared to a work-conserving shadow switch that receives the same traffic T ,
and obeys the per-flow FCFS discipline; that is, cells with the same origin and the same destination
should leave the switch in their arrival order. We denote the execution of the shadow switch in
response to traffic T by ES(T ), and the time a cell c ∈ T leaves the shadow switch by tlS(c, T ).
Note that tlS(c, T ) ≥ ta(c) + 1.
The relative queuing delay of a cell c ∈ T under a scheduling algorithm ALG and a coin-tosses
sequence σ isR(ALG, σ, c, T ) = tlSW (σ, c, T )− tlS(c, T ).
Definition 1 For traffic T , scheduling algorithm ALG and coin-tosses sequence σ, the maximum
relative queuing delay Rmax(ALG, σ, T ) is maxc∈TR(ALG, σ, c, T ), and the average relative
queuing delayRavg(ALG, σ, T ) is 1|T |∑
c∈T R(ALG, σ, c, T ).
The maximum relative queuing delay of an algorithm ALG against an adversary A is denoted
RAmax(ALG). Specifically, RA
max(ALG) ≥ R with probability 1− δ if adversary A can construct a
traffic T such that Prσ [Rmax(ALG, σ, T ) ≥ R] ≥ 1− δ.
The average relative queuing delay of an algorithm ALG against an adversary A is denoted
RAavg(ALG). Specifically, RA
avg(ALG) ≥ R with probability 1 − δ if adversary A can construct a
traffic T such that Prσ [Ravg(ALG, σ, T ) ≥ R] ≥ 1− δ.
If a switch architecture has a scheduling algorithm ALG, such thatRmax(ALG) = 0 we say that
27
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
the switch architecture emulates an ideal switch. In case the switch architecture has a scheduling
algorithm ALG for which Rmax(ALG) is bounded, we say that the switch architecture mimic an
ideal switch.
The per-flow delay jitter of a traffic T under a scheduling algorithm ALG and a coin-tosses
sequence σ is the maximal difference in queuing delay of cells originated in the same input port
and destined for the same output port. Specifically,
Definition 2 For traffic T , scheduling algorithm ALG and coin-tosses sequence σ, the delay of a
cell c in T is delay(ALG, σ, c, T ) = tlSW (σ, c, T )− ta(c).
The delay jitter of a flow (i, j) in T is
JSW (ALG, σ, T, i, j) = maxc∈Ti,j
delay(ALG, σ, c, Ti,j) − minc∈Ti,j
delay(ALG, σ, c, Ti,j) ,
where Ti,j = c ∈ T | orig(c) = i and dest(c) = j.The per-flow delay jitter of the traffic T is JSW (ALG, σ, T ) = maxi,jJSW (ALG, σ, T, i, j).
Similarly, let JS(T ) be the per-flow delay jitter of traffic T under the shadow switch. The
relative delay jitter is formally defined as follows:
Definition 3 For traffic T , scheduling algorithm ALG and coin-tosses sequence σ, the relative
delay jitter, denoted by J , is the difference between the per-flow delay jitter of the switch and
the per-flow delay jitter of an optimal shadow work-conserving switch, that is J (ALG, σ, T ) =
JSW (ALG, σ, T )− JS(T ).
Note that ifRmax(ALG, σ, T ) = 0 then J (ALG, σ, T ) = 0 as well.
28
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Chapter 4
Relative Queuing Delay in PPS
One of the key issues in the design of a PPS (recall Figure 3) is balancing the load of switching
operations among the middle-stage switches, and by that utilizing the parallel capabilities of the
switch. Load balancing is performed by a demultiplexing algorithm, whose goal is to minimize the
concentration of a disproportional number of cells in a small number of middle-stage switches.
Demultiplexing algorithms can be classified according to the amount and type of information
they use. The strongest type of demultiplexing algorithms are centralized, and make demultiplex-
ing decisions based on global information about the status of the switch. Unfortunately, these
algorithms must operate in a speed proportional to the aggregate incoming traffic rate, and there-
fore, they are impractical. At the other extreme, fully-distributed demultiplexing algorithms rely
only on the local information in the input-port.1 Due to their relative simplicity, they are common
in contemporary switches. A realistic middle ground is what we call u Real-time distributed (u-RT)
demultiplexing algorithms, in which a demultiplexing decision is based on the local information
and global information older than u time slots. Obviously, every fully distributed algorithm is also
a u real-time distributed algorithm.
The relative queuing delay of the PPS (Definition 1) captures the influence of the parallelism of
the PPS on the performance of the switch, depending on the different demultiplexing algorithms,
and ignores the specific PPS hardware implementation. As we shall prove, the relative queuing
1These are also called independent demultiplexing algorithms [68].
29
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
delay is determined solely by the balancing of cells among the planes.
Randomization is successfully applied in traditional load balancing settings [17, 61, 108] and
in other high-bandwidth switches [58, 136]: Even a very simple strategy ensures (with high prob-
ability) maximum load close to the optimal distribution. Therefore, it is tempting to employ ran-
domization to reduce the average imbalance between planes and by that reduce the average relative
queuing delay.
Our main contributions are lower and upper bounds on the relative queuing delay of the PPS.
Our lower bounds hold even when the PPS has to deal only with well-behaved traffics that obey
the leaky-bucket model [140], which makes our results stronger. In addition, we show that ran-
domization does not help to decrease the average relative queuing delay. This somewhat surprising
result holds due to the requirement that switches should not mis-sequence cells [81]. This property
allows an adversary to exploit a transient increase of the relative queuing delay and perpetuate it
sufficiently long to increase the average relative queuing delay. On the other hand, we devise a
general methodology for analyzing the maximum relative queuing delay from above; clearly, this
also bounds (from above) the average relative queuing delay.
4.1 Summary of Our Results
Deterministic Lower Bounds
A bufferless PPS (i.e., without buffers at the input-ports) with fully-distributed demultiplexing
algorithm incurs the highest relative queuing delay and relative delay jitter. If some plane is utilized
by all the demultiplexors, we prove a lower bound of(
Rr− 1)N time slots on the relative queuing
delay and relative delay jitter, where Rr
is the ratio between the PPS external and internal rates.
Even in the unrealistic and failure-prone case where the planes are statically partitioned among the
demultiplexors, the relative queuing delay and relative delay jitter are at least(
Rr− 1)
NS
time-slots.
Both lower bounds employ leaky-bucket flows with no bursts.
A bufferless PPS with u-RT demultiplexing algorithm (for any u) has relative queuing delay
and relative delay jitter of at least(1− u r
R
)uNS
time-slots, where u = minu, 12
Rr. In contrast,
30
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Demultiplexor Type PlanesOC-3 OC-12 OC-48
Fully-distributed unpartitioned 64,512 15,360 3,072partitioned 32,256 7,680 1,536
1-RT 504 480 384Centralized 0 0 0
Table 4.1: The relative queuing delay (in time-slots) of a bufferless OC-192 PPS with 1024 portsand speedup 2.
Iyer et al. [72] present a centralized demultiplexing algorithm for a bufferless PPS with speedup
S ≥ 2, which achieves zero relative queuing delay.
Our lower bound results show that the PPS architecture does not scale with increasing number
of external ports (see Table 4.1 for specific instances). This is significant since great effort is cur-
rently invested in building switches with a large number of ports. Note that large relative queuing
delays usually imply that the buffer sizes at the middle-stage switches and at the external ports
should be large as well, so that the cells can be queued.
For bufferless PPS, it is important to notice that using u-RT demultiplexing algorithm sig-
nificantly reduces the lower bound on the relative queuing delay compared with fully-distributed
demultiplexing algorithm. u-RT demultiplexing algorithms correspond to commercially used poli-
cies like arbitrated crossbar switches [132], in which a request is made by the input-port, and
the cell is sent once a grant is received back from the arbiter. The separation between the lower
bounds implies that employing u-RT demultiplexing algorithms in PPS (even with a considerably
large value of u) may decrease the relative queuing delay dramatically, and still be feasible for
high-speed switches.
Randomized Lower Bounds
We show that an adversary can devise traffic that exhibits with high probability a large average
relative queuing delay. The exact bounds depend on the type of the adversary, the exact restriction
on the order of cells the switch should respect and, as in the deterministic case, on the locality of
information used for cell demultiplexing.
When the PPS respects the arrival order of cells with the same input-port and the same output-
31
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
port (that is, per-flow FCFS discipline) and the adversary is adaptive [110], the bounds are equal
(with high probability) to the deterministic lower bounds for maximum relative queuing delay for
all classes of demultiplexing algorithms. The randomized lower bound holds also with an oblivious
adversary, if a PPS obeys a global FCFS policy (that is, all cells to the same destination should
leave the switch according to their arrival order) and a fully-distributed demultiplexing algorithm
is used.
Matching Upper Bounds
To prove that the lower bounds are tight, we devise a methodology for evaluating the relative
queuing delay under global FCFS policies. We show a general upper bound that depends on the
difference between the number of cells with the same destination that are sent through a specific
plane, and the total number of cells with this destination.
Our methodology is employed to prove that the maximal relative queuing delay of the fractional
traffic dispatch (FTD) algorithm [70] is O(N Rr) time-slots. This matches the lower bound on the
average relative queuing delay introduced by fully-distributed demultiplexing algorithms (even
when randomization is used). This is the first formal and complete correctness proof for this
algorithm.
Remark 1 Iyer and McKeown [70, 72] outline an approach for bounding the relative queuing
delay of FTD, but leave a number of details missing [69]; a previous attempt [86] to complete the
formal proof and precisely bound the relative queuing delay of FTD, turned out to be flawed [85]
(see Remark 2 for further details on the mistake in [86].)
By precisely capturing the crucial factors affecting the relative queuing delay, our methodology
leads to new algorithms that use global information that is u time-slot old. Their maximum relative
queuing delay is O(N) time-slots, asymptotically matching the lower bound on the average relative
queuing delay for this class of demultiplexing algorithms (even when randomization can be used).
32
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
PPS Model Extensions
One extension to the PPS model is input-buffered PPS, in which there are small buffers also in the
input-ports, can support more elaborate demultiplexing algorithms, since an arriving cell can either
be transmitted to one of the middle stage switches, or be kept in the input-buffer. We show that un-
der a u-RT demultiplexing algorithm, a switch with speedup S ≥ 2 and input-buffers larger than u
can employ a centralized algorithm (e.g., [72]). In contrast, a deterministic fully-distributed demul-
tiplexing algorithm introduces relative queuing delay and relative delay jitter of at least(1− r
R
)NS
time-slots, for any buffer size under leaky-bucket flows with no bursts.
A second extension to the PPS model is by implementing recursively the planes themselves as
parallel packet switches operating at lower rate. We prove lower bounds on the relative queuing
delay for the homogeneous recursive PPS, in which all the demultiplexors in all recursion levels
are of the same type (e.g., fully distributed demultiplexors), and for the monotone recursive PPS,
in which demultiplexors are allowed to share more information as their rate decreases. The lower
bounds generalize the lower bound for the non-recursive PPS model.
4.2 A Model for Parallel Packet Switches
An N ×N PPS is a three-stage Clos network [41], with K < N planes. Each plane is an N ×N
switch operating at rate r < R, and is connected to all input-ports on one side, and to all output-
ports on the other side (recall Figure 3). The speedup S = KrR
captures the switch over-capacity.
A bufferless PPS has no buffers at its input-ports but can store pending cells in its planes and
in its output-ports. Each cell arriving at input-port i is immediately sent to one of the planes; the
plane through which the cell is sent is determined by a randomized state machine with state set Si,
following some algorithm.
Definition 4 The demultiplexing algorithm of a bufferless input-port i is a function
ALGi : 1, . . . , N × Si × COINSPACE→ 1, . . . , K × Si
33
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
which gives a plane number and the next state, according to the incoming cell destination, the
current state, and the result of a coin-toss that is taken out of a finite and uniform coin-space
COINSPACE. (For a deterministic algorithm, |COINSPACE| = 1.)
It is important to notice that demultiplexing algorithm ALGi accesses the random coin-tosses
one by one. More precisely, the demultiplexing decision of ALGi at time-slot t depends only
on random coins that were tossed up until time-slot t; the coin-tosses up until time-slot t − 1
are incorporated into the state of ALGi at time-slot t, while the coin-toss of time-slot t appears
explicitly in the definition of ALGi.
We next extend the switch model defined in Chapter 3 to capture the PPS architecture.
ESW (ALG, σ, T ) is the execution of a PPS using demultiplexing algorithm ALG in response to
incoming traffic T , and coin-tosses sequence σ; for all cells in T , the execution indicates the planes
the cells are sent through: 〈c, plane(c, σ, T )〉 | c ∈ T. For clarity, we denote this execution by
EPPS(ALG, σ, T ).
A state s ∈ Si is reachable if there is a sequence of coin tosses σ and a traffic T , such that
the state-machine reaches state s in execution EPPS(ALG, σ, T ). A switch configuration consists
of the states of all state-machines, and the contents of all the buffers in the switch. A configuration
is reachable if it is reached in an execution of the switch. Since the switch does not have a pre-
determined initial configuration, we assume that for every pair of reachable configurations C1, C2,
there is a finite incoming traffic that causes the switch a transition from C1 to C2.
The internal lines of the switch operate at rate r < R. For simplicity, we assume that r′ , Rr
=⌈Rr
⌉. This lower rate r imposes an input constraint on the demultiplexing algorithm [74]:
For any two cells c1, c2 in traffic T , if orig(c1) = orig(c2) and |ta(c1) − ta(c2)| ≤ r′ then
plane(c1, σ, T ) 6= plane(c2, σ, T ).
Since a PPS has no buffers in its input-ports, cells are immediately sent to one of the planes;
that is, a cell c traverses the internal link between orig(c) and plane(c, σ, T ) at time ta(c) (see
Figure 5).
We assume that both the planes and the output buffers are FCFS and work-conserving. Let
tp(c, σ, T ) be the time-slot in which a cell c ∈ T leaves plane(c, σ, T ), and denote tlPPS(c, σ, T )
34
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Shadowswitch
PPSta(c)
orig(c)
orig(c)
ta(c)
ta(c) tlS(c, T )dest(c)
tp(c,σ,T)dest(c)
tlPPS(c, σ, T )plane(c, σ, T)
Figure 5: Time-points associated with a cell c ∈ T .
the time-slot it leaves the PPS (that is, tlPPS(c, σ, T ) = tlSW (c, σ, T ) as defined in Chapter 3).
The lower rate of the internal links between the planes to the output ports imposes an output-
constraint [74]:
For every two cells c1, c2 in traffic T , if dest(c1) = dest(c2) and plane(c1, σ, T ) = plane(c2, σ, T )
then |tp(c1, σ, T ) − tp(c2, σ, T )| > r′. To neglect delays caused by the additional stage of the
PPS, a cell can leave the PPS at the same time-slot it arrives at the output-port, provided that
no other cell is leaving at this time-slot, i.e., tlPPS(c, σ, T ) ≥ tp(c, σ, T ). Note however that
tp(c, σ, T ) ≥ ta(c) + 1.
When it is clear from the context, we omit the traffic T and the coin-tosses sequence σ from
the notations plane(c, σ, T ), tp(c, σ, T ), tlPPS(c, σ, T ), tlS(c, σ, T ) andR(ALG, σ, c, T ).
4.3 Lower Bounds on the Relative Queuing Delay
The relative queuing delay of a PPS heavily depends on the information available to the demulti-
plexing algorithm. Practical demultiplexing algorithms must operate with local, or out-dated, in-
formation about the status of the switch: flows waiting at other input-ports, contents of the planes’
buffers, etc. As we shall see, such algorithms incur non-negligible queuing delay.
Specifically, in this section we prove lower bounds on the maximal and average relative queuing
delay even when randomization is used. We show lower bounds for deterministic demultiplexing
algorithms. Based on these results, we present lower bounds for randomized demultiplexing algo-
35
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
rithms that use an adaptive adversary that sends cells to the switch at each time-slot based on the
algorithm actions at previous slots. We further show that under reasonable assumptions the lower
bounds can be extended to hold with an oblivious adversary, which chooses the entire traffic in
advance, knowing only the demultiplexing algorithm. Moreover, we show that these lower bounds
on the relative queuing delay yield similar lower bounds on the relative delay jitter.
We prove even stronger results, and show that the lower bounds hold even when the traffic is
restricted by the (R,B) leaky-bucket model [34, 45]. This model restricts the traffic from flooding
the switch by requiring that the combined rate of flows sharing the same input-port or the same
output-port does not exceed the external rate R of that port by more than a fixed bound B, which
is independent of time. Specifically, a traffic T is (R,B) leaky-bucket, if for any two time-slots
t1 ≤ t2 and any output-port j, |c ∈ T | t1 ≤ ta(c) ≤ t2 and dest(c) = j| ≤ (t2 − t1) + B.
4.3.1 General Techniques and Observations
High relative queuing delay is exhibited when cells that are supposed to leave the shadow switch
one after the other, are concentrated in a single plane. We first describe this scenario given a
specific coin-tosses sequence σ, implying the results holds also for a deterministic demultiplexing
algorithm with |COINSPACE| = 1.
Definition 5 An execution EPPS(ALG, σ, T ) is (f, s) weakly-concentrating for output-port j and
plane k if there is a time-slot t such that:
1. Output-port j’s buffer of the shadow switch is empty at time-slot t; and
2. At least f cells destined for output-port j arrive at the switch during time-interval [t, t+ s), and
f out of these cells are sent through the plane k.
We call an execution an (f, s) weakly-concentrating execution, when the plane k and the
output-port j are clear from the context.
The following lemma bounds the relative queuing delay exhibited in (f, s) weakly-concentrating
executions:
36
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Lemma 1 For any (R,B) leaky-bucket traffic T , coin-tosses sequence σ, and (f, s) weakly-concen-
trating execution EPPS(ALG, σ, T ) for output-port j and plane k, the last cell c that is sent from
plane k to output-port j in EPPS(ALG, σ, T ) attainsR(ALG, σ, c, T ) ≥ f · r′ − (s + B).
Proof. We compare the queuing delay of the cells, arriving in time interval [t, t + s), in the PPS
and in the shadow switch. Since the shadow switch is work-conserving, all f cells leave the switch
exactly f time-slots after the first cell is dispatched. On the other hand, a PPS completes this
execution after at least fr′ time-slots, because f cells are sent to the same plane, and only one
cell can be sent from this plane to the output-port every r′ time-slots. Let c be the last of these
cells sent from the plane to the output-port. Hence, the relative queuing delay c attains is at least
fr′ − f time-slots. Since the incoming traffic is (R,B) leaky-bucket, f ≤ s + B, and therefore
R(ALG, σ, c, T ) ≥ fr′ − f ≥ fr′ − (s + B) time-slots.
Lemma 1 implies the following lower bounds on the maximum relative queuing delay and the
maximum relative delay jitter:
Lemma 2 For any (R,B) leaky-bucket traffic T , coin-tosses sequence σ, and (f, s) weakly-con-
centrating execution EPPS(ALG, σ, T ) for output-port j and plane k:
(1) The maximum relative queuing delayRmax(ALG, σ, T ) ≥ f · r′ − (s + B) time-slots.
(2) There is a traffic T ′ such that the relative delay jitter J (ALG, σ, T ′) ≥ f ·r′−(s+B) time-slots.
Proof. The proof of (1) immediately follows from Lemma 1. Let t be the time-slot defining
EPPS(ALG, σ, T ) as (f, s) weakly-concentrating execution for output-port j and plane k, and
let c be the last cell, arriving at interval [t, t + s), that is sent from plane k to output-port j in
EPPS(ALG, σ, T ).
By the definition of cell c, ta(c) ≤ t + s − 1, while by Lemma 1, tlPPS(c) ≥ t + fr′. This
implies that delay(c, T ) ≥ f · r′ − s + 1.
Let t′ > tlPPS(c, T ) be the first time-slot after tlPPS(c, T ) in which all the buffers of the PPS
are empty. Such time-slot exists since T is finite and both the planes and multiplexors of the PPS
are work-conserving.
37
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Let T ′ = T ∪ c′, where cell c′ is a cell with orig(c′) = orig(c), dest(c′) = dest(c), and
ta(c′) = t′. Clearly, cell c′ leaves the PPS exactly one time-slot after its arrival. In addition, all
other cells in traffic T ′ leave the PPS exactly as in execution EPPS(ALG, σ, T ). Because cell c and
cell c′ share the same origin and destination, the maximum delay jitter introduced by the PPS is at
least JPPS(ALG, σ, T ′) ≥ (f · r′ − s + 1)− 1 = f · r′ − s time-slots.
Recall that the maximum buffer size needed for any work-conserving switch to work under
(R,B) leaky-bucket traffic is B. Therefore a work-conserving switch, which serves the incoming
cells in an FCFS manner (e.g. FCFS output-queued switch) introduces queuing delay, and therefore
also delay jitter, of at most B time-slots. Thus, the relative delay jitter between the PPS and the
shadow switch is at least (f · r′ − s)−B = f · r′ − (s + B) time-slots, proving (2).
Another key observation is that if the last cell of a traffic attains relative queuing delay R,
then this traffic can be continued so that every added cell attains at least relative queuing delayR,
regardless of the random choices made by the demultiplexing algorithm (if any).
We first define how a traffic is continued. A cell c2 ∈ T is the immediate successor of cell c1 ∈T in demultiplexing algorithm ALG, denoted c2 = succ(c1, T ), if tlS(c2, T ) = tlS(c1, T ) + 1, and
for every coin tosses sequence σ, tlPPS(c2, T ) > tlPPS(c1, T ) in the execution EPPS(ALG, σ, T ).
Namely, a PPS cannot change the order in which c1 and c2 are delivered; this happens for example
when a PPS follows a per-flow FCFS policy and c1, c2 share the same input-port and the same
output-port. Generally, the existence of an immediate successor depends on the priority scheme
supported by the PPS.
Let c be the last cell in a traffic T , i.e., tlS(c, T ) = maxc′∈TtlS(c′, T ). A traffic T ′ =
c0, . . . , cn is a proper continuation of T , if in the execution of the shadow switch in response to
traffic T T ′, all the cells of T ′ are delivered one time-slot after the other without any stalls, and
the delivery times of the cells of T remain unchanged. Formally, T ′ is a proper continuation of T
if in the execution ES(T T ′), c0 = succ(c, T T ′), ci = succ(ci−1, T T ′) for every i, and for
every c′ ∈ T , tlS(c′, T ) = tlS(c′, T T ′) and tlPPS(c′, T ) = tlPPS(c′, T T ′).
We first examine proper continuations by a single cell:
Lemma 3 For any demultiplexing algorithm ALG, coin-tosses sequence σ, and finite traffic T , if
38
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
c1 is the last cell of T , and T ′ = c2 is a proper continuation of T , thenR(ALG, σθ, c2, T T ′) ≥R(ALG, σ, c1, T ) for any coin-toss θ.
Proof. Since T ′ is a proper continuation of T , cell c2 leaves the shadow switch exactly at time-slot
tlS(c1, T T ′) + 1, and in addition tlPPS(c2, T T ′) ≥ tlPPS(c1, T T ′) + 1. Hence,
R(ALG, σθ, c2, T T ′) ≥ tlPPS(c2, T T ′)− tlS(c2, T T ′)
≥ (tlPPS(c1, T T ′) + 1)− (tlS(c1, T T ′) + 1)
= tlPPS(c1, T )− tlS(c1, T ) = R(ALG, σ, c1, T )
It is important to notice that Lemma 3 holds for any coin-toss θ, and therefore it trivially holds
if the demultiplexing algorithm ALG is deterministic.
If the adversary can construct, for every traffic, a proper continuation that is arbitrarily long,
then it can construct a traffic that exhibits an average relative queuing delay that matches the
maximum relative queuing delay. Intuitively, the adversary waits for a cell c that attainsRmax and
then sends many cells, which form a proper continuation (whose length depends on the number of
cells that arrived before c).
Lemma 4 Fix an adversary A, demultiplexing algorithm ALG, a coin-tosses sequence σ and a
finite traffic T whose last cell c hasR(ALG, σ, c, T ) = x. If the adversary A can construct a proper
continuation of traffic T , whose size is at least⌈|T |x−ε
ε
⌉(ε is an arbitrarily small constant), then
RAavg(ALG) ≥ x− ε.
Proof. Let ` be the number of cells in traffic T , and let T ′ be a proper continuation of T such
that |T ′| =⌈`x−ε
ε
⌉. Applying Lemma 3, |T ′| times implies that for every cell b in T ′ and any
coin-tosses sequence σb,R(ALG, σσb, b) ≥ R(ALG, σ, c) ≥ x. Hence,
RAavg(ALG) ≥ 1
` +⌈`x−ε
ε
⌉ ⌈`x− ε
ε
⌉x ≥ x− ε
39
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
In order to allow constructing proper continuations of a traffic T with high relative queuing
delay, we extend Definition 5, so that traffic T ends when a concentration occurs:
Definition 6 An execution EPPS(ALG, σ, T ) is (f, s) strongly-concentrating for output-port j and
plane k if it is (f, s) weakly-concentrating and in addition traffic T ends at time-slot t + s.
For brevity, we call such executions (f, s)-concentrating executions.
4.3.2 Lower Bounds for Fully-Distributed Demultiplexing Algorithms
A fully-distributed demultiplexing algorithm demultiplexes a cell, arriving at time-slot t, according
to the input-port’s local information in time interval [0, t]. Since no information is shared between
input ports, we assume that the state si ∈ Si of demultiplexor i does not change, unless a cell
arrives at input-port i. Note that demultiplexing algorithms that change their state even without
receiving a cell are not considered fully distributed, because a common clock-tick is shared among
all input ports. (Such algorithms are covered in Section 4.3.3.)
Lower bound for deterministic fully-distributed algorithm
The relative queuing delay of a PPS with fully-distributed demultiplexing algorithm strongly de-
pends on the number of input-ports that can send a cell, destined for the same output-port, through
the same plane. The following definition captures this switch characteristic under deterministic
algorithm (Definition 8 extends this definition for randomized algorithms):
Definition 7 A deterministic demultiplexing algorithm is d-partitioned if there is a plane k, an
output-port j, such that at least d input-ports send a cell destined for output-port j through plane
k in one of their reachable configurations.
We next show that a static partition of the planes among the demultiplexors helps to reduce
the relative queuing delay. However, since such partitioning is failure-prone, most existing fully-
distributed algorithms are N -partitioned, meaning that each demultiplexor may use each plane
40
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
in order to send cells to each output-port. All our results hold for this class of algorithms by
substituting d = N .
Theorem 1 Any deterministic d-partitioned fully distributed demultiplexing algorithm ALG has
Rmax(ALG) ≥ d(r′ − 1) and J (ALG) ≥ d(r′ − 1) time-slots.
Proof. By the definition of a d-partitioned demultiplexing algorithm, there is an output-port j and
a plane k, such that at least d demultiplexors send a cell destined for j through k in some reachable
configuration. Let I = i1, i2, . . . id be the set of these demultiplexors, and let si ∈ Si be the state
of demultiplexor i ∈ I in configuration Ci, just before a cell is sent to plane k.
Consider traffic T ′i from an arbitrary reachable configuration C which leads the switch to con-
figuration Ci; such traffic exists since C and Ci are reachable, and there is a traffic that causes
the switch to transit between any two reachable configurations. Let Ti = (T ′i )|i; that is, a traffic
in which cells arrive only at input-port i exactly in the same time-slots as in traffic T ′i . Since the
demultiplexing algorithm is fully-distributed, demultiplexor i transits into si. Note that in Ti at
most one cell arrives at the switch in every time-slot, therefore this traffic has no bursts.
Now consider traffic T , which starts with Ti1 . . . Tid , a sequential composition of the traffics
Ti, where i ∈ I . T begins from configuration C, and sequentially for every i ∈ I , the same cells
arrive at the switch in the same time-slots as in traffic Ti, until demultiplexor i reaches state si.
Then, no cells arrive at the switch until all the buffers in all the planes are eventually empty. Finally,
d cells destined for output-port j arrive, one after the other, at different input-ports i ∈ I (one cell in
each time-slot). Since the demultiplexing algorithm is fully-distributed, each demultiplexor i ∈ I
remains in state si, and all the cells are sent through the same plane k (see Figure 6; the last d cells
are denoted ci1 , . . . , cid).
T has no bursts, and cells ci1 , . . . , cid arrive during d consecutive time-slots. These cells arrive
at the switch after the buffer in output-port j is empty.
Thus, by Lemma 2 with f = d, s = d and B = 0, we obtain the stated lower bounds.
If the PPS obeys a per-flow FCFS policy, we get the following lower bound on the average
relative queuing delay:
41
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
i2
i1
idCells ci1, . . . , cid
Figure 6: Illustration of traffic T in the proof of Theorems 1 and 4.
Theorem 2 Any deterministic d-partitioned fully distributed demultiplexing algorithm ALG has
Ravg(ALG) ≥ d(r′ − 1)− ε time-slots, where ε > 0 can be made arbitrarily small.
Proof. Let T be the traffic that caused the maximum relative queuing delay as described in the
proof of Theorem 1.
We continue traffic T with traffic T ′, which consists of⌈|T |f ·r
′−(s+B)−εε
⌉cells from orig(c) to
dest(c), one cell at each time-slot. T ′ is a proper continuation of traffic T , because both the PPS
and the shadow switch obey a per-flow FCFS policy and all cells in T ′ share the same input-port
and the same output-port.
Hence, Lemma 4 implies thatRavg(ALG) ≥ f · r′ − (s + B)− ε.
Note that the PPS input constraint implies that each demultiplexor must send incoming cells
through at least r′ planes. This implies that even under static partitioning each plane is used by
r′NK
demultiplexors on the average. Hence, there is a plane k that is used by at least r′NK
= NS
demultiplexors in order to dispatch cells destined for a certain output-port j. By substituting d =
N/S in Theorems 1 and 2 we get:
Theorem 3 A bufferless PPS, with fully-distributed deterministic demultiplexing algorithm, has
maximum relative queuing delay and relative delay jitter of NS
(r′ − 1) time-slots under a leaky-
bucket traffic without bursts. Its average relative delay jitter is NS
(r′ − 1)− ε, for arbitrarily small
ε > 0.
42
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Lower bound for randomized fully-distributed algorithms with an adaptive adversary
We concentrate now on adaptive adversary, denoted adp, which sends cells to the switch based on
the algorithm actions.
For every traffic T we examine the probability Prσ [EPPS(ALG, σ, T ) is (f, s)-concentrating],
taken over all coin-tosses sequences σ, that the execution of ALG given T and σ is (f, s)-concen-
trating.
Another key observation is that if there is traffic T such that its execution is (f, s)-concentrating
with small but non-negligible probability, an adaptive adversary can construct another execution
that is almost always (f, s)-concentrating:
Lemma 5 If from every configuration C there is an (R,B) leaky-bucket traffic T such that
Prσ [EPPS(ALG, σ, T ) is (f, s)-concentrating ] ≥ p > 0, then an adaptive adversary can construct
an (R,B) leaky-bucket traffic T ′ from C such that Pr σ [EPPS(ALG, σ, T ′) is (f, s)-concentrating] ≥1− δ, where δ can be made arbitrarily small.
Proof. Fix a configuration C; the adaptive adversary constructs the executions from C iteratively:
Denote C0 , C. Let Ci be the configuration just before iteration i ≥ 0, and denote by T i the
traffic such that from configuration Ci, Prσ [EPPS(ALG, σ, T i) is (f, s)-concentrating] ≥ p. The
adversary stops if the last execution is indeed (f, s)-concentrating. Otherwise, it concatenates an
empty traffic of B time-slots (denoted Te) and continues to the next iteration.
Since in each iteration the adversary stops with probability at least p independently of previous
iterations, it stops with an (f, s)-concentrating execution at iteration ` ≤⌈log1−p δ
⌉with probabil-
ity 1 − δ. Since there are B empty time-slots between the arrival of the last cell of traffic T i and
the arrival of the first cell in T i+1, T ′ = T 0 Te . . . Te T ` has burstiness factor B, and its
corresponding execution starting from C is (f, s)-concentrating with probability 1− δ.
If both the shadow switch and the PPS are per-flow FCFS, an adaptive adversary can always
construct an arbitrarily long proper continuation of some traffic T . Therefore, we have:
43
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Lemma 6 If from every configuration C there is an (R,B) leaky-bucket traffic T such that
Prσ [EPPS(ALG, σ, T ) is (f, s)-concentrating] > 0, then with probability 1 − δ, Radpavg(ALG) ≥
f · r′ − (s + B)− ε, where ε > 0 and δ > 0 can be made arbitrarily small.
Proof. By Lemma 5, an adaptive adversary can construct a traffic T ′ from configuration C, such
that Prσ [EPPS(ALG, σ, T ′) is (f, s)-concentrating] ≥ 1− δ.
Let c be the last cell of T ′. Lemma 1 implies that with probability 1 − δ, the relative queuing
delay of c is at least f · r′ − (s + B).
When the adaptive adversary observes such a concentration event, it continues with traffic T ′′,
which consists of⌈|T ′|f ·r
′−(s+B)−εε
⌉cells from orig(c) to dest(c), one cell at each time-slot. T ′′
is a proper continuation of traffic T ′, because both the PPS and the shadow switch obey a per-flow
FCFS policy and all cells in T ′′ share the same input-port and the same output-port.
Hence, Lemma 4 implies thatRadpavg(ALG) ≥ f · r′ − (s + B)− ε, with probability 1− δ.
We now extend Definition 7 to capture randomized demultiplexing algorithms:
Definition 8 A randomized demultiplexing algorithm is d-partitioned if there is a plane k, an
output-port j, and a set of input-ports I , such that |I| ≥ d and the following property holds:
For every input-port i ∈ I and state si ∈ Si, if at least ni cells destined for output-port j arrive at
input-port i after it is in state si, then with probability pi > 0, i sends at least one cell destined for
output-port j through plane k.
We next prove a lower bound for d-partitioned fully distributed demultiplexing algorithms by
showing that it is possible to construct a traffic with no bursts that causes with, non-negligible
probability, the algorithm to concentrate d cells in a single plane during a time-interval of d time-
slots:
Theorem 4 Any randomized d-partitioned fully distributed demultiplexing algorithm ALG has,
with probability 1 − δ, Radpmax(ALG) ≥ d(r′ − 1) time-slots and Radp
avg(ALG) ≥ d(r′ − 1) − ε
time-slots, where ε > 0 and δ > 0 can be made arbitrarily small.
44
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Proof. Given ALG, the adversary pre-computes the set I = i1, . . . , id of d input ports, the
output-port j and the plane k, and for each input port i ∈ I the values ni and pi > 0, for which the
conditions presented in Definition 8 hold.
We now construct a traffic similar to the one used in the proof of Theorem 1.
Fix a configuration C, and for every i ∈ I , let T ′i be a traffic consisting of ni cells destined for
output-port j that arrive one after the other to input-port i. By the definition of ni, with probability
at least pi, there is at least one cell in T ′i that is sent through plane k. Let ci be the first such cell;
it follows that Prσ [cell ci is sent through plane k in EPPS(ALG, σ, T ′i )] ≥ pi. Let Ti be the prefix
of T ′i that ends with cell ci; that is Ti = c ∈ T ′
i |ta(c) ≤ ta(ci). Since the probability to send ci
through plane k in execution EPPS(ALG, σ, T ′i ) depends only on cells that arrive at the switch be-
fore cell ci, it follows that for prefix Ti, Prσ [cell ci is sent through plane k in EPPS(ALG, σ, Ti)] ≥pi as well.
Traffic T is defined as follows: T = (Ti1 \ ci1) . . . (Tid \ cid) ci1 . . . cid (recall
Figure 6). We next show that with non-negligible probability, taken over all coin-tosses sequences
σ, all cells ci1 . . . cid are sent through plane k in the execution of ALG on traffic T .
In traffic T , for each input port i ∈ I , no cells arrive to input port i between Ti \ ci and ci.
Thus, for each input port i ∈ I and coin-tosses sequence σ, plane(ci, T ) = plane(ci, Ti1. . .Tid.Since the demultiplexors are independent, the probability, taken over all coin-tosses sequences σ,
that the last d cells are sent through plane k in execution EPPS(ALG, σ, T ) is at least∏d
a=1 pia > 0.
This implies that execution EPPS(ALG, σ, T ) is (d, d)-concentrating with non-negligible prob-
ability. Since T has no bursts, the claim follows immediately from Lemma 6.
Lower bound for randomized fully-distributed algorithms with an oblivious adversary
We now consider oblivious adversaries, obl, that choose the entire traffic in advance, knowing
only the demultiplexing algorithm ALG. Roblmax(ALG) and Robl
avg(ALG) denote the maximum and
average queuing delay of algorithm ALG against such an adversary. We assume that the PPS and
the shadow switch obey a global FCFS policy, i.e., cells that share the same output-port should
leave the switch in the order of their arrival (with ties broken arbitrarily). Unlike per-flow FCFS
45
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
policy, global FCFS policy requires cells to leave in order even if they do not share the same origin.
We next extend Theorem 4 to hold with an oblivious adversary, under a global FCFS discipline.
Theorem 5 Any randomized d-partitioned fully distributed demultiplexing algorithm ALG has
Roblmax(ALG) ≥ d(r′ − 1) time-slots and Robl
avg(ALG) ≥ d(r′ − 1) − ε time-slots, with probabil-
ity 1− δ, where ε > 0 and δ > 0 can be made arbitrarily small.
Proof. Given ALG, the adversary pre-computes the set I = i1, . . . , id of d input ports, the
output-port j and the plane k, and for each input port i ∈ I the values ni and pi, for which the
conditions of Definition 8 hold. Let p ,∏d
a=1pia
nia> 0.
For any input port i ∈ I , let xi be a value chosen uniformly at random from 1, . . . , ni. Let Ti
be a traffic consisting of xi cells from input port i to output port j, and let ci be the last cell of Ti.
Traffic T ′ is defined as follows: T ′ = (Ti1 \ ci1) . . . (Tid \ cid) ci1 . . . cid. Note that
traffic T ′ is similar to traffic T in the proof of Theorems 1 and 4 (illustrated in Figure 6).
Using traffic T ′, the adversary constructs a traffic T (illustrated in Figure 7) whose average
relative queuing delay is at least d(r′ − 1) − ε time-slots with probability 1 − δ (the constants
δ, ε > 0 can be made arbitrarily small). The construction has two steps:
Step 1 Concatenate⌈log1−p δ
⌉instances of traffic T ′. For each instance, choose independently
and uniformly at random the values for xim , 1 ≤ m ≤ d, from 1, . . . nim. Let ` be the total
size of these instances.
Step 2 Concatenate a traffic of size `⌈
d(r′−1)−εε
⌉cells, such that each cell is sent from an arbitrary
input port i to output port j.
We first prove that with non-negligible probability, taken over all coin-tosses sequences σ, the
execution of ALG on any instance of traffic T ′ = (Ti1 \ ci1) . . . (Tid \ cid) ci1 . . . cidis (d, d)-concentrating, regardless of the initial configuration.
Claim 1 Prσ [EPPS(ALG, σ, T ′) sends the last d cells through plane k] ≥∏d
a=1pia
nia, p > 0.
46
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
i2
i1
id
Step 1 Step 2
i
Traffic T ′ Traffic T ′ • • •Cell c′
Figure 7: Illustration of traffic T in the proof of Theorem 5.
Proof of claim. For any input port i, denote by Ti the traffic consisting of ni cells from input port
i to output port j. By the definition of ni, with probability pi at least one cell in Ti is sent through
plane k. Since xi is chosen uniformly at random from the values 1, . . . , ni, this cell is the xi-th
cell (that is, the cell ci) with probability at least 1ni
. Note that traffic Ti is a prefix of traffic Ti;
since the demultiplexor is bufferless, the decision through which plane to send the cell ci is based
only on cells arriving at the switch prior to ci, which implies that cell ci is sent through k with
probability of at least 1ni· pi.
In traffic T ′, for each input port i ∈ I , no cells arrive at input port i between Ti \ ci and ci.
Thus, for each input port i ∈ I and coin-tosses sequence σ, plane(ci, T′) = plane(ci, Ti1 . . .Tid)
Since the demultiplexors are independent, the probability, taken over all coin-tosses sequences σ,
that execution EPPS(ALG, σ, T ′) sends the last d cells through plane k is at least∏d
a=1pia
nia, p > 0.
In Step 1, the random choices of the T ′ instances are independent. Therefore, (d, d)-concentration
occurs at least once in Step 1 with probability at least 1− δ. Let c′ be last cell of the first instance
in which (d, d)-concentration occurs, and T1 be the traffic c ∈ T | ta(c) ≤ ta(c′). Since
EPPS(ALG, σ, T1) is (d, d)-concentrating, Lemma 1 implies thatR(ALG, σ, c′, T1) ≥ d(r′ − 1).
Let T2 = T \ T1. We next show that T2 is a proper continuation of T1. Intuitively, this is due to
the fact that the switches are work-conserving with FCFS policy and during each interval of size τ ,
exactly τ cells arrive at the switch and destined for the same output-port j (i.e., there are no stalls
between cells in traffic T = T1 T2).
Formally, consider two cells c1, c2 ∈ T such that ta(c2) = ta(c1)+1. The FCFS policy implies
47
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
that tlS(c2, T ) > tlS(c1, T ) and tlPPS(c2, T ) > tlPPS(c1, T ). In addition, by the construction of
traffic T , there is no cell c3 ∈ T such that ta(c1) ≤ ta(c3) ≤ ta(c2). Therefore, the FCFS policy
and the work-conservation of the shadow switch imply that tlS(c2, T ) = tlS(c1, T ) + 1. Hence,
for every two cells c1, c2 ∈ T , if ta(c2) = ta(c1) + 1 then c2 = succ(c1, T ); in particular, the first
cell of T2 is the successor of cell c′. Moreover, since the switches follow a FCFS policy, cells of
traffic T2 do not prohibit cells of traffic T1 of being delivered on time; namely, for any cell c ∈ T1,
tlS(c, T1) = tlS(c, T1 T2) and tlPPS(c, T1) = tlPPS(c, T2).
Since R(ALG, σ, c′, T1) ≥ d(r′ − 1) and |T2| ≥⌈|T1|d(r′−1)−ε
ε
⌉, Lemma 4 implies that
Roblavg(ALG) ≥ d(r′ − 1)− ε andRobl
max(ALG) ≥ d(r′ − 1), with probability 1− δ.
4.3.3 Lower Bounds for u-RT Demultiplexing Algorithms
While fully-distributed demultiplexing algorithms do not use any global information, in practice
demultiplexors may be able to gather some information about the switch status (e.g., through ded-
icated control lines). Therefore, it is important to consider a broader class of demultiplexing algo-
rithms in which out-dated global information is also used:
Definition 9 A u real-time distributed (u-RT) demultiplexing algorithm demultiplexes a cell, ar-
riving at time t, according to the input-port’s local information in time interval [0, t], and to the
switch’s global information in time interval [0, t− u].
The state transition function of the ith bufferless demultiplexor operating under u-RT demul-
tiplexing algorithm is Si(t) : Si × Ct−u+1 × 1, . . . , N × COINSPACE → Si, where t is the
time-slot in which Si is applied, C is the set of all reachable switch configurations, and Ct−u+1
is the cross-product of t − u + 1 such sets, one for each time-slot in the interval [0, t − u]. Note
that a demultiplexor state transition may depend on other demultiplexors’ state transitions, and on
incoming flows to other input-ports, as long as these events occurred u time-slots before the state
transition. The state of a demultiplexor can change even if no cell arrives at the input-port.
The additional global information allows to reduce the relative queuing delay. For example,
when a 1-RT demultiplexing algorithm receives (R, 0) leaky-bucket traffic, it has full information
48
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
about the switch status, and therefore it can emulate a centralized algorithm. Yet, lack of infor-
mation about recent events yields non-negligible relative queuing delay, caused by leaky-bucket
traffic with a non-zero burstiness factor, as we shall prove next. A prominent example of 1-RT
demultiplexing algorithms (that is, with u = 1) are demultiplexing algorithms that share only a
common clock-tick among input-ports. Therefore, a demultiplexor with a 1-RT algorithm may
change its state even if no cell arrives at its input-port.
Lower bound for deterministic u-RT algorithms
Let u = minu, r′
2, that is, the minimum between the lag in gathering global information and half
the external rate relative to the rate of the planes. We first show a lower bounds on the performance
of deterministic u-RT algorithms:
Theorem 6 Any deterministic u-RT demultiplexing algorithm ALG hasRmax(ALG) ≥ uNS
(1− ur′ ))
andJ (ALG) ≥ uNS
(1− ur′ )) time-slots. If the PPS obeys a per-flow FCFS policy thenRavg(ALG) ≥
uNS
(1− ur′ )− ε time-slots, where ε > 0 can be made arbitrarily small.
Proof. Consider an arbitrary configuration C. Denote by t0 the time-slot in which the PPS is in
configuration C, by x0 the number of cells that arrived at the PPS until time-slot t0, and by n0 the
number of cells stored in one of the PPS’ buffers at time-slot t0.
Consider now the empty traffic Te, in which no cells arrive to the switch at all. We first argue
that if Te is long enough, all the buffers of the switch become empty. Specifically, denote by C1 the
switch configuration at time-slot t1 = t0 +n0 + uNx0
S+1. If there are still cells stored in one of the
buffers at time-slot t1, then these cells have relative queuing delay of at least uNx0
S+ 1 time-slots;
therefore the average relative queuing delay is more than uNS
time-slots, and the theorem follows.
Assume now that all the buffers are empty in configuration C1. Fix an output-port j, and
consider the traffic T in which cells destined for j arrive simultaneously to all input-ports at each
time-slot in the interval [t1, t1 + u). Note that T is an (R, uN − u) leaky-bucket traffic, since for
any τ ≥ 1 and time-interval [t, t + τ), the total number of cells arriving to the switch is bounded
by τ + (uN − u).
49
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
time−slot
i|I|
t1 + ut1t0
i1
i2
Figure 8: Illustration of traffic Te T |I in the proof of Theorems 6 and 8.
Since u ≤ 12
Rr
< Rr
, the input constraint implies that two cells arriving at the same input-port
are not sent through the same plane. Hence, there is a plane k used by a set I of at least uK
N input-
ports in the execution EPPS(ALG, T ); note that since a PPS speedup is at least 1, uK
N < RrK
N ≤ N .
For every input-port i ∈ I , let ci ∈ T |i be a cell such that plane(ci, T |i) = k. Consider the
traffic T |i = c|c ∈ T |i and ta(c) ≤ ta(ci); that is, T |i consists of the cells in T |i that arrive at
the switch before cell ci.
Now consider traffic T |I =⋃
i∈I T |i (see Figure 8). Note that both T |I and Te T |I are
(R, u2 NK− u) leaky-bucket traffics.
For every input-port i ∈ I , ta(ci) < t1 + u ≤ t1 + u, which implies that input-port i
does not have global information on the switch status after time-slot t1. Hence, the executions
EPPS(ALG, T ) and EPPS(ALG, T |I) are equivalent. Therefore, all the input-ports i ∈ I send their
last cell to plane k in EPPS(ALG, Te T |I) starting at configuration C. Hence, by Lemma 2, the
maximum relative queuing delay and the relative delay jitter are at leastR = uNS
(1− ur′ ) time-slots.
Assume now that the PPS obeys a per-flow FCFS policy. Let c be the last cell of traffic Te T |Ithat attains the maximum relative queuing delay. Consider traffic T ′ that consists of
⌈|(T |I)| · R−ε
ε
⌉cells from orig(c) to dest(c), one cell at each time-slot. T ′ is a proper continuation of traffic T |I ;
thus, by Lemma 4,Ravg(ALG) ≥ uNS
(1− ur′ )− ε as required.
By substituting the minimal value u = 1, we get the following general result:
Corollary 7 Any deterministic u-RT demultiplexing algorithm, u ≥ 1, has relative queuing delay
and relative delay jitter of at least NS
(1− 1
r′
)time-slots, under leaky-bucket traffic with burstiness
50
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
factor NK− 1.
Lower bound for randomized u-RT algorithms with an adaptive adversary
We next give a lower bound on the average relative queuing delay of a randomized u-RT demulti-
plexing algorithms. The proof is based on Theorem 6 and Lemma 6:
Theorem 8 Any randomized u-RT demultiplexing algorithm ALG has, with probability 1 − δ,
Radpmax(ALG) ≥ uN
S(1 − u
r′ ) time-slots and Radpavg(ALG) ≥ uN
S(1 − u
r′ ) − ε time-slots, where ε > 0
and δ > 0 can be made arbitrarily small.
Proof. Consider an arbitrary configuration C and the traffics Te and T , whose constructions are
described in the proof of Theorem 6.
It is important to notice that the input constraint implies that for every coin-tosses sequence σ
there is a plane k used by a set I of at least uK
N input-ports in the execution EPPS(ALG, σ, T ).
For every input-port i ∈ I , let ci ∈ T |i be a cell such that plane(ci, T |i) = k. Consider the
traffics T |i = c|c ∈ T |i and ta(c) ≤ ta(ci) and T |I =⋃
i∈I T |i (recall Figure 8).
For every input-port i ∈ I , ta(ci) < t1 + u ≤ t1 + u, which implies that input-port i
does not have global information on the switch status after time-slot t1. Hence, the executions
EPPS(ALG, σ, T ) and EPPS(ALG, σ, T |I) are equivalent. Therefore, with probability of at least∏i∈I
(1
|COINSPACE||T |i|)≥(
1|COINSPACE|
u)|I|≥(
1|COINSPACE|
u)uN
K> 0, taken over the coin-tosses
sequences σ, all the input-ports i ∈ I send their last cell to plane k in EPPS(ALG, σ, Te T |I)starting at configuration C. Hence, configuration C satisfies the conditions of Lemma 6 and the
claim follows.
The question whether the lower bound for u-RT demultiplexing algorithm (described in The-
orem 8) can be extended to hold with an oblivious adversary is left open. The proof technique
described in this section will most likely fail to provide such an extension, since the worst-case
traffics that are used in order to prove lower bounds for u-RT algorithms have bursts. Unfor-
tunately, the burstiness accumulates when concatenating bursty traffics, unless there is a gap of
51
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
certain number of time-slots in which no cells arrive to the over-loaded output-port. Large bursts
may justify high queuing delay of cells, and hence result in low relative queuing delay. On the
other hand, a gap in which no cells arrive to the over-loaded output, reduces the relative queuing
delay of cells that arrive immediately after it. This implies that the adversary should identify the
concentration and then choose to continue the traffic without a gap (as in Lemma 4).
4.4 Upper Bounds on the Relative Queuing Delay
This section presents a methodology for bounding Rmax(ALG, σ, T ) for an arbitrary traffic T and
coin-tosses sequence σ. We fix some traffic T and omit the notations ALG, σ and T . For simplicity
assume T begins after time-slot 0, and that at time-slot 0 (i.e., at “the beginning of time”), no cells
arrive at the switch, and therefore all the queues are empty. Our analysis depends on the realistic
assumption that the PPS obeys the global FCFS policy.
Cells are queued in a bufferless PPS either within the planes or within the multiplexors residing
at the output-ports. A simple situation in which queuing in a multiplexor happens is when the
output-port is flooded, but in this case, cells also suffer from high queuing delay in the shadow
switch, and the relative queuing delay is small. A more complicated situation is when a cell arrives
at the multiplexor out of order, and should wait for previous cells to arrive from their planes. In
this case, the relative queuing delay is a by-product of queuing within the other planes (of some
preceding cell)—the relative queuing delay of the waiting cell is at most the relative queuing delay
of some preceding cell that was queued only in the planes. This observation is captured in the next
lemma:
Lemma 7 There is a cell c such that tlPPS(c) = tp(c) andR(c) = Rmax.
Proof. Let c be the first cell to leave the PPS such that R(c) = Rmax. Assume that tlPPS(c) >
tp(c); since the multiplexor buffer is work-conserving, in time-slot tlPPS(c) − 1 another cell c′
leaves the PPS from output-port dest(c). Hence tlPPS(c′) = tlPPS(c)− 1, and therefore R(c′) =
tlPPS(c′)−tlS(c′) = tlPPS(c)−1−tlS(c′). Since c′ leaves the PPS before c and the shadow switch
is FCFS, tlS(c′) ≤ tlS(c)−1. Hence the relative queuing delay of c′ isR(c′) ≥ tlPPS(c)−tlS(c) =
R(c) = Rmax, contradicting the minimality of c.
52
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Consider a single cell c, and focus on the queuing within plane(c), caused by the lower rate
on the link from plane(c) to dest(c). Since both the PPS and the shadow switch are FCFS, cells
arriving at the switch after cell c cannot prohibit c from being transmitted on time. We present an
upper bound that depends only on the disproportion of the number of cells sent through plane(c)
to dest(c). Relating this quantity and the queue lengths at time-slot ta(c) is not immediate, since
it is possible that the shadow switch is busy when the plane is idle, and vice versa.
Let Aj(t1, t2) be the number of cells destined for output-port j that arrive at the switch during
time interval [t1, t2], and Akj (t1, t2) be the number of these cells that are sent through plane k. The
following definition captures the imbalance between planes:
Definition 10 For a plane k, output-port j and time-slots 0 ≤ t1 ≤ t2:
1. The imbalance of time interval [t1, t2] is ∆kj (t1, t2) = Ak
j (t1, t2)− 1r′ Aj(t1, t2).
2. The imbalance by time-slot t2 is ∆kj (t2) = maxt1≤t2∆k
j (t1, t2).
3. The maximum imbalance is ∆kj=maxt2∆k
j (t2).
Clearly, ∆kj ≥ ∆k
j (t2) ≥ ∆kj (t1, t2) for every output-port j, plane k and time-slots t1 > t2. In
addition, the imbalance is superadditive:
Property 1 For every output-port j, plane k and time-slots t1 > t2,
∆kj (t2) ≥ ∆k
j (t1 − 1) + ∆kj (t1, t2)
Proof. By Definition 10, there is a time-slot t1′ ≤ t1 such that ∆k
j (t1 − 1) = ∆kj (t1
′, t1 − 1) =
Akj (t1
′, t1 − 1)− 1r′ Aj(t1
′, t1 − 1). Since ∆kj (t1, t2) = Ak
j (t1, t2)− 1r′ Aj(t1, t2), we have:
∆kj (t1, t2)+∆k
j (t1−1)=Akj (t1
′, t1−1)+Akj (t1, t2)−
1
r′(Aj(t1
′, t1−1)+Aj(t1, t2))=∆kj (t1
′, t2)≤∆kj (t2)
Let Qj(t) be the length of the jth queue in the shadow switch after time-slot t; similarly,
Qkj (t) is the length of jth queue of plane k of the PPS after time-slot t. Let Lk
j (t1, t2) be the
53
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
number of cells destined for output-port j that leave plane k during time interval [t1, t2]. Note that
Qkj (t) = Ak
j (0, t)− Lkj (0, t).
Time-slot t1 is the beginning of a (k, j) busy period for time-slot t2 ≥ t1, if it is the last time-slot
before t2, such that Qkj (t1 − 1) = 0. Note that this expression is well defined because in time-slot
0 all the queues are empty. Since Qkj (t1) > Qk
j (t1 − 1), a cell c arrives at the switch at time-slot
t1, and therefore exactly one cell destined for j leaves plane k in time-interval (t1 + 1− r′, t1 + 1].
This is either cell c itself, or another cell that prohibits c from using the link, and therefore is sent
at most r′ time-slots before time-slot t1 + 1. Since the queue is never empty until time-slot t2, one
cell is sent to j exactly every r′ time-slots after the first cell. This implies that the number of cells
sent from k, Lkj (t1, t2) ≥
⌊(t2−t1)+1
r′
⌋.
Remark 2 Khotimsky and Krishnan [86] defined busy periods only with respect to an output-port
j. This points to a flaw in their proof, which ignores situations when the optimal shadow switch is
busy sending cells to output-port j, while a specific plane in the PPS is idle part of the time [85].
These situations are the main source of complication in our proof.
The following lemma bounds how badly a plane can perform relative to the shadow switch, by
comparing their busy periods:
Lemma 8 If Qj(t− 1) = 0 then for every plane k and for every δ ∈ 0, . . . , ∆kj (t− 1)r′,
Lkj (0, (t− 1) + δ) ≥ Ak
j (0, t− 1)−
⌈∆k
j (t− 1)r′ − δ
r′
⌉
Proof. If there is a time-slot t1 ∈ [t− 1, t− 1 + δ], such that Qkj (t1) = 0, then by time-slot t1, no
cells destined for j are waiting in plane k. That is, Lkj (0, (t − 1) + δ) ≥ Lk
j (0, t1) = Akj (0, t1) ≥
Akj (0, t− 1), and the lemma follows.
Otherwise, let t2 be the beginning of a (k, j) busy period for time-slot (t− 1) + δ. During time
interval [t2, (t− 1) + δ] plane k sends a cell to output j every r′ time-slots, therefore:
Lkj (t2, (t− 1) + δ) ≥
⌊t + δ − t2
r′
⌋(4.1)
54
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
t2 t t + δ
Akj (0, minτ, t− 1)
−Lkj (0, τ)
τ (time-slots)t + ∆kj (t− 1)r′
Figure 9: The number of cells arriving until time-slot t−1, and still queued in plane k by time-slotτ .
On the other hand, Qj(t−1) = 0 implies that for every time-slot t3 ≤ t−1, Aj(t3, t−1) ≤ t−t3
(otherwise the jth buffer of the shadow switch is not empty after time-slot t− 1). In particular:
Aj(t2, t− 1) ≤ t− t2 (4.2)
Using these inequalities we bound Lkj (0, (t− 1) + δ):
Lkj (0, (t−1)+δ) = Lk
j (0, t2 − 1) + Lkj (t2, (t− 1) + δ)
≥ Lkj (0, t2 − 1) +
⌊t+δ−t2
r′
⌋by (4.1)
≥ Akj (0, t2 − 1) +
⌊t+δ−t2
r′
⌋Since Qk
j (t2−1)=0
≥ Akj (0, t2 − 1)+
⌊δr′ +
Aj(t2,t−1)
r′
⌋by (4.2)
= Akj (0, t2 − 1) + Ak
j (t2, t− 1) +⌊
δr′ −∆k
j (t2, t− 1)⌋
by Definition 10
≥ Akj (0, t− 1)−
⌈r′∆k
j (t−1)+δ
r′
⌉
By substituting δ = 0 in Lemma 8, we get the following corollary, demonstrating the relation
between the imbalance and the queue size in the beginning of a busy period:
Corollary 9 If Qj(t− 1) = 0 then for every plane k, Qkj (t− 1) ≤ max
0,⌈∆k
j (t− 1)⌉
.
55
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
We complete the proof by bounding the lag between the time a cell leaves the plane it is sent
through and the time it should leave the shadow switch:
Theorem 10 The maximum relative queuing delay of cells destined for output-port j and sent
through plane k is bounded by max0, r′(∆kj + 1) + Bj, where Bj is the maximum number of
cells destined for output-port j that arrive at the switch in the same time-slot.
Proof. By Lemma 7, it suffices to bound tp(c) − tlS(c) for every cell c. Since tp(c) − tlS(c) =
(tp(c)− ta(c))− ((tlS(c)− ta(c)), it suffices to bound only the difference between the time a cell
spends in the plane, tp(c) − ta(c), and the time it spends in the shadow switch, tlS(c) − ta(c).
Since both switches operate under FCFS policy, these values solely depend on the corresponding
queues’ lengths when cell c arrives.
Let t1 be the earliest time-slot, such that the buffer of output-port j in the shadow switch is
never empty during time-interval [t1, ta(c)]; if no such time-slot exists let t1 = ta(c).
First, we bound tlS(c) − ta(c) from below. The buffer in the shadow switch is empty at time-
slot t1−1, and then the switch is continuously busy during time-interval [t1, ta(c)−1], transmitting
exactly one cell at each time-slot to output-port j. This implies that Qj(ta(c)−1) = Aj(t1, ta(c)−1)− (ta(c)− t1). All the cells in the queue should leave the switch after time-slot ta(c) and before
tlS(c), therefore:
tlS(c)− ta(c) > Aj(t1, ta(c)− 1)− (ta(c)− t1)
Since Aj(ta(c), ta(c)) ≤ Bj , and tlS(c)− ta(c) is an integer, it follows that:
tlS(c)− ta(c) ≥ Aj(t1, ta(c))−Bj + t1 − ta(c) + 1 (4.3)
Recall that by Corollary 9, Qkj (t1−1)≤max0,
⌈∆k
j (t1−1)⌉. There are two cases to consider,
depending on whether all the cells that were queued in plane k at time slot t1 left the plane before
the arrival of cell c (see Figure 10):
Case 1: ta(c) ≤ t1 + ∆kj (t1 − 1)r′. Since plane k is FCFS and work-conserving, it transfers
every cell in its queue in exactly r′ time-slots, except cell c which is considered as transfered in the
56
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
t (time-slots)
⌈∆k
j (t1 − 1)⌉ −Lk
j (0, t)
Akj (0, mint, t1 − 1)
t1 + ∆kj (t1 − 1)r′
after time-slotCase 2: cell c arrive
t1 + ∆kj (t1 − 1)r′t1
t1 + ∆kj (t1 − 1)r′
before time-slotCase 1: cell c arrive
Figure 10: Illustration for the different cases in the proof of Theorem 10
first time-slot of its transmission:
tp(c)− ta(c) ≤ r′Qkj (ta(c)) + 1
= r′(Akj (0, ta(c))− Lk
j (0, ta(c))) + 1 by the definition of Qkj (ta(c))
≤ r′
(Ak
j (0, ta(c))− Akj (0, t1 − 1) +
⌈r′∆k
j (t1 − 1) + t1 − ta(c)
r′
⌉)+ 1
by Lemma 8, since ta(c) ∈ [t1, t1 + ∆kj (t1 − 1)r′]
and Lkj (0, ta(c)) ≥ Lk
j (0, ta(c)− 1).
≤ r′
(Ak
j (0, ta(c))− Akj (0, t1 − 1) +
r′∆kj (t1 − 1) + t1 − ta(c)
r′+ 1
)+ 1
≤ r′Akj (t1, ta(c)) + r′∆k
j (t1 − 1) + t1 − ta(c) + r′ + 1
= Aj(t1, ta(c)) + r′∆kj (t1, ta(c)) + r′∆k
j (t1 − 1)− ta(c) + t1 + r′ + 1 by Definition 10
≤ Aj(t1, ta(c)) + r′(∆kj (ta(c)) + 1)− ta(c) + t1 + 1 by Property 1 (4.4)
By (4.4) and (4.3), tp(c)− tlS(c) ≤ r′(∆kj (ta(c)) + 1) + Bj .
Case 2: ta(c) > t1 + ∆kj (t1 − 1)r′. If Qk
j (ta(c)) = 0 then cell c is immediately delivered to the
output-port, i.e., tp(c) = ta(c) + 1 ≤ tlS(c) and the claim holds since tp(c)− tlS(c) ≤ 0.
If Qkj (ta(c)) > 0, let t2 be the beginning of a (k, j) busy period for ta(c). Note that by the
57
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
choice of t2, Lkj (t2, ta(c)) ≥
⌊ta(c)−t2+1
r′
⌋. Hence, we have:
tp(c)− ta(c) ≤ r′Qkj (ta(c)) + 1
= r′(Ak
j (t2, ta(c))− Lkj (t2, ta(c))
)+ 1 since Qk
j (t2 − 1) = 0
≤ r′(
Akj (t2, ta(c))−
⌈ta(c)−(t2−1)
r′
⌉)+1 since plane k is continuously busy
≤ Aj(t2, ta(c)) + r′∆kj (t2, ta(c))− r′
⌈ta(c)− (t2 − 1)
r′
⌉+ 1
≤ Aj(t1, ta(c)) + r′(∆k
j (t2, ta(c))+1)+t1−ta(c)+(t2 − t1)−Aj(t1, t2−1)+1 (4.5)
By the choice of t1, the output-buffer of the shadow switch is empty at time-slot t1 − 1, and not
empty during time-interval [t1, t2 − 1]. This implies that (t2 − t1) ≤ Aj(t1, t2 − 1), and therefore
(4.5) implies:
tp(c)− ta(c) ≤ Aj(t1, ta(c)) + r′(∆kj (ta(c)) + 1) + t1 − ta(c) + 1 (4.6)
By (4.6) and (4.3), tp(c)− tlS(c) ≤ r′(∆kj (ta(c)) + 1) + Bj .
4.5 Demultiplexing Algorithms with Optimal RQD
This section presents several demultiplexing algorithms and uses the methodology described in
Section 4.4 in order to bound their relative queuing delay.
First, we revisit the fractional traffic dispatch algorithm (FTD) [70] and prove that its rela-
tive queuing delay is (N + 1)r′ time-slots. Then, for a PPS with speedup S > 2, we introduce
a variant of the FTD algorithm that is 2N/S-partitioned; its relative queuing delay is at most
(2N/S + 1) r′ + N(1 − 2/S) time-slots, matching the lower bound for fully distributed demulti-
plexing algorithms (Theorem 4).
Finally, we present novel 1-RT and u-RT demultiplexing algorithms with relative queuing delay
3N + r′ time-slots (Sections 4.5.2 and 4.5.3). Both algorithms have optimal relative queuing delay
for a PPS with constant speedup.
58
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
4.5.1 Optimal Fully-Distributed Demultiplexing Algorithms
Iyer and McKeown [70] presented the best-known example of a fully-distributed demultiplexing
algorithm. In this algorithm, there is a window of size r′ time-slots that slides over the sequence
of cells in each flow (i, j). The algorithm maintains window constraint that ensures that two cells
in the same window are not sent through the same plane. An equivalent variation of the algorithm,
which is called the fractional traffic dispatch algorithm (FTD), statically divides each flow to
blocks of size r′ [70, 86].
The demultiplexing algorithm chooses the plane through which a cell is sent arbitrarily from
the set of planes that do not violate the window constraint and the input-constraint described at
Section 4.2. A speedup of S ≥ 2 suffices for the algorithm to work correctly [70].
A simple application of Theorem 10 and the fact that Bj ≤ N shows:
Theorem 11 Ravg(FTD) ≤ Rmax(FTD) ≤ (N + 1)r′.
Proof. Let Ai→j(t1, t2) be the number of cells in flow (i, j) that arrive at the switch during time-
interval [t1, t2], and Aki→j(t1, t2) be the number of these cells that are sent through plane k.
∆kj (t1, t2) = Ak
j (t1, t2)−Aj(t1, t2)
r′by Definition 10
=N∑
i=1
Aki→j(t1, t2)−
Aj(t1, t2)
r′
≤N∑
i=1
⌈Ai→j(t1, t2)
r′
⌉− Aj(t1, t2)
r′due to the window constraint
≤N∑
i=1
(Ai→j(t1, t2)
r′+
r′ − 1
r′
)− Aj(t1, t2)
r′since Ai→j, r
′ are integers
= Nr′ − 1
r′
By Theorem 10,Rmax(FTD) ≤ (N + 1)r′, since Bj ≤ N .
For PPS with speedup S > 2, a 2NS
-partitioned variant of FTD can yield better relative queuing
59
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Algorithm 1 Partitioned Fractional Traffic Dispatch (PART-FTD) AlgorithmLocal variables at demultiplexor i:
M [N ][r′]: matrix of values in 1, . . . , 2r′,⊥, initially all ⊥R[r′]: vector of values in 1, . . . , 2r′,⊥, initially all ⊥S[N + 1]: vector of values in 0, . . . , r′ − 1, initially all 0
1: int procedure DISPATCH(cell c) at demultiplexor i2: j ← dest(c)3: D ← k ∈ 1, . . . , 2r′ | ∃a ∈ 0, . . . , r − 1,M [j − 1][a] = k
. Planes that violate the window-constraint.4: E ← k ∈ 1, . . . , 2r′ | ∃a ∈ 0, . . . , r − 1, R[a] = k
. Planes that violate the input-constraint.5: choose k ∈ (1, . . . , 2r′ \ (D ∪ E))6: M [j − 1][S[j − 1]]← k . Update for future window constraint calculations.7: R[S[N ]]← k . Update for future input constraint calculations.8: S[j − 1]← S[j − 1] + 1 mod r′ . Update pointer for cyclic use of the vector.9: S[N ]← S[N ] + 1 mod r′ . Update pointer for cyclic use of the vector.
10: return k + 2r′⌊
i2N/S
⌋− 1
11: end procedure
delay, matching the lower bounds described in Theorems 4 and 5. In this algorithm, denoted PART-
FTD (see pseudo-code in Algorithm 1), demultiplexor i uses only planes
2r′⌊
i2N/S
⌋, . . . , 2r′
(⌊i
2N/S
⌋+ 1)− 1. This implies that each demultiplexor uses exactly 2r′
planes, as required for the correctness of FTD, but each plane is used only by at most 2NS
demulti-
plexors.
Theorem 12 Ravg(PART-FTD) ≤ Rmax(PART-FTD) ≤(
2NS
+ 1)r′ + N
(1− 2
S
).
Proof. We use the same calculations as in the proof of Theorem 11. The only difference is that
∆kj (t1, t2) = Ak
j (t1, t2)−Aj(t1, t2)
r′≤
N∑i=1
⌈Ai→j(t1, t2)
r′
⌉− Aj(t1, t2)
r′≤ 2N
S
r′ − 1
r′
since at most 2NS
demultiplexors can send cells through plane k. Therefore, by Theorem 10,
Rmax(PART-FTD) ≤(
2NS
+ 1)r′ + N
(1− 2
S
).
60
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
4.5.2 Optimal 1-RT Demultiplexing Algorithm
We describe a 1-RT demultiplexing algorithm that matches the lower bound presented in Theo-
rem 8. Informally, Algorithm 2 divides the set of planes into two equal-size sets, V0 and V1, and its
operations with respect to cells destined for a specific output-port into two phases. In each phase,
the algorithm sends cells destined for a specific output-port through a different set of planes (i.e.,
V0 or V1). After every time-slot, each input-port collects global information about the switch, and
uses it to calculate the imbalance for each plane k and each output-port j. In the next time-slot
each input-port sends a cell to output-port j only through planes with low (or zero) imbalance.
Intuitively, a phase i ends when there are no balanced planes in Vi to use. Then, in the next phase,
the demultiplexors use the planes of the set V1−i.
To avoid situations in which all the input-ports send cells through the same plane, we divide
the input-ports into Nr′ sets of size r′, and assure that, under no circumstances, two input-ports in
the same set send a cell destined for the same output-port through the same plane. This is done by
calculating the actions of other input-ports in the same set as if they indeed get a cell destined for
the same output-port.
With respect to each output-port j, planes are divided into three levels according to their imbal-
ance (see Definition 10): balanced planes with imbalance ∆kj (t) ≤ 0, slightly imbalanced planes
whose imbalance satisfies 0 < ∆kj (t) < N
r′ , and extremely imbalanced planes with imbalance
∆kj (t) ≥ N
r′ , At the beginning of each time-slot, a set of eligible planes, denoted by F [j], is calcu-
lated for every destination j: A plane is eligible for output-port j if it is balanced with respect to
output-port j or if it was never extremely imbalanced with respect to output-port j since the last
phase change. Phase i is changed to phase 1 − i when all planes k ∈ V1−i become balanced (The
set Q[j] maintains the planes of V1−i that are still imbalanced. The phase changes when Q[j] = ∅).
Example 1 Suppose that at time-slot t = 0, phase[0] changed from 1 to 0, ∆90(0) = 6.5 and all
other planes in V1 have imbalance at most 6.5. In addition, we assume that planes 1 and 2 did not
receive any cells before time-slot 0.
The demultiplexors are divided into 4 sets: 0, 1, 2, 3, 4, 5, 6, 7. Upon receiving a cell,
each demultiplexor calculates the behavior of all demultiplexors in its set that have smaller index
61
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
and ensures that it will not send the cell through the same plane as them. Table 4.2 shows the
plane number through which each demultiplexor would have sent a cell destined for output-port 0,
if such a cell arrives at the switch. The actual arrivals are marked in framed boxes, and are taken
into account in the following time-slots.
At time-slot 1, demultiplexor 0 will send a cell through the first plane in V0 (that is, plane 1).
On the other hand, demultiplexor 1 must avoid sending its cell through plane 1 and therefore it will
use plane 2. Similarly, demultiplexors 2, 4 and 6 will use plane 1 and demultiplexors 3, 5 and 7
will use plane 2.
62
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Algorithm 2 1-RT AlgorithmConstants:
V0=1, . . . , K2 ; V1=K
2 + 1, . . . ,KShared:
F [N ]: N sets of planes, initially all V0 . cells for j can be sent only through F [j]R[N ][r′]: matrix of values in 1, . . . ,K,⊥, initially all ⊥ . holding input-constraintst: value in 0, . . . , r′ − 1, initially 0 . cyclic pointer to matrix RQ[N ], L[N ]: N sets of planes, initially all ∅M [N ]: N sets of planes, initially all 1, . . . ,Kphase[N]: vector of values in 0, 1, initially all 0
1: void procedure ADVANCE-CLOCK( ) . invoked at the beginning of each time-slot2: For every j ∈ 1, . . . , N: CALCULATE(j)3: For every j ∈ 1, . . . , N: F [j]← UPDATE(j)4: Update the matrix R[N ][r′] according to global information5: t← (t + 1) mod r′
6: end procedure
1: int procedure DISPATCH(cell c) at demultiplexor i2: j ← dest(c)3: p←
⌊ir′
⌋4: set B ← ∅5: for x← r′p to i do6: E←k∈1, . . ., K | ∃a∈0, . . ., r′−1,R[x][a]=k7: k ← min (F [j] \ (B ∪ E))8: B ← B ∪ k9: end for
10: R[i][t]← k . can be read by other input-ports only in the next time-slot11: return k12: end procedure
1: set procedure UPDATE(int j)2: set S ← F [j]3: Q[j]← Q[j] \M [j]4: if Q[j] = ∅ then . change phase5: Q[j]← 1, . . . ,K \M [j]6: phase[j]← (1− phase[j])7: S ← Vphase[j]
8: else9: S ← S \ (L[j])
10: end if11: return S12: end procedure
1: void procedure CALCULATE(int j)2: set A← k|∆k
j (t) > Nr′ . using global information
3: M [j]← k|∆kj (t) ≤ 0 . using global information
4: L[j]← (L[j] ∪A) \M [j]5: end procedure
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Time slot0 1 2 3 4 5 6 7 8
Demultiplexor 0 1 2 1 2 3 9 10 1Demultiplexor 1 2 1 2 3 2 10 9 2Demultiplexor 2 1 2 1 2 2 9 10 1Demultiplexor 3 2 1 2 3 3 10 9 2Demultiplexor 4 1 2 1 2 2 9 9 1Demultiplexor 5 2 1 2 3 3 10 10 2Demultiplexor 6 1 1 2 2 2 9 9 1Demultiplexor 7 2 2 1 3 3 10 10 2
∆10(t) 1.5 3.5 4 3 2.5 0.5 0 0
∆90(t) 6.5 5 3 1.5 0.5 0 0 0.5 0.5
Table 4.2: Illustration of Example 1. The plane number through which each demultiplexor wouldhave sent a cell destined for output-port 0, if such a cell arrives at the switch. Actual arrivals aremarked in framed boxes. No cells arrive to different output-ports in this time interval
At time-slot 2, demultiplexor 0 cannot use plane 1 due to the input-constraint. Therefore, it
will use plane 2 and demultiplexor 1 will use plane 1. Plane 1 becomes extremely imbalanced
after time-slot 2 and therefore it is not eligible to receive cells for output-port 0 in the following
time-slots. Although plane 1 becomes slightly imbalanced after time-slot 3, Algorithm 2 dictates
it is still not eligible for output-port 0, since the phase has not changed yet.
The phase changes after time-slot 5, because for every plane k ∈ V1, ∆k0(5) ≤ 0. This implies
that planes from the set V1 are used for sending cells destined for output-port 0 in the following
time-slots. At this time, Q[0] = 1, 2 since ∆10(5) = 2.5 and ∆2
0(5) = 0.5. The phase changes
again after time-slot 7, since both ∆10(7) and ∆2
0(7) are not positive.
To prove the correctness of Algorithm 2, we start with two lemmas.
The first lemma shows that the imbalance between each plane and each output-port is bounded.
Lemma 9 In Algorithm 2, for every plane k ∈ V0 ∪ V1 and output-port j, ∆kj < 2N
r′ .
Proof. Clearly, if ∆kj (t3) > N
r′ then k ∈ L[j] in the beginning of time-slot t3 + 1 (procedure
CALCULATE, Line 4). Therefore, k 6∈ F [j] in the beginning of time-slot t3+1 (procedure UPDATE,
Line 9), and cells are not sent through plane k until a time-slot t3′ > t3+1 in which ∆kj (t3
′−1) ≤ 0.
64
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
This observation holds also if the phase changes in the beginning of time-slot t3 +1 since Q[j] = ∅at Line 4 yields that Vphase ⊆M [j] at Line 7.
For every two input-ports i1 and i2, if⌊
i1r′
⌋=⌊
i2r′
⌋, then i1 and i2 do not send cells destined for
the same output-port through the same plane in the same time-slot (procedure DISPATCH). This
implies that the maximum number of cells destined for the same output-port and sent through the
same plane in a single time-slot is Nr′ .
By Definition 10, if a plane does not receive cells destined for output-port j in time-slot t1,
then ∆kj (t1) ≤ ∆k
j (t1 − 1). This implies that there is a time-slot t1, in which plane k receives
cells destined for j, and ∆kj (t1) = ∆k
j . In the worst-case, ∆kj (t1 − 1) = N
r′ and k receives Nr′ cells
destined for j.
Assume towards a contradiction, that ∆kj (t1) ≥ 2N
r′ . Then there is a time-slot t2 such that
∆kj (t2, t1) ≥ 2N
r′ . Note that ∆kj (t1, t1) < N
r′ , since Akj (t1, t1) ≤ N
r′ and Aj(t1, t1) ≥ Akj (t1, t1). This
implies that t2 < t1 and therefore by Definition 10:
∆kj (t2, t1) = Ak
j (t2, t1)−1
r′Aj(t2, t1)
= Akj (t2, t1 − 1) + Ak
j (t1, t1)−1
r′Aj(t2, t1 − 1)− 1
r′Aj(t1, t1)
= ∆kj (t2, t1 − 1) + ∆k
j (t1, t1) <2N
r′
This contradicts the choice of t2, and the claim follows.
The second property is a simple conclusion from Lemma 9:
Lemma 10 If 2N cells destined for output j arrive at a PPS is operating under Algorithm 2 during
time-interval [t1, t2] and none of them is sent through plane k, then ∆kj (t2) ≤ 0.
Proof. By Definition 10, there is a time-slot t3 such that ∆kj (t2) = ∆k
j (t3, t2).
If t3 ≥ t1, then ∆kj (t3, t2) ≤ 0, since Ak
j (t3, t2) = 0. Otherwise,
∆kj (t3, t2) = ∆k
j (t3, t1 − 1) + ∆kj (t1, t2)
= ∆kj (t3, t1 − 1) + Ak
j (t1, t2)−1
r′Aj(t1, t2)
65
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
By Lemma 9 and Definition 10, ∆kj (t3, t1 − 1) ≤ ∆k
j ≤ 2Nr′ . Since Ak
j (t1, t2) = 0 and Aj(t1, t2) ≥2N , it follows that ∆k
j (t3, t2) ≤ 0 also in this case.
The final theorem shows that a speedup of 8 suffices for this demultiplexing algorithm to
achieve optimal relative queuing delay. Note that such a high speedup is considered impracti-
cal for real switches; yet, Algorithm 2 demonstrates that the lower bound presented in Theorem 8
is tight for u = 1.
Theorem 13 Speedup S = 8 suffices for Algorithm 2 to work correctly with maximum relative
queuing delay of 3N + r′ time-slots.
Proof. It suffices to show that every time Line 7 of procedure DISPATCH is executed, F [j] \ (B ∪E) 6= ∅, and a plane can be chosen. Clearly, at each step |B| ≤ r′ and |E| < r′; therefore the claim
follows if |F [j]| > 2r′. Since F [j] is changed only by procedure UPDATE(j), it suffices to show
that |F [j]| > 2r′ after any execution of UPDATE(j).
Assume, without loss of generality, that phase = 0 after an execution of procedure UPDATE(j)
at time-slot t1. Assume, by way of contradiction, that |F [j]| ≤ 2r′ at time-slot t1. Clearly, from
Line 7 and the fact that |V0| = |V1| = K2
= Sr′
2= 4r′ > 2r′, it follows that phase = 0 after
time-slot t1 − 1. This implies that |V0 ∩ L[j]| ≥ 2r′.
Denote by t2 the last time-slot in which phase was changed from 1 to 0 (t2 = 0 if no such
time-slot exists). At time-slot t2, when executing Line 4, Q[j] is empty and therefore all planes
k ∈ V0 are at M [j] at time-slot t2. This implies that for every k ∈ V0, ∆kj (t2) ≤ 0.
Let k be a plane in V0 ∩L[j]. By the definition of L[j] there is a time-slot t3 ∈ [t2, t1] such that
∆kj (t3) > N
r′ . Let t4 be the last time-slot such that ∆kj (t3) = ∆k
j (t4, t3). If t4 < t2 then
∆kj (t4, t3) = ∆k
j (t4, t2) + ∆kj (t2 + 1, t3)
≤ ∆kj (t2) + ∆k
j (t2 + 1, t3) ≤ ∆kj (t2 + 1, t3)
and therefore t4 is not maximal. Hence t4 ≥ t2 and [t4, t3] ⊆ [t2, t1]. Since ∆kj (t4, t3) =
Akj (t4, t3)− 1
r′ Aj(t4, t3) > Nr′ , and Aj(t4, t3) ≥ Ak
j (t4, t3), it follows that Akj (t4, t3) > N
r′−1.
66
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Because |V0 ∩ L[j]| ≥ 2r′, the number of cells arriving at the switch and destined for j during
time-interval [t2, t1] is at least (2r′) Nr′−1
> 2N . Since during time-interval [t2, t1] no cells are sent
to any plane in V1, Lemma 10 implies that any plane k ∈ V1 has ∆kj (t1) ≤ 0, and in particular all
planes in Q[j]. This yields that Q[j] becomes empty, and the phase changes at least once during
time-interval [t2, t1] which contradicts the choice of t1 and t2.
By Lemma 9 and Theorem 10, the relative queuing delay of the algorithm is at most 3N + r′.
4.5.3 Optimal u-RT Demultiplexing Algorithm
Algorithm 2 can be used as a building block for u-RT algorithms with u > 1. Algorithm 3 runs u
instances of Algorithm 2 in a round-robin manner, such that in each time-slot only one instance is
active (that is, the ith instance is active on time-slots i, i + u, i + 2u, i + 3u etc.). Since there are u
time-slots between two consecutive times in which the same instance is active, global information
on the previous time the instance was active can be shared among the demultiplexors. In addition,
each instance of Algorithm 2 has its own set of 8r′ planes, hence Algorithm 3 needs speedup
S = 8u.
We next bound the imbalance under Algorithm 3:
Lemma 11 In Algorithm 3, for every plane k and output-port j, ∆kj < 2N
r′ .
Proof. Assume towards a contradiction that there is a traffic T , a plane k and an output-port j,
such that ∆kj ≥ 2N
r′ . Let t be the first time-slot in which ∆kj (t) ≥ 2N
r′ , and let x = t mod u. The
choice of t and Algorithm 3 imply that a cell is sent through plane k at time-slot t by instance x.
Let T ′ be the traffic consisting of the cells of traffic T handled by the instance x, that is,
T ′ = c | c ∈ T, ta(c) − x mod u = 0. Let round(c) = ta(c)−xu
be the number of times instance
x was active until cell c arrived to the switch.
Consider traffic T in which each cell c of traffic T ′ arrives at the switch at time-slot round(c),
that is, T = shift(c, round(c)− ta(c)) | c ∈ T ′. Let Aj(t1, t2) be the number of cells in traffic
67
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Algorithm 3 u-RT AlgorithmShared:
ALG[u]: u instances of Algorithm 2 . Each instance with its own planes and shared variablesx: value in 0, . . . , u− 1, initially u− 1 . cyclic pointer to array ALG
1: void procedure ADVANCE-CLOCK( ) . invoked at the beginning of each time-slot2: x← (x + 1) mod u3: ALG[x].ADVANCE-CLOCK() . invoke procedure ADVANCE-CLOCK on the xth instance4: end procedure
1: int procedure DISPATCH(cell c) at demultiplexor i2: return ALG[x].DISPATCH(c) . invoke procedure DISPATCH on the xth instance3: end procedure
T destined for output-port j that arrive at the switch during time interval [t1, t2], and Akj (t1, t2)
be the number of these cells that are sent through plane k by Algorithm 2. Similarly, following
Definition 10 ∆kj (t1, t2) = Ak
j (t1, t2)− 1r′ Aj(t1, t2), and ∆k
j (t2) = maxt1≤t2 ∆kj (t2).
Since only instance x sends cells to plane k, and the dispatching decisions of instance x as
response to traffic T are the same of the decisions of Algorithm 2 as response to traffic T , it
follows that for every time t′ < t, Akj (t
′, t) = Akj
(⌈t′−x
u
⌉, t−x
u
). On the other hand, in T there is
a subset of T ’s cells destined for output-port j, and therefore Aj(t′, t) ≥ Aj
(⌈t′−x
u
⌉, t−x
u
). This
implies that ∆kj (
t−xu
) ≥ ∆kj (t) ≥ 2N
r′ , contradicting Lemma 9.
Lemma 11, Theorem 13 and Theorem 10 immediately imply:
Theorem 14 For any u ≥ 1 and a PPS with speedup S = 8u, there is a u-RT demultiplexing
algorithm ALG such thatRmax(ALG) ≤ 3N + r′.
Note that speedup S = 8u is not feasible in real-life switches. Therefore, Algorithm 3 has
only theoretical importance. We leave for further research the question whether there is an optimal
u-RT demultiplexing algorithm which requires a speedup that does not depend on u.
68
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
4.6 Extensions of the PPS model
4.6.1 The Relative Queuing Delay of an Input-Buffered PPS
We extend our definitions for the bufferless PPS model to the case in which there are buffers in the
input-ports. In an input-buffered PPS, when a cell arrives, the demultiplexor either sends the cell
to one of the planes or keeps it in its buffer. In every time-slot, the demultiplexor sends any number
of buffered cells to the planes, provided that the rate constraints on the lines between the input-
port and any plane are preserved. In this section, we consider only deterministic demultiplexing
algorithms . (a discussion on extending the bounds to randomized algorithms appears in the end of
the section.)
We refer to the buffer residing at input-port i with finite size s as a vector bi ∈ 1, . . . , N,⊥s.
An element of this vector contains the destination of the cell stored at the corresponding place in
the buffer. Empty places in the buffer are indicated with ⊥ in the vector. The size of the buffer at
input-port i is denoted |bi|.
The demultiplexor state machine is changed to include the state of the input-port buffer. Bi
denotes the set of the reachable states of the buffer residing in input-port i. We refer to the set of
states of the ith demultiplexor as Si × Bi. A switch configuration includes also the input-buffers
content.
Definition 11 The demutliplexing algorithm of the demultiplexor residing at input-port i with
input-buffer, is a function
ALGi : 1, . . . , N,⊥ × Si × Bi → Si × 1, . . . , K,⊥|bi|+1
which gives the next state and a vector of size |bi| + 1, according to the incoming cell destination
(⊥ if no cell arrives), current state and the content of the buffer.
The vector of size |bi| + 1 that is returned by the function ALGi states through which plane to
send the cell in the corresponding place in the buffer; the last element of the vector refers to the
incoming cell; ⊥ indicates that the corresponding cell remains in the buffer. We denote by to(c, T )
69
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
the time-slot in which cell c ∈ T is sent from the input-port to one of the planes; since cells can be
queued in the input-port to(c, T ) can be larger ta(c), unlike the bufferless PPS model.
When measuring the relative queuing delay in an input-buffered PPS, the queuing of cells both
in the input-buffers and the planes’ buffers of the PPS should be compared to the queuing of cells
in the output-buffers of the shadow switch. Generally, input buffers increase the flexibility of the
demultiplexing algorithms, which leads to weaker lower bounds.
We prove these lower bounds by constructing (f, s) weakly-concentrating executions (recall
Definition 5):
Theorem 15 Any deterministic fully-distributed demultiplexing algorithm ALG with any input-
buffers’ sizes has Rmax(ALG) ≥ NS
(1− 1
r′
)and J (ALG) ≥ N
S
(1− 1
r′
)time-slots, where ε can
be made arbitrarily small.
Proof. Let C be a switch configuration in which all the buffers in the switch are empty, and denote
by t0 the time-slot in which the PPS is in configuration C. Denote the state of demultiplexor i in
configuration C by (s0i , b
0i ). Clearly, b0
i = ⊥|bi|.
For every input-port i, consider traffic T |i = ci which consists on a single cell ci with
orig(ci) = i, dest(ci) = j and ta(ci) = t0. Note that to(ci) ≤ N/S time-slots, otherwise T |iis causing a relative queuing delay greater then N/S time-slots. Let (sf
i , bfi ) be the demultiplexor
state just before this cell is sent. Clearly, T |i has no bursts, since one cell arrives at the switch.
Since K < N , there exists a plane k and a set of N/K demultiplexors I = i1, . . . , iN/K,such that plane(ci) = k for every i ∈ I .
Now consider another traffic T = T |i1 T |i2 . . . T |iN/K. That is, traffic T begins in con-
figuration C, and for every time-slot t ∈ t0, . . . , t0 + N/K − 1 one cell, which is destined for
output-port j, arrives at input-port i ∈ I . Note that in every time-slot at most one cell arrives at the
switch, therefore this traffic has no bursts.
Since ALG is fully-distributed demultiplexing algorithm, and all the buffers are empty in con-
figuration C, a demultiplexor i ∈ I does not change its states until the first cell arrives. Before
T and T |i begin, demultiplexor i is in state (s0i , b
0i ), and its individual flow under T is exactly
the same as under T |i (only one cell destined for output-port j arrives). Therefore, demultiplexor
70
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
i ∈ I changes its state to (sfi , b
fi ), and sends its cell to plane k, implying that EPPS(ALG, T ) is
(N/K, N/K) weakly-concentrating execution for output-port j and plane k.
Applying Lemma 2 with f = N/K, s = N/K and B = 0, yields lower bounds of NK
r′ − NK
=
NS
(1− 1
r′
)on the maximum relative queuing delay and relative delay jitter.
Unlike fully-distributed demultiplexing algorithms, the size of the input-buffers affects the rel-
ative queuing delay in an input-buffered PPS under u-RT demultiplexing algorithms. A PPS that
can store u cells in each input-port is able to support a u-RT demultiplexing algorithm that guar-
antees relative queuing delay of at most u time-slots, by simulating the CPA algorithm [74]. Note
that CPA assumes the PPS is a global FCFS switch, i.e., cells leave an output-port in a FCFS order,
regardless of the input-port from which they are originated.
Theorem 16 There is a u-RT demultiplexing algorithm for a global FCFS input-buffered PPS,
with buffer size at least u and speedup S ≥ 2, and a relative queuing delay of at most u time-slots.
This algorithm may be impractical; yet, it demonstrates that a lower bound of Ω(N) time-slots
does not hold when the input-buffers are sufficiently large. When buffers are smaller than u, we
show that a global FCFS deterministic input-buffered PPS has relative queuing delay of NS
(1− 1
r′
)time-slots, under leaky-bucket traffic with burstiness factor u(N
K− 1):
Theorem 17 Any deterministic u-RT demultiplexing algorithm ALGwith input-buffers’ sizes smaller
than u hasRmax(ALG) ≥ NS
(1− 1
r′
)and J (ALG) ≥ N
S
(1− 1
r′
)time-slots.
Proof. Let C be the switch configuration at time t0, and assume that at this time all the buffers
in the switch are empty. Let T |i be a traffic that is comprised of cells c with orig(c) = i and
dest(c) = j such that one cell arrives at input-port i in each time-slot, until the first cell destined
for output-port j is sent to one of the planes. This execution takes less than u time-slots, because
otherwise input-port i is queuing in its input-buffer more cells than its capacity. Note that T |i is a
leaky-bucket traffic with no bursts.
Since the PPS is FCFS, and only cells of traffic T |i arrive at the switch, the first cell to leave
the input-port i’s buffer is the first cell of traffic T |i. We denote this cell by ci.
71
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Since K < N , there exists a plane k and a set of demultiplexors I ⊆ 1, . . . , N of size N/K,
such that plane(ci) = k for every i ∈ I . Let t′ = maxto(ci, T |i)|i ∈ I.
Now compose all traffics T |i for i ∈ I and append the time-interval (t′, t′ + u(N/K − 1)]
in which no cells arrive at the switch. T =⋃
i∈I T |i denotes the composite traffic, starting from
configuration C at time t0, which is a (R, u(NK− 1)) leaky-bucket traffic.
Every demultiplexor i ∈ I goes through the same state transitions in response to T |i and
T , since composing the traffic does not change the switch configurations in time interval [0, t0],
to(ci) − u < t0, and its local information is identical in T |i and T . Hence, demultiplexor i sends
the cell ci to plane k in time-slot to(ci, T ) = to(ci, T |i) < t0 + u.
Notice that under traffic T at time interval [t0, t0] (that is, the first time-slot), N/K cells destined
for the same output arrived at the switch and sent through the same plane. Furthermore, the burst of
traffic T during this interval is N/K−1. Thus, Lemma 2 with f = N/K, s = 1 and B = N/K−1
implies thatRmax(ALG) = J (ALG) = NK
r′ −(1 + N
k− 1)
= NS
(1− 1
r′
), as required.
We leave for future research the question whether these lower bounds apply also for the average
relative queuing delay and for randomized algorithms. The major difference between these proofs
and our other lower bounds’ proofs (described in Section 4.3) is that they employ executions in
which a concentration occurs at the beginning of the traffic rather than at its very end (that is,
a weakly-concentrating executions). Therefore, it is unclear how a proper continuation to this
traffic can be devised. Another interesting future research direction is to present methodology and
algorithms that match the lower bounds for input-buffered PPS.
4.6.2 Recursive Composition of PPS
Another extension to the PPS model is implementing the planes themselves as PPS (operating at a
lower rate). A (q, K)-recursive-PPS ((q, K)-RPPS) is defined recursively as follows:
Base case: (1, 〈k1〉)-RPPS is a PPS with k1 planes, operating at external rate R and internal rate
r1 as described in Section 4.2.
Recursion Step: (q+1, K 〈kq+1〉)-RPPS is an (q, K)-RPPS whose planes are replaced with PPS
72
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Figure 11: A (2, 〈2, 2〉)-RPPS with 5 input-ports and 5 output-ports.
switches, each with kq+1 planes, that operate at external rate rq, and internal rate rq+1. Note
that rq > rq+1. K 〈kq+1〉 denotes the concatenation of the vector 〈kq+1〉 after the vector K.
This composition is described in Figure 11.
When a cell arrives to a K-RPPS, it is demultiplexed through a chain of q demultiplexors
(where q is the length of the vector K) until it is sent to an output-queued switch. It is important to
notice that each demultiplexor in this chain handles traffic originated only from a single input-port.
The collection of demultiplexors that handle all flows originating from input-port i, denoted Gi,
forms a tree of height q with∏q
i=1 K[i] leaves; the level of each demultiplexor is its distance from
the the root of the corresponding tree Gi.
In the homogeneous case, where all the demultiplexors in Gi are of the same type, Gi can be
considered as single (yet complex) demultiplexor of this type. Therefore, all lower-bound results,
described in Section 4.3, hold after substituting K with∏q
i=1 K[i] and r with rq.
For simplicity, we present the results only for N -partitioned fully-distributed demultiplexing
algorithms:
Corollary 18 Any homogeneous-RPPS that uses (randomized) N -partitioned fully distributed de-
multiplexing algorithms has, with probability 1 − δ, average relative queuing delay of at least
N( Rrq− 1)− ε time-slots, against adaptive and oblivious adversaries, where ε > 0 and δ > 0 can
73
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
be made arbitrarily small (δ = 0 for deterministic algorithms).
As in Theorem 5, the lower bound against oblivious adversary holds only if the RPPS obeys a
global FCFS policy.
Corollary 19 Any homogeneous-RPPS that uses (randomized) u-RT demultiplexing algorithms
has, with probability 1 − δ, an average relative queuing delay of at least uNS′ (1 − urq
R) − ε time-
slots, against an adaptive adversary, where S ′ = rq
R
∏qi=1 K[i], u = R
2rqand ε, δ > 0 can be made
arbitrarily small (δ = 0 for deterministic algorithms).
These results imply that building a PPS recursively and homogeneously does not improve its
relative queuing delay. Note that similar approach may be applied in order to analyze an input-
buffered RPPS; hence, as in Theorems 16 and 17, the lower bound on the relative queuing delay
some time depend on relations between the buffer size and the type of information used.
Since sharing information is more feasible as the external rate of the switch decreases, it is
interesting to investigate also a monotone K-RPPS in which the switches at levels 1, . . . , v op-
erate under fully-distributed algorithms, the switches at levels v + 1, . . . , w operate under u-RT
demultiplexors algorithms, and the switches at level w + 1, . . . , q are centralized. All demulti-
plexing algorithms can be either deterministic or randomized. For brevity, we assume all u-RT
demultiplexors operate with the same parameter u, and identify such recursive PPS by the tuple
〈K, v, w, u〉.
If all demultiplexors are bufferless, Corollaries 18 and 19 imply the following lower bound:
Corollary 20 A monotone (randomized) 〈K, v, w, u〉-RPPS has, with probability 1−δ, an average
relative queuing delay of at least
max
N
(R
rv
− 1
),uN
S ′
(1− urw
rv
)− ε,
against an adaptive adversary, where S ′ = rw
rv
∏wi=v+1 K[i], u = rv
2rwand ε, δ > 0 can be made
arbitrarily small (δ = 0 for deterministic algorithms).
74
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Proof. Consider a single input-port i and the collections of demultiplexors that handles all flows
originating from input-port i.
On one hand, the demultiplexors at levels 1, . . . , v form a homogeneous fully-distributed de-
multiplexor. Therefore, Corollary 18 implies that it attains, with probability 1− δ, average relative
queuing delay of at least N(
Rrv− 1)− ε time-slots.
On the other hand, the demultiplexors at levels v +1, . . . , w form a collection of homogeneous
u-RT distributed demultiplexors. Therefore, Corollary 19 implies that each of these demultiplexors
attains, with probability 1− δ, average relative queuing delay of at least N uNS′
(1− urw
rv
)− ε time-
slots.
Therefore, the overall average relative queuing delay is as claimed.
We leave for further research the constructions of algorithms for recursive PPS and the analysis
of other combinations of demultiplexing algorithms (e.g., when some of the demultiplexors are
bufferless and some have input-buffers).
75
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Chapter 5
Packet-Mode Scheduling in CIOQ Switches
In many network protocols, from very large Wide Area Networks (WANs) to small Networks on
Chips (NoCs), traffic is comprised of variable size packets. A prime example is provided by IP
datagrams whose sizes typically vary from 40 to 1500 bytes [112]. Real-life switches, however,
operate with fixed-size cells, which are easier to buffer and schedule synchronously in an electronic
domain.
Transmitting packets over cell-based switches requires the use of packet segmentation and re-
assembly modules, resulting in a significant computation and communication overhead [77]. Cell-
based scheduling is expected to turn into an even more crucial problem as the use of optics becomes
widespread, since future switches could deal with packets in the optical spectrum and might be
unable to afford their segmentation and reassembly. Cell-based schedulers, unaware of packet
boundaries, may cause performance degradation; indeed, packet-aware switches typically have
better drop-rate, since they may reduce the number of retransmissions by ensuring that only com-
plete packets are sent over the switch fabric (cf. [141, Page 44]).
Packet mode schedulers [57, 101] bridge this gap by delivering packets contiguously over the
switch fabric, implying that until a packet is fully transmitted, neither its originating port nor its
destination port can handle different packets.
It is imperative to explore whether packet-mode schedulers can provide similar performance
guarantees as cell-based schedulers. We address this question by focusing on CIOQ switches
76
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
and investigating whether a packet-mode CIOQ switch can mimic an ideal shadow switch with
bounded relative queuing delay.
5.1 Our Results
In this chapter, we present packet-mode schedulers for CIOQ switches that mimic a ideal switch
with bounded relative queuing delay. Since such mimicking requires CIOQ switches with a certain
speedup, we further investigate the trade-off between the speedup of the switch and its relative
queuing delay.
We devise pipelined frame-based schedulers, in which scheduling decisions are done at the
frame boundaries. Our schedulers and their analysis rely on matrix decomposition techniques. At
each frame, a demand matrix, representing the total size of packets between each input-output pair,
is decomposed into permutations that dictate the scheduling decisions in the next frame. The major
challenge in these decompositions is ensuring contiguous packet delivery while decomposing the
demand matrix to as few permutations as possible.
We show that contrarily to a cell-based CIOQ switch, a packet-mode CIOQ switch cannot
exactly emulate an ideal shadow switch (e.g., output-queued (OQ) switch), whatever the speedup.
However, once we allow for a bounded relative queuing delay, we find that a packet-mode CIOQ
switch does not require a fundamentally higher speedup than a cell-based CIOQ switch.
Specifically, we show (Theorem 26) that a speedup of 2 + O( 1Rmax
) suffices to ensure that a
packet-mode CIOQ switch mimics a ideal switch with maximum relative queuing delay Rmax =
O(N · lcm(Lmax)) time-slots, where Lmax is the maximum packet size, and lcm(Lmax) is the least
common multiple of 1, . . . , Lmax. This result also holds in the common case where only few packet
sizes are legal, and the resulting relative queuing delay is O(N · lcm(L)) time-slots, where L is
the restricted set of legal packet sizes. It is important to note that if L = 1, . . . , Lmax, lcm(L) is
exponential in Lmax, since it is bounded from below by the primorial of Lmax, Lmax#, and from
above by the factorial of Lmax, Lmax!; both Lmax# and Lmax! are exponential in Lmax.
The relative queuing delay can be significantly reduced with just a doubling of the speedup.
We show (Theorem 25) that a speedup of 4 + O( 1Rmax
) suffices to ensure that a packet-mode
77
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
4
2
1
Relative Queuing Delay(logarithmic scale)
Theorem 25
Theorem 26
Corollary 28
Corollary 27
Theorem 222Lmax
Speedup
Figure 12: Summary of our results. The solid line represents emulation of ideal switch withunbounded buffer size, while the dashed line represents emulation of ideal switch with buffer sizeB per output-port. Relative queuing delay scale is logarithmic.
CIOQ switch mimics an ideal shadow switch with a more reasonable relative queuing delay of
Rmax = O(NLmax) time-slots.
The relative queuing delay can be further reduced to be only Lmax−1 time-slots, if the speedup
is increased to 2Lmax (Theorem 22). In addition, we show (Theorem 21) that it is impossible to
achieve relative queuing delay of less than Lmax/2−3, regardless of the speedup used. In particular,
packet-mode schedulers cannot exactly emulate OQ switches (with no relative queuing delay).
Finally, we consider mimicking an ideal switch with a bounded buffer size B at each output-
port. Extending the matrix decomposition techniques, we show (Corollary 28) that with a smaller
speedup of 1+O( 1Rmax
) and relative queuing delayRmax = O(B+N ·lcm(Lmax)), a packet-mode
CIOQ mimics ideal shadow switch with buffer size B.
Figure 12 summarizes our results and demonstrates the trade-off between the speedup required
for switch mimicking and the resulting relative queuing delay.
78
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
5.2 A model for packet-mode CIOQ switches
This section extends the model defined in Chapter 3 to capture packet-mode switches in general,
and specifically packet-mode CIOQ switches.
In a packet-mode switch, packets of variable size traverse the switch contiguously. The packet
size is measured in cell-units, where the minimal packet size is one cell and the maximum packet
size is Lmax cells. All cells of the same packet arrive at the switch contiguously in the same input-
port and are destined for the same output-port. Therefore, we refer to a packet simply as a sequence
of cells and assume that its size is known upon arrival of its first cell (e.g., the total size is written
in the header).
Packet-mode switch are required to ensure that cells of the same packet leave the switch con-
tiguously; that is, cells of the same packet should leave the switch one after the other with no
interleaving of cells from other packets.
A packet-mode switch should further provide a relaxed notion of a first-come-first-serve (FCFS)
discipline. If the last cell of packet p arrives at the switch before the first cell of packet p′ and both
packets share the same output-port, then all cells of packet p should leave the switch before the
cells of packet p′. We denote this partial order of packets by p ≺ p′ (i.e., packet p should be handled
before packet p′).
An ideal packet-mode shadow switch (e.g., a packet-mode OQ switch) should also be work-
conserving: Namely, if a cell is pending for output port j at time-slot t, then some cell leaves the
switch from output-port j at time-slot t. [37, 88, 91]. We denote by tlS(c) the time-slot at which
cell c is delivered by the shadow switch. The contiguous packet delivery implies that for any packet
p = (c1, . . . c`), tlS(ci) = tlS(cj) + (i− j) for 1 ≤ j ≤ i ≤ `.
Recall that in a CIOQ switch with speedup S, packets arriving at rate R are first buffered in the
input side and then forwarded over the switch fabric to the output-side as dictated by a scheduling
algorithm (see Figure 2). Packets that arrive at input-port i and are destined for output-port j are
stored in the input side of the switch in a separate buffer, which is called virtual output-queue and
denoted by V OQij . The switch fabric operates at rate S ·R, where S is the speedup of the switch,
79
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Input 2
Input 1
Lmax1
p3
p1
p2
R+ 2 R+ 3 Lmax +R+ 2 time-slot
Figure 13: Illustration of the proof of Theorem 21; white packets are destined for output-port 1,while the gray packet is destined for output-port 2.
implying that the switch has S scheduling opportunities (or scheduling decisions) every time-slot.1
A packet-mode CIOQ switch ensures that if a packet p from input-port i to output-port j con-
sists of the cells (c1, . . . , c`) then after cell c1 is transmitted across the switch fabric, no cells of
packets other than p are transmitted from input port i or to output port j until cell c` is transmitted.
Naturally, cells of the same packet are transmitted in order.
It is possible that some input-port i starts transmitting cells of a packet p before all the cells of
packet p arrived at the switch. Since the speedup of the switch is typically greater than 1, this may
cause the switch to under-utilize its speedup. For example, suppose that the first cell c1 of a packet
p = (c1, c2, . . . , c`) arrives at input-port i at time-slot ta(c1) and is immediately sent to output-port
j in the first scheduling opportunity of time-slot ta(c1). Since cell c2 arrives at the switch only at
time-slot ta(c2) = ta(c1) + 1, no cells can be sent from input-port i or to output-port j for the
next S − 1 scheduling opportunities (even if there are cells of other packets in one of the relevant
buffers).
5.3 Simple Upper and Lower Bounds on the Relative Queuing
Delay
We show that a packet-mode CIOQ switch cannot mimic with a small relative queuing delay an
ideal shadow switch, regardless of the CIOQ switch speedup. In particular, this result implies that
a packet-mode CIOQ cannot exactly emulate an OQ switch, whatever the speedup used. This runs
1For non-integral speedup values, the speedup S is the average number of such scheduling decisions per time-slot,where at each time slot the switch makes between bSc and dSe scheduling decisions [59].
80
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
against the conventional wisdom that “speedup N solves every problem”.
Theorem 21 A packet-mode CIOQ switch cannot mimic an ideal switch with a relative queuing
delayRmax < Lmax/2− 3 time-slots.
Proof. Assume towards a contradiction that the CIOQ switch mimics an ideal shadow switch
with relative queuing delayRmax < Lmax/2− 3, and consider the following traffic, comprising of
only three packets (see Figure 13): At time-slot 1 a packet p1 of size Lmax arrives at input-port 1,
destined for output-port 1. At time-slot Rmax + 2, another packet, denoted p2, of size 1 arrives at
input-port 2, destined for output-port 1. At time-slotRmax + 3, a packet p3 of size Lmax arrives at
input-port 2, destined for output-port 2.
At time-slot 1, packet p1 is the only packet destined for output-port 1; since the shadow switch
is work-conserving, the first cell of p1 is delivered by the shadow switch at time-slot 1, implying
it must be delivered by the CIOQ switch by time-slot Rmax + 1. Packet-scheduling restricts the
switch from delivering cells of other packets to output-port 1 until the last cell of packet p1 is
delivered. Since the last cell of packet p1 arrives at the switch at time-slot Lmax, then output-port 1
is busy handling p1 at least until time-slot Lmax.
Using the same arguments, the first cell of packet p3 must be delivered to output-port 2 at
time-slot 2Rmax +3, and input-port 2 is busy handling p3 at least until time-slot Lmax +Rmax +2.
Since Lmax > 2Rmax +3, packet p2 cannot be delivered to output-port 1 until time slot Lmax +
Rmax + 2. But, packet p2 is delivered by the shadow switch in time-slot Lmax + 1, implying that
its relative queuing delay is at leastRmax + 1, contradicting the assumption.
Note that this result holds since the CIOQ switch waits for the cells of the different packets to
arrive, and therefore under the situation described in the proof of Theorem 21, the switch in fact
degrades to work at the external line rate (i.e., with S = 1), as an IQ switch. The result is therefore
consistent with the known result that IQ switches, with speedup 1, cannot emulate output-queued
switches [37].
We now show that a CIOQ switch can mimic a shadow switch with relative queuing delay of
Lmax − 1 time-slots provided it has a sufficiently large speedup of 2Lmax. The algorithm closely
81
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
follows the CCF algorithm, which emulates (precisely) a cell-based OQ switch with speedup S =
2 [37].
Intuitively, multiplying the speedup by the maximum packet size Lmax reduces the problem
of packet-mode switching to cell-based switching: Each cell-based scheduling decision can be
mapped to Lmax contiguous packet-mode scheduling decisions, implying that a packet can be
transmitted contiguously. In addition, a relative queuing delay of Lmax − 1 time-slots allows the
scheduler to wait until a packet fully arrives at the switch before it is scheduled. The following
theorem captures this simple result:
Theorem 22 A packet-mode CIOQ switch with speedup S = 2Lmax can mimic a ideal shadow
switch with relative queuing delay of Lmax − 1 time-slots.
Proof. For each time-slot t, let traffic T (t) be the collection of cells that arrive at the switch by
time-slot t and let T ′(t) ⊆ T (t) be a traffic comprising only of cells in T (t) that are the first cells
of their corresponding packets. Denote by t′CCF (c) the time-slot in which the CCF algorithm with
speedup S = 2 schedules a cell c of traffic T ′(t) over the switch fabric, and let tlS′(c) be the
time-slot in which c leaves a cell-based OQ switch that handles traffic T ′.
The packet-mode CCF algorithm (PM-CCF) simulates the behavior of a cell-based CCF: For
each packet p of traffic T (t), PM-CCF forwards the entire packet p contiguously over the switch
fabric in time-slot tPM−CCF = t′CCF (first(p)) + Lmax − 1.
Since the cell-based CCF works with speedup S = 2, for each time-slot t there are at most two
cells which share the same input or output port and are forwarded over the switch fabric by the cell-
based CCF in time-slot t. PM-CCF works correctly since it has 2Lmax scheduling opportunities at
each time-slot and therefore can schedule the packets corresponding to these two cells entirely in
the same time-slot t. In addition, the contiguous arrival of packets at the input-ports ensures that
packet p has fully arrived to the switch by time-slot t′CCF (first(p)) + Lmax − 1.
For each cell c of traffic T =⋃
t T (t), tlS(c) denotes the time-slot in which c leaves the
packet-mode shadow switch. Note that tlS(c) ≥ tlS(first(packet(c))) ≥ tlS′(first(packet(c))),
because cells corresponding to the same packet are delivered in order and traffic T ′ =⋃
t T′(t) is
a subset of traffic T . Since the cell-based CCF emulates cell-based OQ switch, it follows that for
82
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
each cell c of traffic T :
tlS(c) ≥ tlS′(first(packet(c)))
≥ t′CCF (first(packet(c)))
= tPM−CCF (c)− (Lmax − 1)
This implies that every cell c can be delivered from a CIOQ switch with packet-mode CCF at
time-slot tlS(c) + Lmax − 1, and the claim follows.
This result only demonstrates the possibility of ideal shadow switch mimicking with bounded
delay, since a speedup S = 2Lmax is unreasonable in practical switches.
Furthermore, this result also shows that although cut-through CIOQ switches (that is, switches
that do not wait for packets to fully-arrive at the switch before starting scheduling them) may
provide smaller delay in cell-mode scheduling, in packet-mode scheduling it is more profitable
to use store-and-forward CIOQ switches, that must wait for packets to fully-arrive to the switch
before start scheduling them.
In the rest of this chapter, we show how to achieve a similar result with smaller speedup, by
presenting a tradeoff between the speedup and the relative queuing delay: As the speedup of the
switch increases, the needed relative queuing delay for mimicking a shadow switch decreases.
5.4 Tradeoffs between the speedup and the relative queuing de-
lay
Our scheduling algorithms operate in a frame-based pipelined manner, with scheduling decisions
done only at the frame boundaries. At each frame boundary, the algorithms first construct several
demand matrices, and then decompose these matrices into permutations (or sub-permutations).
The algorithms satisfy the demands by scheduling the cells in the next frame according to the
resulting permutations.
The algorithms and their analysis rely on some results of matrix theory, which are presented
83
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
next.
5.4.1 Matrix Decomposition
Definition 12 A permutation P is a 0-1 matrix such that the sum of each row and the sum of each
column is exactly 1. A sub-permutation P is a 0-1 matrix such that the sum of each row and the
sum of each column is at most 1.
In the rest of this chapter, for simplicity, we refer to sub-permutations as permutations. The
following definition captures the fact that the number of cells that should be scheduled from a
single input-port or to a single output-port is bounded:
Definition 13 A matrix A ∈ INN×N is C-bounded if the sum of each row and each column in A is
at most C.
A classical result says that any C-bounded matrix A can be decomposed into C permutations,
whose sum dominates A:
Theorem 23 (BIRKHOFF VON-NEUMANN DECOMPOSITION [25, 52, 144]) If a matrix A ∈INN×N is C-bounded by an integer C, then there are C permutations P1, . . . , PC such that A ≤∑C
i=1 Pi.
Note that since all values in the matrix A are integer the same result can be obtained using
Konig’s Theorem that bounds the chromatic index of a bipartite graph by its maximum vertex
degree [90].
The Birkhoff von-Neumann decomposition implies that every C-bounded demand matrix can
be scheduled, cell by cell, in C scheduling opportunities (or, equivalently, in dC/Se time-slots)
when permutation Pi dictates the scheduling in opportunity i. However, such a scheduling may
violate the packet-mode restrictions, since there is no relation between adjacent permutations in
the sequence. For reasons that will become clear shortly, we are interested in the following class
of permutations:
84
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Definition 14 A maximal matching for a matrix A = [aij] is a permutation matrix P = [pij] ≤ A
such that if pij = 0 and aij > 0 then there exist i′ such that pi′j = 1 or j′ such that pij′ = 1.
Intuitively, a permutation P ≤ A is a maximal matching for a matrix A if no element can be
added to P , resulting in a matrix that is still a permutation and is dominated by A.
The next theorem shows that if a matrix is decomposed by any sequence of maximal matchings
then the number of permutations needed is at most twice the number needed in Theorem 23. The
decomposition of a C-bounded matrix A works iteratively: In each iteration m, a maximal match-
ing P (m) for the matrix A(m − 1) is found and then subtracted from A(m − 1) to form A(m)
(negative values are treated as zeros). The procedure stops when A(m) = 0.
We next show that this happens after at most 2C − 1 iterations, regardless of the choice of the
maximal matching in each iteration, implying that the matrix A is decomposed into less than 2C
permutations.
Theorem 24 ([145, THEOREM 2.2]) For every C-bounded matrix A ∈ INN×N , the decomposi-
tion procedure described above stops after at most 2C − 1 iterations.
Proof. Denote A(0) = A and let P (m) = [p(m)ij] be the maximal matching found in iteration
m. Let A(m) = [a(m)ij] be the matrix, resulting from subtracting the permutation P (m) from the
matrix A(m− 1).
If A(2C − 1) 6= 0 then there exist i,j such that a(2C − 1)ij > 0. Let a(2C − 1)ij = k and
a(0)ij = `. This implies that p(m)ij = 1 in exactly `− k permutations P (m) (1 ≤ m ≤ 2C − 1),
and therefore p(m)ij = 0 in (2C − 1)− ` + k such permutations.
Note that for every m ≤ 2C−1, a(m)ij > 0. Therefore, Definition 14 yields that if p(m)ij = 0
then there are either i′ such that p(m)i′j = 1 or j′ such that p(m)ij′ = 1.
However, the sum of either row i or column j, excluding a(0)ij , is at most C − `. This implies
that 2(C − `) ≥ (2C − 1)− ` + k, which is a contradiction since `, k ≥ 1.
85
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
5.4.2 Mimicking an Ideal Shadow Switch with Speedup S ≈ 4
Our schedulers operate by constructing a demand matrix at each frame boundary, and then using
the result of decomposing this matrix for scheduling decisions in the next frame. The relative
queuing delay of the schedulers corresponds to the size of the frame, while the speedup of the
switch is determined by the ratio between the frame size and the number of permutations obtained
in the decomposition.
A key insight is that packet-mode shadow switches can be implemented by a push-in-first-
out (PIFO) cell-based OQ switch. In such OQ switches, arriving cells are placed in an arbitrary
location in their destination’s buffer, and the switch always outputs the cells at the head of its
buffers [37]. The PIFO policy is an extension of the first-in-first-out (FIFO) policy that can also
implement QoS-aware (Quality-of-Service-aware) algorithms, such as WFQ and strict priority.
In our case, it allows us to implement packet-mode shadow switches as follows: The first cell
of a packet p arriving at the switch is placed at the end of the relevant OQ switch buffer. Each
consecutive cell ci of packet p is placed immediately after cell ci−1; in each time-slot, the cell at
the head of the buffer departs from the switch. Since cells of the same packet are placed one after
the other in the buffer, they leave the OQ switch contiguously. In addition, if p ≺ p′ then the last
cell of packet p is placed in the buffer before the first cell of packet p′, implying that packet p is
served before packet p′.
Notice that, using the CCF algorithm, a cell-based CIOQ switch with speedup S = 2 can
emulate cell-based OQ switch with any PIFO discipline [37], and in particular the above-mentioned
discipline. However, the CCF algorithm ensures only that packets departs contiguously from the
switch and does not deliver the packets contiguously over the switch fabric (that is, from the input-
ports to the output-ports). Yet, our next algorithms use this underlying CCF algorithm in order to
construct the demand matrix of each frame.
Let tCCF (c) be the time-slot in which a cell c is forwarded over the switch fabric by this CCF
algorithm. Clearly, tCCF (c) ≤ tlS(c). We have the next lemma:
Lemma 12 If a scheduling algorithm ALG schedules the cell last(p) of every packet p by time-
slot tCCF (last(p)) + δ then the maximum relative queuing delay of ALG is at most δ + Lmax − 1,
86
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
where Lmax is the maximum packet size.
Proof. Consider a cell c, let k be its place in packet(c), and let ` be the size of packet(c). The
contiguous packet delivery in the shadow switch dictates that tlS(c) = tlS(last(packet(c))− (`−k). Let tALG(c) be the time-slot in which ALG forwards cell c over the switch fabric. Since both
ALG and CCF forward the cells of packet(c) in their order within the packet,
tALG(c) ≤ tALG(last(packet(c))
≤ tCCF (last(packet(c)) + δ
≤ tlS(last(packet(c)) + δ
= tlS(c) + δ + `− k
≤ tlS(c) + δ + `− 1
≤ tlS(c) + δ + Lmax − 1
This implies that every cell c is in the output-side of the switch by time-slot tlS(c)+δ+Lmax−1,
and therefore ALG can output cell c from the CIOQ switch at time-slot tlS(c) + δ + Lmax − 1.
Notice that ALG does not transmit two cells c, c′ at the same time-slot from the same output-port,
since tlS(c) + δ + Lmax − 1 = tlS(c′) + δ + Lmax − 1 implies that tlS(c) = tlS(c′), contradicting
the definition of the shadow switch.
We now explore the trade-off between the speedup S in which the CIOQ switch operates and
its relative queuing delay. We devise a frame-based scheduler in which the demand matrix in each
frame is built according to the times in which the underlying CCF algorithm forwards cells over
the switch fabric. In addition, packets that were not fully forwarded by the CCF algorithm until
the frame boundary are queued in the input-side of the switch until the next frame. Thus, the CCF
algorithm determines which packets should be delivered by a packet-mode CIOQ in each frame,
as captured by the next definition:
Definition 15 For every input-port i, output-port j, frame size τ and frame number k > 0, the set
of eligible cells of frame k, denoted aij(τ, k), includes all cells c 6∈⋃
k′<k aij(τ, k′) such that all
cells c′ ∈ packet(c) have tCCF (c′) ≤ kτ . By convention, aij(τ, 0) = ∅.
87
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Notice that by definition, all the cells of a packet p are in the same set of eligible cells.
The next lemma bounds the number of cells, sharing an input-port or an output-port, that should
be scheduled within the same frame:
Lemma 13 For every input-port i, output-port j, frame size τ and frame number k > 0,
∑Nj′=1 |aij′(τ, k)| ≤ 2τ + N (Lmax − 1) and
∑Ni′=1 |ai′j(τ, k)| ≤ 2τ + N (Lmax − 1)
Proof. Note that the CCF algorithm works with CIOQ switch with speedup 2. Thus, the number
of cells c that share the same input-port (output-port) and have been forwarded by the CCF within
frame k (namely, (k − 1)τ < tCCF (c) ≤ kτ ) is at most 2T .
Since in each virtual output-queue V OQi,j , all cells of the same packet p are stored one after
the other, there is no cell of a different packet that is forwarded by CCF between cells of packet
p. Therefore, only cells of one packet are in aij(τ, k) and were forwarded by CCF before time-slot
(k − 1)τ ; we next bound the number of such cells: Since the maximum packet size is Lmax and
the last cell of each packet was forwarded by the CCF after time-slot (k − 1)τ , at most Lmax − 1
such cells share the same input-port and the same output-port. Thus, the number of such cells that
share an input-port (output-port) is at most N(Lmax − 1). This implies that both∑N
j′=1 |aij′(τ, k)|and
∑Ni′=1 |ai′j(τ, k)| are bounded by 2τ + N(Lmax − 1).
Lemma 13 and Theorem 23 imply that the eligible cells of each frame can be scheduled within
2τ + N(Lmax − 1) scheduling opportunities. Unfortunately, the decomposition described in The-
orem 23 does not ensure that the packet-mode scheduling constraints are satisfied and therefore
cannot be used directly. For example, consider the matrix
A = [aij] =
3 1 2 0
0 2 2 2
1 2 2 1
2 1 0 3
,
88
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
in which, for example, element a1,1 represents a single packet of size 3 and elements a2,2,a2,3,a2,4
represent packets of size 2. The following decomposition of A into six permutations
A =
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
+
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
+
1 0 0 0
0 0 0 1
0 0 1 0
0 1 0 0
+
0 0 1 0
0 1 0 0
1 0 0 0
0 0 0 1
+
0 1 0 0
0 0 1 0
0 0 0 1
1 0 0 0
+
0 0 1 0
0 0 0 1
0 1 0 0
1 0 0 0
violates the packet-mode constraints: Contiguous transmission of packet a1,1 requires that the first
three permutations are scheduled contiguously. On the other hand, each permutation i ∈ 1, 2, 3must also be adjacent to permutation i + 3 in order to ensure contiguous transmission of packet
a2,i+1. These requirements cannot be satisfied simultaneously, since it yields that at least one
permutation must be adjacent to three permutations.
To circumvent this problem, we use Theorem 24 and introduce a different decomposition al-
gorithm, which guarantees contiguous packet delivery but requires twice as much scheduling op-
portunities: At each frame boundary, the algorithm counts the number of cells in each set aij(τ, k)
and constructs a matrix B(k) = [bij] accordingly (namely, bij = |aij(τ, k)|). Then, the algorithm
repeatedly builds maximal matchings for matrix B(k) and keeps contiguous packet delivery in the
following manner: If a cell from input-port i to output-port j is forwarded in some iteration of the
algorithm, and there are more cells from i to j that were not forwarded yet, then the algorithm
keeps the matching between i and j for the next iteration. (This procedure is sometimes called
exhaustive service matching [96].)
Since the algorithm uses only maximal matchings, Theorem 24 yields that the algorithm needs
twice as many iterations as Birkhoff von-Newmann decomposition in order to decompose matrix
B(k). In particular, for every frame size τ , the algorithm needs at most 4τ + 2N(Lmax − 1)
iterations to complete. This implies that it can mimic an ideal switch with a speedup arbitrarily
89
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
close to 4, while attaining a relative queuing delay of O(NLmax).
Theorem 25 A packet-mode CIOQ switch with speedup S = 4 + 2N(Lmax−1)−1τ
can mimic an OQ
switch with a relative queuing delay of 2τ + Lmax − 2 time-slots.
Proof. Fix a frame size τ and let B(k) = [bij] be the N × N matrix such that bij = |aij(τ, k)|.Lemma 13 implies that the sum of each row and each column of B(k) is at most 2τ +N(Lmax−1).
Algorithm 4 works by repeatedly constructing maximal matchings P for matrix B(k). If a
cell in the set aij(τ, k) is forwarded in some iteration of the algorithm, and there are more cells
in aij(τ, k) to be forwarded, the algorithm keeps the matching between input-port i and output-
port j for the next iteration. Therefore, cells of a specific set are forwarded contiguously. Hence,
Definition 15 implies that Algorithm 4 forwards all the cells corresponding to a specific packet
contiguously: this clearly satisfies the packet-mode scheduling constraints.
All matchings used by Algorithm 4 are maximal and the sum of each column and each row in
B(k) is at most 2τ + N(Lmax − 1). Theorem 24 implies that Algorithm 4 needs at most
2 · (2τ + N(Lmax − 1))− 1 = 4τ + 2N(Lmax − 1)− 1
iterations to complete. Thus, with speedup 4 + 2N(Lmax−1)−1τ
the algorithm schedules all cells
corresponding to B(k) within the next frame, that is, by time-slot (k + 1)τ .
Consider the last cell last(p) of some packet p. Definition 15 implies that if last(p) ∈ aij(τ, k)
then tCCF (last(p)) > (k − 1)τ . Since Algorithm 4 schedules last(p) by time-slot (k + 1)τ , it
follows that the relative queuing delay of last(p) is at most 2τ − 1. By Lemma 12, the relative
queuing delay is at most 2τ + Lmax − 2.
Notice that for switch speedup S > 4, the relative queuing delay induced by this algorithm is2N(Lmax−1)−1
S−4+ Lmax − 2 time-slots.
90
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Algorithm 4 Coarse-Grained Maximal MatchingsLocal Variables:
B: matrix of values in IN, initially B = B(k)P : matrix of values in 0, 1, initially 0
1: procedure SCHEDULE(matrix B)2: while B 6= 0 do3: for all P [i][j] do4: if P [i][j] = 1 and B[i][j] = 0 then5: P [i][j] = 06: end if7: end for8: P := MAX-MATCH(B,P ) . returns a maximal match-
ing of B that dominates P9: for all P [i][j] do
10: if P [i][j] = 1 and B[i][j] > 0 then11: forward a cell from input i to output j12: end if13: end for14: B := B − P15: for all B[i][j] do . avoid negative values in B16: B[i][j] := maxB[i][j], 017: end for18: end while19: end procedure
20: matrix procedure MAX-MATCH(matrix B, matrix P )21: while there are i, j such that B[i][j] = 1 and
∑Nj′=1 P [i][j′] = 0 and
∑Ni′=1 P [i′][j] = 0 do
22: P [i][j] = 123: end while24: return P25: end procedure
5.4.3 Mimicking an Ideal Shadow Switch with Speedup S ≈ 2
Notice that for each frame, the scheduler described in Theorem 25 schedules all eligible cells with
the same origin and the same destination contiguously, implying that in fact it considers them as a
single packet. Using a more fine-grained scheduler and the Birkhoff von-Neumann decomposition,
we now show that a smaller speedup, arbitrarily close to 2, suffices albeit with larger relative
queuing delay.
This is done in the context of the common situation where packet size are restricted to be from
91
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
the set L (cf. [112, 128]). Notice that this case generalizes the unrestricted packet size case, where
L = 1, . . . , Lmax. Let lcm(L) be the least common multiple of all elements in L.
Theorem 26 A packet-mode CIOQ switch with speedup S = 2 +N(Lmaxlcm(L)−1)
τcan mimic an
ideal shadow switch with a relative queuing delay of 2τ + Lmax − 2 time-slots.
Proof. Fix a frame size τ . For every packet size ` ∈ L, let aij(τ, `, k) ⊆ aij(τ, k) be the set of
eligible cells (recall Definition 15) that correspond to packets of size `. Let B(`, k) = [b(`, k)ij]
be the matrix with values b(`, k)ij = |aij(τ,`,k)|`
, that is, the number of eligible packets of size ` in
frame k.
For every packet size `, the algorithm first tries to concatenate lcm(L)/` packets one after the
other in order to get one mega-packet of size lcm(L), each such mega-packet consists of packets
of the same size. The matrix B(lcm(L), k) = [b((lcm(L), k)ij] counts the number of such mega-
packets:
b ((lcm(L), k)ij =∑`∈L
⌊` · b(`, k)ij
lcm(L)
⌋
We first bound the sum of each row and each column of the matrix B(lcm(L), k). Consider
some row i of the matrix (the proof for a column j follows analogously):
N∑j=1
b ((lcm(L), k)ij =N∑
j=1
∑`∈L
⌊` · b(`, k)ij
lcm(L)
⌋
≤N∑
j=1
∑`∈L
` · b(`, k)ij
lcm(L)
=1
lcm(L)
N∑j=1
∑`∈L
|aij(τ, `, k)|
=1
lcm(L)
N∑j=1
|aij(τ, k)|
≤ 1
lcm(L)· (2τ + N(Lmax − 1)) by Lemma 13
92
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
By Theorem 23, the matrix B(lcm(L), k) can be decomposed into 2τ+N(Lmax−1)
lcm(L)permutations.
Denote by P(lcm(L), k) the set of these permutations.
We now turn to deal with leftover packets. Let the matrices B′(`, k) = [b′(`, k)]ij count the
number of packets of size ` that are not concatenated into mega-packets. Note that b′(`, k)ij ≤(lcm(L) − 1)/`, since it is the remainder of dividing b(`, k)ij by lcm(L)/`. This implies that
the sum of each row and each column of matrix B′(`, k) is bounded by N(lcm(L) − 1)/`. By
Theorem 23, the matrix B′ can be decomposed intoN ·(lcm(L)−1)
`permutations. Let P(`, k) be the
set of the permutations used to decompose the matrix B′(`, k).
After obtaining these sets of permutations, the algorithm forwards contiguously all the mega-
packets by holding each permutation P ∈ P(lcm(L), k) for lcm(L) consecutive iterations. Then
for every ` ∈ L, the algorithm holds each permutation P ∈ P(`, k) for ` consecutive iterations.
Clearly, all the cells of a specific packet are forwarded contiguously, and the algorithm satisfies the
packet-mode scheduling constraints.
The number of iterations needed for the algorithm to complete is bounded by:
lcm(L)2τ+N(Lmax−1)
lcm(L)+∑
`∈L `N ·(lcm(L)−1)
`
≤ 2τ + N(Lmax − 1) + N · Lmax · (lcm(L)− 1)
Thus, with a speedup of 2+N ·(Lmaxlcm(L)−1)
τthe algorithm schedules all cells correspond to frame
k within the next frame. This implies that for each packet p the maximum relative queuing delay
of cell last(p) is less than two frame sizes, namely at most 2τ − 1 time-slots. Hence, Lemma 12
implies that the maximum relative queuing delay is at most 2τ + Lmax − 2.
Note that for switch speedup S > 2, the relative queuing delay induced by this algorithm isN ·(Lmaxlcm(L)−1)
S−2+ Lmax − 2 time-slots.
Furthermore, it is important to notice that even though the algorithm described in Theorem 26
employs more sophisticated decomposition than Algorithm 4, both algorithms have the same over-
all time-complexity: Their time-complexity is solely determined by the complexity of the under-
lying CCF algorithm that is invoked twice every time-slot (while decomposition is done only once
93
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
every τ time-slots).
5.5 Mimicking an Ideal Shadow Switch with Bounded Buffers
In many practical applications, CIOQ switches are required to emulate shadow switches with
bounded buffer size. We show that smaller speedup suffices for mimicking an ideal shadow
switch with output buffer B. Intuitively, the reason for this better performance is that an ideal
shadow switch with bounded buffers cannot handle all incoming traffic types without dropping
cells. Therefore, by using the extra information about the legal incoming traffic types, the CIOQ
switch can optimize its scheduling decisions, resulting in a simpler and more efficient scheduling
algorithm.
Unlike the previous algorithms, algorithms for bounded mimicking do not rely on the CCF
algorithm, and use the following definition and lemma, which are adapted from Definition 15 and
Lemma 13:
Definition 16 For every input-port i, output-port j, frame size τ and frame number k > 0, the set
of eligible cells of frame k, denoted aij(τ, k), is the set of cells c that are delivered successfully
by the ideal switch, c 6∈⋃
k′<k aij(τ, k′), and all cells c′ ∈ packet(c) arrive at the switch before
time-slot kτ . By convention, aij(τ, 0) = ∅.
As in Definition 15, all the cells of each packet p are in the same set of eligible cells. The next
lemma bounds the size of these sets.
Lemma 14 For every input-port i, output-port j, frame size T and frame number k > 0,
∑Nj′=1 |aij′(τ, k)| ≤ τ + B + N (Lmax − 1) and
∑Ni′=1 |ai′j(τ, k)| ≤ τ + B + N (Lmax − 1)
Proof. Clearly, at most τ cells arrive at each input-port between time-slot (k − 1)τ and kτ . We
next show that at most τ + B cells arrive between time-slot (k − 1)τ and kτ , destined for a single
output-port j, and are successfully delivered by the shadow switch.
94
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Assume, by way of contradiction, that `1 > τ + B cells destined for output-port j arrive at
the switch within frame k and are not dropped by the shadow switch. Let `2 ≥ 0 be the number
of cells stored in the buffer of output-port j in time-slot (k − 1)τ . By the definition of a switch,
at most τ cells are delivered from output-port j between time-slots (k − 1)τ and kτ , hence the
number of cells that are stored in the buffer by the end of frame k is at least `1 + `2 − τ > B cells,
contradicting the fact that the buffer size is B.
Since all cells of the same packet p arrive at the switch contiguously, only cells of one packet
are in aij(τ, k) and arrived at the switch before time-slot (k− 1)τ . Since the maximum packet size
is Lmax and the last cell of each packet arrives after time-slot (k − 1)τ , the number of such cells
that share the same input-port and the same output-port is bounded by Lmax− 1. Thus, the number
of such cells that share the same input-port (output-port) is bounded by N(Lmax− 1), and the sum
is therefore bounded by τ + B + N(Lmax − 1).
In order to mimic an ideal switch, the CIOQ switch drops all cells that are dropped by the
shadow switch. By employing Lemma 14 in the proofs of Theorems 25 and 26 respectively, we
get the following results:
Corollary 27 A packet-mode CIOQ switch with speedup S = 2 + 2B+2N(Lmax−1)−1τ
can mimic an
ideal shadow switch with buffer size B with a relative queuing delay of 2τ + Lmax − 2 time-slots.
Corollary 28 A packet-mode CIOQ switch with speedup S = 1 +B+N ·(Lmax·lcm(L)−1)
τcan mimic
an ideal shadow switch with buffer size B with a relative queuing delay of 2τ +Lmax−2 time-slots.
5.6 Simulation Results
Analytically, the algorithm described in Section 5.4.3 has prohibitive relative queuing delay of
O(N · lcm(Lmax)) time-slot and therefore it has only theoretical importance2. Conversely, Al-
gorithm 4 requires a speedup S ≈ 4 in order to mimic an ideal switch with reasonable relative
queuing delay of O(NLmax) time-slots.
2Even when lcm(L) = Lmax, the relative queuing delay is O(NLmax2).
95
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
In this section, we show that in practice, Algorithm 4 out-performs its analytical worst-case
bounds (from Theorem 25), implying that even with a modest speedup, it achieves small relative
queuing delay. The results are obtains by conducting extensive simulation experiments under
various synthetic and trace-driven traffic patterns.
Basically, in order to demonstrate the tradeoff between the speedup and the relative queuing
delay, we conduct the following simulation: Given the incoming traffic and a fixed frame size,
we measure the loss ratio of packets under various speedup values. Then, we present the speedup
required to achieve less than 0.1% packet drop. Note that, unlike our theoretical upper bounds,
we allow small amount of cell drops; this clearly represents real-life situations in which switches
are allowed to drop cells in extreme situations. Furthermore, this metric is especially important
because the delay occur in ideal switches (e.g., in OQ switches) is very well-studied, and by the
relative queuing delay one can easily derive absolute bounds on the cell or packet delay of packet-
mode CIOQ switch.
We first consider several stochastic traffic patterns, which are generally modeled as ON-OFF
processes: The ON period length is chosen according to a specific packet size distribution (that
is, each ON period models an arrival of a single packet), while the OFF period is distributed
geometrically with some probability p; the parameter p is chosen so that a certain load is achieved.
Specifically, we study the following three stochastic traffic patterns. These patterns were also
used by Marsan et al. [101] in order to investigate the performance of a packet-mode Input-Queued
switch (with no speedup). It is important to notice that our results are even stronger than real-life
performance, since some of the traffic patterns are chosen specifically to reflect starvation and
unfairness due to the contiguous forwarding of large packets [101]:
1. Uniform traffic: In this traffic pattern, packet sizes are chosen uniformly at random in the
range [1, 192]. For each packet, its destination is chosen uniformly at random among all
output-ports. This uniform traffic setting is considered due to its frequent use in simulations
and stochastic analysis of switch performance (see Chapter 1.2). Note that the maximum
packet size is the Maximum Transmission Unit (MTU) of IP over ATM, measured in ATM
cells [10, 101].
2. Spotted traffic: Packet sizes are 100 cells with probability 0.5 and 3 cells with probability
96
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
0.5; packet destination is chosen according to the following 8 × 8 matrix; each input-port i
chooses a destination uniformly at random among all destinations with entry 1 in row i.
1 1 1 0 1 0 1 0
0 1 0 1 1 1 0 1
1 0 1 0 1 1 1 0
1 1 0 1 0 1 0 1
1 0 1 0 1 0 1 1
0 1 0 1 0 1 1 1
1 0 1 1 1 0 1 0
0 1 1 1 0 1 0 1
Since the matrix is doubly-stochastic and the sum of each row (column) is five, it implies
that each input-port sends packets to 5 output ports, and each output-ports receives packets
from 5 input-ports. Notice that this specific traffic matrix aims to highlight starvation and
loss of throughput due to the contiguous forwarding of large packets [101].
3. Diagonal traffic: Packet destinations are chosen uniformly at random. For every cell c, if
orig(c) = dest(c) then the packet size is 100; otherwise, the packet size is 1. In this traffic
pattern, the flows on the diagonal of the switching matrix consist only on long packets,
while the flows that are not on the diagonal of the switching matrix consist only of short
packets. Like the spotted traffic setting, this traffic pattern stresses the effects of contiguously
delivering packets of variable sizes [101].
The length of all the simulations is 100, 000 time-slots, and they were performed on a 16× 16
switches (Except the spotted traffic simulations that were performed on an 8 × 8 switch, as in the
setting described in [101]).
For each traffic pattern (stochastic or trace-driven), we fix a certain speedup S and a frame
size τ . For each frame k, Algorithm 4 constructs a demand matrix B(k) and then decomposes the
demand matrix B(k− 1) of the previous frame into a sequence of scheduling decisions. Under the
fixed speedup S, Algorithm 4 schedules at most S · τ of these scheduling decisions, and drops all
packets with cells in the remaining scheduling decisions. We measure the loss ratio (in terms of
97
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
0
1
2
3
4
5
200 400 600 800 1000 1200 1400 1600
Spe
edup
Frame Size
Uniform Traffic
0.25 load0.5 load
0.75 load1.0 load
Figure 14: Simulation results for a 16 × 16 switch, operating under uniform traffic pattern anddifferent input loads. The results shows the required frame size needed to achieve less than 0.1%packet drop ratio under specfiic speedup.
packets) under all values of S and τ .
Figures 14, 15 and 16 present the speedup required to achieve less than 0.1% packet drop ratio
under different loads and different stochastic traffic patterns. As expected, the results demonstrate
that Algorithm 4 needs smaller speedup to achieve smaller relative queuing delay. Moreover, the
results show that as the load of the traffic increases, the speedup required by Algorithm 4 also
increases.
Interestingly, these results show that, even in extreme situations, a speedup of less than 2 suf-
fices to achieve ideal switch mimicking with frame size of only 8Lmax time-slots. This can be ex-
plained by carefully investigating the reasons behind the upper bound of Theorem 25: A speedup
S ≥ 4 is required due to frames at which the underlying CCF algorithm forwards 2τ cells from the
98
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
0
1
2
3
4
5
200 400 600 800 1000 1200 1400 1600
Spe
edup
Frame Size
Spotted Traffic
0.25 load0.5 load
0.75 load
Figure 15: Simulation results for a 8×8 switch, operating under spotted traffic pattern and differentinput loads. The results shows the required frame size needed to achieve less than 0.1% packet dropratio under specfiic speedup.
same input-port or to the same output-port; moreover, the additional factor of 2 is caused by a poor
selection of maximal matchings resulting in an inefficient contiguous decomposition as captured
by Theorem 24. Under non-adversarial traffics, these two situations rarely occur in practice, espe-
cially not simultaneously. A relative queuing delay of 2N(Lmax−1)−1S−4
+ Lmax − 2 time-slots occurs
in even more extreme situation: when there is a frame k and an input-port i (output-port j) such
that from any flow (i, j) there is a packet p whose first cell is sent by the underlying CCF algorithm
before time-slot (k − 1)τ and its last cell is sent by the CCF between time-slots (k − 1)τ and kτ .
Clearly, this situation hardly ever happens.
We also conducted trace-driven simulation using trace data of TCP traffic over OC-48 links;
this trace data was taken from CAIDA [43]. We investigate the performance of Algorithm 4 under
99
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
0
1
2
3
4
5
200 400 600 800 1000 1200 1400 1600
Spe
edup
Frame Size
Diagonal Traffic
0.25 load0.5 load
0.75 load1.0 load
Figure 16: Simulation results for a 16 × 16 switch, operating under diagonal traffic pattern anddifferent input loads. The results shows the required frame size needed to achieve less than 0.1%packet drop ratio under specfiic speedup.
this real traffic and show that, also in this non-synthetic case, it performs better than its theoretical
upper bounds. To the best of our knowledge, these are the first test-driven simulation of packet-
mode CIOQ switches.
Figure 17 presents the performance of Algorithm 4 in the trace-driven experiments. We con-
ducted these experiments in granularity of 30 bytes (that is, the cell unit size is 30 bytes) yielding
a maximum packet size, Lmax, of 50 cells (i.e., 1500 bytes). Furthermore, we compressed the traf-
fic so each input-port is fully utilized (that is, 100% load). Compressing the traces to 100% load
intuitively represents the worst-case traffic that should be handled by the switch; this intuition is
further confirmed by our previous experiments which show that as the traffic load increases the
required speedup also increases. As in the previous synthetic traffic patterns, these trace-driven
100
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
0
1
2
3
4
5
200 400 600 800 1000 1200 1400 1600
Spe
edup
Frame Size
Trace-Driven Simulations
Figure 17: Trace-driven simulation results for a 16 × 16 switch, operating under 1.0 input loads.The results shows the required frame size needed to achieve less than 0.1% packet drop ratio underspecific speedup.
simulations also show that Algorithm 4 performs better than its theoretical bounds.
Finally, we compare the performance of Algorithm 4 to two simple greedy algorithms.
The cut-through greedy algorithm gets a certain relative queuing delay Rmax as a parameter,
and ensures that each packet either attains relative queuing delay less than Rmax or is dropped.
Specifically, the algorithm chooses randomly a maximal matching over all packets arriving at the
input side of the switch (even if the packet has not fully-arrived at the switch) and, similarly to
Algorithm 4, keeps an input-output pair matched until the corresponding packet is fully transmit-
ting. Before a packet is selected for transmission its relative queuing delay is compared to Rmax,
and the packet is dropped if it is above the threshold. Our simulations shows that the cut-thorough
greedy algorithm never achieves 0.1% packet drop ratio no matter what the speedup of the switch
101
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
0
1
2
3
4
5
200 400 600 800 1000 1200 1400 1600
Spe
edup
Relative Queuing Delay Threshold
Store&Forward Greedy Algorithm
Figure 18: Simulation results of the Store&Forward Greedy Algorithm for a 16 × 16 switch,operating under 1.0 input loads and trace-driven traffic. The results shows the required speedupneeded to achieve less than 0.1% packet drop ratio under specfic relative queuing delay threshold.
is and what the relative queuing delay threshold, Rmax, chosen. This results coincide with the
lower bound described in Theorem 21, implying that a cut-through algorithm cannot mimic an
ideal switch.
The store&forward greedy algorithm operates exactly as the cut-through greedy algorithm but
schedules only fully-arrived packets. Although this algorithm potentially introduces additional
relative queuing delay of Lmax, our trace-driven simulations, described in Figure 18, show that
this algorithm converges very fast to achieve less than 0.1% packet drop ratio, and in fact it out-
performs Algorithm 4.
It is important to notice that the actions of the store&forward greedy algorithm and Algo-
rithm 4 are very similar; the main difference is that the store&forward greedy algorithm does not
102
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
operate in a frame-based manner, and therefore it introduces smaller relative queuing delay. Yet,
the store&forward greedy algorithm has no known analytical worst-case upper bounds.
103
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Chapter 6
Jitter Regulation for Multiple Streams
The notion of delay jitter (or Cell Delay Variation [11]), defined as the difference between the
maximal and minimal end-to-end delays of different cells, captures the smoothness of a traffic. The
need for efficient mechanisms to provide such smooth and continuous traffic is mostly motivated by
the increasing popularity of interactive communication and in particular video/audio streaming.1
Controlling traffic distortions within the network, and in particular jitter control, has the effect
of moderating the traffic throughout the network [147]. This is important when a service provider
in a QoS network must meet service level agreements (SLAs) with its customers. In such cases,
moderating high congestion states in switches along the network results in the provider’s ability to
satisfy the guarantees to more customers [134].
Jitter control mechanisms have been extensively studied in recent years (see Section 2.4). These
are usually modelled as jitter regulators that use internal buffers in order to shape the traffic, so
that cells leave the regulator in the most periodic manner possible. Upon arrival, cells are stored
in the buffer until their planned release time, or until a buffer overflow occurs. This indicates a
tradeoff between the buffer size and the best attainable jitter, i.e., as buffer space increases, one can
expect to obtain a lower jitter.
In this chapter, we investigate the problem of finding an optimal jitter release schedule, given
1For example, 6.98 billion video streams were initiated by U.S. users during August 2006, while the U.S. streamingaudience increased by 4 percent from July 2006 to reach 110.3 million streamers in August 2006, representing about64 percent of the total U.S. Internet audience [42].
104
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
a predetermined buffer size. This problem was first raised by Mansour and Patt-Shamir [100],
who considered only a single-stream setting. In practice, however, jitter regulators handle multiple
streams simultaneously and must provide low jitter for each stream separately and independently.
In the multi-stream model, the traffic arriving at the regulator is an interleaving of M streams
originating from M independent abstract sources (see Figure 19). Each abstract source i sends a
stream of fixed-size cells in a fully periodic manner, with rate Ri, which arrive at a jitter regulator
after traversing the network. Variable end-to-end delays caused by transient congestion throughout
the network may result in such a stream arriving at the regulator in a non-periodic fashion. The
regulator knows the value of Ri, and strives to release consecutive cells 1/Ri time units apart, thus
re-shaping the traffic into its original form. Moreover, the order in which cells are released by
each abstract source is assumed to be respected throughout the network. This implies that the cells
from the same stream arrive at the regulator in order (but not necessarily equally spaced), and the
regulator should also maintain this order. We refer to this property as the FIFO constraint.
Note that the FIFO constraint should be respected in each stream independently, but not neces-
sarily on all incoming traffic. This implies that in the multi-stream model, the order in which cells
are released is not known a priori. This lack of knowledge is an inherent difference from the case
where there is only one abstract source, and it poses a major difficulty in devising algorithms for
multi-stream jitter regulation (as we describe in detail in Section 6.4).
6.1 Our Results
We present algorithms and tight lower bounds for jitter regulation in this multiple streams environ-
ment, both in offline and online settings. This answers a primary open question posed in [100].
We evaluate the performance of a regulator in the multi-stream model by considering the max-
imum jitter obtained on any stream. We show that, somewhat surprisingly, the offline problem can
be solved in polynomial time. This is done by characterizing a collection of optimal schedules,
and showing that their properties can be used to devise an offline algorithm that efficiently finds a
release schedule that attains the optimal jitter.
We use a competitive analysis approach in order to examine the online problem. In this setting,
105
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Figure 19: The multi-stream jitter regulation model.
by sizing up the buffer to a size of 2MB and statically partitioning the buffer equally among the M
streams, applying the algorithm described in [100, Algorithm B] on each stream separately yields
an algorithm that obtains the optimal max-jitter possible with a buffer of size B. We show that
such a resource augmentation cannot be avoided, by proving that any online algorithm needs a
buffer of size at least MB in order to obtain a jitter within a bounded factor from the optimal jitter
possible with a buffer of size B. We further show that these results also apply when the objective
is to minimize the average jitter attained by the M streams. These results indicate that online jitter
regulation does not scale well as the number of streams increases unless the buffer is sized up
proportionally.
6.2 Model Description, Notation, and Terminology
We adapt the following definitions from [100]:
Definition 17 Given a traffic T = ci | 0 ≤ i ≤ n such that cell ci arrives at time ta(ci), we
define the following:
1. A release schedule s for traffic T defines the release time of cells in T . Specifically, for each
cell ci, tls(ci, T ) denotes the time at which cell ci ∈ T is released from the regulator under
106
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
schedule s. Note that for every ci ∈ T, ta(ci) ≤ tls(ci, T ).
2. A release schedule s for T is B-feasible if at any time t,
|ci ∈ T|ta(ci) ≤ t < tls(ci, T )| ≤ B.
That is, there are never more than B cells in the buffer simultaneously.
3. The delay jitter of T under a release schedule s is
J (s, T ) = max0≤i,k≤n
tls(ci, T )− tls(ck, T )− (i− k)X
where X = 1/R is the inter-release time of T (i.e., X is the difference between the release
times of any two consecutive cells from the abstract source).
It is important to notice that since the abstract source generates perfectly periodic traffic, this
definition of delay jitter coincides with the notion of Cell Delay Variation. This definition also
coincides with Definition 2, if JS(T ) = 0 and each flow (i, j) corresponds to a different stream
Ti,j .
Note also that unlike previous chapters, we do not assume that time is slotted, and therefore the
time in which cells arrive and leave the regulator can be any non-negative real number in R+.
We first extend Definition 17 to a traffic T that is an interleaving of M traffics T1, . . . , TM .
We call each traffic Ti ⊆ T a stream and denote by XTithe inter-release time of stream Ti. We
assume for simplicity that all streams have the same inter-release time X; all our results extend
immediately to the case where this does not hold.
Let cj denote the j’th cell (in order of arrival) of the interleaving of the streams T , and let cij
denote the j’th cell of the single stream Ti. A release schedule should obey a per-stream FIFO
discipline, in which cells of the same stream are released in the order of their arrival.
Let J (s, Ti) be the jitter of a single stream Ti obtained by a release schedule s. We use the
following metric to evaluate multi-stream release schedules:
107
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Definition 18 The max-jitter of a multi-stream traffic T =⋃
i∈1,...,M Ti obtained by a release
schedule s is the maximal jitter obtained by any of the streams composing the traffic; that is,
MJ (s, T ) = max1≤k≤M
J (s, Tk).
In what follows, given any algorithm A, we denote by J (A, Ti) (MJ (A, T )) the jitter (max-
jitter) corresponding to the schedule produced by A given stream Ti (traffic T ).
One can take a geometric view of delay jitter by considering a two dimensional plane where
the x-axis denotes time and the y-axis denotes the cell number (see Figure 20). We first consider
the case of a single stream T . Given a release schedule s, a point at coordinates 〈t, j〉 is marked if
tls(cj, T ) = t (that is, if the j’th cell is released at time t). The release band is the band with slope
R = 1/X that encloses all the marked points and has minimal width, where the width of the band
is the maximal difference in the x-axis coordinates between its margins. The jitter obtained by s is
the width of its release band, and therefore our objective is to find a schedule with the narrowest
release band.
Under the multi-stream model, we associate every stream Ti with a different color i. A point
at coordinates 〈t, j〉 is colored with color i if tls(cij, T ) = t. Any schedule s induces a separate
release band for each stream Ti ⊆ T that encloses all points with color i. Schedule s is therefore
characterized by M release bands.
6.3 Online Multi-Stream Max-Jitter Regulation
As mentioned previously, an online algorithm with buffer size 2MB which statically partitions the
buffer equally among the M streams and applies the algorithm described in [100, Algorithm B]
on each stream separately, obtains the optimal max-jitter possible with a buffer of size B. In this
section we show that this result is tight up to a factor of 2, by showing that in order to obtain a
max-jitter within a bounded factor from the optimal max-jitter possible with a buffer of size B,
any online algorithm needs a buffer of size at least MB cells. Hence, in order to maintain any
reasonable jitter performance, it is necessary to increase the buffer size in a linear proportion to the
108
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
time
cellnumber
leftmargin
rightmargin
j
ta(cij) tls(c
ij, T )
J (s, Ti)
Figure 20: Geometric view of delay jitter. Arrivals are marked with dotted circles and releasesare marked with full circles. The jitter of stream Ti is the width of the band with slope R = 1/Xenclosing all releases.
number of streams.
Theorem 29 For every online algorithm ALG with an internal buffer of size smaller than MB,
and for any x > 0, there exists a traffic consisting of M streams, forcing ALG to have max-jitter at
least x, while the optimal jitter possible with a buffer of size B is zero.
Proof. Let ALG be an online algorithm with a buffer of size at most MB − 1. Consider the
following traffic T : For every 0 ≤ k ≤ B− 1, M cells arrive at the regulator at time k ·X , one for
every stream.
Since the buffer size is at most MB − 1 and MB cells arrived by time t1 = (B − 1)X , it
follows that ALG releases a cell by time t1, say of stream Ti. Consider the following continuation
for T : Given some x > 0, in time t′ ≥ t1 + BX + x, a single cell of stream Ti arrives at the
regulator.
Note that ALG releases the first cell of stream Ti by time t1, the last cell of stream Ti cannot
be sent prior to time t′, and Ti consists of B + 1 cells. Let s be the schedule produced by ALG. It
109
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
follows thatJ (ALG, Ti) ≥ tls(c
iB, T )− tls(c
i0, T )− (B − 0)X
≥ t′ − t1 −BX
≥ x + t1 + BX − t1 −BX = x,
which can be arbitrarily large. It follows also that MJ (ALG, T ) ≥ x. On the other hand, note
that for any choice of x, the optimal max-jitter possible with a buffer of size B is zero: Every cell
of a stream other than Ti is released immediately upon its arrival, and for every 0 ≤ j ≤ B, cell
cij ∈ Ti is released in time x− (B − j)X . Since every stream other than Ti does not consume any
buffer space, it is easy to verify that there are at most B cells in the buffer at any time. Clearly,
every stream attains zero jitter by this release schedule.
Theorem 29 implies that if the buffer size is smaller than MB then there are scenarios in which
an optimal schedule attains zero jitter for all streams, while any online algorithm can be forced
to produce a schedule with arbitrarily large max-jitter. This fact immediately implies that even if
the objective is to minimize the average jitter obtained by the different streams, the same lower
bound holds. Since the online algorithm, which statically partitions the buffer, minimizes the jitter
of each stream independently, it clearly minimizes the overall average jitter as well, thus providing
a matching upper bound (up to a factor of 2).
Moreover, we are able to prove a more general lower bound:
Theorem 30 For every online algorithm ALG with an internal buffer size smaller than
max MB, M(B − 1) + B + 1 ,
there exists a traffic consisting of M streams, such that ALG attains max-jitter strictly greater than
the optimal jitter possible with a buffer of size B.
Proof. Let ALG be an online algorithm with a buffer of size at most M(B− 1) + B. Consider the
following traffic T : For every 0 ≤ k ≤ B− 1, M cells arrive at the regulator at time k ·X , one for
every stream. The traffic stops if ALG releases a cell before time t′ = BX .
If ALG releases a cell before time t′, the claim follows from the proof of Theorem 29.
110
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Therefore, assume now that ALG does not release any cells before time t′, implying that in
time t′ there are MB cells in the buffer. Note that this implies that ALG has buffer size at least
MB. Consider the following continuation for T : In time t′, B + 1 cells of stream T1 arrive at the
regulator.
Since ALG has a buffer of size at most M(B−1)+B = MB +B−M , and before time t′ ALG
has not released any of the MB + B + 1 cells in T , it must release at least M + 1 cells in time t′.
By the pigeonhole principle it follows that at least two of the released cells correspond to the same
stream, say Ti. Therefore, the release schedule produced by ALG, denoted by s, has max-jitter of
at least the jitter attained by stream Ti:
tls(ci0, T )− tls(c
i1, T )− (0− 1)X = t′ − t′ − (0− 1)X = X.
Hence,MJ T (ALG) is strictly greater than zero. On the other hand, the optimal max-jitter possible
with a buffer of size B is zero. To see this, consider the following release schedule: Every cell of
a stream other than T1 is released immediately upon its arrival, and for every 0 ≤ j ≤ 2B, cell
c1j ∈ T1 is released in time t′ − (B − j)X . Similarly to the previous case, every stream obtains
a zero jitter by this release schedule, and no more than B cells are stored simultaneously in the
buffer since every cell, except the last B cells of stream T1, is sent immediately upon arrival, and
no two cells of the same stream are sent simultaneously.
Note that Theorem 30 implies that a greater resource augmentation is required, compared to
Theorem 29, while Theorem 29 implies unbounded competitiveness whereas Theorem 30 only
implies no online algorithm can obtain the optimal jitter. Furthermore, the lower bound described
in Theorem 30 exactly coincides with the result of the single stream model (i.e., M = 1) [100].
6.4 An Efficient Offline Algorithm
This section presents an efficient offline algorithm that generates a release schedule with optimal
max-jitter.
Given a traffic T that is an interleaving of M streams, consider a total order π = (c′0, . . . , c′n)
111
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
on the release schedule of cells in T that respects the FIFO order in each stream separately. The
release schedule, which attains the optimal max-jitter and respects π, can be found using similar
arguments to the ones in [100, Algorithm A]: Cell c′j can be stored in the buffer only until cell
c′j+B arrives, imposing strict bounds on the release time of each cell. In particular, it follows that
for every traffic T , there exists an optimal release schedule. Unfortunately, it is computationally
intractable to enumerate over all possible total orders, hence a more sophisticated approach should
be considered.
We first discuss properties of schedules that achieve optimal max-jitter. Then, we show that
these properties allow to find an optimal schedule in polynomial time.
For every cell cj ∈ T, one can intuitively consider t = ta(cj)− jX as the time at which c0 ∈ T
should be sent, so that cj ∈ T is sent immediately upon its arrival, in a perfectly periodic release
schedule. For any stream T , denote by τ(T ) = maxj
ta(cj)− jX | cj ∈ T
. From a geometric
point of view, τ(T ) is a lower bound on the intersection between the time axis and the right margin
of any release band (see Figure 21(a)), since otherwise the cell defining τ(Ti) would have to be
released prior to its arrival.
Given a release schedule s for a traffic T , a stream Ti ⊆ T is said to be aligned in s if there is
no cell cik ∈ Ti such that tls(c
ik, T ) > τ(Ti)+kX . Clearly, if Ti is aligned in s, then the cell ci
j ∈ Ti
that defines τ(Ti) satisfies tls(cij, T ) = ta(ci
j). Geometrically, the right margin of a release band
corresponding to an aligned stream Ti intersects the time axis in point 〈τ(Ti), 0〉 (see Figure 21(b)).
A release schedule s for traffic T is said to be aligned, if every stream is aligned in s. The
following lemma shows that one can iteratively align the streams of an optimal schedule without
increasing the overall jitter:
Lemma 15 For every traffic T , there exists an optimal aligned schedule s.
Proof. Given an optimal schedule s′ for traffic T with ` < M aligned streams, we prove that s′
can be changed into an aligned schedule (i.e. with M aligned streams), maintaining its optimality.
We first show that s′ can be altered into an optimal schedule with ` + 1 aligned streams. Let Ti
112
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
left margin
time
cellnumber
right margin
τ(T )
(a) non-aligned schedule
τ(T )
cellnumber
time
(b) aligned schedule
Figure 21: Geometric view of the right margin of the release band. Arrivals are marked by dottedcircles and releases bu full circles.
be one of the non-aligned streams in s′, and consider the following schedule s:
tls(ckj , T ) =
mintls′(ck
j , T ), τ(Tk) + jX
k = i
tls′(ckj , T ) k 6= i
Clearly, for every stream other than Ti the schedule remains unchanged; therefore, it suffices to
consider only stream Ti. Since tls′(cij, T ) ≥ ta(ci
j) and τ(Ti) + jX ≥ ta(cij), s is a release
schedule and it can easily be verified that s satisfies the FIFO constraint. Schedule s is B-feasible,
since s′ is B-feasible and for any cell cij ∈ Ti, tls(c
ij, T ) ≤ tls′(ci
j, T ). Stream Ti is aligned in s,
since every cell cij ∈ Ti satisfies tls(c
ij, T ) ≤ τ(Ti) + jX . Hence, s has ` + 1 aligned stream.
In order to prove that s is optimal, it suffices to show that tls(cij, T )− tls(c
im, T )− (j−m)X ≤
J (s′, Ti) for every two cells cij, c
im ∈ Ti.
Assume without loss of generality that tls(cij, T ) − jX ≥ tls(c
im, T ) − mX . If this does
not hold then our term is negative, and we can simply replace between the roles of cij and ci
m.
We distinguish between four possible cases: In the first case where tls(cij, T ) = tls′(ci
j, T ) and
tls(cim, T ) = tls′(ci
m, T ) the result follows immediately for the definition of J (s′, Ti). In the
case where tls(cij, T ) = τ(Ti) + jX and tls(c
im, T ) = τ(Ti) + mX the term is zero, which is
less than J (s′, Ti) by definition. The third case to consider is when tls(cij, T ) = tls′(ci
j, T ) and
tls(cim, T ) = τ(Ti) + mX . This implies that tls′(ci
j, T ) < τ(Ti) + jX , thus tls′(cij, T ) − jX <
113
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
τ(Ti). Therefore tls(cij, T ) − jX = tls′(ci
j, T ) − jX < τ(Ti) = tls(cim, T ) −mX , contradicting
the assumption on cij and ci
m. The last case to consider is when tls(cij, T ) = τ(Ti) + jX and
tls(cim, T ) = tls′(ci
m, T ): Similarly to the previous case, this implies that τ(Ti) < tls′(cij, T )− jX ,
and therefore tls(cij, T ) − tls(c
im, T ) − (j −m)X = τ(Ti) − (tls′(ci
m, T ) −mX) < tls′(cij, T ) −
jX − (tls′(cim, T )−mX) ≤ J (s′, Ti), as required.
Applying the same arguments repeatedly alters schedule s′ into an aligned schedule and pre-
serves its optimality.
Next we show that the optimality of a schedule s is maintained even if cells that are stored in
the buffer are released earlier, as long as their new release time satisfies the FIFO order and remains
within a release band of widthMJ (s, T ):
Lemma 16 Let s be an optimal schedule for traffic T . Then, for every stream Ti ⊆ T and for
every J ∈ [J (s, Ti),MJ (s, T )], the new schedule s’ that is defined as:
tls′(ckj , T ) =
maxta(ck
j ), τ(Tk)− J + jX
k = i
tls(ckj , T ) k 6= i
is B-feasible andMJ (s′, T ) =MJ (s, T ). Furthermore, if s is aligned then so is s′.
Proof. Since s′ only changes the release schedule of stream Ti, it clearly preserves the FIFO order
and jitter of each stream other than Ti.
We first show that s′ respects the FIFO order of cells in Ti. Let cij be any cell in Ti. If
tls′(cij, T ) = ta(ci
j) then its release time is at most ta(cij+1) ≤ tls′(ci
j+1, T ). Otherwise, tls′(cij, T ) =
τ(Ti)− J + jX ≤ τ(Ti)− J + (j + 1)X ≤ tls′(cij+1, T ).
In order to bound the max-jitter of s′, it suffices to show that J (s′, Ti) ≤MJ (s, T ). Consider
any pair of cells cia, c
ib ∈ Ti. By the definition of s′, tls′(ci
a, T ) ≥ τ(Ti) − J + aX . On the other
hand, tls′(cib, T ) = max
ta(ck
b ), τ(Tk)− J + bX≤ τ(Ti) + bX since ta(ci
b) ≤ τ(Ti) + bX
by the definition of τ(Ti). Hence, tls′(cib, T ) − tls′(ci
a, T ) ≤ J + (b − a)X , which implies that
J (s′, Ti) = maxa,b tls′(cib, T )− tls′(ci
a, T )− (b− a)X ≤ J ≤MJ (s, T ).
114
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Assume by way of contradiction that s′ in not B-feasible, and let t be any time in which a set
P of more than B cells are stored in the buffer. Since the release schedule of any stream Tk other
than Ti is identical under both s and s′, every cell ckj ∈ P , for k 6= i, is also stored in the buffer at
time t under schedule s. Note first that any cell in P is not released upon its arrival. Hence,
tls′(cij, T ) = τ(Ti)− J + jX by the definition of s′
≤ τ(Ti)− J (s, Ti) + jX since J ∈ [J (s, Ti),MJ (s, T )]
= ta(cik)− kX − J (s, Ti) + jX for ci
k ∈ Ti defining τ(Ti)
≤ tls(cik, T )− (k − j)X − J (s, Ti) since ta(ci
k) ≤ tls(cik, T )
≤ tls(cik, T )− (k − j)X−(
tls(cik, T )− tls(c
ij, T )− (k − j)X
)by definition of J (s, Ti)
≤ tls(cij, T )
Therefore, all cells cij ∈ P are stored in the buffer at time t under schedule s as well, contradicting
the B-feasibility of s.
We conclude the proof by showing that if s is aligned then s′ is also aligned. Assume s is
aligned. For any stream Tk 6= Ti schedules s and s′ are identical on Tk, and therefore Tk is
aligned in s′. Assume by contradiction that Ti is not aligned, therefore there is a cell cij ∈ Ti such
that tls′(cij, T ) > τ(Ti) + jX . Since tls′(ci
j, T ) = maxta(ci
j), τ(Ti)− J + jX
, it follows that
ta(cij) > τ(Ti) + jX , contradicting the maximality of τ(Ti).
By iteratively applying Lemma 16 with J =MJ (s, T ) on all streams, we get:
Corollary 31 Given an optimal aligned schedule s for traffic T , the schedule s′ defined by
tls′(ckj , T ) = max
ta(ck
j ), τ(Tk)−MJ (s, T ) + jX
is an optimal aligned schedule.
The following lemma bounds from below the release time of cells in a schedule. Intuitively,
this lemma defines the left margin of the release band.
115
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Lemma 17 For any schedule s for traffic T , every stream Ti ⊆ T , and every cell cij ∈ Ti,
tls(cij, T ) ≥ τ(Ti)− J (s, Ti) + jX .
Proof. Assume by contradiction that there exists a stream Ti and a cell cij ∈ Ti such that tls(ci
j, T ) <
τ(Ti)− J (s, Ti) + jX . Let cik ∈ Ti be the cell defining τ(Ti). Since tls(c
ik, T ) ≥ ta(ci
k),
J (s, Ti) ≥ tls(cik, T )− tls(c
ij, T )− (k − j)X
> ta(cik)− (τ(Ti)− J (s, Ti) + jX)− (k − j)X
= ta(cik)− (ta(ci
k)− kX) + J (s, Ti)− jX − kX + jX = J (s, Ti),
which is a contradiction.
Lemma 17 indicates an important property of aligned optimal schedules. In such schedules,
the jitter of any stream can be characterized by the release time of a single cell, as depicted in the
following corollary:
Corollary 32 For any aligned schedule s for traffic T and every stream Ti ⊆ T ,
J (s, Ti) = maxj
τ(Ti)− tls(c
ij, T ) + jX
.
The following lemma shows that at least one of the widest release bands, corresponding to
some stream Ti attaining the max-jitter, has its left margin determined by the following event: An
arrival of a cell causing a buffer overflow (possibly of another stream), which necessitates some
cell of Ti to be released earlier than desired.
Lemma 18 Let s be an aligned optimal schedule for traffic T . There exists a stream Ti ⊆ T that
attains the max-jitter, and a cell cij ∈ Ti such that tls(c
ij, T ) = τ(Ti) −MJ (s, T ) + jX and
tls(cij, T ) = ta(c`) for some cell c` ∈ T.
Proof. We show by contradiction that if the claim does not hold for an optimal aligned schedule,
then such a schedule can be altered into a new schedule with max-jitter strictly less than the original
schedule. Formally, consider an aligned optimal schedule s for T . Let T ′ = Ti | J (s, Ti) =MJ (s, T ),
116
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
and for every Ti ∈ T ′, let Pi =cij ∈ Ti | tls(ci
j, T ) = τ(Ti)−MJ (s, T ) + jX
. From a geo-
metric point of view, Pi consists of all the cells in Ti, whose release time lies on the left margin
of Ti’s release band. Finally, let P =⋃
Ti∈T ′ Pi. Assume by contradiction that for every cj ∈ P ,
there is no cell c` ∈ T such that tls(cj, T ) = ta(c`).
Note first that in such a case,MJ (s, T ) > 0. Otherwise, since s is aligned, for each stream Ti
the cell cik defining τ(Ti) satisfies both tls(c
ik, T ) = ta(ci
k) and tls(cik, T ) = τ(Ti)− 0 + jX .
The altered schedule s′ is obtained by postponing the release of all the cells in P for some
positive amount of time. As we shall prove, schedule s′ is B-feasible, and has a max-jitter strictly
less thanMJ (s, T ), contradicting the optimality of s.
For each cell cik ∈ P which is the j’th cell of T (i.e, ci
k = cj), the exact amount of postponing
time is determined by the following constraints:
1. Avoiding buffer overflow: Do not postpone further than the first arrival of a cell after tls(cj, T ).
This constraint is captured by
δ(cj) =
minc`:ta(c`)>tls(cj ,T )
ta(c`)− tls(cj, T )
if cj is not the last cell in T
∞ otherwise.
2. Maintaining FIFO order: Recalling that cj = cik, do not postpone further than tls(c
ik+1, T ).
This constraint is captured by
ε(cj) =
tls(cik+1, T )− tls(c
ik, T ) if cj is not the last cell in Ti
∞ otherwise.
Let δ = mincj∈P δ(cj) and ε = mincj∈P ε(cj), capturing the amounts of time that satisfy these
constraints for all cells in P .
It is important to notice that δ and ε are strictly greater than zero: δ > 0 by its definition,
and ε > 0 since if there is a cell cik ∈ Pi such that tls(c
ik, T ) = tls(c
ik+1, T ) then tls(c
ik+1, T ) =
τ(Ti)−MJ (s, T ) + kX < τ(Ti)−MJ (s, T ) + (k + 1)X , contradicting Lemma 17.
117
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
For the purpose of analysis, define for every stream Ti ∈ T ′,
ρ(Ti) = mincik∈Ti\Pi
tls(c
ik, T )− (τ(Ti)−MJ (s, T ) + kX)
.
ρ(Ti) comes to capture how far is the rest of the stream from the left margin. Since for any Ti ∈ T ′,
J (s, Ti) > 0, then Ti \Pi is not empty and ρ(Ti) > 0 and finite. Let ρ = minTi∈T ′ ρ(Ti). It follows
that ρ > 0 and finite.
Let ∆ = min δ, ε, ρ, and consider the following schedule that, as we shall prove, attains a
jitter strictly smaller thanMJ (s, T ):
tls′(cj, T ) =
tls(cj, T ) + ∆/2 cj ∈ P
tls(cj, T ) otherwise
Note that this schedule is well-defined since ∆ > 0 and finite.
We first prove that s′ is B-feasible and maintains FIFO order. Assume by way of contradiction
that s′ is not B-feasible, and let t be the first time the number of cells in the buffer exceeds B. By
the minimality of t, there exists a cell that arrives at time t. For every cell cj ∈ P , no cells arrive
to the buffer in the interval [tls(cj, T ), tls(cj, T ) + ∆/2] because ∆ ≤ δ(cj), implying that t is not
in any such interval. But the definition of s′ yields that the content of the buffer in such a time t is
the same under schedules s and s′, thus contradicting the B-feasibility of s. The FIFO order of s′
is maintained since ∆ ≤ ε(cj) for every cj ∈ P .
We conclude the proof by showing thatMJ (s′, T ) <MJ (s, T ). Consider any Ti ∈ T ′, and
any cik ∈ Ti. If ci
k ∈ P then by the definition of s′ and Lemma 17, tls′(cik, T ) = tls(c
ik, T )+∆/2 ≥
τ(Ti) −MJ (s, T ) + kX + ∆/2. The same holds also for cik /∈ P : Since ρ(Ti) ≥ ∆ > ∆/2, it
follows that tls′(cik, T ) = tls(c
ik, T ) ≥ τ(Ti)−MJ (s, T ) + kX + ρ(Ti) > τ(Ti)−MJ (s, T ) +
kX + ∆/2. Hence, for every cik,
τ(Ti)− tls′(cik, T ) + kX ≤ τ(Ti)− (τ(Ti)−MJ (s, T ) + kX + ∆/2) + kX
= MJ (s, T )−∆/2
< J (s, Ti).
118
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
time
cellnumber
r
j
k
τ(Ti) ta(cik)tls(c
ij, T )
r′
c`
J (s, Ti)
Figure 22: Outline of arrivals (dotted circles) and releases (full circles) for cells of the streamTi that attains the max-jitter, in an aligned release schedule, as discussed in Corollary 31 and inLemma 18. The square represents an arrival of some cell in T causing buffer overflow.
By Corollary 32, J (s′, Ti) < J (s, Ti) for any stream Ti ∈ T ′. The jitter of any other stream
remains unchanged, thereforeMJ (s′, T ) <MJ (s, T ), contradicting the optimality of s.
Finally, we conclude this section by showing that there exists a polynomial-time algorithm
that finds an optimal schedule for the multi-stream max-jitter problem. Algorithm 5 depicts the
pseudo-code of this algorithm.
Theorem 33 Algorithm 5 finds an optimal schedule for the multi-stream max-jitter problem. Its
time-complexity is O(n3), where n =∑M
i=1 |Ti|.
Proof. Assume a feasible schedule exists. Lemma 18 implies that there is an optimal schedule s
and a stream Ti, such thatMJ (s, T ) = τ(Ti) − ta(cl) + kX , for some cells cik ∈ Ti and cl ∈ T .
Note that for any stream Ti, the value of τ(Ti) can be computed in linear time using only the arrival
times of the cells of stream Ti (See Algorithm 5, lines 18-19).
It follows that by enumerating over all possible choices of pairs (cik, cl), one can find the col-
lection of possible values of the optimal jitter (See Algorithm 5, function OFFLINE).
For every such value J , computing an aligned release schedule attaining jitter J and verifying
that it is B-feasible can be done in linear time by checking the feasibility of the schedule defined
119
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
in Corollary 31 assuming MJ (s) = J (See Algorithm 5, functions COMPUTESCHEDULE and
ISFEASIBLE).
120
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Algorithm 5 Algorithm OFFLINE
1: boolean function ISFEASIBLE(traffic T , schedule s, buffer size B)2: BufferOccupancy← 03: T ′ ← MERGE(T, s) . for identical values, those of s appear before those of T4: for every e ∈ T ′ in increasing order do5: if e ∈ T then6: BufferOccupancy← BufferOccupancy + 17: else . i.e., e ∈ s8: BufferOccupancy← BufferOccupancy− 19: end if
10: if BufferOccupancy > B then11: return FALSE
12: end if13: end for14: return TRUE
15: end function
16: schedule function COMPUTESCHEDULE(traffic T , jitter J)17: s← NULL
18: for every stream Ti ∈ T do19: τ(Ti)← maxj
ta(ci
j)− jX
. defines the right margin20: end for21: for every cell ci
j ∈ T do22: tls(c
ij, T )← max
ta(ci
j), τ(Ti)− J + jX
23: end for24: return s25: end function
26: schedule function OFFLINE(traffic T , buffer size B)27: Array of jitter values MinJitter[M ]. Initially all∞28: Array of schedules MinSchedule[M ]. Initially all NULL
29: for every stream Ti ∈ T do30: τ(Ti)← maxj
ta(ci
j)− jX
31: for every cij ∈ Ti do
32: for every ck ∈ T do33: if ta(ck) ≥ ta(ci
j) and ta(ck) ≤ τ(Ti) + jX then34: λ(Ti)← ta(ck)− jX35: s← COMPUTESCHEDULE(T ,τ(Ti)− λ(Ti))36: if ISFEASIBLE(T, s, B) and MinJitter[i] > τ(Ti)− λ(Ti) then37: MinJitter[i]← τ(Ti)− λ(Ti)38: MinSchedule[i]← s39: end if40: end if41: end for42: end for43: end for44: `← arg min MinJitter . The value of (any) argument i for which MinJitter[i] is minimal45: return MinSchedule[`]46: end function
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Chapter 7
Conclusions
At the heart of the Internet, and crucially shaping its performance, are routers and switches. The
communication needs of current applications necessitate that switches (routers) would have high-
speed and high port-count. This dissertation has taken a competitive approach and evaluated the
performance of a switch by comparison to the performance of an ideal switch.
Chapter 4 focused on the Parallel Packet Switch architecture and presented tight bounds on the
average relative queuing delay, which hold with high probability even if randomization is used.
This generally implies that unlike other load-balancing problems, randomization does not reduce
the relative queuing delay. Our lower bounds rely on the fact that switches are FCFS, but can be
generalized to rely on other priority schemes.
We believe that the same techniques presented in this dissertation can be applied to other switch
architectures and show that randomization does not decrease the relative queuing delay also in
these architectures. In contrast, it is important to notice that using randomized algorithms may
decrease the complexity of the packet scheduling process, which is one of the primary performance
bottlenecks of contemporary switches.
We also introduced a novel class of demultiplexing algorithms for PPS (called u-RT algorithms)
that uses local information and complete global information older than u time-slots. It is important
to design u-RT demultiplexing algorithms that exchange small, practical amount of information,
e.g., credits, between demultiplexors. Moreover, it is important to study the performance of syn-
122
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
chronized demultiplexing algorithms that share only a global clock (these algorithms are currently
classified as 1-RT demultiplexing algorithms).
In addition, we presented lower bounds for two extensions of the PPS model: input-buffered
PPS and recursive PPS. An interesting future research is to devise efficient demultiplexing algo-
rithms for these architectures that out-perform the lower bounds presented for the bufferless PPS
model.
Chapter 5 investigated packet-mode scheduling in CIOQ switches. We show that even though
packet-mode scheduling imposes very confining restrictions on scheduling algorithms, a speedup
arbitrarily close to 2 suffices to mimic ideal shadow switch, if relative queuing delay can be tol-
erated. This result matches the lower bound for cell-based scheduling, implying that, somewhat
surprisingly, no additional speedup is required in order to keep packets contiguous over the switch
fabric.
We studied the trade-off between the relative queuing delay and the speedup of the switch by
presented upper bounds on the speedup required to achieve a given relative queuing delay, leaving
the question of their optimality for future research. Note that by Theorem 21, Lmax/2 − 3 is a
lower bound on the relative queuing delay, regardless of the switch speedup.
It is also interesting to explore packet-mode scheduling in PPS architecture. Such packet-
mode PPS ensures that all cells of the same packet are delivered contiguously to the output-port,
eliminating the need for reassembly buffers in the output-port. However, since the middle-stage
switches operate at a lower rate, it is impossible to deliver consecutive cells through the same
plane, therefore, unlike CIOQ switches, contiguous packet delivery over the switch fabric is not
required.
Iyer et al. [74, Theorem 5] showed that a speedup S ≥ 3 suffices for a PPS to emulate a
PIFO OQ switch using a centralized algorithm. Notice that a packet-mode PPS is not required to
maintain contiguous packet delivery over its fabric. Thus, as discussed in Chapter 5 (Page 86),
this result implies that a packet-mode PPS can emulate an ideal shadow switch (with no relative
queuing delay).1 The question whether a fully-distributed or u-RT demultiplexing algorithms can
1The algorithm described in [74, Theorem 5] requires that the departure time of a cell c is known upon its arrival.This can be easily satisfied since we assumed that the size of a packet is known upon the arrival of its first cell.
123
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
provide such emulation, and what is their relative queuing delay, is left for future research.
Chapter 6 examined the problem of jitter regulation and specifically, the tradeoff between the
buffer size available at the regulator and the optimal jitter attainable using such a buffer. We
dealt with the realistic case where the regulator must handle many streams concurrently, with the
objective of minimizing the maximum jitter attained by any of these streams.
Since real-life networks clearly have finite capacity links, it is also interesting to investigate the
behavior of a jitter regulator that handles multiple streams simultaneously and its outgoing links are
of bounded bandwidth (either shared among all streams or dedicated to each stream separately). In
addition, since regulators might be allowed to drop cells, it is of interest to examine the correlations
between buffer size, optimal jitter, and drop ratio.
Furthermore, it is important to incorporate jitter regulation within existing switching architec-
tures, so that traffic leaves the switch already shaped. This is straightforward when cells are imme-
diately available in the output buffer upon their arrival (e.g., in OQ switches). However, in other
switch architectures (e.g., in CIOQ switches), jitter regulation and scheduling must be integrated.
Such mechanisms are of a special interest since they cope both with the switching bottleneck and
with the packet scheduling bottleneck simultaneously (recall Figure 1).
In a broader context, it is appealing to apply the techniques and methodologies presented in
this dissertation on other (existing and future) switch architectures. A prominent example is the
buffered crossbar architecture (see Section 1.1, Page 9) that draws a lot of attention recently. Even
though some competitive evaluations of this architecture already exist in literature (e.g., [38, 99,
138]), there are still many questions left unresolved. Among these, perhaps the most appealing one
is to fully understand the relations between the speedup of the switch, the size of the buffers in the
crosspoints and the relative queuing delay.
This dissertation considers only unicast traffic, in which each cell is destined for a single output-
port. Given the tremendous growth in video relay usage in recent years, multicast traffic becomes
crucial. In a multicast traffic, each cell has a set of f destinations (called its fanout set) to which
it should be delivered. A trivial way to support such multicast traffics is by replicating each cell
f times and treating the cells as unicast cells. However, as the size of the fan-out set increases,
this approach becomes infeasible. On the positive side, common switch architectures have built-in
124
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
multicast capabilities; a prominent example is the CIOQ switch in which a multicast cell can be
delivered to several output-ports in a single time-slot (regardless of the switch speedup). Therefore,
a promising future research direction is investigating the ability of a switch with built-in multicast
capabilities to handle multicast traffic and provide good QoS guarantees such as small delay and
small jitter performance.
There are also several ways to extend the competitive model described in this dissertation and
potentially achieve other insights on the behavior of feasible switch architectures.
In order to provide an exact emulation, our model requires that at each time-slot exactly the
same cells are transmitted from the investigated switch and the shadow switch. We further relaxed
the model and allowed a bounded relative queuing delay between the switches, yet we still required
that the order of cells (determined by a priority scheme) is preserved. It is of theoretical interest to
investigate whether further relaxation of these requirements to allow cells to be mis-sequenced can
reduce the relative queuing delay or the required speedup for such emulation. Note that under such
setting the switch is still required to achieve the same throughput and overall delay guarantees as
the shadow switch (for example, by providing work-conservation).
Another extension of our model involves incorporating cell drops. In practice, buffer sizes
are bounded and cells that cannot be stored in the buffers are dropped. When using a competi-
tive approach to evaluate switch architecture or scheduling policies with bounded buffer sizes, a
multiplicative competitive ratio is used to evaluate cell losses. Namely, the competitive ratio C is
the ratio between the number of cells transmitted by an optimal algorithm (often referred to as an
adversary) to the number of cells transmitted by the evaluated algorithm. However, these evalua-
tions usually assume that the adversary uses an offline algorithm, which has a complete knowledge
about future arrivals, while operating under the same architecture. Since offline algorithms are not
realistic, these results tend to be overly pessimistic.
It is worthy to consider also a model in which the adversary operates under an ideal switch
architecture with a given (bounded) buffer size and without any knowledge on future arrivals.
Under such a model, one can compare a given switch with its adversary by evaluating the relative
queuing delay, the relative loss ratio and tradeoffs between these two objectives. We believe that
such evaluations may indicate important and realistic design choices. As a first step towards this
125
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
goal, we showed how packet-mode CIOQ switch can mimic an ideal switch with bounded buffers
(see Section 5.5).
126
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Bibliography
[1] Inverse Multiplexing over ATM (IMA): A Breakthrough WAN Technology for Corporate
Networks. 3Com Corporation, 1997.
[2] A. Adas. Traffic models in broadband networks. IEEE Communications Magazine, 35(7):
82–89, July 1997.
[3] G. Aggarwal, R. Motwani, D. Shah, and A. Zhu. Switch scheduling via randomized edge
coloring. In 44th Symposium on Foundations of Computer Science (FOCS), pages 502–513,
2003.
[4] E. Altman, Z. Liu, and R. Righter. Scheduling of an input-queued switch to achieve maximal
throughput. Probability in the Engineering and Informational Sciences, 14:327–334, 2000.
[5] T. E. Anderson, S. S. Owicki, J. B. Saxe, and C. P. Thacker. High-speed switch scheduling
for local-area networks. ACM Transactions on Computer Systems, 11(4):319–352, 1993.
[6] M. Andrews, B. Awerbuch, A. Fernandez, J. Kleinberg, T. Leighton, and Z. Liu. Universal
stability results for greedy Contention-Resolution protocols. Journal of the ACM, 48(1):
39–69, 2001.
[7] M. Arpaci and J. A. Copeland. Buffer Management for Shared-Memory ATM Switches.
IEEE Communications Surveys and Tutorials, 3(1), 2000.
[8] A. Aslam and K. J. Christensen. A parallel packet switch with multiplexors containing
virtual input queues. Computer Communications, 27(13):1248–1263, 2004.
[9] J. Aspnes. Randomized protocols for asynchronous consensus. Distributed Computing, 16
(2-3):165–175, 2003.
127
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[10] R. Atkinson. RFC 1626: Default IP MTU for use over ATM AAL5, May 1994.
[11] Traffic Management Specification. The ATM Forum, March 1999. Version 4.1, AF-TM-
0121.000.
[12] Inverse Multiplexing for ATM (IMA) specification. The ATM Forum, March 1999. Version
1.1, AF-PHY-0086.001.
[13] H. Attiya and D. Hay. Randomization does not reduce the average delay in parallel packet
switches. In the 17th ACM Symposium on Parallelism in Algorithms and Architectures
(SPAA), pages 11–20, 2005.
[14] H. Attiya and D. Hay. The inherent queuing delay of parallel packet switches. IEEE Trans-
actions on Parallel and Distributed Systems, 17(9):1048–1056, 2006.
[15] H. Attiya, D. Hay, and I. Keslassy. Packet-mode emulation of output-queued switches. In
the 18th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages
138–147, 2006.
[16] R.Y. Awdeh and H.T. Mouftah. Survey of ATM switch architectures. Computer Networks
and ISDN Systems, 27(12):1567–1613, 1995.
[17] Y. Azar, A. Broder, A. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Com-
puting, 29(1):180–200, 1999.
[18] A. Bar-Noy, R. Bhatia, J. Naor, and B. Schieber. Minimizing service and operation costs of
periodic scheduling. Mathematics of Operations Research, 27(3):518–544, 2002.
[19] Y. Bartal, A. Fiat, and Y. Rabani. Competitive algorithms for distributed data management.
Journal of Computer and System Sciences, 51(3):341–358, 1995.
[20] S. Baruah, G. Buttazzo, S. Gorinsky, and G. Lipari. Scheduling periodic task systems to
minimize output jitter. In 6th International Conference on Real-Time Computing Systems
and Applications, pages 62–68, 1999.
[21] S. Ben-David, A. Borodin, R. Karp, G. Tardos, and A. Wigderson. On the power of ran-
domization in online algorithms. Algorithmica, 11(1):2–14, 1994.
128
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[22] J. C.R. Bennett and H. Zhang. WF 2Q: Worst-case fair weighted fair queueing. In IEEE
INFOCOM, March 1996.
[23] J. C.R. Bennett and H. Zhang. Hierarchical packet fair queueing algorithms. IEEE/ACM
Transactions on Networking, 5(5):675–689, October 1997.
[24] A. Bianco, P. Giaccone, E. Leonardi, and F. Neri. A framework for differential frame-based
matching algorithms in input-queued switches. In IEEE INFOCOM, 2004.
[25] G. Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman Rev. Ser. A, 5:
147–151, 1946.
[26] Blackwell, T., Chang, K., Kung, H.T. and Lin. D. Credit-based flow control for ATM
networks. In Proc. of the First Annual Conference on Telecommunications R&D in Mas-
sachusetts, 1994.
[27] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge
University Press, 1998.
[28] A. Borodin, J. Kleinberg, P. Raghavan, M. Sudan, and D. P. Williamson. Adversarial Queue-
ing Theory. Journal of the ACM, 48(1):13–38, 2001.
[29] J. Y. Le Boudec. Network calculus made easy. Technical Report EPFL-DI 96/218, cole
Polytechnique Federale, Lausanne (EPFL), 1996.
[30] S. Bradner. Internet RFC 1242: Benchmarking terminology for network interconnection
devices, July 1991.
[31] C. S. Chang, W. J. Chen, and H. Y. Huang. Birkhoff-von Neumann input buffered crossbar
switches. In IEEE INFOCOM, pages 1614–1623, 2000.
[32] C.S. Chang, D.S. Lee, and Y.S. Jou. Load balanced Birkhoff-von Neumann switches, part
I: one-stage buffering. Computer Communications, 25:611–622, 2002.
[33] K. Chang and H.T. Kung. Reciever-oriented adaptive buffer allocation in credit-based flow
control for atm networks. In IEEE INFOCOM, pages 239–252, 1995.
129
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[34] A. Charny. Providing QoS guarantees in input buffered crossbar switches with speedup.
PhD thesis, Massachusetts Institute Of Technology, September 1998.
[35] A. Charny, P. Krishna, N.S. Patel, and R.J. Simcoe. Algorithms for providing bandwidth and
delay guarantees in input-buffered crossbars with speedup. In Sixth IEEE/IFIP International
Workshop on Quality of Service, pages 235–244, 1998.
[36] F. M. Chiussi, D. A. Khotimsky, and S. Krishnan. Generalized inverse multiplexing of
switched ATM connections. In IEEE Globecom, pages 3134–3140, 1998.
[37] S. Chuang, A. Goel, N. McKeown, and B. Prabhakar. Matching output queueing with a
combined input output queued switch. In IEEE INFOCOM, pages 1169–1178, 1999.
[38] S. Chuang, S. Iyer, and N. McKeown. Practical algorithms for performance guarantees in
buffered crossbars. In IEEE INFOCOM, 2005.
[39] ATM Switch Router Software Configuration Guide. Cisco Systems, Inc, 2001. 12.1(6)EY.
[40] Cisco 12000 Series Gigabit Switch Routers. Cisco Systems, Inc.
Available online at: http://www.cisco.com/warp/public/cc/pd/rt/12000/prodlit/gsr ov.pdf .
[41] C. Clos. A study of non-blocking switching networks. Bell System Technical Journal, pages
406–424, 1953.
[42] comScore Networks. comScore releases August U.S. Video Metrix rankings, October 2006.
Available online at: http://www.comscore.com/press/release.asp?press=1035.
[43] Cooperative Association for Internet Data Analysis (CAIDA). http://www.caida.org/.
[44] R. L. Cruz. A calculus for network delay, Part II: Network analysis. IEEE Transactions on
Information Theory, pages 132– 141, 1991.
[45] R.L. Cruz. A calculus for network delay, part I: Network elements in isolation. IEEE
Transactions on Information Theory, pages 114–31, 1991.
[46] J. Dai. A fluid-limit model criterion for instability of multiclass queueing networks. Annals
of Applied Probability, 6:751–757, 1996.
130
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[47] J. G. Dai and B. Prabhakar. The throughput of data switches with and without speedup. In
IEEE INFOCOM, pages 556–564, 2000.
[48] W.J. Dally and B. Towles. Principles and practices of interconnection networks, chapter 7,
pages 145–158. Elsevier, 2004.
[49] B. Davie, A. Charny, J.C.R. Bennett, K. Benson, J.Y. Le Boudec, W. Courtney, S. Davari,
V. Firoiu, and D. Stiliadis. Internet RFC 3246: An expedited forwarding PHB (per-hop
behavior), March 2002.
[50] A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm.
Journal of Internetworking Research and Experience, pages 3–26, October 1990.
[51] E.A. Dinic. Algorithm for solution of a problem of maximum ow in a network with power
estimation. Soviet Mathematics Doklady, 11(5):1277–1280, 1970.
[52] L. Dulmage and I. Halperin. On a theorem of Frobenius-Konig and J. von Neumann’s game
of hide and seek. Trans. Roy. Soc. Canada III, 49:23–29, 1955.
[53] J. Duncanson. Inverse multiplexing. IEEE Communications Magazine, 32(4):34–41, April
1994.
[54] V. Firoiu, J. Le Boudec, D. Towsley, and Z. Zhang. Theories and models for internet quality
of service. Proceedings of the IEEE, 90(9):1565–1591, September 2002.
[55] S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance.
IEEE/ACM Transactions on Networking, 1(4):397–413, 1993.
[56] P. Fredette. The past, present, and future of inverse multiplexing. IEEE Communications
Magazine, 32(4):42–46, April 1994.
[57] Y. Ganjali, A. Keshavarzian, and D. Shah. Input queued switches: cell switching vs. packet
switching. In IEEE INFOCOM, volume 3, pages 1651–1658, March 2003.
[58] P. Giaccone, B. Prabhakar, and D. Shah. Randomized scheduling algorithms for high-
aggregate bandwidth switches. IEEE Journal on Selected Areas in Communications, 21
(4):546–559, May 2003.
131
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[59] P. Giaccone, E. Leonardi, B. Prabhakar, and D. Shah. Delay bounds for combined input-
output switches with low speedup. Performance Evaluation, 55(1-2):113–128, 2004.
[60] S.J. Golestani. A stop-and-go queueing framework for congestion management. In ACM
SIGCOMM, pages 8–18, 1990.
[61] G. Gonnet. Expected length of the longest probe sequence in hash coding searching. Journal
of the ACM, 28(2):289–304, April 1981.
[62] D. Guez, A. Kesselman, and A. Rosen. Packet-mode policies for input-queued switches.
In the 16th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages
93–102, 2004.
[63] D. Hay and G. Scalosub. Jitter regulation for multiple streams. In the 13th Annual European
Symposium on Algorithms (ESA), pages 496–507, 2005.
[64] H. Heffes and D. M. Lucantoni. Markov-modulated characterization of packetized voice
and data traffic and related statistical multiplexer performance. IEEE Journal on Selected
Areas in Communications, 4(6):856–868, 1986.
[65] M. Hluchyj and M. Karol. Queueing in high-performance packet switching. IEEE Journal
on Selected Areas in Communications, 6(12):1587–1597, December 1988.
[66] J.E. Hopcroft and R.M. Karp. An n5/2 algorithm for maximum matchings in bipartite
graphs. SIAM Journal on Computing, 2(4):225–231, 1973.
[67] A. Hung, G. Kesidis, and N. McKeown. ATM input-buffered switches with guaranteed-rate
property. In IEEE ISCC, pages 331–335, 1998.
[68] S. Iyer. Analysis of a packet switch with memories running slower than the line rate. Mas-
ter’s thesis, Stanford University, May 2000.
[69] S. Iyer. Personal Communication, October 2006.
[70] S. Iyer and N. McKeown. Making parallel packet switches practical. In IEEE INFOCOM,
pages 1680–1687, 2001.
132
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[71] S. Iyer and N. McKeown. On the speedup required for a multicast parallel packet switch.
IEEE Communications letteres, 5(6):269–271, June 2001.
[72] S. Iyer and N. McKeown. Analysis of the parallel packet switch architecture. IEEE/ACM
Transactions on Networking, pages 314–324, 2003.
[73] S. Iyer and Nick McKeown. Maximum size matchings and input queued switches. In 40th
Annual Allerton Conference on Communication, Control, and Computing, October 2002.
[74] S. Iyer, A.A. Awadallah, and N. McKeown. Analysis of a packet switch with memories
running slower than the line rate. In IEEE INFOCOM, pages 529–537, 2000.
[75] P. Jayanti, K. Tan, and S. Toueg. Time and space lower bounds for nonblocking implemen-
tations. SIAM Journal on Computing, 30(2):438–456, 2000.
[76] C.R. Kalmanek, H. Kanakia, and S. Keshav. Rate-controlled servers for very high-speed
networks. In IEEE Globecom, pages 12–20, December 1990.
[77] K. Kar, T. Lakshman, D. Stiliadis, and L. Tassiulas. Reduced complexity input buffered
switches. In HOT Interconnects, pages 13– 20, 2000.
[78] M. Karol, M. Hluchyj, and S. Morgan. Input versus output queueing on a space-division
packet switch. IEEE Transactions on Communications, 35(12):1347–1356, December 1987.
[79] S. Keshav. An Engineering Approach to Computer Networking. Addison-Wesley Publishing
Co., 1997.
[80] I. Keslassy. Load Balanced Router. PhD thesis, Stanford University, June 2004.
[81] I. Keslassy and N. McKeown. Maintaining packet order in two-stage switches. In IEEE
INFOCOM, 2002.
[82] I. Keslassy, C.S. Chang, N. McKeown, and D.S. Lee. Optimal load-balancing. In IEEE
INFOCOM, 2005.
[83] I. Keslassy, M. Kodialam, T.V. Lakshman, and D. Stiliadis. On guaranteed smooth schedul-
ing for input-queued switches. IEEE/ACM Transactions on Networking, 13(6):1364–1375,
December 2005.
133
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[84] A. Kesselman and Y. Mansour. Harmonic buffer management policy for shared memory
switches. In IEEE INFOCOM, 2002.
[85] D. Khotimsky. Personal Communication, January 2004.
[86] D. Khotimsky and S. Krishnan. Stability analysis of a parallel packet switch with bufferless
input demultiplexors. In IEEE International Conference on Communications (ICC), pages
100–106, 2001.
[87] D. Khotimsky and S. Krishnan. Evaluation of open-loop sequence control schemes for
multi-path switches. In IEEE International Conference on Communications (ICC), vol-
ume 4, pages 2116– 2120, 2002.
[88] L. Kleinrock. Queuing Systems, Volume II. John Wiley&Sons, 1975.
[89] H. Koga. Jitter regulation in an internet router with delay constraint. Journal of Scheduling,
4(6):355–377, 2001.
[90] D. Konig. Grafok es alkalmazasuk a determinansok es a halmazok elmeletere. Matematikai
es Termeszettudomanyi Ertesıto, 34:104–119, 1916.
[91] P. Krishna, N.S. Patel, A. Charny, and R.J. Simcoe. On the speedup required for work-
conserving crossbar switches. IEEE Journal on Selected Areas in Communications, 17(6):
1057–1066, June 1999.
[92] H. T. Kung and A. Chapman. The FCVC (Flow-Controlled Virtual Channels) Proposal for
ATM Networks. In International Conference on Network Protocols, pages 116–127, 1993.
[93] H. T. Kung, T. Blackwell, and A. Chapman. Credit-based flow control for ATM networks:
credit update protocol, adaptive credit allocation and statistical multiplexing. In ACM SIG-
COMM, pages 101–114, 1994.
[94] E. Leonardi, M. Mellia, F. Neri, and M. A. Marsan. Bounds on average delays and queue size
averages and variances in input queued cell-based switches. In IEEE INFOCOM, volume 3,
pages 1095–1103, 2001.
134
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[95] X. Li and M. Hamdi. On scheduling optical packet switches with reconfiguration delay.
IEEE Journal on Selected Areas in Communications, 21(7):1156–1164, 2003.
[96] Y. Li, S. Panwar, and H.J. Chao. Exhaustive service matching algorithms for input queued
switches. In IEEE Workshop on High Performance Switching and Routing, pages 253–258,
2004.
[97] M. Liu, S.H. Kuo, and W.C Su. Performance analysis for a packet-mode combined input-
output queued switch. In IEEE International Conference on Networking, Sensing and Con-
trol, pages 117–121, March 2004.
[98] Inverse Multiplexing for ATM, Expanding the Revenue Opportunities for Converged Ser-
vices over ATM. Lucent Technologies, 2001.
[99] R. B. Magill, C. E. Rohrs, and R. L. Stevenson. Output-queued switch emulation by fabrics
with limited memory. IEEE Journal on Selected Areas in Communications, 21(4):606–615,
2003.
[100] Y. Mansour and B. Patt-Shamir. Jitter control in QoS networks. IEEE/ACM Transactions
on Networking, 9(4):492–502, August 2001.
[101] M. A. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri. Packet-mode scheduling in
input-queued cell-based switches. IEEE/ACM Transactions on Networking, 10(5):666–678,
October 2002.
[102] M. A. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri. Multicast traffic in input-
queued switches: optimal scheduling and maximum throughput. IEEE/ACM Transactions
on Networking, 11(3):465–477, 2003.
[103] N. McKeown. The islip scheduling algorithm for input-queued switches. IEEE/ACM Trans-
actions on Networking, 7(2):188–201, 1999.
[104] N. McKeown, A. Mekkittikul, V. Anantharam, and J. Walrand. Achieving 100% through-
put in an input-queued switch. IEEE Transactions on Communications, 47(8):1260–1267,
August 1999.
135
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[105] A. Mekkittikul. Scheduling Non-uniform Traffic in High Speed Packet Switches and Routers.
PhD thesis, Stanford University, November 1998.
[106] A. Mekkittikul and N. McKeown. A practical scheduling algorithm to achieve 100%
throughput in input-queued switches. In IEEE INFOCOM, volume 2, pages 792–799, April
1998.
[107] M.D. Mitzenmacher. The power of two choices in randomized load balancing. PhD thesis,
University of California at Berkley, Fall 1996.
[108] M.D. Mitzenmacher. On the analysis of randomized load balancing schemes. Theory of
Computer Systems, 32:361–386, 1999.
[109] S. Mneimneh, V. Sharma, and K.Y. Siu. Switching using parallel input-output queued
switches with no speedup. IEEE/ACM Transactions on Networking, 10(5):653–665, 2002.
[110] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[111] J. Nagle. RFC 970: On packet switches with infinite storage, December 1985.
[112] WAN packet size distribution. The National Laboratory for Applied Network Research.
Available online at: http://www.nlanr.net/NA/Learn/packetsizes.html.
[113] P. Pappu, J. Parwatikar, J. Turner, and K. Wong. Distributed queueing in scalable high
performance routers. In IEEE INFOCOM, 2003.
[114] P. Pappu, J. Turner, and K. Wong. Work-conserving distributed schedulers for terabit routers.
In ACM SIGCOMM, pages 257–268, 2004.
[115] A. K. Parekh and R. G. Gallager. A generalized processor sharing approach to flow control in
integrated service networks: the single node case. IEEE/ACM Transactions on Networking,
1:344–357, 1993.
[116] C. Partridge. Internet RFC 1257: Isochronous applications do not require jitter-controlled
networks, September 1991.
136
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[117] M.R. Pearlman, Z.J. Haas, P. Sholander, and S.S. Tabrizi. On the impact of alternate path
routing for load balancing in mobile ad hoc networks. In MobiHoc, pages 3–10, 2000.
[118] G. F. Pfister and V. A. Norton. Hot spot contention and combining in multistage intercon-
nection networks. IEEE Transactions on Computers, 34(10):943–948, 1985.
[119] Inverse multiplexing over ATM works today. PMC-Sierra, Inc., 2002.
http://www. electronicstalk.com/news/pmc/pmc121.html.
[120] B. Prabhakar and N. McKeown. On the speedup required for combined input and output
queued switching. Automatica, 35(12):1909–1920, December 1999.
[121] A. Prakash, A. Aziz, and V. Ramachandran. A near optimal schedule for switch-memory-
switch routers. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA),
pages 343–352, 2003.
[122] A. Prakash, A. Aziz, and V. Ramachandran. Randomized parallel schedulers for switch-
memory-switch routers: Analysis and numerical studies. In IEEE INFOCOM, 2004.
[123] F. Rendl. On the complexity of decomposing matrices arising in satellite communication.
Operations Research Letters, 4:5–8, 1985.
[124] A. Rosen. A note on models for non-probabilistic analysis of packet switching network.
Information Processing Letters, 84(5):237–240, December 2002.
[125] J. Sgall. On-line scheduling - a survey. On-line Algorithms: The State of the Art, pages
196–231, 1998.
[126] D. Shah and M. Kopikare. Delay bounds for approximate maximum weight matching algo-
rithms for input queued switches. In IEEE INFOCOM, 2002.
[127] S. Sharif, A. Aziz, and A. Prakash. An O(log2 N) parallel algorithm for output queuing. In
IEEE INFOCOM, 2002.
[128] R. Sinha, C. Papadopoulos, and J. Heidemann. Internet Packet Size Distributions: Some
Observations, 2005. Available online at:
http://netweb.usc.edu/˜rsinha/pkt-sizes/.
137
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[129] D.D. Sleator and R.E. Tarjan. Amortized efficiency of list update and paging rules. Com-
munications of the ACM, 28(2):202–208, 1985.
[130] D.C. Stephens and H. Zhang. Implementing distributed packet fair queueing in a scalable
architecture. In IEEE INFOCOM, pages 282–290, 1998.
[131] I. Stoica and H. Zhang. Exact emulation of an output queueing switch by a combined input
output queueing switch. In Sixth IEEE/IFIP International Workshop on Quality of Service,
pages 218–224, 1998.
[132] Y. Tamir and H.C. Chi. Symmetric crossbar arbiters for VLSI communication switches.
IEEE Transactions on Parallel and Distributed Systems, 4(1):13–27, January 1993.
[133] Y. Tamir and G. L. Frazier. High-performance multi-queue buffers for VLSI communica-
tions switches. In Proceedings of the 15th Annual International Symposium on Computer
architecture, 1988.
[134] A.S. Tanenbaum. Computer Networks. Prentice Hall, fourth edition, 2003.
[135] R.E. Tarjan. Data structures and network algorithms. Society for Industrial and Applied
Mathematics, 1983.
[136] L. Tassiulas. Linear complexity algorithms for maximum throughput in radio networks and
input queued switches. In IEEE INFOCOM, pages 533–539, 1998.
[137] B. Towles and W.J. Dally. Guaranteed scheduling for switches with configuration overhead.
IEEE/ACM Transactions on Networking, 11(5):835–847, 2003.
[138] J. Turner. Strong performance guarantees for asynchronous crossbar schedulers. In IEEE
INFOCOM, 2006. To appear.
[139] J. Turner and N. Yamanaka. Architectural choices in large scale atm switches. IEICE
Transactions, E81-B(2):120–137, February 1998.
[140] J.S. Turner. New directions in communications (or which way to the information age?).
IEEE Communications Magazine, 24(10):8–15, October 1986.
138
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
[141] ERX-700/1400, Edge Routing Switch. Unisphere Solutions, Inc.
http://www.juniper.net/techpubs/software/erx/erx130/product overview.pdf .
[142] G. Varghese. Network Algorithmics: An Interdisciplinary Approach to Designing Fast Net-
worked Devices. Morgan Kaufmann Publishers, Inc., 2005.
[143] D.C. Verma, H. Zhang, and D. Ferrari. Guaranteeing delay jitter bounds in packet switching
networks. In Proceedings of TriComm, pages 35–46, 1991.
[144] J. von Neumann. A certain zero-sum two-person game equivalent to the optimal assignment
problem. Contributions to the Theory of Games, 2:5–12, 1953.
[145] T. Weller and B. Hajek. Scheduling nonuniform traffic in a packet-switching system with
small propagation delay. IEEE/ACM Transactions on Networking, 5(6):813–823, 1997.
[146] C-S. Wu, J-C. Jiau, and K-J. Chen. Characterizing traffic behavior and providing end-to-end
service guarantees within atm networks. In IEEE INFOCOM, volume 1, pages 336–344,
1997.
[147] H. Zhang. Service disciplines for guaranteed performance service in packet switching net-
works. Proceedings of the IEEE, 83(10):1374–1396, October 1995.
[148] H. Zhang. Providing end-to-end performance guarantees using non-work-conserving disci-
plines. Computer Communications: Special Issue on System Support for Multimedia Com-
puting, 18(10), October 1995.
[149] H. Zhang and D. Ferrari. Rate-controlled service disciplines. Journal of High-Speed Net-
works, 3(4):389–412, 1994.
139
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
v
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
iv
האלגוריתמים מבוססים על טכניקת ). לארכיטקטורה זו הפועלים מספר פעמים בכל יחידת זמן
.פירוק מטריצות
מהירותו הפנימית : השונותפשרות בין מאפייני המתג מצביעים עלהאלגוריתמים שהצגנו
, גדול הואיחסיההשהיה ה זמן כאשר. תג לעומת זמן ההשהיה היחסי אותו הוא משיגשל המ
מכיוון שדרישה דומה ידועה גם . מקצב הגעת החבילותכפולה הנדרשת מהמתגפנימיתהמהירות ה
מהירות פנימית גבוהה יותר כדי ל נדרששהמתג איננו משמעות התוצאה היא , מיתוג תאיםלגבי
העלאת על ידי זמן ההשהיה היחסי יכול להתקצר משמעותית.ברת החבילותשמור על רציפות העל
אפשרי להשיג הדמית מתג אופטימלי ללא השהיה בלתיי כהוכחנו, עם זאת. המהירות הפנימית
. של המתגללא תלות במהירות הפנימית, יחסית כלל
הסימולציות תוצאות. ניתחנו את ביצועי האלגוריתמים שלנו על ידי סימולציות,בנוסף
. יותר מחסמיהם התיאורטיםיםהאלגוריתמים שהוצגו טובביצועי מראות שבמציאות
התעבורה נכנסת בה מבודדתשירות בסביבה -הבטחת איכותנושא מחקר שלישי הינו
בבקר התרכזנו במחקר זה.השירות- איכותכך שתתאים לדרישות, בה מחדשוציען על לבקר האמו
הנדרש לעצב את התעבורה הנכנסת כך שתהיה מחזורית) delay jitter regulator( ההשהיה רעד
, כגון(הדרישה לתעבורה חלקה נפוצה בעיקר בתקשורת אינטראקטיבית . ככל האפשר)או חלקה(
ניתנים לתרגום ישירות חסמים על רעד ההשהייה , ומים כאלובייש). תעבורת קול או וידאו
רעד , לייה החדה בתעבורת וידאו בשנים האחרונותעקב הע. לחסמים על גודל החוצצים ביעד
.ההשהיה נהפך למדד חשוב במתגים מודרניים
ולכן , בקרי רעד ההשהיה משתמשים בחוצצים פנימיים בכדי לעצב מחדש את התעבורה
סיטואציות ב, כמו כן. הצלחת הבקר במשימתו את יחסי הגומלין בין גודל החוצצים ללחקורחשוב
ולהשיג רעד השהיה קטן על כל ,זמנית-כאלו צריכים לטפל במספר זרימות בו בקרים יותמציאות
.אחת מהן בנפרד
זרימות המשיג את רעד ההשהיה -מקוון לבקר מרובה-הצגנו אלגוריתם לא, במחקר זה
עבור , לעומת זאת. האלגוריתם פועל בזמן פולינומיאלי. Bהאופטימלי בהנתן גודל חוצץ נתון
הצגנו חסם עליון וחסם תחתון המראה כי גודל החוצץ הנדרש כדי להשיג , םאלגוריתמים מקווני
. רעד השהיה אופטימלי תלוי באופן לינארי במספר הזרימותאת
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
iii
אפילו : יגמה נפוצה לחלוקת עומס באופן ממוצע היא שימוש באקראיותפאראד
עומס מקסימלי הקרוב לחלוקה , בהסתברות גבוהה, אסטרטגיות פשוטות ביותר מבטיחות
, עומס מסורתיות ובמתגים אחרים- הצלחת יישום האקראיות בבעיות חלוקתלאור. האופטימלית
אנו מראים , ברם. כדי לשפר את ביצועיה בPPSה להשתמש באקראיות גם בארכיטקטורת מפת
תוצאה מפתיעה זו נובעת מהדרישה . שאקראיות לא יכולה להוריד את זמן ההשהיה הממוצע
ונה זו מאפשרת ליריב לנצל תכ. עוזבים את המתגבו תאים המעשית ממתגים לא לשנות את הסדר
הממוצעכך שזמן ההשהיה , חלוף בזמן ההשהיה היחסי ולהנציחה לזמן ארוך מספיק-תעליה ב
.יגדל בהתאם
אנו מציגים מתודולוגיה כללית לניתוח זמן ההשהיה המקסימלי על ידי , לעומת זאת
מתודולוגיה זו משמשת לתכנון אלגוריתמים . מדידת חוסר האיזון בין המתגים האיטיים
, יתרה מזאת. שים במידע מיושן במקצת על מצבו הכוללני של המתגאופטימלים חדשים המשתמ
. אנליזה מלאה לאלגוריתמים מבוזרים ידועים לראשונהמתודלוגיה זו מספקת
קלט-משולבי חוצצי פלטארכיטקטורת מתגים נושא מחקר נוסף הינו נתוח ביצועיה של
)CIOQ( , נהבחבילות באורך משתאלגוריתמי מיתוג המטפלים ובפרט .
חבילות באורך משתנה נובע מהעובדה שתחת רוב פרוטוקולי ישיר של הצורך במיתוג
,אולם ).IPלמשל חבילות (התעבורה מורכבת מחבילות באורך משתנה , התקשורת הידועים
ם שלתהליכיומבצעים ,אל תאים בעלי אורך אחידם מתייחסים לחבילות אלו כייומתגים עכש
מתגים המטפלים ישירות בחבילות באורך . מחוץ למתגמחדש יה שבירת חבילה לתאים ואיחו
מבלי , מהכניסות ליציאותבאופן רציף משתנה מוגבלים בכך שהם צריכים להעביר כל חבילה
העלולה , בכך מתגים כאלו חוסכים את התקורה הנובעת משבירת החבילה ואיחויה. ברהולש
).תגים אופטייםבמ, לדוגמה(ות משמעותית ביותר במתגים מהירים להי
תכננו אלגוריתמי מיתוג המאפשרים העברת חבילות רציפה במתגים מטיפוס ,במחקר זה
CIOQ .עם (מאפשרת להשיג הדמיה , במידה קטנההראינו כי האצת מהירותו הפנימית של המתג
, נולריות נמוכה האלגוריתמים שהצגנו פועלים בגר.של מתג אופטימלי) זמן השהיה יחסי חסום
בניגוד לאלגוריתמים נפוצים ( רק אחת למספר יחידות זמן תוג חבילותימחליטים על מר כלומ
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
ii
, צבי הגעת החבילות עבורה ידועים ק,בהנתן ארכיטקטורת מתגים קיימת, באופן כללי
. אנו מציעים אלגוריתמי מיתוג ומנתחים את ביצועיהם, ומיקום קווי הבקרהמיקום החוצצים
ם על בחירות חשובות אנו מוכיחים מגבלות אינהרנטיות של הארכיטקטורה ומצביעי, בנוסף
קטורה שאלגוריתמי המיתוג המוצעים תלויים באופן הדוק בארכיט, חשוב להדגיש. בתכנון מתגים
מיות יתמחקר זה מערב בתוכו מגוון רחב של בעיות אלגור, לכן.הספיציפית אליה הם מיועדים
.שונות
למתג כלומר הן נמדדות בהשוואה , יחסיות הן המוצגות במחקר זהרוב התוצאות
-on (בדומה לניתוח אלגוריתמים מקוונים. אופטימלי שאיננו מוגבל על ידי הארכיטקטורה שלו
line( ,מכיוון ,הגישה התחרותית של מחקר זה היא טבעית, בהם אין מידע על מאורעות עתידיים
. הגורמים המשמעותיים ביותר בביצועיושאינו נגיש לאלגוריתם המיתוג היא אחתשכמות המידע
בהנחות הסתברותיות על התעבורה הנכנסת תלוי שניתוח תחרותי כזה איננו לצייןשוב ח
יתרונות דבר זה מאפשר לו לחשוף). לות להיות מטעות במקרים מסוימיםהנחות אלו יכו(
ניתוח על פי המקרה הגרוע לובמיוחד , ניתוח אנאליטיל, בנוסף. וחולשות של המגונים הרלוונטים
בניגוד לניתוח (שירות מסוימת - איכותלהבטיח מכיוון שהוא מאפשר ,דת חשיבות מיוח,ביותר
).ואינו מכסה את כל הסיטואציות האפשריות, נסיוני המבוסס בדרך כלל על סימולציות
.נסקור כעת את התוצאות המרכזיות של מחקר זה
, תחת ארכיטקטורה זו. )PPS (מתגי חבילות מקבילייםעסקנו בארכיטקטורת , ראשית
. הפועלים בקצב איטי יותר, כיםחבילות המגיעות למתג נשלחות ליעדן במקביל דרך מתגים מתוו
) בגודל קבוע ואחידתא הינו חבילה(זמן ההשהיה היחסי של תאים את בצורה אנליטיתתחנונ
ת המקביליות של המתג מדד זה מבטא את השפע. בהשוואה למתג אופטימליPPSבארכיטקטורת
סוג המידע שבידי ליונים והתחתונים תלויים בכמות ובהחסמים הע. ולליםעל ביצועיו הכ
ומבטאים הבדלים משמעותיים , האחראי על חלוקת העומס בין המתגים המתווכיםהאלגוריתם
משפר באופן , אף אם מיושן, שיתוף מידע בין הכניסות השונות: בין ביצועי אלגוריתמים שונים
.משמעותי את ביצועי המתג
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
i
תקציר
והן בקצבים בהם פועלות חב פס על ידי צרכני רשת האינטרנטהן בדרישה לרו, העלייה המהירה
אחד ל–המתגים והנתבים – רכיבי הרשת הבסיסיים הופכת את, רשתות תקשורת מודרניות
.ביצועי הרשת כולה של בקבוק הצווארי מ
שניה ומאות כניסות/יגהביט ג40פועלים בקצבים של עד המתגים ונתבים קיימים , כיום
, למשל(רשתות תקשורת עכשוויות נדרשות לשלב מספר סוגי תעבורה , יתרה מזאת. ויציאות
-נדרש לעמוד בדרישות איכות) או הנתב(ולכן המתג , )וידאותעבורת עם תעבורת קול ו IPתעבורת
הנתבים כל , כדי להתמודד עם משימות אלו. המשתנות מיישום ליישום, שירות מחמירות
חבילות -כגון אלגוריתמי תזמון, והמתגים המודרניים מצויידים במנגנוני בקרה מתוחכמים
השימוש בארכיטקטורות מתרחב,מהירים יותרו גדוליםנעשיםככל שהמתגים . תורים-וניהול
עומסהחלוקת איזון ארכיטקטורות אלו דורשות מנגונים נוספים לתיאום ו. מקביליות ומבוזרות
.יהן השוניםבין רכיב
תהליך : המשפיעים על ביצועיואופיינים במתג מודרני ניתן לזהות שלושה צווארי בקבוק
, יציאה מיועדת כל חבילה המגיעה למתגוקובע לאיזה) address-lookup (חיפוש כתובת היעד
תזמון ותהליך , אחראי על העברת החבילות מהכניסות ליציאותה) switching(המיתוג תהליך
לותהחבי
)packet-scheduling (בכל זמן נתוןמיציאה מסוימת במתג חבילה תצא מחליט איזוה .
, ל ידי תהליך המיתוגחיבור זה מתרכז בעיקר בבעיות העולות מצוואר הבקבוק הנוצר ע
, בנוסף. שיטות מחקר אנליטיות לבחינת ביצועי מנגנוני הבקרה במתגים שוניםותורם לפיתוח
. בין ארכיטקטורות מתגים שונותהשוותרות לתוצאות המחקר מאפש
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
. נעשה בהנחיית פרופסור חגית עטיה בפקולטה למדעי המחשבמחקרה
אין מילים לתאר עד כמה אסיר תודה אני על עזרתך וההנחיה הסבלנית בכל השנים , חגית
כמרצה , כחוקר, אני מרגיש בר מזל שניתנה לי ההזדמנות לעבוד איתך וללמוד ממך. האחרונותמחשבה או סתם , בכל שאלה, ייפנויה בפנהרגשתי תמיד שדלתך . ובראש וראשונה כבן אדם
לאזן בצורה את הצלחתךאני מוקיר במיוחד , מבין כל אינספור הדברים שלמדתי ממך. שיחהסוג המנחים עליו מאין ספק שאת . עצמאות שנתת לי כחוקרהחופש ובין המושלמת בין הנחייתי ל
).והרבה הרבה יותר(כל סטודנט חולם
' גבריאל סקלוסוב ופרופ, יצחק קסלסי' דר, איתם שיתפתי פעולהאני רוצה להודות לאנשיםיצחק קסלסי ' אני מודה לדר. התקופות בהן עבדנו ביחד היו מהמהנות בכל לימודיי. וולש. ניפר ל'ג
.ביקור זה תרם למחקרי רבות. 2006ביקור שלי בסיסקו במהלך קיץ גם על ארגון ה
' דר, ישראל צידון' פרופ, ספי נאור' פרופ, ישי מנסור' פרופ: אני מודה לחברי ועדות הבוחנים שלי .הפקתי תועלת רבה מהערותיכם ותובנותיכם. דני רז' עדי רוזן ופרופ' דר, יצחק קסלסי
או שעזרו לי /איתם עבדתי ו, אני רוצה להודות גם לכל האנשים האחרים בפקולטה למדעי המחשב
. הראשוןהחל מלימודי התואר, במהלך שנותי בפקולטה
). ועל העצות והעזרה הגרפית(על התמיכה המוראלית העקבית ' תודות מיוחדות למשה סייקביץ .ללא תמיכתך כל זה לא היה קורה, משה
י צבי וצביה יי וסבתותילסב, יעל ויגאל חייי להורמודהאני , )אם כי לא בסדר חשיבותם(אחרונים
על כך שתמיד הייתם שם לצידי , וליתר בני משפחתי,רועי ואסף, חיי איללא, וליה ויעקב חיגרסי .כל החלטה ובחירה שעשיתיב ותמכתם
על התמיכה הכספית הנדיבה בהשתלמותילקרנות וולף ובלנקשטייןאני מודה
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
ניתוח תחרותי שלניתוח תחרותי שלניתוח תחרותי שלניתוח תחרותי של ארכיטקטורות מתגיםארכיטקטורות מתגיםארכיטקטורות מתגיםארכיטקטורות מתגים
חיבור על מחקר
לשם מילוי חלקי של הדרישות לקבלת תואר
דוקטור לפילוסופיה
דוד חי
מכון טכנולוגי לישראל― הוגש לסנט הטכניון
2007 אפריל ז"ר תשסייא חיפה
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
ניתוח תחרותי שלניתוח תחרותי שלניתוח תחרותי שלניתוח תחרותי של ארכיטקטורות מתגיםארכיטקטורות מתגיםארכיטקטורות מתגיםארכיטקטורות מתגים
דוד חי
Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007
Top Related