Devavrat Shah Stanford University
description
Transcript of Devavrat Shah Stanford University
Devavrat Shah
Stanford University
Randomization and Heavy Traffic Theory: New approaches to the design and analysis
of switch algorithms
3
Network algorithms
• Algorithms implemented in networks, e.g. in
– switches/routers scheduling algorithms routing lookup packet classification
– memory/buffer managers maintaining statistics active queue management bandwidth partitioning
– load balancers randomized load balancing
– web caches eviction schemes placement of caches in a network
4
Network algorithms: challenges
• Time constraint: Need to make complicated decisions very quickly– e.g. line speeds in the Internet core 10Gbps (soon to be 40Gbps) packets arrive roughly every 50ns (roughly 10ns)
• Limited computational resources – due to rigid space and heat dissipation constraints
• Algorithms need to be very simple so as to be implementable– but simple algorithms may perform poorly, if not well-designed
5
An illustration
• Consider the following simple system
• Algorithms– serve a randomly chosen queue: very easy
– longest queue first (LQF): complicated to implement
• Performance– serve at random: does not give maximum throughput
– LQF: max throughput and minimizes the maximum backlog
Capacity =1
6
Input queued switch
• Multiple queues and servers
7
Input queued switch
• Crossbar constraints– each input can connect to at most one output
– each output can connect to at most one input
Crossbar fabric
1
2
1 2 3
3
Not allowed
8
Switch scheduling
• Crossbar constraints– each input can connect to at most one output
– each output can connect to at most one input
Crossbar fabric
1
2
1 2 3
3
9
Switch scheduling
• Crossbar constraints– each input can connect to at most one output
– each output can connect to at most one input
Crossbar fabric
1
2
1 2 3
3
10
Switch scheduling
• Crossbar constraints– each input can connect to at most one output
– each output can connect to at most one input
Crossbar fabric
1
2
1 2 3
3
11
Notation and definitions
1
2
1 2 3
3
12
Useful performance metric
• Throughput
– an algorithm is stable (or delivers 100% throughput) if for any admissible arrival, the average backlog is bounded; i.e.
• Average delay or average backlog (queue-size)
13
Schedule ´ Bipartite graph matching
19
3421
18
7
1
Schedule or Matching
14
Scheduling algorithms
Not stable
19
3421
18
7
1
PracticalMaximal Matchings
– iSLIP: Cisco’s GSR 12000 Series Nick McKeown
– Wave Front Arbiter Tamir and Chi
– Parallel Iterative Matching Anderson, Owicki, Saxe, Thacker
15
Scheduling algorithms
Not stable
Stable (Tassiulas-Ephremides 92,
McKeown et. al. 96, Dai-Prabhakar 00)
Not stable (McKeown-Ananthram-Walrand 96)
19
3421
18
7
1
PracticalMaximal Matchings
Max Wt Matching
19
18
Max Size Matching
19
1
7
16
The Maximum Weight Matching Algorithm
• MWM: performance– throughput: stable (Tassiulas-Ephremides 92; McKeown et al 96; Dai-Prabhakar 00)
– backlogs: very low on average (Leonardi et al 01; Shah-Kopikare 02)
• MWM: implementation – has cubic worst-case complexity (approx. 27,000 iterations for a 30-port switch) – MWM algorithms involve backtracking: i.e. edges laid down in one iteration may be removed in a subsequent
iteration algorithm not amenable to pipelining
17
Switch algorithms
Stable and low backlogsNot stable
Better performance
Easier to implement
Maximal matching Max Wt Matching
19
18
Max Size Matching
19
1
7
Not stable
18
Summarizing…
• Need algorithms that are simple and perform well– can process packets economically at line speeds– deliver a high throughput– and low latencies/backlogs
• Need to develop methods for analyzing the performance of these algorithms– traditional methods aren’t good enough
19
Rest of the talk
• A simple, high performance scheduling algorithm
– Randomized approximation of Max Wt Matching(joint work with P. Giaccone and B. Prabhakar)
– A method for analyzing backlogs based on heavy traffic theory (joint work with D. Wischik)
• A new approach to switch scheduling
– Randomized edge coloring (joint work with G. Aggrawal, R. Motwani and A. Zhu)
20
Algorithm I
Randomized approximationto the Max Wt Matching algorithm
21
Randomized approximation to MWM
• Consider the following randomized approximation:At every time - sample d matchings independently and uniformly - use the heaviest of these d matchings to schedule
packets
• Ideally we would like to use a small value of d. However,…
Theorem. Even with d = N, this algorithm is not stable.
In fact, when d=N, the throughput · 1 – e-1 ¼ 0.63. (Giaccone-Prabhakar-Shah 02)
22
Tassiulas’ algorithm
Next time MAX
Previous schedule
S(t-1)
Current schedule
S(t)
Random Matching
R(t)
23
Tassiulas’ algorithm
10
50
10107
060
S(t-1)W(S(t-1))=160
40 3
010
20R(t)
W(R(t))=150
MAX
S(t)
24
Performance of Tassiulas’ algorithm
Theorem (Tassiulas 98): The above scheme is stable under any admissible Bernoulli IID inputs.
25
Backlogs under Tassiulas’ algorithm
0.01
0.1
1
10
100
1000
10000
0 0.2 0.4 0.6 0.8 1
Normalized Load
Mean I
Q L
eng
th
Tassiulas
MWM
26
10
10
10
70
60
S(t-1)
W(S(t-1))=160
50
40
30
10
20
R(t)
W(R(t))=150
Reducing backlogs: the Merge operation
30 v/s 120
130 v/s 30
Merge
27
10
10
10
70
60
S(t-1)
W(S(t-1))=160
50
40
30
10
20
R(t)
W(R(t))=150
Reducing backlogs: the Merge operation
Merge
W(S(t)) = 250
28
Performance of Merge algorithm
Theorem (GPS): The Merge scheme is stable under any admissible Bernoulli IID inputs.
29
Merge v/s Max
0.01
0.1
1
10
100
1000
10000
0 0.2 0.4 0.6 0.8 1
Normalized Load
Mea
n I
Q L
eng
th
TassiulasMergeMWM
30
893
5
23
47
1131
97
S(t-1)
W(S(t-1))=209
Use arrival information: Serena
2
7
The arrival graph
31
893
5
23
47
1131
97
S(t-1)
W(S(t-1))=209
Use arrival information: Serena
2
The arrival graph
32
893
6
23
47
1131
97
S(t-1)
23
W(S(t-1))=209
0
W=121
Use arrival information: Serena
Merge
W(S(t))=243
S(t)
89
3
23
31
97
33
Performance of Serena algorithm
Theorem (GPS): The Serena algorithm is stable under any admissible Bernoulli IID inputs.
34
Backlogs under Serena
0.01
0.1
1
10
100
1000
10000
0 0.2 0.4 0.6 0.8 1
Normalized Load
Mea
n I
Q L
ength
TassiulasMergeSerenaMWM
35
Summary of main ideas and results
• Randomization alone is not enough
• Memory (using past sample) is very powerful
• Exploiting problem structure: the Merge operation
• Using arrival information
We obtained an algorithm– whose performance is close to that of MWM– and is very simple to implement
36
A new method for analyzing backlogsvia Heavy Traffic Theory
37
Some context…
• We have seen the design of a simple, high throughput switch scheduling algorithm
• The backlog/delay induced by an algorithm is another very important measure of its performance
0.01
0.1
1
10
100
1000
10000
0 0.2 0.4 0.6 0.8 1
Normalized Load
Mean I
Q L
ength
TassiulasMergeSerenaMWM
• For example, recall Tassiulas’ scheme
38
Delay or backlog analysis
• Delay analysis is harder than throughput analysis because
– throughput measures the long-term efficiency of an algorithm– whereas, the delay (or backlog) induced by an algorithm is a
measure of its instantaneous efficiency
It is also harder to design low delay schedulers
39
Current delay analysis methods are not suitable
• Queuing theory– requires knowledge of service time distribution – but this depends on the scheduling algorithm service distribution is too complicated to determine
• Large deviations: very successful for output-queued switch– N-port OQ switch can be “decomposed” into N independent 1-port switches but an N-port input-queued switch is not decomposable
• Stochastic coupling arguments– effective for establishing: Delay(Alg A) < Delay(Alg B) (without saying how much better) very hard to make for input-queued switches
40
Our approach
• Use the machinery of heavy traffic theory (developed by J. M. Harrison, M. Bramson, R. Williams and many others)
• Focus on the following intriguing observation
– consider the MWM- algorithm: edge(i,j) has weight Qij
– Keslassy-McKeown 01 have observed that the average delay increases with
30
32
34
36
38
40
42
44
46
0 1 2 3 4
ABCD
Ave
rag
e D
ela
y
41
Why this is worth focusing on
• The MWM class of algorithms are the most analyzed– huge body of literature on throughput analysis– hardly anything known about their delay properties– in particular, no explanation of the Keslassy-McKeown observation
• Note the singularity at =0 in the K-M observation– “continuity” implies that delay should be the smallest when =0– however, MWM-0 is the Maximum Size Matching algorithm
– MSM is known to be unstable; i.e. delays are infinite when =0 ! (McKeown-Ananthram-Walrand 96)– Question: if MSM is not the optimum, who is?
Heavy traffic theory helps answer these questions
42
Heavy traffic theory
• Consider a network with N queues– as some parameter r ! 1– let the load on the system approach its capacity (hence the name HT)– speed up time by a factor r2, and scale down space by a factor r; i.e. quantity of interest: Qi(r2t)/r = Qi
r(t), say
as r ! 1, we’re looking at the zoomed out system both in time and space
this type of scaling occurs in the Central Limit Theorem
r2 1/r
Original system
Scaledsystem
time
43
Heavy traffic theory
• Consider a network with N queues– as some parameter r ! 1– let the load on the system approach its capacity (hence the name HT)– speed up time by a factor r2, and scale down space by a factor r; i.e. quantity of interest: Qi(r2t)/r = Qi
r(t), say
as r ! 1, we’re looking at the zoomed out system both in time and space
this type of scaling occurs in the Central Limit Theorem
• Main result: As r ! 1– (Q1
r(t),…,QNr(t)) ! (Q1
1(t),…,QN1(t)) = Q1(t),
which is a multidimensional Reflected Brownian Motion (RBM) (such limiting results are independent of details of distribution)
w
44
State space collapse: dimension reduction
• Q1(t) can roam in lower dimensional space, say of dim K < N
45
State space collapse: dimension reduction
1r
Nr
Q1r(t)
QNr(t)
• The HT limit procedure for LQF system
– load: r=1r++N
r = 1-r-1 ! 1
– state: Qr(t)=(Q1r(t),,QN
r(t)) ! Q1(t) = (Q11(t),, QN
1(t))
• State space collapse
– because of the LQF policy, Q11(t) = = QN
1(t)
– or, the dimension of the state space is just 1, not N
Capacity =1
• Q1(t) can roam in lower dimensional space, say of dim K < N
46
State space collapse: performance
• More importantly, the state space induced by an IQ switch scheduling algorithm gives an indication of its performance; i.e.
State space(Alg A) ½ State space(Alg B) ) Alg B is better than Alg A
• So , to resolve the Keslassy-McKeown observation, we need to1. determine the state space of MWM-, and2. show
State space(MWM-1) ½ State space(MWM-2) if 1 > 2
47
19 17
3
6
4 MWM=28
Switch: state space collapse
4
?
17
• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?
• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the
size of all the N2 queues
• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.
• Consider an example of 3-port switch
48
19 17
3
6
4 MWM=28
Switch: state space collapse
• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?
• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the
size of all the N2 queues
• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.
• Consider an example of 3-port switch
7 ?
4
19
7
17
49
19 17
3
6
4 MWM=28
Switch: state space collapse
7 5
4
• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?
• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the
size of all the N2 queues
• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.
• Consider an example of 3-port switch
50
19 17
3
6
4 MWM=28
Switch: state space collapse
7 5
• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?
• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the
size of all the N2 queues
• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.
• Consider an example of 3-port switch
?
51
19 17
3
6
4 MWM=28
Switch: state space collapse
7 5
• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?
• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the
size of all the N2 queues
• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.
• Consider an example of 3-port switch
18
52
19 17
3
6
4 MWM=28
18
Switch: state space collapse
7 5
4
7
3
18
5
?
• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?
• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the
size of all the N2 queues
• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.
• Consider an example of 3-port switch
53
19 17
3
6
4 MWM=28
18
Switch: state space collapse
7 5
4
7
3
18
5
5
• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?
• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the
size of all the N2 queues
• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.
• Consider an example of 3-port switch
54
N-port switch
12
19 17
3
6
4
5
• For larger N: obtain entries inductively
55
N-port switch
12
19 17
3
6
4
9
• For larger N: obtain entries inductively
7
18
5
5
6 12
9?
19
3
56
N-port switch
12
19 17
3
6
4
9
• For larger N: obtain entries inductively
7
18
5
5
6 12
93
19
3 10
24
24
57
N-port switch
12
19 17
3
6
4
9
6 12
9
19
3
35 27 31 148
30
22
78
• We use the following 2N-1 entries: W = (R1,…,RN-1; C1,…,CN-1;WNN)
58
MWM-: state space
Theorem (Shah-Wischik)
Under MWM-, given W = (R1,…RN-1,C1,…CN-1,WNN), the corresponding switch state [Qij] is the unique solution of
minimize ij Qij
kQik ¸ Ri 8 i, (row-sum)
kQkj ¸ Cj 8 j, (column-sum)
k,l Qkl ¸ WNN (overall-sum)
subject to:
Qij = 0 if ij = 0
59
MWM-0.5: state space
Cross-section
R1
C1
W
C1
R1
Q11=0
22
• Consider a 2-port switch: dimension of the collapsed space is 3
60
MWM-1: state space
R1
C1
W
Cross-section
R1
C1
Q11=0
22
61
22
MWM-2: state space
W
R1
C1
Cross-section
R1
C1 Q11=0
62
Comparison
MWM-0.5 MWM-1 MWM-2
• As increases, the size of the state space decreases
• Therefore, the delay increases
theoretically explains the Keslassy-McKeown observation
63
What happens at = 0?
• As ! 0, MWM- does not become the MSM
• Rather, it is the Maximum-weighted Maximum Size Matching algorithm – find all possible Max Size Matchings– if more than one MSM exist, pick the one with maximum weight
• Intuitively,– Maximum Size Matching is for good delay – Max Weight is for good throughput
• Mekkitikkul and McKeown 98 showed how to obtain the M-W MSM using Max Weight algorithm, and called their algorithm Longest Port First
Conjecture: The LPF algorithm has the largest possible state space. Hence it has the optimal delay performance.
(Shah-Wischik)
64
Summary of main ideas and results
• There is a large body of literature on throughput analysis, but hardly anything about the delay of switch algorithms
• We developed a new method to analyze delay– based on Heavy Traffic Theory– it is simple and general
• We applied the method – to shed light on an intriguing observation of Keslassy-McKeown– and to characterize a delay optimal algorithm
65
Algorithm II
Randomized edge coloring
66
Edge coloring
• Let G=(V,E) be a graph,– be the maximum vertex degree.
• Edge coloring: – assign colors to the edges in E so that no two edges sharing a vertex have the same color.
67
Edge coloring and switch scheduling
11
2
1
1
Weighted Graph
2
1
1
Multi Graph
1
2
1 2 3
3
68
Contd.
Multi-graph with =4
Schedules
Edge color
T=1 T=2 T=3
Edge-colored multi-graph
T=4
69
Edge-coloring algorithms
• Bipartite multi-graph can be colored using exactly colors– but, algorithm involves backtracking difficult to implement
• Next, consider a simple Greedy algorithm:– pick edges one at a time and in any order– assign the smallest available color
• Greedy algorithm:– simple and distributed– pipelineable easy to implement
70
Greedy algorithm
Edge colored multi-graph
Greedyedge color
5 time slots are required even though the graph is 4-colorable! 80% efficiency or throughput !
Colors:
In the worst case, this algorithm can take 2-1 colors switch gives 50% throughput !
Multi-graph with =4
71
A stability theorem
Theorem. A switch scheduling algorithm gives 100% throughput iff the corresponding edge coloring algorithm uses + o() colors.
(Aggrawal-Motwani-Shah-Zhu 03)
• We want to obtain an edge coloring algorithm which - uses + o() colors- is simple to implement (like Greedy)
• Previous work on such algorithms (Dubhashi, Grable, Panconesi 96)
- uses (1+c) , for c > 0 ) throughput is 1/(1+c)- is a lot more complicated than Greedy
72
Our algorithm: Randomized edge-coloring
• Let G be a bipartite multi-graph on N input and output vertices with max degree
• Rand-Color: Let 1,…, be initial pool of colors.
– Step 1: (Randomized version of Greedy with colors)
a) sample an uncolored edge uniformly at randomb) assign a color uniformly at random from the available colorsc) when no more edges can be colored, go to Step 2
– Step 2: color the remaining edges using the Greedy algorithm
73
Performance
Theorem (AMSZ). Under the following two conditions, the algorithm Rand-Color uses + o() colors w.h.p. (hence it gives 100% throughput).
Condition I:
Condition II:
74
Summary of main ideas and results
• Edge coloring algorithms as a new approach to switch scheduling– not based on MWM– low complexity– new methods to establish stability
• Could be an interesting alternative to MWM-type algorithms– may be useful in general (non-switch) networks
75
Conclusions
• Network algorithms
– design involves a trade off between simplicity and performance– analysis requires new methods
• In the context of switch scheduling, I presented
– design of a simple, high-performance switch algorithm– a new analysis method based on heavy traffic theory– finally, I presented a new approach, based on edge coloring, for
designing scheduling algorithms