Devavrat Shah Stanford University

Devavrat Shah

Stanford University

Randomization and Heavy Traffic Theory: New approaches to the design and analysis

of switch algorithms

3

Network algorithms

• Algorithms implemented in networks, e.g. in

– switches/routers scheduling algorithms routing lookup packet classification

– memory/buffer managers maintaining statistics active queue management bandwidth partitioning

– load balancers randomized load balancing

– web caches eviction schemes placement of caches in a network

4

Network algorithms: challenges

• Time constraint: Need to make complicated decisions very quickly– e.g. line speeds in the Internet core 10Gbps (soon to be 40Gbps) packets arrive roughly every 50ns (roughly 10ns)

• Limited computational resources – due to rigid space and heat dissipation constraints

• Algorithms need to be very simple so as to be implementable– but simple algorithms may perform poorly, if not well-designed

5

An illustration

• Consider the following simple system

• Algorithms– serve a randomly chosen queue: very easy

– longest queue first (LQF): complicated to implement

• Performance– serve at random: does not give maximum throughput

– LQF: max throughput and minimizes the maximum backlog

Capacity =1

6

Input queued switch

• Multiple queues and servers

7

Input queued switch

• Crossbar constraints– each input can connect to at most one output

– each output can connect to at most one input

Crossbar fabric

1

2

1 2 3

3

Not allowed

8

Switch scheduling



Crossbar fabric

1

2

1 2 3

3

9

Switch scheduling



Crossbar fabric

1

2

1 2 3

3

10

Switch scheduling



Crossbar fabric

1

2

1 2 3

3

11

Notation and definitions

1

2

1 2 3

3

12

Useful performance metric

• Throughput

– an algorithm is stable (or delivers 100% throughput) if for any admissible arrival, the average backlog is bounded; i.e.

• Average delay or average backlog (queue-size)

13

Schedule ´ Bipartite graph matching

19

3421

18

7

1

Schedule or Matching

14

Scheduling algorithms

Not stable

19

3421

18

7

1

PracticalMaximal Matchings

– iSLIP: Cisco’s GSR 12000 Series Nick McKeown

– Wave Front Arbiter Tamir and Chi

– Parallel Iterative Matching Anderson, Owicki, Saxe, Thacker

15

Scheduling algorithms

Not stable

Stable (Tassiulas-Ephremides 92,

McKeown et. al. 96, Dai-Prabhakar 00)

Not stable (McKeown-Ananthram-Walrand 96)

19

3421

18

7

1

PracticalMaximal Matchings

Max Wt Matching

19

18

Max Size Matching

19

1

7

16

The Maximum Weight Matching Algorithm

• MWM: performance– throughput: stable (Tassiulas-Ephremides 92; McKeown et al 96; Dai-Prabhakar 00)

– backlogs: very low on average (Leonardi et al 01; Shah-Kopikare 02)

• MWM: implementation – has cubic worst-case complexity (approx. 27,000 iterations for a 30-port switch) – MWM algorithms involve backtracking: i.e. edges laid down in one iteration may be removed in a subsequent

iteration algorithm not amenable to pipelining

17

Switch algorithms

Stable and low backlogsNot stable

Better performance

Easier to implement

Maximal matching Max Wt Matching

19

18

Max Size Matching

19

1

7

Not stable

18

Summarizing…

• Need algorithms that are simple and perform well– can process packets economically at line speeds– deliver a high throughput– and low latencies/backlogs

• Need to develop methods for analyzing the performance of these algorithms– traditional methods aren’t good enough

19

Rest of the talk

• A simple, high performance scheduling algorithm

– Randomized approximation of Max Wt Matching(joint work with P. Giaccone and B. Prabhakar)

– A method for analyzing backlogs based on heavy traffic theory (joint work with D. Wischik)

• A new approach to switch scheduling

– Randomized edge coloring (joint work with G. Aggrawal, R. Motwani and A. Zhu)

20

Algorithm I

Randomized approximationto the Max Wt Matching algorithm

21

Randomized approximation to MWM

• Consider the following randomized approximation:At every time - sample d matchings independently and uniformly - use the heaviest of these d matchings to schedule

packets

• Ideally we would like to use a small value of d. However,…

Theorem. Even with d = N, this algorithm is not stable.

In fact, when d=N, the throughput · 1 – e-1 ¼ 0.63. (Giaccone-Prabhakar-Shah 02)

22

Tassiulas’ algorithm

Next time MAX

Previous schedule

S(t-1)

Current schedule

S(t)

Random Matching

R(t)

23

Tassiulas’ algorithm

10

50

10107

060

S(t-1)W(S(t-1))=160

40 3

010

20R(t)

W(R(t))=150

MAX

S(t)

24

Performance of Tassiulas’ algorithm

Theorem (Tassiulas 98): The above scheme is stable under any admissible Bernoulli IID inputs.

25

Backlogs under Tassiulas’ algorithm

0.01

0.1

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Normalized Load

Mean I

Q L

eng

th

Tassiulas

MWM

26

10

10

10

70

60

S(t-1)

W(S(t-1))=160

50

40

30

10

20

R(t)

W(R(t))=150

Reducing backlogs: the Merge operation

30 v/s 120

130 v/s 30

Merge

27

10

10

10

70

60

S(t-1)

W(S(t-1))=160

50

40

30

10

20

R(t)

W(R(t))=150

Reducing backlogs: the Merge operation

Merge

W(S(t)) = 250

28

Performance of Merge algorithm

Theorem (GPS): The Merge scheme is stable under any admissible Bernoulli IID inputs.

29

Merge v/s Max

0.01

0.1

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Normalized Load

Mea

n I

Q L

eng

th

TassiulasMergeMWM

30

893

5

23

47

1131

97

S(t-1)

W(S(t-1))=209

Use arrival information: Serena

2

7

The arrival graph

31

893

5

23

47

1131

97

S(t-1)

W(S(t-1))=209


2

The arrival graph

32

893

6

23

47

1131

97

S(t-1)

23

W(S(t-1))=209

0

W=121


Merge

W(S(t))=243

S(t)

89

3

23

31

97

33

Performance of Serena algorithm

Theorem (GPS): The Serena algorithm is stable under any admissible Bernoulli IID inputs.

34

Backlogs under Serena

0.01

0.1

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Normalized Load

Mea

n I

Q L

ength

TassiulasMergeSerenaMWM

35

Summary of main ideas and results

• Randomization alone is not enough

• Memory (using past sample) is very powerful

• Exploiting problem structure: the Merge operation

• Using arrival information

We obtained an algorithm– whose performance is close to that of MWM– and is very simple to implement

36

A new method for analyzing backlogsvia Heavy Traffic Theory

37

Some context…

• We have seen the design of a simple, high throughput switch scheduling algorithm

• The backlog/delay induced by an algorithm is another very important measure of its performance

0.01

0.1

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Normalized Load

Mean I

Q L

ength

TassiulasMergeSerenaMWM

• For example, recall Tassiulas’ scheme

38

Delay or backlog analysis

• Delay analysis is harder than throughput analysis because

– throughput measures the long-term efficiency of an algorithm– whereas, the delay (or backlog) induced by an algorithm is a

measure of its instantaneous efficiency

It is also harder to design low delay schedulers

39

Current delay analysis methods are not suitable

• Queuing theory– requires knowledge of service time distribution – but this depends on the scheduling algorithm service distribution is too complicated to determine

• Large deviations: very successful for output-queued switch– N-port OQ switch can be “decomposed” into N independent 1-port switches but an N-port input-queued switch is not decomposable

• Stochastic coupling arguments– effective for establishing: Delay(Alg A) < Delay(Alg B) (without saying how much better) very hard to make for input-queued switches

40

Our approach

• Use the machinery of heavy traffic theory (developed by J. M. Harrison, M. Bramson, R. Williams and many others)

• Focus on the following intriguing observation

– consider the MWM- algorithm: edge(i,j) has weight Qij

– Keslassy-McKeown 01 have observed that the average delay increases with

30

32

34

36

38

40

42

44

46

0 1 2 3 4

ABCD

Ave

rag

e D

ela

y

41

Why this is worth focusing on

• The MWM class of algorithms are the most analyzed– huge body of literature on throughput analysis– hardly anything known about their delay properties– in particular, no explanation of the Keslassy-McKeown observation

• Note the singularity at =0 in the K-M observation– “continuity” implies that delay should be the smallest when =0– however, MWM-0 is the Maximum Size Matching algorithm

– MSM is known to be unstable; i.e. delays are infinite when =0 ! (McKeown-Ananthram-Walrand 96)– Question: if MSM is not the optimum, who is?

Heavy traffic theory helps answer these questions

42

Heavy traffic theory

• Consider a network with N queues– as some parameter r ! 1– let the load on the system approach its capacity (hence the name HT)– speed up time by a factor r2, and scale down space by a factor r; i.e. quantity of interest: Qi(r2t)/r = Qi

r(t), say

as r ! 1, we’re looking at the zoomed out system both in time and space

this type of scaling occurs in the Central Limit Theorem

r2 1/r

Original system

Scaledsystem

time

43

Heavy traffic theory

• Consider a network with N queues– as some parameter r ! 1– let the load on the system approach its capacity (hence the name HT)– speed up time by a factor r2, and scale down space by a factor r; i.e. quantity of interest: Qi(r2t)/r = Qi

r(t), say

as r ! 1, we’re looking at the zoomed out system both in time and space

this type of scaling occurs in the Central Limit Theorem

• Main result: As r ! 1– (Q1

r(t),…,QNr(t)) ! (Q1

1(t),…,QN1(t)) = Q1(t),

which is a multidimensional Reflected Brownian Motion (RBM) (such limiting results are independent of details of distribution)

w

44

State space collapse: dimension reduction

• Q1(t) can roam in lower dimensional space, say of dim K < N

45

State space collapse: dimension reduction

1r

Nr

Q1r(t)

QNr(t)

• The HT limit procedure for LQF system

– load: r=1r++N

r = 1-r-1 ! 1

– state: Qr(t)=(Q1r(t),,QN

r(t)) ! Q1(t) = (Q11(t),, QN

1(t))

• State space collapse

– because of the LQF policy, Q11(t) = = QN

1(t)

– or, the dimension of the state space is just 1, not N

Capacity =1

• Q1(t) can roam in lower dimensional space, say of dim K < N

46

State space collapse: performance

• More importantly, the state space induced by an IQ switch scheduling algorithm gives an indication of its performance; i.e.

State space(Alg A) ½ State space(Alg B) ) Alg B is better than Alg A

• So , to resolve the Keslassy-McKeown observation, we need to1. determine the state space of MWM-, and2. show

State space(MWM-1) ½ State space(MWM-2) if 1 > 2

47

19 17

3

6

4 MWM=28

Switch: state space collapse

4

?

17

• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?

• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the

size of all the N2 queues

• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.

• Consider an example of 3-port switch

48

19 17

3

6

4 MWM=28







7 ?

4

19

7

17

49

19 17

3

6

4 MWM=28


7 5

4






50

19 17

3

6

4 MWM=28


7 5






?

51

19 17

3

6

4 MWM=28


7 5






18

52

19 17

3

6

4 MWM=28

18


7 5

4

7

3

18

5

?






53

19 17

3

6

4 MWM=28

18


7 5

4

7

3

18

5

5






54

N-port switch

12

19 17

3

6

4

5

• For larger N: obtain entries inductively

55

N-port switch

12

19 17

3

6

4

9


7

18

5

5

6 12

9?

19

3

56

N-port switch

12

19 17

3

6

4

9


7

18

5

5

6 12

93

19

3 10

24

24

57

N-port switch

12

19 17

3

6

4

9

6 12

9

19

3

35 27 31 148

30

22

78

• We use the following 2N-1 entries: W = (R1,…,RN-1; C1,…,CN-1;WNN)

58

MWM-: state space

Theorem (Shah-Wischik)

Under MWM-, given W = (R1,…RN-1,C1,…CN-1,WNN), the corresponding switch state [Qij] is the unique solution of

minimize ij Qij

kQik ¸ Ri 8 i, (row-sum)

kQkj ¸ Cj 8 j, (column-sum)

k,l Qkl ¸ WNN (overall-sum)

subject to:

Qij = 0 if ij = 0

59

MWM-0.5: state space

Cross-section

R1

C1

W

C1

R1

Q11=0

22

• Consider a 2-port switch: dimension of the collapsed space is 3

60

MWM-1: state space

R1

C1

W

Cross-section

R1

C1

Q11=0

22

61

22

MWM-2: state space

W

R1

C1

Cross-section

R1

C1 Q11=0

62

Comparison

MWM-0.5 MWM-1 MWM-2

• As increases, the size of the state space decreases

• Therefore, the delay increases

theoretically explains the Keslassy-McKeown observation

63

What happens at = 0?

• As ! 0, MWM- does not become the MSM

• Rather, it is the Maximum-weighted Maximum Size Matching algorithm – find all possible Max Size Matchings– if more than one MSM exist, pick the one with maximum weight

• Intuitively,– Maximum Size Matching is for good delay – Max Weight is for good throughput

• Mekkitikkul and McKeown 98 showed how to obtain the M-W MSM using Max Weight algorithm, and called their algorithm Longest Port First

Conjecture: The LPF algorithm has the largest possible state space. Hence it has the optimal delay performance.

(Shah-Wischik)

64


• There is a large body of literature on throughput analysis, but hardly anything about the delay of switch algorithms

• We developed a new method to analyze delay– based on Heavy Traffic Theory– it is simple and general

• We applied the method – to shed light on an intriguing observation of Keslassy-McKeown– and to characterize a delay optimal algorithm

65

Algorithm II

Randomized edge coloring

66

Edge coloring

• Let G=(V,E) be a graph,– be the maximum vertex degree.

• Edge coloring: – assign colors to the edges in E so that no two edges sharing a vertex have the same color.

67

Edge coloring and switch scheduling

11

2

1

1

Weighted Graph

2

1

1

Multi Graph

1

2

1 2 3

3

68

Contd.

Multi-graph with =4

Schedules

Edge color

T=1 T=2 T=3

Edge-colored multi-graph

T=4

69

Edge-coloring algorithms

• Bipartite multi-graph can be colored using exactly colors– but, algorithm involves backtracking difficult to implement

• Next, consider a simple Greedy algorithm:– pick edges one at a time and in any order– assign the smallest available color

• Greedy algorithm:– simple and distributed– pipelineable easy to implement

70

Greedy algorithm

Edge colored multi-graph

Greedyedge color

5 time slots are required even though the graph is 4-colorable! 80% efficiency or throughput !

Colors:

In the worst case, this algorithm can take 2-1 colors switch gives 50% throughput !

Multi-graph with =4

71

A stability theorem

Theorem. A switch scheduling algorithm gives 100% throughput iff the corresponding edge coloring algorithm uses + o() colors.

(Aggrawal-Motwani-Shah-Zhu 03)

• We want to obtain an edge coloring algorithm which - uses + o() colors- is simple to implement (like Greedy)

• Previous work on such algorithms (Dubhashi, Grable, Panconesi 96)

- uses (1+c) , for c > 0 ) throughput is 1/(1+c)- is a lot more complicated than Greedy

72

Our algorithm: Randomized edge-coloring

• Let G be a bipartite multi-graph on N input and output vertices with max degree

• Rand-Color: Let 1,…, be initial pool of colors.

– Step 1: (Randomized version of Greedy with colors)

a) sample an uncolored edge uniformly at randomb) assign a color uniformly at random from the available colorsc) when no more edges can be colored, go to Step 2

– Step 2: color the remaining edges using the Greedy algorithm

73

Performance

Theorem (AMSZ). Under the following two conditions, the algorithm Rand-Color uses + o() colors w.h.p. (hence it gives 100% throughput).

Condition I:

Condition II:

74


• Edge coloring algorithms as a new approach to switch scheduling– not based on MWM– low complexity– new methods to establish stability

• Could be an interesting alternative to MWM-type algorithms– may be useful in general (non-switch) networks

75

Conclusions

• Network algorithms

– design involves a trade off between simplicity and performance– analysis requires new methods

• In the context of switch scheduling, I presented

– design of a simple, high-performance switch algorithm– a new analysis method based on heavy traffic theory– finally, I presented a new approach, based on edge coloring, for

designing scheduling algorithms

Devavrat Shah Stanford University

Documents

Transcript of Devavrat Shah Stanford University