Devavrat Shah Stanford University

74
Devavrat Shah Stanford University Randomization and Heavy Traffic Theory: New approaches to the design and analysis of switch algorithms

description

Randomization and Heavy Traffic Theory: New approaches to the design and analysis of switch algorithms. Devavrat Shah Stanford University. Network algorithms. Algorithms implemented in networks, e.g. in switches/routers scheduling algorithms routing lookup - PowerPoint PPT Presentation

Transcript of Devavrat Shah Stanford University

Page 1: Devavrat Shah Stanford University

Devavrat Shah

Stanford University

Randomization and Heavy Traffic Theory: New approaches to the design and analysis

of switch algorithms

Page 2: Devavrat Shah Stanford University

3

Network algorithms

• Algorithms implemented in networks, e.g. in

– switches/routers scheduling algorithms routing lookup packet classification

– memory/buffer managers maintaining statistics active queue management bandwidth partitioning

– load balancers randomized load balancing

– web caches eviction schemes placement of caches in a network

Page 3: Devavrat Shah Stanford University

4

Network algorithms: challenges

• Time constraint: Need to make complicated decisions very quickly– e.g. line speeds in the Internet core 10Gbps (soon to be 40Gbps) packets arrive roughly every 50ns (roughly 10ns)

• Limited computational resources – due to rigid space and heat dissipation constraints

• Algorithms need to be very simple so as to be implementable– but simple algorithms may perform poorly, if not well-designed

Page 4: Devavrat Shah Stanford University

5

An illustration

• Consider the following simple system

• Algorithms– serve a randomly chosen queue: very easy

– longest queue first (LQF): complicated to implement

• Performance– serve at random: does not give maximum throughput

– LQF: max throughput and minimizes the maximum backlog

Capacity =1

Page 5: Devavrat Shah Stanford University

6

Input queued switch

• Multiple queues and servers

Page 6: Devavrat Shah Stanford University

7

Input queued switch

• Crossbar constraints– each input can connect to at most one output

– each output can connect to at most one input

Crossbar fabric

1

2

1 2 3

3

Not allowed

Page 7: Devavrat Shah Stanford University

8

Switch scheduling

• Crossbar constraints– each input can connect to at most one output

– each output can connect to at most one input

Crossbar fabric

1

2

1 2 3

3

Page 8: Devavrat Shah Stanford University

9

Switch scheduling

• Crossbar constraints– each input can connect to at most one output

– each output can connect to at most one input

Crossbar fabric

1

2

1 2 3

3

Page 9: Devavrat Shah Stanford University

10

Switch scheduling

• Crossbar constraints– each input can connect to at most one output

– each output can connect to at most one input

Crossbar fabric

1

2

1 2 3

3

Page 10: Devavrat Shah Stanford University

11

Notation and definitions

1

2

1 2 3

3

Page 11: Devavrat Shah Stanford University

12

Useful performance metric

• Throughput

– an algorithm is stable (or delivers 100% throughput) if for any admissible arrival, the average backlog is bounded; i.e.

• Average delay or average backlog (queue-size)

Page 12: Devavrat Shah Stanford University

13

Schedule ´ Bipartite graph matching

19

3421

18

7

1

Schedule or Matching

Page 13: Devavrat Shah Stanford University

14

Scheduling algorithms

Not stable

19

3421

18

7

1

PracticalMaximal Matchings

– iSLIP: Cisco’s GSR 12000 Series Nick McKeown

– Wave Front Arbiter Tamir and Chi

– Parallel Iterative Matching Anderson, Owicki, Saxe, Thacker

Page 14: Devavrat Shah Stanford University

15

Scheduling algorithms

Not stable

Stable (Tassiulas-Ephremides 92,

McKeown et. al. 96, Dai-Prabhakar 00)

Not stable (McKeown-Ananthram-Walrand 96)

19

3421

18

7

1

PracticalMaximal Matchings

Max Wt Matching

19

18

Max Size Matching

19

1

7

Page 15: Devavrat Shah Stanford University

16

The Maximum Weight Matching Algorithm

• MWM: performance– throughput: stable (Tassiulas-Ephremides 92; McKeown et al 96; Dai-Prabhakar 00)

– backlogs: very low on average (Leonardi et al 01; Shah-Kopikare 02)

• MWM: implementation – has cubic worst-case complexity (approx. 27,000 iterations for a 30-port switch) – MWM algorithms involve backtracking: i.e. edges laid down in one iteration may be removed in a subsequent

iteration algorithm not amenable to pipelining

Page 16: Devavrat Shah Stanford University

17

Switch algorithms

Stable and low backlogsNot stable

Better performance

Easier to implement

Maximal matching Max Wt Matching

19

18

Max Size Matching

19

1

7

Not stable

Page 17: Devavrat Shah Stanford University

18

Summarizing…

• Need algorithms that are simple and perform well– can process packets economically at line speeds– deliver a high throughput– and low latencies/backlogs

• Need to develop methods for analyzing the performance of these algorithms– traditional methods aren’t good enough

Page 18: Devavrat Shah Stanford University

19

Rest of the talk

• A simple, high performance scheduling algorithm

– Randomized approximation of Max Wt Matching(joint work with P. Giaccone and B. Prabhakar)

– A method for analyzing backlogs based on heavy traffic theory (joint work with D. Wischik)

• A new approach to switch scheduling

– Randomized edge coloring (joint work with G. Aggrawal, R. Motwani and A. Zhu)

Page 19: Devavrat Shah Stanford University

20

Algorithm I

Randomized approximationto the Max Wt Matching algorithm

Page 20: Devavrat Shah Stanford University

21

Randomized approximation to MWM

• Consider the following randomized approximation:At every time - sample d matchings independently and uniformly - use the heaviest of these d matchings to schedule

packets

• Ideally we would like to use a small value of d. However,…

Theorem. Even with d = N, this algorithm is not stable.

In fact, when d=N, the throughput · 1 – e-1 ¼ 0.63. (Giaccone-Prabhakar-Shah 02)

Page 21: Devavrat Shah Stanford University

22

Tassiulas’ algorithm

Next time MAX

Previous schedule

S(t-1)

Current schedule

S(t)

Random Matching

R(t)

Page 22: Devavrat Shah Stanford University

23

Tassiulas’ algorithm

10

50

10107

060

S(t-1)W(S(t-1))=160

40 3

010

20R(t)

W(R(t))=150

MAX

S(t)

Page 23: Devavrat Shah Stanford University

24

Performance of Tassiulas’ algorithm

Theorem (Tassiulas 98): The above scheme is stable under any admissible Bernoulli IID inputs.

Page 24: Devavrat Shah Stanford University

25

Backlogs under Tassiulas’ algorithm

0.01

0.1

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Normalized Load

Mean I

Q L

eng

th

Tassiulas

MWM

Page 25: Devavrat Shah Stanford University

26

10

10

10

70

60

S(t-1)

W(S(t-1))=160

50

40

30

10

20

R(t)

W(R(t))=150

Reducing backlogs: the Merge operation

30 v/s 120

130 v/s 30

Merge

Page 26: Devavrat Shah Stanford University

27

10

10

10

70

60

S(t-1)

W(S(t-1))=160

50

40

30

10

20

R(t)

W(R(t))=150

Reducing backlogs: the Merge operation

Merge

W(S(t)) = 250

Page 27: Devavrat Shah Stanford University

28

Performance of Merge algorithm

Theorem (GPS): The Merge scheme is stable under any admissible Bernoulli IID inputs.

Page 28: Devavrat Shah Stanford University

29

Merge v/s Max

0.01

0.1

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Normalized Load

Mea

n I

Q L

eng

th

TassiulasMergeMWM

Page 29: Devavrat Shah Stanford University

30

893

5

23

47

1131

97

S(t-1)

W(S(t-1))=209

Use arrival information: Serena

2

7

The arrival graph

Page 30: Devavrat Shah Stanford University

31

893

5

23

47

1131

97

S(t-1)

W(S(t-1))=209

Use arrival information: Serena

2

The arrival graph

Page 31: Devavrat Shah Stanford University

32

893

6

23

47

1131

97

S(t-1)

23

W(S(t-1))=209

0

W=121

Use arrival information: Serena

Merge

W(S(t))=243

S(t)

89

3

23

31

97

Page 32: Devavrat Shah Stanford University

33

Performance of Serena algorithm

Theorem (GPS): The Serena algorithm is stable under any admissible Bernoulli IID inputs.

Page 33: Devavrat Shah Stanford University

34

Backlogs under Serena

0.01

0.1

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Normalized Load

Mea

n I

Q L

ength

TassiulasMergeSerenaMWM

Page 34: Devavrat Shah Stanford University

35

Summary of main ideas and results

• Randomization alone is not enough

• Memory (using past sample) is very powerful

• Exploiting problem structure: the Merge operation

• Using arrival information

We obtained an algorithm– whose performance is close to that of MWM– and is very simple to implement

Page 35: Devavrat Shah Stanford University

36

A new method for analyzing backlogsvia Heavy Traffic Theory

Page 36: Devavrat Shah Stanford University

37

Some context…

• We have seen the design of a simple, high throughput switch scheduling algorithm

• The backlog/delay induced by an algorithm is another very important measure of its performance

0.01

0.1

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Normalized Load

Mean I

Q L

ength

TassiulasMergeSerenaMWM

• For example, recall Tassiulas’ scheme

Page 37: Devavrat Shah Stanford University

38

Delay or backlog analysis

• Delay analysis is harder than throughput analysis because

– throughput measures the long-term efficiency of an algorithm– whereas, the delay (or backlog) induced by an algorithm is a

measure of its instantaneous efficiency

It is also harder to design low delay schedulers

Page 38: Devavrat Shah Stanford University

39

Current delay analysis methods are not suitable

• Queuing theory– requires knowledge of service time distribution – but this depends on the scheduling algorithm service distribution is too complicated to determine

• Large deviations: very successful for output-queued switch– N-port OQ switch can be “decomposed” into N independent 1-port switches but an N-port input-queued switch is not decomposable

• Stochastic coupling arguments– effective for establishing: Delay(Alg A) < Delay(Alg B) (without saying how much better) very hard to make for input-queued switches

Page 39: Devavrat Shah Stanford University

40

Our approach

• Use the machinery of heavy traffic theory (developed by J. M. Harrison, M. Bramson, R. Williams and many others)

• Focus on the following intriguing observation

– consider the MWM- algorithm: edge(i,j) has weight Qij

– Keslassy-McKeown 01 have observed that the average delay increases with

30

32

34

36

38

40

42

44

46

0 1 2 3 4

ABCD

Ave

rag

e D

ela

y

Page 40: Devavrat Shah Stanford University

41

Why this is worth focusing on

• The MWM class of algorithms are the most analyzed– huge body of literature on throughput analysis– hardly anything known about their delay properties– in particular, no explanation of the Keslassy-McKeown observation

• Note the singularity at =0 in the K-M observation– “continuity” implies that delay should be the smallest when =0– however, MWM-0 is the Maximum Size Matching algorithm

– MSM is known to be unstable; i.e. delays are infinite when =0 ! (McKeown-Ananthram-Walrand 96)– Question: if MSM is not the optimum, who is?

Heavy traffic theory helps answer these questions

Page 41: Devavrat Shah Stanford University

42

Heavy traffic theory

• Consider a network with N queues– as some parameter r ! 1– let the load on the system approach its capacity (hence the name HT)– speed up time by a factor r2, and scale down space by a factor r; i.e. quantity of interest: Qi(r2t)/r = Qi

r(t), say

as r ! 1, we’re looking at the zoomed out system both in time and space

this type of scaling occurs in the Central Limit Theorem

r2 1/r

Original system

Scaledsystem

time

Page 42: Devavrat Shah Stanford University

43

Heavy traffic theory

• Consider a network with N queues– as some parameter r ! 1– let the load on the system approach its capacity (hence the name HT)– speed up time by a factor r2, and scale down space by a factor r; i.e. quantity of interest: Qi(r2t)/r = Qi

r(t), say

as r ! 1, we’re looking at the zoomed out system both in time and space

this type of scaling occurs in the Central Limit Theorem

• Main result: As r ! 1– (Q1

r(t),…,QNr(t)) ! (Q1

1(t),…,QN1(t)) = Q1(t),

which is a multidimensional Reflected Brownian Motion (RBM) (such limiting results are independent of details of distribution)

w

Page 43: Devavrat Shah Stanford University

44

State space collapse: dimension reduction

• Q1(t) can roam in lower dimensional space, say of dim K < N

Page 44: Devavrat Shah Stanford University

45

State space collapse: dimension reduction

1r

Nr

Q1r(t)

QNr(t)

• The HT limit procedure for LQF system

– load: r=1r++N

r = 1-r-1 ! 1

– state: Qr(t)=(Q1r(t),,QN

r(t)) ! Q1(t) = (Q11(t),, QN

1(t))

• State space collapse

– because of the LQF policy, Q11(t) = = QN

1(t)

– or, the dimension of the state space is just 1, not N

Capacity =1

• Q1(t) can roam in lower dimensional space, say of dim K < N

Page 45: Devavrat Shah Stanford University

46

State space collapse: performance

• More importantly, the state space induced by an IQ switch scheduling algorithm gives an indication of its performance; i.e.

State space(Alg A) ½ State space(Alg B) ) Alg B is better than Alg A

• So , to resolve the Keslassy-McKeown observation, we need to1. determine the state space of MWM-, and2. show

State space(MWM-1) ½ State space(MWM-2) if 1 > 2

Page 46: Devavrat Shah Stanford University

47

19 17

3

6

4 MWM=28

Switch: state space collapse

4

?

17

• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?

• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the

size of all the N2 queues

• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.

• Consider an example of 3-port switch

Page 47: Devavrat Shah Stanford University

48

19 17

3

6

4 MWM=28

Switch: state space collapse

• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?

• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the

size of all the N2 queues

• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.

• Consider an example of 3-port switch

7 ?

4

19

7

17

Page 48: Devavrat Shah Stanford University

49

19 17

3

6

4 MWM=28

Switch: state space collapse

7 5

4

• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?

• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the

size of all the N2 queues

• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.

• Consider an example of 3-port switch

Page 49: Devavrat Shah Stanford University

50

19 17

3

6

4 MWM=28

Switch: state space collapse

7 5

• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?

• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the

size of all the N2 queues

• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.

• Consider an example of 3-port switch

?

Page 50: Devavrat Shah Stanford University

51

19 17

3

6

4 MWM=28

Switch: state space collapse

7 5

• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?

• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the

size of all the N2 queues

• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.

• Consider an example of 3-port switch

18

Page 51: Devavrat Shah Stanford University

52

19 17

3

6

4 MWM=28

18

Switch: state space collapse

7 5

4

7

3

18

5

?

• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?

• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the

size of all the N2 queues

• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.

• Consider an example of 3-port switch

Page 52: Devavrat Shah Stanford University

53

19 17

3

6

4 MWM=28

18

Switch: state space collapse

7 5

4

7

3

18

5

5

• The dimension of the state space is N2, at most– what is the dimension due to state space collapse under MWM ?

• Given that MWM tries to equalize the weights of all matchings,– can we determine the minimum number of queue-sizes we need to know in order to work out the

size of all the N2 queues

• Lemma (Shah-Wischik): Knowing the sizes of any 2N-2 queues is not sufficient. But there exist 2N-1 queues whose sizes are sufficient to know.

• Consider an example of 3-port switch

Page 53: Devavrat Shah Stanford University

54

N-port switch

12

19 17

3

6

4

5

• For larger N: obtain entries inductively

Page 54: Devavrat Shah Stanford University

55

N-port switch

12

19 17

3

6

4

9

• For larger N: obtain entries inductively

7

18

5

5

6 12

9?

19

3

Page 55: Devavrat Shah Stanford University

56

N-port switch

12

19 17

3

6

4

9

• For larger N: obtain entries inductively

7

18

5

5

6 12

93

19

3 10

24

24

Page 56: Devavrat Shah Stanford University

57

N-port switch

12

19 17

3

6

4

9

6 12

9

19

3

35 27 31 148

30

22

78

• We use the following 2N-1 entries: W = (R1,…,RN-1; C1,…,CN-1;WNN)

Page 57: Devavrat Shah Stanford University

58

MWM-: state space

Theorem (Shah-Wischik)

Under MWM-, given W = (R1,…RN-1,C1,…CN-1,WNN), the corresponding switch state [Qij] is the unique solution of

minimize ij Qij

kQik ¸ Ri 8 i, (row-sum)

kQkj ¸ Cj 8 j, (column-sum)

k,l Qkl ¸ WNN (overall-sum)

subject to:

Qij = 0 if ij = 0

Page 58: Devavrat Shah Stanford University

59

MWM-0.5: state space

Cross-section

R1

C1

W

C1

R1

Q11=0

22

• Consider a 2-port switch: dimension of the collapsed space is 3

Page 59: Devavrat Shah Stanford University

60

MWM-1: state space

R1

C1

W

Cross-section

R1

C1

Q11=0

22

Page 60: Devavrat Shah Stanford University

61

22

MWM-2: state space

W

R1

C1

Cross-section

R1

C1 Q11=0

Page 61: Devavrat Shah Stanford University

62

Comparison

MWM-0.5 MWM-1 MWM-2

• As increases, the size of the state space decreases

• Therefore, the delay increases

theoretically explains the Keslassy-McKeown observation

Page 62: Devavrat Shah Stanford University

63

What happens at = 0?

• As ! 0, MWM- does not become the MSM

• Rather, it is the Maximum-weighted Maximum Size Matching algorithm – find all possible Max Size Matchings– if more than one MSM exist, pick the one with maximum weight

• Intuitively,– Maximum Size Matching is for good delay – Max Weight is for good throughput

• Mekkitikkul and McKeown 98 showed how to obtain the M-W MSM using Max Weight algorithm, and called their algorithm Longest Port First

Conjecture: The LPF algorithm has the largest possible state space. Hence it has the optimal delay performance.

(Shah-Wischik)

Page 63: Devavrat Shah Stanford University

64

Summary of main ideas and results

• There is a large body of literature on throughput analysis, but hardly anything about the delay of switch algorithms

• We developed a new method to analyze delay– based on Heavy Traffic Theory– it is simple and general

• We applied the method – to shed light on an intriguing observation of Keslassy-McKeown– and to characterize a delay optimal algorithm

Page 64: Devavrat Shah Stanford University

65

Algorithm II

Randomized edge coloring

Page 65: Devavrat Shah Stanford University

66

Edge coloring

• Let G=(V,E) be a graph,– be the maximum vertex degree.

• Edge coloring: – assign colors to the edges in E so that no two edges sharing a vertex have the same color.

Page 66: Devavrat Shah Stanford University

67

Edge coloring and switch scheduling

11

2

1

1

Weighted Graph

2

1

1

Multi Graph

1

2

1 2 3

3

Page 67: Devavrat Shah Stanford University

68

Contd.

Multi-graph with =4

Schedules

Edge color

T=1 T=2 T=3

Edge-colored multi-graph

T=4

Page 68: Devavrat Shah Stanford University

69

Edge-coloring algorithms

• Bipartite multi-graph can be colored using exactly colors– but, algorithm involves backtracking difficult to implement

• Next, consider a simple Greedy algorithm:– pick edges one at a time and in any order– assign the smallest available color

• Greedy algorithm:– simple and distributed– pipelineable easy to implement

Page 69: Devavrat Shah Stanford University

70

Greedy algorithm

Edge colored multi-graph

Greedyedge color

5 time slots are required even though the graph is 4-colorable! 80% efficiency or throughput !

Colors:

In the worst case, this algorithm can take 2-1 colors switch gives 50% throughput !

Multi-graph with =4

Page 70: Devavrat Shah Stanford University

71

A stability theorem

Theorem. A switch scheduling algorithm gives 100% throughput iff the corresponding edge coloring algorithm uses + o() colors.

(Aggrawal-Motwani-Shah-Zhu 03)

• We want to obtain an edge coloring algorithm which - uses + o() colors- is simple to implement (like Greedy)

• Previous work on such algorithms (Dubhashi, Grable, Panconesi 96)

- uses (1+c) , for c > 0 ) throughput is 1/(1+c)- is a lot more complicated than Greedy

Page 71: Devavrat Shah Stanford University

72

Our algorithm: Randomized edge-coloring

• Let G be a bipartite multi-graph on N input and output vertices with max degree

• Rand-Color: Let 1,…, be initial pool of colors.

– Step 1: (Randomized version of Greedy with colors)

a) sample an uncolored edge uniformly at randomb) assign a color uniformly at random from the available colorsc) when no more edges can be colored, go to Step 2

– Step 2: color the remaining edges using the Greedy algorithm

Page 72: Devavrat Shah Stanford University

73

Performance

Theorem (AMSZ). Under the following two conditions, the algorithm Rand-Color uses + o() colors w.h.p. (hence it gives 100% throughput).

Condition I:

Condition II:

Page 73: Devavrat Shah Stanford University

74

Summary of main ideas and results

• Edge coloring algorithms as a new approach to switch scheduling– not based on MWM– low complexity– new methods to establish stability

• Could be an interesting alternative to MWM-type algorithms– may be useful in general (non-switch) networks

Page 74: Devavrat Shah Stanford University

75

Conclusions

• Network algorithms

– design involves a trade off between simplicity and performance– analysis requires new methods

• In the context of switch scheduling, I presented

– design of a simple, high-performance switch algorithm– a new analysis method based on heavy traffic theory– finally, I presented a new approach, based on edge coloring, for

designing scheduling algorithms