Routing Lookups and Packet Classification: Theory and Practice

Post on 17-Jan-2016

58 views 2 download

description

Routing Lookups and Packet Classification: Theory and Practice. August 18, 2000 Hot Interconnects 8. Pankaj Gupta Department of Computer Science Stanford University pankaj@stanford.edu http://www.stanford.edu/~pankaj. Tutorial Outline. Introduction What this tutorial is about - PowerPoint PPT Presentation

Transcript of Routing Lookups and Packet Classification: Theory and Practice

Routing Lookups and Packet Classification: Theory and Practice

Pankaj GuptaDepartment of Computer Science

Stanford Universitypankaj@stanford.edu

http://www.stanford.edu/~pankaj

August 18, 2000

Hot Interconnects 8High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.

2

Tutorial Outline

• Introduction– What this tutorial is about

• Routing lookups– Background, lookup schemes

• Packet Classification– Background, classification schemes

• Implementation choices for given design requirements

3

Request to you

• Please ask lots of questions!– But I may not be able to answer all of

them right now

• I am here to learn, so please share your experiences, thoughts and opinions freely

4

What this tutorial is about?

5

Internet: Mesh of Routers

The Internet Core

Edge Router

Campus Area Network

6

RFC 1812: Requirements for IPv4 Routers

• Must perform an IP datagram forwarding decision (called forwarding)

• Must send the datagram out the appropriate interface (called switching)

Optionally: a router MAY choose to perform special processing on incoming packets

7

Examples of special processing

• Filtering packets for security reasons• Delivering packets according to a pre-

agreed delay guarantee• Treating high priority packets

preferentially• Maintaining statistics on the number

of packets sent by various routers

8

Special Processing Requires Identification of

Flows• All packets of a flow obey a pre-defined

rule and are processed similarly by the router

• E.g. a flow = (src-IP-address, dst-IP-address), or a flow = (dst-IP-prefix, protocol) etc.

• Router needs to identify the flow of every incoming packet and then perform appropriate special processing

9

Flow-aware vs Flow-unaware Routers

• Flow-aware router: keeps track of flows and perform similar processing on packets in a flow

• Flow-unaware router (packet-by-packet router): treats each incoming packet individually

10

What this tutorial is about:

• Algorithms and techniques that an IP router uses to decide where to forward the packets next (routing lookup)

• Algorithms and techniques that a flow-aware router uses to classify packets into flows (packet classification)

11

Routing Lookups

12

Routing Lookups: Outline

• Background and problem definition

• Lookup schemes• Comparative evaluation

13

Lookup in an IP Router

Unicast destination address based lookup

Dstn Addr

Next Hop

--------

---- ----

--------

Dstn-prefix Next Hop

Forwarding Table

Next Hop Computation

Forwarding Engine

Incoming Packet

HEADER

14

Packet-by-packet Router

ForwardingDecision

Forwarding

Table

Interconnect

Linecard

Linecard

Linecard

Linecard

Routing processor

ForwardingDecision

Forwarding

Table

15

Switching

Routing Control

Datapath:per-packet processing

Routing lookup

Packet-by-packet Router: Basic Architectural

Components

Scheduling

16

ATM and MPLS SwitchesDirect Lookup

(Port, vci/label)

Address

Memory

Data

(Port, vci/label)

17

IPv4 Addresses

• 32-bit addresses• Dotted quad notation: e.g.

12.33.32.1• Can be represented as integers on

the IP number line [0, 232-1]: a.b.c.d denotes the integer: (a*224+b*216+c*28+d)

0.0.0.0 255.255.255.255IP Number Line

18

Class-based Addressing

A B C D

0.0.0.0

E

128.0.0.0 192.0.0.0

Class Range MS bits netid hostidA 0.0.0.0 –

128.0.0.00 bits 1-7 bits 8-31

B 128.0.0.0 -191.255.255.255

10 bits 2-15 bits 16-31

C 192.0.0.0 -223.255.255.255

110 bits 3-23 bits 24-31

D (multicast)

224.0.0.0 - 239.255.255.255

1110 - -

E (reserved)

240.0.0.0 -255.255.255.255

11110 - -

19

Lookups with Class-based Addresses

23

186.21

Port 1

Port 2192.33.32.1

Class A

Class B

Class C

192.33.32 Port 3Exact match

netid port#

20

Problems with Class-based Addressing

• Fixed netid-hostid boundaries too inflexible: rapid depletion of address space

• Exponential growth in size of routing tables

21

Exponential Growth in Routing Table Sizes

Num

ber

of

BG

P r

oute

s advert

ised

22

Classless Addressing (and CIDR)

• Eliminated class boundaries• Introduced the notion of a variable

length prefix between 0 and 32 bits long• Prefixes represented by P/l: e.g., 122/8,

212.128/13, 34.43.32/22, 10.32.32.2/32 etc.

• An l-bit prefix represents an aggregation of 232-l IP addresses

23

CIDR:Hierarchical Route Aggregation

Backbone

Router

R1R2

R3R4

ISP, P ISP, Q192.2.0/22 200.11.0/22

Site, S

192.2.1/24

Site, T

192.2.2/24 192.2.0/22 200.11.0/22

192.2.1/24 192.2.2/24

192.2.0/22, R2

Backbone routing table

IP Number Line

24

Size of the Routing Table

Source: http://www.telstra.net/ops/bgptable.html

Nu

mb

er

of

act

ive B

GP p

refixes

Date

25

Classless Addressing

A BC

0.0.0.0

Class-based:

255.255.255.255

Classless:

0.0.0.0255.255.255.255

23/8 191/8

191.23/16191.128.192/18

191.23.14/23

26

Backbone

Router

R1

R2R3

ISP, P

192.2.0/22, R2

Backbone routing table

Non-aggregatable Prefixes:

(1) Multi-homed Networks

192.2.0/22192.2.2/24

R4

192.2.2/24, R3

27

Backbone

Router

R1R2 R3 R4

ISP, P ISP, Q192.2.0/22 200.11.0/22

Site, S

192.2.1/24

Site, T

192.2.2/24 192.2.0/22 200.11.0/22

192.2.1/24 192.2.2/24

Non-aggregatable Prefixes:

(2) Change of ProviderBackbone routing table

192.2.0/22, R2

192.2.2/24, R3

IP Number Line

28

Routing Lookups with CIDR

192.2.0/22, R2

192.2.2/24, R3 192.2.0/22 200.11.0/22

192.2.2/24

200.11.0/22, R4

200.11.0.33192.2.0.1 192.2.2.100

Find the most specific route, or the longest matching prefix among all the prefixes matching the destination address of an incoming packet

29

Longest Prefix Match is Harder than Exact Match

• The destination address of an arriving packet does not carry with it the information to determine the length of the longest matching prefix

• Hence, one needs to search among the space of all prefix lengths; as well as the space of all prefixes of a given length

30

Metrics for Lookup Algorithms

• Speed• Storage requirements• Low update time• Ability to handle large routing

tables• Flexibility in implementation• Low preprocessing time

31

0.01

0.1

1

10

100

1000

10000

100000

1980 1985 1990 1995 2000 2005Year

Sin

gle

fib

er

capaci

ty (

Gb/s

)

2x per year

Maximum Bandwidth per Installed Fiber

Source: Lucent

32

Maximum Bandwidth per Router Port, and Lookup Performance Required

Year Line Linerate (Gbps)

40B (Mpps)

84B (Mpps)

354B (Mpps)

1997-98

OC3 0.155 0.48 0.23 0.054

1998-99

OC12 0.622 1.94 0.92 0.22

1999-00

OC48 2.5 7.81 3.72 0.88

2000-01

OC192 10.0 31.25 14.88 3.53

2002-03

OC768 40.0 125 59.52 14.12

1GE 1.0 3.13 1.49 0.35

33

Size of Routing Table?

• Currently, 85K entries• At 25K per year, 230-256K prefixes

for next 5 years• Decreasing costs of transmission

may increase rate of routing table growth

• At 50K per year, need 350-400K prefixes for next 5 years

34

Routing Update Rate?

• Currently a peak of a few hundred BGP updates per second

• Hence, 1K per second is a must• 5-10K updates/second seems to be safe • BGP limitations may be a bottleneck

first• Updates should be atomic, and should

interfere little with normal lookups

35

Routing Lookups: Outline

• Background and problem definition

• Lookup schemes• Comparative evaluation

36

Example Forwarding Table (5-bit Prefixes)

Prefix Next-hop

P1 111* H1

P2 10* H2

P3 1010* H3

P4 10101 H4

37

Linear Search

• Keep prefixes in a linked list• O(N) storage, O(N) lookup time,

O(1) update complexity• Improve average time by keeping

linked list sorted in order of prefix lengths

38

Caching Addresses

CPU BufferMemory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

LineCard

DMA

MAC

LocalBuffer

Memory

Fast Path

Slow Path

39

Caching Addresses

Advantages

Increased average lookup performance

Disadvantages

Decreased locality in backbone trafficCache sizeCache management overheadHardware implementation difficult

40

Radix Trie

P1 111* H1

P2 10* H2

P3 1010*

H3

P4 10101

H4

P2

P3

P4

P1

A

B

C

G

D

F

H

E

1

0

0

1 1

1

1

Lookup 10111

Add P5=1110*

I

0

P5

next-hop-ptr (if prefix)

left-ptr right-ptr

Trie node

41

Radix Trie

• W-bit prefixes: O(W) lookup, O(NW) storage and O(W) update complexity

Advantages

SimplicityExtensible to wider fields

Disadvantages

Worst case lookup slowWastage of storage space in chains

42

Leaf-pushed Binary Trie

A

B

C

G

D

E

1

0

0

1

1

left-ptr or next-hop

Trie node

right-ptr or next-hop

P2

P4P3

P2

P1P1 111* H1

P2 10* H2

P3 1010*

H3

P4 10101

H4

43

PATRICIA

2A

B C

E

10

1

Patricia tree internal node

3

P3

P2

P4

P110

0F G

D5

bit-position

left-ptr right-ptr

Lookup 10111

P1 111* H1

P2 10* H2

P3 1010*

H3

P4 10101

H4

44

PATRICIA

• W-bit prefixes: O(W2) lookup, O(N) storage and O(W) update complexity

Advantages

Decreased storage Extensible to wider fields

Disadvantages

Worst case lookup slowBacktracking makes implementation complex

45

Path-compressed Tree

1, , 2A

B C10

10,P2,4

P4

P1

1

0

E

D1010,P3,5

bit-position

left-ptr right-ptr

variable-length bitstring

next-hop (if prefix present)

Path-compressed tree node structure

Lookup 10111

P1 111* H1

P2 10* H2

P3 1010*

H3

P4 10101

H4

46

Path-compressed Tree

• W-bit prefixes: O(W) lookup, O(N) storage and O(W) update complexity

Advantages

Decreased storage

Disadvantages

Worst case lookup slow

47

Early Lookup Schemes

• BSD unix [sklower91] : Patricia, expected lookup time = 1.44logN

• Dynamic prefix trie [doeringer96] : Patricia variant, complex insertion/deletion : 40K entries consumed 2MB with 0.3-0.5 Mpps

48

Multi-bit Tries

Depth = WDegree = 2Stride = 1 bit

Binary trieW

Depth = W/kDegree = 2k

Stride = k bits

Multi-ary trie

W/k

49

Prefix Expansion with Multi-bit Tries

If stride = k bits, prefix lengths that are not a multiple of k need to be expanded

Prefix Expanded prefixes

0* 00*, 01*

11* 11*

E.g., k = 2:

Maximum number of expanded prefixes corresponding to one non-expanded prefix = 2k-

1

50

Four-ary Trie (k=2)

P2

P3 P12

A

B

F11

next-hop-ptr (if prefix)

ptr00 ptr01

A four-ary trie node

P11

10

P42

H11

P41

10

10

1110

D

C

E

G

ptr10 ptr11

Lookup 10111

P1 111* H1

P2 10* H2

P3 1010*

H3

P4 10101

H4

51

Compressed Trie (k=8)

L16

L24

L8

Only 4 memory accesses!

L32

8-8-8-8 split

52

Prefix Expansion Increases Storage

Consumption• Replication of next-hop ptr• Greater number of unused (null)

pointers in a node

Time ~ W/kStorage ~ NW/k * 2k-1

53

Generalization: Different Strides at Each Trie Level

• 16-8-8 split• 4-10-10-8 split• 24-8 split• 21-3-8 split

54

Choice of Strides: Controlled Prefix Expansion [Sri98]

Given a forwarding table and a desired number of memory accesses in the worst case (i.e., maximum tree depth, D)

A dynamic programming algorithm to compute the optimal sequence of strides that minimizes the storage requirements: runs in O(W2D) timeAdvantages

Optimal storage under these constraints

Disadvantages

Updates lead to sub-optimality anywayHardware implementation difficult

55

Further Generalization: Different Stride at Each

Node [Sri98]

Given a forwarding table and a desired number of memory accesses in the worst case (i.e., maximum tree depth, D)

A dynamic programming algorithm to compute the optimal stride at each node that minimizes the storage requirements: runs in O(NW2D) time

56

Stride Optimization : Implementation Results

Two levels Three levels

Fixed-stride

49 MB, 1ms 1.8 MB, 1ms

Varying-stride

1.6MB, 130 ms

0.57 MB, 871 ms

38816 prefixes, 300 MHz P-II

57

Lulea Algorithm [lulea98]

16-8-8 split

L16

L24

L32

58

Lulea Algorithm

1 0 0 0 1 0 1 1 1 0 0 0 1 1

1 1

16-8-8 split

59

Lulea Algorithm

10001010 11100010 10000010 10110100 11000000

R1, 0 R5, 0R2, 3 R3, 7 R4, 9

0 13

Codeword array

Base index array

0 1

0 321 4

P1 P2 P3 P4Pointer array

60

Lulea Algorithm

33K entries: 160KB, average 2Mpps

Advantages

Extremely small data structure – can fit in L1/L2 cache

Disadvantages

Scalability to larger tables?Incremental updates not supported

61

Binary Search on Trie Levels [wald98]

P

62

Prefix-length

Hashtable ptr

8

12

16

22

Binary Search on Trie Levels

10

10.1, 10.2

10.1.10, 10.1.32, 10.2.64

Example prefixes10/8

10.1/16

10.1.10/22

10.1.32/22

10.2.64/22

Example addresses10.1.10.4

10.2.3.9

63

Binary Search on Trie Levels

33K entries: 1.4 MB, 1.2-2.2 Mpps

Advantages

Scales nicely to IPv6

Disadvantages

Multiple hashed memory accessesIncremental updates complex

64

Binary Search on Prefix Intervals [lampson98]

Prefix Interval

P1 * 0000-1111

P2 00* 0000-0011

P3 1* 1000-1111

P4 1101 1101-1101

P5 001* 0010-0011

0000 11110010 0100 0110 1000 11101010 1100

P1

P4P3

P5P2

I1 I3 I4 I5 I6I2

65

0111

I1

I3

I2 I4 I5

I6

0011 1101

11000001

>

>

>

>

>

Alphabetic Tree

0000 11110010 0100 0110 1000 11101010 1100

P1

P4P3

P5P2

I1 I3 I4 I5 I6I2

66

Multiway Search on Intervals

38K entries: 0.95 MB, 2.1 Mpps

Advantages

Space is O(N)

Disadvantages

Incremental updates complex

67

Depth-constrained Near-optimal Alphabetic Tree

• Redraw the binary search tree based on probability of access of routing table entries:– Minimize average lookup time– But keep worst case lookup time

bounded

40% improvement in lookup time with a small relaxation in worst case lookup time.

68

Routing Lookups in Hardware [gupta98]

Prefix length

Num

ber

April 11, 2000

MAE-EAST routing table (source: www.merit.edu)

69

Routing Lookups in Hardware

142.19.6.14

Prefixes up to 24-bits

1 Next Hop

24

Next Hop

142.19.6

224 = 16M entries

142.19.6

70

Routing Lookups in Hardware

Prefixes up to 24-bits

1 Next Hop

128.3.72

24 0 Pointer

8

Prefixes above 24-bits

Next Hop

Next Hop

Next Hop

off

set

base

128.3.72.14

128.3.72

14

71

Routing Lookups in Hardware

Prefixes up to n-bits2n entries:

0

n + m

n

i j Prefixeslonger than

n+m bits

Next Hop

2m

i entries

72

Routing Lookups in Hardware

Various compression schemes can be employed to decrease the storage requirements: e.g. employ carefully chosen variable length strides, bitmap compression etc.

Advantages

20 Mpps with 50ns DRAM or 66 Mpps with e-DRAMEasy to implement in hardware

Disadvantages

Large memory required (9-33 MB)Depends on prefix-length distribution

73

Content-addressable Memory (CAM)

• Fully associative memory• Exact match operation in a single

clock cycle: parallel compare

74

Lookups with Ternary-CAM

Memory array Priority

encoder

Next-hopmemory

P32

P31

P8

DestinationAddress

Next-hop

TCAM RAM

01

2

3

M

0

1

0

0

1

75

Advantages

Fast: 15-20 ns

Disadvantages

Expensive (and low density): 0.25 MB at 50 MHZ costs $30-$75High power: 5-8 WUpdates slow

Lookups with TCAM

76

Updates with TCAM

P32

P31

P8

01

2

3

M

Issue: how to manage the free space : [Hoti’00]

Empty space

77

Routing Lookups: Outline

• Background and problem definition

• Lookup schemes• Comparative evaluation

78

Performance Comparison: Complexity

Algorithm Lookup

Storage

Update

Binary trie W NW W

Patricia W2 N W

Path-compressed trie W N W

Multi-ary trie W/k N*2k -

LC trie W N -

Lulea - - -

Binary search on trie levels

logW NlogW -

Binary search on intervals log(2N) N -

TCAM 1 N W

79

Performance Comparison

Algorithm Lookup (ns)

Storage (KB)

Patricia (BSD) 2500 3262

Multi-way fixed-stride optimal trie (3-levels)

298 1930

Multi-way fixed-stride optimal trie (5-levels)

428 660

LC trie - 700

Lulea 409 160

Binary search on trie levels 650 1600

6-way search on intervals 490 950

Lookups with direct access 15-60 9-33 * 1000

TCAM 15-20 512

80

Routing Lookups: References

• [lulea98] A. Brodnik, S. Carlsson, M. Degermark, S. Pink. “Small Forwarding Tables for Fast Routing Lookups”, Sigcomm 1997, pp 3-14.

• [gupta98] P. Gupta, S. Lin, N.McKeown. “Routing lookups in hardware at memory access speeds”, Infocom 1998, pp 1241-1248, vol. 3.

• P. Gupta, B. Prabhakar, S. Boyd. “Near-optimal routing lookups with bounded worst case performance,” Proc. Infocom, March 2000

• [lampson98] B. Lampson, V. Srinivasan, G. Varghese. “ IP lookups using multiway and multicolumn search”, Infocom 1998, pp 1248-56, vol. 3.

81

Routing lookups : References (contd)

• [wald98] M. Waldvogel, G. Varghese, J. Turner, B. Plattner. “Scalable high speed IP routing lookups”, Sigcomm 1997, pp 25-36.

• [LC-trie] S. Nilsson, G. Karlsson. “Fast address lookup for Internet routers”, IFIP Intl Conf on Broadband Communications, Stuttgart, Germany, April 1-3, 1998.

• [sri98] V. Srinivasan, G.Varghese. “Fast IP lookups using controlled prefix expansion”, Sigmetrics, June 1998

• TCAM vendors: netlogicmicro.com, laratech.com, mosaid.com, sibercore.com

82

Packet Classification

83

Packet Classification: Outline

• Background and problem definition

• Classification schemes• Comparative evaluation

84

Flow-aware vs Flow-unaware Routers (recap)

• Flow-aware router: keeps track of flows and perform similar processing on packets in a flow

• Flow-unaware router (packet-by-packet router): treats each incoming packet individually

85

Why Flow-aware Router?

Routers require additional mechanisms: admission control, resource reservation, per-flow queueing, fair scheduling etc.

ISPs want to provide differentiated services

capability to distinguish and isolate traffic belonging to different flows based on negotiated service agreements

classification

Rules or policies

86

Need for Differentiated Services

ISP1

NAP

E1E2

ISP2

ISP3Z

X

Y

Service ExampleTraffic Shaping

Ensure that ISP3 does not inject more than 50Mbps of total traffic on interface X, of which no more than 10Mbps is email traffic

Packet Filtering

Deny all traffic from ISP2 (on interface X) destined to E2

Policy Routing

Send all voice-over-IP traffic arriving from E1 (on interface Y) and destined to E2 via a separate ATM network

87

More Value added Services

• Differentiated services – Regard traffic from Autonomous System

#33 as `platinum grade’

• Accounting and Billing– Treat all video traffic as highest priority and

perform accounting for this type of traffic

• Committed Access Rate (rate limiting)– Rate limit WWW traffic from sub

interface#739 to 10Mbps

88

Multi-field Packet Classification

Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.

Example: packet (5.168.3.32, 152.133.171.71, …, TCP)

Field 1 Field 2 … Field k

Action

Rule 1 5.3.90/21 2.13.8.11/32

… UDP A1

Rule 2 5.168.3/24 152.133/16 … TCP A2

… … … … … …

Rule N 5.168/16 152/8 … ANY AN

89

Packet Header Fields for Classification

L3-SA L2-DAL2-SAL3-DA L3-PROTL4-PROTL4-DPL4-SPPAYLOAD

Transport layer header Network layer header MAC header

Direction of transmission of packet

DA = Destination AddressSA = Source AddressPROT = ProtocolSP = Source portDP = Destination port

L2 = layer 2 (e.g., Ethernet)L3 = layer 3 (e.g., IP)L4 = layer 4 (e.g., TCP)

90

Special processing

Control

Datapath:per-packet processing

Routing lookup

Flow-aware Router: Basic Architectural Components

Routing, resource reservation, admission control, SLAs

Packet classification

Switching

Scheduling

91

Packet Classification

Action

--------

---- ----

--------

Predicate Action

Classifier (policy database)

Packet Classification

Forwarding Engine

Incoming Packet

HEADER

92

Packet Classification: Problem Definition

Given a classifier C with N rules, Rj, 1 j N, where Rj consists of three entities:

1) A regular expression Rj[i], 1 i d, on each of the d header fields,

2) A number, pri(Rj), indicating the priority of the rule in the classifier, and

3) An action, referred to as action(Rj).

For an incoming packet P with the header considered as a d-tuple of points (P1, P2, …, Pd), the d-dimensional packet classification problem is to find the rule Rm with the highest priority among all the rules Rj matching the d-tuple; i.e., pri(Rm) > pri(Rj), j m, 1 j N, such that Pi matches Rj[i], 1 i d. We call rule Rm the best matching rule for packet P.

93

Example 4D classifier

Rule

L3-DA L3-SA L4-DP L4-PROT

Action

R1 152.163.190.69/255.255.255.255

152.163.80.11/255.255.255.255

* * Deny

R2 152.168.3/255.255.255

152.163.200.157/255.255.255.255

eq www udp Deny

R3 152.168.3/255.255.255

152.163.200.157/255.255.255.255

range 20-21

udp Permit

R4 152.168.3/255.255.255

152.163.200.157/255.255.255.255

eq www tcp Deny

R5 * * * * Deny

94

Example Classification Results

Pkt Hdr

L3-DA L3-SA L4-DP L4-PROT

Rule, Action

P1 152.163.190.69 152.163.80.11 www tcp R1, Deny

P2 152.168.3.21 152.163.200.157

www udp R2, Deny

95

Classification is a Generalization of Lookup

• Classifier = routing table• One-dimension (destination

address)• Rule = routing table entry• Regular expression = prefix• Action = (next-hop-address, port)• Priority = prefix-length

96

Metrics for Classification Algorithms

• Speed• Storage requirements• Low update time• Ability to handle large classifiers• Flexibility in implementation• Low preprocessing time• Scalability in the number of header fields• Flexibility in rule specification

97

Size of Classifier?

• Microflow recognition: 128K-1M flows in a metro/edge router

• Firewall applications, 8-16K• Wildcarded filters, 16-128K • Depends heavily on where your

box will be deployed

98

Packet Classification: Outline

• Background and problem definition

• Classification schemes• Comparative evaluation

99

Example Classifier

Rule Destination Address

Source Address

R1 0* 10*

R2 0* 01*

R3 0* 1*

R4 00* 1*

R5 00* 11*

R6 10* 1*

R7 * 00*

100

Set-pruning Tries [Tsuchiya, Sri98]

Dimension DA

Rule

DA SA

R1 0* 10*

R2 0* 01*

R3 0* 1*

R4 00* 1*

R5 00* 11*

R6 10* 1*

R7 * 00*

R7 Dimension SAR2 R1 R5 R7 R2 R1

R3

R7

R6

R7

R4

O(N2) memory

101

Grid-of-Tries [Sri98]

Dimension DA

Dimension SAR5 R2 R1

R3R6

R7

R4

O(NW) memoryO(W2) lookup

Rule

DA SA

R1 0* 10*

R2 0* 01*

R3 0* 1*

R4 00* 1*

R5 00* 11*

R6 10* 1*

R7 * 00*

102

Grid-of-Tries [Sri98]

Dimension DA

Dimension SAR5 R2 R1

R3R6

R7

R4

O(NW) memoryO(2W) lookup

Rule

DA SA

R1 0* 10*

R2 0* 01*

R3 0* 1*

R4 00* 1*

R5 00* 11*

R6 10* 1*

R7 * 00*

103

Grid-of-Tries

Advantages

Good solution for two dimensions

Disadvantages

Static solutionNot easily extensible to more than two dimensions

20K entries: 2MB, 9 memory accesses (with expansion)

104

R5

Geometric Interpretation in 2D

R4

R3

R2R1

R7

P2

Dimension #1

Dim

ensi

on #

2

R6

P1

e.g. (128.16.46.23, *)e.g. (144.24/16, 64/24)

105

Bitmap-intersection [Lak98]

R4 R3 R2R11

1

0

0

1

0

1

1

R3

R4

R1

R2

106

Bitmap-intersection

Advantages

Good solution for multiple dimensions, for small classifiers

Disadvantages

Static solutionLarge memory bandwidth (scales linearly in N)Large amount of memory (scales quadratically in N)Hardware-optimized

512 rules: 1Mpps with single FPGA (33MHz) and five 1Mb SRAM chips

107

2D classification [Lak98]

R4 R3 R2R1

R5

R6

Prefixes

Ranges

P1

R7

Prefixes of length 4

Prefixes of length 3

108

2D Classification [Lak98]: Preprocessing

• Store the prefixes in a trie• With each prefix store the set of

intervals that form a rectangle with that prefix as the other side

• Store the intervals by storing them as a set of non-overlapping disjoint intervals

109

2D Classification [Lak98]: Lookup

• For each prefix length:– Find the prefix matching the incoming

point and the set of non-overlapping intervals associated with the prefix

– Search for the non-overlapping interval that contains the point

• Repeat for all prefix lengths

110

2D Classification [Lak98]: Complexity

• Lookups: O(WlogN) with N two-dimensional rules– O(W+logN) using fractional cascading

• Space: O(N) • Static data structure

111

Crossproducting [Sri98]

R4 R3R2

R1

54

3

2

1

6

21 7 8 94 5 63

P1P2

(1,3)

(8,4)

112

Crossproducting

Advantages

Fast accessesSuitable for multiple fields

Disadvantages

Large amount of memoryNeed caching for bigger classifiers (> 50 rules)

50 rules: 1.5MB, need caching (on-demand crossproducting) for bigger classifiers

Need: d 1-D lookups + 1 memory access, O(Nd) space

113

Space-time Tradeoff

Point Location among N non-overlapping regions in d dimensions:

either O(log N) time with O(Nd) space, orO(logd-1N) time with O(N) space

Need help: exploit structure in real-life classifiers.

114

Recursive Flow Classification [Gupta99]

• Difficult to achieve both high classification rate and reasonable storage in the worst case

• Real classifiers exhibit structure and redundancy

• A practical scheme could exploit this structure and redundancy

Observations:

115

RFC: Classifier Dataset

• 793 classifiers from 101 ISP and enterprise networks with a total of 41505 rules.

• 40 classifiers: more than 100 rules. Biggest classifier had 1733 rules.

• Maximum of 4 fields per rule: source IP address, destination IP address, protocol and destination port number.

116

Structure of the Classifiers

R1

R2

R34 regions

117

Structure of the Classifiers

R1

R2

R3

{R1, R2}

{R2, R3}

{R1, R2, R3}

7 regions

dataset: 1733 rule classifier = 4316 distinct regions (worst case is 1013 !)

118

Recursive Flow Classification

2S = 2128 2T = 212

One-step

2S = 2128 2T = 212232264

Multi-step

119

Chunking of a Packet

Source L3 Address

Destination L3 Address

L4 protocol and flags

Source L4 port

Destination L4 port

Type of Service

Packet Header

Chunk #0

Chunk #7

120

Packet Flow

Phase 0 Phase 1 Phase 2 Phase 3

index

action

Header

Combination

16

16 8

16 8

16 8 Reduction

128 64 32 16

14

121

Choice of Reduction Tree

3

2

1

0

5

4

Number of phases = P = 310 memory accesses

3

2

1

0

5

4

Number of phases = P = 411 memory acceses

122

RFC: Storage Requirements

Number of Rules

Mem

ory

in M

byte

s

123

RFC: Classification Time

• Pipelined hardware: 30 Mpps (worst case OC192) using two 4Mb SRAMs and two 64Mb SDRAMs at 125MHz.

• Software: (3 phases) 1 Mpps in the worst case and 1.4-1.7 Mpps in the average case. (average case OC48) [performance measured using Intel Vtune simulator on a windows NT platform]

124

RFC: Pros and Cons

Advantages

Exploits structure of real-life classifiersSuitable for multiple fieldsSupports non-contiguous masksFast accesses

Disadvantages

Depends on structure of classifiersLarge pre-processing timeIncremental updates slowLarge worst-case storage requirements

125

Hierarchical Intelligent Cuttings (HiCuts)

[Gupta99]

• No single good solution for all cases – But real classifiers have structure

• Perhaps an algorithm can exploit this structure– A heuristic hybrid scheme …

Observations:

126

HiCuts: Basic Idea

{R1, R2, R3, …, Rn}

Decision Tree

{R1, R3,R4} {R1, R2,R5} {R8, Rn}

Binth: BinThreshold = Maximum Subset Size = 3

127

Heuristics to Exploit Classifier Structure

• Picking a suitable dimension to hicut across

• Minimize the maximum number of rules into any one partition, OR

• Maximize the entropy of the distribution of rules across the partition, OR

• Maximise the different number of specifications in one dimension

• Picking the suitable number of partitions (HiCuts) to be made

• Affects the space consumed and the classification time. Tuned by a parameter, spfac

128

HiCuts:Number of Memory Accesses

Binth = 8, spfac = 4

Number of Rules (log

scale)

Crossproducting

129

HiCuts: Storage Requirements

Binth = 8 ; spfac = 4

Sp

ace

in

KB

yte

s (l

og

sc

ale

)

Number of Rules (log

scale)

130

Incremental Update Time

Binth = 8, spfac = 4 , 333MHz P-II running Linux

Number of Rules (log

scale)

Tim

e in

seco

nd

s (l

og

sc

ale

)

131

HiCuts: Pros and Cons

Advantages

Exploits structure of real-life classifiersAdapts data structureSuitable for multiple fieldsSupports incremental updates

Disadvantages

Depends on structure of classifiersLarge pre-processing timeLarge worst-case storage requirements

132

Tuple Space Search [Suri99]

Decompose the classification problem into a number of exact match problems, then use hashing

Rule TupleR1 (01*, 111*)

[2,3]

R2 (11*, 010*)

[2,3]

R3 (1*, *) [1,0]

Use one hash table for each tuple, search all hash tables sequentially

133

Improved TSS via Precomputation

• Extension of “binary search on trie levels”

• If [2,3,3] succeeeds, no need to search e.g., [4,5,6]

• If [2,3,3] fails, no need to search e.g., [1,2,1]

• Search the tuple space intelligently (decision tree on tuple space)

134

TSS: Pros and Cons

Advantages

Suitable for multiple fieldsSupports incremental updatesFast classification and updates on average

Disadvantages

Large pre-processing timeMultiple hashed-memory accesses

135

Area-based Quad Tree [Buddhikot99]

00 01 1110

00 01 1110R2

R1

R3R5

R4

R1,R2

R5

R3,R4

Crossing Filter Set

Lookup: two 1-D longest prefix match operations at every node in the path from the root to a leaf

O(N) space O(WlogN) lookup timeO(W+logN) using FC

P1

136

AQT: Efficient Updates

new

old

Partition prefixes into groups and do pre-computation per group instead of per interval

O(aW) search and O(aN1/a) updates

137

2-D Classification Using FIS Tree [Feldmann00]

R2

R1

R3R5

R4

P1

x-FIS tree

l levelsO(ln1+1/l) space(l+1) 1-D lookups

138

FIS Tree: Experimental Study

Number of rules

Levels in FIS tree

Storage space

Number of memory accesses

4-60 K 2 < 5 MB < 15

~106 3 < 100 MB

< 18

Rulesets constructed using netflow data from AT&T Worldnet. Experiments done using static 2-D FIS trees.

139

Ternary CAMs

Advantages

Suitable for multiple fieldsFast: 16-20 ns (50-66 Mpps)Simple to understand

Disadvantages

Inflexible: range-to-prefix blowupDensity: largest available in 2000 is 32K x 128 (but can be cascaded)Management software, and on-chip logic: non-trivial complexityPower: 5-8 WIncremental updates: slowDRAM-based CAMs: higher density but soft-error is a problemCost: $30-$160 for 1Mb

140

Rule Range Maximal Prefixes

R5 [3,11] 0011, 01**, 10**

R4 [2,7] 001*, 01**

R3 [4,11] 01**, 10**

R2 [4,7] 01**

R1 [1,15] 0001, 001*, 01**, 10**, 110*, 1110

Range-to-prefix Blowup

Rule Range

R1 [3,11]

R2 [2,7]

R3 [4,11]

R4 [4,7]

R5 [1,14]

Maximum memory blowup = factor of (2W-2)d

141

Packet Classification: References

• [Lak98] T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp 191-202

• [Sri98] V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp 203-214

• [Suri99] V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, Sigcomm 1999, pp 135-146

• [Gupta99] P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings,” Hot Interconnects VII, 1999

142

Packet Classification: References (contd.)

• [Gupta99] P. Gupta, N. McKeown, “Packet classification on multiple fields,” Sigcomm 1999, pp 147-160

• [Buddhikot99] M. M. Buddhikot, S. Suri, and M. Waldvogel, “Space decomposition techniques for fast layer-4 switching,” Protocols for High Speed Networks, vol. 66, no. 6, pp 277-283, 1999

• [Feldmann00] A. Feldmann and S. Muthukrishnan, “Tradeoffs for packet classification,” Infocom 2000

• T. Woo, “A modular approach to packet classification: algorithms and results, “ Infocom 2000

143

Special Instances of Classification

• Multicast – PIM SM

– Longest Prefix Matching on the source and group address

– Try (S,G) followed by (*,G) followed by (*,*,RP) – Check Incoming Interface

– DVMRP: – Incoming Interface Check followed by (S,G) lookup

• IPv6 – 128 bit destination address field

144

Implementation Choices Given Design

Requirements

Disclaimer: These are my opinions

145

Design Requirement LU1

2.5 Gbps, 100K routes

a) 2-4 TCAMsb) On-chip logic with one external

SDRAM chip (using multibit tries)c) On-chip e-DRAM

Requirements:

Choices:

146

Design Requirement LU2

10 Gbps, 256K routes

a) 4-8 TCAMsb) On-chip logic with 2-4 external

SDRAM chips (using multibit tries)c) On-chip e-DRAM

Requirements:

Choices:

147

Design Requirement PC1

10 Gbps classification up to L4, 16-64K comparatively static 128bit entries

a) 1-4 TCAMs b) On-chip logic with 2 external SDRAM

and 2 SRAM chips (using RFC)c) Off-chip SRAMs (using HiCuts)

Requirements:

Choices:

148

Your Design Here

Requirements:

Choices:

149

Lookup/Classification Chip Vendors

• Switch-on• Fastchip• Agere• Solidum• Siliconaccess• TCAM vendors: Netlogic, Lara,

Sibercore, Mosaid, Klsi etc.

150

Summary

• Both problems are well studied by now but increasing linerates and database sizes continue to present interesting opportunities

• Still need a high-speed (~OC192) dynamic, generic, multi-field classification algorithm for large number of (up to a million) rules

151

Thanks! I will appreciate direct

feedback at pankaj@stanford.edu