Basic Data Structures for IP lookups and Packet Classification
description
Transcript of Basic Data Structures for IP lookups and Packet Classification
1
Basic Data Structures for
IP lookups and
Packet Classification
2
Routing table examples 4.0.0.0/8 6.0.0.0/8 9.2.0.0/16 9.20.0.0/17 12.0.0.0/8 13.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 20.0.0.0/8 24.0.0.0/18 24.0.0.0/14 24.1.0.0/17 24.4.0.0/17 24.48.0.0/18
2001:0200:0136::/48 2001:0200:0900::/40 2001:0200:0905::/48 2001:0200:0c00::/40 2001:0200::/32 2001:0200:c000::/35 2001:0200:e000::/35 2001:0208::/32 2001:0218::/32 2001:0218:6002::/48 2001:0220::/35 2001:0238::/32 2001:0240::/32 2001:0250:0204::/48 2001:0250::/32 2001:0250:e000::/36
oix-route-views - Route Views Archive http://archive.routeviews.org/oix-route-views/
* 3.0.0.0 203.181.248.233 0 7660 1 7018 80 i * 4.0.0.0 203.194.0.5 0 9942 1 i * 6.1.0.0/16 203.194.0.5 0 9942 1 7170 1455 i * 6.2.0.0/22 203.194.0.5 0 9942 1 7170 1455 i * 6.3.0.0/18 203.194.0.5 0 9942 1 7170 1455 i * 6.4.0.0/16 203.194.0.5 0 9942 1 7170 1455 i * 6.5.0.0/19 203.194.0.5 0 9942 1 7170 1455 i * 6.8.0.0/20 203.194.0.5 0 9942 1 7170 1455 i * 6.9.0.0/20 203.194.0.5 0 9942 1 7170 1455 i * 6.10.0.0/15 203.194.0.5 0 9942 1 7170 1455 i * 6.14.0.0/15 203.194.0.5 0 9942 1 7170 1455 i * 9.2.0.0/16 203.194.0.5 0 9942 1239 701 i * 9.184.112.0/20 203.194.0.5 0 9942 3786 i * 9.186.144.0/20 203.194.0.5 0 9942 3786 i * 12.0.0.0 203.194.0.5 0 9942 1239 7018 i * 12.0.48.0/20 203.194.0.5 0 9942 16631 16631 16631 1742 i
3
Routing table format (1/3) Destination: IP address of the packet's final
destination Next hop: The IP address to which the packet is
forwarded Interface: The outgoing network interface the
device should use when forwarding the packet to the next hop or final destination
Metric: Assigns a cost to each available route so that the most cost-effective path can be chosen
4
Routing table format (2/3) Routes: Includes (1) directly-attached subnets, (2) indirect
subnets that are not attached to the device but can be accessed through one or more hops, and (3) default routes to use for certain types of traffic or when information is lacking.
5
Routing table format (3/3) Routing tables can be maintained manually or
dynamically. Tables for static network devices do not change
unless a administrator manually changes them. In dynamic routing, devices build and maintain
their routing tables automatically by using routing protocols to exchange information about the surrounding network topology.
Dynamic routing tables allow devices to "listen" to the network and respond to occurrences like device failures and network congestion.
6
7
Prefix Prefix Length Distribution
1
10
100
1000
10000
100000
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Num
ber o
f en
trie
s (l
og s
cale
)
prefix length
prefix length distribution (oix routing table 2002/12/01 [12])
8
Prefix Length format: bn-1…b0/l (l is prefix length)
In IPv4, d3.d2.d1.d0/l can also be used. Mask format: bn-1…b0/mn-1…m0 (prefix length is l)
mj = 1 for all n – 1 j n – l+1, and mj =0 otherwise. d3.d2.d1.d0/ m3.m2.m1.m0 for IPv4.
Ternary format: bn-1…bn-l+1*…* (prefix length is l) bj = 0 or 1 for n – 1 j n – l + 1. If tk is *, then tj must also be * for all j < k. A single don’t care bit can be used to denote a series of
don’t care bits, e.g., 1* denotes 1**** in the 5-bit address space.
9
Prefix (n+1)-bit format: bn-1…bn-l+110…0 (l is prefix len)
for the prefix bn-1…bn-l+1* of length l in ternary format, there is one trailing ‘1’ followed by n – l 0’s.
or (n+1)-bit format: bn-1…bn-l+101…1
for the prefix bn-1…bn-l+1* of length l in ternary format, there is one trailing ‘0’ followed by n – l 1’s.
10
5-bit Prefixes: bn-1…bn-l+110…0
11111
11110
11101
11100
1110*
1111*
111**
00011
00010
00001
00000
0000*
0001*
000**
*****
0****
00***
000111
000101
000011
000001
000010
000110
000100
001000
111111
111101
111011
111001
111010
111110
111100
11***
111000
6-bit binary address space000000 is not used
11
5-bit Prefixes: bn-1…bn-l+101…1
11111
11110
11101
11100
1110*
1111*
111**
00011
00010
00001
00000
0000*
0001*
000**
*****
0****
00***
000111
000101
000011
000001
000010
000110
000100
111101
111011
111001
111010
111110
111100
11***
111000
6-bit binary address space111111 is not used
110111
000000
12
Prefix properties Disjoint prefixes:
Two prefixes are said to be disjoint if they do not share any address.
Prefix enclosure: A = bn-1…bj…bi* and B = bn-1…bj* and j > i. Prefix A is enclosed by B (B A) since the IP address
space covered by A is a subset of that covered by B, where is the enclosure operator.
A special case of overlapping. Prefix comparison
The inequality 0 < * < 1 is used to compare two prefixes in the ternary representation of prefixes.
13
Prefix properties The most specific prefixes (MSP):
The prefixes that do not cover any others. Disjoint, so can be put in an array for binary search
Grouping prefixes in layers based on MSP. Six layers at most for IPv4 tables
1
2
3
1 1
2
4
1 1
2
3
1 1
2
5
4
1 1
2
3
1 1
2
1 1
2
3
1 1
2
1
14
Prefix propertiesDatabase
(year-month)AS6447(2000-4)
AS6447(2002-4)
AS6447(2005-4)
number of prefixes
79,530 124,798 163,535
Level-1 prefixes73,891(92.9%
)114,745 (91.9%)
150,245 (91.9%)
Level-2 prefixes 4,874 (6.1%) 8,496 (6.8%) 11,135 (6.8%)
Level-3 prefixes 642 (0.8%) 1,290 (1%) 1,775 (1.1%)
Level-4 prefixes 104 (0.1%) 235 (0.2%) 329 (0.2%)
Level-5 prefixes 17 29 45
Level-6 prefixes 2 3 6
15
Prefix properties
Prefix length
Num
ber
16
Prefix
Prefix Next-hopP1 111* H1P2 10* H2P3 1010* H3P4 10101 H4
P1 is disjoint from the other three prefixes. P2 P3 P4 Longest prefix match(LPM), not exact match enclosure makes (1) sorting prefixes and (2) binary
searching prefixes difficult
Forwarding table example
17
Example Forwarding Table Prefix Next-hop
P1 111* H1P2 10* H2P3 1010* H3P4 10101 H4
Longest prefix match(LPM), not exact match Prefix enclosure makes (1) sorting prefixes and
(2) binary searching prefixes difficult. So, trie based schemes emerge naturally
18
Binary Trie (Radix Trie)
P1 111* H1
P2 10* H2
P3 1010* H3
P4 10101 H4P2
P3
P4
P1
A
B
C
G
D
F
H
E
1
0
0
1 1
1
1
Lookup 10111
Add P5=1110*
I
0
P5
next-hop-ptr (if prefix)
left-ptr right-ptr
Trie node
成功大學資訊工程系 CIAL 實驗室 19
Binary prefix search Definition 1 (Prefix comparison):
The inequality 0 < * < 1 is used to compare two prefixes in the ternary format.
成功大學資訊工程系 CIAL 實驗室 20
Binary prefix search Directly performing a binary search on the list of sorted prefixes
may encounter a failure: Dst = 01011000
12 3
Correct match Failed match
4
成功大學資訊工程系 CIAL 實驗室 21
Binary prefix search Enclosure relationship between prefixes results in
the search failure Generate some auxiliary prefixes that inherit the
routing information of the original LPM (e.g., F) and put them where the binary search operations can find them. ex. auxiliary prefix 01011000.
Therefore, it is feasible to split prefix F into two parts such that both sides of prefix O are covered.
成功大學資訊工程系 CIAL 實驗室 22
Binary prefix search The full tree expansion. The full tree expansion splits the enclosure prefixes
into many longer prefixes (leaf pushing). Auxiliary prefix merges Many auxiliary prefixes may inherit the same routing
information of a common enclosure prefix. These prefixes can be merged into one. The merge operation is defined as follows.
Prefix merge: The prefix obtained by merging a set of consecutive
prefixes is the longest common ancestor (LCA) of these consecutive prefixes in the binary trie.
成功大學資訊工程系 CIAL 實驗室 23
Binary prefix search The full tree expansion
F3=01011000
成功大學資訊工程系 CIAL 實驗室 24
Binary prefix search The full tree after the merge operations
F3=01011000
25
Binomial spanning tree
1000
1100
1110
1111
01
2
3
A 4-cube and its corresponding binomial spanning tree.
3
2
1
0
1000
1100
1110 1111
0000
0000
26
Perfect code: Hamming code (7, 4) 7-cube example:
0000000
1000000 0100000 0010000 0001000 0000100 0000010 0000001
24(16) one-level binomial spanning trees
= 7-cube
27
r = received codeSyndrome s = (s2 s1 s0) = r . H7
T Corrected code = r + ErrorPattern[s]
Perfect code: Hamming code (7, 4)1 1 0 1 1 0 0
1 0 1 1 0 1 0
0 1 1 1 0 0 1
H7 = G7 =
1 0 0 0 1 1 00 1 0 0 1 0 10 0 1 0 0 1 10 0 0 1 1 1 1
(a) Parity-check and generator matrices of Hamming code (7, 4).
(c) Decoding table
000 0000-000001 0000-001010 0000-010011 0010-000100 0000-100101 0100-000110 1000-000111 0001-000
Syndrome ErrorPatternInner product Transpose
28
Perfect code: Hamming code (7, 4) u Codeword
0000 0000-0000001 0001-1110010 0010-0110011 0011-1000100 0100-1010101 0101-0100110 0110-1100111 0111-0011000 1000-1101001 1001-0011010 1010-1011011 1011-0101100 1100-0111101 1101-1001110 1110-0001111 1111-111
16 codewordsGenerate 16 Codewords u . G7
29
Perfect code: Golay code (23, 12) 212 3-level binomial spanning trees
C(23,0)+C(23, 1)+C(23,2)+C(23,3) = 1 + 23 + 23*22/2 +3*22*21/(3*2) = 24 + 23*11 + 23*11*7 = 24 + 253*8 = 24 + 2024 = 2048 = 211
30
Ranges Why ranges?
Prefixes can also be represented by ranges. The source/destination port fields of rule tables for packet
classification are ranges. Prefixes are special cases of ranges. Prefix bn-1…bn-l+1* of length l is the range of addresses
from bn-1…bn-l+10…0 to bn-1…bn-l+11…0, denoted as [bn-1…bn-l+10…0, bn-1…bn-l+11…0].
Overlapping: Two ranges are overlapping if they are not disjoint.
Partially overlapping: Two ranges are partially overlapping if they are neither disjoint
nor enclosing.
31
Elementary Intervals for Ranges Definition: Let the set of k elementary intervals
constructed from a set R of ranges in the address space of 0 … N – 1 be
X = {Xi | Xi = [ei, fi], for i = 1 to k}. X must satisfy the following:
1) e1 = 0 and fk = N – 1,
2) fi = ei+1 – 1 for i = 1 to k – 1,
3) all addresses in Xi are covered by the same subset of
R (called the range matching set of Xi) denoted by EIi,
and
4) EIi EIi+1, for i = 1 to k – 1.
32
Elementary Intervals for RangesID Prefix Range Minus-1 Traditional
start finish start finishP1 000000/2 [0, 15] - 15 0 15P2 010000/2 [16, 31] 15 31 16 31P3 000100/4 [4, 7] 3 7 4 7P4 100000/1 [32, 63] 31 - 32 63P5 010110/5 [22, 23] 21 23 22 23P6 110000/2 [48, 63] 47 - 48 63P7 110000/4 [48, 51] 47 51 48 51P8 110111/6 [55, 55] 54 55 55 55P9 100000/3 [32, 39] 31 39 32 39
33
Elementary Intervals for Ranges Graphical view
EI7
{P4,P9}X7
[32, 39]
EI8
{P4}X8
[40, 47]
EI9
{P4,P6,P7}X9
[48, 51]
EI10
{P4,P6}X10
[52, 54]
EI11
{P4,P6,P8}X11
[55, 55]
EI12
{P4,P6}X12
[56, 63]
EI4
{P2}X4
[16, 21]
EI5
{P2,P5}X5
[22, 23]
EI6
{P2}X6
[24, 31]
EI2
{P1,P3}X2
[4, 7]
EI3
{P1}X3
[8, 15]
EI1
{P1}X1
[0, 3]
34
Segment Tree
y
w
z
u v
h
q g
r s t
15
23
15P13 31 54
P4P6
P221
47
P439
7
55
P8X12
[56,63]X11
[55,55]
51
P7X10
[52,54]X9
[48,51]
P9X8
[40,47]X7
[32,39]
P2X6
[24,31]
X5
[22,23]X4
[16,21]
P5
P1X3
[8,15]X2
[4,7]X1
[0,3]
P3
leaf node
35
Interval Tree Each node in an interval tree is associated with a key which
must be covered by at least one range. Depending on whether a node can store 1 or 1+ range, fat interval tree
each node is allowed to store more than one range. The number of nodes in the interval tree is O(N). To insert a range R = [e, f], if R covers root’s key, R is stored in
the root. Otherwise, R is inserted in the left (right) subtree of the root when f is smaller (e is larger) than the key of the root.
When R does not cover the key of any node which is traversed, a new node with the key selected from addresses e to f is created and inserted as the left or right child of the node which was last visited.
O(logN + k) time, k is # of prefixes that match the given address.
Prefix insertion and deletion are very expensive because ranges in some nodes may need relocations after tree rotations.
36
Interval Tree thin interval tree: each node of the interval tree stores exactly one range. Since ranges may overlap, two comparison rules are used to
compare if a range is smaller or larger than another range. For two ranges R1 = [e1, f1] and R2 = [e2, f2], R1 < R2 if e1 < e2. If tie, the second rule applies. R1 < R2 if R2 is a subrange of R1 (i.e. e1 = e2 and f2 < f1).
Also, a node stores a max value, Max(the finish endpoints of all ranges) stored in the subtree rooted at that node.
In contrast with the fat interval tree, prefix insertion and deletion take O(logN) time. However, O(min{N, klogN}) time is needed to find the longest matching prefix as well as the highest-priority matching prefix, where k is the number of matched prefixes for a given address.
37
Hash Table Narrowing down the search space. Index = Hash_function(key)%m, where key may
be the first k bits of IP addresses and m is the size of the hash table.
Perfect hash: no collision Minimal perfect hash: A perfect hash, where the
size of its hash table is k for k different hashing keys.
38
Hash Table
H(k1)%m
k1
Array of m elements
Difficulties: prefixes and ranges can not be used as the keys of the hash functions directly.
H(k2)%mk2
collision
39
Hash Table: 8-bit Segmentation table A 8-bit segmentation table is usually used for
IPv4 forwarding tables because there is no prefix of length shorter than 8.
H(prefix)%256(MSB 8 bits of prefix)
Array of 256 elements
Prefixes with the same first 8 MSB bits
0
1
255
Prefix: 0.x.y.z
Maybe empty set
40
Hash Table: 16-bit Segmentation table Prefixes of length <= 16 must be stored properly.
For example, duplicate 0.0.b.c/15 into buckets 0 and 1 or store the port of 0.0.b.c/15 into elements 0 and 1.
Put them into another set (good for update but need to search two sets in the worst case).
H(prefix)%216
(MSB 16 bits of prefix)
Array of 216 elements
Prefixes with the same first 16 MSB bits
0
1
216-1
Prefix: 0.0.y.z
Maybe empty set
Prefixes of length 16
41
Hash Table: Compression Since there are many empty elements in the
segmentation table, we can use bitmap to compress the segmentation table.
Array of M elements
Prefixes with the same first 16 MSB bits
0
1
M-1
Prefix: 0.0.y.z
Must be non-empty
1100...011001
Prefix: 0.1.y.z
216-Bitmap containing M 1’s
42
Bloom filter H1(key) = P1
H2(key) = P2
H3(key) = P3
H4(key) = P4
…
Hk(key) = Pk
Hi() is a hash function, e.g. MD5
1
1
1
1
Bit vector of m bits
m bits
43
Bloom filter After inserting n keys (kn bits), the probability that a
particular bit is still 0 is (1-1/m)kn
So, the probability of a false positive is
p for the right-hand side is minimized when k = ln2m/n m/n = 6, k = 4: p = 0.0561 m/n = 8, k = 6: p = 0.0215 m/n=12, k = 8: p =0.00314 m/n=16, k=11: p =0.000458
k
m
nkknk
em
p
1
111
44
Bloom filter Update:
Update whole SC Threshold: when the digests differ beyond a threshold, say,
5% or 10%, Regular time intervals: every say 5 mins,
45
Counting Bloom filter Deletion operation for local digest:
For each bit in the m-bit vector, use an l-bit counter to record the number of times that a particular bit is turned on by different URLs
l = 4 by experience If deletion is not supported, cache summary must be
rebuilt from scratch on a periodic basis to erase stale bits and prevent bit pollution