Towards a Billion Routing Lookups per Second in Software

1

Towards a Billion Routing Lookups per Second in Software

Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc

Publisher: SIGCOMM Computer Communication Review, 2012

Presenter: Yuen-Shuo Li

Date: 2012/12/12

2

Outline

Idea Introduction Search Data Structure Lookup Algorithm Updating Performance

3

Idea

Can a software routing implementation compete in a field generally reserved for specialized lookup hardware?

4

introduction(1/3)

Software routing lookups have fallen out of focus of the research community in the past decade, as the performance of general-purpose CPUs had not been on par with quickly increasing packet rates.

Software vs Hardware?

5

introduction(2/3)

But landscape has changed, however.

1. The performance potential of general-purpose CPUs has increased significantly over the past decade.

• instruction-level parallelism, shorter execution pipelines, etc.

2. The increasing interest in virtualized systems and software defined networks calls for a communication infrastructure that is more flexible and less tied to the hardware.

6

introduction(3/3)

DXR, an efficient IPv4 lookup scheme for software running on general CPUs.

The primary goal was not to implement a universal routing lookup schemes: data structures and algorithms which have been optimized exclusively for the IPv4 protocol.

distills a real-world BGP snapshot with 417,000 prefix into a structure consuming only 782 Kbytes, less than 2 bytes per prefix, and achieves 490 MLps

MLps: million lookups per second

7

Search Data Structure(1/11)

Building of the search data structure begins by expanding all prefixes from the database into address ranges.

8


Neighboring address ranges that resolve to the same next hop are then merged.

9


each entry only needs the start address and the next hop.

the search data we need!

10


11


Lookup table

To shorten the search by splitting the entire range in chunks, and using the initial k bits of the address to reach, through a lookup table.

k = 16, chunks

216

12


Each entry in the lookup table must contain the position and size of the corresponding chunk in the range table

a special value for size indicates that the 19 bits are an offset into the next-hop table.

Format

Size position

1 bit 12 bits 19 bits

32 bits

Lookup table entry

k = 16, chunks

216

13


14


Range table

With k bits already consumed to index the lookup table, the range table entries only need to store the remaining 32-k address bits, and the next hop index.

15


Range table

A further optimization which is especially effective for large BGP views, where the vast majority of prefixes are /24 or less specific.

long format

short format

16


Range table

one bit in the lookup table entry is used to indicate whether a chunk is stored in long or short format.

long format

short format

Lookup index

Next hop

16 bit 16 bits

32 bits

Lookup index

Next hop

8 bit 8 bits

16 bits

Lookup Algorithm(1/4)

The k leftmost bits of the key are used as an index in the lookup table.

1.2.4.0

00000001.00000010.00000100.00000000

index


If the entry points to the next hop, the lookup is complete.

A special value for size indicates that the 19 bits are an offset into the next-hop table.


Otherwise, the (position, size) information in the entry select the portion of the range table on which to perform a binary search of the (remaining bits of the ) key.

size

position

22

Updating(1/3)

Lookup structures store only the information necessary for resolving next hop lookups, so a separate database which stores detailed information on all the prefixes is required for rebuilding the lookup structures.

lookup information

detailed information

23

Updating(2/3)

Updates are handled as follows:

updates covering multiple chunks(prefix length < k) are expanded.

then each chunk is processed independently and rebuilt from scratch.

Finds the best matching route for the first IPv4 address belonging to the chunk, translating it to an range table entry.

If the range table heap contains only a single element when the process ends, the next hop can be directly encoded in the lookup table.

24

Updating(3/3)

The rebuild time can be reduced

by coalescing multiple updates into a single rebuild We have implemented this by delaying the reconstruction for several

milliseconds after the first update is received

processing chunks in parallel, if multiple cores are available

25

Performance(1/11)

Our primary test vectors were three IPv4 snapshots from routeviews.org BGP routers.

http://routeviews.org/

26

Performance(2/11)

Here we present the results obtained using the LINX snapshot, selected because it is the most challenging for our scheme:

it contains the highest number of next hops

and an unusually high number of prefixes more specific than /24

27

Performance(3/11)

Our primary test machine had a 3.6 GHz AMD FX-815 8-core CPU and 4 GB of RAM. Each pair of cores share a private 2 MB L2 cache block, for a total of 8 MB of L2 cache. All 8 cores share 8 MB of L3 cache.

We also ran some benchmarks on a 2.8 GHz Intel Xeon W3530 4-core CPU with smaller but faster L2 caches(256 KB per core), and 8 MB shared L3 cache

• All tests ran on the same platform/compiler(FreeBSD 8.3, amd 64, gcc 4.2.2)

• Each test ran for 10 seconds

28

Performance(4/11)

We used three different request patterns:

REP (repeated): each key is looked up 16 times before moving to the next one.

RND (random): each key is looked up only once. The test loop has no data dependencies

between iterations.

SEQ (sequential): same as RND, but the timing loop has an artificial data dependency

between iterations so that requests become effectively serialized.

Performance(5/11)

29

30

Performance(6/11)

average lookup throughput with D18R is consistently faster compared to D16R on our test systems.

31

Performance(7/11)

the distribution of binary search steps for uniformly random queries observed using the LINX database.

n = 0 means a direct match in the lookup table

32

Performance(8/11)

the interference between multiple cores, causing a modest slowdown of the individual threads.

33

Performance(9/11)

DIR-24-8-BASIC

34

Performance(10/11)

35

Performance(11/11)

Towards a Billion Routing Lookups per Second in Software

Documents

Transcript of Towards a Billion Routing Lookups per Second in Software