Router and switch architecturePCI Bus 0 EEPROM PCI Bridge R 4700 R 5000 SRAM! NPE-100 Layer 2 Cache...

Router internals — 1

Router and switch architecture

Martin Heusse


Contenu

• Router architecture

• Routing table data structure


What is routing?

• Packet reception

✓ Interface FIFO (ring buffer?) holds groups of bits as they arrive✓ Packet queued until treated by central CPU or interface card

CPU (throw interrupt)✓ Check CRC, is there space in memory…✓ Packet classification (Dropped? Accepted? Switching method?)✓ Moved to input hold queue

Interface

Int. FIFORing buffer(sometimes)

Int. queue Packet routing

Classify

• Packet forwarding

✓ Look up routing table✓ Rewrite header (Ethernet, NAT?, TTL, checksum…)✓ Packet moved to output hold queue


Input and Output queues

• Input queues absorb transient forwarding subsystem saturationConfigurable

• Output queues holds burst of packets directed to one interface

Switching

• Generally, queues hold a given number of packets (not bytes)How would you implement a queue? Ring? Chained list? Whatis the storage unit (MTU size bin, packet, particle…)

✓ There can be several queues in parallel (various priorities)


Shared memory — first generation

• Ex.: conventional PC, Cisco 2800, HP ProCurve 7xxx

• Everything stored in same memory space

Bus Shared

memory

CPU

CPU memory (routing table)

• Limiting factor: memory access


Shared memory — first generation (cont.)

�� CIsco 25xx (1993)

Copyright © 1998, Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr 30

59

601

1094_06F9_c4 © 1999, Cisco Systems, Inc.

WICWIC

CPUCPU

Serial

CSU/DSUBRI S/T, U Ethernet

ConsoleI/O BusesI/O Buses

Boot ROM

PCMCIA

CP

U B

us

CP

U B

us

NVRAM

DRAMSIMM

On Board

DRAM

WIC Slot

Cisco 160x SeriesCisco 160x Series

M 68360

SCC

M 68360

SCC

60

601


WICWIC

On BoardDRAM

Hub Ports

2505, 25072516

DRAMSIMM

WIC Slots

2524, 2525

Syste

m B

us

WAN Intf

Async

2509-2512

Ether/TR

Mgmt Card2517-2519

Daughter and

Hub Cards

Boot ROM

PCMCIA

CP

U B

us

CP

U B

us

Flash

Dual

UART

NVRAM

M 68030M 68030

Sys Ctrl

ASIC

Sys Ctrl

ASIC

Cisco 25xx SeriesCisco 25xx Series



�� CIsco 7200


69

601


CPU

NPE

CPU

NPEMidplaneMidplane

I/O ControllerI/O Controller

I/O Bus

PCI Bus 0

EEPROM

PCIBridge

R 4700

R 5000

R 4700

R 5000

SRAM

! NPE-100

Layer 2Cache

NPE-200

CP

U B

us

CP

U B

us

DRAM

PCIBridge

PCI Bridge

PCI Bridge

PCI Bridge

PA 6PA 6

PA 4PA 4

PA 2PA 2

PCI Bridge

PCI Bridge

PCI Bridge

Cisco 720x SeriesCisco 720x Series

PA 5PA 5

PA 3PA 3

PA 1PA 1

Sys Ctrl

GT 64010

Sys Ctrl

GT 64010

PC

I B

us 2

PC

I B

us 1

PCMCIAFast Ether

Boot ROMDual

UART

NVRAM Boot Flash

70

601


Fan

Tray

CPU BusCPU Bus

M 68040M 68040Bit Slice

Proc.

Bit Slice

Proc.EEPROM

Multi BusIntf Logic

SP/SSPSP/SSP

Diag Bus

Intf LogicSRAMDRAM

RPRP

Mu

lti

Bu

s

CxBus Intf

DMA Logic

Multi Bus

Intf Logic

Cx Bus

Cisco 70x0—RP and SP/SSPCisco 70x0—RP and SP/SSP

Local BusLocal Bus

I/O CtrlI/O

Devices

Diag Bus

Intf Logic

Intf Proc.Intf Proc.Cx Bus

Arbiter

Diag BusDiag Bus

Intf Proc.Intf Proc.



�� Juniper M40

• Decoupling of control plane and forwarding plane—forwarding by a dedicated ASIC

• 1998 — 40Gb/s

• JunOS based on FreeBSD



PIC: Physical Interface Card


Intelligent line cards — 2d generation

• Ex.: Cisco 7500

• Line cards have some intelligence, write into each other’smemory

CPU

CPU

CPU

Bus

CPU

CPU memory

• Limiting factor: 1 shared bus (needs to be N times faster thaneach of N interfaces)

• Central processor dedicated only to control planeDistinct from Forwarding plane


Intelligent line cards — 2d generation(cont.)

�� CIsco 7500


71

601


Cisco 70x0—RSP7000Cisco 70x0—RSP7000

Fan

Tray IP / VIPIP / VIP

DRAMRSP7KRSP7KCI BoardCI Board

QA ASIC

SRAM

RegisterFPGA

Diag BusIntf Logic

EnvmLogic

EEPROM

Cx BusArbiter

I/O

Bu

sBoot ROM

PCMCIA

Dual

UART

NVRAM

Boot Flash

CPU BusCPU Bus

R 4600R 4600

Sys Ctrl

ASICs

Sys Ctrl

ASICs

IP/VIPIP/VIP

Diag BusDiag Bus

Cx Bus

Diag BusFPGA

MemD Ctrl

ASICs

MemD Ctrl

ASICs

72

601


Boot ROM

PCMCIA

Dual

UART

NVRAM

Boot Flash

RSPRSP

QA ASIC

SRAM

RegisterFPGA

Cy Bus 0 Cy Bus 1

Cisco 75xx SeriesCisco 75xx Series

DRAMSys Ctrl

ASICs

Sys Ctrl

ASICs Layer 2Cache

CPU BusCPU Bus

R 4600R 4700

R 5000

R 4600R 4700

R 5000

I/O

Bu

s

Diag BusDiag Bus

IP/VIPIP/VIP IP/VIPIP/VIPCy BusArbiter

IP/VIPIP/VIP IP/VIPIP/VIP

Diag BusFPGA

MemD Ctrl

ASICs

MemD Ctrl

ASICs


Intelligent line cards — 2d generation(cont.)

�� Versatile Interface Processors (1/interface)


73

601


R 4600R 4700

R 5000

R 4600R 4700

R 5000

Boot ROM

PCIBridge 2

DRAMVIPVIP

Cisco 75xx Series—VIPCisco 75xx Series—VIP

PCIBridge 1

PC

I B

us 1

PC

I B

us 2

PAPA

PAPAPMA

ASICs

PMAASICs

SRAM

Diag BusDiag BusCBus

Layer 2Cache

CPU BusCPU Bus

PC

I B

us 0

Packet

Bu

sCYA

ASICs

CYAASICs

I/O Ctrl

ASIC

I/O Ctrl

ASIC

EEPROM

DRAM Ctrl

ASICs

DRAM Ctrl

ASICs

74

601


RSP1RSP1

CPUCPU

R4600R4600 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 --100 MHz100 MHz

RSP2RSP2 R4600/R4700R4600/R4700 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 --100 MHz100 MHz

RSP4RSP4 R5000R5000 RISCRISC 64 bit64 bit IP,VIP1,VIP2IP,VIP1,VIP2 512 KB512 KB200 MHz200 MHz

PROCESSORPROCESSOR TYPETYPE CLOCKCLOCK CPU BusCPU Bus INTERFACESINTERFACESLayer 2Layer 2

CACHECACHE

VIP2-15VIP2-15 R4700R4700 RISCRISC 64 bit64 bit PAPA 512 KB512 KB100 MHz100 MHz



NPE100NPE100 R4700R4700 RISCRISC 64 bit64 bit PA,IO -FEPA,IO -FE 512 KB512 KB150 MHz150 MHz



RPRP M68040M68040 RISCRISC 32 bit32 bit IP, VIP1IP, VIP1 --40 MHz40 MHz

RPRP M68040M68040 RISCRISC 32 bit32 bit IP, VIP1IP, VIP1 --40 MHz40 MHz

RSP7KRSP7K R4600R4600 RISCRISC 64 bit64 bit IP, VIP1,VIP2IP, VIP1,VIP2 --100 MHz100 MHz

7500

7500

7200

7200

7000

7000

High End Router ComparisonHigh End Router Comparison


Intelligent line cards + crossbar switch3d generation

• Ex. Cisco 7600, juniper T-series, HP ProCurve Switch 4200vl…

• Crossbar switch:

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

• Routing of N simultaneous packet (or cell)


Head of line blocking

• Crossbar needs to be N times faster than each lineor need one buffer / output on each input (i.e. one buffer percrosspoint)

• What goes through the crossbar?

✓ ATM cells✓ particles? (→ packet reassembly)✓ packets


Cisco router performance (packets/s)

Router Process switching Fast switching2500 800 44002801 3000 90.0007200-NPE-G1 79.000 1.018.0007600-dCEF720 48.000.000 per slot


A step further

• Check Cisco CEF ( Cisco express forwarding)

• Banyan switch

• MPLS: packets carry an identifier of their processing


Routing table

• Static entries, routing protocols, ARP

• Can be large!

• Entries in use are cached (on interface cards, if applicable) → the cache holds a small subset of know destinations


Longest match lookup — Routing tablestorage

• Source: Ruiz-Sanchez, M.A.; Biersack, E.W.; Dabbous, W.,”Survey and taxonomy of IP address lookup algorithms,” Network,IEEE , vol.15, no.2, pp.8-23, Mar/Apr 2001

2 Classical Solution

2.1 Binary Trie

A natural way to represent prefixes is using a trie. A trie is a tree-based data structure allowing the organization

of prefixes on a digital basis by using the bits of prefixes to direct the branching. Figure 7 shows a binary trie

(each node has at most two children) representing a set of prefixes of a forwarding table.

a 0*

b 01000*

c 011*

d 1*

e 100*

g 1101*

h 1110*

i 1111*

f 1100*

Prefixes

0

0

0

0

0

0

0

0

0

1

1

1

11

1

1

hg if

ec

b

a d

Figure 7: Binary trie for a set of prefixes.

In a trie, a node on level l represents the set of all addresses that begin with the sequence of l bits consisting

of the string of bits labeling the path from the root to that node. For example, node c in figure 7 is at level

3 and represents all addresses beginning with the sequence 011. The nodes that correspond to prefixes are

shown in dark color and these nodes will contain the forwarding information or a pointer to the forwarding

information. Note also that prefixes are not only located at leaves but also at some internal nodes. This situation

arises because of exceptions in the aggregation process. For example, in figure 7 the prefixes b and c represent

exceptions to prefix a. Figure 8 illustrates this situation better. The trie shows the total address space, assuming

5-bit long addresses. Each leaf represents one possible address. We can see that address spaces covered by

prefixes b and c overlap with the address space covered by prefix a. Thus, prefixes b and c represent exceptions

to prefix a and refer to specific subintervals of the address interval covered by prefix a. In the trie in figure 7,

this is reflected by the fact that prefixes b and c are descendants of prefix a, or in other words, prefix a is itself

a prefix of b and c. As a result, some addresses will match several prefixes. For example, addresses beginning

with 011 will match both, prefix c and prefix a. Nevertheless, prefix c must be preferred because it is more

specific (longest match rule).

Tries allow in a straightforward way to find the longest prefix that matches a given destination address. The

search in a trie is guided by the bits of the destination address. At each node, the search proceeds to the left or

to the right according to the sequential inspection of the address bits. While traversing the trie, every time we

visit a node marked as prefix (i.e., a dark node) we remember this prefix as the longest match found so far. The

search ends, when there is no more branch to take and the longest or best matching prefix will be the last prefix

remembered. For instance, if we search the best matching prefix (BMP) for an address beginning with the bit

pattern 10110 we start at the root in figure 7. Since the first bit of the address is 1 we move to the right, to the

node marked with prefix d and we remember d as the BMP found so far. Then we move to the left since the

second address bit is 0, this time the node is not marked as prefix, so d is still the BMP found so far. Next the

third address bit is 1 but at this point there is no branch labeled 1, so search ends and the last remembered BMP

7


Path-compressed trie

by the bit-number field in the nodes traversed. When a node marked as prefix is encountered, a comparison with

the actual prefix value is performed. This is necessary since during the descent in the trie we may skip some

bits. If a match is found, we proceed traversing the trie and keep the prefix as the BMP so far. Search ends when

a leaf is encountered or a mismatch is found. As usual the BMP will be the last matching prefix encountered.

For instance, if we look for the BMP of an address beginning with the bit pattern 010110 in the path compressed

trie shown in figure 9, we proceed as follows: We start at the root node and since its bit number is 1 we inspect

the first bit of the address. The first bit is 0 so we go to the left. Since the node is marked as prefix we compare

the prefix a with the corresponding part of the address (0). Since they match we proceed and keep a as the BMP

so far. Since the node’s bit number is 3 we skip the second bit of the address and inspect the third one. This bit

is 0 so we go to the left. Again we check whether the prefix b matches the corresponding part of the address

(01011). Since they do not match, search stops and the last remembered BMP (prefix a) is the correct BMP.

Path-compression was first proposed in a scheme called PATRICIA [10], but this scheme does not support

longest prefix matching. Sklower proposed a scheme with modifications for longest prefix matching in [13]. In

fact, this variant was originally designed not only to support prefixes but more general non-contiguous masks.

Since this feature was really never used, current implementations differ somehow from the Sklower’s original

scheme. For example, the BSD version of the path-compressed trie (referred to as BSD trie) is essentially the

same as we have just described. The basic difference is that in the BSD scheme, the trie is first traversed without

checking the prefixes at internal nodes. Once at a leaf, the traversed path is backtracked in search of the longest

matching prefix. At each node with a prefix, or a list of prefixes, a comparison is performed to check for a

match. Search ends when a match is found. Comparison operations are not made on the downward path in the

hope that not many exception prefixes exist. Note that with this scheme, in the worst case, the path is completely

traversed two times. In the case of the original Sklower’s scheme the backtrack phase also needs to do recursive

descents of the trie because non-contiguous masks are allowed.

a 0*

b 01000*

c 011*

d 1*

e 100*

g 1101*

h 1110*

i 1111*

f 1100*

Prefixes1

0

10

44

3

3 2

1

0 1

1010

10

eb c

da

hgf i

Figure 9: A path-compressed trie

Until recently, the longest matching prefix problem has been addressed by using data structures based on

path-compressed tries, like the BSD trie. Path-compression makes much sense when the binary trie is sparsely

populated. But when the number of prefixes increases and the trie gets denser, using path compression has little

benefit. Moreover, the principal disadvantage of path-compressed tries, as well as binary tries in general, is that

a search needs to do many memory accesses, in the worst case 32 for IPv4 addresses. For example, for a typical

backbone router [18] with 47113 prefixes, the BSD version for a path-compressed trie creates 93304 nodes. The

maximal height is 26, while the average height is almost 20. For the same prefixes, a simple binary trie (with

one-child nodes) has a maximal height of 30 and an average height of almost 22. As we can see, the heights of

both tries are very similar and the BSD trie may perform additional comparison operations when backtracking

is needed.

9

• Useful for sparsely populated space. But many prefixes used inIPv4

• Backtracking necessary : after reaching e and finding out that itdoes not match, need to go back to d (for 101… for example)


Disjoint prefix trie

We have seen that prefixes can overlap (see figure 4). In a trie, when two prefixes overlap, one of them is

itself a prefix of the other, see figures 7 and 8. Since prefixes represent intervals of contiguous addresses, when

two prefixes overlap this means that one interval of addresses contains another interval of addresses, see figure

4 and 8. In fact, that is why an address can be matched to several prefixes. If several prefixes match, the longest

prefix match rule is used in order to find the most specific forwarding information. One way to avoid the use of

the longest prefix match rule and to still find the most specific forwarding information is to transform a given

set of prefixes into a set of disjoint prefixes. Disjoint prefixes do not overlap and thus no address prefix is itself

prefix of another one. A trie representing a set of disjoint prefixes will have prefixes at the leaves but not at

internal nodes. To obtain a disjoint-prefix binary trie, we simply add leaves to nodes that have only one child.

These new leaves are new prefixes that inherit the forwarding information of the closest ancestor marked as a

prefix. Finally, internal nodes marked as prefixes are unmarked. For example, figure 10 shows the disjoint-prefix

binary trie that corresponds to the trie in figure 7. Prefixes a , a , a have inherited the forwarding information

of the original prefix a, which now has been suppressed. Prefix d has been obtained in a similar way. Since

prefixes at internal nodes are expanded or pushed down to the leaves of the trie, this technique has been called

leaf pushing by Srinivasan et al. [14]. Figure 11 shows the disjoint intervals of addresses that correspond to the

disjoint-prefix binary trie of figure 10.

a 0*

b 01000*

c 011*

d 1*

e 100*

g 1101*

h 1110*

i 1111*

f 1100*

Prefixes

0

0

0

0

0

0

0

0

0 0

1

1

1

11

1

1

1

1

1

hg if

!dec

"a

#ab

!a

Figure 10: Disjoint-prefix binary trie

a 0*

b 01000*

c 011*

d 1*

e 100*

g 1101*

h 1110*

i 1111*

f 1100*

Prefixes

Disjoint intervals of addresses

!a a

!d!

d!

d!

d!

d!

!a

!a

!a

!a

!aa b c c c c e e e e f f g g h h i i

c e

f g h i

a!

a#

a"

a"

a"!

Figure 11: Expanded disjoint-prefix binary trie

Compression techniques: Data compression tries to remove redundancy from the encoding. The idea to use

compression comes from the fact that expanding the prefixes increases information redundancy. Compression

11

• Disjoint prefixes do not overlap


There are other techniques!


Sources

• S. Keshav; “An engineering approach to computer networking”

• Cisco Router Architecturewww.cisco.com/networkers/nw99_pres/601.pdf

• Ross & Kurose “Computer Networking”

• …

www.cisco.com/networkers/nw99_pres/601.pdf

Router and switch architecturePCI Bus 0 EEPROM PCI Bridge R 4700 R 5000 SRAM! NPE-100 Layer 2 Cache...

Documents

Transcript of Router and switch architecturePCI Bus 0 EEPROM PCI Bridge R 4700 R 5000 SRAM! NPE-100 Layer 2 Cache...