1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content...

29
1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    1

Transcript of 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content...

Page 1: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

1ENTS689L: Packet Processing and SwitchingCommercial Network Processor Architectures

Content Addressable Memories

Vahid Tabatabaee

Fall 2007

Page 2: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

2ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

References

Title: Network Processors Architectures, Protocols, and PlatformsAuthor: Panos C. LekkasPublisher: McGraw-Hill

Kostas Pagiamtzis, Ali Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J of Solid-State Circuits vol. 41, No.3, March 2006.

NetLogic MicroSystems Application Note, “Intradevice Configuration of Network Search Engines”.

NetLogic MicroSystems Application Note, “High Performance Layer 3 Forwarding”.

IDT White Paper, “Taking Packet Processing to the Next Level”.

Page 3: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

3ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Classification and Search Engines

Classification engine receives streams of packets as its input. It applies a set of application-specific sorting rules and policies

continuously on the packets. It ends up compiling a series of new parallel packet streams in

queues of packets.ored.

For classification the NP should consult a memory bank, a lookup table or even a data base where the rules are stored.

Search engines are used for consultation of a lookup table or a database based on rules and policies for the correct classification.

Search engines are mostly based on associative memory, which is also known as CAM

Page 4: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

4ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

What is CAM?

Content Addressable Memory is a special kind of memory!

Read operation in traditional memory: Input is address location of the

content that we are interested in it. Output is the content of that

address. In CAM it is the reverse:

Input is associated with something stored in the memory.

Output is location where the associated content is stored.

1 0 1 X X

0 1 1 0 X

0 1 1 X X

1 0 0 1 1

0 1 1 0 1

0 0

0 1

1 0

1 1

0 1

Content AddressableMemory

1 0 1 X X

0 1 1 0 X

0 1 1 X X

1 0 0 1 1

0 1

0 0

0 1

1 0

1 1

0 1 1 0 X

Traditional Memory

Page 5: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

5ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM for Routing Table Implementation

CAM can be used as a search engine. We want to find matching contents in a database or Table. Example Routing Table

Source: http://pagiamtzis.com/cam/camintro.html

Page 6: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

6ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Simplified CAM Block Diagram The input to the system is the search word. The search word is broadcast on the search lines. Match line indicates if there were a match btw. the search and stored word. Encoder specifies the match location. If multiple matches, a priority encoder selects the first match. Hit signal specifies if there is no match. The length of the search word is long ranging from 36 to 144 bits. Table size ranges: a few hundred to 32K. Address space : 7 to 15 bits.

Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006

Page 7: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

7ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM Memory Size

Largest available around 18 Mbit (single chip).

Rule of thumb: Largest CAM chip is about half the largest available SRAM chip. A typical CAM cell

consists of two SRAM cells.

Exponential growth rate on the size

Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006

Page 8: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

8ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM Basics

The search-data word is loaded into the search-data register.

All match-lines are pre-charged to high (temporary match state).

Search line drivers broadcast the search word onto the differential search lines.

Each CAM core compares its stored bit against the bit on the corresponding search-lines.

Match words that have at least one missing bit, discharge to ground.

Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006

Page 9: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

9ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Type of CAMs

Binary CAM (BCAM) only stores 0s and 1s Applications: MAC table consultation. Layer 2 security related

VPN segregation. Ternary CAM (TCAM) stores 0s, 1s and don’t cares.

Application: when we need wilds cards such as, layer 3 and 4 classification for QoS and CoS purposes. IP routing (longest prefix matching).

Available sizes: 1Mb, 2Mb, 4.7Mb, 9.4Mb, and 18.8Mb.

CAM entries are structured as multiples of 36 bits rather than 32 bits.

Page 10: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

10ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM Advantages

They associate the input (comparand) with their memory contents in one clock cycle.

They are configurable in multiple formats of width and depth of search data that allows searches to be conducted in parallel.

CAM can be cascaded to increase the size of lookup tables that they can store.

We can add new entries into their table to learn what they don’t know before.

They are one of the appropriate solutions for higher speeds.

Page 11: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

11ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM Disadvantages

They cost several hundred of dollars per CAM even in large quantities.

They occupy a relatively large footprint on a card.

They consume excessive power.

Generic system engineering problems:Interface with network processor.Simultaneous table update and looking up requests.

Page 12: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

12ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM structure

The comparand bus is 72 bytes wide bidirectional.

The result bus is output. Command bus enables

instructions to be loaded to the CAM.

It has 8 configurable banks of memory.

The NPU issues a command to the CAM.

CAM then performs exact match or uses wildcard characters to extract relevant information.

There are two sets of mask registers inside the CAM.

CAM control

Global mask registers

72 bits 131072CAM

(72 bits x 16K x 8 structures)

Mixable with72 bits x 16384144 bits x 8192288 bits x 4096576 bits x 2048

Em

pty

Bit

Prio

rity

Enc

oder

Fla

g C

ontr

ol

Out

put P

ort

Con

trol

Control & status registers

I/O P

ort C

ontr

ol

Dec

oder

Pip

elin

e ex

ecut

ion

cont

rol

(com

man

d bu

s)

Page 13: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

13ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM structure

There is global mask registers which can remove specific bits and a mask register that is present in each location of memory.

The search result can be one output (highest priority) Burst of successive results.

The output port is 24 bytes wide.

Flag and control signals specify status of the banks of the memory.

They also enable us to cascade multiple chips.

CAM control

Global mask registers

72 bits 131072CAM

(72 bits x 16K x 8 structures)

Mixable with72 bits x 16384144 bits x 8192288 bits x 4096576 bits x 2048

Em

pty

Bit

Prio

rity

Enc

oder

Fla

g C

ontr

ol

Out

put P

ort

Con

trol

Control & status registers

I/O P

ort C

ontr

ol

Dec

oder

Pip

elin

e ex

ecut

ion

cont

rol

(com

man

d bu

s)

Page 14: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

14ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM Features

CAM Cascading: We can cascade up to 8 pieces without incurring performance

penalty in search time (72 bits x 512K). We can cascade up to 32 pieces with performance degradation

(72 bits x 2M). Terminology:

Initializing the CAM: writing the table into the memory. Learning: updating specific table entries. Writing search key to the CAM: search operation

Handling wider keys: Most CAM support 72 bit keys. They can support wider keys in native hardware.

Shorter keys: can be handled at the system level more efficiently.

Page 15: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

15ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CAM Latency

Clock rate is between 66 to 133 MHz. The clock speed determines

maximum search capacity. Factors affecting the search

performance: Key size Table size

For the system designer the total latency to retrieve data from the SRAM connected to the CAM is important.

By using pipeline and multi-thread techniques for resource allocation we can ease the CAM speed requirements.

Source: IDT

Page 16: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

16ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Packet Search Speed Requirements

Source: IDT article in CommsDesign:http://www.commsdesign.com/showArticle.jhtml?articleID=16501972

Source: IDT

Page 17: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

17ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Management of Tables Inside a CAM

It is important to squeeze as much information as we can in a CAM. Example from Netlogic application notes:

We want to store 4 tables of 32 bit wide IP destination addresses. The CAM is 128 bits wide. If we store directly in every slot 96 bits are wasted.

We can arrange the 32 bit wide tables next to each other. Every 128 bit slot is partitioned into four 32 bit slots. These are 3rd, 2nd, 1st, and 0th tables going from left to right. We use the global mask register to access only one of the tables.

MASK 3

00000000

FFFFFFFF

FFFFFFFF

FFFFFFFF

MASK 2

FFFFFFFF

00000000

FFFFFFFF

FFFFFFFF

MASK 1

FFFFFFFF

FFFFFFFF

00000000

FFFFFFFF

MASK 0

FFFFFFFF

FFFFFFFF

FFFFFFFF

00000000

Page 18: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

18ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Example Continued

We can still use the mask register (not global mask register) to do maximum prefix length match.

1 0 1 0 0 0….1 0 1 1 1 0….1 0 1 1 0 1….1 1 0 1 1 1….

127 97 96 95

0

1

0

0

94

1 1 0

1 0 1

0 0 0

0 1 1

3 2 1

1

0

1

0

0

1 0 1 1 1 0…. 0 1 1 1 0

MATCH FOUND

0 0 0 0 0 1…. 1 1 1 1 1

ComparandRegister

Global MaskRegister

….….….….

….

….

Page 19: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

19ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Table Aggregation

We can use tag bits to aggregate multiple tables in a single CAM. Example:

We want to use a single CAM (NL85721) for IPV4 packet classification and forwarding.

We want to filter packets based on other parameters such as VPN. We can have an undesired match when we want to do a classification.

CAM word 0 does not match but the dest. address matches CAM word 1

Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf

Page 20: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

20ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Tag bits to avoid undesired matches

Tag bits can be used to differentiate between tables. Tag bits should not be masked. For packet classification tag bit is 0 and for packet forwarding it is 1.

Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf

Page 21: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

21ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Vertically Oriented Table Aggregation

We can use validity bits to support multiple tables with different number of entries. We need one validity bit for each table. When the validity bit in a slot is 1 the corresponding table has a valid entry. In the comparand register, only the validity bit of the table that is under search should be 1.

Source: http://www.netlogicmicro.com/pdf/ncs12_rev_0_8.pdf

Page 22: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

22ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

System Design Issues (multiple searches)

For deep packet inspection, several searches must occur simultaneously.

For example: MAC table, IP table, rules table, flow-management table.

Question: Do we use 4 CAMs or just 1 CAM with 4 partitions. If we use only 1 CAM:

Some tables are very large and some small.

This approach wastes expensive partitions.

If we use 4 CAMs: It does suffer when smaller tables do

not justify using separate CAMs. The overall cost also increases since

we have to use separate SRAM too.

Packet Processing environmentNetwork Processor

or custom-designed ASIC

CAM

SRAM

CAM

SRAM

CAM

SRAM

CAM

SRAM

Page 23: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

23ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

System Design Issues (shorter and longer search keys)

We showed how we can implement 36 bit search tables in a 72 bit wide CAM.

This approach reduces the speed to half since we need to search two time for each key.

Some CAMS are hardwired to support both 36 and 72 bit wide search keys but they are more expensive.

For longer search keys the are two choices: We can use double data rate (DDR) bus and load meaningful

bits at both the rising and dropping edge of the clock. We can double the clock frequency of the that loads the

comparands.

Page 24: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

24ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

System Design Issues (simultaneous update and search)

CAMs can not be updated in a location while searching at the same time.

When we do update packets can not be forwarded and they are back logged.

We can have a backup CAM for update while search is done on the other CAM.

Some designs offer a third port for table maintenance without inhibiting search operations (SiberCore is an example). Increases pin count, board real estate, signals to be routed on

the board.

Page 25: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

25ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

System Design Issues (CIDR table update)

Recall that CIDR works based on the longest prefix match (LPM).

CAM segments are created based on the prefix length.

Some empty slots are left in each segment to accommodate new entries.

If a segment is suddenly filled up, the table must be taken offline to reshuffle the entries.

A read and write operation is needed for each entry that must be relocated. We may need a read and write for the mask word too.

Source: http://www.netlogicmicro.com/pdf/cidr_white_paper.pdf

Page 26: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

26ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

CIDR table update: worst case analysis

What is the worst case scenario: All segments but one are full A new entry may need up to 31 move operations. Each move requires 4 clock cycles for total of

4 x 31 = 124 clock cycles We have 3000 routing updates per second

3000 x 124 = 372000 clock cycles per second If the NP clock rate is 100 MHz the cycle time is 10 nsec How much time the update consumes:

372000 cycles x 10 nsec per cycle = 3.72 msec In OC-192 rate, we have around 20 to 30 MPPS Therefore, 74,400 to 111,600 packets will not be classified and

should be discarded.

Page 27: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

27ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Reproaches against CAM based search engines (POWER)

There is a misnomer that power consumption of CAM increases! It does not make sense to compare power consumptions of 2Mb CAM

clocked at 66 MHz and capable of 66 Msps with 9Mb CAM clocked at 150 MHZ capable of 125 Msps.

Power consumption is result of multiple factors such as: Semiconductor manufacturing process. Number of searches per second. Storage density.

The smaller the process the larger the capacity; it can also cause drop in the power supply and increase in the clock rate. 0.18μ process 50% less power than 0.25μ and 30% further improvement

in 0.15μ. The absolute power consumption is increasing, because:

Larger table. Wider search key for deep packet classification. Increased wire speed.

Make sure to consider worst case scenarios not the data sheet values.

Page 28: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

28ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Reproaches against CAM based search engines

Table maintenance and management is a software related problem. Third port (Synchronous Maintenance Interface [SMI]) for

SiberCore CAMs is an interesting way of having table maintenance without affecting of the ongoing search processes.

Sort-free CAM that do not need partitioning CAMs.

Density and footprint (Not a real issue) example: The three members in the family, the

CYNSE10512, 10256, and 10128, provide address tables of 512k, 256k, and 128k entries (18 Mbits, 9 Mbits, and 4.5 Mbits), respectively.

All three devices are housed in 388-contact BGA packages.

Price: $75, $135, $275 1,000,000 entry IPV4 can be handled in two

18Mbits CAM.

Page 29: 1 ENTS689L: Packet Processing and Switching Commercial Network Processor Architectures Content Addressable Memories Vahid Tabatabaee Fall 2007.

29ENTS689L: Packet Processing and SwitchingContent Addressable Memory (CAM)

Reproaches against CAM based search engines

Inflexibility with Table Configurations: This is a real issue Some applications need flexible table sizes and width More research and development needed.

Price In absolute terms they are expensive. They are sophisticated complex products that are

indispensable in most designs. So they should be expensive!