A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture...

20
James Markevitch, Srinivasa Malladi Cisco Systems A 400Gbps Multi-Core Network Processor August 22, 2017

Transcript of A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture...

Page 1: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

James Markevitch, Srinivasa Malladi Cisco Systems

A 400Gbps Multi-Core Network Processor

August 22, 2017

Page 2: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

2 © 2017 Cisco and/or its affiliates. All rights reserved.

THE INFORMATION HEREIN IS PROVIDED ON AN “AS IS” BASIS, WITHOUT ANY WARRANTIES OR REPRESENTATIONS, EXPRESS, IMPLIED OR STATUTORY, INCLUDING WITHOUT LIMITATION, WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Legal

Page 3: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

3 © 2017 Cisco and/or its affiliates. All rights reserved.

•  The architecture, design, and implementation described in this presentation was done by a diverse team •  Chip and processor architects, designers, DV engineers, physical

designers, software engineers •  7 geographic locations

•  Credit and thanks to the entire team

Acknowledgements

Page 4: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

4 © 2017 Cisco and/or its affiliates. All rights reserved.

•  Line: interfaces (such as 10 Gigabit Ethernet) used for customer connections

•  Fabric: interconnect internal to the router to transfer data between different components of the router, such as network processors

Background: Distributed Router Architecture

Network Processor

Network Processor

Line Fabric

Line

Router (simplified)

Page 5: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

5 © 2017 Cisco and/or its affiliates. All rights reserved.

•  800Gbps (400Gbps full-duplex) network packet processor

•  672 general purpose processors

•  > 6.5Tbps serdes I/O bandwidth

•  External DRAM for large data structures and packet buffering

•  External TCAM for large data structures

•  Integrated Ethernet MACs from 10GE to 100GE

•  Integrated traffic manager

•  Most logic in 1GHz and 760MHz domains

Overview

Processor

DRAM …

400Gbps Line

400Gbps Fabric

PCIe Gen3

TCAM DRAM

serdes serdes

serdes serdes

Page 6: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

6 © 2017 Cisco and/or its affiliates. All rights reserved.

Overall Chip Architecture

MA

Cs

Packet Storage

Processor Array

Fabric Interface

Hardware Accelerators

Traffic Manager

Entire packets stored on-chip

during processing

Networking specific hardware accelerators

Examples: packet ordering, prefix lookup, global ALU operations,

counters

Handles queuing and implements features such

as quality of service External DRAM

Processor array can access and modify entire

packet content

400Gbps Line

400Gbps Fabric

Page 7: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

7 © 2017 Cisco and/or its affiliates. All rights reserved.

Processing Flow – 400Gbps Line to Fabric

MA

Cs

Packet Storage

Processor Array

Fabric Interface

Hardware Accelerators

Traffic Manager

1. Packets received from

line-side serdes

4. Packet is processed by one of the general purpose

threads

4a. External DRAM accessed for large data

structures External DRAM

3. Packet descriptors sent to processor thread

for processing

400Gbps Line

400Gbps Fabric

2. Entire packet stored in on-chip

memory

4b. Hardware accelerators assist

certain features

5. After processing, packets are sent to

fabric in proper order

Page 8: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

8 © 2017 Cisco and/or its affiliates. All rights reserved.

Processing Flow – 400Gbps Fabric to Line

MA

Cs

Packet Storage

Processor Array

Fabric Interface

Hardware Accelerators

Traffic Manager

1. Packets received from fabric serdes

4. Packet is processed by one of the general purpose

threads 4a. External DRAM

accessed for large data structures

External DRAM

3. Packet descriptors sent to processor thread

for processing

400Gbps Line

400Gbps Fabric

2. Entire packet stored in on-chip

memory

4b. Hardware accelerators assist

certain features

5. After processing, packets are sent to

traffic manager in proper order

5a. Packets stored in external DRAM

6. Packets sent to line from traffic manager

using configured policies

Page 9: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

9 © 2017 Cisco and/or its affiliates. All rights reserved.

•  672 processor cores (2688 total threads) •  Run-to-completion model

•  A single thread “owns” a single packet throughout its processing life •  Different packets may require different features and therefore have

different processing times •  Contrast with: feature-pipelined architecture, systolic arrays, others

•  Programmable in C and assembly language •  Support for traditional stack

Processing Model

Page 10: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

10 © 2017 Cisco and/or its affiliates. All rights reserved.

MMU L0D$ Reg File

L0I$

Processor Core Architecture

Pipeline

L0D$

L0I$

MMU Request Generator

Reg File

Separate per-thread L0D$ 2-way set associative

Round robin thread switching

on a cycle by cycle basis

Per-thread MMU entries

Software managed

MMU

Request generator for L0D$ miss, explicit

requests

Separate per-thread L0I$

Network-specific instructions

added to ISA L1I$

TLU

8-stage non-stalling pipeline

Page 11: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

11 © 2017 Cisco and/or its affiliates. All rights reserved.

Core

Processor Cluster Architecture

Core

L1I$

L2I$

L1D$ TLU

Interconnect

16 Processor cores

Table Lookup Unit accelerates data structure

accesses L1D$

4-way set associative

Messages sent to processor interconnect

4-way set associative L1I$

Page 12: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

12 © 2017 Cisco and/or its affiliates. All rights reserved.

Cluster Cluster

Processor Array Architecture

Cluster L2I$

Interconnect

Packet storage

On-chip memories

DRAM controllers

Cluster 0 … Cluster 41

Accelerators

4-way set associative L2I$ shared by multiple clusters

Bi-partite In-order

> 9Tbps bandwidth

Cluster Cluster Cluster L2I$

… …

Message-based interconnect between processor cores/caches

and accelerators/memories

•  1GHz processor frequency

•  Entire processor array “hand”-placed using custom in-house tools

Page 13: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

13 © 2017 Cisco and/or its affiliates. All rights reserved.

•  Prefix look-up •  Looks up addresses, such as IPv4 and

IPv6 (e.g. 192.168.0.123) •  These address spaces in the Internet

are largely unstructured and not hierarchical

•  TCAM, hashing, range compression •  Access control lists, quality of service,

and other features map to a variety of data structures

•  Statistics counters, rate monitors •  Some applications require a large

number of counters for network management and customer Service Level Agreements are met

•  Packet ordering •  Special hardware to ensure that

packets within a flow do not leave the router out of order

Hardware Accelerators (not a complete list)

Page 14: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

14 © 2017 Cisco and/or its affiliates. All rights reserved.

•  Hierarchical queuing structure •  Flexible levels of queueing

hierarchy •  Minimum rate guarantees •  Maximum rate limits •  Weighted sharing

•  256k queues

•  Packet data stored in off-chip memory for large buffering

Traffic Manager

Web

Voice

… Physical port

Logical interface

End customer

Type of service

Example of a traffic scheduling hierarchy

Page 15: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

15 © 2017 Cisco and/or its affiliates. All rights reserved.

•  672 processors (2688 threads)

•  9.2 billion transistors

•  343 megabits of SRAM

•  276 serdes

•  643 mm^2 die

•  22nm process

•  Hybrid COT design flow

Processor Die

Serdes Accelerators

Traffic Manager

Accelerators

MACs, interface controllers, and buffering

Processor Array

Page 16: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

16 © 2017 Cisco and/or its affiliates. All rights reserved.

•  Portions of the network require more buffering than can be accommodated on a CMOS logic die

•  Portions of the network require larger table sizes than can be accommodated on a CMOS logic die •  Rough rule of thumb is that each feature requires one or more memory

accesses •  Examples of features: access control, quality of service, link aggregation,

statistics for service level agreements •  Proprietary architecture details are typically used to help reduce access

rate (one example could be caching)

External Memory System Motivation

Page 17: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

17 © 2017 Cisco and/or its affiliates. All rights reserved.

•  12.5Gbps serdes (up to 28 links)

•  Proprietary serial protocol optimized for networking

•  > 1 billion random accesses per second

•  > 300Gbps data transfer rate •  After removing overhead for

commands, addresses, etc.

Serial-Attached memory

Protocol Controller

Protocol Controller

DRAM (16 Banks)

DRAM (16 Banks)

Cro

ssba

r

......

16 RX, 28 TXserial links

@ 12.5 Gbps

Mem

ory

inte

rface

PLL

BIST

Mem

ory

inte

rface

Reference clock

Page 18: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

18 © 2017 Cisco and/or its affiliates. All rights reserved.

•  Multi-chip module •  28nm logic die (serdes, protocol

controller, BIST) •  30nm DRAM die

•  Parallel I/O interface between logic die and DRAM die •  0.85V @ 1250Mbps

Serial-Attached Memory

Page 19: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster

19 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Thank you! Questions?

Page 20: A 400Gbps Multi-Core Network Processor - Hot Chips · Background: Distributed Router Architecture Network Processor Network ... Overview Processor DRAM ... Core Processor Cluster