An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

An Efficient Programmable 10 Gigabit Ethernet

Network Interface Card

Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai

Designing a 10 Gigabit NIC

Programmability for performance Computation offloading improves performance

NICs have power, area concerns Architecture solutions should be efficient

Above all, must support 10 Gb/s links What are the computation, memory requirements? What architecture efficiently meets them? What firmware organization should be used?

Mechanisms for an Efficient Programmable 10 Gb/s NIC

A partitioned memory system Low-latency access to control structures High-bandwidth, high-capacity access to frame data

A distributed task-queue firmware Utilizes frame-level parallelism to scale across many simple,

low-frequency processors

New RMW instructions Reduce firmware frame-ordering overheads by 50% and

reduce clock frequency requirement by 17%

Outline

Motivation

How Programmable NICs work

Architecture Requirements, Design

Frame-parallel Firmware

Evaluation

How Programmable NICs Work

PCIInterface

EthernetInterface

Memory

Processor(s)Bus Ethernet

Per-frame Requirements

InstructionsData

Accesses

TX Frame 281 101

RX Frame 253 85

Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions

Aggregate Requirements10 Gb/s - Max Sized Frames

Instruction Throughpu

Control Data

Bandwidth

Frame Data

Bandwidth

TX Frame 229 MIPS 2.6 Gb/s 19.75 Gb/s

RX Frame 206 MIPS 2.2 Gb/s 19.75 Gb/s

Total 435 MIPS 4.8 Gb/s 39.5 Gb/s

1514-byte Frames at 10 Gb/s 812,744 Frames/s

Meeting 10 Gb/s Requirements with Hardware

Processor Architecture At least 435 MIPS within embedded device Does NIC firmware have ILP?

Memory Architecture Low latency control data High bandwidth, high capacity frame data … both, how?

ILP Processors for NIC Firmware? ILP limited by data, control dependences Analysis of dynamic trace reveal dependences

Perfect BPPerfect

1BPNo BP

In-order 1 0.87 0.87 0.87

In-order 2 1.19 1.19 1.13

In-order 4 1.34 1.33 1.17

Out-order 1

1.00 1.00 0.88

Out-order 2

1.96 1.74 1.21

Out-order 4

2.65 2.00 1.29

Processors: 1-Wide, In-order

2x performance costly Branch prediction, reorder buffer, renaming logic, wakeup logic Overheads translate to greater than 2x core power, area costs Great for a GP processor; not for an embedded device

Other opportunities for parallelism? YES! Many steps to process a frame - run them simultaneously Many frames need processing - process simultaneously

Use parallel single-issue cores

Perfect 1BP

In-order 1 0.87 0.87

Out-order 2

1.74 1.21

Memory Architecture

Competing demands Frame data: High bandwidth, high capacity for many

offload mechanisms Control data: Low latency; coherence among processors,

PCI Interface, and Ethernet Interface

The traditional solution: Caches Advantages: low latency, transparent to the programmer Disadvantages: Hardware costs (tag arrays, coherence) In many applications, advantages outweigh costs

Are Caches Effective?

SMPCache trace analysis of a 6-processor NIC architecture

16B 32B 64B 128B256B 512B 1KB 2KB 4KB 8KB16KB32KB

Cache Size (Bytes)

Hit Ratio (Percent) 6 ProcessorHit Ratio

Choosing a Better Organization

Cache HierarchyA Partitioned Organization

Putting it All Together

Instruction Memory

I-Cache 0

(P+4)x(S) Crossbar (32-bit)

PCIInterface

EthernetInterfacePCI

Bus DRAM

Ext. Mem. Interface

(Off-Chip)

Scratchpad 0 Scratchpad 1 S-pad S-1

CPU P-1

I-Cache 1 I-Cache P-1

Parallel Firmware

NIC processing steps already well-

defined

Previous Gigabit NIC firmware divides

steps between 2 processors

… but does this mechanism scale?

Task Assignment with an Event Register

PCI Read Bit SW Event Bit … Other Bits

PCI Interface Finishes Work

Processor(s) inspect

transactions

0 0 011

Processor(s) need to enqueue

TX Data

Processor(s) pass data to

Ethernet Interface

Task-level Parallel Firmware

TransferDMAs 0-4

0 Idle Idle

PCI Read Bit

PCI Read HW Status

Function Running (Proc 0)

1Transfer

DMAs 5-9

TimeProcessDMAs

0-4Idle

ProcessDMAs

5-91 Idle

Frame-level Parallel Firmware

TransferDMAs 0-4 Idle

PCI RD HW Status

TransferDMAs 5-9

TimeProcessDMAs

Build Event

ProcessDMAs

Build Event

Evaluation Methodology

Spinach: A library of cycle-accurate LSE simulator modules for network interfaces Memory latency, bandwidth, contention modeled precisely Processors modeled in detail NIC I/O (PCI, Ethernet Interfaces) modeled in detail Verified when modeling the Tigon 2 Gigabit NIC (LCTES

Idea: Model everything inside the NIC Gather performance, trace data

Scaling in Two Dimensions

100 150 200 250 300

Core Frequency (MHz)

Throughput (Gb/s)

Ethernet Limit

8 Processors

6 Processors

4 Processors

2 Processors

1 Processor

Processor Performance

Processor Behavior

IPC Compone

Execution 0.72

Miss Stalls 0.01

Load Stalls 0.12

Scratchpad Conflict Stalls

Pipeline Stalls

Total 1.00

Achieves 83% of theoretical peak IPC

Small I-Caches work Sensitive to mem

stalls Half of loads are part of a load-

to-use sequence Conflict stalls could be reduced

with more ports, more banks

Reducing Frame Ordering Overheads

Firmware ordering costly - 30% of execution

Synchronization, bitwise check/updates occupy processors, memory

Solution: Atomic bitwise operations that also update a pointer according to last set location

Maintaining Frame Ordering

Index 0 Index 1 Index 3 Index 4 … more bitsFrame Status Array

CPU A prepares frames

CPU B prepares frames

CPU C Detects

Completed Frames

EthernetInterface

Iterate

Notify Hardware

UNLOCK

1 1 1 1

RMW Instructions Reduce Clock Frequency

Performance: 6x166 MHz = 6x200 MHz Performance is equivalent at all frame sizes 17% reduction in frequency requirement

Dynamically tasked firmware balances the benefit Send cycles reduced by 28.4% Receive cycles reduced by 4.7%

ConclusionsA Programmable 10 Gb/s NIC

This NIC architecture relies on: Data Memory System - Partitioned organization, not

coherent caches Processor Architecture - Parallel scalar processors Firmware - Frame-level parallel organization RMW Instructions - reduce ordering overheads

A programmable NIC: A substrate for offload services

Comparing Frame Ordering Methods

0 200 400 600 800 1000 1200 1400

UDP Datagram Size (Bytes)

Full-Duplex Throughput (Gb/s)

Ethernet Limit(Duplex)

6x200 MHz,Software Only

6x166 MHz, RMWEnhanced

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

Documents

Transcript of An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

Fast Ethernet and Gigabit Ethernet - Academics | WPIweb.cs.wpi.edu/~rek/Undergrad_Nets/B06/Fast_Ethernet.pdf · Fast Ethernet and Gigabit Ethernet Networks: Fast Ethernet 1Networks:

Cisco & 2N konference - Westcon-Comstormedia.gswi.westcon.com/media/ComstorCzech/prezentace/Cisco_SB.pdf · Fast Ethernet Gigabit Ethernet Fast Ethernet Gigabit Ethernet 200 Series

Gigabit Ethernet ExpressCard - Belkin€¦ · Gigabit Ethernet ExpressCard ... the ExpressCard now assures you of device-upgrade ... d’ajouter un port Gigabit Ethernet et ainsi

10 Gigabit Ethernet - Study Mafiastudymafia.org/.../uploads/2015/03/mca-10-Gigabit-Ethernet-report.pdf · Seminar report On 10 Gigabit Ethernet ... Gigabit Ethernet is interoperable

VCSEL Gigabit Ethernet

Fast Ethernet and Gigabit Ethernet - Worcester Polytechnic

10 gigabit ethernet cabling

Brochure Business communication excellence · in the office for business use. ... (IP) (IP) PC port Gigabit Ethernet Gigabit Ethernet Gigabit Ethernet Gigabit ... OmniPCX Enterprise

Gigabit Ethernet

Gigabit Ethernet PMD

IEEE 802.3ae IEEE 802.3ae -- 1100 Gigabit Ethernet Gigabit ... … · IEEE 802.3ae IEEE 802.3ae -- 1100 Gigabit Ethernet Gigabit Ethernet • Solo full-duplex • Mezzo trasmissivo:

Optical Gigabit Ethernet

Fast Ethernet and Gigabit Ethernet

Fast Ethernet and Gigabit Ethernet

Ethernet, Fast Ethernet, and Gigabit Ethernet In this presentation, we will take a closer look at Ethernet, Fast Ethernet, and Gigabit Ethernets.

Cisco Gigabit Ethernet Transceiver Modules Compatibility ... · PDF fileCisco Gigabit Ethernet Transceiver Modules Compatibility Matrix ... Cisco Gigabit Ethernet Transceiver Modules

DGS-1100/ME Series Gigabit Metro Ethernet Switches · Gigabit Metro Ethernet Switches DGS-1100/ME Series D-Link’s Gigabit Metro Ethernet Switches are ideal for Metro Ethernet applications,

Broadcom NetXtreme Gigabit Ethernet Plus NICh10032. · 1 Broadcom NetXtreme Gigabit Ethernet Plus NIC Introduction ...

Fast Ethernet and Gigabit Ethernet

Gigabit Ethernet TxRx