An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

26
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai

description

An Efficient Programmable 10 Gigabit Ethernet Network Interface Card. Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai. Designing a 10 Gigabit NIC. Programmability for performance Computation offloading improves performance NICs have power, area concerns - PowerPoint PPT Presentation

Transcript of An Efficient Programmable 10 Gigabit Ethernet Network Interface Card

Page 1: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

An Efficient Programmable 10 Gigabit Ethernet

Network Interface Card

Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai

Page 2: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

2

Designing a 10 Gigabit NIC

Programmability for performance Computation offloading improves performance

NICs have power, area concerns Architecture solutions should be efficient

Above all, must support 10 Gb/s links What are the computation, memory requirements? What architecture efficiently meets them? What firmware organization should be used?

Page 3: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

3

Mechanisms for an Efficient Programmable 10 Gb/s NIC

A partitioned memory system Low-latency access to control structures High-bandwidth, high-capacity access to frame data

A distributed task-queue firmware Utilizes frame-level parallelism to scale across many simple,

low-frequency processors

New RMW instructions Reduce firmware frame-ordering overheads by 50% and

reduce clock frequency requirement by 17%

Page 4: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

4

Outline

Motivation

How Programmable NICs work

Architecture Requirements, Design

Frame-parallel Firmware

Evaluation

Page 5: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

5

How Programmable NICs Work

PCIInterface

EthernetInterface

Memory

Processor(s)Bus Ethernet

Page 6: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

6

Per-frame Requirements

InstructionsData

Accesses

TX Frame 281 101

RX Frame 253 85

Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions

Page 7: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

7

Aggregate Requirements10 Gb/s - Max Sized Frames

Instruction Throughpu

t

Control Data

Bandwidth

Frame Data

Bandwidth

TX Frame 229 MIPS 2.6 Gb/s 19.75 Gb/s

RX Frame 206 MIPS 2.2 Gb/s 19.75 Gb/s

Total 435 MIPS 4.8 Gb/s 39.5 Gb/s

1514-byte Frames at 10 Gb/s 812,744 Frames/s

Page 8: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

8

Meeting 10 Gb/s Requirements with Hardware

Processor Architecture At least 435 MIPS within embedded device Does NIC firmware have ILP?

Memory Architecture Low latency control data High bandwidth, high capacity frame data … both, how?

Page 9: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

9

ILP Processors for NIC Firmware? ILP limited by data, control dependences Analysis of dynamic trace reveal dependences

Perfect BPPerfect

1BPNo BP

In-order 1 0.87 0.87 0.87

In-order 2 1.19 1.19 1.13

In-order 4 1.34 1.33 1.17

Out-order 1

1.00 1.00 0.88

Out-order 2

1.96 1.74 1.21

Out-order 4

2.65 2.00 1.29

Page 10: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

10

Processors: 1-Wide, In-order

2x performance costly Branch prediction, reorder buffer, renaming logic, wakeup logic Overheads translate to greater than 2x core power, area costs Great for a GP processor; not for an embedded device

Other opportunities for parallelism? YES! Many steps to process a frame - run them simultaneously Many frames need processing - process simultaneously

Use parallel single-issue cores

Perfect 1BP

No BP

In-order 1 0.87 0.87

Out-order 2

1.74 1.21

Page 11: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

11

Memory Architecture

Competing demands Frame data: High bandwidth, high capacity for many

offload mechanisms Control data: Low latency; coherence among processors,

PCI Interface, and Ethernet Interface

The traditional solution: Caches Advantages: low latency, transparent to the programmer Disadvantages: Hardware costs (tag arrays, coherence) In many applications, advantages outweigh costs

Page 12: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

12

Are Caches Effective?

SMPCache trace analysis of a 6-processor NIC architecture

0

10

20

30

40

50

60

16B 32B 64B 128B256B 512B 1KB 2KB 4KB 8KB16KB32KB

Cache Size (Bytes)

Hit Ratio (Percent) 6 ProcessorHit Ratio

Page 13: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

13

Choosing a Better Organization

Cache HierarchyA Partitioned Organization

Page 14: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

14

Putting it All Together

Instruction Memory

I-Cache 0

CPU 0

(P+4)x(S) Crossbar (32-bit)

PCIInterface

EthernetInterfacePCI

Bus DRAM

Ext. Mem. Interface

(Off-Chip)

Scratchpad 0 Scratchpad 1 S-pad S-1

CPU P-1

I-Cache 1 I-Cache P-1

CPU 1

Page 15: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

15

Parallel Firmware

NIC processing steps already well-

defined

Previous Gigabit NIC firmware divides

steps between 2 processors

… but does this mechanism scale?

Page 16: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

16

Task Assignment with an Event Register

PCI Read Bit SW Event Bit … Other Bits

PCI Interface Finishes Work

Processor(s) inspect

transactions

0 0 011

Processor(s) need to enqueue

TX Data

Processor(s) pass data to

Ethernet Interface

Page 17: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

17

Task-level Parallel Firmware

TransferDMAs 0-4

0 Idle Idle

PCI Read Bit

PCI Read HW Status

Function Running (Proc 0)

Function Running (Proc 1)

1Transfer

DMAs 5-9

1

0

TimeProcessDMAs

0-4Idle

ProcessDMAs

5-91 Idle

Page 18: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

18

Frame-level Parallel Firmware

TransferDMAs 0-4 Idle

PCI RD HW Status

Function Running (Proc 0)

Function Running (Proc 1)

TransferDMAs 5-9

TimeProcessDMAs

0-4

Build Event

Idle

ProcessDMAs

5-9

Build Event

Idle

Page 19: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

19

Evaluation Methodology

Spinach: A library of cycle-accurate LSE simulator modules for network interfaces Memory latency, bandwidth, contention modeled precisely Processors modeled in detail NIC I/O (PCI, Ethernet Interfaces) modeled in detail Verified when modeling the Tigon 2 Gigabit NIC (LCTES

2004)

Idea: Model everything inside the NIC Gather performance, trace data

Page 20: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

20

Scaling in Two Dimensions

0

2

4

6

8

10

12

14

16

18

20

100 150 200 250 300

Core Frequency (MHz)

Throughput (Gb/s)

Ethernet Limit

8 Processors

6 Processors

4 Processors

2 Processors

1 Processor

Page 21: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

21

Processor Performance

Processor Behavior

IPC Compone

nt

Execution 0.72

Miss Stalls 0.01

Load Stalls 0.12

Scratchpad Conflict Stalls

0.05

Pipeline Stalls

0.10

Total 1.00

Achieves 83% of theoretical peak IPC

Small I-Caches work Sensitive to mem

stalls Half of loads are part of a load-

to-use sequence Conflict stalls could be reduced

with more ports, more banks

Page 22: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

22

Reducing Frame Ordering Overheads

Firmware ordering costly - 30% of execution

Synchronization, bitwise check/updates occupy processors, memory

Solution: Atomic bitwise operations that also update a pointer according to last set location

Page 23: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

23

Maintaining Frame Ordering

0

Index 0 Index 1 Index 3 Index 4 … more bitsFrame Status Array

0 0 0

CPU A prepares frames

CPU B prepares frames

CPU C Detects

Completed Frames

EthernetInterface

LOCK

Iterate

Notify Hardware

UNLOCK

1 1 1 1

Page 24: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

24

RMW Instructions Reduce Clock Frequency

Performance: 6x166 MHz = 6x200 MHz Performance is equivalent at all frame sizes 17% reduction in frequency requirement

Dynamically tasked firmware balances the benefit Send cycles reduced by 28.4% Receive cycles reduced by 4.7%

Page 25: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

25

ConclusionsA Programmable 10 Gb/s NIC

This NIC architecture relies on: Data Memory System - Partitioned organization, not

coherent caches Processor Architecture - Parallel scalar processors Firmware - Frame-level parallel organization RMW Instructions - reduce ordering overheads

A programmable NIC: A substrate for offload services

Page 26: An Efficient Programmable  10 Gigabit Ethernet  Network Interface Card

26

Comparing Frame Ordering Methods

0

2

4

6

8

10

12

14

16

18

20

0 200 400 600 800 1000 1200 1400

UDP Datagram Size (Bytes)

Full-Duplex Throughput (Gb/s)

Ethernet Limit(Duplex)

6x200 MHz,Software Only

6x166 MHz, RMWEnhanced