Autonomous NIC Offload
Transcript of Autonomous NIC Offload
Boris Pismenny, Yoray Zack, Ben Ben-Ishay and Or Gerlitz
AUTONOMOUSNVME-TCP OFFLOAD
1
Overview
• Motivation
• Storage protocol offload
• Seamless integration
• APIs and implementation
2
Motivation: offload opportunities
• Transmit side data checksum calculation• PDU data CRC calculation
• Receive side data checksum validation• PDU data CRC verification
• Receive side copy• Need to place data at destination buffers
• But TCP receives data in anonymous unaligned buffers
• Data is copied from TCP to destination buffers
Data
CRC
Header
PDU
3
Motivation: offload opportunities
• Copy and CRC consume up to 50% per IO cycles
4
Motivation: NVMe out-of-order processing
• Generic zerocopy receive does not work
• NVMe supports reordering of storage read/write operations
TargetInitiator
Problem:Generic zerocopyreceive writes to
wrong buffer
To solve this problem, we need upper layer protocol awareness!
A
B
Reordering of reads
5
Transmit offload overview
NVMe-TCP PDU header/trailer
TCP header
⇐crc
Baseline
H H data T
H H data T
H data T
data
H H data T
App
TCP/IP
NIC
NVMe-TCP
Network
CP
UN
IC
6
Transmit offload overview
NVMe-TCP PDU header/trailer
TCP header
Baseline
H H data T
H H data T
H data T
data
H H data T
App
TCP/IP
NIC
NVMe-TCP
Network
Offload
H data
data
H H data 0
H H data 0
⇐crc
0
App
TCP/IP
NIC
NVMe-TCP
NetworkH H data T
CP
UN
ICC
PU
NIC
Offload CRC
7
⇐crc
Receive offload overview
NVMe-TCP PDU header/trailer
TCP header
⇐copy+crc
Baseline
H H data T
⇐DMA
data
Choose
H H data T
App
TCP/IP
NIC
NVMe-TCP
Network
CP
UN
IC
data
H data T
8
⇐copy+crc
Receive offload overview
NVMe-TCP PDU header/trailer
TCP header
H data T
H H data T
Offload
⇐DMA
H T
H H T
⇐DMA+crc
data
CombineDMA+copy+crc
Choose
H H data TH H data T
App
TCP/IP
NIC
NVMe-TCP
Network
CP
UN
IC
App
TCP/IP
NIC
NVMe-TCP
Network
CP
UN
IC
Baseline
data
data data
9
Seamless integration: crc
• New SKB bit skb->ddp_crc• Used similarly to TLS’s skb->decrypted
• On transmit skb->ddp_crc indicates CRC offload is expected
• On receive skb->ddp_crc indicates no CRC errors in packet’s payload• skb->ddp_crc==0 triggers software PDU CRC calculation
10
Seamless integration: copy
• NIC driver builds SKBs of packets on the wire • Packet headers from receive ring
• Storage protocol headers/trailers from receive ring
• Payload from destination buffers
previous PDU data
…
…
Ethernet / IP / TCPno PDU data (not aDMA target buffer)
PDU data: offloadDMA-writes here
previous packets… …
more packets… …
rece
ived
pac
ket
rece
ive
rin
g
Application buffers
CID
=X
11
Seamless integration: copy
• NIC driver builds SKBs of packets on the wire • Packet headers from receive ring
• Storage protocol headers/trailers from receive ring
• Payload from destination buffers
previous PDU data
…
…
Ethernet / IP / TCPno PDU data (not aDMA target buffer)
PDU data: offloadDMA-writes here
previous packets… …
more packets… …
rece
ived
pac
ket
rece
ive
rin
g
Application buffers
…
CID
=X
skb_shinfo(skb)
12
Seamless integration: copy
• NIC driver builds SKBs of packets on the wire • Packet headers from receive ring
• Storage protocol headers/trailers from receive ring
• Payload from destination buffers
• Storage protocol skips copy• Iff (src == dst) before memcpy
previous PDU data
…
…
Ethernet / IP / TCPno PDU data (not aDMA target buffer)
PDU data: offloadDMA-writes here
previous packets… …
more packets… …
rece
ived
pac
ket
rece
ive
rin
g
Application buffers
skb_shinfo(skb) …
CID
=X
13
Seamless integration: copy
• Need to avoid network stack copies of data• Problem: skb_coalesce copies data from destination buffer back to SKB
• Solution: Avoid it by reusing the skb->ddp_crc bit
• Need to map between destination pages and their identifiers• Upper layer protocol maintains mapping
14
Hardware perspective
15
NIC contexts
NIC contexts
Dynamic state• expected TCP seq• current msg offset • current msg size• current msg CID• CRC state
Static state• CID to buffer map• Protocol version• Message format
16
Transmit offload in-sequence
• NIC offload Implementation is simple• Incrementally offload using NIC contexts
TCPhdr 1
TCPhdr 2
TCPhdr 3
TCPhdr 4
TCPhdr 5
TCPhdr 6
TCPhdr 7
TCPhdr 8
size size size
NIC contexts
Dynamic state• expected TCP seq• current msg offset • current msg size• current msg CID• CRC state
Static state• CID to buffer map• Protocol version• Message format
17
Transmit offload out-of-sequence
• Wrong dynamic NIC context state
• Context recovery needs only the message prefix• Driver can get the prefix from the storage protocol layer
• Reuse TCP transmit buffer for storing data• TCP ACKs release data in storage protocol PDU granularity
18
TCPhdr 1
TCPhdr 2
TCPhdr 3
TCPhdr 4
TCPhdr 5
TCPhdr 6
TCPhdr 7
TCPhdr 8
size size size
NIC contexts
Dynamic state• expected TCP seq• current msg offset • current msg size• current msg CID• CRC state
Static state• CID to buffer map• Protocol version• Message format
Receive offload in-sequence
• NIC offload Implementation is simple• Incrementally offload using NIC contexts
• Hardware reports one bit per packet• is packet CRC ok?
19
CRC verified Message header
1 3 5
Receive offload retransmission
• Retransmissions bypass offload• Software fallback
20
CRC verified Non-verified message data Message header
1 3 5
Receive offload data reordering
• PDU data reordering• Skip hardware to skip to the next record
• Continue offloading
21
1 3 5 6
CRC verified Non-verified message data Message header
Receive offload header reordering
• PDU header reordering• Stops hardware NIC offloading
• Software must recover NIC context to continue
22
1 35 6
CRC verified Non-verified message data Message header
Receive offload recovery problem
• NIC context recovery on receive is non-trivial:• Stopping packets to recover NIC context is impossible
• Packets keep coming
• Software alone cannot recover during traffic
• Need to combine software and hardware
23
1 35 6
CRC verified Non-verified message data Message header
Receive offload recovery solution
NIC context recovery relies on:
(1) Speculatively finding PDU message header magic pattern
24
1 35
CRC verified message data
Non-verified message data
Message header
6 7
Speculative message header
Receive offload recovery solution
NIC context recovery relies on:
(1) Speculatively finding PDU message header magic pattern
(2) Requesting software to confirm that this is indeed a PDU header, while
25
1 35 6 7
is it a PDU header?
Decrypted message data
Non-Decrypted message data
Message header
Speculative message header
Receive offload recovery solution
NIC context recovery relies on:
(1) Speculatively finding PDU message header magic pattern
(2) Requesting software to confirm that this is indeed a PDU header, while
(3) Tracking subsequent messages using the message header’s length field
26
1 35 6 7
Decrypted message data
Non-Decrypted message data
Message header
Speculative message header
Receive offload recovery solution
NIC context recovery relies on:
(1) Speculatively finding PDU message header magic pattern
(2) Requesting software to confirm that this is indeed a PDU header, while
(3) Tracking subsequent messages using the message header’s length field
(4) Resuming offload if software confirms the HW speculation
27
1 35 6 7 8 9
Decrypted message data
Non-Decrypted message data
Message header
Speculative message header
Yes, it was a PDU header in P5
APIs and implementation
28
ULP DDP infrastructure
• ULP DDP interposes between NIC drivers and storage protocols
• Protocol agnostic
• Vendor agnostic
• First users are NVMe-TCP and Mellanox drivers
ULP-DDP infrastructure
NVMe-TCP ISCSI
Mellanox ???
????
29
ULP DDP APIs
• Setup/teardown per-connection state
• Setup/teardown mapping between pages and their identifiers
• Protocol resynchronization
30
NVMe-TCP setup per-connection state
• Offload begins after all handshakes complete
• Configure NVME queue limits (max sgl, max IO size, etc.)
Start offload here
31
NVMe-TCP mapping pages
• Map buffers before IO send
• Unmap on IO completion• Added asynchronous unmap to improve performance
32
Netdev features
• We run out of netdev feature bits!
• Proposal: override __UNUSED_NETIF_F_1• Single bit for both receive and transmit
33
Future work
• Integration with TLS• Data-path POC working
• Need a solution for the TLS handshake in NVMe-TCP
34