Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin...

7
Towards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen Xizheng Wang {panpulin,guochen,xzwang}@hnu.edu.cn Hunan University Huichen Dai Bojie Li Binzhang Fu Kun Tan {daihuichen,libojie2,fubinzhang,kun.tan}@huawei.com Huawei ABSTRACT Because of small NIC on-chip memory, the massive connec- tion states maintained on Remote Direct Memory Access (RDMA) NIC (RNIC) significantly limit its scalability. When the number of concurrent connections grows, RNICs have to frequently fetch connection states from host memory, lead- ing to dramatic performance degradation. In this paper, we propose StaR, which fundamentally solves this scalability issue by making RNIC stateless. Leveraging the asymmetric communication pattern in data center applications, the StaR RNIC stores zero connection-related states by moving all the connection states to the other end. Through careful design, StaR RNICs can maintain unchanged RDMA semantics and avoid security issues even when processing traffic statelessly. Preliminary simulation results show that StaR can improve the aggregate throughput by more than 160x (stress test) and 4x (application) compared to original RNICs. KEYWORDS RDMA, Scalability, Stateless RNIC, Data Center Networks 1 INTRODUCTION With the increasingly stringent demand on low-latency and high-bandwidth, Remote Direct Memory Access (RDMA) has been widely deployed in data center networks (DCNs) [1]. RDMA 1 offers superior performance than traditional TCP/IP by implementing the entire network stack in hardware net- work interface card (NIC), letting applications directly access remote memory, largely bypassing CPU. Existing RDMA NIC (RNIC), however, needs to maintain massive connection related context states to complete the RDMA stack processing. Specifically, for each RDMA con- nection 2 , RNIC has to store its memory-related states (e.g., page size, address translation table) and networking-related states (e.g., transport states and packet header information). Corresponding author. 1 In this paper, RDMA refers to RoCEv2 [2] as it is the de facto standard in current DCN [1]. 2 Though RDMA provides Unreliable Datagram (UD) transport for connec- tionless communication, it is rarely used in DCN because UD does not support reliable transmission and congestion control. Since NIC hardware has very limited on-chip memory (e.g., a few MBs), such states greatly limit RNIC’s scalability. Our testbed measurements (Fig. 1) show that a typical RNIC can only support less than 300 concurrent connections while keeping the performance. 0 30 60 90 40 160 280 400 520 640 760 880 Rate (millions/s) # of concurrent connections Application throughput PCIe read Figure 1: RNIC performance collapses due to high memory footprint per connection (PCIe read indi- cates NIC fetching states from host memory). Experi- ment Settings: Four client machines initiating RDMA WRITEs to one server machine through multiple con- current reliable connections (one outbounding 8-byte WRITE on each connection). All machines are con- nected to a 100Gbps switch with Mellanox CX4 NICs. Although above problem has been discovered for a long time and attracted much attention in the community [37], in a strict sense, it still remains unsolved yet. Previous works either try to mitigate it (e.g., using large memory pages [3] or connection grouping [7]) or work around it (e.g., using unreliable datagram [5]), all requiring application’s collaborative change 3 . In this work, we aim to fundamentally solve above dilemma by Stateless RNIC (StaR), which is a novel RNIC that offers ultra-high scalability while being transparent to applications (except specifying the stateless side during connection setup). By moving states to the other communication side, a StaR NIC can maintain zero connection-related states while all the RDMA data plane processing is still done by NIC hardware. Specifically, during connection setup, two StaR communica- tion sides can negotiate on where to maintain the connec- tion states. The stateless side then relies on the connection 3 We note that there are also many middle layers providing communication interface on top of the original RDMA verbs (e.g.,[8, 9]). To be general, we regard all the softwares that use the RDMA verbs as RDMA applications. 57 DOI: 10.1145/3343180.3343183

Transcript of Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin...

Page 1: Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen∗ Xizheng Wang {panpulin,guochen,xzwang}@hnu.edu.cn Hunan University

Towards Stateless RNIC for Data Center NetworksPulin PanGuo Chen∗

Xizheng Wang{panpulin,guochen,xzwang}@hnu.edu.cn

Hunan University

Huichen DaiBojie Li

Binzhang FuKun Tan

{daihuichen,libojie2,fubinzhang,kun.tan}@huawei.comHuawei

ABSTRACTBecause of small NIC on-chip memory, the massive connec-tion states maintained on Remote Direct Memory Access(RDMA) NIC (RNIC) significantly limit its scalability. Whenthe number of concurrent connections grows, RNICs have tofrequently fetch connection states from host memory, lead-ing to dramatic performance degradation. In this paper, wepropose StaR, which fundamentally solves this scalabilityissue by making RNIC stateless. Leveraging the asymmetriccommunication pattern in data center applications, the StaRRNIC stores zero connection-related states by moving all theconnection states to the other end. Through careful design,StaR RNICs can maintain unchanged RDMA semantics andavoid security issues even when processing traffic statelessly.Preliminary simulation results show that StaR can improvethe aggregate throughput by more than 160x (stress test) and4x (application) compared to original RNICs.

KEYWORDSRDMA, Scalability, Stateless RNIC, Data Center Networks

1 INTRODUCTIONWith the increasingly stringent demand on low-latency andhigh-bandwidth, Remote Direct Memory Access (RDMA) hasbeen widely deployed in data center networks (DCNs) [1].RDMA1 offers superior performance than traditional TCP/IPby implementing the entire network stack in hardware net-work interface card (NIC), letting applications directly accessremote memory, largely bypassing CPU.Existing RDMA NIC (RNIC), however, needs to maintain

massive connection related context states to complete theRDMA stack processing. Specifically, for each RDMA con-nection2, RNIC has to store its memory-related states (e.g.,page size, address translation table) and networking-relatedstates (e.g., transport states and packet header information).∗Corresponding author.1In this paper, RDMA refers to RoCEv2 [2] as it is the de facto standard incurrent DCN [1].2Though RDMA provides Unreliable Datagram (UD) transport for connec-tionless communication, it is rarely used in DCN because UD does notsupport reliable transmission and congestion control.

Since NIC hardware has very limited on-chip memory (e.g.,a few MBs), such states greatly limit RNIC’s scalability. Ourtestbed measurements (Fig. 1) show that a typical RNIC canonly support less than 300 concurrent connections whilekeeping the performance.

0

30

60

90

40 160 280 400 520 640 760 880

Rat

e (m

illio

ns/

s)

# of concurrent connections

Application throughput PCIe read

Figure 1: RNIC performance collapses due to highmemory footprint per connection (PCIe read indi-cates NIC fetching states from host memory). Experi-ment Settings: Four client machines initiating RDMAWRITEs to one server machine through multiple con-current reliable connections (one outbounding 8-byteWRITE on each connection). All machines are con-nected to a 100Gbps switch with Mellanox CX4 NICs.

Although above problem has been discovered for a longtime and attracted much attention in the community [3–7], in a strict sense, it still remains unsolved yet. Previousworks either try to mitigate it (e.g., using large memorypages [3] or connection grouping [7]) or work around it (e.g.,using unreliable datagram [5]), all requiring application’scollaborative change3.

In this work, we aim to fundamentally solve above dilemmaby Stateless RNIC (StaR), which is a novel RNIC that offersultra-high scalability while being transparent to applications(except specifying the stateless side during connection setup).By moving states to the other communication side, a StaR NICcan maintain zero connection-related states while all theRDMA data plane processing is still done by NIC hardware.Specifically, during connection setup, two StaR communica-tion sides can negotiate on where to maintain the connec-tion states. The stateless side then relies on the connection3We note that there are also many middle layers providing communicationinterface on top of the original RDMA verbs (e.g., [8, 9]). To be general, weregard all the softwares that use the RDMA verbs as RDMA applications.

57

DOI: 10.1145/3343180.3343183

Page 2: Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen∗ Xizheng Wang {panpulin,guochen,xzwang}@hnu.edu.cn Hunan University

APNet ’19, August 17–18, 2019, Beijing, China

states maintained on the other side to complete its networkcommunication including receiving or sending data pack-ets, notifying application, and generating ACK packets, etc..Many DCN applications requiring large concurrent connec-tions have asymmetric communication pattern, i.e., only oneside (called server) has huge fan-in/fan-out traffic while theother side (called client) only has a few connections [5, 10–12]. To name a few: Parameter servers (server) and workers(client) in distributed machine learning systems [12], andresult-aggregators (server) and document-lookupers (client)in web-search back-end services [10]. As such, StaR can sig-nificantly improve their performance originally bottleneckedby the RNIC scalability at the server side.We have implemented StaR in NS3 simulator. Evaluation

results show that compared to original RNICs, StaR can im-prove the aggregate throughput by more than 160x undertargeted stress tests, and bring more than 4x throughputimprovement for upper-layer applications.

2 CHALLENGES AND INSIGHTSWe need to carefully address several challenges brought bymoving connection states to the other side. Particularly:

• How to complete RDMA functionalities without states onNIC? RNIC has two major functionalities, i.e., DMAingdata from/to application memory and transmittingdata through network. Correspondingly, we need tofigure out how to: 1) determine the address towrite/readdata to/from the application memory; 2) notify appli-cations when operations are completed; 3) completereliable in-order network transmission and congestioncontrol; 4) encapsulate packets (e.g., filling IP andMACaddresses).

• How to ensure security without states on NIC? We canverify whether an application’s operation to NIC issecure in driver software. However, without maintain-ing any states, RNIC cannot conduct security check onreceived packets. Therefore, the NIC actions triggeredby the received packets may cause adversarial conse-quences such as accessing illegal memory address andgenerating malicious traffic to the network.

We address above challenges in StaR based on the follow-ing insights:

• Carry necessary states in packets by trading off some net-work bandwidth. Specifically, the client tracks the stackprocessing states for the server, and generates pack-ets carrying those necessary states. Then the statelessside relies on the connection states carried inside thereceived packets to complete its RDMA stack process-ing. As §3.1 shows, the header size consumed by thosestates is acceptable, only incurring small bandwidthoverhead.

• Ensure security in the client. Since StaR targets DCNscenario where all physical machines are managed by asingle operator, we can ensure security by controllingthe packets sent out by the client NICs. Particularly,we add a security module to each NIC, which is actu-ally a match-action table that checks all the outboundpackets. If a NIC is specified as the client side of a StaRconnection, its security table will be installed withentries only allowing legal packets to the correspond-ing server. In addition, switches in DCN can preventsource address spoofing.

3 STAR DESIGNWe now discuss in details about how StaR conducts stackprocessing and security mechanism, respectively, followedby a detailed discussion on StaR’s performance.

3.1 Stack Processing

Overview: Fig. 2 overviews the stateless processing in StaR.StaR offers applications an interface to specify which sideto be stateless before connection starts (e.g., through con-nection option). Then during transmission, application canbenefit from StaR NIC’s high-scalability without any con-straints/changes beyond the standard RDMA usage.Specifically, during connection setup, the stateless side

(called server) driver will transmit all its connection states tothe other stateful side (called client). Original RDMA stackcreates a pair of work queues, called queue pair (QP), tomaintain the stack processing context of each RDMA connec-tion (including memory-access related states and network-transmission-related states). As such, the stateless side trans-fers thewhole QP context to the client side during connectionsetup (maintaining a copy in the local host software), andthe client can track and help the server NIC to finish all thestack processing.Whenever the server application initiates an RDMA op-

eration by posting a work queue element (WQE) to its QPon the host, the WQE will be directly encapsulated into apacket and transfered and stored in the QPmaintained on theclient side, rather than being stored at the local NIC. Thenthe client-side stack can generate corresponding packets andtrigger server NIC’s operation according to its WQEs in theQP. For example, a client can generate packet with DMAaddresses according to the RECV/READWQEs in the remoteQP, so that the data can be directly DMAed by the serverNIC. Similarly, it can get server’s data using DMA addressesfrom the SEND/WRITE WQEs in the remote QP. It incurscost, indeed, to send WQEs to the other side before transmit-ting the actual data. However, as introduced later, this hasalmost no impact (for receiving operations) or very small

58

Page 3: Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen∗ Xizheng Wang {panpulin,guochen,xzwang}@hnu.edu.cn Hunan University

Towards Stateless RNIC for Data Center Networks APNet ’19, August 17–18, 2019, Beijing, China

NIC

Process based on

each packet

NIC Host

SQ

RQ

CQ

Local

Mem table

Net states

SQ

RQ

CQRemote

Mem table

Net states

Conn

Stateless (Server) side Stateful (Client) side

Network

Receive packet/ACK

Send packet/ACK

Host

SQ

RQ

CQ

Conn

Figure 2: Stateless RDMA processing in StaR NIC: Maintain all states at one side.

impact (for sending operations) on the end-to-end delay indata center scenario.

Since RDMAapplications operate on virtualmemory space,original RNIC maintains a memory table (Mem table inFig. 2) to translate the user-registered buffer address to theactual physical memory address. As such, in StaR, the serverside will also transmit its memory table to the client dur-ing connection setup. Then, whenever the application reg-isters/deregisters a user buffer later, the server driver willencapsulate it into packets and send to the client, which thenupdate the entries in the memory table maintained in theclient NIC. Note that registering/deregistering buffer is al-ways not on the critical path of data transmission, so aboveprocedure will not impair performance.

RDMA applications typically use TCP to setup and trans-mit necessary information during connection setup. As such,we omit the detailed discussion to connection setup, whichis easy to understand, and focus on the data transmission inthe rest of the section.

Packet types: StaR relies on the following types of packetsto complete stateless RDMA processing:1) WQEP, CQEP and EA. WQEP is the packet that trans-

mits the WQE to the client, and CQEP is the packet thattransmits the CQE to the server after its operation has beenfinished. A typical WQE/CQE with one memory region isless than several 10s of bytes [13]. WQEP and CQEP have aper-connection sequence number tagged by the driver soft-ware, and EA (Element ACK) is the ACK that clients returnto servers to notify the WQEP/CQEP has been received.

2) RDP and RA. RDP (Receiving Data Packet) is the packetthat contains the data going to be received by the statelessserver. The data DMA address (8B) and its length (2B) arecarried in each RDP’s header. RDP has a sequence number,and RA (Receiving ACK) is the ACK that servers return toclients after receiving each RDP. RAs’ header are generatedusing the header information carried in each RDP (e.g., MACand IP addresses for this connection). Such header informa-tion is relatively small. For instance, current RNIC requires

only 44B states to generate a packet’s Ethernet header andIP header [13].3) GD and SDP. GD (Getting Data) is the packet that a

client sends to a server to trigger it to send data, whichcarries the data DMA address (8B) and length (2B), and theheader information to encapsulate data packets (similar toRDP). SDP (Sending Data Packet) is the packet triggered byGD, containing the data sending from the server to the client.Due to space limitation, we omit the detailed structure

of each packet type. We now introduce the procedure ofdata reception and sending in StaR, respectively, based onhow StaR handles the four RDMA verbs (WRITE/READ andSEND/RECV).

Receiving on the stateless side: There are three condi-tions that the stateless (server) side receives data, i.e., theserver application proactively READs or RECVs the data sentby the client, or the server passively receives data which theclient WRITEs. Fig. 3(a) shows the processing procedure onthe two sides.Specifically, when an application on the server initiates

a READ/RECV, the driver will encapsulate the WQE into aWQEP packet and notify the NIC to send it out to the client.After receiving the WQEP, the client decapsulates the WQEand stores it in the QPmaintained for the server, and immedi-ately generates an explicit acknowledgment (EA). Note thatEA is necessary sinceWQEPs may get lost (discussed in §3.3).Then, according to the READ/RECV WQE (also dependingon its own SEND WQEs), the client will generate a series ofdata packets (RDPs). Upon receiving RDPs, the server NICcan directly DMA the data to the corresponding memoryaddress and generate acknowledgments (RAs) back usinginformation contained in RDP headers. When all the data inthe READ/RECV has been successfully received (tracked bythe client through RAs), the client will send a CQEP, whichthe server can use to notify applications.

It is much simpler for the server to receive data WRITE bythe client. Particularly, the server only needs to DMA dataand generate RAs according to the RDP headers controlledby the client.

59

Page 4: Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen∗ Xizheng Wang {panpulin,guochen,xzwang}@hnu.edu.cn Hunan University

APNet ’19, August 17–18, 2019, Beijing, China

Stateful side

Store READ in QP

Clear RECV in QP

send data

READ

Stateless side

READstarts

READdone

receive data

& send RA

Stateful side

Store RECV in QP

Clear RECV in QP

send data

RECV

Stateless side

RECVstarts

RECVdone

receive data

& send RA

Stateful side

WRITE done

WRITE starts

&send data

WRITE

Stateless side

receive data

& send RA

(a) Receiving.

Stateful side

READ done

READ starts

&send GD

READ

Stateless side

…send data

Stateful side

Store WRITE in QP

Clear WRITE in QPWRITE

Stateless side

WRITEstarts

WRITEdone

send data

receive data

send GD…

Stateful side

Store SEND in QP

Clear SEND in QPSEND

Stateless side

SENDstarts

SENDdone

send data

receive data

send GD…

(b) Sending.

Figure 3: Procedure of stateless receiving and sending.

Since the client tracks the whole transmission state, it canapply any congestion control or loss recovery mechanisms.

Sending on the stateless side: Similar as receiving, thereare also three conditions that the stateless (server) side sendsdata, i.e., the server proactively WRITEs or SENDs the data,or the server passively sends data triggered by the clientREADs, as shown in Fig. 3(b).

When an application on the server initiates aWRITE/SEND,the driver will encapsulate theWQE into aWQEP packet andnotify the NIC to send it out to the client. After decapsulatingthe WQE and generating EA, the client will generate severalGDs to retrieve data from the server. Upon receiving GDs, theserver NIC can directly fetch the data and encapsulate it intopackets (SDPs) using information contained in GD headers.The client will send a CQEP to notify server the completionof WRITE/SEND when all the data has been received (bytracking the SDPs). The procedure that the client READsdata from the server is simple and similar to the WRITE casein Fig. 3(a), so we omit the discussion here.Congestion control. For StaR stateless sender, all the con-

gestion control calculation is done at the receiver (client) side.

StaR can implement various window based congestion con-trol algorithms (e.g., DCTCP), calculating congestion windowaccording to received SDPs. Since the size of SDP is deter-mined by each GD, receiver can control the sender’s sendingrate by adjusting the GDs. Note that rate based congestioncontrol is not applicable for StaR stateless sender as it needsper-connection states for pacing the rate.

3.2 Security MechanismFig. 4 overviews the mechanism in StaR to ensure the secu-rity of NIC stateless processing. For local operations, StaRdriver will check their security to prevent malicious opera-tion to the NIC. This is typical in device driver, so we omitthe discussion here. We focus on how to prevent maliciousbehavior triggered by the received packets.Specifically, each NIC is enforced with an extra security

check module, and all the traffic will pass through this mod-ule before sent out. Initially, all StaR traffic (packets withformat introduced before) are blocked by the security modulein case of threatening the stateless NIC. Note that a NIC instateless mode only reacts to StaR packets and other packetsare safe. Whenever a StaR connection is setting up, an entrywill be inserted into the client’s security module indicatinglegal packets to the server, e.g., legal memory access rangesin GDs/RDPs in this connection. Note that we ensure theentries in the security module to be only modified by thestateless server through control in local NIC drivers. Duringthe connection lifetime, the entries may also be updated bythe stateless server for control plane operations, e.g., regis-tering/derigstering user buffer.Above security mechanism can gracefully prevent state-

less NIC from disrupting the host or the network. Also, dif-ferent connections within a NIC can be prevented from ac-cessing others’ memory regions. However, similar as originalRDMA, applications should carefully use its own registeredbuffers within an RDMA connection thus to prevent over-writing data or getting the outdated data.

3.3 Discussions on Performance

No state-fetching bottleneck on data transmission: InStaR, the stateless server receives/sends every data packetonly based on each received RDP/GD. Since per-packet state-less processing can be well pipelined and parallelized in NIChardware, there would be no performance bottleneck withany large scale. In contrary, in original RDMA, if a receivedpacket belongs to a connection whose state is not on-chip,the processing has to be stalled until the NIC fetched its statefrom the host. Similarly, when original RNIC is going to senda packet triggered by a SEND/WRITE WQE, it has to firstfetch the corresponding connection state before start thesending procedure.

60

Page 5: Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen∗ Xizheng Wang {panpulin,guochen,xzwang}@hnu.edu.cn Hunan University

Towards Stateless RNIC for Data Center Networks APNet ’19, August 17–18, 2019, Beijing, China

Network

NIC

Stateful (client) Side

NICHost Host

Stateless (server) side

Generate white list to the security module

Trustable packet

Trustable packet

Stateful (client) Side

Network Stack

Security Check

Security Check

Network Stack

Host

APP APP

APP

Driver: ensure secure operations

NICstateless processing

Figure 4: Ensure security for stateless processing in StaR NIC.

Cost on maintaining WQEs at the remote side: Thismainly incurs three costs:

1) Extra network delay for transmitting WQEs. As Fig. 3(a)shows, when receiving data from WRITE, servers need notpost anyWQE so there is no extra delay. For READ, StaR alsoincurs no extra network delay before data transmission as theREAD request and the READWQE are combined in the sameWQEP. For RECV, StaR indeed needs an extra 1/2 round-trip-time (RTT) to first transmit the WQE to the client. However,the impact on performance is negligible since typical RDMAapplications always prepost multiple RECV WQEs and thennotify the sender to start sending data.

Note that when READ/RECV is done, applications on StaRneed an extra RTT to get notified by CQEPs from the client.However, a network RTT without software involvement inDCN is negligible (e.g., <8 µs across switches [14]). Moreover,high-performance RDMA applications often proactively pollcertain memory addresses to track the transmission sta-tus [3]. , hence this notification delay will not impair suchapplications’ performance.When sending data, transmitting WQEs incurs an extra

RTT. StaR bears this tradeoff for stateless processing thus toimprove the throughput under massive concurrent sending.There is no extra delay for application notification becauseoriginal sender RNIC also waits for the receiver’s ACK so itknows the sending operation has finished.

2) Processing overhead on generating WQE packets. Besidesnetwork delay, generating WQE packets also incurs process-ing overhead. While in original RDMA WQEs are directlyfetched and stored in the NIC, StaR driver needs to first en-capsulate them into WQE packets before passing to the NIC.However the WQEs can be preposted and well pipelined,which has no impact on the application performance (see re-sults in Fig. 5(b)). Moreover, the processing overhead is smallcompared with the subsequent data transmission, becauseWQEPs are small-size control packets which need no com-plicated transport logic. Specifically, currently StaR adoptsno congestion control for WQEPs, and simply encapsulatingthe StaR/UDP/IP/Ethernet header for them before deliveredto the NIC.

3) Handling WQE loss. WQEs may be lost during transmis-sion. StaR needs to handle it on the stateless side since theremote stateful side may not even know whether there areany WQEs posted if they get lost. StaR deals with WQE lossin the driver software thus to keep stateless processing in theNIC. Particularly, each EA for WQE is directly passed to thedriver by StaR NIC. As the lossless network where RDMA istypically deployed on has extremely low loss rate [5], thisrarely impacts the host performance. Moreover, the EA pro-cessing does not delay the data transmission procedure (seeFig. 3).

Auto adaption between stateless and stateful mode: Tofurther optimize the cost discussed before, each StaR NICcan be implemented with two processing logics, i.e., one fororiginal RDMA connections and one for stateless connec-tions. StaR will automatically choose the original RDMAmode for the newly created connection if current concurrentconnections does not exceed the on-chip memory limit. Andfor more connections, it may turn into the stateless modeif the on-chip memory has been used up. We note that au-tomatic mode transition for active connections may offerbetter performance as connections come and go, which weplan to study further in future works.

4 EVALUATIONOur evaluation tries to answer the following questions: 1)How does StaR perform under dedicated stress test (§4.1)? 2)How does StaR improve the performance of upper-layer appli-cations (§4.2)?

Simulation settings:We evaluate StaR and original RNIC(denoted as RDMA) in NS3 simulator. A server and multipleclients are directly connected to a 100Gbps switch, with a12µs base network RTT between them. MSS is 1.4KB. ThePCIe latency is set to be 1µs for RNIC to access states storedin host memory or host to deliver WQE to RNIC. For RDMA,its RNIC on-chip memory can store up to 300 connections’states (based on Fig. 1). Since our evaluation targets the RNICscalability problem caused by on-chip states, to avoid the in-terference of congestion control and loss recovery, we set the

61

Page 6: Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen∗ Xizheng Wang {panpulin,guochen,xzwang}@hnu.edu.cn Hunan University

APNet ’19, August 17–18, 2019, Beijing, China

switch queue to be large enough to avoid PFC, ECN or packetloss. We do not turn on the capability of mode auto-adaptionin StaR, thus to better evaluate its overhead.

0

60

120

180

40 640 1240 1840 2440

Rat

e (m

illio

ns/

s)

# of conns

StaR RDMA

(a) Stress test.

0

1

2

3

4

40 280 520 760

Rat

e (m

illio

ns/

s)

# of conns

StaR RDMA

(b) RPC application.

Figure 5: Aggregate throughput (MOPS) at the serverunder stress test and RPC applications.

4.1 Stress testWe simulate the same scenario as our testbed measurementsin Fig. 1. Various number of clients concurrently WRITEto one server. All WRITEs from one client are through onelong-lived RDMA connection. In each connection, there isonly one outbounding 8-byteWRITE at anymoment, and theclient immediately initiates another 8-byte WRITE after theformer one has finished. In StaR, the server is the statelessside.

Results: As shown in Fig. 5(a), as the number of concur-rent connections grows, the aggregate WRITE rate in RDMAfirst grows to about 23M operations per second (OPS) andthen quickly drops to about 1M OPS due to connection statemissing. In contrary, the overall WRITE rate in StaR canscale well as the number of connection grows, reaching the100Gbps bandwidth limitation (about 166M OPS). Note thatdue to the RTT and PCIe latency settings, the absolute re-sults in this simulation are different from those in the testbed(Fig. 1), but the relative trends are the same.

4.2 Application throughputWe simulate a remote procedure call (RPC) application, wheremultiple clients concurrently call the remote procedure inone server. Each client sends a 2.8KB RPC request to theserver (through SEND/RECV), and the server responses an1.4KB RPC result (through SEND/RECV) after receiving therequest (assuming zero RPC computation time). The clientimmediately repeats the RPC call after the former one com-pletes. All RPCs between a client and the server use one long-lived RDMA connection. We assume all the RECV WQEs arepre-posted which does not affect the RPC performance. InStaR, the server is the stateless side.

Results: Fig.5(b) shows the overall RPCs finished per sec-ond in the server. StaR can always keep the maximum 3.8M

OPS RPC rate (limited by the network bandwidth) as thenumber of concurrent connections grows beyond 280. Incontrary, the RPC rate in RDMA drops to about 0.9M OPSdue to state missing on the NIC. Note that StaR has lowerRPC rate than RDMA when the concurrency is small, be-cause each SEND in StaR requires an extra RTT to transmitthe WQE first. However, this problem can be solved by theauto adaption scheme discussed in §3.3. The maximum RPCrate of StaR is a little lower (∼2%) than the maximum inRDMA (for 160 to 300 connections) due to the overhead ofextra headers and control packets.

5 RELATEDWORKStaR shares the same idea with Trickles [15] on movingconnection states to the other side. However, Trickles onlyfocuses on TCP states for software network stack, and thisTCP-only solution is not fully stateless since the statelessside still needs to calculate sending rate according to TCPcongestion control information of the connection. On thecontrary, StaR is the first stateless NIC for RDMA, whichmaintains all necessary memory-access-related and network-related states on the other side and carries them in packetheader, to make the server side RNIC completely stateless.[16] builds an HTTP service based on “stateless” TCP,

which is actually an TCP interface working on UDP sockets.The application has to do a bunch of network processingsuch as congestion control and loss recovery, which is con-tradicted to RDMA’s design rationale.

6 CONCLUSIONThis paper presents StaR. To the best of our knowledge, ittakes the first step to fundamentally solve the RNIC’s scalabil-ity problem, by moving all the connection states to the clientside. As such, StaR server can maintain zero connection-related states while all the RDMA data plane processing isstill done by NIC hardware. Preliminary simulations showStaR’s potential to significantly improve the RNIC’s per-formance under large concurrency, which might bring per-formance and architecture revolutions to the upper-layerlarge-scale RDMA applications.

7 ACKNOWLEDGMENTSWe thank the APNet reviewers for the helpful commentson improving this paper. We thank Yongqiang Xiong andPeng Cheng for valuable discussions on the idea of makingone side’s NIC stateless. This work was partly supportedby the National Natural Science Foundation of China undergrant 6187060280, Fundamental Research Funds for the Cen-tral Universities of China, and Huawei Innovation ResearchProgram.

62

Page 7: Towards Stateless RNIC for Data Center NetworksTowards Stateless RNIC for Data Center Networks Pulin Pan Guo Chen∗ Xizheng Wang {panpulin,guochen,xzwang}@hnu.edu.cn Hunan University

Towards Stateless RNIC for Data Center Networks APNet ’19, August 17–18, 2019, Beijing, China

REFERENCES[1] Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye,

Jitu Padhye, and Marina Lipshteyn. RDMA over Commodity Ethernetat Scale. In Proceedings of the 2016 conference on ACM SIGCOMM 2016Conference, pages 202–215. ACM, 2016.

[2] Supplement to InfiniBand architecture specification volume 1 release 1.2.2annex A17: RoCEv2 (IP routable RoCE). InfiniBand Trade Association,2012.

[3] Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, andOrion Hodson. FaRM: fast remote memory. In 11th USENIX Symposiumon Networked Systems Design and Implementation (NSDI 14), pages 401–414, 2014.

[4] Anuj Kalia, Michael Kaminsky, and David G Andersen. Design Guide-lines for High Performance RDMA Systems. In 2016 USENIX AnnualTechnical Conference (USENIX ATC 16), 2016.

[5] Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast,Scalable and Simple Distributed Transactions with Two-Sided (RDMA)Datagram RPCs. In 12th USENIX Symposium on Operating Systems De-sign and Implementation (OSDI 16), pages 185–201, GA, 2016. USENIXAssociation.

[6] Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang Xiong, PengCheng, Jiansong Zhang, Enhong Chen, and ThomasMoscibroda. Multi-path transport for RDMA in datacenters. In 15th {USENIX} Symposiumon Networked Systems Design and Implementation ({NSDI} 18), pages357–371, 2018.

[7] Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable RDMA RPC onReliable Connection with Efficient Resource Sharing. In Proceedings ofthe Fourteenth EuroSys Conference 2019, EuroSys ’19, pages 19:1–19:14,New York, NY, USA, 2019. ACM.

[8] Open MPI: Open Source High Performance Computing), 2019. https://www.open-mpi.org/.

[9] Anuj Kalia, Michael Kaminsky, and David G Andersen. DatacenterRPCs can be General and Fast. In USENIX Symposium on NetworkedSystems Design and Implementation (NSDI), pages 1–16, 2019.

[10] Virajith Jalaparti, Peter Bodik, Srikanth Kandula, Ishai Menache,Mikhail Rybalkin, and Chenyu Yan. Speeding Up Distributed Request-response Workflows. In Proceedings of the ACM SIGCOMM 2013 Con-ference on SIGCOMM, SIGCOMM ’13, pages 219–230, New York, NY,USA, 2013. ACM.

[11] Rajesh Nishtala, Hans Fugal, Steven M Grimm, Marc P Kwiatkowski,Herman Lee, Harry C Li, Ryan Mcelroy, Mike Paleczny, Daniel Peek,Paul Saab, et al. Scaling Memcache at Facebook. pages 385–398, 2013.

[12] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, AmrAhmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-YiingSu. Scaling distributed machine learning with the parameter server.In 11th USENIX Symposium on Operating Systems Design and Imple-mentation (OSDI 14), pages 583–598, 2014.

[13] Linux Cross Reference. Linux/include/linux/mlx4/qp.h. http://lxr.free-electrons.com/source/include/linux/mlx4/qp.h.

[14] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat,Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey,Puneet Kaur, Joo-Young Kim, et al. A cloud-scale acceleration archi-tecture. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACMInternational Symposium on, pages 1–13. IEEE, 2016.

[15] Alan Shieh, Andrew C Myers, and Emin Gün Sirer. Trickles: a statelessnetwork stack for improved scalability, resilience, and flexibility. InProceedings of the 2nd conference on Symposium on Networked SystemsDesign & Implementation (NSDI), pages 175–188, 2005.

[16] David A Hayes, Michael Welzl, Grenville Armitage, and Mattia Rossi.Improving HTTP performance using “stateless” TCP. In Proceedingsof the 21st international workshop on Network and operating systems

support for digital audio and video, pages 57–62. ACM, 2011.

63