R2D2 Reliable and Rapid Data Delivery for DCs

R2D2Reliable and Rapid Data Delivery for DCs

Berk Atikoglu, Mohammad Alizadeh, Tom Yue, Balaji Prabhakar, Mendel Rosenblum

2

Motivation Unreliable packet delivery due to

Corruption▪ Dealt with via retransmission

Congestion▪ Particularly bad due to incast or fan-in congestion

These losses increase difficulty of reliable transmission Loss of throughput Increase in flow transfer times

Internal Use

Does it increase difficulty? I suggest saying something like decreases efficiency or reduces performance.

Incast

The client sends a request to several servers.

The responses travel to the switch simultaneously.

The switch buffer overflows from the amount of data. Some packets are dropped.

3

1

S S S S S

C 2C

S S S S S

3

1

2

S S S S S

C 3

4

Existing Approaches High-resolution timers

Reduce retransmission timeouts (RTO) to hundreds of µs▪ Proposed in Vasudevan et al (Sigcomm 2009); see also Chen et al (WREN

2009) Large number of CPU cycles on rapid interrupts or timer

programming In virtualized environments, high cost of processing hardware

interrupts means even higher overhead

Large switch buffers Reduce incast occurences by caching enough packets Increased packet latency Complex implementation Large caches are expensive Increased power usage

5

Our Approach: R2D2 R2D2: collapse all flows into a single “meta-flow”

Single wait queue holds packets sent by host that are not yet acked

Single retransmission timer, no per-flow state Provides reliable packet delivery Resides in Layer 2.5, a shim layer between Layer 2 and Layer 3

Key observation: Exploit uniformity of Data Center environments Path lengths between hosts are small (3 – 5 hops) RTTs are small (100 – 400 µs) Path bandwidths are uniformly high (1Gbps, 10Gbps) Therefore, amount of data from a 1G/10G source “in flight” is less

than 64/640 KB Store source packets in R2D2 on-the-fly, rapidly retransmit

dropped or corrupted packets

Internal Use

We have per-flow state actually. I'm not sure if we explictly say this.

Tom Yue

Should this be replaced with "Provides a TCP keep-alive function."

Internal Use

Right. I also think we should use something other than providing reliable packet delivery. Keep alive sounds good to me.

6

TCP

L3 L2

L2 L3

7

L2.5

R2D2

L3 L2

L2 L3

R2D2

Layer 3

Layer 2.5R2D2

sender

Layer 2

When a flow times out: Retransmit first un-ACKed

packet (fill the hole). Back-off: double the flow’s

timeout value. When an ACK comes

in: Reset the timeout back-

off.

1

2

3

4

Outbound packet is intercepted by R2D2.

A timer is started. A copy of the packet is

placed in the wait queue.

The returned TCP ack removes all ACKed packets held in the wait queue.

1

23

4

8

9

Features Reliable, but not guaranteed, delivery

Maximum number of retransmissions before giving up

State-sharing Only one wait queue; all packets go in same queue

No change to network stack Kernel module in Linux; driver in Windows Hardware version is OS-independent

Incremental deployability Possible to protect a subset of flows

Tom Yue

Avoids TCP timeouts, allowing rapid progress.

10

Implementation Implemented as a Linux Kernel Module on

Kernel 2.6.* No need to modify kernel Can be loaded/unloaded easily

Incoming/outgoing TCP/IP packets are captured using Netfilter

Captured packets are put into a queue just meta-data is kept in queue; packet is

cloned L2.5 thread processes the packets in the

queue periodically

Internal Use

2.6.* is too general. We can say 2.6.28.10 and onwards.

Test Setup

48 Dell PowerEdge 2950 Servers Intel Core 2 Quad Q9550 × 2 16GB ECC DRAM Broadcom NetXtreme II 5708

1GbE NIC CentOS 5.3 Final; Linux 2.6.28-

10

Switches Netgear GS748TNA (48 ports,

GbE) Cisco Catalyst 4948 (48 ports,

GbE) BNT RackSwitch G8421 (24

ports, 10GbE) 11

…

1GbE / 10GbE1 rack48 servers

12

Algorithms

R2D2 Minimum timeout: 3ms Max retransmissions: 10 Delayed ack disabled

TCP: CUBIC TCP minRTO: 200ms Segmentation offloading: disabled TCP timestamps: disabled

13

Workload – 1 GbE switches Number of servers (N): 1, 2, 4, 8, 16, 32, 46

File size (S): 1MB, 20MB

Client: requests (S/N) MB from each server Issues new request when all servers respond

Measurements: Goodput Retransmission ratio:

Retransmitted packets

Total packets sent by TCP

14

Netgear Test – Goodput

1MB 20MB

1 2 4 8 16 32 460100200300400500600700800900

1000

R2D2TCP

Servers

Goo

dput

(Mbp

s)

1 2 4 8 16 32 460100200300400500600700800900

1000

R2D2TCP

Servers

Goo

dput

(Mbp

s)

15

Netgear Test – Retransmission Ratio

1 2 4 8 16 32 460

0.001

0.002

0.003

0.004

0.005

0.006

0.007

Servers

Retr

ansm

issi

on R

atio

1 2 4 8 16 32 460

0.001

0.002

0.003

0.004

0.005

0.006

Servers

Retr

ansm

issi

on R

atio

1MB 20MB

16

Netgear Test – Multiple Clients

6 clients (instead of 1 client) 32 servers Each client requests a file from each of the

32 servers

R2D2 TCP0200400600800

1000

Test

Goo

dput

(Mbp

s)

1MB 20MB

R2D2 TCP0200400600800

1000

Test

Goo

dput

(Mbp

s)

17

Catalyst 4948 Test – Goodput

1 2 4 8 16 32 460100200300400500600700800900

1000

R2D2TCP

Servers

Goo

dput

(Mbp

s)

1 2 4 8 16 32 460100200300400500600700800900

1000

R2D2TCP

Servers

Goo

dput

(Mbp

s)

1MB 20MB

18

Catalyst 4948 Test – Retransmission Ratio

1 2 4 8 16 32 460

0.01

0.02

0.03

0.04

0.05

0.06

Servers

Retr

ansm

issi

on R

atio

1 2 4 8 16 32 460

0.001

0.002

0.003

0.004

0.005

Servers

Retr

ansm

issi

on R

atio

1MB 20MB

19

Catalyst 4948 Test – Multiple Clients

R2D2 TCP0100200300400500600700800900

1000

123456

Test

Goo

dput

(Mbp

s)

R2D2 TCP0100200300400500600700800900

1000

123456

Test

Goo

dput

(Mbp

s)

1MB 20MB

20

10GbE test – Goodput

File size: 10MB Number of servers:

1, 5, 9, 13, 17, 21

1 5 9 13 17 210100020003000400050006000700080009000

R2D2TCP

Servers

Goo

dput

(Mbp

s)

1 5 9 13 17 210

0.00050.001

0.00150.002

Servers

Retr

ansm

issi

on

Ratio

21

Conclusion R2D2 is scalable and fast, provides reliable

delivery No need to modify kernel Can be loaded/unloaded easily Improves reliability in data center networks

Hardware implementation in NIC can be much faster Work well with TCP offload options like

segmentation and checksum offloading Developing an FPGA implementation

Tom Yue

same comment

Internal Use

same comment

Internal Use

we can say helps reliable delivery

R2D2 Reliable and Rapid Data Delivery for DCs

Documents

Transcript of R2D2 Reliable and Rapid Data Delivery for DCs