mSwitch: A Highly-Scalable, Modular Software Switch

12
mSwitch: A HighlyScalable, Modular Software Switch Michio Honda (NetApp) * Felipe Huici (NEC), Giuseppe Lettieri and Luigi Rizzo (Università di Pisa) ACM SOSR’15 June 17 * this work was mostly done at NEC

Transcript of mSwitch: A Highly-Scalable, Modular Software Switch

Page 1: mSwitch: A Highly-Scalable, Modular Software Switch

mSwitch:  A  Highly-­‐Scalable,  Modular  Software  SwitchMichio  Honda  (NetApp)  *  

Felipe  Huici  (NEC),  Giuseppe  Lettieri  and  Luigi  Rizzo  (Università  di  Pisa)  ACM  SOSR’15  June  17

*  this  work  was  mostly  done  at  NEC

Page 2: mSwitch: A Highly-Scalable, Modular Software Switch

Motivation

• Software  switches  are  important  – Interconnection  between  VMs/containers  and  NICs  –Middleboxes,  SDN,  NFV  

• Requirements  –Throughput  (e.g.,  10  Gbps)  –Scalability  (e.g.,  100  ports)  –Flexibility  (e.g.,  forwarding  decision,  packet  modification)  –CPU  efficiency  (e.g.,  allocate  as  many  CPU  resources  to  VMs  as  possible)  

Are  existing  software  switches  able  to  meet  these  requirements?

Software Switch

VM VM VM

NIC NIC

Page 3: mSwitch: A Highly-Scalable, Modular Software Switch

Existing  Software  Switches

• OS-­‐standard  ones  don’t  provide  high  throughput

1 2

4

6

8

10

60 124 252 508 1020 1514

Thro

ughp

ut (G

bps)

Packet size (bytes, excluding CRC)

FreeBSD bridgeLinux OVS

DPDK vSwitchVALE

• High  throughput  ones  lack  port  scalability  and/or  flexibility

while forwarding packets at high rates using DPDK vSwitchor VALE (the other switches do not yield high throughputor are not publicly available). In terms of CPU usage, thefundamental feature of DPDK vSwitch, and indeed, of anyDPDK-based package, is that DPDK’s poll-mode driver re-sults in 100% utilization irrespective of the traffic rates beingprocessed. In contrast, VALE relies on interrupts, so that userprocesses are woken up only on packet arrival. In our exper-iments, for the 10 CPU cores handling packets this resultsin a cumulative CPU utilization of about 140% for mSwitchthat also adopts an interrupt-based model and a much higherbut expected 1,000% for DPDK vSwitch (the full results forthese experiments are in section 4.2).High Density: Despite its high throughput, VALE, as wewill show in section 4, scales poorly when packets are for-warded to an increasing number of ports, and the throughputfurther drops when packets from multiple senders are sentto a common destination port; both of these are commonscenarios for a back-end virtualization switch containing asingle NIC and multiple VMs.

For DPDK vSwitch, its requirement of having a core ded-icated to each port limits its density. While it is possible tohave around 62-78 or so cores on a system (e.g., 4 AMDCPU packages with 16 cores each, minus a couple of coresfor the control daemon and operating system, or 4 Intel 10-core CPUs with hyper threading enabled), that type of hard-ware represents an expensive proposition, and ultimately itmay not make sense to have to add a CPU core just to beable to connect an additional VM or process to the switch.Finally, CuckooSwitch targets physical NICs (i.e., no virtualports), so the experiments presented in that paper are limitedto 8 ports total.Flexibility: Most of the software switches currently avail-able do not expressly target a flexible forwarding plane, lim-iting themselves to L2 forwarding. This is the case for thestandard FreeBSD and Linux bridges, but also for newersystems such as VALE and CuckooSwitch. Instead, OpenvSwitch supports the OpenFlow protocol, and as such pro-vides the ability to match packets against a fairly compre-hensive number of packet headers, and to apply actions tomatching packets. However, as shown in figure 1 and in [19],Open vSwitch does not yield high throughput.

Throughput CPU Usage Density FlexibilityFreeBSD switch ⇥

p p⇥

Linux switch ⇥p p

⇥Open vSwitch ⇥

p p p

Hyper-Switch ⇥p

⇥p

DPDK vSwitchp

⇥ ⇥p

CuckooSwitchp

⇥ ⇥ ⇥VALE

p p⇥ ⇥

Table 1. Characteristics of software switches with respectto throughput, CPU usage, port density and flexibility.

DPDK vSwitch takes the Open vSwitch code base and ac-celerates it through the use of the DPDK packet framework.

DPDK itself introduces a completely different, non-POSIXprogramming environment, making it difficult to adapt ex-isting code to it. For DPDK vSwitch, this means that ev-ery Open vSwitch code release must be manually adapted towork within the DPDK vSwitch framework. In contrast, insection 5 we show how using mSwitch and applying a few,one-time code changes to Open vSwitch results in a 2.6-3times performance boost.Summary: Table 1 summarizes the characteristics of eachof the currently available software switches with respect tothe stated requirements; none of them simultaneously meetthem.

3. mSwitch DesignTowards our goal of implementing a software switch withhigh throughput, reasonable CPU utilization, high port den-sity and a flexible data plane, and taking into considerationthe analysis of the problem space in the previous section, wecan start to see a number of design principles.

First, in terms of throughput, there is no need to re-invent the wheel: several existing switches yield excellentperformance, and we can leverage the techniques they usesuch as packet batching [2, 6, 17, 18], lightweight packetrepresentation [6, 9, 17] and optimized memory copies [9,16, 18] to achieve this.

In addition, to obtain relatively low CPU utilization andflexible core assignment we should opt for an interrupt-based model, such that idle ports do not unnecessarily con-sume cycles that can be better spent by active processes orVMs. This is crucial if the switch is to act as a back-end, andhas the added benefit of reducing the system’s overall powerconsumption.

Further, we should design a forwarding algorithm that islightweight and that, ideally, scales linearly with the numberof ports on the switch; this would allow us to reach higherport densities than current software switches are capable of.Moreover, for a back-end switch muxing packets from alarge number of sending virtual ports to a common desti-nation port (e.g., a NIC), it is imperative that the forwardingalgorithm is efficiently able to handle this incast problem.

Finally, the switch’s data plane should be programmablewhile ensuring that this mechanism does not harm the sys-tem’s ability to quickly switch packets between ports. Thispoints towards a split between highly optimized switch codein charge of switching packets, and user-provided code todecide destination ports and potentially modify or filterpackets.

3.1 Starting PointHaving identified a set of design principles, the next ques-tion is whether we should base a solution on one of the ex-isting switches previously mentioned, or start from scratch.The Linux, FreeBSD and Open vSwitch switches are non-starters since they are not able to process packets with high

Page 4: mSwitch: A Highly-Scalable, Modular Software Switch

• Separation  into  fabric  and  logic  • Fabric:  switches  packets  between  ports  • Logic:  modular  forwarding  decisions

mSwitch  Design  Decisions

4

Switching  fabricSwitching  logic

OS  stack

Sock.  API

Apps

Virtual  Ports

App/VMApp/VM

netm

ap  

API

netm

ap  

API

UserKernel

NIC

• Interrupt  model  • Efficient,  flexible  CPU  utilization• Runs  in  the  kernel  • To  efficiently  handle  interrupts  • Integration  with  OS  subsystems  • network  stack,  device  drivers  etc

• Separate,  per-­‐port  packet  buffers  • Isolation  • Copy  is  anyways  inexpensive

Page 5: mSwitch: A Highly-Scalable, Modular Software Switch

• Output  queue:  reserve  destination  buffers  w/  lock  and  copy  packets  w/o  lock  • concurrent  senders  can  perform  copy

Scalable  Packet  Switching  Algorithms

• Input  queue:  group  packets  for  each  destination  port  before  forwarding  • For  a  batch  of  input  packets,  lock  each  destination  port  and  access  its  device  register  only  once

5

sender1

sender2

Output  queue

Page 6: mSwitch: A Highly-Scalable, Modular Software Switch

Modular  Switching  Logic

• Switching  logics  are  implemented  as  separate  kernel  modules  that  implement  a  lookup  function:  • Return  value  indicates  a  destination  switch  port  index,  drop  or  broadcast  • L2  learning  is  default  but  we  can  change  it  at  anytime  while  the  switch  is  running

Page 7: mSwitch: A Highly-Scalable, Modular Software Switch

A  Full  mSwitch  Module

u_intmy_lookup(u_char *buf, const struct net_device *dev){ struct ether_hdr *eh; eh = (struct ether_hdr *)buf; /* least significant byte */ return eh->ether_dst[0];}

Page 8: mSwitch: A Highly-Scalable, Modular Software Switch

CPU  Utilization

• mSwitch  efficiently  utilizes  CPUs

1

2

3

4

5

6

7

1 2 3 4 5 6 7 8 9

100 200 300 400 500 600 700 800 900 1000

Thro

ughp

ut (G

bps)

Cum

ulat

ive

CPU

util

izat

ion

(%)

# of destination virtual ports (# of CPU cores - 1)

mSwitchDPDK vSwitch

mSwitch (% CPU)DPDK vSwitch (% CPU)

NIC

AppAppAppAppvirtualports

One  CPU  core  for  the  NIC  and  per  virtual  port

Page 9: mSwitch: A Highly-Scalable, Modular Software Switch

Ports  Scalability

• mSwitch  scales  to  many  ports

1

2

3

4

5

6

7

1 10 20 30 40 50 60 70 80 90 100 110 120

10 20 30 40 50 60 70 80 90 100

Thro

ughp

ut (G

bps)

CPU

util

izat

ion

(%)

# of destination virtual ports

mSwitch (Throughput)VALE (Throughput)mSwitch (NIC-CPU)

VALE (NIC-CPU)mSwitch (App-CPU)

VALE (App-CPU)

NIC

AppAppAppAppvirtualports

One  CPU  core  for  the  NIC  Another  one  for  all  virtual  ports

Page 10: mSwitch: A Highly-Scalable, Modular Software Switch

mSwitch  Module  Use  Cases

10

VPVP VPkernel

NICOpen vSwitch

datapath

VMuser VM VM

NIC

VPVPkernel

NICUDP/TCP port

filter

App/VMuser App/VM

NIC

(middlebox for TCP 80 and 443)

(middlebox for UDP/TCP 5004)

Mux/Demux (3 tuple)

VPVP

App+stack

Sock. APIOS stack

App+stack

kerneluser

(TCP 22)

(TCP 80) (TCP 53) App

NIC

• Accelerated  Open  vSwitch  datapath  • 3x  speedup

• Filtering  for  virtualized  middleboxes  • Efficiently  directs  relevant  packets  to  middleboxes  on  virtual  ports

• Support  for  user-­‐space  protocols:  • With  isolation  • Can  still  use  OS’s  stack

Page 11: mSwitch: A Highly-Scalable, Modular Software Switch

Conclusion• A  highly-­‐scalable,  modular  software  switch  • Higher  scalability  and  flexibility  compared  to  DPDK  vSwitch  and  VALE  • Already  integrated  into  netmap/VALE  implementation  • https://code.google.com/p/netmap/  • Upstreamed  into  FreeBSD,  works  in  Linux  • All  the  modules  (e.g.,  Open  vSwitch  acceleration)  are  publicly  available  • https://github.com/cnplab  

• The  paper  is  open  access:http://web.sfc.wide.ad.jp/~micchie/papers/a1-­‐honda-­‐sosr15.pdf  

• Other  papers  using  mSwitch:  • Martins  et  al.  “ClickOS  and  the  art  of  network  function  virtualization”,  USENIX  NSDI’14  • Honda  et  al.  “Rekindling  network  protocol  innovation  with  user-­‐level  stacks”,  ACM  CCR  201411

Page 12: mSwitch: A Highly-Scalable, Modular Software Switch

Module  complexity  and  performance

12

70 modified lines ) to hook the Open vSwitch code to themSwitch switching logic. In essence, mSwitch-OVS replacesOpen vSwitch’s datapath, which normally uses Linux’s stan-dard packet I/O, with mSwitch’s fast packet I/O. As a re-sult, we can avoid expensive, per-packet sk_buff allocationsand deallocations.

0

2

4

6

8

10

60 124 252 50810201514

Thro

ughp

ut (G

bps)

Packet size (Bytes)

OVSmSwitch-OVS

(a) Between NICs.

0 5

10 15 20 25 30 35

60 124 252 50810201514

Thro

ughp

ut (G

bps)

Packet size (Bytes)

OVSmSwitch-OVS

(b) Between virtual ports.

Figure 15: Throughput of mSwitch’s Open vSwitchmodule (mSwitch-OVS) as opposed to that of stan-dard Open vSwitch (OVS) on a single CPU core.Measurements are done when forwarding betweenNICs (left) and between virtual ports (right). Inthe latter, for standard Open vSwitch we use tap

devices.

The results when forwarding between NICs using a sin-gle CPU core (Figure 15(a)) show that with relatively fewsmall changes to Open vSwitch, mSwitch is able to achieveimportant throughput improvements: For small packets, wenotice a 2.6-3x speed-up. The di↵erence is also large whenforwarding between virtual ports (Figure 15(b)), althoughpart of those gains are certainly due to the presence of slowtap devices for Open vSwitch.

5.5 Module ComplexityAs the final evaluation experiment, we look into how ex-

pensive the various modules are with respect to CPU fre-quency. Figure 16 summarizes how the throughput of mSwitchis a↵ected by the complexity of the switching logic for minimum-sized packets and di↵erent CPU frequencies. As shown,hash-based functions (learning bridge or 3-tuple filter) arerelatively inexpensive and do not significantly impact thethroughput of the system. The middlebox filter is evencheaper, since it does not incur the cost of doing a hashlook-up.

Conversely, Open vSwitch processing is much more CPUintensive, because OpenFlow performs packet matching againstseveral header fields across di↵erent layers; the result is re-flected in a much lower forwarding rate, and also an almostlinear curve even at the highest clock frequencies.

6. GENERALITY AND LESSONS LEARNEDThrough the process of designing, implementing and ex-

perimenting with mSwitch we have learned a number oflessons, as well as developed techniques that we believe aregeneral and thus applicable to other software switch pack-ages:

2 4 6 8

10 12 14

1.2 1.5 2.1 2.4 3.2Thro

ughp

ut (M

pps)

CPU Clock Frequency (Ghz)

BaselineFilter

L2 learn3-tuple

mSwitch-OVS

Figure 16: Throughput comparison between di↵er-ent mSwitch modules for 60 Byte packets.

• Interrupt vs. Polling Model: Using a polling modelcan yield some throughput gains with respect to aninterrupt-based one, but at the cost of much higherCPU utilization. For a dedicated platform (e.g., Cuck-ooSwitch, which uses the server it runs on solely as ahardware-switch replacement) this may not matter somuch, but for a system seeking to run a software switchas a back-end to processes, containers, or virtual ma-chines, an interrupt-based model (or a hybrid one suchas NAPI) is more e�cient and spares cycles that thoseprocesses can use. Either way, high CPU utilizationequates to higher energy consumption which is alwaysundesirable. We also showed that latency penalty aris-ing from the use of an interrupt model is negligible forOpenFlow packet matching (Section 4.5).

• Data Plane Decoupling: Logically separating mSwitchinto a fast, optimized switching fabric and a special-ized, modular switching logic achieves the best fromboth worlds: the specialization required to reach highperformance with the flexibility and ease of develop-ment typically found in general packet processing sys-tems (Section 3.4).

• High Port Density: The algorithm presented in Sec-tion 3.3 permits the implementation of a software switchwith both high port density and high performance.The algorithm is not particular to mSwitch, and so itcan be applied to other software packages.

• Destination Port Parallelism: Given the preva-lence of virtualization technologies, it is increasinglycommon to have multiple sources in a software switch(e.g., the containers or VMs) needing to concurrentlyaccess a shared destination port (e.g., a NIC). Thealgorithm described in Section 3.3 yields high perfor-mance under these scenarios and is generic, so appli-cable to other systems.

• Huge Packets: Figure 6 suggests that huge packetsprovide significant advantages for the switching planein a high performance switch. Supporting these pack-ets is important when interacting with entities (e.g.,virtual machines) which have massive per packet over-heads (e.g., see [18]).

• Zero-Copy Client Bu↵ers: Applications or virtualmachines connected to ports on a software switch arelikely to assemble packets in their own bu↵ers, di↵erentfrom those of the underlying switch. To prevent costlytransformations and memory copies, the switch shouldallow such clients to store output packets in their ownbu↵ers (Section 3.5).

• Application Model: How should functionality be

Measurement  results  for  minimum-­‐sized  packet  forwarding  between  two  NICs