XCo: Coordination to Prevent Network Congestion in Cloud ...dcm/Teaching/CDA5532-Cloud... · XCo...

Post on 05-Mar-2021

5 views 0 download

Transcript of XCo: Coordination to Prevent Network Congestion in Cloud ...dcm/Teaching/CDA5532-Cloud... · XCo...

XCo: Explicit Coordination to Prevent Network Fabric Congestion in Cloud Computing Cluster 

Platforms

Presented by Wei Dai

Reasons for Congestion in CloudReasons for Congestion in Cloud

• Cloud operators use virtualization to consolidateCloud operators use virtualization to consolidate thousands of VMs on shared hardware platforms due to cost concerns.

• Most VMs host service‐oriented applications that are ppinherently communication intensive.

2

Reasons for Congestion in CloudReasons for Congestion in Cloud

• Cloud computing infrastructures consist of large dataCloud computing infrastructures consist of large data center clusters using commodity servers and networking hardware.

Pros:Cheap

Easy to install and manage

Can be shared by a wide range of network services and protocolsprotocols

3

Reasons for Congestion in CloudReasons for Congestion in Cloud

• Cloud computing infrastructures consist of large data p g gcenter clusters using commodity servers and networking hardware.Cons:Higher latencySmaller / lower‐performance packet buffersSmaller / lower‐performance packet buffers

• Switch buffers can easily become overwhelmed by y yhigh‐throughput traffic that can be bursty and synchronized, leading to significant packet losses.

4

Types of CongestionTypes of Congestion

• TCP throughput collapse (also known as Incast)TCP throughput collapse (also known as Incast)

Well‐known example of congestion experienced by barrier‐synchronized traffic, e.g. synchronous reads in networked storage

• Congestion caused by non‐TCP traffic, e.g. UDP

• Congestion caused by traffic not TCP‐friendly

voice/video over IP, and peer‐to‐peer traffic

• Congestion caused by large number of short TCP sessions

5

How to Solve the ProblemHow to Solve the Problem

• Root cause: transient overload of buffers withinRoot cause: transient overload of buffers within switches

• Hardware and software mechanisms are hard to deploy at scale.

• Ethernet flow control in IEEE 802.3x helps in low‐end edge switches, but is counter‐productive in backbone switches.

6

How to Solve the ProblemHow to Solve the Problem

• Current industry practice:Current industry practice:Add higher capacity network switches

Multi‐port network cards

Physically separate networks for data and control traffic

• Drawback: increase cost and complexity without addressing the root cause

7

XCo – Explicit CoordinationXCo Explicit Coordination

• Coordinate network transmissions from multipleCoordinate network transmissions from multiple VMs to avoid throughput collapse and increase network utilization

• Advantages: simple, effective, feasible, and g pindependent of switch‐level hardware support, transparent implementation without modifying any 

l d d l k happlications, standard protocols, network switches or VMs

8

XCo – Explicit CoordinationXCo Explicit Coordination

9

Central ControllerCentral Controller

• Resides in the same switched network as other nodes

• Takes as input:Switch interconnection topology and link capacitiesSwitch interconnection topology and link capacitiesLocation of VMs on physical nodesCurrent traffic matrix of the networkAdministrative policiesAdministrative policies

• Whenever detects congestion buildup at any link, computes and sends transmission directives to local coordinators at each end‐host that is contributing to the congestion

10

Local CoordinatorLocal Coordinator

• Intercepts and regulates the outgoing trafficIntercepts and regulates the outgoing traffic aggregates (VM‐to‐VM flows) from all VMs within the corresponding end‐host according to transmission directives 

• Provides traffic feedback to the central controller

• The specific regulation pattern is dictated by transmission directives.

11

Transmission DirectivesTransmission Directives

• Explicit instructions for transmissionExplicit instructions for transmission

• Various forms:Various forms:

Explicit timeslice scheduling

which V2V flow transmits when and for how longwhich V2V flow transmits when and for how long

Explicit rate limiting

t h t t V2V fl h ld t it f th t Nat what rate a V2V flow should transmit for the next N ms

Combination of the above two or other forms

12

Explicit Timeslice SchedulingExplicit Timeslice Scheduling

13

Work ConservationWork Conservation

• Some nodes may finish early with their timeslice.y y

• Local coordinators return the remaining part of timesliceb k l llback to central controller.

• Central controller then permits another node to transmit• Central controller then permits another node to transmit.

• Local coordinators introduce a small hysteresis delay y ybefore returning the timeslice, in case that more packets might arrive during the delay.

14

Experimental SetupsExperimental Setups

15

Impact of Ethernet CongestionImpact of Ethernet Congestion

16

Performance Evaluation of XCoPerformance Evaluation of XCo

17

Impact of Ethernet CongestionImpact of Ethernet Congestion

18

Performance Evaluation of XCoPerformance Evaluation of XCo

19

Impact of Ethernet CongestionImpact of Ethernet Congestion

20

Performance Evaluation of XCoPerformance Evaluation of XCo

21

Experimental SetupExperimental Setup

22

Impact of Ethernet CongestionImpact of Ethernet Congestion

23

Performance Evaluation of XCoPerformance Evaluation of XCo

24

Live VMMigrationLive VM Migration

25

Fairness among V2V FlowsFairness among V2V Flows

26

Work ConservationWork Conservation 

27

ReferenceReference

• XCo: explicit coordination to prevent network fabric congestion in cloud computing cluster platformsVijay Shankar Rajanna, Smit Shah, Anand Jahagirdar, Christopher Lemoine, and Kartik Gopalan

Proceedings of the 19th ACM International Symposium onProceedings of the 19th ACM International Symposium on High Performance Distributed Computing, June 2010 

28