Peregrine : An All-Layer-2 Container Computer Network

1

Peregrine: An All-Layer-2 Container Computer Network

Tzi-cker Chiueh , Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, and Yu-Ming Huan∗Computer Science Department, Stony Brook University

†Industrial Technology Research Institute, Taiwan

1

Outline

Motivation Layer 2 + Layer 3 design Requirements for cloud-scale DC Problems of Classic Ethernet to the Cloud

Solutions Related Solutions Peregrine’s Solution

Implementation and Evaluation Software architecture Performance Evaluation

2

L2 + L3 Architecture: Problems3

Problem: Configuration:- Routing table in the routers- IP assignment- DHCP coordination - VLAN and STP

Virtual Machine Mobility Constrained to a Physical Location

Bandwidth Bottleneck

Ref: Cisco data center with FabricPath and the Cisco FabricPath Switching System

Problem: forwarding table sizeCommodity switch 16-32k

Requirements for Cloud-Scale DC Any-to-any connectivity with non-blocking fabric

Scale to more than 10,000 physical nodes Virtual machine mobility

Large layer 2 domain Fast fail-over

Quick failure detection and recovery Support for Multi-tenancy

Share resources between different customers Load balancing routing

Efficiently use all available links

4

Layer 2 Switch

Solution: A Huge L2 Switch!

5

Single L2 Network Non-blocking Backplane Bandwidth

Config-free, Plug-and play

Linear cost and power scaling

VMVM

VMVM

VMVM

VMVM

VMVM

VMVM

….. Scale to 1 million VMs

However, Ethernet does not scale!

5

Revisit Ethernet: Spanning Tree Topology

s1

N3 N5

N1 N8 N7

N6

N4

N2

s3 s4

s2

6

Revisit Ethernet: Spanning Tree Topology

s1

N3 N5

N1 N8 N7

N6

N4

N2

s3 s4

s2

Root

RD

D B

7

Revisit Ethernet: Broadcast and Source Learning

s1

N3 N5

N1 N8 N7

N6

N4

N2

s3 s4

s2

Root

RD

D B

Benefit: plug-and-play

8

Ethernet’s Scalability Issues Limited forwarding table size

Commodity switch: 16k to 64k entries STP as a solution to loop prevention

Not all physical links are used No load-sensitive dynamic routing

Slow fail-over Fail-over latency is high ( > 5 seconds)

Broadcast overhead Typical Layer 2 consists of hundreds of hosts

9

Related Works / Solution Strategies

Scalability: Clos network / Fat-tree to scale out

Alternative to STP Link aggregation, e.g. LACP, Layer 2 trunking Routing protocols to layer 2 network

Limited forwarding table size Packet header encapsulation or re-writing

Load balancing Randomness or traffic engineering approach

10

Design of the Peregrine11

Peregrine’s Solutions

Not all links are used Disable Spanning Tree protocol

L2 Loop Prevention Redirect broadcast and block flooding packet

Source learning and forwarding calculate all routes for all node-pairs by Route Server

Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module

12

ARP Intercept and Redirect13

A sw1

sw2

sw3

sw4 B

DS

1. DS-ARP 2. DS-Reply

Control flowData flow

3. Send data

RAS

Directory Service Route Algorithm Server

Peregrine’s Solutions

Not all links are used Disable Spanning Tree protocol

L2 Loop Prevention Redirect broadcast and block flooding packet

Source learning and forwarding Calculate all routes for all node-pairs by Route Server Fast fail-over: primary and backup routes for each pair

Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module

14

Mac-in-Mac Encapsulation

A sw1

sw2

sw3

sw4 B

DS

1. ARP redirect

2. B locates at sw4

3. sw4

Control flowData flow

5. Decap and restore original frame

Decapsulation

4. Encap sw4 in source mac

Encapsulation

sw 4 B A

15

Fast Fail-Over16

Goal: Fail-over latency < 100 msec Application agnostic TCP timeout: 200ms

Strategy: Pre-compute a primary and backup route for each VM Each VM has two virtual MACs When a link fails, notify hosts using affected primary routes

that they should switch to corresponding backup routes

When a Network Link Fails17

IMPLEMENTATION & Evaluation18

Software Architecture19

Review All Components

A sw1

sw2

sw3

sw4 B

DS

ARP redirect

ARP request rate

sw5

sw6

sw7

RAS

How fast can DS handle request?

How long can RAS process request?MIM module

Backup route

Performance of: MIM, DS, RAS, switch ?

20

Mac-In-Mac Performance

Time spent for decap/encap/total: 1us / 5us / 7us (2.66GHz CPU)Around 2.66K / 13.3K / 18.6K cycles

21

cdf

Aggregate Throughput for Multiple VMs

1. APR table size < 1k2. Measure TCP throughput of 1VM, 2VM, 4VM communicating to each

other.

22

ARP Broadcast Rate in a Data Center What’s the ARP traffic rate in real world?

From 2456 hosts, CMU CS department claims that there are 1150 ARP/sec at peak, 89 ARP/sec on average.

From 3800 hosts at university network, there are around 1000 ARP/sec at peak, < 100 ARP/sec on average.

To scale to 1M node, 20K-30K ARP/sec on average. Current optimal DS: 100K ARP/sec

23

Fail-over time and its breakdown Average fail-over time:

75ms Switch: 25 ~ 45 ms

sending trap (soft unplug) RS: 25ms

receiving trap and processing DS: 2ms

receiving info from RS and inform DS

The rests are network delay and dom0 processing time

24

Conclusion A unified Layer-2-only network for LAN and SAN Centralized control plane and distributed data plane Use only Commodity Ethernet switches

Army of commodity switches vs. few high-port-density switches

Requirements on switches: run fast and has programmable routing table

Centralized load-balancing routing using real-time traffic matrix

Fast fail-over using pre-computed primary/back routes

25

Thank you

Questions?26

Review All Components: Result

27

A sw1

sw2

sw3

sw4 B

DS

ARP redirect

ARP request rate

sw5

sw6

sw7

RS

100K ARP/sec

25ms per request

Link down35ms

7us for Packet processing

Backup route

27

Thank you

Backup slides28

OpenFlow Architecture OpenFlow switch: A data plane that

implements a set of flow rules specified in terms of the OpenFlow instruction set

OpenFlow controller: A control plane that sets up the flow rules in the flow tables of OpenFlow switches

OpenFlow protocol: A secure protocol for an OpenFlow controller to set up the flow tables in OpenFlow switches

29

Data Path (Hardware)Data Path (Hardware)

Control Control PathPath OpenFlowOpenFlow

OpenFlow OpenFlow ControllerController

OpenFlow Protocol (SSL/TCP)

30

Conclusion and Contribution

Using commodity switches to build a large scale layer 2 network

Provide solutions to Ethernet’s scalability issues Suppressing broadcast Load balancing route calculation Controlling MAC forwarding table Scale up to one million VMs by Mac-in-Mac two stage forwarding Fast fail-over

Future work High Availability of DS and RAS, mater-slave model Inter

31

Comparisons Scalable and available data center fabrics

IEEE 802.1aq: Shortest Path Bridging IETF TRILL Competitors: Cisco, Juniper, Brocade Differences: commodity switches, centralized load

balancing routing and proactive backup route deployment Network virtualization

OpenStack Quantum API Competitors: Nicira, NEC Generality carries a steep performance price

Every virtual network link is a tunnel Differences: Simpler and more efficient because it runs on

L2 switches directly

32

Three Stage Clos Network (m,n,r)

1n

2n

rn

.

.

.

1

2

m

.

.

.

1

2

r

.

.

.

n

n

n

r x r m x nn x m

33

Clos Network Theory

Clos(m, n, r) configuration:rn inputs, rn outputs

2r nxm + m rxr switches, less than rn x rn Each rxr switch can in turn be implemented as

a 3-stage Clos network Clos(m,n,r) is rearrangeably non-blocking iff m >= n Clos(m,n,r) is stricly non-blocking iff m >= 2n-1

34

Link Aggregation35

ECMP: Equal-Cost Multipath

Pros: multiple links are used, Cons: hash collision, re-converge downstream to a single link

36

Example: Brocade Data Center

Ref: Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers

Link aggregation

L3 ECMP

37

PortLand• Scale-out: Three-layer,

multi-root topology• Hierarchical, encode

location into MAC address• Local Discover Protocol to

find shortest path, route by MAC

• Fabric Manager maintains IP to MAC mapping

• 60-80 ms failover, centrally control and notify

38

VL2: Virtual Layer 2

• Three layer, Clos network• Flat, IP-in-IP, Location

address(LA) an Application Address (AA)

• Link-state routing to disseminate LA

• VLB + flow-based ECMP• Depend on ECMP to

detect link failure• Packet interception at S VL2 Directory Service

39

Monsoon• Three layer, multi-root

topology• 802.1ah MAC-in-MAC

encapsulation, source routing

• centralized routing decision

• VLB + MAC rotation• Depend on LSA to detect

failures• Packet interception at S Monsoon Directory Service

IP <-> (server MAC, ToR MAC)

40

TRILL and SPB

TRILL• Transparent Interconnect of

Lots of Links, IETF• IS-IS as a topology

management protocol• Shortest path forwarding• New TRILL header• Transit hash to select next-

hop

SPB• Shortest Path Bridging, IEEE• IS-IS as a topology

management protocol• Shortest path forwarding• 802.1ah MAC-in-MAC• Compute 16 source node

based trees

41

TRILL Packet ForwardingLink-state routing

TRILL header

A-ID: nickname of AC-ID: nickname of C

HopC: hop countRef: NIL Data Communications42

SPB Packet ForwardingLink-state routing

802.1ah Mac-in-Mac

Ref: NIL Data Communications

I-SID: Backbone Service Instance IdentifierB-VID: backbone VLAN identifier

43

Re-arrangeable non-blocking Clos network

nxk (N/n)x(N/n) kxn

N=6n=2k=2

3x3

3x3

2x2

2x2

2x2

2x2

2x2

2x2input output

Example:1. Three-stage Clos network2. Condition: k>=n3. An unused input at ingress switch can always be connected

to an unused output at egress switch4. Existing calls may have to be rearranged

ingress middle egress

44

Features of Peregrine network

• Utilize all links• Load balancing routing algorithm• Scale up to 1 million VMs

– Two stage dual mode forwarding• Fast fail over• Load balancing routing algorithm

45 45

Goal

• Given a mesh network and traffic profile– Load balance the network resource utilization

• Prevent congestion by balancing the network load to support as many traffic load as possible

– Provide fast recovery from failure• Provide primary-backup route to minimize recovery

time

S D

Primary

Backup

46

Factors

• Only hop count• Hop count and link residual capacity• Hop count, link residual capacity, and link

expected load• Hop count, link residual capacity, link expected

load and additional forwarding table entries requiredHow to combine them into one number for a particular candidate

route?

47

Route Selection: idea

S1 D1

D2S2

A B

C D

Which route is better from S1 to D1?

S2-D2 shares link C-D

Leave C-D free

Share with S2-D2

Link C-D is more important! Idea: use it as sparsely as possible

48

Route Selection: hop count and Residual capacity

S1 D1

D2S2

A B

C D

Leave C-D free

Share with S2-D2

Using Hop count or residual capacity makes no difference!

Traffic Matrix:S1 -> D1: 1GS2 -> D2: 1G

49

Determine Criticality of A Link

= fraction of all (s, d) routes that pass through link l

Expected load of a link at initial state

= Bandwidth demand matrix for s and d

Determine the importance of a link

50

Criticality ExampleFrom B to C has four possible routes.

A B C

0 4/4

2/4

Case2: Calculates = B, d = C

Case3: s = A, d = C is similar

2/4 2/4

2/42/4

2/4

4/4

51

Expected LoadAssumption: load is equally distributed over each

possible routes between S and D.

Consider bandwidth demand for B-C is 20.Expected Load:

A B C

0 20

10 10

20

1010

10

10

52

Cost MetricsCost metric represents the expected load per

unit of available capacity on the link

= Residual Capacity= Expected Load

Idea: pick the link with minimum cost

A B C

0.01 0.01

0.01 0.010.01 0.01

0 0.02 0.02

53

Forwarding Table MetricConsider using commodity switch with 16-32k

forwarding table size.

Idea: minimize entry consumption, prevent forwarding table from being exhaustedA B C

100

100

0200

300

= available forwarding table entries at node n

INC_FWD= extra entries needed to route A-C

54

Load Balanced Routing• Simulated network

– 52 PMs with 4 NICs, total 384 links– Replay 17 multi-VDC 300-second traces

• Compare – Random shortest path

routing (RSPR)– Full Link Criticality-based

routing (FLCR)• Metrics: congestion count

– # of links with exceeded capacity

• Low additional traffic induced by FLCR

55

Peregrine : An All-Layer-2 Container Computer Network

Documents

Transcript of Peregrine : An All-Layer-2 Container Computer Network