Peregrine : An All-Layer-2 Container Computer Network
description
Transcript of Peregrine : An All-Layer-2 Container Computer Network
1
Peregrine: An All-Layer-2 Container Computer Network
Tzi-cker Chiueh , Cheng-Chun Tu, Yu-Cheng Wang, Pai-Wei Wang, Kai-Wen Li, and Yu-Ming Huan∗Computer Science Department, Stony Brook University
†Industrial Technology Research Institute, Taiwan
1
Outline
Motivation Layer 2 + Layer 3 design Requirements for cloud-scale DC Problems of Classic Ethernet to the Cloud
Solutions Related Solutions Peregrine’s Solution
Implementation and Evaluation Software architecture Performance Evaluation
2
L2 + L3 Architecture: Problems3
Problem: Configuration:- Routing table in the routers- IP assignment- DHCP coordination - VLAN and STP
Virtual Machine Mobility Constrained to a Physical Location
Bandwidth Bottleneck
Ref: Cisco data center with FabricPath and the Cisco FabricPath Switching System
Problem: forwarding table sizeCommodity switch 16-32k
Requirements for Cloud-Scale DC Any-to-any connectivity with non-blocking fabric
Scale to more than 10,000 physical nodes Virtual machine mobility
Large layer 2 domain Fast fail-over
Quick failure detection and recovery Support for Multi-tenancy
Share resources between different customers Load balancing routing
Efficiently use all available links
4
Layer 2 Switch
Solution: A Huge L2 Switch!
5
Single L2 Network Non-blocking Backplane Bandwidth
Config-free, Plug-and play
Linear cost and power scaling
VMVM
VMVM
VMVM
VMVM
VMVM
VMVM
….. Scale to 1 million VMs
However, Ethernet does not scale!
5
Revisit Ethernet: Spanning Tree Topology
s1
N3 N5
N1 N8 N7
N6
N4
N2
s3 s4
s2
6
Revisit Ethernet: Spanning Tree Topology
s1
N3 N5
N1 N8 N7
N6
N4
N2
s3 s4
s2
Root
RD
D B
7
Revisit Ethernet: Broadcast and Source Learning
s1
N3 N5
N1 N8 N7
N6
N4
N2
s3 s4
s2
Root
RD
D B
Benefit: plug-and-play
8
Ethernet’s Scalability Issues Limited forwarding table size
Commodity switch: 16k to 64k entries STP as a solution to loop prevention
Not all physical links are used No load-sensitive dynamic routing
Slow fail-over Fail-over latency is high ( > 5 seconds)
Broadcast overhead Typical Layer 2 consists of hundreds of hosts
9
Related Works / Solution Strategies
Scalability: Clos network / Fat-tree to scale out
Alternative to STP Link aggregation, e.g. LACP, Layer 2 trunking Routing protocols to layer 2 network
Limited forwarding table size Packet header encapsulation or re-writing
Load balancing Randomness or traffic engineering approach
10
Design of the Peregrine11
Peregrine’s Solutions
Not all links are used Disable Spanning Tree protocol
L2 Loop Prevention Redirect broadcast and block flooding packet
Source learning and forwarding calculate all routes for all node-pairs by Route Server
Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module
12
ARP Intercept and Redirect13
A sw1
sw2
sw3
sw4 B
DS
1. DS-ARP 2. DS-Reply
Control flowData flow
3. Send data
RAS
Directory Service Route Algorithm Server
Peregrine’s Solutions
Not all links are used Disable Spanning Tree protocol
L2 Loop Prevention Redirect broadcast and block flooding packet
Source learning and forwarding Calculate all routes for all node-pairs by Route Server Fast fail-over: primary and backup routes for each pair
Limited switch forwarding table size Mac-in-Mac two stage forwarding by Dom0 kernel module
14
Mac-in-Mac Encapsulation
A sw1
sw2
sw3
sw4 B
DS
1. ARP redirect
2. B locates at sw4
3. sw4
Control flowData flow
5. Decap and restore original frame
Decapsulation
4. Encap sw4 in source mac
Encapsulation
sw 4 B A
15
Fast Fail-Over16
Goal: Fail-over latency < 100 msec Application agnostic TCP timeout: 200ms
Strategy: Pre-compute a primary and backup route for each VM Each VM has two virtual MACs When a link fails, notify hosts using affected primary routes
that they should switch to corresponding backup routes
When a Network Link Fails17
IMPLEMENTATION & Evaluation18
Software Architecture19
Review All Components
A sw1
sw2
sw3
sw4 B
DS
ARP redirect
ARP request rate
sw5
sw6
sw7
RAS
How fast can DS handle request?
How long can RAS process request?MIM module
Backup route
Performance of: MIM, DS, RAS, switch ?
20
Mac-In-Mac Performance
Time spent for decap/encap/total: 1us / 5us / 7us (2.66GHz CPU)Around 2.66K / 13.3K / 18.6K cycles
21
cdf
Aggregate Throughput for Multiple VMs
1. APR table size < 1k2. Measure TCP throughput of 1VM, 2VM, 4VM communicating to each
other.
22
ARP Broadcast Rate in a Data Center What’s the ARP traffic rate in real world?
From 2456 hosts, CMU CS department claims that there are 1150 ARP/sec at peak, 89 ARP/sec on average.
From 3800 hosts at university network, there are around 1000 ARP/sec at peak, < 100 ARP/sec on average.
To scale to 1M node, 20K-30K ARP/sec on average. Current optimal DS: 100K ARP/sec
23
Fail-over time and its breakdown Average fail-over time:
75ms Switch: 25 ~ 45 ms
sending trap (soft unplug) RS: 25ms
receiving trap and processing DS: 2ms
receiving info from RS and inform DS
The rests are network delay and dom0 processing time
24
Conclusion A unified Layer-2-only network for LAN and SAN Centralized control plane and distributed data plane Use only Commodity Ethernet switches
Army of commodity switches vs. few high-port-density switches
Requirements on switches: run fast and has programmable routing table
Centralized load-balancing routing using real-time traffic matrix
Fast fail-over using pre-computed primary/back routes
25
Thank you
Questions?26
Review All Components: Result
27
A sw1
sw2
sw3
sw4 B
DS
ARP redirect
ARP request rate
sw5
sw6
sw7
RS
100K ARP/sec
25ms per request
Link down35ms
7us for Packet processing
Backup route
27
Thank you
Backup slides28
OpenFlow Architecture OpenFlow switch: A data plane that
implements a set of flow rules specified in terms of the OpenFlow instruction set
OpenFlow controller: A control plane that sets up the flow rules in the flow tables of OpenFlow switches
OpenFlow protocol: A secure protocol for an OpenFlow controller to set up the flow tables in OpenFlow switches
29
Data Path (Hardware)Data Path (Hardware)
Control Control PathPath OpenFlowOpenFlow
OpenFlow OpenFlow ControllerController
OpenFlow Protocol (SSL/TCP)
30
Conclusion and Contribution
Using commodity switches to build a large scale layer 2 network
Provide solutions to Ethernet’s scalability issues Suppressing broadcast Load balancing route calculation Controlling MAC forwarding table Scale up to one million VMs by Mac-in-Mac two stage forwarding Fast fail-over
Future work High Availability of DS and RAS, mater-slave model Inter
31
Comparisons Scalable and available data center fabrics
IEEE 802.1aq: Shortest Path Bridging IETF TRILL Competitors: Cisco, Juniper, Brocade Differences: commodity switches, centralized load
balancing routing and proactive backup route deployment Network virtualization
OpenStack Quantum API Competitors: Nicira, NEC Generality carries a steep performance price
Every virtual network link is a tunnel Differences: Simpler and more efficient because it runs on
L2 switches directly
32
Three Stage Clos Network (m,n,r)
1n
2n
rn
.
.
.
1
2
m
.
.
.
1
2
r
.
.
.
n
n
n
r x r m x nn x m
33
Clos Network Theory
Clos(m, n, r) configuration:rn inputs, rn outputs
2r nxm + m rxr switches, less than rn x rn Each rxr switch can in turn be implemented as
a 3-stage Clos network Clos(m,n,r) is rearrangeably non-blocking iff m >= n Clos(m,n,r) is stricly non-blocking iff m >= 2n-1
34
Link Aggregation35
ECMP: Equal-Cost Multipath
Pros: multiple links are used, Cons: hash collision, re-converge downstream to a single link
36
Example: Brocade Data Center
Ref: Deploying Brocade VDX 6720 Data Center Switches with Brocade VCS in Enterprise Data Centers
Link aggregation
L3 ECMP
37
PortLand• Scale-out: Three-layer,
multi-root topology• Hierarchical, encode
location into MAC address• Local Discover Protocol to
find shortest path, route by MAC
• Fabric Manager maintains IP to MAC mapping
• 60-80 ms failover, centrally control and notify
38
VL2: Virtual Layer 2
• Three layer, Clos network• Flat, IP-in-IP, Location
address(LA) an Application Address (AA)
• Link-state routing to disseminate LA
• VLB + flow-based ECMP• Depend on ECMP to
detect link failure• Packet interception at S VL2 Directory Service
39
Monsoon• Three layer, multi-root
topology• 802.1ah MAC-in-MAC
encapsulation, source routing
• centralized routing decision
• VLB + MAC rotation• Depend on LSA to detect
failures• Packet interception at S Monsoon Directory Service
IP <-> (server MAC, ToR MAC)
40
TRILL and SPB
TRILL• Transparent Interconnect of
Lots of Links, IETF• IS-IS as a topology
management protocol• Shortest path forwarding• New TRILL header• Transit hash to select next-
hop
SPB• Shortest Path Bridging, IEEE• IS-IS as a topology
management protocol• Shortest path forwarding• 802.1ah MAC-in-MAC• Compute 16 source node
based trees
41
TRILL Packet ForwardingLink-state routing
TRILL header
A-ID: nickname of AC-ID: nickname of C
HopC: hop countRef: NIL Data Communications42
SPB Packet ForwardingLink-state routing
802.1ah Mac-in-Mac
Ref: NIL Data Communications
I-SID: Backbone Service Instance IdentifierB-VID: backbone VLAN identifier
43
Re-arrangeable non-blocking Clos network
nxk (N/n)x(N/n) kxn
N=6n=2k=2
3x3
3x3
2x2
2x2
2x2
2x2
2x2
2x2input output
Example:1. Three-stage Clos network2. Condition: k>=n3. An unused input at ingress switch can always be connected
to an unused output at egress switch4. Existing calls may have to be rearranged
ingress middle egress
44
Features of Peregrine network
• Utilize all links• Load balancing routing algorithm• Scale up to 1 million VMs
– Two stage dual mode forwarding• Fast fail over• Load balancing routing algorithm
45 45
Goal
• Given a mesh network and traffic profile– Load balance the network resource utilization
• Prevent congestion by balancing the network load to support as many traffic load as possible
– Provide fast recovery from failure• Provide primary-backup route to minimize recovery
time
S D
Primary
Backup
46
Factors
• Only hop count• Hop count and link residual capacity• Hop count, link residual capacity, and link
expected load• Hop count, link residual capacity, link expected
load and additional forwarding table entries requiredHow to combine them into one number for a particular candidate
route?
47
Route Selection: idea
S1 D1
D2S2
A B
C D
Which route is better from S1 to D1?
S2-D2 shares link C-D
Leave C-D free
Share with S2-D2
Link C-D is more important! Idea: use it as sparsely as possible
48
Route Selection: hop count and Residual capacity
S1 D1
D2S2
A B
C D
Leave C-D free
Share with S2-D2
Using Hop count or residual capacity makes no difference!
Traffic Matrix:S1 -> D1: 1GS2 -> D2: 1G
49
Determine Criticality of A Link
= fraction of all (s, d) routes that pass through link l
Expected load of a link at initial state
= Bandwidth demand matrix for s and d
Determine the importance of a link
50
Criticality ExampleFrom B to C has four possible routes.
A B C
0 4/4
2/4
Case2: Calculates = B, d = C
Case3: s = A, d = C is similar
2/4 2/4
2/42/4
2/4
4/4
51
Expected LoadAssumption: load is equally distributed over each
possible routes between S and D.
Consider bandwidth demand for B-C is 20.Expected Load:
A B C
0 20
10 10
20
1010
10
10
52
Cost MetricsCost metric represents the expected load per
unit of available capacity on the link
= Residual Capacity= Expected Load
Idea: pick the link with minimum cost
A B C
0.01 0.01
0.01 0.010.01 0.01
0 0.02 0.02
53
Forwarding Table MetricConsider using commodity switch with 16-32k
forwarding table size.
Idea: minimize entry consumption, prevent forwarding table from being exhaustedA B C
100
100
0200
300
= available forwarding table entries at node n
INC_FWD= extra entries needed to route A-C
54
Load Balanced Routing• Simulated network
– 52 PMs with 4 NICs, total 384 links– Replay 17 multi-VDC 300-second traces
• Compare – Random shortest path
routing (RSPR)– Full Link Criticality-based
routing (FLCR)• Metrics: congestion count
– # of links with exceeded capacity
• Low additional traffic induced by FLCR
55