Post on 16-Jan-2016
DARD: Distributed Adaptive Routing for Datacenter Networks
Xin Wu, Xiaowei Yang
Multiple equal cost paths in DCN
• Scale-out topology -> Horizontal expansion -> More paths
src dst
core
Agg
ToR
pod
Suboptimal scheduling -> hot spot
src1 src2 dst1 dst2
Unavoidable intra-datacenter traffic• Common services: DNS, search, storage• Auto-scaling: dynamic application instances
To prevent hot spots• Distributed– ECMP & VL2: flow-level hashing in switches
• Centralized– Hedera: compute optimal scheduling in ONE server
Centralized:Efficient but Not Robust
Distributed:Robust but Not Efficient
Design Space
Goal: practical, efficient, robust• Practical– Using well-proven technologies
• Efficient– Close to optimal traffic scheduling
• Robust– No single point failure
Centralized:Efficient but Not Robust
Distributed:Robust but Not Efficient
Design Space
Distributed:Robust and Efficient
Contributions
• Explore the possibility of distributed yet close-to-optimal flow scheduling in DCNs.
• A working implementation in testbed.
• Proven convergence upper bound.
Intuition: minimize the maximum number of flows via a link
src1 dst1 src2 src3dst2 dst3
Step 0: maximum # of flows via a link = 3
src1 dst1 src2 src3dst2 dst3
Step 1: maximum # of flows via a link = 2
Intuition: minimize the maximum number of flows via a link
Intuition: minimize the maximum number of flows via a link
src1 dst1 src2 src3dst2 dst3
Step 2: maximum # of flows via a link = 1
Architecture
Monitor network states
Compute next scheduling
Change flow’s path
• Control loop runs on every server independently
Monitor network states
• src asks switches for the #_of_flows and bandwidth of each link to dst.
src dst
• src assemblies the link states to identify the most and least congested paths to dst.
Distributed computation• Runs on every server
1. for each dst 2. { 3. Pbusy: the most congested path from src to dst;
4. Pfree : the least congested path from src to dst;
5. if (moving one flow from pbusy to pfree won’t
cause a more congested path than pbusy)
6. Move one flow from pbusy to pfree;
7. }
• Steps to convergence is bounded
Change path: using different src-dst pair core1 core2
1.0.0.0/8 2.0.0.0/8 3.0.0.0/8 4.0.0.0/8core3 core4
src1.1.1.22.1.1.23.1.1.24.1.1.2
src dst
dst1.2.1.22.2.1.23.2.1.24.2.1.2
• src-dst address pair uniquely encodes a path
• Static forwarding table
tor1tor1
1.1.1.0/242.1.1.0/243.1.1.0/244.1.1.0/24
tor2
agg1’s down-hill tabledst next hop 1.1.1.0/24 tor1
1.1.2.0/24 tor2
2.1.1.0/24 tor1
2.1.2.0/24 tor2
agg1
1.1.0.0/162.1.0.0/16
agg1 agg2
agg1’s up-hill tablesrc next hop 1.0.0.0/8 core1 2.0.0.0/8 core2
Forwarding example: E2->E1core1
tor1 tor2
agg1 agg2
E1 E2
agg1’s down-hill tabledst next hop 1.1.1.0/24 tor1
1.1.2.0/24 tor2
2.1.1.0/24 tor1
2.1.2.0/24 tor2
agg1’s up-hill tablesrc next hop 1.0.0.0/8 core1 2.0.0.0/8 core2
1.0.0.0/8 2.0.0.0/8
1.1.1.2 1.2.1.2
src: 1.2.1.2, dst: 1.1.1.2Packet header:
Forwarding example: E1->E2core1
tor1 tor2
agg1 agg2
E1 E2
agg1’s down-hill tabledst next hop 1.1.1.0/24 tor1
1.1.2.0/24 tor2
2.1.1.0/24 tor1
2.1.2.0/24 tor2
agg1’s up-hill tablesrc next hop 1.0.0.0/8 core1 2.0.0.0/8 core2
1.0.0.0/8 2.0.0.0/8
1.1.1.2 1.2.1.2
src: 1.1.1.2, dst: 1.2.1.2Packet header:
Randomness: prevent path oscillation
• Add a random time interval to the control cycle
Implementation
• DeterLab testbed– 16-end-hosts fattree– Monitoring: OpenFlow API– Computation: daemon on end hosts– One NIC multiple addresses: IP alias– Static routes: OpenFlow forwarding table– Multipath: IP-in-IP encapsulation
• ns-2 simulator– For different & larger topologies
DARD fully utilizes the bisection bandwidth
intra-pod dominant random inter-pod dominant600
700
800
pVLB ECMP DARD Hedera
Traffic Patterns
Bise
ction
ban
dwid
th
(Gbp
s)
• Simulation, 1024-end-host fattree• pVLB: periodical flow-level VLB
DARD improves large file transfer time
Inter-pod dominant
Intra-pod dominant
random
# of new files per second
DARD
vs.
ECM
P im
prov
emen
t
• Testbed, 16-end-host fattree
Convergence time (seconds)
Inter-pod dominantrandom
Intra-pod dominant
DARD converges in 2~3 control cycles• Simulation, 1024-end-host fattree, static traffic patterns• One control cycle ≈ 10 seconds
Inter-pod dominantrandomIntro-pod dominant
Times a flow switches its paths
Randomness prevents path oscillation• Simulation, 128-end-host fattree
DARD’s control overhead is bounded by the topology
• control_traffic = #_of_servers x #_of_switches.
• Simulation, 128-end-host fattree
DARDHedera
# of simultaneous flows
Cont
rol t
raffi
c (M
B/s)
Conclusion
• DARD: Distributed Adaptive Routing for Datacenters– Practical: well-proven end-host-based technologies– Efficient: close to optimal traffic scheduling– Robust: no single point failure
Monitor network states
Compute next scheduling
Change flow’s path
Thank You!
Questions and comments:xinwu@cs.duke.edu