Application-aware Network Resource Allocation · 2016-04-21 · Emerging Communication Patterns 4...
Transcript of Application-aware Network Resource Allocation · 2016-04-21 · Emerging Communication Patterns 4...
Application-aware Network Resource Allocation
Chris Cai
Preliminary Exam Presentation
Thesis Committee: Professor Roy Campbell (UIUC) Professor Indranil Gupta (UIUC)
Professor Klara Nahrstedt (UIUC) Dr. Franck Le (IBM Research)
1
In One Internet Minute
Networking In Big-data Era: Fast Growing Data Volume vs Staggering Network
2
Google: 3.2M searches
Skype: 124,560 calls
Youtube: 7.1M video views
Twitter: 429,600 tweets
Instagram: 42,900 photos
Facebook: 31.25M messages
http://www.citoresearch.com/cloud/what-slows-down-enterprise-networks-7-deadly-sins-network-congestion http://www.cio.com/article/2915592/social-media/7-staggering-social-media-use-by-the-minute-stats.html http://www.infosecurity-magazine.com/news/half-of-all-network-devices-are
Factors Which Slow Down Network Bad Configuration Management:
• 31% had critical configuration violations
Outdated Hardware: • 51% of all corporate network
devices globally are aging or obsolete
Device Run Amok: • Rogue app updates can conflict
with business critical apps
Networking in Big-data Era: Data Analytics Across Datacenters
3
Walmart (Shanghai, China) Walmart (San Jose, the U.S.)
Walmart (Tokyo, Japan) Walmart (São Paulo,Brazil)
Internet
Find top K most popular items sold globally
Networking in Big-data Era: Emerging Communication Patterns
4
MapReduce Shuffle Bulk Synchronous Parallel (BSP)
Dataflow without explicit barriers Partition-aggregate
Mapper
Reducer
Join
Aggregator
Aggregators
Workers
Superstep t
Superstep t+1
Original Figures from Mosharaf Chowdhury and Ion Stoica. 2012. Coflow: a networking abstraction for cluster applications. *Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online.
Graph Processing
MapReduce Online*
Search
Application-aware Network Application-aware networking is the capacity of an intelligent network to
maintain current information about applications that connect to it.*
• Goal: To optimize their functioning as well as that of other applications or
systems that they control.
• The information maintained includes application state and resource
requirements*.
5 *Adapt from source: http://searchsdn.techtarget.com/definition/application-aware-networking-app-ware-networking
Application-aware Network
Application1 Application2 Application3 Application4
Application 1 Type: MapReduce State: Shuffle Demand: BW > 1Gbps for 10 mins
Application 2 Type: Trading State: Bidding Demand: latency < 10ns for 30 mins
Application 4 Type: Data Backup State: Transmitting Demand: backup 1TB data in 72 hours
Application 3 Type: Video Conference State: Live Demand: loss rate < 10-6 for 1 hour
Thesis Statement
• Application-aware network should take into account both application-level as well as network-level information including network topology, and leverage global network footprints from the commercial cloud service provider to build overlay network for wide area network applications, in order to significantly improve its performance and provide benefits for a wider range of applications and users.
6
Phurti and CRONets
• Phurti (Focus on Single Datacenter Network) – Phurti: Application and Network-Aware Flow Scheduling for Multi-
Tenant MapReduce Cluster
• collaboration with Shayan Saeed, Indranil Gupta, Roy Campbell
• Published at IEEE International Conference on Cloud Engineering(IC2E) 2016
• CRONets (Focus on Wide-Area Network) – CRONets: Cloud-Routed Overlay Networks
• Ongoing work in collaboration with IBM Research
• Preliminary results and research plan
7
Phurti: Application and Network-Aware Flow Scheduling for Multi-Tenant MapReduce Clusters
Chris X. Cai*, Shayan Saeed*
Indranil Gupta*, Roy Campbell*, Franck Le†
8
*UIUC †IBM Research
Outline
• Introduction
• System Architecture
• Scheduling Algorithm
• Evaluation
9
Multi-tenancy in MapReduce Clusters
• Better ROI, high utilization.
• Network is the primary bottleneck.
• Facebook jobs spend 33% time in communication.
• Reduce cannot start before shuffle phase completes.
10
MapReduce Jobs
Users MapReduce Cluster
Problem Statement
How to schedule network traffic to improve completion time for MapReduce jobs?
11
Application-Awareness in Scheduling Job 1 Traffic Job 2 Traffic
Link 1
Link 2 3 units 2 units
6 units
L1
L2
2 4 6 0 time
Fair Sharing
L1
L2
2 4 6 0 time
Shortest Flow First
L1
L2
2 4 6 0 time
Application Aware
Job 1 Completion time = 5
Job 2 Completion time = 6
Job 1 Completion time = 5
Job 2 Completion time = 6
Job 1 Completion time = 3
Job 2 Completion time = 6
12
Network-Awareness in Scheduling
N4 N2
N1 N3 S2 S1
Path 1
Path 2
Job 1 Traffic Job 2 Traffic
Path 1
Path 2
3 units
3 units
13
Network-Awareness in Scheduling Job 1 Traffic Job 2 Traffic
Path 1
Path 2
3 units
3 units
P1
P2
2 4 6 0 time
Network-Agnostic
Job 1 Completion time = 6
Job 2 Completion time = 6
P1
P2
2 4 6 0 time
Network-Aware
Job 1 Completion time = 3
Job 2 Completion time = 6
14 Takeaway: Do not schedule interfering flows of different jobs together
Related Work
• Traditional flow-scheduling – PDQ [SIGCOMM ‘12], Hedera [NSDI ‘10]
– Only improve network-level metrics
• Application-Aware traffic schedulers – Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14]
– Unaware of network topology
15
Phurti: Contributions
• Improves Job Completion Time
• Starvation Protection
• Scalable
• API Compatibility
• Hardware Compatibility
16
Outline
• Introduction
• System Architecture
• Scheduling Algorithm
• Evaluation
17
Phurti Framework
Phurti Scheduling Framework
Northbound API
N1 N2 N3 N4 N5 N6
S1 S2
Southbound API
Hadoop Nodes
SDN Switches
18
Outline
• Introduction
• System Architecture
• Scheduling Algorithm
• Evaluation
19
Phurti Algorithm – Intuition
20
1
2
3
4
P1
P2
2 4 6 0 time
Job 1 Completion time = 4
Job 1 Flows Job 2 Flows
P1
P2
2 4 6 0 time
Job 2 Completion time = 5
Max. Sequential Traffic: 4 units Max. Sequential Traffic: 5 units
1
2
3
4
Takeaway: Job completion time is determined by maximum sequential traffic.
Phurti Algorithm – Intuition (cont.)
21
1
2
3
4
Job 1 Traffic
Job 2 Traffic
P1
P2
2 4 6 0 time 8
P1
P2
2 4 6 0 time 8
Max. Sequential Traffic: 4 units
Max. Sequential Traffic: 5 units
If Job 1 scheduled first If Job 2 scheduled first
Job 1 Completion time = 4 Job 2 Completion time = 8
Job 1 Completion time = 8 Job 2 Completion time = 5
Observation: It is better to schedule the job with smaller maximum sequential traffic first.
Phurti Algorithm
22
Assign priorities to jobs based on Max Sequential Traffic
Let flows of the highest priority job
transfer
Let non-interfering flows of the lower
priority jobs transfer
s1 s3 s2
N1 N2 N3 N4
Job Flow Size Max Seq. Traffic
Priority
J1 N1N4
2 LOW N4N1
J2 N2N3 1 HIGH
N1
N1 N4
N4
N2 N3
Let other flows transfer at a small rate
Latency Improvement
Throughput Maximization
Starvation Protection
Evaluation
• Baseline: Fair Sharing (Default in MapReduce)
• Testbed: 6 nodes, 2 HP SDN switches
• SWIM workload: workload generated from Facebook Hadoop trace
Job Size Bin % of total jobs % of total bytes in shuffled data
Small 62% 5.5%
Medium 16% 10.3%
Large 22% 84.2%
23
Job Completion Time
24
0
0.2
0.4
0.6
0.8
1
1.2
-800 -600 -400 -200 0 200
Difference in Job Completion Time (sec)
95% of jobs have better job completion time under Phurti.
Job Completion Time
25
0
0.05
0.1
0.15
0.2
0.25
Overall Small Medium Large
Frac
tio
nal
Imp
rove
me
nt
Job Type
Average 95th percentile13% improvement in 95th percentile job completion time showing starvation protection.
Flow Scheduling Overhead
Simulate a fat-tree topology with 128 hosts.
26
0
1
2
3
4
5
6
20 40 60 80 100
Sche
dulin
gTime(m
illisecon
ds)
NumberofSimutaneousFlowArrivals
Even in unlikely event of 100 simultaneous incoming flows, scheduling time is 4.5ms which is negligible scheduling overhead.
Flow Scheduling Overhead
Scheduling time for a new flow with 10 ongoing glows in the network
27
Scheduling overhead grows much slower than linear rate showing that it is scalable with increasing number of hosts.
Phurti vs Varys Simulate 128-hosts fat-tree topology with core network having 1x, 5x and 10x capacity compared to access links
28
Phurti performs at least as good as Varys in every case.
Phurti outperforms Varys when the core network has much less capacity (oversubscribed).
X X X
Phurti: Contributions
• Improves completion time for 95% of the jobs, decreases the average completion time by 20% for all jobs.
• Starvation Protection. Improves tail job completion time by 15%.
• Scalable. Shown to scale to 1024 hosts and 100 simultaneous flow arrivals.
• API Compatibility
• Hardware Compatibility
29
CRONets (Cloud-Routed Overlay Networks)
• Transmission Over Internet – No single party controls the routing or QoS of Internet
– BGP(Border Gateway Protocol) is the standard protocol used for exchanging routing and reachability information among Autonomous Systems(ASes)
• Designed to follow commercial relationships among ASes, not to prioritize performance
30
AS A AS B
Provider of A
Customer of A
Motivation
• Current Internet routing does not take performance metrics (e.g., throughput, latency) into account when selecting paths
ISP 1 ISP 2
ISP 3 ISP 4
Overloaded ISP
Path chosen by BGP Alternative paths 31
CRONets (Cloud-Routed Overlay Networks)
32
Leveraging cloud servers from public cloud providers(Amazon EC2, etc) as overlay node to increase path diversities for users.
Overlay Server (Amazon, Softlayer, etc)
Web Server Client
Direct Path Overlay Path
Comparison vs Related Work
DETOUR by Savage et al (Micro 1999)
– First study showed a large fraction of Internet paths could get improved performance through indirect routing
Resilient Overlay Network by Andersen et al (SOSP 2001)
– Overlay network with overlay nodes communicating via a distributed protocol, and hosted by mainly universities across the world
ARROW by Peter et al (SigCOMM 2014)
– ARROW let users to create tunnels among participating ISPs and stitch together end-to-end paths to improve robustness against attacks and failure events.
33
Comparison vs Related Work
Contribution of CRONets will include:
– First study of overlay network in a realistic-cloud-setting, with more than 6600 observed paths
– The overlay paths examined do traverse commercial ASes (less biased than some previous studies based on Internet2).
– Detailed analysis the low-level network metrics to understand the key factors behind the performance gains
– Proposing and evaluation of MPTCP for automatic best overlay server selection
34
Overlay Modes
• Two overlay modes: non-split overlay and split overlay
35
Overlay Server
Web Server Client
non-split overlay split overlay
Overlay Server
Web Server Client
TCP Proxy
BW: Bandwidth RTT: Round Trip Time MSS: Maximum Segment Size p: probability of packet loss
one single TCP connection
Large Scale Measurement
• Goal: evaluate if CRONets can provide promising improvements in a realistic–cloud–setting
36
• Each direct path is compared with the 5 overlay paths (non-split and split modes) via a 100MB file download.
Web Server (Eclipse Mirror Servers)
Overlay Server
Direct Path Overlay Path
Client (PlanetLab Nodes)
Measurement Testbed • Locations of Eclipse mirrors(blue labels): 3 in Europe, 3 in Asia and 4 in North
America(10 in total). • Locations of overlay servers(red labels): Washington DC, San Jose, Dallas,
Amsterdam, and Tokyo(5 in total). Each server is configured with 100 Mbps network, 4GB memory and 2.0 Ghz CPU.
• PlanetLab nodes as clients: 48 in Europe, 45 in America, 14 in Asia, and 3 in Australia(110 in total).
37
14 PlanetLab nodes in Asia
48 PlanetLab nodes in Europe
45 PlanetLab nodes in America
3 PlanetLab nodes in Australia
Total of 10 * (1 direct path + 5 overlay paths) * 110 = 6600 Internet paths
Preliminary Results • Number of measurement samples = 10 * 110 * (1 direct path + 5 overlay
paths) = 6600 paths
• Improvement factor = max overlay TCP throughput/direct TCP throughput
38
100%
Controlled Servers Experiment
39
Overlay Server (Softlayer) Traffic Sender (Softlayer) Traffic Receiver (PlanetLab)
• Round Trip Time (RTT) during TCP transmission • packet loss rate • path routing • discrete overlay throughput = min(throughput of segment A, throughput of segment B)
To collect more measurement data (e.g. Round Trip Time, packet loss rate, path routing, discrete overlay throughput, etc), we repeat the measurements with Softlayer servers as the TCP senders
Traffic Sender (public server)
segment A segment B
Research Plan: What Types of Paths Benefit the Most from CRONets?
• Throughput Improvement vs Throughput of Direct Path
• Throughput Improvement vs Round Trip Time of Direct Path
• Throughput Improvement vs Packet Loss Rate of Direct Path
• Throughput Improvement vs Path Diversity
Direct Path Throughput: 10 Mbps Direct Path RTT: 100ms Direct Path Packet Loss Rate: 0.1%
Direct Path Throughput: 1 Mbps Direct Path RTT: 200 ms Direct Path Packet Loss Rate: 1%
Research Plan: Persistency of Gains
41
0
10
20
30
40
50
60
Time 1 Time 2 Time 3 Time 4
Thro
ugh
pu
t
Overlay Path
Direct Path
Adapting to Network Dynamics
• How to select the best path offering highest throughput?
42
public server
overlay server
overlay server
client
Research Plan Adapting To Network Dynamics
43
Direct Path
Overlay Path
MPTCP MPTCP
sub-flow
Research Plan Overlay Network Simulation
44
AS
AS
AS
AS
AS
AS AS
AS
Congestion-free AS
Congested AS
AS
AS
AS
Hypotheses and Expected Outcomes
• Controlled server experiment (controlled server as traffic sender) – comparable performance improvements as observed in public web
server experiment.
– CRONets not only improves TCP throughputs, but also RTT and packet loss rate
• Understanding the gains – CRONets would provide higher improvements for direct paths with larger
RTT, higher packet loss rate, lower direct TCP throughput, and via overlay path with a larger path diversity.
45
Hypotheses and Expected Outcomes
• Adapting to network dynamics
– MPTCP would be able to perform close to the best overlay path
• Overlay Network Simulation
– We expect CRONets would still be a useful and efficient method for users to bypass congested ASes in a simulation experiment using real Internet topology.
46
Thesis Contributions, Impact and Future Work
• A scheduling framework for multi-tenant MapReduce cluster with rich interfaces (application + network topology) between application and network (IC2E 2016)
• A very first attempt to study and understand overlay networks in a realistic-cloud-setting at a large scale with thousands of Internet paths (work-in-progress)
• Both Phurti and CRONets are examples of building blocks for implementing application-aware computing frameworks and
services.
• Potential future work could include:
– Combine Phurti and CRONets to support traffic pattern other than MapReduce shuffle
– Extend CRONets to support streaming, video conference, etc
47
Questions
48
Backup Slides
49
Effective Transmit Rate
50
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
CD
F
Effective Transmit Rate
80% of jobs have effective transmit rate larger than 0.9 showing minimal throttling.
Split Overlay versus Non-split Overlay
51
TCP throughput formula:
B
A C
Split Overlay versus Non-split Overlay
52
B
A C
Overlay Path Throughput Simulation
• Assumption: – CRONets does not increase RTTs for overlay paths
significantly • For a given direct path with round trip time RTT_direct,
the round trip time of its corresponding overlay paths is α*RTT_direct, and α follows normal distribution N (1, 0.1).
– For a given direct path with packet loss rate p_direct, the loss rates of the two segments of the corresponding one-hop overlay paths is β*p_direct
53
Overlay Path Throughput Simulation • Case 1: The cloud provider is able to provision network
with better quality than direct path. – Let β follow normal distribution N(0.5,0.05)
54
Overlay Path Throughput Simulation • Case 2: the packet loss rates of overlay paths and
direct paths are comparable
– Let β follow normal distribution N (1, 0.05)
55