Application-aware Network Resource Allocation · 2016-04-21 · Emerging Communication Patterns 4...

Application-aware Network Resource Allocation

Chris Cai

Preliminary Exam Presentation

Thesis Committee: Professor Roy Campbell (UIUC) Professor Indranil Gupta (UIUC)

Professor Klara Nahrstedt (UIUC) Dr. Franck Le (IBM Research)

1

In One Internet Minute

Networking In Big-data Era: Fast Growing Data Volume vs Staggering Network

2

Google: 3.2M searches

Skype: 124,560 calls

Youtube: 7.1M video views

Twitter: 429,600 tweets

Instagram: 42,900 photos

Facebook: 31.25M messages

http://www.citoresearch.com/cloud/what-slows-down-enterprise-networks-7-deadly-sins-network-congestion http://www.cio.com/article/2915592/social-media/7-staggering-social-media-use-by-the-minute-stats.html http://www.infosecurity-magazine.com/news/half-of-all-network-devices-are

Factors Which Slow Down Network Bad Configuration Management:

• 31% had critical configuration violations

Outdated Hardware: • 51% of all corporate network

devices globally are aging or obsolete

Device Run Amok: • Rogue app updates can conflict

with business critical apps

Networking in Big-data Era: Data Analytics Across Datacenters

3

Walmart (Shanghai, China) Walmart (San Jose, the U.S.)

Walmart (Tokyo, Japan) Walmart (São Paulo,Brazil)

Internet

Find top K most popular items sold globally

Networking in Big-data Era: Emerging Communication Patterns

4

MapReduce Shuffle Bulk Synchronous Parallel (BSP)

Dataflow without explicit barriers Partition-aggregate

Mapper

Reducer

Join

Aggregator

Aggregators

Workers

Superstep t

Superstep t+1

Original Figures from Mosharaf Chowdhury and Ion Stoica. 2012. Coflow: a networking abstraction for cluster applications. *Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online.

Graph Processing

MapReduce Online*

Search

Application-aware Network Application-aware networking is the capacity of an intelligent network to

maintain current information about applications that connect to it.*

• Goal: To optimize their functioning as well as that of other applications or

systems that they control.

• The information maintained includes application state and resource

requirements*.

5 *Adapt from source: http://searchsdn.techtarget.com/definition/application-aware-networking-app-ware-networking

Application-aware Network

Application1 Application2 Application3 Application4

Application 1 Type: MapReduce State: Shuffle Demand: BW > 1Gbps for 10 mins

Application 2 Type: Trading State: Bidding Demand: latency < 10ns for 30 mins

Application 4 Type: Data Backup State: Transmitting Demand: backup 1TB data in 72 hours

Application 3 Type: Video Conference State: Live Demand: loss rate < 10-6 for 1 hour

Thesis Statement

• Application-aware network should take into account both application-level as well as network-level information including network topology, and leverage global network footprints from the commercial cloud service provider to build overlay network for wide area network applications, in order to significantly improve its performance and provide benefits for a wider range of applications and users.

6

Phurti and CRONets

• Phurti (Focus on Single Datacenter Network) – Phurti: Application and Network-Aware Flow Scheduling for Multi-

Tenant MapReduce Cluster

• collaboration with Shayan Saeed, Indranil Gupta, Roy Campbell

• Published at IEEE International Conference on Cloud Engineering(IC2E) 2016

• CRONets (Focus on Wide-Area Network) – CRONets: Cloud-Routed Overlay Networks

• Ongoing work in collaboration with IBM Research

• Preliminary results and research plan

7

Phurti: Application and Network-Aware Flow Scheduling for Multi-Tenant MapReduce Clusters

Chris X. Cai*, Shayan Saeed*

Indranil Gupta*, Roy Campbell*, Franck Le†

8

*UIUC †IBM Research

Outline

• Introduction

• System Architecture

• Scheduling Algorithm

• Evaluation

9

Multi-tenancy in MapReduce Clusters

• Better ROI, high utilization.

• Network is the primary bottleneck.

• Facebook jobs spend 33% time in communication.

• Reduce cannot start before shuffle phase completes.

10

MapReduce Jobs

Users MapReduce Cluster

Problem Statement

How to schedule network traffic to improve completion time for MapReduce jobs?

11

Application-Awareness in Scheduling Job 1 Traffic Job 2 Traffic

Link 1

Link 2 3 units 2 units

6 units

L1

L2

2 4 6 0 time

Fair Sharing

L1

L2

2 4 6 0 time

Shortest Flow First

L1

L2

2 4 6 0 time

Application Aware

Job 1 Completion time = 5






12

Network-Awareness in Scheduling

N4 N2

N1 N3 S2 S1

Path 1

Path 2

Job 1 Traffic Job 2 Traffic

Path 1

Path 2

3 units

3 units

13

Network-Awareness in Scheduling Job 1 Traffic Job 2 Traffic

Path 1

Path 2

3 units

3 units

P1

P2

2 4 6 0 time

Network-Agnostic



P1

P2

2 4 6 0 time

Network-Aware



14 Takeaway: Do not schedule interfering flows of different jobs together

Related Work

• Traditional flow-scheduling – PDQ [SIGCOMM ‘12], Hedera [NSDI ‘10]

– Only improve network-level metrics

• Application-Aware traffic schedulers – Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14]

– Unaware of network topology

15

Phurti: Contributions

• Improves Job Completion Time

• Starvation Protection

• Scalable

• API Compatibility

• Hardware Compatibility

16

Outline

• Introduction



• Evaluation

17

Phurti Framework

Phurti Scheduling Framework

Northbound API

N1 N2 N3 N4 N5 N6

S1 S2

Southbound API

Hadoop Nodes

SDN Switches

18

Outline

• Introduction



• Evaluation

19

Phurti Algorithm – Intuition

20

1

2

3

4

P1

P2

2 4 6 0 time


Job 1 Flows Job 2 Flows

P1

P2

2 4 6 0 time


Max. Sequential Traffic: 4 units Max. Sequential Traffic: 5 units

1

2

3

4

Takeaway: Job completion time is determined by maximum sequential traffic.

Phurti Algorithm – Intuition (cont.)

21

1

2

3

4

Job 1 Traffic

Job 2 Traffic

P1

P2

2 4 6 0 time 8

P1

P2

2 4 6 0 time 8

Max. Sequential Traffic: 4 units

Max. Sequential Traffic: 5 units

If Job 1 scheduled first If Job 2 scheduled first

Job 1 Completion time = 4 Job 2 Completion time = 8

Job 1 Completion time = 8 Job 2 Completion time = 5

Observation: It is better to schedule the job with smaller maximum sequential traffic first.

Phurti Algorithm

22

Assign priorities to jobs based on Max Sequential Traffic

Let flows of the highest priority job

transfer

Let non-interfering flows of the lower

priority jobs transfer

s1 s3 s2

N1 N2 N3 N4

Job Flow Size Max Seq. Traffic

Priority

J1 N1N4

2 LOW N4N1

J2 N2N3 1 HIGH

N1

N1 N4

N4

N2 N3

Let other flows transfer at a small rate

Latency Improvement

Throughput Maximization

Starvation Protection

Evaluation

• Baseline: Fair Sharing (Default in MapReduce)

• Testbed: 6 nodes, 2 HP SDN switches

• SWIM workload: workload generated from Facebook Hadoop trace

Job Size Bin % of total jobs % of total bytes in shuffled data

Small 62% 5.5%

Medium 16% 10.3%

Large 22% 84.2%

23

Job Completion Time

24

0

0.2

0.4

0.6

0.8

1

1.2

-800 -600 -400 -200 0 200

Difference in Job Completion Time (sec)

95% of jobs have better job completion time under Phurti.

Job Completion Time

25

0

0.05

0.1

0.15

0.2

0.25

Overall Small Medium Large

Frac

tio

nal

Imp

rove

me

nt

Job Type

Average 95th percentile13% improvement in 95th percentile job completion time showing starvation protection.

Flow Scheduling Overhead

Simulate a fat-tree topology with 128 hosts.

26

0

1

2

3

4

5

6

20 40 60 80 100

Sche

dulin

gTime(m

illisecon

ds)

NumberofSimutaneousFlowArrivals

Even in unlikely event of 100 simultaneous incoming flows, scheduling time is 4.5ms which is negligible scheduling overhead.

Flow Scheduling Overhead

Scheduling time for a new flow with 10 ongoing glows in the network

27

Scheduling overhead grows much slower than linear rate showing that it is scalable with increasing number of hosts.

Phurti vs Varys Simulate 128-hosts fat-tree topology with core network having 1x, 5x and 10x capacity compared to access links

28

Phurti performs at least as good as Varys in every case.

Phurti outperforms Varys when the core network has much less capacity (oversubscribed).

X X X

Phurti: Contributions

• Improves completion time for 95% of the jobs, decreases the average completion time by 20% for all jobs.

• Starvation Protection. Improves tail job completion time by 15%.

• Scalable. Shown to scale to 1024 hosts and 100 simultaneous flow arrivals.

• API Compatibility

• Hardware Compatibility

29

CRONets (Cloud-Routed Overlay Networks)

• Transmission Over Internet – No single party controls the routing or QoS of Internet

– BGP(Border Gateway Protocol) is the standard protocol used for exchanging routing and reachability information among Autonomous Systems(ASes)

• Designed to follow commercial relationships among ASes, not to prioritize performance

30

AS A AS B

Provider of A

Customer of A

Motivation

• Current Internet routing does not take performance metrics (e.g., throughput, latency) into account when selecting paths

ISP 1 ISP 2

ISP 3 ISP 4

Overloaded ISP

Path chosen by BGP Alternative paths 31

CRONets (Cloud-Routed Overlay Networks)

32

Leveraging cloud servers from public cloud providers(Amazon EC2, etc) as overlay node to increase path diversities for users.

Overlay Server (Amazon, Softlayer, etc)

Web Server Client

Direct Path Overlay Path

Comparison vs Related Work

DETOUR by Savage et al (Micro 1999)

– First study showed a large fraction of Internet paths could get improved performance through indirect routing

Resilient Overlay Network by Andersen et al (SOSP 2001)

– Overlay network with overlay nodes communicating via a distributed protocol, and hosted by mainly universities across the world

ARROW by Peter et al (SigCOMM 2014)

– ARROW let users to create tunnels among participating ISPs and stitch together end-to-end paths to improve robustness against attacks and failure events.

33

Comparison vs Related Work

Contribution of CRONets will include:

– First study of overlay network in a realistic-cloud-setting, with more than 6600 observed paths

– The overlay paths examined do traverse commercial ASes (less biased than some previous studies based on Internet2).

– Detailed analysis the low-level network metrics to understand the key factors behind the performance gains

– Proposing and evaluation of MPTCP for automatic best overlay server selection

34

Overlay Modes

• Two overlay modes: non-split overlay and split overlay

35

Overlay Server

Web Server Client

non-split overlay split overlay

Overlay Server

Web Server Client

TCP Proxy

BW: Bandwidth RTT: Round Trip Time MSS: Maximum Segment Size p: probability of packet loss

one single TCP connection

Large Scale Measurement

• Goal: evaluate if CRONets can provide promising improvements in a realistic–cloud–setting

36

• Each direct path is compared with the 5 overlay paths (non-split and split modes) via a 100MB file download.

Web Server (Eclipse Mirror Servers)

Overlay Server

Direct Path Overlay Path

Client (PlanetLab Nodes)

Measurement Testbed • Locations of Eclipse mirrors(blue labels): 3 in Europe, 3 in Asia and 4 in North

America(10 in total). • Locations of overlay servers(red labels): Washington DC, San Jose, Dallas,

Amsterdam, and Tokyo(5 in total). Each server is configured with 100 Mbps network, 4GB memory and 2.0 Ghz CPU.

• PlanetLab nodes as clients: 48 in Europe, 45 in America, 14 in Asia, and 3 in Australia(110 in total).

37

14 PlanetLab nodes in Asia

48 PlanetLab nodes in Europe

45 PlanetLab nodes in America

3 PlanetLab nodes in Australia

Total of 10 * (1 direct path + 5 overlay paths) * 110 = 6600 Internet paths

Preliminary Results • Number of measurement samples = 10 * 110 * (1 direct path + 5 overlay

paths) = 6600 paths

• Improvement factor = max overlay TCP throughput/direct TCP throughput

38

100%

Controlled Servers Experiment

39

Overlay Server (Softlayer) Traffic Sender (Softlayer) Traffic Receiver (PlanetLab)

• Round Trip Time (RTT) during TCP transmission • packet loss rate • path routing • discrete overlay throughput = min(throughput of segment A, throughput of segment B)

To collect more measurement data (e.g. Round Trip Time, packet loss rate, path routing, discrete overlay throughput, etc), we repeat the measurements with Softlayer servers as the TCP senders

Traffic Sender (public server)

segment A segment B

Research Plan: What Types of Paths Benefit the Most from CRONets?

• Throughput Improvement vs Throughput of Direct Path

• Throughput Improvement vs Round Trip Time of Direct Path

• Throughput Improvement vs Packet Loss Rate of Direct Path

• Throughput Improvement vs Path Diversity

Direct Path Throughput: 10 Mbps Direct Path RTT: 100ms Direct Path Packet Loss Rate: 0.1%

Direct Path Throughput: 1 Mbps Direct Path RTT: 200 ms Direct Path Packet Loss Rate: 1%

Research Plan: Persistency of Gains

41

0

10

20

30

40

50

60

Time 1 Time 2 Time 3 Time 4

Thro

ugh

pu

t

Overlay Path

Direct Path

Adapting to Network Dynamics

• How to select the best path offering highest throughput?

42

public server

overlay server

overlay server

client

Research Plan Adapting To Network Dynamics

43

Direct Path

Overlay Path

MPTCP MPTCP

sub-flow

Research Plan Overlay Network Simulation

44

AS

AS

AS

AS

AS

AS AS

AS

Congestion-free AS

Congested AS

AS

AS

AS

Hypotheses and Expected Outcomes

• Controlled server experiment (controlled server as traffic sender) – comparable performance improvements as observed in public web

server experiment.

– CRONets not only improves TCP throughputs, but also RTT and packet loss rate

• Understanding the gains – CRONets would provide higher improvements for direct paths with larger

RTT, higher packet loss rate, lower direct TCP throughput, and via overlay path with a larger path diversity.

45

Hypotheses and Expected Outcomes

• Adapting to network dynamics

– MPTCP would be able to perform close to the best overlay path

• Overlay Network Simulation

– We expect CRONets would still be a useful and efficient method for users to bypass congested ASes in a simulation experiment using real Internet topology.

46

Thesis Contributions, Impact and Future Work

• A scheduling framework for multi-tenant MapReduce cluster with rich interfaces (application + network topology) between application and network (IC2E 2016)

• A very first attempt to study and understand overlay networks in a realistic-cloud-setting at a large scale with thousands of Internet paths (work-in-progress)

• Both Phurti and CRONets are examples of building blocks for implementing application-aware computing frameworks and

services.

• Potential future work could include:

– Combine Phurti and CRONets to support traffic pattern other than MapReduce shuffle

– Extend CRONets to support streaming, video conference, etc

47

Questions

48

Backup Slides

49

Effective Transmit Rate

50

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

CD

F

Effective Transmit Rate

80% of jobs have effective transmit rate larger than 0.9 showing minimal throttling.

Split Overlay versus Non-split Overlay

51

TCP throughput formula:

B

A C

Split Overlay versus Non-split Overlay

52

B

A C

Overlay Path Throughput Simulation

• Assumption: – CRONets does not increase RTTs for overlay paths

significantly • For a given direct path with round trip time RTT_direct,

the round trip time of its corresponding overlay paths is α*RTT_direct, and α follows normal distribution N (1, 0.1).

– For a given direct path with packet loss rate p_direct, the loss rates of the two segments of the corresponding one-hop overlay paths is β*p_direct

53

Overlay Path Throughput Simulation • Case 1: The cloud provider is able to provision network

with better quality than direct path. – Let β follow normal distribution N(0.5,0.05)

54

Overlay Path Throughput Simulation • Case 2: the packet loss rates of overlay paths and

direct paths are comparable

– Let β follow normal distribution N (1, 0.05)

55

Application-aware Network Resource Allocation · 2016-04-21 · Emerging Communication Patterns 4...

Documents

Transcript of Application-aware Network Resource Allocation · 2016-04-21 · Emerging Communication Patterns 4...