Optimizing Big Data Analytics Frameworks in Geographically ...€¦ · Optimizing Big Data...

Optimizing Big Data Analytics Frameworks inGeographically Distributed Datacenters

by

Shuhao Liu

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

The Edward S. Rogers Sr. Department of Electrical and ComputerEngineering

University of Toronto

© Copyright 2019 by Shuhao Liu

AbstractOptimizing Big Data Analytics Frameworks in Geographically Distributed Datacenters

Shuhao LiuDoctor of Philosophy

The Edward S. Rogers Sr. Department of Electrical and Computer EngineeringUniversity of Toronto

2019

A variety of Internet applications rely on big data analytics frameworks to efficiently

process large volumes of raw data. As such applications expand to a continental or even

global scale, their raw data input can be generated and stored in different datacenters.

The performance of data analytics jobs is likely to suffer, as transferring data among

workers located in different datacenters is expensive.

In this dissertation, we propose a series of system optimizations for big data analytics

frameworks that are deployed across geographically distributed datacenters. Our works

optimize the following three components in their architectural design:

Inter-datacenter network transfers. Few measures have been taken to handle the

unpredictability and scarcity of inter-datacenter available bandwidth. With the aware-

ness of job-level performance, we focus on expediting inter-datacenter coflows, which are

collections of parallel flows associated with job-level communication requirements. We

propose novel scheduling and routing strategies to minimize the coflow completion times.

These strategies are implemented in Siphon, a software-defined overlay network deployed

atop inter-datacenter networks.

Shuffles. Data analytics frameworks all provide a set of parallel operators as applica-

tion building blocks to simplify development. Some operators trigger all-to-all transfers

among worker nodes, known as shuffles. Shuffles are the source of inter-datacenter traffic,

and its performance is critical. Our work focuses on improving the network utilization

during shuffles, by adopting a new push-based mechanism that allows early start of in-

dividual flow transfers.

ii

Graph analytics APIs. Graph analytics is among the most important algorithms

that are natively supported by typical big data analytics frameworks. Bulk Synchronous

Parallel (BSP) is the state-of-the-art synchronization model to parallelize graph algo-

rithms, which generates a significant amount of traffic among worker nodes iteratively.

We propose a Hierarchical Synchronous Parallel (HSP) model, which is designed to re-

duce the demand for inter-datacenter transfers. HSP achieves this goal without sacrificing

algorithm correctness, resulting in a better performance and lower monetary cost.

We have implemented and tested the prototypes based on Apache Spark, one of the

most popular data analytics frameworks. Extensive experimental results on real public

clouds across multiple geographical regions have shown their effectiveness.

iii

To my family

iv

Acknowledgements

It has been more than four years since I started my first day at the University ofToronto. At that time, I was a student knowing little about research in computer science,and being a bit afraid of speaking English in public. Four years later, with all the blood,sweat, and tears of PhD study in memory, I could not be more grateful and thankful.

First and foremost, I would like to express my sincere appreciation to Prof. BaochunLi, my PhD advisor, for his guidance throughout the years. His visions in research andhis methodologies of mentoring students are innovative and effective. He has also givenme career advice, affected my values and corrected my bad habits in work. I am verylucky to have Prof. Baochun Li to be my advisor, my mentor, my collaborator, and myfriend.

I would like to thank my colleagues from iQua research group: Hong Xu, Wei Wang,Jun Li, Li Chen, Liyao Xiang, Weiwei Fang, Xiaoyan Yin, Yilun Wu, Yinan Liu, ZhimingHu, Jingjie Jiang, Shiyao Ma, Yanjiao Chen, Hao Wang, Hongyu Huang, Wanyu Lin,Xu Yuan, Wenxin Li, Siqi Ji, Jiapin Lin, Tracy Cheng, Jiayue Li, Yuanxiang Gao, ChenYing, Yifan Gong. Even if we were working on different projects, they would alwaysoffer me a helping hand. Having meetings and discussions with them has improved thebreadth of my knowledge. I would like to give my special gratitudes to Li Chen, whohas been my closest collaborator. We worked side-by-side for countless hours and solvedhard research problems together. It has been a joyful experience working with her.

Finally, I would like to thank my family for their unconditional love. Though beingtens of thousands of miles away from home, I can always feel their care and support. Lifechanged a lot during these four years, but their love has always been as solid as a rock.

This dissertation serves as the ultimate milestone of my PhD study, and this journeywill be my life-long treasure. Without all the help and support I received throughout theyears, this dissertation would not have been possible.

v

Contents

Acknowledgements v

Table of Contents vi

List of Tables ix

List of Figures x

1 Introduction 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background and Related Work 102.1 Wide-Area Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Network Optimization for Data Analytics . . . . . . . . . . . . . . . . . . 122.3 Software-Defined Networking . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Optimizing Shuffle in Data Analytics . . . . . . . . . . . . . . . . . . . . 142.5 Distributed Graph Analytics . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Siphon: Expediting Inter-Datacenter Coflows 163.1 Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Scheduling Inter-Datacenter Coflows . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Inter-Coflow Scheduling . . . . . . . . . . . . . . . . . . . . . . . 213.2.2 Intra-Coflow Scheduling . . . . . . . . . . . . . . . . . . . . . . . 263.2.3 Multi-Path Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.4 A Flow’s Life in Siphon . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Siphon: Design and Implementation . . . . . . . . . . . . . . . . . . . . . 323.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 Data Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

vi

3.3.3 Connections for Inter-Datacenter Links . . . . . . . . . . . . . . . 36

3.3.4 Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Controller-Data Plane Interaction in Siphon . . . . . . . . . . . . . . . . 39

3.4.1 Inefficiency of Reactive Control in OpenFlow . . . . . . . . . . . . 40

3.4.2 Caching Dynamic Forwarding Logic . . . . . . . . . . . . . . . . . 42

3.4.3 Customizing the Message Processing Logic . . . . . . . . . . . . . 44

3.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5.1 Macro-Benchmark Tests . . . . . . . . . . . . . . . . . . . . . . . 47

3.5.2 Single Coflow Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5.3 Inter-Coflow Scheduling . . . . . . . . . . . . . . . . . . . . . . . 57

3.5.4 Aggregators: Stress Tests . . . . . . . . . . . . . . . . . . . . . . 58

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Optimizing Shuffle in Wide Area Data Analytics 61

4.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1.1 Fetch-based Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1.2 Problems with Fetch in Wide-Area Data Analytics . . . . . . . . 63

4.2 Transferring Shuffle Input across Datacenters . . . . . . . . . . . . . . . . 64

4.2.1 Transferring Shuffle Input: Timing . . . . . . . . . . . . . . . . . 65

4.2.2 Transferring Shuffle Input: Choosing Destinations . . . . . . . . . 66

4.2.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . 69

4.3 Implementation on Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.2 transferTo(): Enforced Data Transfer in Spark . . . . . . . . . . 71

4.3.3 Implementation Details of tranferTo() . . . . . . . . . . . . . . 73

4.3.4 Automatic Push/Aggregate . . . . . . . . . . . . . . . . . . . . . 76

4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.4.1 Cluster Configurations . . . . . . . . . . . . . . . . . . . . . . . . 80

4.4.2 Job Completion Time . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4.3 Cross-Region Traffic . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4.4 Stage Execution Time . . . . . . . . . . . . . . . . . . . . . . . . 86

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

vii

5 A Hierarchical Synchronous Parallel Model 895.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 925.2 Hierarchical Synchronous Parallel Model . . . . . . . . . . . . . . . . . . 94

5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2.2 Model Formulation and Description . . . . . . . . . . . . . . . . . 965.2.3 Proof of Convergence and Correctness . . . . . . . . . . . . . . . 995.2.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 1005.2.5 PageRank Example: a Numerical Verification . . . . . . . . . . . 103

5.3 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.4.2 WAN Bandwidth Usage . . . . . . . . . . . . . . . . . . . . . . . 1095.4.3 Performance and Total Cost Analysis . . . . . . . . . . . . . . . . 1115.4.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6 Concluding Remarks 1136.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Bibliography 116

viii

List of Tables

3.1 Peak TCP throughput (Mbps) achieved across different regions on theGoogle Cloud Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Summary of prototype network applications implemented in Siphon . . . 463.3 Summary of shuffles in different workloads (present the run with median

application run time). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 The overall throughput for 6 concurrent data fetches. . . . . . . . . . . . 59

4.1 The specifications of four workloads used in the evaluation. . . . . . . . . 80

5.1 Summary of the used datasets. . . . . . . . . . . . . . . . . . . . . . . . . 1085.2 WAN bandwidth usage comparison. . . . . . . . . . . . . . . . . . . . . . 109

ix

List of Figures

1.1 An overview of our work in the architectural design of a general dataanalytics framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Key terminologies and concepts in a data analytics framework. . . . . . . 11

3.1 A case of inter-datacenter coflow scheduling. . . . . . . . . . . . . . . . . 213.2 A complete execution graph of Monte Carlo simulation. . . . . . . . . . . 243.3 Network flows across datacenters in the shuffle phase of a simple job. . . 263.4 Job timeline with LFGF scheduling. . . . . . . . . . . . . . . . . . . . . . 273.5 Job timeline with naive scheduling. . . . . . . . . . . . . . . . . . . . . . 273.6 Flexibility in routing improves performance. . . . . . . . . . . . . . . . . 303.7 A flow’s life through Siphon. . . . . . . . . . . . . . . . . . . . . . . . . . 323.8 An architectural overview of Siphon. . . . . . . . . . . . . . . . . . . . . 323.9 The architectural design of a Siphon aggregator. . . . . . . . . . . . . . . 363.10 The architecture of the Siphon Controller. . . . . . . . . . . . . . . . . . 373.11 An example to show the benefits of programmable data plane in software-

defined networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.12 The message processing diagram in Siphon. . . . . . . . . . . . . . . . . . 433.13 Prototype inheritance of Message Processing Objects, illustrating the pro-

gramming interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.14 Average application run time. . . . . . . . . . . . . . . . . . . . . . . . . 483.15 Shuffle completion time and stage completion time comparison (present

the run with media application run time). . . . . . . . . . . . . . . . . . 493.16 Average job completion time across 5 runs. . . . . . . . . . . . . . . . . . 523.17 Breakdowns of the reduce stage execution across 5 runs. . . . . . . . . . 523.18 CDF of shuffle read time (present the run with median job completion time). 533.19 The summary of inter-datacenter traffic in the shuffle phase of the sort

application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.20 Bandwidth distribution among datacenters. . . . . . . . . . . . . . . . . 56

x

3.21 Average and 90th percentile CCT comparison. . . . . . . . . . . . . . . . 563.22 The switching capacities of a Siphon aggregator on three different types

of instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1 Motivation of optimizing shuffle: case 1. . . . . . . . . . . . . . . . . . . 654.2 Motivation of optimizing shuffle: case 2. . . . . . . . . . . . . . . . . . . 664.3 A snippet of a sample execution graph of a data analytic job. . . . . . . 674.4 Implementation of transferTo(). . . . . . . . . . . . . . . . . . . . . . . 734.5 Implementation of transferTo() implicit embedding. . . . . . . . . . . . 774.6 Geographical deployment of testbed. . . . . . . . . . . . . . . . . . . . . 804.7 Average job completion time under HiBench. . . . . . . . . . . . . . . . . 824.8 Total volume of cross-datacenter traffic under different workloads. . . . . 854.9 Stage execution time breakdown under each workload. . . . . . . . . . . 88

5.1 A motivating example for HSP. . . . . . . . . . . . . . . . . . . . . . . . 935.2 An example run of PageRank under HSP. . . . . . . . . . . . . . . . . . 1055.3 The flow chart that shows central coordination in HSP. . . . . . . . . . . 1065.4 Application runtime under HSP, normalized by the runtime under BSP. . 1095.5 Estimated cost breakdown for running applications. . . . . . . . . . . . . 1105.6 Rate of convergence analysis for PageRank on uk-2014-host. . . . . . . 110

xi

Chapter 1

Introduction

Since the dawn of cloud computing, an increasing number of applications and services

have been built thanks to the ability to efficiently extract useful information from a

massive amount of raw data in commodity datacenters. Some of these applications

require to digest petabytes of raw data daily or even hourly, launching their jobs on

tens of thousands of physical machines in parallel. Big data analytics frameworks, e.g.,

Apache Hadoop [1] and Apache Spark [84], make it feasible to develop and manage such

an application that runs at a large scale.

At a high level, a typical big data analytics framework executes a data analytics job

in several stages. Each stage consists of a number of tasks that can process different

partitions of the dataset, running on different worker machines in parallel.

The dependencies among stages can be abstracted as a Directed Acyclic Graph

(DAG), and the child stage cannot start until its parent is completely finished. Be-

tween two consecutive stages, the intermediate data layout of the parent stage will be

completely reorganized, resulting in an all-to-all traffic pattern among workers, known

as the shuffle phase. Due to the volume of traffic generated, shuffles may constitute a

significant fraction of the job completion time [40], even if workers are located within a

single datacenter where bandwidth is abundantly available [7].

As many Internet applications and services expand as global businesses, not only

has the volumes of input data been increasing continuously and exponentially, but also

1

Chapter 1. Introduction 2

the geographical diversity of datasets has been growing. User-generated contents and

system logs, two major sources of data to be processed [26], are naturally generated and

stored in servers housed in geographically distributed datacenters. NetFlix, for example,

houses its global video streaming service in Amazon Web Service (AWS), a global cloud

platform. It is tricky to run a recommender system continuously based on ever-changing

user preference data, which are generated in different AWS regions.

Challenges arise in the efficiency of big data analytics frameworks, which are not

designed with the awareness of datacenter boundaries. Inter-datacenter Wide-Area Net-

works (WANs) offer magnitudes of lower bandwidth capacity as compared to intra-

datacenter networks [38]. Even if such WANs are physically dedicated optical links

(e.g., Google’s B4 [42]), the bandwidth resources for a single cloud tenant is still scarce

as being shared among millions of users. As a result, the shuffle phase will take a much

longer time when bulk inter-datacenter flow transfers are required.

Administrators of these data analytics applications can do little — as cloud tenants,

they have zero control over their generated traffic in inter-datacenter WANs.

This problem is known as wide-area data analytics [64, 65]. As a tenant in the

public cloud, how can we process a geographically distributed dataset at a

low cost?

As an additional requirement, a practical solution must be transparent:

1. Compatibility with existing code. With an entire ecosystem of libraries and applica-

tions on top of popular data analytics frameworks like Hadoop and Spark, it makes

sense to maintain the API consistency. Any attempt to modify existing well-defined

APIs is undesirable.

2. Optional user interference. The usefulness of data analytics frameworks stems

from the fact that it hides plenty of implementation details from users, allowing

them to focus solely on their business logic. From the developers’ point of view,

it should make no difference where or how the input datasets are physically stored


and processed. To avoid violating this principle, it must be an optional interface

to intake user insights about the geographical data distribution for the sake of

performance, and it is okay if users are completely unaware.

3. Ease of deployment on mainstream cloud platforms. Deploying and managing a

data analytics framework is easy and well-supported on most cloud platforms, and it

should not be more difficult when the deployment scale out to multiple datacenters.

One straightforward and intuitive solution is to aggregate the entire dataset into a

single datacenter before processing it. However, moving a massive amount of raw data via

inter-datacenter WAN is expensive, both time-wise and money-wise. Both our work [56]

and existing work [64] have evaluation results showing that it is a less effective solution

as compared to processing raw data in-place. Also, it is sometimes infeasible to transfer

raw datasets across borders due to special political regulations [79].

There are generally two lines of existing work trying to tackle the wide-area data

analytics problem. One set of work attempted to design optimal mechanisms of as-

signing input data and computation tasks across datacenters [39, 64, 78, 79], while the

other adjusted the application workloads towards reducing demands on inter-datacenter

communications [38, 77]. While both lines of work have been proven effective, we wish

to rethink the problem from the architectural perspective: can we redesign some other

key system components in a data analytics framework, so that the bottlenecked inter-

datacenter transfers are minimized?

In this dissertation, complementary to all existing proposals in the literature, we pro-

pose a series of system component redesigns, which are readily implementable

and deployable on existing data analytics frameworks, to optimize the effi-

ciency of wide-area data analytics. We have two major objectives: (1) improving

the job-level performance, i.e., minimizing job completion times; and (2) reducing the

amount of data transferred across datacenters. As a result, the monetary cost of running

a data analytics job across datacenters can be minimized.


I/O LayerInter-Node RPC/Messaging

Inter-Node DataTransfers

Distributed FileManagement

ResourceManagement

LayerWorkflow Management/

ExecutionResource

Scheduling

UserInterface

LayerDataflow Parallel APIs

Graph Analytics APIs ML APIs …

Figure 1.1: An overview of our work in the architectural design of a general data analytics framework.

To achieve both objectives, we have redesigned and implemented several system com-

ponents with the awareness of datacenter boundaries, and we have integrated them into

each architectural layer. Fig. 1.1 shows an overview of our work in the architectural

design of a general data analytics framework. Existing work in the literature focuses on

resource scheduling and Machine Learning APIs. As a comparison, we have redesigned

or improved three different yet important system components (depicted in dark blocks)

from the ground up. Our designs have the potential to work in conjunction with existing

ones for further efficiency improvements.

As the first part of this dissertation, we focus on expediting the delivery of inter-node

data flows. We are motivated by a simple observation: employing the same transport

for both inter- and intra-datacenter data transfers is not efficient. In Apache Spark [84],

for example, the out-of-box solution to inter-node data transfers is based on on-demand,

single TCP connections. It is effective enough within a single datacenter; however, the

performance of inter-datacenter data transfers is very likely to suffer from long-tail, un-

predictable flow completion times [38], because of the high-latency, high-jitter, and low-

capacity WAN links.

It is conceivable to decouple inter-datacenter from intra-datacenter data transfers


completely and schedule their delivery strategically for the sake of job-level performance.

Specifically, our goal is to collectively optimize the completion time of inter-datacenter

coflows [20,22], which is a flow group abstraction that directly connects the performance

of network transfers to that of a data analytics job. For example, all flows generated in a

shuffle phase can be abstracted as a coflow, as the next-stage computation cannot start

until the coflow is fully completed.

We propose a novel coflow scheduling algorithm tailored to minimizing the

inter-datacenter coflow completion time. To realize it across real datacenters in the

cloud, we have designed and implemented Siphon, an inter-datacenter overlay network

that can transparently aggregate and handle inter-datacenter data transfers.

Inspired by prior work on traffic engineering in software-defined WANs [37, 42], the

architectural design of Siphon follows the software-defined networking principle for the

sake of flexibility in routing and scheduling. Coflow scheduling and flow routing decisions

are made on a central controller, which maintains a global view of the inter-datacenter

overlay.

Siphon can expedite the inter-datacenter coflow transfers, by making optimal schedul-

ing decisions, given an existing collection of flows in the network. As a natural follow-up

question, can we optimize by generating fewer competing inter-datacenter flows in wide-

area data analytics?

This part of the responsibility is handled by the resource management layer in a data

analytics framework. Most existing works in the literature have been focusing on the

resource assignment and scheduling towards this goal. They optimize the execution of a

wide-area data analytic job by intelligently assigning individual tasks to datacenters, such

that the overhead of moving data across datacenters can be minimized. For example,

Geode [79], WANAnalytics [78] and Pixida [45] have proposed various task placement

strategies, reducing the volume of inter-datacenter traffic. Iridium [64] achieves shorter

job completion times by leveraging a redistributed input dataset, along with mechanisms


for making optimal task assignment decisions.

Despite their promising outlook, even the best resource scheduling strategies may not

achieve optimality due to the level of abstraction needed to solve the problem. As an

example, in Spark, resource schedulers can only operate at the granularity of computa-

tion tasks. Potential optimizations on the actual materialization of API calls, especially

shuffles that generate inter-datacenter data transfers directly, have been overlooked.

As the second part of our work, we take an alternate approach. We put a microscope

on the execution behavior of a single shuffle, analyzing its pros and cons. It turns out

that the state-of-the-art fetch-based shuffle, where receivers of the shuffle traffic initiate

all flows at the same time, may under-utilize the inter-datacenter bandwidth. To solve

this problem, we propose a push-based shuffle mechanism in wide-area data

analytics, which improves the bandwidth utilization and eventually the job

performance by allowing early inter-datacenter transfers.

Finally, we focus on graph analytics, an important and special category of big data

analytics applications. Graph analytics, machine learning, and SQL queries are among

the foundations of many big data analytics applications [82]. Popular data analytics

frameworks, e.g., Spark, usually support them as libraries that are built based on basic

data parallel APIs.

In particular, distributed graph analytics relies on a synchronization model to au-

tomatically convert itself into an iterative sequence of data parallel operations. Most

data analytics frameworks support graph analytics by adopting BSP [82], which relies on

heavy communications and synchronizations across worker nodes. Representative works

in the literature, e.g., Pregel [59], PowerGraph [29] and GraphX [30], are solely designed

and optimized for processing graphs within a single datacenter. Gemini [90], one of

the state-of-the-art solutions, even assume a high-performance cluster with 100 Gbps of

bandwidth capacity between worker nodes. As a result, they are not sufficiently com-

petent to address the challenge of running across geographically distributed datacenters,


i.e., wide-area graph analytics.

In wide-area data analytics, Gaia [38] and Clarinet [77] have demonstrated their

efficiency by tweaking the workflow of machine learning and SQL queries, respectively.

Inspired by their work, we tend to answer another question: can we tweak the workflow

of graph analytics, so that they can also run efficiently across geographically distributed

datacenters?

One possible solution is to allow asynchronous computation on different partitions of

the graph. Existing systems implementing such an asynchronous parallel model include

GraphUC [33] and Maiter [87]. However, neither system can guarantee the convergence

or the correctness of graph applications [82], which is unacceptable.

Inspired by Gaia [38] in machine learning, a promising approach is to partially re-

lax the strong synchronization among worker nodes. As the third part of our work, we

propose a new synchronization model for graph analytics, which focuses on

reducing the inter-datacenter bandwidth usage while having a strong conver-

gence guarantee in wide-area graph analytics.

1.1 Contributions

We first investigate to expedite the inter-datacenter coflow transfers in wide-area data

analytics. To this end, we design and implement Siphon, an inter-datacenter overlay net-

work that can be integrated with any existing data analytics frameworks (e.g., Apache

Spark). With Siphon, inter-datacenter coflows are discovered, routed, and scheduled au-

tomatically at runtime. Specifically, Siphon serves as a transport service that accelerates

and schedules the inter-datacenter traffic with the awareness of workload-level dependen-

cies and performance, while being utterly transparent to analytics applications. Novel

intra-coflow and inter-coflow scheduling and routing strategies have been designed and

implemented in Siphon, based on a software-defined networking architecture.

On our cloud-based testbeds, we have extensively evaluated Siphon’s performance in


accelerating coflows generated by a broad range of workloads. With a variety of Spark

jobs, Siphon can reduce the completion time of a single coflow by up to 76%. With

respect to the average coflow completion time, Siphon outperforms the state-of-the-art

scheme by 10%.

We proceed to optimize the runtime execution of shuffle, a phase in job execution

that triggers all data transfers in data analytics. We design a new proactive push-based

shuffle mechanism, and implement a prototype based on Apache Spark, with a focus

on minimizing the network traffic incurred in shuffle stages of data analytic jobs. The

objective of this framework is to strategically and proactively aggregate the output data

of mapper tasks to a subset of worker datacenters, as a replacement for Spark’s original

passive fetch mechanism across datacenters. It improves the performance of wide-area

analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-

datacenter links. Our extensive experimental results using standard benchmarks across

six Amazon EC2 regions have shown that our proposed framework is able to reduce job

completion times by up to 73%, as compared to the existing baseline implementation in

Spark.

Finally, we focus on wide-area graph analytics, an important set of iterative algorithms

that are commonly implemented as a library on data analytics frameworks. Existing

graph analytics frameworks are not designed to run across multiple datacenters well,

as they implement a Bulk Synchronous Parallel model that requires excessive wide-area

data transfers. To address this challenge, we propose a new Hierarchical Synchronous

Parallel (HSP) model designed and implemented for synchronization across datacenters

with a much-improved efficiency in inter-datacenter communication. Our new model re-

quires no modifications to graph analytics applications, yet guarantees their convergence

and correctness. Our prototype implementation on Apache Spark can achieve up to 32%

lower WAN bandwidth usage, 49% faster convergence, and 30% less total cost for bench-

mark graph algorithms, with input data stored across five geographically distributed


datacenters.

1.2 Organization

The remainder of this dissertation is organized as follows:

Chapter 2 introduces the background and reviews related literature.

Chapter 3 presents our work on expediting inter-datacenter coflow transfers. We pro-

pose a set of strategies that can efficiently route and schedule inter-datacenter

coflows. We also present the design, implementation, and evaluation of Siphon, a

software defined inter-datacenter framework that can realize these strategies in data

analytics frameworks. This chapter is based on our work published in USENIX

ATC 2018 [53], collaborated with Li Chen and Baochun Li, and our work pub-

lished in IEEE Journal on Selected Areas in Communications [55], collaborated

with Baochun Li.

Chapter 4 presents our work on optimizing shuffle in wide-area data analytics. We

present a detailed system study on the behavior of shuffles and propose a new

Push/Aggregate mechanism to minimize the job completion time. This chapter

is based on our work published in IEEE ICDCS 2017 [56], collaborated with Hao

Wang and Baochun Li.

Chapter 5 presents our work on a Hierarchical Synchronize Parallel mode in wide-

area graph analytics. This novel synchronization model takes effect by generating

less inter-datacenter traffic when the graph algorithm is converted to an iterative

workflow. This chapter is based on our work published in IEEE INFOCOM 2018

[54], collaborated with Li Chen, Baochun Li, and Aiden Carnegie.

Chapter 6 concludes this dissertation, with a summary of our work and a discussion

on future directions.

Chapter 2

Background and Related Work

Since the advent of MapReduce [24], generations of data analytics frameworks have been

designed and continuously optimized for supporting the growing need of big data process-

ing. These frameworks define APIs as a set of well-defined parallel operations, allowing

developers to interpret their data parallel workflow in serial, without caring about the

management details in the underlying distributed computing environment. As a result,

data analytics frameworks makes a data analytics application more maintainable, more

scalable, and easier to develop.

In this chapter, we first introduce a few common terms used throughout this disserta-

tion before reviewing the related work in the literature. Our terms are largely consistent

with that of Apache Spark [84] because it is popular and the prototype implementations

of our work are all based on Spark. They are illustrated in Fig. 2.1.

In a deployed big data analytics framework, we usually refer to the collection of com-

puting resources it manages as a cluster. Clusters are typically organized in a master-slave

architecture. The master node of the cluster is responsible for tracking and maintaining

the status of all slaves, a.k.a. worker nodes or executors.

Developers define their workflow using a sequence of parallel APIs calls. They then

submit the compiled binary to the master to start parallel processing. The master takes

the binary, interpreting the workflow and start it as a data analytics job. The job com-

pletion time is one of the most important metric to evaluate the performance of a data

10

Chapter 2. Background and Related Work 11

Cluster

Master

User-DefinedApplication

Ongoing Stage(s)

JobDependency Graph

Stage 0

Stage 1 Stage 2……

Stage 0

Tasks

WorkerTask 0

WorkerTask 1

WorkerTask 2

……

Interpret

ParallelizeSchedule

&Assign

Figure 2.1: Key terminologies and concepts in a data analytics framework.

analytics framework.

Depending on the defined workflow, a job can be further divided in many stages,

whose dependencies are interpreted as a Directed Acyclic Graph (DAG). A stage can

start to run as long as all its dependencies are fulfilled, in the form of a bunch of parallel

tasks. Each task within the same stage is executed on a worker node, executing the same

piece of code on a different partition of input data. Note that a stage is not considered

completed unless all of its tasks completes successfully. If a task takes longer time to run

for some reason (e.g., slow worker, network congestion, etc.), it will become a straggler

and delay the completion of the entire stage. As a result, stage completion time is also a

valid metric in performance evaluation.

In Spark, the input datasets are split into partitions that can be processed in parallel.

Logical computation is organized in several consecutive map() and reduce() stages:

map() operates on each individual partition to filter or sort, while reduce() reorganizes

and collects the summary of map() intermediate results. An all-to-all communication

pattern will be triggered between mappers and reducers, which is called a shuffle phase.

These intermediate data shuffles are well known as costly operations in data analytic

jobs, since they incur intensive traffic across worker nodes.


2.1 Wide-Area Data Analytics

Running a data analytics job whose input data originates from multiple geographically

distributed datacenters is commonly known as wide area data analytics in the research

literature.

Inter-datacenter networks are critical network resources in WAN. Even though they

might consists of dedicated, optical links [42], preliminary measurements suggest they be

heavily shared [27, 38, 57]. Since wide area network links easily become the performance

bottleneck, existing works strive to reduce the usage of inter-datacenter bandwidth.

Geode [79], WANalytics [78] and Pixida [45] propose task placement strategies aiming

at reducing the total volume of traffic among datacenters. Iridium [64] and Flutter [39],

on the other hand, argue that less cross-datacenter traffic does not necessarily result in

a shorter job completion time. Thus, they propose an online heuristic to make joint

decisions on both input data migration and task placement across datacenters.

Heintz et al. [35] propose an algorithm to produce an entire job execution plan,

including both data and task placement. However, their model requires too much prior

knowledge such as intermediate data sizes, making it far from practical. Hung et al. [41]

propose a greedy scheduling heuristic to schedule multiple concurrent analytic jobs, with

the objective of reducing average job completion time.

However, all these efforts focus on adding wide-area network awareness to the compu-

tation framework, without tackling the lower-level inter-datacenter data transfers directly.

Our works are all orthogonal and complementary to these efforts.

2.2 Network Optimization for Data Analytics

A variety of flow scheduling algorithms (e.g., [36, 76, 80, 85]) are proposed to improve

flow completion times and meet deadlines in datacenter networks. They focus on the

average behavior of independent flows, which cannot work with the awareness of job-


level performance.

Considering the impact of flow completion on the job-level performance, coflow schedul-

ing algorithms (e.g., [20,22,25]) are proposed to minimize the average coflow completion

time within a datacenter network, which is assumed to be free of congestion. Without

such assumptions, joint coflow scheduling and routing strategies [52, 88] are proposed in

the datacenter network, where both the core and the edge are congested.

Different from these models, the network in the wide area has congested core and

congestion-free edge, since the inter-datacenter links have much lower bandwidth than

the access links of each datacenter. Apart from the different network model, our propose

coflow scheduling (Chapter 3.2) handles the uncertainty of the fluctuating bandwidth

in the wide area, while the existing efforts assume the bandwidth capacities to remain

unchanged. Also, previous efforts are limited by their performance evaluations with

simulations or emulations. In contrast, we have implemented our practical yet effective

scheduling algorithms in Chapter 3 and demonstrated their performance with real-world

deployment and experiments.

2.3 Software-Defined Networking

Software-Defined Networking is a rising star in network management. Starting from

a campus network [61], it has already developed as a promising technology in next-

generation networks [46]. The principle of software-defined networking is to decouple the

packet forwarding intelligence from the hardware. The system design of Siphon (Chapter

3) follows this software-defined networking principle, applying it in an inter-datacenter

overlay network.

OpenFlow [4] is the first standardized protocol designed for communications between

the controller and the data plane. Static rules are installed and cached on the data plane

switches, to perform longest prefix matches at runtime.

Despite the fine granularity of control, OpenFlow-based control plane suffers from


scalability problems. Some workaround solutions attempt to offload the controller by

applying the label-switching [6,10,11,18] technology or a distributed control plane [34,62].

However, they are still far from effective in dealing with traffic in production [15].

Reconfigurable Dataplanes have emerged as the future of OpenFlow 2.0 [15]. With

new advancements in the hardware, switches today are able to process packets using

reconfigurable logic, instead of static rules, at the line rate [71]. Following this direction,

network programming primitives (e.g., FAST [63], P4 [15], Probabilistic NetKAT [28],

Domino [70]) and compilers (e.g., SNAP [44]) are proposed.

The interactions between the Siphon controller and aggregators (Chapter 3.4) is

greatly inspired by this line of work. However, due to hardware constraint, sophisticated

algorithms are difficult to be implemented with the limited number of operations [69]. By

taking advantage of the overlay data plane which is more easily programmable and recon-

figurable, the controller-data plane interaction scheme is designed to be more flexible. It

adapts better to the inter-datacenter overlay environment, getting rid of the unnecessary

complexity incurred by processing multi-layer packet headers.

2.4 Optimizing Shuffle in Data Analytics

When it comes to running data analytics jobs within a datacenter, there exist several

proposals on optimizing shuffle input data placement. iShuffle [31] proposes to shuffle

before the reducer tasks are launched in Hadoop. Shards in the shuffle input are pushed

to the predicted reducers, respectively, during shuffle write, that is, a “shuffle-on-write”

service. However, it is not practical to predict the reducer placement before hand, espe-

cially in Spark. MapReduce Online [23] also proposes a push-based shuffle mechanism,

in order to optimize the performance under continuous queries. Unfortunately, general

analytic jobs that are submitted randomly will not benefit.

As a comparison, our work in Chapter 4 optimizes shuffle that involves inter-datacenter

transfers in a general data analytics framework. We make no assumptions on any prior


knowledge about the workload. We focus on generic operations that trigger shuffles at

runtime. Further, our solution interacts friendly with task scheduling module in the

data analytics frameworks, rather than predicting but eventually enforcing its decision

making, which is a violation of design principles.

2.5 Distributed Graph Analytics

A variety of distributed graph analytics systems, most of which implement the BSP

model, have been proposed in the literature. Representatives include [59, 90]. These

systems focused on computing environments within a high-performance cluster, with the

abundantly available bandwidth between worker nodes. In contrast, Chapter 5 proposes a

new synchronization model that can be integrated seamlessly with these systems, serving

as an alternative of BSP when running across multiple datacenters.

Algorithm-level optimizations such as [68] can certainly reduce the required inter-

datacenter traffic. They can be applied to specific categories of graph algorithms, and

are orthogonal to optimizations on the system or the synchronization model.

As closely related works, asynchronous (GraphLab [58], PowerGraph [29]) or partial

asynchronous (GraphUC [33], Maiter [87]) synchronization models can potentially reduce

the need for inter-datacenter communications, but they cannot always guarantee algo-

rithm convergence [82]. Such guarantees in our work (Chapter 5) root in the strategical

switches between synchronous (global) and asynchronous (local) modes. The limited

extent of asynchrony achieves a sweet spot with both minimal inter-datacenter traffic

and convergence guarantees.

Chapter 3

Siphon: Expediting Inter-DatacenterCoflows

Given particular traffic generated by wide-area data analytics, improving its performance

by directly accelerating the completion of its inter-datacenter data transfers has been

largely neglected. The literature either attempted to design optimal mechanisms of as-

signing input data and computation tasks across datacenters [39, 64, 78, 79], or tried to

adjust the application workloads towards reducing demands on inter-datacenter commu-

nications [38, 77]. On the other hand, inter-datacenter traffic engineering can only be

achieved by cloud service providers at a flow group level [37, 42]. Cloud tenants have

no control over their generated inter-datacenter traffic. Such inability is likely lead to

sub-optimal utilization of inter-datacenter bandwidth.

To fill this gap, we propose a deliberate design of a fast delivery service for data

transfers across datacenters, with the goal of improving application performance from

an orthogonal and complementary perspective to the existing efforts. Moreover, it has

been observed that an application cannot proceed until all its flows complete [21], which

indicates that its performance is determined by the collective behavior of all these flows,

rather than any individual ones. We incorporate the awareness of such an important ap-

plication semantic, abstracted as coflows [22], into our design, to better satisfy application

requirements and further improve application-level performance.

16

Chapter 3. Siphon: Expediting Inter-Datacenter Coflows 17

Existing efforts have investigated the scheduling of coflows within a single datacen-

ter [20,22,52,86], where the network is assumed to be congestion free and abstracted as a

giant switch. Unfortunately, such an assumption no longer holds in the inter-datacenter

WANs, yet the requirement for optimal coflow scheduling to improve application perfor-

mance becomes even more critical.

In this chapter, given the observation that data analytics frameworks have complete

knowledge of all generated network flows, we propose three strategies that can signifi-

cantly reduce the coflow completion time, which is directly translated to a shorter job

completion time.

First, we have designed a novel and practical inter-coflow scheduling algorithm to

minimize the average coflow completion time, despite the unpredictable available band-

width in wide-area networks. The algorithm is based on Monte Carlo simulations to

handle the uncertainty, with several optimizations to ensure its timely completion and

enforcement.

Second, we have proposed a simple yet effective intra-coflow scheduling policy. It tries

to prioritize a subset of flows such that the potential straggler tasks can be accelerated.

Finally, we have designed a greedy multi-path routing algorithm, which detours a

subset of the traffic on a bottlenecked link to an alternate idle path, such that the

slowest flow in a shuffle can be finished earlier.

Further, to enforce these scheduling and routing strategies, we have designed and

implemented Siphon, a new building block for data analytics frameworks that is designed

to provide a transparent and unified platform to expedite inter-datacenter coflows.

From the perspective of big data analytics frameworks, Siphon decouples inter-datacenter

transfers from intra-datacenter traffic, serving as a transport with full coflow awareness.

It can be easily integrated to existing frameworks with minimal changes in source code,

while being completely transparent to the analytics applications atop. We have integrated

Siphon to Apache Spark [84].


The aforementioned coflow scheduling strategies become feasible because Siphon’s ar-

chitectural design follows the software-defined networking principle. The network control

plane, which makes control decisions such as routing and scheduling, is logically central-

ized and decoupled from the network data plane, which deals with the actual flows.

For the datapath, Siphon employs aggregator daemons on all (or a subset of) work-

ers, forming a virtual overlay network atop the inter-datacenter WAN, aggregating and

forwarding inter-datacenter traffic efficiently. At the same time, a controller can make

centralized routing and scheduling decisions on the aggregated traffic and enforce them

on aggregators. Also, the controller can work closely with the resource scheduler of the

data parallel framework, to maintain a global and up-to-date knowledge about ongoing

inter-datacenter coflows at runtime.

Because of the size and scale of inter-datacenter WANs, it takes a noticeable amount

of time for the controller to update a rule on aggregators. This latency is sometimes

significant, in that routing and scheduling decisions cannot take effect immediately. To

address this problem, Siphon implements a novel approach for data plane-controller in-

teractions.

Unlike traditional link-layer software-defined networking [61], the aggregators, which

act like switches, take advantage of its software flexibility. Rather than caching static

control rules in a flow table, an aggregator can actually cache dynamic control logics that

are installed by the controller. In particular, we can define algorithms using a concise

set of Javascript APIs, which will later be interpreted by a light-weight interpreter at

each aggregator within the data plane. When new flows come, an aggregator can make

routing and scheduling decisions based on locally cached control logics without requiring

controller intervention. As a result, the volume of interactions with the controller are

minimized. This mechanism greatly improves efficiency in Siphon, where network latency

makes a significant difference.

We have evaluated our proposed coflow scheduling strategies with Siphon. Across


five geographical regions on Google Cloud, we have evaluated the performance of Siphon

from a variety of aspects, and the effectiveness of intra-coflow scheduling in accelerating

several real Spark jobs. Our experimental results have shown an up to 76% reduction

in the shuffle read time. Further experiments with the Facebook coflow benchmark [22]

have shown an ∼ 10% reduction on the average coflow completion time as compared to

the state-of-the-art schemes.

This chapter is organized as follows: We first motivate inter-datacenter coflow schedul-

ing and analyze its practical challenges (Chapter 3.1). Then, in Chapter 3.2, we propose

a novel inter-datacenter coflow scheduling algorithm based on the idea of Monte-Carlo

simulation to tackle these challenges. To realize it, we implement Siphon (Chapter 3.3), a

high-performance inter-datacenter overlay that provides the flexibility of inter-datacenter

traffic engineering. Chapter 3.4 describes the architectural optimization made in Siphon’s

software-defined networking design. The final chapters present extensive system evalua-

tions in a real cloud environment.

3.1 Motivation and Background

In modern big data analytics, the network stack traditionally serves to deliver individual

flows in a timely fashion [7, 8, 85], while being oblivious to the application workload.

Recent work argues that, by leveraging workload-level knowledge of flow interdepen-

dence, the proper scheduling of coflows can improve the performance of applications in

datacenter networks [22].

As an application is deployed at an inter-datacenter scale, the network is more likely

to be a system bottleneck [64]. Existing efforts in wide-area data analytics [38, 64, 77]

all seek to avoid this bottleneck, rather than mitigating it. Therefore, it is necessary to

enforce a systematic way of scheduling inter-datacenter coflows for better link utilization,

given the fact that the timely completion of coflows can play an even more significant

role in application performance.


Oregon Carolina Tokyo Belgium TaiwanOregon 3000 236 250 152.0 194Carolina 237 3000 83.8 251 45.1Tokyo 83.8 81.7 3000 89.2 586Belgium 249 242 86.6 3000 76.0Taiwan 182 35.8 508 68.0 3000

Table 3.1: Peak TCP throughput (Mbps) achieved across different regions on the Google Cloud Platform.

As an example, suppose we have two coflows, A and B, sharing the same inter-

datacenter link. Without enforcing a proper scheduling mechanism, A and B will have to

fair-share the available bandwidth, leading to the late complete of both flows. However,

with a simple preemptive scheduling that prioritizes A, A can be complete significantly

faster while B is completed at exactly the same time.

First, inter-datacenter networks have a different network model. Networks are usually

modeled as a big switch [22] or a fat tree [67] in the recent coflow scheduling literature,

where the ingress and egress ports at the workers are identified as the bottleneck. This is

no longer true in wide area data analytics, as the available bandwidth on inter-datacenter

links are magnitudes lower than the edge capacity (see Table 3.1, measured with iperf3

in TCP mode on standard 2-core instances. Rows and columns represent source and

destination datacenters, respectively. These statistics match the reports in [38]).

Second, the available inter-datacenter bandwidth fluctuates over time. Unlike in

datacenter networks, the completion time of a given flow can hardly be predictable,

which makes the effectiveness of existing deterministic scheduling strategies (e.g., [22,86])

questionable. The reason is easily understandable: though the aggregated link bandwidth

between a pair of datacenters might be abundant, it is shared among tons of users and

their launched applications, with varied, unsynchronized and unpredictable networking

patterns.

Third, our ability to properly schedule and route inter-datacenter flows is limited.

We may gain full control via software-defined networking within a datacenter [88], but

such a technology is not readily available in inter-datacenter WANs. Flows through


Link 1

Time0 1 2 3 4 5 6 7 8

Link 2

A1 B1

A2B2

Figure 3.1: An example with two coflows, A and B, being sent through two inter-datacenter links. Basedon link bandwidth measurements and flow sizes, the duration distributions of four flows are depictedwith box plots. Note that the expected duration of A1 and B2 are the same.

inter-datacenter links are typically delivered with best effort on direct paths, without the

intervention of application developers.

To summarize, it calls for a redesigned coflow scheduling and routing strategy for

wide-area data analytics, as well as a new platform to realize in existing data analytics

frameworks. In this chapter, Siphon is thus designed from the ground up for this purpose.

It is an application-layer, pluggable building block that is readily deployable. It can

support a better WAN transport mechanism and transparently enforce a flexible set of

coflow scheduling disciplines, by closely interacting with the data parallel frameworks.

A Spark job with tasks across multiple datacenters, for example, can take advantage of

Siphon to improve its performance by reducing its inter-datacenter coflow completion

times.

3.2 Scheduling Inter-Datacenter Coflows

3.2.1 Inter-Coflow Scheduling

Inter-coflow scheduling is the primary focus of the literature [20,22,72,86]. In this section,

we first analyze the practical network model of wide-area data analytics. Based on the

new observations, we propose the details of a Monte Carlo simulation-based scheduling

algorithm.


♦ Design Objectives

Similar to [22,88], we assume the complete knowledge of ongoing coflows, i.e., the source,

the destination and the size of each flow are known as soon as the coflow arrives. Despite

recent work [20,86] which deals with zero or partial prior knowledge, we argue that this

assumption is practical in modern data parallel frameworks. It is conceivable that the

task scheduler is fully aware the potential cross-worker traffic before launching the tasks

in the next stage and triggering the communication stage [1, 21, 84]. We will elaborate

further on its feasibility in Sec. 3.3.4.

Our major objective is to minimize the average coflow completion time, in alignment

with the existing literature. However, we focus on inter-datacenter coflows, which are

constrained by a different network model. In particular, based on the measurement in

Table 3.1, we conclude that inter-datacenter links are the only bottlenecked resources,

and congestion can hardly happen at the ingress or egress port. For convenience, we call

it a dumb bell network structure.

As compared to coflow scheduling within a single datacenter which the network is typi-

cally modeled as a giant, congestion free switch [20,22], inter-datacenter coflow scheduling

is different in two ways:

The good news is, because the number of inter-datacenter links is magnitudes of lower

than the available paths within a single datacenter, the problem complexity is reduced

significantly, as well. Though the decision space of scheduling a given set of coflows

remains the same, the cost of calculating the coflow completion times under a specific

scheduling order is magnitudes lower.

However, the bad news is that inter-datacenter bandwidth is a dynamic resource.

Scheduling across coflows must take runtime variations into account, making a scheduling

decision that has a higher probability of completing coflows faster. The requirement, on

the contrary, complicates the problem. It is the challenge we focus on in this chapter.


♦ Schedule with Bandwidth Uncertainty

Coflow scheduling in a big switch network model has been proven to be NP-hard, as it can

be reduced to an instance of the concurrent open shop scheduling with coupled resources

problem [22]. With a dumb bell network structure, as contention is removed from the

edge, each inter-datacenter link can be considered an independent resource that is used to

service the coflows (jobs). Therefore, it makes sense to perform fully preemptive coflow

scheduling, as resource sharing always results in an increased average [36].

The problem may seem simpler with this network model. However, it is the sharing

nature of inter-datacenter links that complicates the scheduling. The real challenge is,

being shared among tons of unknown users, the available bandwidth on a certain link is

not predictable. In fact, the available bandwidth is a random variable whose distribution

can be inferred from history measurements. Thus, the flow durations are also random

variables. The coflow scheduling problem in wide-area data analytics can be reduced to

the independent probabilistic job shop scheduling problem [13], which is also NP-hard.

We seek a heuristic algorithm to solve this online scheduling problem. An intuitive

approach is to make an estimation of the flow completion times, e.g., based on the

expectation of recent measurements, such that we can solve the problem by adopting a

deterministic scheduling policy such as Minimum-Remaining Time First (MRTF) [22,86].

Unfortunately, this naive approach fails to model the probabilistic distribution of flow

durations. Fig. 3.1 shows a simple example in which deterministic scheduling does not

work. In this example, the available bandwidth on Link 1 and 2 have distinct distributions

because users sharing the link have distinct networking behaviors. With Coflow A and

B competing, the box plots depict the skewed distributions of flow durations if the

corresponding coflow gets all the available bandwidth.

With a naive, deterministic approach that considers average only, scheduling either

A or B will result in a minimum average coflow completion time. However, it is an easy

observation that, with a higher probability, the duration of flow A1, will be shorter than


A B C

AB C A BC

ABC AB CCABC CBAC CCABCBCACACB CCBA

Figure 3.2: The complete execution graph of Monte Carlo simulation, given 3 ongoing coflows, A, B andC. The coflow scheduling order is determined by the distributions at the end of all branches.

B2. Thus, prioritizing Coflow A over B should yield an optimum schedule.

In conclusion, inter-datacenter coflow scheduling decisions should consider the distri-

bution of available bandwidth, which may be calculated with the weighted samples of

historic measurements.

♦ Monte Carlo Simulation-based Scheduling

To incorporate such uncertainty, we propose an online Monte Carlo simulation-based

inter-coflow scheduling algorithm, which is greatly inspired by the offline algorithm pro-

posed in [13].

The basic idea of Monte Carlo simulation is simple and intuitive: For every candidate

scheduling order, we repeatedly simulate its execution and calculate its cost, i.e., the

simulated average coflow completion time. With enough rounds of simulations, the cost

distribution will approximate the actual distribution of average coflow completion time.

Based on this simulated cost distribution, we can choose among all candidate scheduling

orders at a certain confidence level.

As an example, Fig. 3.2 illustrates an algorithm execution graph with 3 ongoing

coflows. There are 6 potential scheduling orders, corresponding to the 6 branches in the

graph. To perform one round of simulation, the scheduler generates a flow duration for

each of the node in the graph, by randomly drawing from their estimated distributions.

By summing up the cost for each branch, it yields a best scheduling decision instance,


which results in a counter increment. After plenty of rounds, the best scheduling order

will converge to the branch with the maximum counter value.

One major concern of this algorithm is its high complexity. In fact, it works just like

a brute-force, exhaustive search in each round of simulation. With n ongoing coflows,

there will be up to n! branches in the graph of simulation.

We argue that given a much smaller number of links to simulate, the algorithm

complexity is acceptable, especially with the following techniques to limit the simulation

search space.

Bounded search depth. In online coflow scheduling, all we care about is the coflow

that should be scheduled next. This property makes a full simulation towards all leaf

nodes unnecessary. Therefore, we set an upper bound, d, to the search depth, and

simulate the rest of branches using MRTF heuristic and the median flow durations. This

way, the search space is limited to a polynomial time Θ(nd). In our implementation, we

heuristically set d = 3, for an empirical balance between complexity and precision.

Early termination. Some “bad” scheduling decisions can be identified easily. For ex-

ample, scheduling an elephant coflow first will always result in a longer average. Based on

this observation, after several rounds of full simulation, we cut down some branches where

performances are always significantly worse. This technique limit the search breath, re-

sulting in a O(nd) complexity.

Online incremental simulation. As an online simulation, the scheduling algorithm

should quickly react to recent events, such as coflow arrivals and completions. Whenever

a new event comes, the previous job execution graph will be updated accordingly, by

pruning or creating branches. Luckily, the existing useful simulation results (or partial

results) can be preserved to avoid repetitive computation.

Simulation Timeout. Since Monte-Carlo simulation can be terminated at any time,

we have the flexibility to bound the execution time for each round of simulation, at the

cost of a less reliable simulation result. If a timeout happens, we can simply launch more


M1

M2R2

R1

M3

50MB

100MB

50MB

50MB50MB

50MB

DC1 to DC2

DC3 to DC2

Figure 3.3: Network flows across datacenters in the shuffle phase of a simple job.

parallel processes to run the Monte-Carlo simulation to speed it up.

These optimizations are inspired by similar techniques adopted in Monte Carlo Tree

Search (MCTS), but our algorithm differs from MCTS conceptually. In every simulation,

MCTS tends to reach the leaves of a single branch in the decision tree, where the outcome

can be revealed. As a comparison, our algorithm has to go though all branches at a certain

depth, otherwise we cannot figure out the optimal scheduling for the particular instance

of available bandwidth.

♦ Scalability

In wide-area data analytics, a centralized Monte Carlo simulation-based scheduling al-

gorithm may be questioned with respect to its scalability, as making and enforcing a

scheduling decision may experience seconds of delays.

We can exploit the parallelism and staleness tolerance of our algorithm. The beauty

of Monte Carlo simulation is that, by nature, the algorithm is infinitely parallelizable

and completely agnostic to staled synchronization. Thus, we can potentially scale out the

implementation to a great number of scheduler instances placed in all worker datacenters,

to minimize the running time of the scheduling algorithm and the propagation delays in

enforcing scheduling decisions.

3.2.2 Intra-Coflow Scheduling

To schedule flows belonging to the same coflow, we have designed a preemptive scheduling

policy to help flows share the limited link bandwidth efficiently. Our scheduling policy


100DC 1-2

50

Time0 1 2 3 4

Flow Group 1 ends

Schedule 1 (LFGFS)

50 50 50

50DC 3-2

200

150

R1

R2

Flow Group 2 ends

5 6 7 8

Figure 3.4: Job timeline with LFGF scheduling.

100DC 1-2

50

Time0 1 2 3 4

Flow Group 2 ends

Schedule 2

5050 50

50DC 3-2

200

150

R1

R2

Flow Group 1 ends

5 6 7 8

Figure 3.5: Job timeline with naive scheduling.

is called Largest Flow Group First (LFGF), whose goal is to minimize job completion

times. A Flow Group is defined as a group of all the flows that are destined to the same

reduce task. The size of a flow group is the total size of all the flows within, representing

the total amount of data received in the shuffle phase by the corresponding reduce task.

As suggested by its name, LFGF preemptively prioritizes the flow group of the largest

size.

The rationale of LFGF is to coordinate the scheduling order of flow groups so that

the task requiring more computation can start earlier, by receiving their flows earlier.

Here we assume that the task execution time is proportional to the total amount of data

it received for processing. It is an intuitive assumption given no prior knowledge about

the job.


As an example, we consider a simple Spark job that consists of two reduce tasks

launched in datacenter 2, both requiring to fetch data from two mappers in datacenter 1

and one mapper in datacenter 3, as shown in Fig. 3.3. Corresponding to the two reducers

R1 and R2, two flow groups are sharing both inter-datacenter links, with the size of 200

MB and 150 MB, respectively. For simplicity, we assume the two links have the same

bandwidth, and the calculation time per unit of data is the same as the network transfer

time.

With LFGF, Flow Group 1, corresponding to R1, has a larger size and thus will be

scheduled first. As is illustrated in Fig. 3.4, the two flows (M1-R1, M2-R1) in Flow Group

1 are scheduled first through the link between datacenter 1 and 2. The same applies to

another flow (M3-R1) of Flow Group 1 on the link between datacenter 3 and 2. When

Flow Group 1 completes at time 3, i.e., all its flows complete, R1 starts processing the

200 MB data received, and finishes within 4 time units. The other reduce task R2 starts

at time 5, processes the 150 MB data with 3 units of time, and completes at time 8,

which becomes the job completion time.

If the scheduling order is reversed as shown in Fig. 3.5, Flow Group 2 will complete

first, and thus R2 finishes at time 5. Although R1 starts at the same time as R2 in

Fig. 3.4, its execution time is longer due to its larger flow group size, which results in

a longer job completion time. This example intuitively justifies the essence of LFGF —

for a task that takes longer to finish, it is better to start it earlier by scheduling its flow

group earlier.

The proposed scheduling algorithm can be easily implemented in the controller, whose

execution is triggered by the shuffle flows reported by the TaskScheduler. In particular,

for each reduce task, we first calculate the total amount of data sent by all the flows

destined to it, and then assign priorities to all the tasks accordingly. This way, all the

flows destined to a task have the same priority, and flows in a task with more shuffle

data will have a higher priority to be scheduled. The priority number associated with


each flow will be conveyed to each aggregator the flow traverses, where a rule (flowId:

nextHop, priority) is installed in its forwarding table. The priority queues in these

aggregators will enforce the scheduling given the priorities obtained from the installed

rules, to be elaborated in Sec. 3.3.2.

3.2.3 Multi-Path Routing

Beyond ordering the coflows, optimizing route selection for individual flows is also made

feasible with Siphon. Aggregators, being deployed in geographically distributed data-

centers, can relay flows among datacenters efficiently. Provided that the topology of

inter-datacenter networks is publicly available [42], multi-path routing is an attractive

means of reducing coflow completion time.

Existing solutions, including RAPIER [88] and Li et al. [52], both exploit multiple,

equal-cost, paths within a single datacenter to improve the average coflow completion

time. Unfortunately, it is not ideal to directly apply their multi-path routing algorithms

here. Available paths between two datacenters have different cost, as a multi-hop path

will incur more bandwidth usage cost as compared to the directly link.

Being aware of this concern, we design a simple and efficient multi-path routing

algorithm to utilize available link bandwidth better and to balance network load. The

idea is similar to water-filling — it identifies the bottleneck link, and shifts some traffic

to the alternative path with the lightest network load in an iterative fashion.

The intuition that motivates this algorithm can be illustrated with the example shown

in Fig. 3.6. The shuffle phase consists of two fetches, each representing a group of flows

between a pair of source and destination datacenters. The first fetch is between DC2 and

DC3, involving four flows represented by the gray lines; the second fetch is between DC2

and DC1, which consists of two flows represented by the black lines. For simplicity, all

the flows are of the same size (100 MB), and all the inter-datacenter link bandwidth is

the same (10 MB/s). If these flows are routed directly as on the left side of the figure, the


DC 1

M1

R1

M2

R2

R3

DC 2 DC 3

DC 1

M1

R1

M2

R2

R3

DC 2 DC 3

Figure 3.6: Flexibility in routing improves performance.

link between DC2 and DC3 will become the bottleneck, resulting in 40 seconds of shuffle

completion time. However, if we allow the multi-path routing to split some traffic from

the direct path (DC2-DC3) to the alternative path (DC2-DC1, DC1-DC3), the network

load will be better balanced across links, which naturally speeds up the shuffle phase. As

illustrated by the right side of Fig. 3.6, when 100 MB of traffic is shifted from DC2-DC3

to the alternative two-hop path, DC2-DC3 and DC2-DC1 have the same network load.

With this routing, it only takes 30 seconds to complete the shuffle.

The bottleneck link is identified based on the time it takes to finish all the passing

flows. In the first iteration, we calculate all the link load and the time it takes to

finish all the passing flows, given that all the flows go through their direct links. To be

particular, for each link l, the link load is represented as Dl = di, where di represents

the total amount of data of the fetch i whose direct path is link l. The completion

time is thus calculated as tl = Dl/Bl, where Bl represents the bandwidth of link l. We

identify the most heavily loaded link l∗, which has the largest tl∗ , and choose one of its

alternative paths which has the lightest load for traffic re-routing. In order to compute

the percentage of traffic to be re-routed from l∗, represented by α, we solve the equation

Dl∗(1 − α)/Bl∗ = (Dl∗α + Dl′)/Bl′ , where l′ is the link with the heaviest load on the

selected detour path.

This algorithm can be readily implemented in the controller, since link bandwidth

and the flow sizes are all available as aforementioned. The calculated routing decision

for each flow is conveyed to all the aggregators along its assigned path, where a basic

forwarding rule (flowId: nextHop) indicating its next-hop datacenter will be installed


and enforced in the data plane.

3.2.4 A Flow’s Life in Siphon

We then examine how a flow from a data analytic job is established and routed through

Siphon aggregators, using the simple case shown in Fig. 3.7. The integration between

Siphon and data parallel frameworks, such as Spark, is designed as a simple RPC-based

published-subscribe API. To initiate a data flow to datacenter 2, the Spark executor in

datacenter 1 simply needs to send its outgoing traffic to the local aggregator, specify-

ing the destination datacenter and host with the publish API. To receive data through

Siphon, the Spark executor in datacenter 2 simply calls the subscribe API.

Upon receiving a flow, the aggregator checks whether there exists a pre-installed

rule for this flow. If a match is found, the flow will be switched to the corresponding

output queue, which is connected to an aggregator in another datacenter. Otherwise,

the aggregator will consult the Siphon controller for new rules that reflect its routing

and scheduling decisions. In Fig. 3.7, the flow is sent to the output queue corresponding

to datacenter 2, which will be transferred via pre-established parallel TCP connections

between these aggregators.

When a flow arrives at the aggregator in datacenter 2, the destination in its header will

be checked by the aggregator. If the destination executor is within the local datacenter,

the flow will be handled by the RPC server. Since the destination has already subscribed

to the data from the source, the flow will be delivered successfully by the RPC server.

If the destination is in a different datacenter, this aggregator will serve as a relay and

apply rules from the controller to select its output queue.


RPC server

In Out

In OutAggregator in datacenter 1

Aggregator in datacenter 2

RPC server

Figure 3.7: A flow’s life through Siphon.

Datacenter 1

Siphon Inter-Datacenter Software-Defined Overlay

…Worker Process

AggregatorDaemon

Worker Process

AggregatorDaemon

Worker Process

AggregatorDaemon

Datacenter 2

Worker Process

AggregatorDaemon

Worker Process

AggregatorDaemon

Worker Process

AggregatorDaemon

…

Controller

Figure 3.8: An architectural overview of Siphon.

3.3 Siphon: Design and Implementation

3.3.1 Overview

To realize any coflow scheduling strategies in wide-area data analytics, we need a system

that can flexibly enforce the scheduling decisions. Traditional traffic engineering [37, 42]

techniques can certainly be applied, but they are only available to cloud providers which

have full control over the traffic. Common cloud tenants do not have such flexibility. As

is concluded in Sec. 3.1, Siphon is designed and implemented as a host-based building

block to achieve this goal.

Fig. 3.8 shows a high-level overview of Siphon’s architecture. Processes, called aggre-

gator daemons, are deployed on all (or a subset of) workers, interacting with the worker

processes of the data parallel framework directly. Conceptually, all these aggregators will

form an overlay network, which is built atop inter-datacenter WANs and supports the

data parallel frameworks.

In order to ease the development and deployment of potential optimizations for inter-

datacenter transfers, the Siphon overlay is managed with the software-defined networking


principle. Specifically, aggregators operate as application-layer switches at the data plane,

being responsible for efficiently aggregating, forwarding and scheduling traffic within

the overlay. Network and flow statistics are also collected by the aggregators actively.

Meanwhile, all routing and scheduling decisions are made by the central Siphon controller.

With a flexible design to accommodate a wide variety of flow scheduling disciplines, the

centralized controller can make fine-grained control decisions, based on coflow information

provided by the resource scheduler of data parallel frameworks.

3.3.2 Data Plane

Siphon’s data plane consists of a group of aggregator daemons, collectively forming an

overlay that handles inter-datacenter transfers requested by the data parallel frameworks.

Working as application-layer switches, the aggregators are designed with two objectives:

it should be simple for data parallel frameworks to use, and supports high switching

performance.

♦ Software Message Switch

The main functionality of an aggregator is to work as a software switch, which takes care

of fragmentizing, forwarding, aggregating and prioritizing the data flows generated by

data parallel frameworks.

After receiving data from a worker in the data parallel framework, an aggregator

will first divide the data into fragments such that they can be easily addressable and

schedulable. These data fragments are called messages. Each data flow will be split into

a sequence of messages to be forwarded within Siphon. A minimal header, with a flow

identifier and a sequence number, will be attached to each message. Upon reaching the

desired destination aggregator, they will be again reassembled and delivered to the final

destination worker.

The aggregators can forward the messages to any peer aggregators as an interme-

diate nexthop or the final destination, depending on the forwarding decisions made by


the controller. Inheriting the design in traditional OpenFlow switches, the aggregator

looks up a forwarding table that stores all the forwarding rules in a hash table, to ensure

high performance. Fortunately, wildcards in forwarding rule matching are also available,

thanks to the hierarchical organizations of the flow identifiers. If neither the flow iden-

tifier nor the wildcard matches, the aggregator will consult the controller. A forwarding

rule includes a nexthop to enforce routing, and a flow weight to enforce flow scheduling

decisions.

Since messages forwarded to the same nexthop share the same link, we use a priority

queue to buffer all pending outbound messages to support scheduling decisions. Priorities

are allowed to be assigned to individual flows sharing a queue, when it is backlogged

with a fully saturated outbound link. The control plane will be responsible for assigning

priorities to each flows.

♦ Performance-Centric Implementation

Since an aggregator is I/O-bounded, it is designed and implemented with performance in

mind. It has been implemented in C++ from scratch with the event-driven asynchronous

programming paradigm. Several optimizations are adopted to maximize its efficiency.

Multi-threading. To fully utilize multiple CPU cores in modern VMs, throughout

our implementation, all operations as a result of handling events are executed in parallel

with a worker thread pool. The switch is also designed to be multi-threaded. In order

to reduce the overhead of excessive mutual exclusion locks, lock-free data structures are

used whenever possible. As examples, the forwarding table, which is read much more

frequently than being written, is protected by a reader-writer spin lock instead of a

traditional mutex lock.

Event-driven design. events are raised and handled asynchronously, including all

network I/O events. All the components are loosely coupled with one another, as each

function in these components is only triggered when specific events it is designed to handle

are raised. As examples of such an event-driven design, the switch will start forwarding


messages in an input queue as soon as the queue raises a PacketIn event, and the output

queue will be consumed as soon as a corresponding worker TCP connection raises a

DataSent event, indicating that the outbound link is ready.

Coroutine-based pipeline design pattern. Because an aggregator may commu-

nicate with a number of peers at the same time, work conservation must be preserved. In

particular, it should avoid head-of-line blocking, where one congested flow may take all

resources and slow down other non-congested flows. An intuitive implementation based

on input and output queues cannot achieve this goal. To solve this problem, our imple-

mentation takes advantage of a utility called “stackful coroutine,” which can be considered

as a procedure that can be paused and resumed freely, just like a thread whose context

switch is controlled explicitly. In an aggregator, each received message is associated with

a coroutine, and the total number of active coroutines is bounded for the same flow. This

way, we can guarantee that non-congested flows can be served promptly, even coexisting

with resource “hogs.”

Minimized memory copying. Excessive memory copying is often an important

design flaw that affects performance negatively. We used smart pointers and reference

counting in our implementation to avoid memory copying as messages are forwarded. In

the lifetime of a message through an aggregator, it is only copied between the kernel

socket buffers for TCP connections and the aggregator’s virtual address space. Within

the aggregator, a message is always accessed using a smart pointer, and passed between

different components by copying the pointer, rather than the data in the message itself.

Periodical measurements on link qualities. Siphon aggregators are also respon-

sible for measuring and reporting live statistics, such as the one-way delay, round-trip

time, and throughput on each of its inter-datacenter links, and estimates of available

inter-datacenter bandwidth. These live statistics will be reported to the controller peri-

odically for it to make informed decisions about routing and flow scheduling.

The architecture of an aggregator is illustrated in Fig. 3.9. Externally, all aggrega-


Siphon AggregatorRPC Server

SiphonController

Controller Proxy

Live statistics

monitoring

Switch

Spark executor

Publish Subscribe

Fragmentation and Reassembly

ConnectionInput Queue

Output Queue

Application Proxy

Input QueueOutput Queue

Forwarding Table

Parallel TCPConnections

Spark executor

To anotheraggregator


Output Queue


Output Queue


To otheraggregators


Figure 3.9: The architectural design of a Siphon aggregator.

tors are connected to the central Siphon controller for control plane decisions, and are

responsible for directly interacting with Apache Spark for data transfers. Internally, each

aggregator is designed and implemented as a high-performance software-defined switch

that operates in the application layer.

3.3.3 Connections for Inter-Datacenter Links

In each aggregator, an instance of the Connection component implements all the neces-

sary mechanisms to send or receive data over its corresponding inter-datacenter link. It

collectively manages a set of pre-established TCP connections to another aggregator in

a different datacenter, and handles network I/O asynchronously with I/O events. The

use of multiple pre-established TCP connections in parallel helps to saturate available

bandwidth capacities across datacenters.

To coordinate and synchronize all the pre-established TCP connections to another

aggregator, all underlying TCP connections are implemented as workers, which consis-

tently and automatically produce or consume the shared input queue and output queue.

In particular, whenever a worker TCP connection receives a complete message, it en-

queues it into the input queue and raises a notification event, notifying the downstream


Redis Database

ServerProcesses(node.js)

Pub/Sub Pub/SubDecisionMakers

Connections to aggregators

Spark Driver Program(Resource Scheduler)

Figure 3.10: The architecture of the Siphon Controller.

component, the Switch, to forward it to its destination. This way, messages received from

all underlying TCP connections are consumed sequentially, since the Switch only pulls

data from the input queue.

The design of the Connection component focuses on both high performance and the

flexibility of enforcing a variety of flow scheduling policies. For the sake of performance,

messages with known next-hop destinations are buffered in the output queue of the

corresponding Connection. Whenever there is a worker TCP connection ready to send

data, it dequeues one message from the output queue and sends it out. This will maximize

the aggregate performance as all worker TCP connections are combined.

3.3.4 Control Plane

The controller in Siphon is designed to make flexible control plane decisions, including

flow scheduling and routing.

Although the controller is a logically centralized entity, our design objective is to

make it highly scalable, so that it is easy to be deployed on a cluster of machines or VMs

when needed. As shown in Fig. 3.10, the architectural design of the controller completely

decouples the decision making processes from the server processes that directly respond

to requests from Siphon aggregators, connecting them only with a Redis database server.

Should the need arises, the decision making processes, server processes, and the Redis

database can be easily distributed across multiple servers or VMs, without incurring

additional configuration or management cost.

The Redis database server provides a reliable and high-performance key-value store

and a publish/subscribe interface for inter-process communication. It is used to keep


all the states within the Siphon datapath, including all the live statistics reported by

the aggregators. The publish/subscribe interface allows server processes to communicate

with decision-making processes via the Redis database.

The server processes, implemented in node.js, directly handle the connections from

all Siphon aggregators. These server processes are responsible for parsing all the reports

or requests sent from the aggregators, storing the parsed information into the Redis

database, and responding to requests with control decisions made by the decision-making

processes. It is flexible how the decision-making processes are implemented, depending

on requirements of the scheduling algorithm.

In inter-coflow scheduling, the controller requires the full knowledge of a coflow before

it starts. This is achieved by integrating the resource scheduler of the data parallel

framework to the controller’s Pub/Sub interface. Particularly in Spark, the task scheduler

running in the driver program have such knowledge as soon as the reduce tasks are

scheduled and placed on workers. We have modified the driver program, such that

whenever there are new tasks being scheduled, the generated traffic information will

be published to the controller. The incremental Monte Carlo simulations will then be

triggered on the corresponding parallel decision makers.

The superiority of the software-defined networking architecture is that, with a central

“brain” of the whole Siphon substrate, it is straightforward to implement, deploy and

debug a wide variety of decision-making algorithms to improve performance. Our design

of the controller maximizes such flexibility of centralized decision making. Siphon features

built-in support for the rapid development of these decision makers:

First, the event-driven programming model, backed by the node.js server and Redis

publish/subscribe interface, is well suited for rapid decision maker development. Unlike

long-running processes that have to poll from the database frequently, Siphon controller

triggers decision makers on data plane events. This way, a decision maker can be imple-

mented as a callback function.


Second, real-time measurements from the aggregators, such as available bandwidth,

are accessible to all running decision makers. In particular, such information is frequently

updated by the Siphon aggregators, and persists in the Redis database, which can be

retrieved by the decision makers with a readily available Python package.

Third, customized events and information reports can be easily plugged in to extend

the decision makers’ capabilities. The controller allows the developers to define their

customized data plane events, and store additional information in the database. Then,

new decision makers can make use of these information in similar ways.

In addition to the measurements available on the controller (bandwidth, etc.), in our

Siphon-Spark integration, we make the controller aware of all concurrent flows generated.

As soon as tasks are scheduled by Spark, a new customized event, onShuffleScheduled,

will be raised. Base on this design and integration with Spark, we have developed two

novel orthogonal decision makers at the Siphon controller out-of-box. Both are designed

with the sole purpose of improving the aggregated throughput of the concurrent inter-

datacenter flows generated in a Spark cluster. Unlike traditional Coflow scheduling [20,

22], the proposed policies and algorithms tend to accelerate a single Coflow.

3.4 Controller-Data Plane Interaction in Siphon

With software-defined networking as its underlying principle, the control and data planes

in Siphon are completely separated. The control plane is implemented with a centralized

controller, and it communicates — either proactively or reactively — with all the aggre-

gators, collecting real-time measurement statistics and deploying its control decisions.

Traditional ways of scaling the control plane cannot work smoothly in Siphon, whose

aggregators are distributed across geographically distributed datacenters, located far from

each other. Siphon implements a novel controller-data plane interaction scheme. By

caching dynamic forwarding logic instead of static flow tables on the aggregators, our

scheme greatly improves the scalability of the software-defined networking architecture.


3.4.1 Inefficiency of Reactive Control in OpenFlow

In the OpenFlow-like controller-data plane interaction, data plane will query the con-

troller about the forwarding decisions for each new flow. This approach is quite brute-

force but fine-grained because the controller can make flow-specific or even packet-specific

decisions in a straight-forward manner.

This scheme incurs heavy workload on the controller while being deployed at large-

scale. Since the control plane has to handle all data plane events and ensuring decision

integrity at the same time, the single centralized controller approach suffers from high

workload and reliability issues.

Three major issues limit the scalability of this design: i) There will be a significant flow

initiation delay, since its head packets will be forwarded to the controller, during which

the communication latency adds to the delay; ii) The controller workload is high, and it

might be congested by bursty network messages because of the all-to-one traffic pattern;

iii) Flow rerouting is necessary but expensive, which may encounter severe consistency

issues across the network.

Existing approaches to scale an OpenFlow control plane, such as hierarchical or par-

titioning solutions, either sacrifice the grain of control or limit the generality of possible

network applications. Recent progress made on the programmable data plane, as is

introduced in OpenFlow 2.0 [3], seems a promising cure for the scalability problem.

In this chapter, we implement a novel scalable and readily-deployable scheme for

controller-data plane interactions in Siphon. Inspired by the idea of a reconfigurable

data plane, the central Siphon controller caches control logic closer to the data plane

by injecting script code to the interpreter runtime on data plane nodes. Whenever the

Siphon data plane node is processing a new message, it directly dispatches the message

header information to the provided handler in the script code, to perform the desired

computation and actions based on locally cached controller intelligence.

An example of a load-balancing network application shown in Fig. 3.11. For tradi-


Flow 1

Path Y

Path X

ControllerLoad-balancer App

Switch A Switch B

Flow 2

Flow 3

Flow 4

Flow 1 -> Path X

Flow 2 -> Path Y

Flow 3 -> Path Y

Flow 4 -> Path X

(a) OpenFlow-style: installing static rule.

Flow 1

Path Y

Path X

ControllerLoad-balancer App

Switch A Switch B

Flow 2

Flow 3

Flow 4sortedPathByLoad=[Y,X]

sortedPathByLoad = [X, Y];function loadBalancer(Flow f) { for (path in sortedPathByLoad) { if (delay(path) < Inf) return path; } return NIL;} sortedPathByLoad=[X,Y]

(b) Siphon: caching dynamic forwarding logic.

Figure 3.11: An example to show the benefits of programmable data plane in software-defined networks.There exist two available shortest paths with the same capacity between ingress Switch A and egressSwitch B. The network application tries to balance the load between the two paths. The rate of Flow 1is 3 units, while the rate of Flow 2, 3 and 4 is 2 units, respectively.

tional OpenFlow network (Fig. 3.11(a)), a rule has to be installed on Switch A reactively

for each new flow. In the depicted example, four messages are required to achieve globally

optimal load-balancing. Note that the response time for rule installation will add to the

overhead of flow initiation.

In Siphon (Fig. 3.11(b)), on the contrary, the loadBalancer logic is cached on Switch

A beforehand by the central controller. When a new flow comes in, the path selection


is completely handled by the cached loadBalancer function, without querying the con-

troller. At the same time, the controller will proactively update Switch A with mea-

surement statistics of path load, ensuring the global optimality of forwarding decisions.

Only two updating messages are necessary in this case. Moreover, loadBalancer takes

local measurement statistics into consideration. Notably, delay() function returns the

monitoring delays of a given path. Its call on a path with failure will return an Inf.

Therefore, failed paths will be avoided automatically, without the slow reaction of the

controller.

As a conclusion, the benefits of dynamic logic caching are three-fold: i) Time overhead

of flow initiation is reduced because the inevitable communication latency is avoided; ii)

Both computation and network load on the controller is very likely to be reduced, since

statistics are updated on-demand; and iii) All route selection can react to network failures

quickly, so as to enable fast recovery.

3.4.2 Caching Dynamic Forwarding Logic

Generally speaking, message processing in software-defined networks involves three stages:

header matching, computing and action applying. Messages from different flows are iden-

tified through header matching. Then, computation is required to get the desired list

of actions (e.g., header modification, forwarding to a given next-hop node). Finally, the

actions will be applied to the message.

In software-defined networking, the computing stage is completely migrated to the

control plane. The core concept of our approach is to allow most of the messages to be

processed without consulting the central controller.

To this end, the Siphon controller preemptively caches the control logic (instead of

static forwarding rules) on the data plane. As follow-up actions, it frequently compiles

and updates the required global statistics. In Siphon, the communication messages be-

tween the controller and aggregators are encoded in JavaScript Object Notation (JSON)


KnowledgeUpdate

LogicCache

Hit

Not Hit

Controller Proxy

Header Matcher

Flow TableLookup

Action List

writeLabel(“minDelay”)toNextHop(10)toController()

……

Message Dispatcher

Message Processing Objectfunction processMessage(destList)

CachedKnowledgeDatabase

RTT/BandwidthMonitor

Action Executor

Select

SiphonController

Figure 3.12: The message processing diagram in Siphon.

strings, which naturally enables the logic caching via Javascript code injection. The

Siphon aggregators are embedded with a lightweight Javascript interpreter to execute

the code.

Specifically, the Siphon controller programs the message processing logic in a serial of

Javascript objects. Each of the objects has a key processMessage() method, defining

the logic to process messages of a given forwarding preference label in the header. Each

object is then serialized into a JSON message, thus to be sent to aggregators. The

message is then parsed and rebuilt into the origin Javascript objects. These objects that

contain the message processing logic, namely the Message Processing Objects, are thus

registered to the Javascript interpreter on the data plane node. Note that all above tasks

can be completed even before the first flow in the network initiates.

The processing of a single message is illustrated in Fig. 3.12. Whenever there is a

new incoming message, the header matcher will unpack the message headers, reading the

specified destination information and forwarding preference labels. Such information will

be used by the message dispatcher, to select an appropriate Message Processing Object


stored to process the corresponding message.

The member function processMessage() of the selected Message Processing Object

will then be called, with an argument indicating the list of destinations. The return value

of this function is a list of pre-defined actions, e.g., rewriting the Siphon message header,

relaying to a given next hop node or further consulting the central controller. Finally,

the action executor deploys the actions, which concludes a message process cycle.

The above process seems simple; however, it cannot be completed without a partic-

ular slice of global network knowledge. At the time a new Message Processing Object

being installed on a aggregator, it can also subscribe the updates of some customized

variables (i.e. global knowledge) from the controller. This global knowledge can be some

intermediate variables used by the algorithm inside processMessage(), which cannot be

locally obtained.

Moreover, the measurement results of local network delays and bandwidth statistics

can be used directly by processMessage(). As a result, the forwarding decision making

can react directly from the data plane events, such as link failure, without involving the

controller.

3.4.3 Customizing the Message Processing Logic

Software-defined networking architecture allows network operators to implement a va-

riety of novel network applications in pure software, customizing the control logic over

different network flows. In Siphon, programming a network application is as simple as in

OpenFlow. To implement a customized Siphon network application, there are two steps

in general: defining a Message Processing Object and subscribing a variable to be cached

as global knowledge.

Message Processing Objects are the core objects that define the control logic. Different

Message Processing Objects can be used to process messages with different labels. On

data plane nodes, new Message Processing Objects are registered to the controller proxy,


Register

Upd

ate

Controller Proxy

this.registerMPO = function(MPO) { MPO.prototype = this; // Register MPO to dispacher};

this.processMessage = function(message) { // Dispatch to an registed MPO};

this.knowledgeDB = {…};this.monitorStat = {…};this.updateDB = function(json) { json.updateDB(this.knowledgeDB);};

updateMsg = “{…, updateDB : function(DB){…}}”

SiphonController

Message Processing ObjectprefType: “Field to match preference label”,processMessage: function (…) {…},sharedVar: “Preference type shared property”,// Knowledge Database Access: this.knowledgeDB// Monitor Stat Access: this.monitorStat

Figure 3.13: Prototype inheritance of Message Processing Objects, illustrating the programming inter-faces.

and inherit the database and monitor access.

Thanks to the prototype inheritance mechanism in Javascript, customized Message

Processing Objects can program with simple interfaces shown in Fig. 3.13. The prefType

property defines the type of messages to be handle, whose value matches the preference

label in message headers. processMessage() function programs the message processing

logic, as is addressed previously. At the meantime, variables which are shared among

the flows of the same prefType can be defined or modified as property sharedVar. For

example, a counter can be added as a shared variable to get the number of processed

messages. Access to the global knowledge database and monitor statistics is made via

this.knowledgeDB and this.monitorStat interfaces directly.

The second step is to define the subscribed data from the controller. Together with

the Message Processing Object installation, the result of some user-defined functions can

be subscribed to update the knowledge database.

We implemented several prototype network applications, in the form of two Message

Processing Objects. Tab. 3.2 summarizes their implementation. Typically, these three


Network App Summary Global Knowledge“min-delay” Choose the next hop which directs to

the path with the minimum possibleone-way delay. Desired bandwidth canbe specified as a prerequisite.

Best 3 paths withthe least latency ofeach possible destina-tion. The correspond-ing available band-width is given, too.

“max-bandwidth” Choose the next hop which directs tothe path with most available band-width.

Best 3 paths with themost available band-width of each possibledestination.

“bw-allocation” Allocate a fraction of the availablebandwidth to a single flow, accordingto the desired relative weight

None.

“load-balancer” Choose the next hop for a message tobalance the load among all availablepaths. Note that either per-message orper-flow load balancing can be speci-fied.

Same as above.

Table 3.2: Summary of prototype network applications implemented in Siphon

applications are enough to satisfy the general traffic engineering needs to operate Siphon

in inter-datacenter networks. Note that multicast is automatically supported by these

applications. Also, per-message multipath routing can be enabled by the options specified

in the preference label field of a message header.

3.4.4 Discussion

One may think caching control logic on the relays is a violation of the software-define

network principle. Indeed, in some cases, the aggregators in Siphon seem to work in

the same way as a traditional routing. However, it is not true because the control logic

is cached by the central controller. In other words, the controller is free the delete or

update the cached logic at any time. Eventually, the forwarding decisions are still under

full provisioning of the central controller.

Caching forwarding logic is compatible with the OpenFlow-style of control, that is,

caching static forwarding rules. Directly consulting the controller is the default action to


be applied to a Siphon message. In particular, the message will be forwarded to the con-

troller under the following conditions: i) there is no matching Message Processing Object

installed; ii) processMessage() raises an exception due to lack of global knowledge.

Another concern about this design is that because the forwarding decisions can be

made locally, it might be unfeasible to ensure the correctness of message forwarding. For

example, the forwarding decisions might result in a loop. Siphon solves this problem by

consistently updating the knowledge cached on aggregators. Since the cached knowledge

on different relays is drawn from the same database, a loop can be avoided easily even

with greedy path selection. Moreover, the locality of forwarding decision making guar-

antees the consistency during a network update. There will not be any inconsistent rules

installed on the aggregators.

3.5 Performance Evaluation

In this section, we present our results from a comprehensive set of experimental eval-

uations with Siphon, organized into three parts. First, we provide a coarse-grained

comparison to show the application-level performance improvements by using Siphon.

A comprehensive set of machine learning workloads is used to evaluate our framework

compared with the baseline Spark. Then, we try to answer the question how Siphon

expedite a single coflow by putting a simple shuffle under the microscope. Finally, we

evaluate our inter-coflow scheduling algorithm, by using the state-of-the-art heuristic as

a baseline.

3.5.1 Macro-Benchmark Tests

Experimental Setup. In this experiment, we run 6 different machine learning work-

loads on a 160-core cluster, which spans across 5 geographical regions. Performance

metrics such as application runtime, stage completion time and shuffle read time are to

be evaluated. The shuffle read time is defined as the completion time of the slowest data


ALS PCA BMM Pearson W2V FG0

200

400

App

licat

ion

Run

Tim

e(s

) -26.1

-14.1

-8.5

-23.8

-4.1 +1.7

Siphon Spark

Figure 3.14: Average application run time.

Work-load

#Shuffles

TotalBandwidthUsage(GB)

ExtraBandwidthUsage(MB)

SiphonShuffle

Read Time(s)

SparkShuffle

Read Time(s)

RuntimeReduction

(%)

CostDifference

(¢)

ALS 18 40.47 2186.3 46.8 90.5 48.3 -26.56PCA 2 0.51 37.6 3.3 13.7 76.1 -6.80BMM 1 42.3 2911.1 48.9 97.8 50.0 -29.26Pear-son 2 0.57 23.8 3.6 13.1 72.6 -6.23

W2V 5 0.45 10.2 5.8 9.6 39.9 -2.49FG 2 0.57 20.5 1.77 1.87 5.4 -0.05

Table 3.3: Summary of shuffles in different workloads (present the run with median application runtime).

fetch in a shuffle. It reflects the time needed for the last task to start computing, and it

determines the stage completion time to some extent.

The Spark-Siphon cluster. We set up a 160-core, 520 GB-memory Spark cluster.

Specifically, 40 n1-highmem-2 instances are evenly disseminated in 5 Google Cloud dat-

acenter (N. Carolina, Oregon, Belgium, Taiwan, and Tokyo). Each instance provides 2

vCPUs, 13 GB of memory, and a 20 GB SSD of disk storage. Except for one instance

in the N. Carolina region works as both Spark master and driver, all instances serve as

Spark standalone executors. All instances in use are running Apache Spark 2.1.0.

The Siphon aggregators run on 10 of the executors, 2 in each datacenter. An aggrega-

tor is responsible for handling Pub/Sub requests from 4 executors in the same datacenter.

The Siphon controller runs on the same instance as the Spark master, in order to minimize

the communication overhead between them.


1 3 9 27 81ALS Stage Completion Time / Shuffle Read Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Siphon StagesSpark StagesSiphon ShufflesSpark Shuffles

(a) Alternating Least Squares (in CDF).

0 1 2 3PCA Stage Index

0

5

10

15

20

25

30

Sta

geC

ompl

etio

nTi

me

(s) Siphon Stage

Spark StageSiphon ShuffleSpark Shuffle

(b) Principal Component Analysis.

0 1 2 3 4 5 6 7BMM Stage Index

0

50

100

150

200

250

Sta

geC

ompl

etio

nTi

me

(s) Siphon Stage


(c) Block Matrix Multiplication.

0 1 2 3 4 5Pearson Stage Index

0

5

10

15

20

25

30

Sta

geC

ompl

etio

nTi

me

(s) Siphon Stage


(d) Pearson’s Correlation.

0 1 2 3 4 5 6 7 8 9Word2Vec Stage Index

0

20

40

60

80

100

120

Sta

geC

ompl

etio

nTi

me

(s) Siphon Stage


(e) Word2Vec distributed presentation of words.

0 1 2 3 4 5 6FP-Growth Stage Index

0

20

40

60

80

100

Sta

geC

ompl

etio

nTi

me

(s) Siphon Stage


(f) FP-growth frequent item sets.

Figure 3.15: Shuffle completion time and stage completion time comparison (present the run with mediaapplication run time).

Note that we do not launch extra resources for Siphon aggregators to make the com-

parison fair. Even though they occupy some computation resource and system I/Os with

their co-located Spark executors, the consumption is minimal.

Workload specifications. 6 machine learning workloads, with multiple jobs and


multiple stages, are used for evaluation.

• ALS: Alternating Least Squares.

• PCA: Principle Component Analysis.

• BMM: Block Matrix Multiplication.

• Pearson: Pearson’s correlation.

• W2V: Word2Vec distributed presentation of words.

• FG: FP-Growth frequent item sets.

These workloads are the representative ones from Spark-Perf Benchmark1, the offi-

cial Spark performance test suite created by Databricks2. The workloads that are not

evaluated in this chapter share the same characteristics with one or more selected ones,

in terms of the network traffic pattern and computation intensiveness. We set the scale

factor to 2.0, which is designed for a 160-core, 600 GB-memory cluster.

Methodologies. With different workloads, we compare the performance of job exe-

cutions, with or without Siphon integrated as its cross-datacenter data transfer service.

Note that, without Siphon, Spark works in the same way as the out-of-box, vanilla

Spark, except one slight change on the TaskScheduler. Our modification eliminates

the randomness in the Spark task placement decisions. In other words, each task in a

given workload will be placed on a fixed executor across different runs. This way, we can

guarantee that the impact of task placement on the performance has been eliminated.

Performance. We run each workload on the same input dataset for 5 times. The

average application run time comparisons across 5 runs are shown in Fig. 3.14. Later

we focus on job execution details, taking the run with median application run time

for example. Table 3.3 summarizes the total shuffle size and shuffle read time of each

workload. Further, Fig. 3.15 breaks down the time for network transfers and computation

in each stage, providing more insight.

Among the 6 workloads, BMM, the most network-intensive workload, benefits most

1https://github.com/databricks/spark-perf2https://databricks.com/.

https://github.com/databricks/spark-perf

https://databricks.com/


from Siphon. It enjoys a 23.6% reduction in average application run time. The reason

is that it has one huge shuffle — sending more than 40 GB of data in one shot — and

Siphon can help significantly. The observation can be proved by Fig. 3.15(c), which shows

that Siphon manages to reduce around 50 seconds of shuffle read time.

Another network-intensive workload is ALS, an iterative workload. The average run

time has been reduced by 13.2%. The reason can be easily seen with the information

provided in Table 3.3. During a single run of the application, 40.47 GB of data is shuffled

through the network, in 18 stages. Siphon collectively reduces the shuffle time by more

than 30 seconds. Fig. 3.15(a) shows the CDFs of shuffle completion times and stage

completion times, using Spark and Siphon respectively (note the x-axis is in log scale).

As we observe, the long tail of the stage completion time distribution is reduced because

Siphon has significantly improved the performance of all shuffle phases.

The rest of the workloads generate much less shuffled traffic, but their shuffle read

time have also been reduced (5.4%∼76.1%).

PCA and Pearson are two workloads that have network-intensive stages. Their shuffle

read time constitutes a significant share in some of the stages, but they also have com-

putation intensive stages that dominate the application run time. For these workloads,

Siphon greatly impacts the job-level performance, by minimizing the time used for shuffle

(Table 3.3).

W2V and FG are two representative workloads whose computation time dominates

the application execution. With these workloads, Siphon can hardly make a difference

in terms of application run time, which is mostly decided by the computation stragglers.

An extreme example is shown in Fig. 3.15(e). Even though the shuffle read time has been

reduced by 4 seconds (Table 3.3), the computation stragglers in Stage 4 and Stage 6 will

still slow down the application by 0.7% (Fig. 3.14). Siphon is not designed to accelerate

these computation-intensive data analytic applications.

Cost Analysis. As the acceleration of Spark shuffle reads in Siphon is partially due


Spark Naive Multipath Siphon0

50

100

150

200

250S

ortJ

obC

ompl

etio

nTi

me

(s)

186.5

215.2

157.3

188.9

139.4

170.3

134.7

164.0

Map StageReduce Stage

Figure 3.16: Average job completion time across 5runs.

Spark Naive Multipath Siphon0

50

100

150

200

Red

uce

Sta

geE

xecu

tion

Tim

e(s

)

72.2

186.2

56.3

155.3

49.4

135.3

48.7

130.3

Task ExecutionShuffle Read

Figure 3.17: Breakdowns of the reduce stage exe-cution across 5 runs.

to the relay of traffic through intermediate datacenters, it is concerned how it affects the

overall cost for running the data analytics jobs. On the one hand, the relay of traffic

increases the total WAN bandwidth usage, which is charged by public cloud providers.

On the other hand, the acceleration of jobs reduces the cost for computation resources.

We present the total cost of running the machine learning jobs in Table 3.3, based on

Google Cloud pricing3. Each instance used in our experiment costs $1.184 per hour, and

our cluster costs ¢ 0.6578 per second. As a comparison, the inter-datacenter bandwidth

only costs 1 cent per GB.

As a result, Siphon actually reduced the total cost of running all workloads (Table 3.3).

On the one hand, a small portion of inter-datacenter traffic has been relayed. On the

other hand, the idle time of computing resources has been reduced significantly, which

exceeds the extra bandwidth cost.

3.5.2 Single Coflow Tests

Experimental Setup. In the previous experiment, Siphon works well in terms of speed-

ing up the coflows in complex machine learning workloads. However, one question remains

unanswered: how does each component of Siphon contribute to the overall reduction on

the coflow completion time? In this experiment, we use a smaller cluster to answer this3https://cloud.google.com/products/calculator/

https://cloud.google.com/products/calculator/


0 10 20 30 40 50 60 70 80

Shuffle Read Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

SparkNaiveMultipathSiphon

Figure 3.18: CDF of shuffle read time (present the run with median job completion time).

question by examining a single coflow more closely.

The cross-datacenter Spark cluster consists of 19 workers and 1 master, spanning

5 datacenters. The Spark master and driver is on a dedicated node in Oregon. The

geographical location of worker nodes is shown in Fig. 3.19, in which the number of

executors in different datacenters is shown in the black squares. The same instance type

(n1-highmem-2) is used.

Most software configurations are the same as the settings used in Sec. 3.5.1, including

the Spark patch. In other words, the cluster still offers a fixed task placement for a given

workload.

In order to study the system performance that generates a single coflow, we decided

to use the Sort application from the HiBench benchmark suite [40]. Sort has only two

stages, one map stage of sorting input data locally and a reduce stage of sorting after

a full shuffle. The only coflow will be triggered at the start of the reduce stage, which

is easier to analyze. We prepare the benchmark by generating 2.73 GB of raw input in

HDFS. Every datacenter in the experiment stores an arbitrary fraction of the input data

without replication, but the distribution of data sizes is skewed.

We compare the shuffle-level performance achieved by the following 4 schemes, with

the hope of providing a comprehensive analysis of the contribution of each component of

Siphon:


• Spark: The vanilla Spark framework, with fixed task placement decisions, as the

baseline for comparison.

• Naive: Spark using Siphon as its data transfer service, without any flow scheduling

or routing decision makers. In this case, messages are scheduled in a round-robin

manner, and the inter-datacenter flows are sent directly through the link between

the source to the destination aggregators.

• Multi-path: The Naive scheme with the multi-path routing decision maker enabled

in the controller.

• Siphon: The complete Siphon evaluated in Sec. 3.5.1. Both LFGF intra-coflow

scheduling and multi-path routing decision makers are enabled.

Job and stage level performance. Fig. 3.16 illustrates the performance of sort

jobs achieved by the 4 schemes aforementioned across 5 runs, with respect to their job

completion times, as well as their stage completion times for both map and reduce stages.

As we expected, all 3 schemes using Siphon have improved job performance by accelerat-

ing the reduce stage, as compared to Spark. With Naive, the performance improvement is

due to a higher throughput achieved by pre-established parallel TCP connections between

Siphon aggregators.The improvement of Multi-path over Naive is attributed to a further

reduction of reduce stage completion times — with multi-path routing, the network load

can be better balanced across links to achieve a higher throughput and faster network

transfer times. Finally, it is not surprising that Siphon, benefiting the advantages of both

intra-coflow scheduling and Multi-path routing, achieves the best job performance.

To obtain fine-grained insights on the performance improvement, we break down

the reduce completion time further into two parts: the shuffle read time (i.e., coflow

completion time) and the task execution time. As is shown in Fig. 3.17, the improvement

of Naive over Spark is mainly attributed to a reduction of the shuffle read time.Multi-

path achieves a substantial improvement of shuffle read time over Naive, since the network

transfer completes faster by mitigating the bottleneck through multi-path routing. Siphon


achieves a similar shuffle read time with Multi-path, with a slight reduction in the task

execution time. This implies that multi-path routing is the main contributing factor for

performance improvement, while intra-coflow scheduling helps marginally on the straggler

mitigation as expected.

Shuffle: Spark v.s. Naive. To allow a more in-depth analysis of the performance

improvement achieved by the baseline Siphon (Naive), we present the CDFs of shuffle

read times achieved by Spark and Naive, respectively, in Fig. 3.18. Compared with the

CDF of Spark that exhibits a long tail, all the shuffle read times are reduced by ∼10 s

with Naive, thanks to the improved throughput achieved by persistent, parallel TCP

connections between aggregators.

Shuffle: intra-coflow scheduling and multi-path routing. We further study

the effectiveness of the decision makers, with Multi-path and Siphon’s CDFs presented in

Fig. 3.18.

With multi-path routing enabled, both Multi-path and Siphon achieve shorter com-

pletion times (∼50 s) for their slowest flows respectively, compared to Naive (>60 s) with

direct routing. Such an improvement is contributed by the improved throughput with a

better balanced load across multiple paths. It is also worth noting that the percentage of

short completion times achieved with Multi-path is smaller than Naive — 22% of shuffle

reads complete within 18 s with Multi-path, while 35% complete with Naive. The reason

is that by rerouting flows from bottleneck links to lightly loaded ones via their alternative

paths, the network load, as well as shuffle read times, will be better balanced.

It is also clearly shown that with LFGF scheduling, the completion time of the slow-

est shuffle read is almost the same with that achieved by Multi-path. This meets our

expectation, since the slowest flow will always finish at the same time in spite of the

scheduling order, given a fixed amount of network capacity.

We further illustrate the inter-datacenter traffic during the sort job run time in

Fig. 3.19, to intuitively show the advantage of multi-path routing. The sizes of the


3

4

4

4

4

S. Carolina

Tokyo

Taiwan

Oregon Belgium

216

161

216

216

198

148

197

197

149

149 149

149

227

227

227170

206

207

155

206

Figure 3.19: The summary of inter-datacenter traffic in the shuffle phase of the sort application.

T-V T-M V-T V-M M-T M-V

150

160

170

180

190

200

210

Ava

ilabl

eB

andw

idth

(Mbp

s)

Figure 3.20: Bandwidth distributionamong datacenters.

<25% 25-49% 50-74% ≥75% All

Nor

mal

ized

CC

T(%

)

95.7

98.7 98.3

89.6 89.7

94.9

99.3

96.7

85.9

99.5

Average 90th Percentile

Figure 3.21: Average and 90th percentile CCT comparison.

traffic between each pair of datacenters are shown around the bidirectional arrow line,

the thickness of which is proportional to the amount of available bandwidth shown in

Table 3.1.

The narrow link from Taiwan to S. Carolina becomes the bottleneck, which needs to

transfer the largest amount of data. With our multi-path routing algorithm, part of the

traffic will be rerouted through Oregon. We can observe that the original traffic load

along this path is not heavy (only 149 MB from Taiwan to Oregon and 170 MB from

Oregon to S. Carolina), and both alternate links have more available bandwidth. This

demonstrates that our routing algorithm works effectively in selecting optimal paths to

balance loads and alleviate bottlenecks.


3.5.3 Inter-Coflow Scheduling

In this section, we evaluate the effectiveness of Monte Carlo simulation-based inter-coflow

scheduling algorithm, by comparing the average and the 90th-percentile Coflow Comple-

tion Time (CCT) with existing heuristics.

Testbed. To make the comparison fair, we set up a testbed on a private cloud, with 3

datacenters located inVictoria, Toronto, andMontreal, respectively. We have conducted

a long-term bandwidth measurement among them, with more than 1000 samples collected

for each link. Their distributions are depicted in Fig. 3.20, which are further used in the

online Monte Carlo simulation.

Benchmark. We use the Facebook benchmark [22] workload, which is a 1-hour

coflow trace from 150 workers. We assume workers are evenly distributed in the 3 data-

centers, and generate aggregated flows on inter-datacenter links. To avoid overflow, the

flow sizes are scaled down, with the average load on inter-datacenter links reduced by

30%.

Methodology. A coflow generator, together with a Siphon aggregator, is deployed

in each datacenter. All generated traffic goes through Siphon, which can enforce proper

inter-coflow scheduling decisions on inter-datacenter links. As a baseline, we experiment

with the Minimum Remaining Time First (MRTF) policy, which is the state-of-the-art

heuristic with full coflow knowledge [86]. The metrics CCTs are then normalized to the

performance of the baseline algorithm.

Performance. Fig. 3.21 shows that Monte Carlo simulation-based inter-coflow schedul-

ing outperforms MRTF in terms of both average and tail CCTs. Considering all coflows,

the average CCT is reduced by ∼10%. Since the coflow size in the workload follows a

long-tail distribution, we further categorize coflows in 4 bins, based on the total coflow

size. Apparently, the performance gain mostly stems from expediting the largest bin –

elephant coflows that can easily overlap with each other. Beyond MRTF, Monte Carlo

simulations can carefully study all possible near-term coflow ordering with respect to the


unpredictable flow completion times, and enforce a decision that is statistically optimal.

3.5.4 Aggregators: Stress Tests

Switching capacities and CPU overheads. To evaluate the switching capacity and

CPU overhead of our application-layer switch in a Siphon aggregator daemon, we have

conducted two experiments with VM instances that have a different number of vCPUs.

Two Siphon aggregators, connecting with each other in a “dumbbell” topology, are de-

ployed on two VM instances, with one of them receiving data from 16 mock Spark work-

ers, and another one forwarding the data eventually to a destination worker. To avoid

bottlenecks on the network link between the aggregators before reaching their switching

capacities, we run all the VM instances in the same datacenter, and use a message size

of 1 MB, the same fragmentation size as deployed to support Spark.

The switching capacities of the second aggregator, when running on three different

types of VMs, are illustrated in Fig. 3.22, with the x-axis representing the switching

rate, i.e., the number of messages switched per core within a second, and y-axis standing

for the average CPU load of the aggregator. When running on a 2-core instance, the

switching capacity is reached when each core handles 2000 messages per second. When

4-core and 8-core instances are used, the capacities are 1600 and 1000 messages per core

per second, respectively. Due to the increasing synchronization cost incurred between

concurrent threads when writing received messages to their shared output queues, it is

unavoidable that the switching capacity per vCPU core decreases as more cores become

available in a VM instance. Fortunately, the total number of messages that an aggregator

is able to forward per second with a full vCPU load — 4000 (32Gbps on 2 cores), 6400

(51.2Gbps on 4 cores) and 8000 (64Gbps on 8 cores) respectively — is way more than

sufficient to saturate inter-datacenter link capacities or even intra-datacenter links.

It is worth noting that aggregator daemons does not necessarily slow down the com-

putation on the co-located worker. Even though it consumes some CPU resources of the


0 400 800 1200 1600 2000Switching Rate (messages/core/second)

0

20

40

60

80

100

Avera

ge C

PU

Load (

%)

2 cores

4 cores

8 cores

Figure 3.22: The switching capacities of a Siphon aggregator on three different types of instances.

Data Size (MB) 2 8 32 128 512Throughput-Direct

(Mbps) 51.9 98.1 159.1 224.1 250.0

Throughput-Siphon(Mbps) 95.7 225.9 270.1 278.6 279.4

Improvement (%) +84.4 +130.3 +69.8 +24.3 +11.8

Table 3.4: The overall throughput for 6 concurrent data fetches.

worker, the resources are shared temporally. This is because whenever the worker starts

sending data to other workers, the current phase of computation has already concluded.

The next phase of computation will not start until it has already received all data, which

indicates that the corresponding aggregator is not busy anymore.

Inter-datacenter throughput. In the third experiment, we wish to evaluate how

Siphon takes advantage of the pre-established parallel TCP connections between the

aggregators, to accelerate the data fetches in wide-area data analytics.

We deploy 6 nodes to simulate the fetchers in N. Carolina region, and 6 servers to

respond to their fetch requests in Belgium. During the test, all fetchers will start at the

same time, and each fetcher will request a file of a given size from its assigned server. As

soon as all fetchers have received all the desired data, we stop timing and calculate the

aggregated average throughput for these fetches.

The results are shown in Table 3.4. The second row of the table shows the achieved

throughput by establishing new TCP connections to fetch data, which simulates the


default data fetch strategies in Spark. The third row shows the average throughput

when the fetched data is sent using Siphon, via two aggregators deployed in the two

datacenters, respectively.

Apparently, Siphon has dramatically improved the aggregated throughput to fetch

data across different datacenters, especially when the sizes of data are relatively small

(<128 MB). When the data size is 8 MB, Siphon has more than doubled the throughput.

It is mostly because the pre-established connections in Siphon skip the slow-start phase

of TCP, which usually takes a long time on such inter-datacenter links.

As the data size increases (>128 MB), the throughput improvement is less significant

because direct TCP connections are able to ramp up. However, we can still benefit from

the multiplexed connections — messages are multiplexed through all parallel connec-

tions, such that stragglers are avoided. When the data size is very small (2 MB), the

improvement is less significant because the network flows are completed rapidly, and the

constant overhead to use the Siphon aggregator as a relay matters. Fortunately, Siphon

still offers >80% better aggregated throughput.

3.6 Summary

We make four original contributions in this work:

•We have proposed a novel and practical inter-coflow scheduling algorithm for wide-

area data analytics. Starting from analyzing the network model, new challenges in inter-

datacenter coflow scheduling have been identified and addressed.

• We have designed an intra-coflow scheduling policy and a multi-path routing algo-

rithm that improve WAN utilization in wide-area data analytics.

• We have designed a novel interaction scheme between the controller and the dat-

aplane in software-defined WANs, which can significantly reduce the overhead of rule

updates.

• We have built Siphon, a transparent and unified building block that can easily


extend existing data parallel frameworks with out-of-box capability of expediting inter-

datacenter coflows.

Chapter 4

Optimizing Shuffle in Wide Area DataAnalytics

In this chapter, we propose to take a systems-oriented approach to improve the bandwidth

utilization on inter-datacenter links during the shuffle phase. Previous studies show that

shuffle phase constitutes over 60% of job completion time for network-intensive jobs [31],

and it will become an even more severe bottleneck in Wide-Area Data Analytics, when

shuffle traffic is sent via inter-datacenter WANs.

Rather than employing the common fetch-based shuffle, we enforces a push-based

shuffle mechanism, which allows early inter-datacenter transfers and reduces link idle

times. As a result, the shuffle completion time and the job completion time can be

reduced.

Our new system framework is first and foremost designed to be practical: it has been

implemented in Apache Spark to optimize the runtime performance of wide-area analytic

jobs in a variety of real-world benchmarks. To achieve this objective, our framework

focuses on the shuffle phase, and strategically aggregate the output data of mapper tasks

in each shuffle phase to a subset of datacenters. In our proposed solution, the output

of mapper tasks is proactively and automatically pushed to be stored in the destination

datacenters, without requiring any intervention from a resource scheduler. Our solution

is orthogonal and complementary to existing task assignment mechanisms proposed in

62

Chapter 4. Optimizing Shuffle in Wide Area Data Analytics 63

the literature, and it remains effective even with the simplest task assignment strategy.

Compared to existing task assignment mechanisms, the design philosophy in our

proposed solution is remarkably different. The essence of traditional task assignment is

to move a computation task to be closer to its input data to exploit data locality; in

contrast, by proactively moving data in the shuffle phase from mapper to reducer tasks,

our solution improves data locality even further. As the core of our system framework, we

have implemented a new method, called transferTo(), on Resilient Distributed Datasets

(RDDs), which is a basic data abstraction in Spark. This new method proactively sends

data in the shuffle phase to a specific datacenter that minimizes inter-datacenter traffic.

It can be either used explicitly by application developers or embedded implicitly by the

job scheduler. With the implementation of this method, the semantics of aggregating the

output data of mapper tasks can be captured in a simple and intuitive fashion, making

it straightforward for our system framework to be used by existing Spark jobs.

With the new transferTo() method at its core, our new system framework enjoys a

number of salient performance advantages. First, it pipelines inter-datacenter transfers

with the preceding mappers. Starting data transfers early can help improve the utilization

of inter-datacenter links. Second, when task execution fails at the reducers, repetitive

transfers of the same datasets across datacenters can be avoided, since they are already

stored at the destination datacenter by our new framework. Finally, the application

programming interface (API) in our system framework is intentionally exposed to the

application developers, who are free to use this mechanism explicitly to optimize their

job performance.

Note that, even though the analysis and implementation in this chapter is based

entirely on Apache Spark, the push-based shuffle mechanism can be applied to other

general data analytics frameworks, including Apache HaDoop.

We have deployed our new system framework in a Spark cluster across six Amazon

EC2 regions. By running workloads from the HiBench [40] benchmark suite, we have


conducted a comprehensive set of experimental evaluations. Our experimental results

have shown that our framework speeds up the completion time of general analytic jobs

by 14% to 73%. Also, with our implementation, the impact of bandwidth and delay

jitters in wide-area networks is minimized, resulting in a lower degree of performance

variations over time.

4.1 Background and Motivation

4.1.1 Fetch-based Shuffle

Both Apache Hadoop and Spark are designed to be deployed in a single datacenter. Since

datacenter networks typically have abundant bandwidth, network transfers are considered

even less expensive than local disk I/O in Spark [84]. With this assumption in mind, the

shuffle phase is implemented with a fetch-based mechanism by default. To understand

the basic idea in our proposed solution, we need to provide a brief explanation of the

fetch-based shuffle in Spark.

In Spark, a data analytic job is divided into several stages, and launched in a stage-

by-stage manner. A typical stage starts with a shuffle, when all the output data from

the previous stages is already available. The workers of the new stage, i.e., reducers in

this shuffle, will fetch the output data from the previous stages, which constitutes the

shuffle input, stored as a collection of local files on the mappers. Because the reducers are

launched at the same time, shuffle input is fetched concurrently, resulting in a concurrent

all-to-all communication pattern. For better fault tolerance, the shuffle input will not be

deleted until the next stage finishes. When failures occur on the reducer side, the related

files will be fetched from the mappers again, without the need to re-run them.

4.1.2 Problems with Fetch in Wide-Area Data Analytics

Though effective within a single datacenter, it is quite a different story when it comes

to wide-area data analytics across geographically distributed datacenters, due to limited


bandwidth availability on wide-area links between datacenters [39]. Given the potential

bottlenecks on inter-datacenter links, there are two major problems with fetch-based

shuffles.

First, as a shuffle will only begin when all mappers are finished — a barrier-like

synchronization — inter-datacenter links are usually well under-utilized most of the time,

but likely to be congested with bursty traffic when the shuffle begins. The links are

under-utilized, because when some mappers finish their tasks earlier, their output cannot

be transmitted immediately to the reducers. Yet, when the shuffle is started by all

the reducers at the same time, they initiate concurrent network flows to fetch their

corresponding shuffle input, leading to bursty traffic that may contend for the limited

inter-datacenter bandwidth, resulting in potential congestion.

Second, when failures occur on reducers with the traditional fetch-based shuffle mech-

anism, data must be fetched again from the mappers over slower inter-datacenter network

links. Since a stage will not be considered complete until all its tasks are executed suc-

cessfully, The slowest tasks, called the stragglers, will directly affect the overall stage

completion time. Re-fetching shuffle input over inter-datacenter links will slow down

these stragglers even further, and negatively affects the overall job completion times.

4.2 Transferring Shuffle Input across Datacenters

To improve the performance of shuffle in wide-area data analytics, we will need to answer

two important questions: when and where should we transfer the shuffle input from

mappers to reducers? The approach we have taken in our system framework is simple

and quite intuitive: we should proactively push shuffle input as soon as any data partition

is ready, and aggregate it to a subset of worker datacenters.


time0 4 8 12 16

time0 4 8 12 16

Map Reduce

Reduce

Map Data Transfer

Data Transfer

Reduce

ReduceMap

Shuffle Read

Shuffle Read

ShuffleWrite

ShuffleWrite

Map

Stage N Stage N+1

Stage N Stage N+1

worker A

worker B

worker A

worker B

ShuffleRead

(a)

(b)

Figure 4.1: Mappers typically cannot finish their work at the same time. In this case, if we proactivelypush the shuffle input to the datacenter where the reducer is located (b), the inter-datacenter link willbe better utilized as compared to leaving it on the mappers (a).

4.2.1 Transferring Shuffle Input: Timing

Both problems of fetch-based shuffle stem from the fact that the shuffle input is co-

located with mappers in different datacenters. Therefore, they can be solved if, rather

than asking reducers to fetch the shuffle input from the mappers, we can proactively push

the shuffle input to those datacenters where the reducers are located, as soon as such

input data has been produced by each mapper.

As an example, consider a job illustrated in Fig. 4.1. Reducers in stage N + 1 need to

fetch the shuffle input from mappers A and B, located in another datacenter. We assume

that the available bandwidth across datacenters is 14of a single datacenter network link,

which is an optimistic estimate. Fig. 4.1(a) shows what happens with the fetch-based

shuffle mechanism, where shuffle input is stored on A and B, respectively, and transferred

as soon as stage N+1 starts at t = 10. Two flows share the inter-datacenter link, allowing

both reducers start at t = 18. In contrast, in Fig. 4.1(b), shuffle input is pushed to the

datacenter hosting the reducer immediately after it is computed by each mapper. Inter-

datacenter transfers are allowed to start at t = 4 and t = 8, respectively, without the


time0 4 8 12 16 20 24

Map FailedReduce

Reduce

Refetch ReduceShuffle Read

Shuffle Read

ShuffleWrite

time0 4 8 12 16 20 24

Map

Stage N Stage N+1

worker A

worker B(a)

Map FailedReduce

Reduce

Reduce

Map

Stage N Stage N+1

worker A

worker B(b)

Data Transfer

Data Transfer

ShuffleWrite

ShuffleRead

Refetch

Figure 4.2: In the case of reducer failures, if we proactively push the shuffle input to the datacenter wherethe reducer is located (Fig. 2(b)), data re-fetching across datacenters can be eliminated, reducing thetime needed for failure recovery as compared to the case where shuffle input is located on the mappers(Fig. 2(a)).

need for sharing link bandwidth. As a result, reducers will be able to start at t = 14.

Fig. 4.2 shows an example of the case of reducer failures. With the traditional fetch-

based shuffle mechanism, the failed reducer will need to fetch its input data again from

another datacenter, if such data is stored with the mappers, shown in Fig. 4.2(a). In

contrast, if the shuffle input is stored with the reducer instead when it fails, the reducer

can read from the local datacenter, which is much more efficient.

4.2.2 Transferring Shuffle Input: Choosing Destinations

Apparently, proactively pushing shuffle input to be co-located with reducers is beneficial,

but a new problem arises with this new mechanism: since the reduce tasks will not

be placed until the map stage finishes, how can we decide the destination hosts of the

proactive pushes?

It is indeed a tough question, because the placement of reducers is actually decided

by the shuffle input distribution at the start of each stage. In other words, our choice of

push destinations will in turn impact the reducer placement. Although it seems a cycle of

unpredictability, but we think there already exist enough hints to give a valid answer in


Datacenter 2

Datacenter 1

Patition 2222

Patition 3333

Patition 1111

Reducer C

Reducer B

Reducer A

123

1

32

321

Final StageWorker

…

…

…

…

…

…

FinalResults

Figure 4.3: A snippet of a sample execution graph of a data analytic job.

wide-area data analytics. Specifically, for the sake of minimizing cross-datacenter traffic,

there is a tendency for both task and shuffle input placement at the datacenter level. We

can exploit this tendency as a vital clue.

Our analysis starts by gaining a detailed understanding of shuffle behaviors in MapRe-

duce. Fig. 4.3 shows a snippet of abstracted job execution graph and data transfers. The

depicted shuffle involves 3 partitions of shuffle input, which will then be dispatched to 3

reducers. In this case, each partition of the shuffle input is saved as 3 shards based on

specific user-defined rules, e.g., the keys in the key-value pairs. During data shuffle, each

shard will be fetched by the corresponding reducer, forming an all-to-all traffic pattern.

In other words, every reducer will access all partitions of the shuffle input, fetching the

assigned shards from each.

We assume that shuffle input is placed in M datacenters. The sizes of the partitions

stored in these datacenters are s1, s2, . . . , sM , respectively. Without loss of generality,

let the sizes be sorted in the non-ascending order, i.e., s1 ≥ s2 ≥ . . . ≥ sM . Also, each

partition is divided into N shards, with respect to N reducers, R1, R2, . . . , RN . Though


being different in sizes in practice, all shards of a particular partition tend to be about the

same size for the sake of load balancing [66]. Thus, we assume the shards in a partition

are equal in size.

If a reducer Rk is placed in Datacenter ik, the total volume of its data fetched from

non-local datacenters will be

d(k)ik

=∑

1≤j≤Mj 6=ik

d(k)ik,j

=∑

1≤j≤Mj 6=ik

1

Nsj.

Each term in the summation, d(k)ik,j, denotes the size of data to be fetched from Datacenter

j.

Let S be the total size of the shuffle input, i.e., S =∑M

j=1 sj, we have

d(k)ik

=M∑j=1

1

Nsj −

1

Nsik =

1

N(S − sik) ≥ 1

N(S − s1). (4.1)

The equality holds if and only if ik = 1. In other words, the minimum cross-datacenter

traffic can be achieved when the reducer is placed in the datacenter which stores the most

shuffle input.

The inequality Eq. (4.1) holds for every reducer Rk (k ∈ {1, 2, . . . , N}). Then, the

total volume of cross-datacenter traffic incurred by this shuffle satisfies

D =N∑k=1

d(k)ik≥ N · 1

N(S − s1) = S − s1. (4.2)

Again, the equality holds iff. i1 = i2 = . . . = iN = 1.

Without any prior knowledge on application workflow, we reach two conclusions to

optimize a general wide-area data analytic job.

First, given a shuffle input distribution, the datacenter with the largest fraction of

shuffle input will be favored by the reducer placement. This is a direct corollary of


Eq. (4.2).

Second, shuffle input should be aggregated to a subset of datacenters as much as

possible. The minimum volume of data to be fetched across datacenters is S − s1.

Therefore, in order to further reduce cross-datacenter traffic in shuffle, we should improve

s1Swhich is the fraction of shuffle input placed in Datacenter 1. As an extreme case, if all

shuffle input is aggregated in Datacenter 1, there is no need for cross-datacenter traffic

in future stages.

In all, compared to scattered placement, a better placement decision would be ag-

gregating all shuffle input into a subset of datacenters which store the largest fractions.

Without loss of generality, in the subsequent sections of this chapter, we will aggregate

to a single datacenter as an example.

4.2.3 Summary and Discussion

According to the analysis throughout this section, we learn that the strategy of Push/Aggregate,

i.e., proactively pushing the shuffle input to be aggregated in a subset of worker datacen-

ters, can be beneficial in wide-area data analytics. It can reduce both stage completion

time and traffic tension, because of higher utilization of the inter-datacenter links. Also,

duplicated inter-datacenter data transfers can be avoided in case of task failures, further

reducing the pressure on the bottleneck links.

One may argue that even with the aggregated shuffle input, a good task placement

decision is still required, or it may generate more overall inter-datacenter traffic. Indeed,

the problem itself sounds like a dilemma, where task placement and shuffle input place-

ment depend on each other. However, with the above hints, we are able to break the

dilemma by placing the shuffle input first. After that, a better task placement decision

can be made by even the default resource schedulers, which has a simple straight-forward

strategy to exploit host-level data locality.

Conceptually speaking, the ultimate goal of Push/Aggregate is to proactively improve


the best data locality that a possible task placement decision can achieve. Thus, the

Push/Aggregate operations are completely orthogonal and complementary to the task

placement decisions.

4.3 Implementation on Spark

In this section, we present our implementation of the Push/Aggregate mechanism on

Apache Spark. We take Spark as an example in this chapter due to its better perfor-

mances [84] and better support for machine learning algorithms with MLlib [2]. However,

the idea can be applied to Hadoop as well.

4.3.1 Overview

In order to implement the Push/Aggregate shuffle mechanism, we are required to modify

two default behaviors in Spark: i) Spark should be allowed to directly push the output of

an individual map task to a remote worker node, rather than storing on the local disk;

and ii) the receivers of the output of map tasks should be selected automatically within

the specific aggregator datacenters.

One may think a possible implementation would be replacing the default shuffle mech-

anism completely, by enabling remote disks, which locate in the aggregator datacenters,

to be the potential storage in addition to the local disk on a mapper. Though this

approach is straight-forward and simple, there are two major issues.

On the one hand, although the aggregator datacenters are specified, it is hard for

mappers to decide the exact destination worker nodes to place the map output. In

Spark, it is the Task Scheduler’s responsibility to make centralized decisions on task

and data placement, considering both data locality and load balance among workers.

However, a mapper by itself, without synchronization with the global Task Scheduler,

can hardly have sufficient information to make the decision in a distributed manner, while

still keeping the Spark cluster load-balanced. On the other hand, the push will not start


until the entire map output is ready in the mapper memory, which introduces unnecessary

buffering time.

Both problems are tough to solve, requiring undesirable changes to other Spark com-

ponents such as the Task Scheduler. To tackle the former issues, it is natural to ask:

rather than implementing a new mechanism to select the storage of map output, is it

possible to leave the decisions to the Task Scheduler?

Because the Task Scheduler can have knowledge of computation tasks, we need to

generate additional tasks in the aggregator datacenter, whose computation is as simple

as receiving the output of mappers.

In this chapter, we add a new transformation on RDDs, transferTo(), to achieve

this goal. From a high level, transferTo() provides a means to explicitly transfer

a dataset to be stored in a specified datacenters, while the host-level data placement

decisions are made by the Spark framework itself for the sake of load balance. In addition,

we implement an optional mechanism in Spark to automatically enforce transferTo()

before a shuffle. This way, if this option is enabled, the developers are allowed to use the

Push/Aggregate mechanism in all shuffles without changing a single line of code in their

applications.

4.3.2 transferTo(): Enforced Data Transfer in Spark

transferTo() is implemented as a method of the base RDD class, the abstraction of

datasets in Spark. It takes one optional parameter, which gives all worker hosts in the

aggregator datacenter. However, in most cases, the parameter can be omitted, such that

all data will be transferred to a datacenter that is likely to store the largest fraction of

the parent RDD, as is suggested in Sec. 4.2.3. It returns a new RDD, TransferredRDD,

which represents the dataset after the transfer operation. Therefore, transferTo() can

be used in the same way as other native RDD transformations, including chaining with

other transformations.


When an analytic application is submitted, Spark will interpret the transformation by

launching an additional set of receiver tasks, only to receive all data in the parent RDD.

Then, the Task Scheduler can place them in the same manner as other computation tasks,

thus to achieve automatic host-level load balance. Because the preferredLocations

attributes of these receiver tasks are set to be in the aggregator datacenters, the default

Task Scheduler will satisfy these placement requirements as long as the datacenters have

workers available. This way, from the application’s perspective, the parent RDD is thus

explicitly pushed to the aggregator datacenters, without violating any default host-level

scheduling policies.

Besides, a bonus point of transferTo() is that, since the receiver tasks require no

shuffle from the parent RDD, they can be pipelined with the preceding computation

tasks. In other words, if transferTo() is called upon the output dataset of a map task,

the actual data transfer will start as soon as there is a fraction of data available, without

waiting until the entire output dataset is ready. This pipelining feature is enabled by

Spark without any further change, which automatically solves the second issue mentioned

in Sec. 4.3.1.

It is worth noting that transferTo() can be directly used as a developer API. For

developers, it provides the missing function that allows explicit data migration across

worker nodes. transferTo() enjoy the following graceful features:

Non-Intrusiveness and Compatibility. The introduction of the new API modifies

no original behavior of the Spark framework, maximizing the compatibility with ex-

isting Spark applications. In other words, changes made on the Spark codebase regard-

ing transferTo() are completely incremental, rather than being intrusive. Thus, our

patched version of Spark maintains 100% backward compatibility with the legacy code.

Consistency. The principle programming concept remains consistent. In Spark, RDD

is the abstraction of datasets. The APIs allow developers to process a dataset by ap-

plying transformations on the corresponding RDD instance. The implementation of


InputRDD

A1 B1A2

mapA1

mapA2

mapB1

reduceA1, A2

B1

reduceA1, A2

B1

InputRDD .map(…) .reduce(…) …

InputRDD .map(…) .transferTo([A]) .reduce(…) …

InputRDD

A1 B1A2

mapA1

mapA2

mapB1

reduceA1, A2, Ax

transferTo

A1transferTo

A2transferTo

A*

reduceA1, A2, Ax

(a) (b)

Figure 4.4: An example to show how preferredLocations attribute works without(a) or with(b)transferTo() transformation. A* represents all available hosts in datacenter A, while Ax representsthe host which is selected as the storage of third map output partition.

transferTo() inherits the same principle.

Minimum overhead. transferTo() strives to eliminate unnecessary overhead intro-

duced by enforced data transfers. For example, if a partition of dataset already locates

in our specified datacenter, no cross-node transfer is made. Also, unnecessary disk I/O

is avoided.

4.3.3 Implementation Details of tranferTo()

As a framework for building big data analytic applications, Spark strives to serve the

developers. By letting the framework itself make tons of miscellaneous decisions auto-

matically, the developers are no longer burdened by the common problems in distributed

computing, e.g., communication and synchronization among nodes. Spark thus provides

such a high-level abstraction that developers are allowed to program as if the cluster was

a single machine.

An easier life comes at the price of less control. The details of distributed comput-


ing, including communications and data transfers among worker nodes, are completely

hidden from the developers. In the implementation of transferTo() where we intend

to explicitly specify the cross-node transfer of intermediate data, Spark does not expose

such a functionality to the application developers.

Here we close this gap, by leveraging the internal preferredLocations attribute of

an RDD.

♦ preferredLocations in Spark

It is a native attribute in each partition of all RDDs, being used to specify the host-level

data locality preferences. While the Task Scheduler is trying to place the corresponding

computation on individual worker nodes, it plays an important role. In other words, the

Task Scheduler takes preferenceLocations as a list of higher priority hosts, and strives

to satisfy the placement preferences whenever possible.

A simple example is illustrated in Fig. 4.4 (a), where the input dataset is transformed

by a map() and a reduce(). The input RDD has 3 partitions, located on two hosts in

Datacenter A and one host in Datacenter B, respectively. Thus, 3 corresponding map

tasks are generated, with preferredLocations the same as input data placement. Since

the output of map tasks is stored locally, the preferredLocations of all reducers will

be the union of the mapper hosts.

This way, the Task Scheduler can have enough hints to place tasks to maximize host-

level data locality and minimize network traffic.

♦ Specifying the Preferred Locations for transferTo() Tasks

In our implementation of transferTo(), we generate an additional computation task

right after each map task, whose preferredLocations attribute filters out all hosts that

are not in the aggregator datacenters.

Why do we launch new tasks, rather than directly changing the preferredLocations

of mappers? The reason is simple: if mappers are directly placed in the aggregator data-


center, it will be the raw input data that is transferred across datacenters. In most cases,

it is undesirable because map output is very likely to have a smaller size as compared to

the raw input data.

If the parent mapper already locates in the aggregator datacenter, the generated task

will do nothing; however, if not, thus the parent partition of map output requires being

transferred, the corresponding task will provide a list of all worker nodes in the aggregator

datacenter as the preferredLocations. In the latter case, the Task Scheduler will select

one worker node from the list to place the task, which simply receives output from the

corresponding mapper.

As another example, Fig. 4.4 (b) shows how transferTo() can impact the preferredLocations

of all tasks in a simple job. As compared to Fig. 4.4 (a), the map output is explicitly trans-

ferred to Datacenter A. Because the first two partitions are already placed in Datacenter

A, the two corresponding transferTo() tasks are completely transparent. On the other

hand, since the third partition originated in Datacenter B, the subsequent transferTo()

task should prefer any hosts in Datacenter A. As a result of task execution, the map out-

put partition will be eventually transferred to a random host in Datacenter A, which is

selected by the Task Scheduler. Finally, since all input of reducers is in Datacenter A, the

shuffle can happen within a single datacenter, realizing the Push/Aggregate mechanism.

Note that we can omit the destination datacenter of transferTo(). If no parameter is

provided, transferTo() will automatically decide the aggregator datacenter, by selecting

the one with the most partitions of map output.

♦ Optimized Transfers in the case of Map-side Combine

There is a special case in some transformations, e.g., reduceByKey(), which require

MapSideCombine before a shuffle. Strictly speaking, MapSideCombine is a part of reduce

task, but it allows the output of map tasks to be combined on the mappers before being

sent through network, in order to reduce the traffic.

In wide-area data analytics, it is critical to reduce cross-datacenter traffic for the sake


of performance. Therefore, our implementation of transferTo() makes smart decisions,

by performing MapSideCombines before transfer whenever possible. In transferTo(),

we pipeline any MapSideCombine operations with the preceding map task, and avoid the

repetitive computation on the receivers before writing the shuffle input to disks.

4.3.4 Automatic Push/Aggregate

Even though transferTo() is enough to serve as the fundamental building block of

Push/Aggregate, a mechanism is required to enforce transferTo() automatically, with-

out the explicit intervention from the application developers. To this end, we modified

the default DAGScheduler component in Spark, to add an optional feature that automat-

ically inserts transferTo() before all potential shuffles in the application.

The programmers can enable this feature by setting a property option, spark.shuffle.aggregation,

to true in their Spark cluster configuration file or in their code. We did not enable

this feature by default for backward compatibility considerations. Once enabled, the

transferTo() method will be embedded implicitly and automatically to the code before

each shuffle, such that the shuffle inputs can be pushed to the aggregator datacenters.

Specifically, when a data analytic job is submitted, we use DAGScheduler to embed

the necessary transferTo() transformations into the origin submitted code. In Spark,

DAGScheduler is responsible for rebuilding the entire workflow of a job based on con-

secutive RDD transformations. Also, it decomposes the data analytic job into several

shuffle-separated stages.

Since DAGScheduler natively identifies all data shuffles, we propose to add a transferTo()

transformation ahead of each shuffle, such that the shuffle input can be aggregated.

Fig. 4.5 illustrates an example of implicit transferTo() embedding. Since groupByKey()

triggers a shuffle, the transferTo() transformation is embedded automatically right be-

fore that to start proactive transfers of the shuffle input.

Note that because transferTo() is inserted automatically, none parameter is pro-


DC1 DC2

Origin Code

val InRDD = In1+In2

InRDD .filter(…) .groupByKey(…) .collect()

Produced Code

val InRDD = In1+In2

InRDD .filter(…) .transferTo(…) .groupByKey(…) .collect()

In1

filter

groupByKey

Shuffle Input

collect

In2

filter

groupByKey

Shuffle Input

DC1 DC2

In1

filter

groupByKey

Shuffle Input

collect

In2

filter

groupByKey

Shuffle Input

transferTo

Processed ByDAGScheduler

transferTo

Figure 4.5: An example of implicit embedding of the transferTo() transformation. transferTo()aggregates all shuffle input in DC1, before the groupByKey() transformation starts. For the partitionnatively stored in DC1, transferTo() simply does nothing.

vided to the method. Therefore, it works the same way as if the aggregator datacenter

is omitted, i.e., the datacenter that generates the largest fraction of shuffle input should

be chosen. We approximate the optimal selection by choosing the datacenter storing the

largest amount of map input, which is a known piece of information in MapOutputTracker

at the beginning of the map task.

4.3.5 Discussion

In addition to the basic feature that enforces aggregation of shuffle input, the implemen-

tation of transferTo() can trigger interesting discussions in wide-area data analytics.

Reliability of Proactive Data Transfers. One possible concern is that, since the

push mechanism for shuffle input is a new feature in Spark, the reliability of computation,

e.g., fault tolerance, might be compromised. However, it is not true.


Because transferTo() is implemented by creating additional receiver tasks rather

than changing any internal implementations, all native features provided by Spark are

inherited. The introduced proactive data transfers, from the Spark framework’s perspec-

tive, are the same as regular data exchanges between a pair of worker nodes. Therefore,

in case of failure, built-in recovery mechanisms, such as retries or relaunches, will be

triggered automatically in the same manner.

Expressing Cross-region Data Transfers as Computation. Essentially, transferTo()

provides a new interpretation for inter-datacenter transfers. In particular, they can be

expressed in a form of computation, since transferTo() is implemented as a transfor-

mation. It conforms with our intuition, in which moving a large volume of data across

datacenters consumes both computation and network resources that are comparable to

a normal computing task.

This concept can help in several ways. For example, inter-datacenter data transfers

can be shown from the Spark WebUI. It can be helpful in terms of debugging the wide-

area data analytic jobs, by visualizing the critical inter-datacenter traffic.

Implicit vs. Explicit Embedding. Instead of implicit embedding transferTo()

using DAGScheduler, the developers are allowed to explicitly control the data placement

at the granularity of datacenters. In some real-world data analytic applications, this is

meaningful because the developers always know better about their data.

For example, it is possible in production that the shuffle input has a larger size than

the raw data. In this case, to minimize inter-datacenter traffic, it is the raw data rather

than the shuffle input that should be aggregated. The developers can be fully aware of

this situation; however, it is difficult for the Spark framework itself to make this call,

resulting in an unnecessary waste of bandwidth.

Another example is the cached datasets. In Spark, the developers are allowed to

call cache() on any intermediate RDD, in order to persist the represented dataset in

memory. These cached datasets will not be garbage collected until the application exits.


In practice, the intermediate datasets that will be used several times in an application

should be cached to avoid repetitive computation. In wide-area data analytics, caching

these datasets across multiple datacenters is extremely expensive, since reusing them will

induce repetitive inter-datacenter traffic. Fortunately, with the help of transferTo(),

the developers are allowed to cache after all data is aggregated in a single datacenter,

avoiding the duplicated cross-datacenter traffic.

Limitations. Even though the Push/Aggregate shuffle enjoys many graceful features,

it does have limitations that users should be aware of. The effectiveness of transferTo()

relies on the sufficient computation resources in the aggregator datacenter. It will launch

additional tasks in the aggregator datacenters, in which more computation resources will

be consumed. If the chosen aggregator datacenter cannot complete all reduce tasks be-

cause of insufficient resources, the reducers will be eventually placed in other datacenters,

which would be less effective.

We think this limitation is acceptable in wide-area data analytics for two reasons. On

the one hand, Push/Aggregate basically trades more computation resources for lower job

completion times and less cross-datacenter traffic. Because the cross-datacenter network

resources are the bottleneck in wide-area data analytics, the trade-off is reasonable. On

the other hand, in practice, it is common that a Spark cluster is shared by multiple jobs,

such that the available resources within one datacenter is more than enough for a single

job. Besides, when the cluster is multiplexed by many concurrent jobs, it is very likely

that the workload can be rebalanced across-datacenters, keeping the utilization high.

4.4 Experimental Evaluation

In this section, we present a comprehensive evaluation of our proposed implementation.

Our experiments are deployed across multiple Amazon Elastic Compute Cloud (EC2)

regions. Selective workloads from HiBench [40] are used as benchmarks to evaluate both

the job-level and the stage-level performances.


4 6

4

4

4

4N. Virginia

N. California

São Paulo

Frankfurt

Singapore

Sydney

Figure 4.6: The geographical location and the number of instances in our Spark cluster. Instances in 6different Amazon EC2 regions are employed. Each region has 4 instances running, except N. Virginiawhere two extra special nodes deployed.

Workload SpecificationWordCount The total size of generated input files is 3.2 GB.

Sort The total size of generated input data is 320 MB.

TeraSortThe input has 32 million records. Each record is 100 bytesin size.

PageRankThe input has 500,000 pages. The maximum number ofiterations is 3.

NaiveBayes The input has 100,000 pages, with 100 classes.

Table 4.1: The specifications of four workloads used in the evaluation.

The highlights of our evaluation results are as follows:

1. Our implementation speeds up workloads from the HiBench benchmark suite, re-

ducing the average job completion time by 14% ∼ 73%.

2. The performances are more predictive and stable, despite the bandwidth jitters on

inter-datacenter links.

3. The volume of cross-datacenter traffic can be reduced by about 16% ∼ 90%.

4.4.1 Cluster Configurations

Amazon EC2 is one of the most popular cloud service providers today. It provides

computing resources that are hosted in their datacenters around the globe. Since EC2 is

a production environment for a great number of big data analytic applications, we decide

to run our experimental evaluation by leasing instances across regions.


Cluster Resources. We set up a Spark cluster with 26 nodes in total, spanning 6

different geographically distributed regions on different continents, as is shown in Fig. 4.6.

Four worker nodes are leased in each datacenter. The Spark master node and the HaDoop

File System (HDFS) NameNode are deployed on 2 dedicated instances in the N. Virginia

region, respectively.

All instances in use are of the type m3.large, which has 2 vCPUs, 7.5 GB of memory,

and a 32 GB Solid-State Drive (SSD) as disk storage. The network performance of the

instances is reported as “moderate”. Our measurement shows that there is approximately

1 Gbps of bandwidth capacity between a pair of instances within a region. However, the

cross-region network capacity varies over time. Our preliminary investigation is consistent

with previous empirical studies [39,57]. The available bandwidth of inter-datacenter links

fluctuates greatly. Some links can have as low as 80 Mbps of capacity, while other links

may have up to 300 Mbps bandwidth.

Software Settings. The instances in our cluster are running a Linux Operating

System, Ubuntu 14.04 LTS 64-bit (HVM). To set up a distributed file system, we use

HDFS from Apache Hadoop 2.6.4. Our implementation is developed based on Apache

Spark 1.6.1, built with Java 1.8 and Scala 2.11.8. Spark cluster is started in the standalone

mode, without the intervention of external resource managers. This way, we leave the

Spark’s internal data locality mechanism to make the task placement decisions in a

coarse-grained and greedy manner.

Workload Specifications. Within the cluster, we run five selected workloads of

the HiBench benchmark suite, WordCount, Sort, TeraSort, PageRank, and NaiveBayes.

These workloads are good candidates for testing the efficiency of the data analytic frame-

works, with an increasing complexity. Among the workloads, WordCount is the simplest,

involving one shuffle only. PageRank and NaiveBayes are relatively more complex and

require several iterations at runtime, with multiple consecutive shuffles. They are two

representative workloads of the machine learning algorithms. The workloads are con-


WordCount Sort TeraSort PageRank NaiveBayesHiBench Workload

0

50

100

150

200

250

300

350

Avera

ge Job C

om

ple

tion T

ime (

s) Spark

Centralized

AggShuffle

Figure 4.7: The average job completion time under different HiBench workload. For each workload, wepresent a 10% trimmed mean over 10 runs, with an error bar representing the interquartile range as wellas the median value.

figured to run at “large scale,” which is one of the default options in HiBench. The

specifications of their settings are listed in Table 4.1. The maximum parallelism of both

map and reduce is set to 8, as there are 8 cores available within each datacenter.

Baselines. We use two naive solutions, referred as “Spark” and “Centralized”, in wide-

area data analytics as the baselines to compare with our proposed shuffle mechanism.

“Spark” represents the deployment of Spark across geo-distributed datacenters, without

any optimization in terms of the wide-area network. The job execution will be completely

blind about the network bottleneck. The “Centralized” scheme refers to the naive and

greedy solution in which all raw data is sent to a single datacenter before being processed.

After all data is centralized within a cluster, Spark works within a datacenter to process

data.

As a comparison, the Spark patched with our proposed shuffle mechanism is referred

to as “AggShuffle” in the remainder of this section, meaning the shuffle input is aggregated

in a single datacenter. Note that we do not use implicitly embedded transferTo() trans-

formation. Only are the implicit transformations involved in the experiments, leaving

the benchmark source code unchanged.


4.4.2 Job Completion Time

The completion time of a data analytic job is the primary performance metric. Here,

we report the measurement results from HiBench, which records the duration of running

each workload. With 10 iterative runs on the 5 different workloads, the mean and the

distribution of completion times are depicted in Fig. 4.7. Note that running Spark ap-

plications across EC2 regions is prone to the unpredictable network performances, as the

available bandwidth and network latency fluctuates dramatically over time. As a result,

running the same workload with the same execution plan at different times may result

in distinct performances. To eliminate the incurred randomness as much as possible, we

introduce the following statistical methods to process the data.

Trimmed average of the job completion time. The bars in Fig. 4.7 reports the 10%

trimmed mean value of job completion time measurements over 10 runs. In particular,

the maximum and the minimum values are invalidated before we compute the average.

This methodology, in a sense, eliminates the impact of its long-tail distribution on the

mean.

According to Fig. 4.7, AggShuffle offers the best performances in all three schemes

in evaluation. For example, AggShuffle shows as much as 73% and 63% reduction in

job completion time, as compared to Spark and Centralized, respectively. Under other

workloads, using Spark as the baseline, our mechanism achieves at least 15% performance

gain in terms of job durations.

As compared to the Centralized mechanism, we can easily find that AggShuffle is still

beneficial, except TeraSort. Under the TeraSort workload, the average job completion

time in Centralized is only 4% higher, which is a minor improvement in practice. The

reason is hidden behind the TeraSort algorithm. In the HiBench implementations, there

is a map transformation before all shuffles, which actually bloats the input data size.

In other words, the input of the first shuffle is even larger in size as compared to the

raw input. Consequently, extra data will be transferred to the destination datacenter,


incurring unnecessary overhead. Looking ahead, the analysis is supported by the cross-

datacenter traffic measurement shown in Fig. 4.8. TeraSort turns out to be a perfect

example to show the necessity of developers’ interventions. Only can the application

developers tell the increase of data size beforehand. This problem can be resolved by

explicitly calling transferTo() before the map, and we can expect further improvement

from AggShuffle.

Interquartile range and the median of the job completion time. In addition to

the average job completion time, we think the distribution of durations of the same job

matters in wide-area data analytics. To provide further the distribution information in

Fig. 4.7, according to our measurements over 10 iterative runs. To this end, we add the

interquartile range and the median into the figure as error bars. The interquartile range

shows the range from the 25-th percentile and the 75-th percentile in distribution. Also,

the median value is shown as a dot in the middle of an error bar.

Fig. 4.7 clearly shows that AggShuffle outperforms both other schemes in terms of

minimizing the variance. In other words, it can provide wide-area data analytic appli-

cations with more stable performances, making it more predictive. It is an important

feature, as is suggested in the experimental results, even running in the same environment

settings, the completion time of a wide-area analytic job varies significantly over time.

We argue the ability to limit the variance of data analytics frameworks is a performance

metric that has been overlooked in the literature.

The reason of AggShuffle’s stability is two-fold. On the one hand, the major source of

performance fluctuation is the network performance. As the wide-area links interconnect-

ing datacenters, unlike the datacenter network, are highly unstable with no performance

guarantees. Flash congestion and temporarily connections lost are common, whose im-

pact will be magnified in the job completion times. On the other hand, since AggShuffle

initiates early data transfers without waiting for the reducers to start. This way, concur-

rent bulk traffic on bottleneck links will be smoothed over time, with less link sharing


Sort0

50

100

150

200

Cro

ss-D

ata

cente

r Tra

ffic

(M

B)

Spark

Centralized

AggShuffle

TeraSort0

500

1000

1500

2000

2500

3000

3500

4000

Cro

ss-D

ata

cente

r Tra

ffic

(M

B)

Spark

Centralized

AggShuffle

PageRank0

200

400

600

800

1000

Cro

ss-D

ata

cente

r Tra

ffic

(M

B)

Spark

Centralized

AggShuffle

NaiveBayes0

50

100

150

200

250

300

Cro

ss-D

ata

cente

r Tra

ffic

(M

B)

Spark

Centralized

AggShuffle

Figure 4.8: Total volume of cross-datacenter traffic under different workloads.

and a better chance for data transfer to complete quickly.

As for TeraSort, rather than offering help, our proposed aggregation of shuffle input

actually burdens the cross-datacenter network. Again, it can be resolved by explicitly

invoking transferTo() for optimality.

4.4.3 Cross-Region Traffic

The volume of cross-datacenter traffic incurred by wide-area analytic applications is an-

other effective metric for evaluation. During our experiments on EC2, we tracked the

cross-datacenter traffic among Spark worker nodes. The average of our measurement is

shown in Fig. 4.8. Note in this figure, the “Centralized” scheme indicates the cross-region

traffic to aggregate all data into the centralized datacenter.


Except for TeraSort, whose transferTo() is automatically called on a bloated

dataset, all other workloads in the evaluation can enjoy much less bandwidth usage

in AggShuffle. As shuffle input is proactively aggregated in early stages and all fur-

ther computation is likely to be scheduled within one datacenter, cross-datacenter traffic

will be reduced significantly on average. In particular, it is worth noting that under the

PageRank workload, the required cross-datacenter traffic can be reduced by 91.3%, which

is pretty impressive.

Fig. 4.8 shows that the “Centralized” scheme requires the least cross-datacenter traf-

fic in TeraSort among the three. It is consistent with the conclusion in our previous

discussions.

4.4.4 Stage Execution Time

In Fig. 4.9, we breakdown the execution of different workloads by putting them under

the microscope. Specifically, we show the detailed average stage completion time in

our evaluation. Particularly, the length of stacked bars represents the trimmed average

execution time for each stage under specific workloads. Again, the error bars read the

interquartile regions and median values.

Inferred from the large variances, any stages in Spark may suffer from degraded per-

formances, most likely due to poor data locality. As a comparison, the Collective strategy

usually performs well in late stages, while has the longest average completion time in early

stages. It is supposed to be the result of collecting all raw data in early stages. However,

AggShuffle can finish both early and late stages quickly. Similar to the Collective scheme,

it offers an exceptionally low variance in the completion time of late stages.

Although different stages under different workloads have specific features and pat-

terns, we are still able to provide some useful insights. The “magic” behind AggShuffle is

that it proactively improves the data locality during shuffle phases, without the need to

transfer excessive data. Then, as shuffle input is aggregated in a smaller number of dat-


acenters, the achievable data locality is high enough to guarantee a better performance.

Note that in Fig. 4.9, the total completion time of all stages is not necessarily equiv-

alent to the job completion time presented in Fig. 4.7. First of all, though being stacked

together in Fig. 4.9, some of the stages may overlap with each other at runtime. The

summation does not directly contribute to the total job completion time. Second, stage

completion time is measured and reported by Spark, while the measurement of job com-

pletion time is implemented by HiBench, with different concepts. Third, the cross-stage

delays such as scheduling and queuing are not covered by the stage completion time

measurements.

4.5 Summary

In this chapter, we have designed and implemented a new system framework that opti-

mizes network transfers in the shuffle stages of wide-area data analytic jobs. The gist of

our new framework lies in the design philosophy that the output data from mapper tasks

should be proactively aggregated to a subset of datacenters, rather than passively fetched

as they are needed by reducer tasks in the shuffle stage. The upshot of such proactive

aggregation is that data transfers can be pipelined and started as soon as computation

finishes, and do not need to be repeated when reducer tasks fail. The core of our new

framework is a simple transferTo transformation on Spark RDDs, which allows it to

be implicitly embedded by the Spark DAG scheduler, or explicitly added by application

developers. Our extensive experimental evaluations with the HiBench benchmark suite

on Amazon EC2 have clearly demonstrated the effectiveness of our new framework, which

is complementary to existing task assignment algorithms in the literature.


0 50 100 150 200 250 300WordCount Time (s)

Spark

Centralized

AggShuffle

0 20 40 60 80 100 120 140 160 180Sort Time (s)

Spark

Centralized

AggShuffle

0 50 100 150 200 250 300 350 400TeraSort Time (s)

Spark

Centralized

AggShuffle

0 20 40 60 80 100 120PageRank Time (s)

Spark

Centralized

AggShuffle

0 20 40 60 80 100 120 140NaiveBayes Time (s)

Spark

Centralized

AggShuffle

Figure 4.9: Stage execution time breakdown under each workload. In the graph, each segment in thestacked bars represents the life span of a stage.

Chapter 5

A Hierarchical Synchronous ParallelModel for Wide-Area Graph Analytics

Graph analytics is an important category of big data analytics applcations. It serves as

the foundation of many popular Internet services, including PageRank Web search [16]

and social networking [17]. These services are typically deployed at a global scale, with

user-related data naturally generated and stored in geographically distributed, com-

modity datacenters [32, 64, 65, 77]. In fact, popular cloud providers, such as Amazon,

Microsoft, and Google, all operate tens of datacenters around the world [38], offering

convenient access to both storage and computing resources.

It would benefit a wide variety of applications if the performance of processing geo-

graphically distributed graph data can be improved. In Internet-scale graph analytics, the

production graphs are typically as large as billions of vertices and trillions of edges [48,74],

taking terabytes of storage. For example, it is reported that Web search engines operate

on an indexable Web graph consisting of ever-growing 50 billions of websites and one

trillion hyperlinks between them [48]. Such large scales with rapid rates of change [38],

coupled with high costs of Wide-Area Network (WAN) data transfers [64] and possible

regulatory constraints [79], make it expensive, inefficient, or simply infeasible to cen-

tralize the entire dataset to a central location, even though this is a commonly-used

approach [9, 51] for analytics.

90

Chapter 5. A Hierarchical Synchronous Parallel Model 91

Therefore, it is critical to design efficient mechanisms to run graph analytics applica-

tions in a geographically distributed manner across multiple datacenters in a paradigm

called wide-area graph analytics. In particular, the fundamental challenge is to process

the graph with raw input data stored and computing resources distributed in globally

operated datacenters, which are inter-connected by wide-area networks (WANs).

Unfortunately, existing distributed graph analytics frameworks are not sufficiently

competent to address such a challenge. Representative works in the literature, e.g.,

Pregel [59], PowerGraph [29] and GraphX [30], are solely designed and optimized for

processing graphs within a single datacenter. Gemini [90], one of the state-of-the-art

solutions, even assume a high-performance cluster with 100 Gbps of bandwidth capacity

between worker nodes. Unfortunately, this assumption is far beyond the reality in inter-

datacenter WANs, whose available capacity is typically hundreds of Mbps [38].

In this chapter, we argue that the inefficiency of wide-area graph analytics stems from

the Bulk Synchronous Parallel (BSP) model [75], which is the dominating synchronization

model implemented by most of the popular graph analytics engines [82]. The primary

reason for its popularity is that BSP works seamlessly with the vertex-centric program-

ming abstraction [59], which eases the development of graph analytics applications. Many

graph algorithms and optimizations, namely vertex programs, are exclusively designed

with such an abstraction with BSP [68,83] in mind.

In a vertex program under BSP, the application runs in a sequence of “supersteps,”

or iterations, which apply updates on vertices and edges iteratively. Message passing

and synchronization is made between two consecutive supersteps, while performing local

computation within each superstep. Since each superstep typically allows communication

between neighboring vertices only, it takes at least k supersteps until the algorithm con-

verges on a graph whose diameter is k [83]. Thus, k message passing phases will happen

in serial, which incurs excessive — and not always necessary [87] — inter-datacenter traf-

fic in wide-area graph analytics. In conclusion, BSP is designed for a deployment where


communication cost among workers is similarly low. While being adopted in Wide-Area

Graph Analytics, excessive inter-datacenter traffic will be generated, lowering the job

performance.

One possible solution is to loosen the BSP model, by allowing asynchronous updates

on different graph partitions. This way, new iterations of computation are able to proceed

with partially staled vertex/edge properties, relaxing the hard requirement on inter-

datacenter communication to a best-effort model. Existing systems implementing such

an asynchronous parallel model include GraphUC [33] and Maiter [87]. However, neither

system can guarantee the convergence or the correctness of graph applications [82].

Our objective is to design a new synchronization model for wide-area graph analytics,

which satisfies three important requirements:

1. WAN efficiency. The new model should require fewer rounds of inter-datacenter

communication and generate less inter-datacenter traffic.

2. Correctness. The new model should ensure the application can return the same

result as if it was executed under BSP.

3. Transparency. The new model should require absolutely no change to the existing

applications, by retaining the same set of vertex-centric abstraction APIs.

In this chapter, we introduce Hierarchical Synchronous Parallel (HSP), a novel syn-

chronization model designed for efficiency in wide-area graph analytics. In contrast to

BSP, which requires complete, global synchronizations among all worker nodes in all

datacenters, HSP allows partial, local synchronizations within each datacenter as addi-

tional updates. Specifically, HSP automatically switches between two modes of execution,

global and local, like a two-level hierarchical organization. The global mode is the

same as BSP, where all datacenters respond to central coordination. The local mode, on

the other hand, allows each datacenter to work autonomously without coordinating with

others. Our theoretical analysis shows that, if the mode switch happens strategically,

HSP can guarantee the convergence and correctness of all vertex programs. In addition,


if the implementation of the vertex program is considered practical [83], HSP can en-

sure a much higher rate of convergence, as compared to BSP with the same amount of

inter-datacenter traffic generated.

We have implemented the HSP model on GraphX [30], an open-source general graph

analytics framework built on top of Apache Spark [84]. The original implementation of

GraphX supports the BSP vertex-centric programming abstraction. In our prototype

implementation, we have extended the framework with HSP, by allowing synchroniza-

tion to be bounded within a single datacenter, and by implementing the feature that

automatically switches the mode of execution on a central coordinator. With our imple-

mentation, we have performed an extensive evaluation of HSP in five real geographically

distributed datacenters on Google Cloud. Three empirical benchmark workloads on two

large-scale real-world graph datasets have been experimented on. The results show that

HSP is efficient in running wide-area graph analytics. It requires significantly fewer cross-

datacenter synchronizations until a guaranteed algorithm convergence, and reduces WAN

bandwidth usage by 22.4% to 32.2%. The monetary cost of running graph applications

can be reduced by up to 30.4%.

5.1 Background and Motivation

Graphs in production are typically too large to be efficiently processed by a single machine

[59]. Distributed graph analytics frameworks are thus developed to run graph analytics

in parallel on multiple worker nodes. Before running the actual analytics, the input graph

is divided into several partitions, each of which is held and processed by a worker. The

frameworks will then handle the synchronization and necessary message passing among

workers automatically, allowing developers to work solely on the analytic logic itself.

Most of the state-of-the-art solutions [29, 30, 90] provide a vertex-centric abstraction

for developers to work on — similar to Google’s Pregel — and implement the Bulk Syn-

chronous Parallel (BSP) model for inter-node synchronization [82]. Such an integration


DC A DC B

1 1 4 532

1 1 3 421

1 1 2 311

1 1 1 211

1 1 1 111

1 1 1 111

Partition in DC A Partition in DC B

1 2 5 643

DC A DC B

1 1 4 542

1 1 4 411

1 1 4 4411 1 4 441

1 1 1 411

1 1 1 111

1 1 1 111

(a) Supersteps under BSP. (b) Allow additional local synchronization.

Figure 5.1: A Connected-Component algorithm executed under different synchronization models. Whitecircles indicate active vertices, and the arrows represent message passing.

of programming abstraction and synchronization model allows developers to “think like

a vertex,” making the development of graph analytics applications intuitive and easy to

debug.

Even though it is well-known that the BSP model requires excessive communication,

the bandwidth capacity among workers is seldom considered a system bottleneck [90].

When deployed within a high-performance cluster where bandwidth is readily abundant,

BSP performs well with a large number of system optimization techniques, such as ad-

vanced graph partitioning strategies and load balancing. Unfortunately, this is no longer

true in wide-area data analytics, where inter-datacenter data transfers can incur a much

higher cost, in terms of both time and monetary expenses.

Fig. 5.1a illustrates a sample execution of the Connected-Component (CC) algorithm

under BSP. The algorithm runs on a six-vertex graph, which is cut into two partitions.

Within a superstep, each vertex tries to update itself with the smallest vertex ID seen

so far at all its neighbors. A vertex becomes inactive as soon as it cannot get further

updates. The algorithm converges until no vertex is active, and we can then compute the

number of connected components by counting the remaining vertex IDs. Fig. 5.1a shows

that CC converges in a total of 6 supersteps, while the first three supersteps require


message passing from DC A to DC B.

However, it is easy to observe that the first two inter-datacenter messages are un-

necessary in this example. These two messages are in fact transferring values that will

be immediately overridden in the next superstep. The insight behind it is that we can

sometimes hold inter-datacenter synchronization until multiple cycles of synchronization

within a single datacenter have been performed, for the sole purpose of minimizing cross-

datacenter traffic. For example, one possible optimization is illustrated in Fig. 5.1b. It

allows updates of vertex IDs to happen within a datacenter, without updating the vertex

that is owned by both datacenters. Inter-datacenter communication happens once at the

fourth step, when both partitions has already converged locally. This new principle of

synchronization reaches the same result of the CC algorithm, yet generating only 1/3 of

the inter-datacenter traffic as compared to BSP.

The example shown in Fig. 5.1 inspires us to explore such a new principle, for the

sake of minimizing inter-datacenter traffic. To achieve this objective, we wish to carefully

design a new synchronization model, called the Hierarchical Synchronous Parallel (HSP)

model. As an alternative to BSP, it needs to guarantee that the correctness of any vertex

program should be retained, which may be way more complex than the Connected-

Component algorithm.

5.2 Hierarchical Synchronous Parallel Model

In this section, we introduce the Hierarchical Synchronous Parallel (HSP) model. We

will first explain the high-level principle of its design and the general idea behind its

correctness guarantee. We will then formulate HSP model theoretically and explain it

in greater detail. With our formulation, we present a formal proof of its correctness and

rate of convergence in wide-area graph analytics. Finally, we use a simple PageRank

application as an example to illustrate the effectiveness of HSP.


5.2.1 Overview

Generally speaking, HSP is an extension to the BSP model in wide-area graph analytics,

by performing synchronization in a two-level hierarchy. In addition to BSP, HSP allows

local synchronization among worker nodes located in a single datacenter, completely

avoiding inter-datacenter communication. To achieve this, HSP introduces two modes of

execution, global and local, and switches between them strategically and frequently.

In the global mode, HSP has exactly the same behavior as BSP, where each syn-

chronization is a global, all-to-all communication among all worker nodes, regardless of

which datacenter they are located in. We call one iteration of the execution in HSP a

“global update,” which is equivalent to a superstep in BSP. Global updates are essential

to the correctness of graph analytics, because it is necessary to spread the information

outside of individual datacenters.

In the local mode, the worker nodes are organized in different autonomous datacen-

ters. Workers housed in the same datacenter work synchronously. In particular, they

still run in iterations, or “local updates,” as if they are running the vertex program un-

der BSP. The difference is that, if a vertex has mirrors in multiple datacenters, called

a “global vertex,” we mark it inactive and do not update it until switching back to the

global mode. Since synchronizing the property of these global vertices is the only source

of inter-datacenter traffic, we completely eliminate the need for inter-datacenter commu-

nication in the local mode. In addition, it is worth noting that without the need for

global synchronization, execution at different datacenters can be asynchronous.

Without running in the local mode, HSP is equivalent to BSP. Thus, it still guar-

antees correctness. However, running in the local mode takes advantage of low-cost

synchronization within a datacenter, allowing more updates in the same amount of time.

It is an interesting question when to make the switching between these two modes.

Our mode-switching strategy is designed based on our theoretical analysis in the next

subsection for the sake of the algorithm convergence guarantee. Here we introduce its


general principle. On the one hand, local updates should allow at least one information

exchange between any pair of vertices. Thus, the number of local updates should be

higher than the diameter of the local partition. On the other hand, we should not leave

any worker idle before reaching global convergence. As a result, HSP will switch away

from the local mode as soon as all datacenters execute more local updates than its

partition diameter or the local update in any datacenter converges. Then, while HSP

running in the global mode, it will switch back to local immediately as soon as the

algorithm is considered “more converged,” whose metric will be introduced later.

The intuition behind HSP is that, in general, graph algorithms need to spread the in-

formation on a vertex to all vertices of the entire graph. In other words, every vertex has

to “get its voice heard.” In wide-area data analytics, communicating with neighbors does

not always come at similar prices. Therefore, instead of requiring every vertex to talk

to its neighbors in every iteration, HSP organizes communication hierarchically. It al-

lows information to spread well within a closed neighborhood, before inter-neighborhood

communication. This way, the entire graph can still converge, while generating much less

inter-datacenter WAN traffic.

5.2.2 Model Formulation and Description

Before formulating the HSP model, we first give a formal definition of a vertex program.

Given a graph G = (V,E) with initial properties on all vertices xxx(0) ∈ R|V |, a ver-

tex programming application defines the function to update each vertex in a superstep.

Specifically, a combiner function g(·) is defined to combine the received message to each

vertex, and a compute function f(·) is defined to compute the updated property on a

vertex using the old property and the combined incoming message. Without loss of gen-

erality, let fi : R|V | → R denote the update function defined on vertex i ∈ {1, 2, . . . , |V |},

such that, in BSP, we have

x(k+1)i = f

(x(k)i , gi(xxx

(k)))

:= fi(xxx(k)). (5.1)


Or equivalently, define FFF : R|V | → R|V | such that

xxx(k+1) = FFF (xxx(k)).

The objective of the vertex programming application is to compute xxx∗ ∈ R|V | which

satisfies xxx∗ = FFF (xxx∗), by iteratively applying the update defined in Eq. (5.1) until con-

vergence under BSP. By definition, xxx∗ is a fixed point under operator FFF , denoted as

xxx∗ ∈ FixFFF .

In practice, the application is considered converged when a valid approximation xxx(N)

is obtained after N supersteps. We define a valid approximation as follows:

Definition 1 (Distance metric). A distance metric on R|V | is a function

D : R|V | × R|V | → [0,∞),

where [0,∞) is the set of non-negative real numbers and for all xxx,yyy,zzz ∈ R|V |, the following

conditions are satisfied:

D(xxx,yyy) = 0 ⇐⇒ xxx = yyy (identity of indiscernibles)

D(xxx,yyy) = D(yyy,xxx) (symmetry)

D(xxx,yyy) ≤ D(xxx,zzz) +D(zzz,yyy) (triangle inequality).

For example, Chebyshev norm, i.e.,

D(xxx,yyy) = max{|x1 − y1|, |x2 − y2|, . . . , |x|V | − y|V ||},

is a commonly used distance metric because it is easy to compute in a distributed envi-

ronment.

Definition 2 (A valid approximation of a fixed point). Given pre-defined error bound

δ ∈ (0,∞), xxx(N) is a valid approximation of xxx∗, a fixed point under operator FFF , if it


satisfies

D(xxx(N),xxx(N+1)) = D(xxx(N),FFF (xxx(N))) ≤ δ (5.2)

where D : (R|V |,R|V |)→ [0,∞) is a distance metric.

To process the graph in d geographically distributed datacenters, it is partitioned

into d large subgraphs. We define each vertex either a local vertex or a global vertex.

Specifically, vertex i is a local vertex of datacenter j if datacenter j stores its original copy

and its property xi is not required to update any vertex stored in a different datacenter.

In other words, xi should never be delivered outside of datacenter j when running the

application under BSP. Let Ij denote the index set of all local vertices of datacenter j.

Otherwise, if xi is required to update vertices stored in multiple datacenters, we define

vertex i a global vertex. Let IG denote the index set of all global vertices. Apparently,

the defined d + 1 index sets I1, I2, . . . , Id, IG are mutually exclusive and their union is

{1, 2, . . . , |V |}.

With a globally partitioned graph, we define FFF j(·), the local update function in data-

center j, as follows:

[FFF j(xxx)

]i

=

fi(xxx) i ∈ Ijxi otherwise

. (5.3)

Using the notations introduced above, we present our detailed description of the HSP

model in Procedure 1. The procedure for local updates in a single datacenter is listed

separately. Since the local updates in different datacenters are asynchronous, it requires a

mechanism of central coordination that decides mode switch to global. This is achieved

by two methods, voteModeSwitch() and forceModeSwitch(). The prior is called once

the number of local updates reaches the subgraph diameter, while the latter is called

upon local convergence. The central coordinator receives the signals triggered by these

two methods, and enforces mode switch when appropriate (line 12 of Prcedure 1).


Procedure 1 Execution of a vertex programming application under the HierarchicalSynchronous Parallel (HSP) model.1: Set execution mode to global, global update counter k ← 0, current error δ0 ←∞;2: while δk > δ do3: if Execution mode is global then4: Perform a global update: xxx(k+1) ← FFF (xxx(k));5: δk+1 ← D(xxx(k+1),xxx(k));6: if δk+1 < δk then7: Switch execution mode to local;8: else9: δk+1 ← δk;10: k ← k + 1;11: else12: Apply local updates in each datacenter concurrently (as in Procedure 2), until any

datacenter calls forceModeSwitch() or all datacenters call voteModeSwitch().13: Switch execution mode to global;14: return xxx(k).

Procedure 2 Local updates in datacenter j.1: Set local iteration counter nj ← 0, local error δ0 ←∞; d denotes the diameter of the local

partition;2: repeat3: nj ← nj + 1;4: Perform an in-place local update: xxx(k,nj) ← FFF j

(xxx(k,nj−1)

);

5: δnj ← D(xxx(k,nj),xxx(k,nj−1));6: if δnj < δ then7: forceModeSwitch();8: if nj == d then9: voteModeSwitch();10: until Mode switch is forced by any or voted by all datacenters;11: Execution mode switched to global.

5.2.3 Proof of Convergence and Correctness

To prove the convergence and correctness guarantee of the proposed HSP model, the only

requirement for the vertex programming application is that the iterative computation is

correctly implemented; that is, the application can converge within a finite number of

iterations under BSP. Formally, we have the following assumption:

Assumption 1 (Convergence under BSP). Given any arbitrary initial value of xxx(0) and

a distance metric D(·, ·), the sequence of successive approximations {xxx(k)} approaches xxx∗,

a fixed point under operator FFF , as the number of iterations of global updates Eq. (5.1)


approaches infinity. That is,

limk→∞

D(xxx(k),xxx∗

)= 0,where xxx∗ ∈ FixFFF . (5.4)

Theorem 1 (Convergence and correctness guarantee of HSP). If a vertex programming

application satisfies Assumption 1, given any pre-defined error bound δ ∈ (0,∞), it will

also return a valid approximation of xxx∗ in finite time under Procedure 1.

Proof. Since δk is overridden at line 9 using the previous value in the sequence, it is valid

to consider their original value before being overridden.

δk+1 = D(xxx(k+1),xxx(k)

)≤[D(xxx(k+1),xxx∗

)+D

(xxx(k),xxx∗

)](triangle inequality),

which approaches 0 as k →∞ given Eq. (5.4).

In a real-world implementation where precision of any number is bounded, the se-

quence {δk} will eventually approach 0, i.e., ∀δ ∈ (0,∞), ∃k∗ < ∞, s.t. δk∗ < δ, and

Procedure 1 will return an estimation xxx(k∗).

xxx(k∗) satisfies Eq. (5.2), making it a valid estimation of xxx∗.

As is shown in the proof, the convergence guarantee of HSP relies heavily upon its

global updates. As long as the application can converge, applying local updates in the

middle does not affect the result of the vertex programming algorithm.

5.2.4 Rate of Convergence

Even with the convergence and the correctness guarantee, one may still be skeptical

about the effectiveness of HSP. How could the additional local updates help with the

application execution? Is it a guarantee that it will generate less inter-datacenter traffic?

It is difficult to answer these questions without any prior knowledge about the actual

application itself, since the vertex programming model provides developers with a sub-


stantial amount of flexibility. Generally speaking, developers are able to code whatever

they desire, making it difficult to reach any useful conclusion about such applications.

However, to ensure the scalability while processing very large datasets, vertex pro-

gramming applications can share some characteristics in practice. Yan et al. [83], in

particular, investigated some well-implemented vertex programming algorithms, namely

practical Pregel algorithms. Common characteristics of a practical vertex programming

application were summarized under BSP. These applications require linear space usage,

linear computing complexity and linear communication cost per superstep. In addition,

practical Pregel algorithms require at most a logarithmic number of supersteps till their

convergence, i.e., at least a linear rate of convergence.

With the latter characteristic as an assumption, HSP can ensure effectiveness by

allowing additional local updates in different datacenters.

Assumption 2 (Practical implementation). The vertex programming application con-

verges at a linear or a superlinear rate under BSP, i.e.,

∃µ ∈ [0, 1), s.t. limk→∞

D(xxx(k+1),xxx∗)

D(xxx(k),xxx∗)≤ µ. (5.5)

To study the rate of convergence, we consider each cycle of synchronization in HSP. A

cycle of synchronization is defined as the interval between two consecutive mode switches

from local to global. In other words, a cycle includes several consecutive global updates

and the subsequent local updates, until a switch back to the global mode.

In particular, during the local mode within each synchronization cycle, several iter-

ations of local updates are applied in each individual datacenter at the same time. For

the convenience of our subsequent proof, we collectively formulate these local updates as

a function

FFF(n1,n2,...,nd)(xxx) :=

( d∏j=1

FFFnj

j

)(xxx),

where nj denotes the number of iterations of local updates applied in datacenter j.


Lemma 1. xxx∗ is a fixed point under operator FFF := FFF(n1,n2,...,nd) for any n1, n2, . . . , nd.

Proof. Given j ∈ {1, 2, . . . , d} and i ∈ Ij, according to Eq. (5.3), we have[FFF j(xxx

∗)]i

=

fi(xxx∗) = xxx∗i . Also, given i /∈ IG,

[FFF j(xxx

∗)]i

= xxx∗i by definition. Thus, xxx∗ is a fixed point

under FFF j.

Since no global update is applied, the properties of global vertices remain the same

after FFF (·) is applied, i.e., [FFF (xxx)

]i

= xi,∀i ∈ IG.

Further, FFF j(·) depends on the local vertices of datacenter j only. Thus, the individual

functions FFF j(j = 1, 2, . . . , d) are commutative and associative in Eq. (5.3). Therefore,

FFF(n1,n2,...,nd)(xxx∗) = FFF

(n1,...,nj−1,...,nd)(FjFjFj(xxx

∗))

= FFF(n1,...,nj−1,...,nd)(xxx∗) = · · · = FFF

(0,...,0)(xxx∗) = xxx∗

Lemma 2. FFF : R|V | → R|V | is a contraction mapping, i.e.,

D(FFF (xxx), FFF (yyy)) < D(xxx,yyy),∀xxx,yyy ∈ R|V |.

Proof. According to the Banach fixed-point theorem [12], an equivalent condition of

Eq. (5.5) is that FFF : R|V | → R|V | is a contraction mapping. Given ∀xxx,yyy ∈ R|V |, we have

D(FFF (xxx),FFF (yyy)) ≤ µD(xxx,yyy).

Construct yyy ∈ R|V | by letting yi =

yi i ∈ Ijxi i /∈ Ij

. Thus,

D(FFF j(xxx),FFF j(yyy)) = D(FFF j(xxx),FFF j(yyy)) = D(FFF (xxx),FFF (yyy))

≤ µD(xxx, yyy) ≤ µD(xxx,yyy).

Therefore,

D(FFF (xxx), FFF (yyy)) ≤ µ∑d

j=1 njD(xxx,yyy).


Theorem 2 (Rate of convergence of HSP). If a vertex programming application satisfies

Assumption 1 and 2, after the same number of global updates, HSP will converge to xxx∗

at a higher rate as compared to BSP.

Proof. Consider an HSP synchronization cycle that includes n global updates and a set

of local updates denoted by FFF . After the synchronization cycle, the original estimation

xxx(k) is updated to xxx(k+n) = (FFF ·FFF n)(xxx(k)).

As a valid comparison, after the same number of global synchronization iterations,

BSP will get xxx(k+n) = FFF n(xxx(k)).

The average rate of convergence of HSP µHSP satisfies

µnHSP = lim

k→∞

D(xxx(k+n),xxx∗)

D(xxx(k),xxx∗)= lim

k→∞

D(FFF (xxx(k+n)),xxx∗)

D(xxx(k),xxx∗)

= limk→∞

D(FFF (xxx(k+n)), FFF (xxx∗))

D(xxx(k),xxx∗)(Lemma 1)

≤ µ∑d

j=1 nj · limk→∞

D(xxx(k+n),xxx∗)

D(xxx(k),xxx∗)(Lemma 2)

= µ∑d

j=1 nj · µnBSP < µn

BSP

Therefore, HSP provides a much higher average rate of convergence as compared to

BSP.

Given a distance metric D : (R|V |,R|V |) → [0,∞) and an error bound δ ∈ (0,∞),

Theorem 2 implies that HSP requires fewer global updates, thus less inter-datacenter

traffic, to reach a practical approximation.

5.2.5 PageRank Example: a Numerical Verification

To verify our findings in the previous theorems, we compare the convergence under HSP

and BSP using a simple PageRank example shown in Fig. 5.2. Fig. 5.2(a) shows the

5-vertex graph in the example, while the graph is partitioned into two datacenters by


cutting the central vertex. Note that the diameter in each partition is 1 (ignoring the

global vertex); therefore, HSP runs local updates only once in every local mode of

execution. We plot every single estimation achieved by both synchronization models in

Fig. 5.2(b), whose y-axis shows the Euclidean norm between estimations and xxx∗ in log

scale.

Since PageRank is a practical Pregel algorithm by definition [83], BSP shows a per-

fectly linear rate of convergence (the red line in Fig. 5.2(b)). As a comparison, HSP

depicted by the black line, the lower bound of the gray area, shows a much higher rate

of convergence, given the same number of global updates as the number of supersteps in

BSP.

If we consider x-axis as algorithm run time in real systems, the gray area indicates

the possible convergence rate of HSP. The reason is that the lower bound, shown by

the black line, assumes no cost incurred by local updates because they do not introduce

inter-datacenter traffic. The upper bound, in the other hand, assumes that a local update

takes exactly the same amount of time as a superstep in BSP. In reality, the time needed

by a local update is in between of these two extremes, and HSP can always converge

faster.

5.3 Prototype Implementation

We have implemented a prototype of the HSP synchronization model on GraphX [30],

and it retains full compatibility with existing analytics applications. Our prototype

implementation is non-trivial to complete, yet it makes a strong case that HSP can be

seamlessly integrated with existing BSP-based graph analytics engines.

Systems design. GraphX is an open-source graph analytics engine built on top of

Apache Spark [84]. It is an ideal platform for us to implement our prototype, due to its

full interoperability with general dataflow analytics and machine learning applications in

the popular Spark framework.


DC A DC B0.170795

0.0300000.331283 0.307311 0.160607

(a) The example direct graph. The numbers on the vertices represent the finalranks that we used as xxx∗.

1 4 7 10 13 16 19 22 25 28 31

# Global Synchronizations

10−6

10−5

10−4

10−3

10−2

10−1

100

‖x(k

)−

x∗ ‖

2BSPHSP

Local updatesGlobal updates

(b) Convergence of PageRank under BSP and HSP.

Figure 5.2: A PageRank example with damping factor set to 0.15 [16]. All values used in computationare rounded to the sixth decimal place; therefore, norm lower than 10−6 makes little sense and is ignoredin the figure.

In GraphX, the vertex-centric programming abstraction is supported via a Pregel

API, which allows developers to pass in their customized vertex update and message

passing functions. The graph analytics applications are executed under the BSP model,

with all workers proceeding in supersteps. It takes advantage of the Resilient Distributed

Dataset (RDD) programming abstraction [84]. An RDD represents a collection of data

stored on multiple workers. Parallel computation on the data can be modeled as sequen-

tial transformations made on the RDD, allowing developers to program as if it is running

on a single machine. In particular, Pregel models a graph using a VertexRDD and an

EdgeRDD, while a superstep is modeled as a series of sequential transformations on them.

Our implementation retains the original Pregel API, providing full compatibility with

existing applications and requiring no change to their code. To enable HSP instead of

BSP, users can simply pass an additional option, -Dspark.graphx.hsp.enabled=true,

along with the application submission command.

Under the hood, our modifications to the original GraphX codebase take place within


DC Manager Thread

Input Graph

Graph Partition

Mode Select

1 Global Update

Converged?

Launch n DC Manager Threads

Accumulator c = 0

1 Local Update

# updates = diameter?

Converged?

Output Graph

c = c+1

c = c+n

c > n

Join n Threads

Local

No

YesNo

GlobalYes

Yes

Yes

No

No

…

…

…

Figure 5.3: The flow chart that shows central coordination in HSP. Blocks in black indicate the originalPregel implementation in GraphX.

the Pregel implementation and include two separate components: the central logic to

switch between local and global execution modes, as well as the RDD transformation

sequence that actually implements the local updates.

Mode switches. We have implemented Procedure 1 and Procedure 2 for mode

switching. It runs on the Spark “driver” program, which centrally manages all work-

ers acquired by the application. The flow chart is shown in Fig. 5.3. We utilize two

Accumulators to help decision making. An Accumulator can serve as a global variable,

which is addable by workers and readable by the central coordinator.

To switch from the global to the local mode, one Accumulator is used to track the

differences (i.e., distance) between consecutive updates. A mode switch will be made once

the difference decreases. On the other hand, to switch back to global, we use the other

Accumulator to track the progress made by individual local updates. voteModeSwitch()

and forceModeSwitch() are implemented by adding to the Accumulator. Each individual

datacenter will check the value of the global variable after each local update to decide

whether to proceed, and change the value once it converges or reaches a preset number.


Local updates. Implementing local updates in GraphX is challenging, because

the RDD transformations on a graph are designed to hide some runtime details from

the developers. These details include data distributions across the workers, which are

essential to the concept of local updates. We need to dig deep into the RDD internals

for our implementation.

One of the major challenges is to identify global vertices and local vertices.

In Spark, the actual dataset placement is decided at runtime depending on the worker

availabilities. Thus, the co-located graph partitions in the same datacenters can only

be identified at runtime. Fortunately, such runtime information is accessible via the

preferredLocations feature in an RDD, which provides insights about the actual worker

where a data partition is placed. We make full use of this feature, creating a set containing

all global vertices once the graph is fully loaded to the available workers. Since graph

partitioning remains unchanged throughout the application execution, the global vertex

set can be cached in the memory without being computed again.

With knowledge about all global vertices, we create a SubGraph out of the co-located

graph partitions in each datacenter. The local updates will be carried out asynchronously

on different SubGraph instances, and they will be controlled by DC Manager Threads on

the Spark driver as is depicted in Fig. 5.3. All transformations made on the VertexRDD

and the EdgeRDD of a SubGraph are similar to the original BSP transformations. However,

they have been carefully rewritten such that no global vertex is modified during local

updates.

5.4 Experimental Evaluation

We have evaluated the effectiveness and correctness of our prototype implementation in

real datacenters, with empirical benchmarking workloads and real-world graph datasets.

In this section, we summarize and analyze our experimental results.


Dataset # Vertices # Edges HarmonicDiameter

EdgelistSize (MB)

enwiki-2013 4,206,785 101,355,853 5.24 1556.9uk-2014-host 4,769,354 50,829,923 21.48 801.4

Table 5.1: Summary of the used datasets.

5.4.1 Methodology

Experiment platforms. We deploy a 10-worker Spark cluster on Google Cloud. Each

worker is a regular Ubuntu 16.04 LTS instance, with 2 CPU cores and 7.5GB of memory.

Our modified version of GraphX is based on Spark v2.2.0, and the cluster is deployed in

the standalone mode.

In particular, two workers are employed in each of the five geographical regions in-

cluding N. Virginia, Oregon, Tokyo, Belgium, and Sydney. Preliminary measurements on

the available bandwidth show ∼3Gbps of capacity within a datacenter. Inter-datacenter

bandwidth is more than a magnitude lower, ranging from 50Mbps to 230Mbps. The

findings are similar to the measurements reported in [38].

Applications. We use three benchmarking applications to evaluate the effectiveness

of HSP, including PageRank (PR), ConnectedComponents (CC), and ShortestPaths (SP).

We use the default implementations provided by GraphX without changing a single line

of code. Also, the default graph partitioning strategy is used, which preserves the original

edge partitioning in the HDFS input file.

PR represents random walk algorithms, an important category of algorithms that seek

to find the steady state in the graph. CC and SP represent graph traversal algorithms.

These two categories of algorithms cover the most common vertex programs in practice

[73]. Three applications show different degrees of network intensiveness. PR requires

more time for synchronization, while SP is more computation-intensive.

Input datasets. We use two web datasets from WebGraph [14]. The key features

of the datasets are summarized in Table 5.1. Both datasets have more than 4 million

vertices. However, uk-2014-host has much fewer edges, making the diameter of the


Workload# HSPGlobalSync.

# BSPSuper-step

HSPUsage(GB)

BSPUsage(GB)

Reduction(%)

enwiki-2013

PR 46 74 18.39 27.14 32.2CC 5 7 0.69 0.91 23.4SP 7 10 0.59 0.84 30.6

uk-2014-host

PR 35 52 21.48 31.47 31.7CC 12 20 0.71 0.95 25.4SP 15 23 0.50 0.64 22.4

Table 5.2: WAN bandwidth usage comparison.

1.0Normalized Application Runtime

PR

CC

SP

0.90x0.86x

0.97x0.93x

1.02x1.04x

enwiki-2013 uk-2014-host

Figure 5.4: Application runtime under HSP, normalized by the runtime under BSP.

graph much higher. In other words, enwiki-2013 is more “dense” in terms of vertex

connectivity. Experimenting on these two datasets makes a strong case that HSP can

work well on natural, real-world graphs.

5.4.2 WAN Bandwidth Usage

Apart from the correctness guarantee and the API transparency, the design objective

of HSP is WAN efficiency in wide-area graph analytics. As compared to BSP, HSP is

expected to significantly reduce the required number of global synchronizations as well

as the WAN bandwidth usage. These statistics in the experiments are calculated and

summarized in Table 5.2, with all combinations of benchmarks and datasets.

In general, HSP has met our expectations; more than 22% reduction in WAN band-

width usage can be observed in all workloads. Among the applications, PR benefits most


PR CC SP0.0

0.5

1.0

1.5

2.0

2.5

3.0

Cost

on G

oogle

Clo

ud (

US $

)

-29.79%

-15.54% -9.41%

BSP WAN Usage

HSP WAN Usage

BSP Instances

HSP Instances

(a) enwiki-2013

PR CC SP0.0

0.5

1.0

1.5

2.0

2.5

3.0

Cost

on G

oogle

Clo

ud (

US $

) -30.43%

-14.79% -8.21%

BSP WAN Usage

HSP WAN Usage

BSP Instances

HSP Instances

(b) uk-2014-host

Figure 5.5: Estimated cost breakdown for running applications. The calculation follows the GoogleCloud pricing model as of July 2017, where 10 instances cost $0.95/h and WAN traffic is $0.08/GB.

1 9 17 25 33 41 49

(a) # Global Synchronizations

10−2

10−1

100

101

102

103

‖x(k

)−

x(k−

1) ‖

2

BSPHSP

50 100 150 200 250 300 350

(b) PageRank Execution Time (s)

10−2

10−1

100

101

102

103

‖x(k

)−

x(k−

1) ‖

2

BSPHSP

Figure 5.6: Rate of convergence analysis for PageRank on uk-2014-host. The delta is organized by thenumber of global synchronizations and the application execution time, respectively.

from running under HSP, enjoying an over 30% reduction in total inter-datacenter traf-

fic. CC and SP require fewer global synchronizations before convergence even in BSP,

allowing less room for improvement.

HSP also works well on both datasets, despite the differences in graph diameters.

Because uk-2014-host is partitioned with less fragmentation in 5 datacenters due to a

larger diameter, graph traversal applications (CC and SP) can see a higher reduction

in the number of global synchronizations under HSP. Another interesting finding is that

running CC on enwiki-2013 under HSP takes only 5 global synchronizations, which

reaches the expected minimum in a 5-datacenter setting.


5.4.3 Performance and Total Cost Analysis

WAN bandwidth usage, along with instance usage, directly contribute to the monetary

cost of running analytics in public cloud. In most cloud pricing models, inter-datacenter

traffic is charged by GBs of usage, while instances are charged by hours of machine time.

In our experiments, the runtime of each workload is summarized in Fig. 5.4. It

shows the normalized time for running applications under HSP as compared to BSP.

The performance vary on different applications, due to the different degrees of network

intensiveness. PR, for example, can achieve 14% less application runtime because it

originally spent more time transferring a huge amount of data across-datacenters. SP, on

the other hand, has a slightly degraded performance, since the extra computation time

incurred by local updates exceeds the reduction in network transfers.

However, we argue that the possible performance degradation is acceptable in con-

sidering total cost. We illustrate the cost breakdown of our experiments in Fig. 5.5.

For PR, WAN usage cost contributes a majority proportion to the final monetary cost.

Since both machine time and inter-datacenter traffic have been reduced, HSP is about

30% cheaper than BSP. The rest two applications are relatively less network-intensive,

but the WAN usage still constitutes a large proportion. Even though the application

runtimes are similar, HSP can still save about 10%.

5.4.4 Rate of Convergence

To further verify the theoretical proof of a higher rate of convergence in HSP (Sec. 5.2.4),

we study the convergence speed of PageRank in our experiments. The results are shown

in Fig. 5.6. Different from Fig. 5.2(b), we measure the ranks’ “delta” (in the form of

Euclidean norm) between two consecutive global synchronizations instead of the distance

to the real ranks, because the real ranks are unknown.

We may observe that Fig. 5.6a matches the numerical analysis in Fig. 5.2(b). HSP

converges linearly, yet at a rate that is 1.49x of BSP with the same number of global


synchronizations.

In Fig. 5.6b we plot the deltas by the end times of global synchronizations. It shows

a similar speed of convergence in early stages of execution, while HSP accelerates more

in later stages. The reason is that, in the beginning, HSP takes more time running local

updates. These local updates usually double the time interval between global synchro-

nizations. However, local updates take much less time later because the local vertices are

be considered “more converged,” and the progress of HSP accelerates significantly.

5.5 Summary

We introduce Hierarchical Synchronous Parallel (HSP), a new synchronization model

that is designed to run graph analytics on geographically distributed datasets efficiently.

HSP has a local mode of execution, which allows workers in different regions to work

asynchronously and avoids WAN traffic. By carefully designing the strategy for mode

switches, We have proved that, theoretically and experimentally, HSP guarantees the con-

vergence and correctness of existing graph applications without change. Our prototype

implementation and evaluation show that HSP can reduce the WAN bandwidth usage by

up to 32%, leading to a significant reduction in monetary cost for analyzing graph data

in the cloud. We conclude that HSP is a general, efficient, and readily implementable

synchronization model that can benefit wide-area graph analytics systems.

Chapter 6

Concluding Remarks

6.1 Conclusion

In this dissertation, we explore several system approaches to optimizing the performance

of data analytics frameworks that are deployed across multiple geographically distributed

datacenters. By revisiting the design principles of general data analytics frameworks,

we have proposed three set of system optimizations, targeting different layers in their

architecture.

First of all, at networking layer, we attempt to alleviate the bottleneck of inter-

datacenter data transfers in wide-area data analytics directly. We have designed and

implemented Siphon — a building block that can be seamlessly integrated with existing

data parallel frameworks — to expedite coflow transfers. Following the principles of

software-defined networking, a controller implements and enforces several novel coflow

scheduling strategies. A novel approach to caching controller rules at the dataplane has

also been adopted and evaluated.

To evaluate the effectiveness of Siphon in expediting coflows as well as analytics jobs,

we have conducted extensive experiments on real testbeds, with Siphon deployed across

geo-distributed datacenters. The results have demonstrated that Siphon can effectively

reduce the completion time of a single coflow by up to 76% and improve the average

coflow completion time.

114

Chapter 6. Concluding Remarks 115

Secondly, the workflow execution layer, we attempt to optimize the timing of inter-

datacenter data transfers involved in a shuffle. We have designed and implemented a new

proactive data aggregation framework based on Apache Spark, with a focus on optimizing

the network traffic incurred in shuffle stages of data analytic jobs. The objective of this

framework is to strategically and proactively aggregate the output data of mapper tasks

to a subset of worker datacenters, as a replacement to Spark’s original passive fetch

mechanism across datacenters. It improves the performance of wide-area analytic jobs

by avoiding repetitive data transfers, which improves the utilization of inter-datacenter

links. Our extensive experimental results using standard benchmarks across six Amazon

EC2 regions have shown that our proposed framework is able to reduce job completion

times by up to 73%, as compared to the existing baseline implementation in Spark.

Last but not least, at algorithm API layer, we focus on optimizing wide-area graph

analytics, an important subset of data analytics applications. We have presented a new

Hierarchical Synchronous Parallel model designed and implemented for synchronization

across datacenters with a much improved efficiency in inter-datacenter communication.

Our new model requires no modifications to graph analytics applications, yet guarantees

their convergence and correctness. Our prototype implementation on Apache Spark can

achieve up to 32% lower WAN bandwidth usage, 49% faster convergence, and 30% less

total cost for benchmark graph algorithms, with input data stored across five geograph-

ically distributed datacenters.

6.2 Future Directions

We will continue our work in wide-area graph analytics, with a further relaxation on

deployment environments.

In both BSP and HSP, it still requires synchronization among all workers after each

iteration, stragglers can degrade the overall system performance significantly. As a result,

existing systems partition the input graph evenly among workers before proceeding to

Chapter 6. Concluding Remarks 116

the iterative computation for load balancing, relying heavily on the carefully designed

heuristics [19, 90].

These heuristics, despite their complexity, usually assume a high-performance, homo-

geneous computing environment. However, this assumption may not hold in practice. In

both the public cloud and the private cloud, it is common to have heterogeneous work-

ers, which possesses different levels of computation capacity [5]. For example, virtual

machine instances might be heterogeneous due to resource sharing and the coexistence

of multiple generations of hardware [43]. To be aware of such heterogeneity, some parti-

tioning strategies in the literature require profiling/learning the system behavior [47,50],

which is a costly process, while others assume a known resource model [49, 81] or prior

knowledge [60,89] of the heterogeneity. Because of the additional efforts or assumptions

in deployment, a static heterogeneity-aware partitioning strategy alone is less practical

or effective.

We argue that the input graph should be dynamically repartitioned across the iterative

execution of graph analytics for the perfect load balance. To this end, we can explore

to design a new distributed graph analytics system that can automatically adapt to a

heterogeneous environment without any prior knowledge. Beyond existing systems that

allow dynamic load balancing among workers, we can redesign the workload migration

framework from the ground up for efficiency, with a complete awareness of the worker

heterogeneity in computing capacities.

Bibliography

[1] Apache Hadoop Official Website. http://hadoop.apache.org/. [Online; accessed

1-May-2016].

[2] MLlib: Apache Spark Website. http://spark.apache.org/mllib/. [Online; ac-

cessed 1-May-2016].

[3] Open Network Foundation Official Website. https://www.opennetworking.org/.

[Online; accessed 6-May-2015].

[4] OpenFlow White Paper. https://www.opennetworking.org/images/stories/

downloads/sdn-resources/white-papers/wp-sdn-newnorm.pdf. [Online; ac-

cessed 6-May-2015].

[5] Martín Abadi, Paul Barham, Jianmin Chen, et al. TensorFlow: A System for Large-

Scale Machine Learning. In Proc. USENIX Symposium on Operating Systems Design

and Implementation (OSDI), 2016.

[6] Kanak Agarwal, Colin Dixon, Eric Rozner, and John Carter. Shadow MACs: Scal-

able Label-Switching for Commodity Ethernet. In Proc. ACM SIGCOMM Workshop

on Hot Topics in Software Defined Networking (HotSDN), 2014.

[7] Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen

Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data Center

TCP (DCTCP). In Proc. ACM SIGCOMM, 2010.

117

http://hadoop.apache.org/

http://spark.apache.org/mllib/

https://www.opennetworking.org/

https://www.opennetworking.org/images/stories/downloads/sdn-resources/white-papers/wp-sdn-newnorm.pdf

https://www.opennetworking.org/images/stories/downloads/sdn-resources/white-papers/wp-sdn-newnorm.pdf

Bibliography 118

[8] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown,

Balaji Prabhakar, and Scott Shenker. pFabric: Minimal Near-Optimal Datacenter

Transport. In Proc. ACM SIGCOMM, 2013.

[9] Aditya Auradkar, Chavdar Botev, Shirshanka Das, Dave De Maagd, Alex Feinberg,

Phanindra Ganti, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna, Brendan Harris,

et al. Data Infrastructure at LinkedIn. In Proc. IEEE International Conference on

Data Engineering (ICDE), 2012.

[10] Daniel O Awduche. MPLS and Traffic Engineering in IP Networks. IEEE Commu-

nications Magazine, 37(12):42–47, 1999.

[11] Daniel O Awduche and Johnson Agogbua. Requirements for Traffic Engineering over

MPLS. Technical report, RFC 2702, September 1999.

[12] Stefan Banach. Sur les Opérations Dans les Ensembles Abstraits et Leur Application

aux Équations Intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.

[13] J Christopher Beck and Nic Wilson. Proactive Algorithms for Job Shop Scheduling

with Probabilistic Durations. Journal of Artificial Intelligence Research, 28:183–232,

2007.

[14] Paolo Boldi and Sebastiano Vigna. The Webgraph Framework I: Compression Tech-

niques. In Proc. ACM International Conference on World Wide Web (WWW), 2004.

[15] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rex-

ford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. P4:

Programming Protocol-Independent Packet Processors. ACM SIGCOMM Computer

Communication Review, 44(3):87–95, 2014.

[16] Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web

Search Engine. Computer Networks and ISDN Systems, 30(1):107–117, 1998.

Bibliography 119

[17] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui

Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry C Li, et al. TAO:

Facebook’s Distributed Data Store for the Social Graph. In Proc. USENIX Annual

Technical Conference (ATC), 2013.

[18] Martin Casado, Teemu Koponen, Scott Shenker, and Amin Tootoonchian. Fabric:

a Retrospective on Evolving SDN. In Proc. ACM SIGCOMM Workshop on Hot

Topics in Software Defined Networking (HotSDN), 2012.

[19] Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. PowerLyra: Differentiated

Graph Computation and Partitioning on Skewed Graphs. In Proc. ACM European

Conference on Computer Systems (Eurosys), 2015.

[20] Mosharaf Chowdhury and Ion Stoica. Efficient Coflow Scheduling Without Prior

Knowledge. In Proc. ACM SIGCOMM, 2015.

[21] Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I Jordan, and Ion Stoica.

Managing Data Transfers in Computer Clusters with Orchestra. In Proc. ACM

SIGCOMM, 2011.

[22] Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. Efficient Coflow Scheduling with

Varys. In Proc. ACM SIGCOMM, 2014.

[23] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein, Khaled Elmeleegy,

and Russell Sears. MapReduce Online. In Proc. USENIX Symposium on Networked

Systems Design and Implementation (NSDI), 2010.

[24] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on

Large Clusters. In Proc. USENIX Symposium on Operating Systems Design and

Implementation (OSDI), 2004.

Bibliography 120

[25] Fahad R Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron. De-

centralized Task-Aware Scheduling for Data Center Networks. In Proc. ACM SIG-

COMM, 2014.

[26] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From Data Mining

to Knowledge Discovery in Databases. AI magazine, 17(3):37, 1996.

[27] Yuan Feng, Baochun Li, and Bo Li. Airlift: Video Conferencing as a Cloud Ser-

vice Using inter-Datacenter Networks. In Proc. IEEE International Conference on

Network Protocols (ICNP), 2012.

[28] Nate Foster, Dexter Kozen, Konstantinos Mamouras, Mark Reitblatt, and Alexan-

dra Silva. Probabilistic NetKAT. In Proc. European Symposium on Programming

Languages and Systems, 2016.

[29] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin.

PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In

Proc. USENIX Symposium on Operating Systems Design and Implementation

(OSDI), 2012.

[30] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J

Franklin, and Ion Stoica. GraphX: Graph Processing in a Distributed Dataflow

Framework. In Proc. USENIX Symposium on Operating Systems Design and Imple-

mentation (OSDI), 2014.

[31] Yanfei Guo, Jia Rao, Dazhao Cheng, and Xiaobo Zhou. iShuffle: Improving Hadoop

Performance with Shuffle-on-Write. In Proc. USENIX International Conference on

Autonomic Computing (ICAC), 2013.

[32] Ashish Gupta, Fan Yang, Jason Govig, Adam Kirsch, Kelvin Chan, Kevin Lai,

Shuo Wu, Sandeep Govind Dhoot, Abhilash Rajesh Kumar, Ankur Agiwal, et al.

Bibliography 121

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing. Proc. VLDB

Endowment, 7(12):1259–1270, 2014.

[33] Minyang Han and Khuzaima Daudjee. Giraph Unchained: Barrierless Asynchronous

Parallel Execution in Pregel-Like Graph Processing Systems. Proc. VLDB Endow-

ment, 8(9):950–961, 2015.

[34] Soheil Hassas Yeganeh and Yashar Ganjali. Kandoo: a Framework for Efficient and

Scalable Offloading of Control Applications. In Proc. ACM SIGCOMM Workshop

on Hot Topics in Software Defined Networking (HotSDN), 2012.

[35] Benjamin Heintz, Abhishek Chandra, Ramesh K Sitaraman, and Jon Weissman.

End-to-End Optimization for Geo-Distributed MapReduce. IEEE Transactions on

Cloud Computing, 2015.

[36] C. Y. Hong, M. Caesar, and P. B. Godfrey. Finishing Flows Quickly with Preemptive

Scheduling. In Proc. ACM SIGCOMM, 2012.

[37] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan

Nanduri, and Roger Wattenhofer. Achieving High Utilization with Software-Driven

WAN. In Proc. ACM SIGCOMM, 2013.

[38] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R.

Ganger, Phillip B. Gibbons, and Onur Mutlu. Gaia: Geo-Distributed Machine

Learning Approaching LAN Speeds. In Proc. USENIX Symposium on Networked


[39] Zhiming Hu, Baochun Li, and Jun Luo. Flutter: Scheduling Tasks Closer to Data

Across Geo-Distributed Datacenters. In Proc. IEEE INFOCOM, 2016.

[40] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The HiBench

Benchmark Suite: Characterization of the MapReduce-Based Data Analysis. In

Proc. International Conference on Data Engineering Workshops (ICDEW), 2010.

Bibliography 122

[41] Chien-Chun Hung, Leana Golubchik, and Minlan Yu. Scheduling Jobs Across Geo-

Distributed Datacenters. In Proc. ACM Symposium on Cloud Computing (SoCC),

2015.

[42] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun

Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, et al. B4: Experi-

ence with a Globally-Deployed Software Defined WAN. In Proc. ACM SIGCOMM,

2013.

[43] Jiang, Jiawei and Cui, Bin and Zhang, Ce and Yu, Lele. Heterogeneity-aware Dis-

tributed Parameter Servers. In Proc. ACM International Conference on Management

of Data (SIGMOD), 2017.

[44] Lavanya Jose, Lisa Yan, George Varghese, and Nick McKeown. Compiling Packet

Programs to Reconfigurable Switches. In Proc. USENIX Symposium on Networked


[45] Konstantinos Kloudas, Margarida Mamede, Nuno Preguiça, and Rodrigo Rodrigues.

Pixida: Optimizing Data Parallel Jobs in Wide-Area Data Analytics. Proc. VLDB

Endowment, 9(2):72–83, 2015.

[46] Diego Kreutz, Fernando MV Ramos, Paulo Esteves Verissimo, Christian Esteve

Rothenberg, Siamak Azodolmolky, and Steve Uhlig. Software-Defined Networking:

A Comprehensive Survey. Proceedings of the IEEE, 103(1):14–76, 2015.

[47] Dinesh Kumar, Arun Raj, Deepankar Patra, and Dharanipragada Janakiram.

GraphIVE: Heterogeneity-Aware Adaptive Graph Partitioning in GraphLab. In

Proc. IEEE International Conference on Parallel Processing Workshops (ICCPW),

2014.

Bibliography 123

[48] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Dandapani Sivakumar,

Andrew Tompkins, and Eli Upfal. The Web as a Graph. In Proc. ACM SIGMOD-

SIGACT-SIGART Symposium on Principles of Database Systems, 2000.

[49] Shailendra Kumar, Sajal K Das, and Rupak Biswas. Graph Partitioning for Parallel

Applications in Heterogeneous Grid Environments. In Proc. IEEE International

Parallel and Distributed Processing Symposium (IPDPS), 2002.

[50] Michael LeBeane, Shuang Song, Reena Panda, Jee Ho Ryoo, and Lizy K John.

Data Partitioning Strategies for Graph Workloads on Heterogeneous Clusters. In

Proc. IEEE International Conference for High Performance Computing, Networking,

Storage and Analysis (SC), 2015.

[51] George Lee, Jimmy Lin, Chuang Liu, Andrew Lorek, and Dmitriy Ryaboy. The Uni-

fied Logging Infrastructure for Data Analytics at Twitter. Proc. VLDB Endowment,

5(12):1771–1780, 2012.

[52] Yupeng Li, Shaofeng H-C Jiang, Haisheng Tan, Chenzi Zhang, Guihai Chen, Jipeng

Zhou, and Francis Lau. Efficient Online Coflow Routing and Scheduling. In Proc.

ACM International Symposium on Mobile Ad Hoc Networking and Computing (Mo-

biHoc), 2016.

[53] Shuhao Liu, Li Chen, and Baochun Li. Siphon: a High-Performance Substrate for

Inter-Datacenter Transfers in Wide-Area Data Analytics. In Proc. USENIX Annual

Technical Conference (ATC), 2017.

[54] Shuhao Liu, Li Chen, Baochun Li, and Aiden Carnegie. A Hierarchical Synchronous

Parallel Model for Wide-Area Graph Analytics. In Proc. IEEE INFOCOM, 2018.

[55] Shuhao Liu and Baochun Li. Stemflow: Software-Defined Inter-Datacenter Over-

lay as a Service. IEEE Journal on Selected Areas in Communications (JSAC),

35(11):2563–2573, 2017.

Bibliography 124

[56] Shuhao Liu, Hao Wang, and Baochun Li. Optimizing Shuffle in Wide-Area Data An-

alytics. In Proc. IEEE International Conference on Distributed Computing Systems

(ICDCS), 2017.

[57] Zimu Liu, Yuan Feng, and Baochun Li. Bellini: Ferrying Application Traffic Flows

through Geo-Distributed Datacenters in the Cloud. In Proc. IEEE GLOBECOM,

2013.

[58] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and

Joseph M Hellerstein. Distributed GraphLab: a Framework for Machine Learning

and Data Mining in the Cloud. Proc. VLDB Endowment, 5(8):716–727, 2012.

[59] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn,

Naty Leiser, and Grzegorz Czajkowski. Pregel: A System for Large-Scale Graph

Processing. In Proc. ACM SIGMOD International Conference on Management of

Data (SIGMOD), 2010.

[60] Christian Mayer, Muhammad Adnan Tariq, Chen Li, and Kurt Rothermel.

Graph: Heterogeneity-Aware Praph Computation with Adaptive Partitioning. In

Proc. IEEE International Conference on Distributed Computing Systems (ICDCS),

2016.

[61] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson,

Jennifer Rexford, Scott Shenker, and Jonathan Turner. OpenFlow: Enabling Inno-

vation in Campus Networks. ACM SIGCOMM Computer Communication Review,

38(2):69–74, 2008.

[62] Hesham Mekky, Fang Hao, Sarit Mukherjee, Zhi-Li Zhang, and TV Lakshman.

Application-Aware Data Plane Processing in SDN. In Proc. ACM SIGCOMM Work-

shop on Hot Topics in Software Defined Networking (HotSDN), 2014.

Bibliography 125

[63] Masoud Moshref, Apoorv Bhargava, Adhip Gupta, Minlan Yu, and Ramesh Govin-

dan. Flow-Level State Transition as a New Switch Primitive for SDN. In Proc. ACM

SIGCOMM Workshop on Hot Topics in Software Defined Networking (HotSDN),

2014.

[64] Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya

Akella, Paramvir Bahl, and Ion Stoica. Low Latency Geo-Distributed Data An-

alytics. In Proc. ACM SIGCOMM, 2015.

[65] Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S Pai, and Michael J Freedman.

Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area.

In Proc. USENIX Symposium on Networked Systems Design and Implementation

(NSDI), 2014.

[66] Smriti R Ramakrishnan, Garret Swart, and Aleksey Urmanov. Balancing Reducer

Skew in MapReduce Workloads Using Progressive Sampling. In Proc. ACM Sympo-

sium on Cloud Computing (SoCC), 2012.

[67] Jeff Rasley, Brent Stephens, Colin Dixon, Eric Rozner, Wes Felter, Kanak Agar-

wal, John Carter, and Rodrigo Fonseca. Planck: Millisecond-Scale Monitoring and

Control for Commodity Networks. In Proc. ACM SIGCOMM, 2014.

[68] Semih Salihoglu and Jennifer Widom. Optimizing Graph Algorithms on Pregel-Like

Systems. Proc. VLDB Endowment, 7(7):577–588, 2014.

[69] Naveen Kr Sharma, Antoine Kaufmann, Thomas E Anderson, Arvind Krishna-

murthy, Jacob Nelson, and Simon Peter. Evaluating the Power of Flexible Packet

Processing for Network Resource Allocation. In Proc. USENIX Symposium on Net-

worked Systems Design and Implementation (NSDI), 2017.

[70] Anirudh Sivaraman, Alvin Cheung, Mihai Budiu, Changhoon Kim, Mohammad

Alizadeh, Hari Balakrishnan, George Varghese, Nick McKeown, and Steve Licking.

Bibliography 126

Packet Transactions: High-Level Programming for Line-Rate Switches. In Proc.

ACM SIGCOMM, 2016.

[71] Anirudh Sivaraman, Changhoon Kim, Ramkumar Krishnamoorthy, Advait Dixit,

and Mihai Budiu. DC. p4: Programming the Forwarding Plane of a Data-Center

Switch. In Proc. ACM SIGCOMM Symposium on SDN Research (SOSR), 2015.

[72] Hengky Susanto, Hao Jin, and Kai Chen. Stream: Decentralized Opportunistic

Inter-Coflow Scheduling for Datacenter Networks. In Proc. IEEE International Con-

ference on Network Protocols (ICNP), 2016.

[73] Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and

John McPherson. From Think Like a Vertex to Think Like a Graph. Proc. VLDB

Endowment, 7(3):193–204, 2013.

[74] Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The Anatomy

of the Facebook Social Graph. arXiv preprint arXiv:1111.4503, 2011.

[75] Leslie G Valiant. A Bridging Model for Parallel Computation. Communications of

the ACM, 33(8):103–111, 1990.

[76] Balajee Vamanan, Jahangir Hasan, and TN Vijaykumar. Deadline-Aware Datacenter

TCP (D2TCP). In Proc. ACM SIGCOMM, 2012.

[77] Raajay Viswanathan, Ganesh Ananthanarayanan, and Aditya Akella. Clarinet:

Wan-Aware Optimization for Analytics Queries. In Proc. USENIX Symposium on

Operating Systems Design and Implementation (OSDI), 2016.

[78] Ashish Vulimiri, Carlo Curino, Philip Brighten Godfrey, Thomas Jungblut, Kon-

stantinos Karanasos, Jitendra Padhye, and George Varghese. WANalytics: Analyt-

ics for a Geo-Distributed Data-Intensive World. In Proc. Conference on Innovative

Data Systems Research (CIDR), 2015.

Bibliography 127

[79] Ashish Vulimiri, Carlo Curino, Philip Brighten Godfrey, Thomas Jungblut, Jitu

Padhye, and George Varghese. Global Analytics in the Face of Bandwidth and

Regulatory Constraints. In Proc. USENIX Symposium on Networked Systems Design

and Implementation (NSDI), 2015.

[80] Damon Wischik, Costin Raiciu, Adam Greenhalgh, and Mark Handley. De-

sign, Implementation and Evaluation of Congestion Control for Multipath TCP.


(NSDI), 2011.

[81] Ning Xu, Bin Cui, Lei Chen, Zi Huang, and Yingxia Shao. Heterogeneous Environ-

ment Aware Streaming Graph Partitioning. IEEE Transactions on Knowledge and

Data Engineering (TKDE), 27(6):1560–1572, June 2015.

[82] Da Yan, Yingyi Bu, Yuanyuan Tian, Amol Deshpande, et al. Big Graph Analytics

Platforms. Foundations and Trends® in Databases, 7(1-2):1–195, 2017.

[83] Da Yan, James Cheng, Kai Xing, Yi Lu, Wilfred Ng, and Yingyi Bu. Pregel Algo-

rithms for Graph Connectivity Problems with Performance Guarantees. Proc. VLDB

Endowment, 7(14):1821–1832, 2014.

[84] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Mur-

phy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient Dis-

tributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.


(NSDI), 2012.

[85] David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz.

DeTail: Reducing the Flow Completion Time Tail in Datacenter Networks. In Proc.

ACM SIGCOMM, 2012.

Bibliography 128

[86] Hong Zhang, Li Chen, Bairen Yi, Kai Chen, Mosharaf Chowdhury, and Yanhui

Geng. CODA: Toward Automatically Identifying and Scheduling Coflows in the

Dark. In Proc. ACM SIGCOMM, 2016.

[87] Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. Maiter: an Asynchronous

Graph Processing Framework for Delta-Based Accumulative Iterative Computation.

IEEE Transactions on Parallel and Distributed Systems (TPDS), 25(8):2091–2100,

2014.

[88] Yangming Zhao, Kai Chen, Wei Bai, Minlan Yu, Chen Tian, Yanhui Geng, Yiming

Zhang, Dan Li, and Sheng Wang. Rapier: Integrating Routing and Scheduling for

Coflow-Aware Data Center Networks. In Proc. IEEE INFOCOM, 2015.

[89] Amelie Zhou, Shadi Ibrahim, and Bingsheng He. On Achieving Efficient Data Trans-

fer for Graph Processing in Geo-Distributed Datacenters. In Proc. IEEE Interna-

tional Conference on Distributed Computing Systems (ICDCS), 2017.

[90] Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. Gemini: A

Computation-Centric Distributed Graph Processing System. In Proc. USENIX Sym-

posium on Operating Systems Design and Implementation (OSDI), 2016.

Optimizing Big Data Analytics Frameworks in Geographically ...€¦ · Optimizing Big Data...

Documents

Transcript of Optimizing Big Data Analytics Frameworks in Geographically ...€¦ · Optimizing Big Data...