Eun Ju Ha 1 , Jung Hwan Baek 1 , Jeong Hyun Lee 1 , Jin Young Sung 2 ,
c 2010 Eun-Sung Jungufdcimages.uflib.ufl.edu/UF/E0/04/19/96/00001/jung_e.pdf · Eun-Sung Jung...
Transcript of c 2010 Eun-Sung Jungufdcimages.uflib.ufl.edu/UF/E0/04/19/96/00001/jung_e.pdf · Eun-Sung Jung...
NETWORK RESOURCE PROVISIONING IN RESEARCH NETWORKS
By
EUN-SUNG JUNG
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2010
c⃝ 2010 Eun-Sung Jung
2
To my parents, my wife, Hyeseon, and my daughter, Lauren
3
ACKNOWLEDGMENTS
First of all, I would like to thank my chair, Dr. Sanjay Ranka, and my co-chair, Dr.
Sartaj Sahni. Since I started to work with him, they have inspired me, guided me through
all the research, and gave me invaluable advice, suggestions, comments and support
with patience and generosity. I also would like to show my sincere gratitude to my
supervisory committee members for insightful comments on my research.
I would like to give my deepest gratitude to my family and friends. Without their help
and support, this dissertation would not have been possible.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Target Networks and Services . . . . . . . . . . . . . . . . . . . . . . . . 141.3 Problems Addressed and Our Contributions . . . . . . . . . . . . . . . . . 16
1.3.1 Bandwidth Allocation for Iterative Data-dependent Applications . . 161.3.2 Topology Aggregation for E-Science Networks . . . . . . . . . . . . 171.3.3 Workflow Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 191.5 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 BANDWIDTH ALLOCATION FOR ITERATIVE DATA-DEPENDENT E-SCIENCEAPPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Synchronous Dataflow for E-Science Applications . . . . . . . . . . . . . 282.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.3.2 Optimal Bandwidth Allocation with a Feasible Schedule . . . . . . 37
2.3.2.1 Modeling communication delays . . . . . . . . . . . . . . 382.3.2.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . 42
2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 TOPOLOGY AGGREGATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3 TA for Multiple-Path Multiple-Job (MPMJ) . . . . . . . . . . . . . . . . . . 54
3.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.3.2 New Topology Aggregation Algorithms . . . . . . . . . . . . . . . . 56
3.3.2.1 Full-mesh method . . . . . . . . . . . . . . . . . . . . . . 563.3.2.2 Star method . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.2.3 Partitioned star method . . . . . . . . . . . . . . . . . . . 58
3.4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5
3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.1 Bulk File Transfers in E-Science . . . . . . . . . . . . . . . . . . . . 623.6.2 Experiment Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . 633.6.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 643.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4 WORKFLOW SCHEDULING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.2 Workflow Scheduling in E-Science Networks . . . . . . . . . . . . . . . . 70
4.2.1 System Model and Data Structure . . . . . . . . . . . . . . . . . . 714.2.1.1 Time model . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.1.2 Network resource model . . . . . . . . . . . . . . . . . . 714.2.1.3 Workflow model . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.3 Construction of an Auxiliary Graph . . . . . . . . . . . . . . . . . . 73
4.3 MILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 Single Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1.1 Multi-commodity flow constraints . . . . . . . . . . . . . . 784.3.1.2 Task assignment constraints . . . . . . . . . . . . . . . . 804.3.1.3 Precedence constraints . . . . . . . . . . . . . . . . . . . 804.3.1.4 Deadline constraints . . . . . . . . . . . . . . . . . . . . . 81
4.3.2 Multiple Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.3.3 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 LP Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5 List Scheduling Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6.2.1 Schedule length of workflows . . . . . . . . . . . . . . . . 924.6.2.2 Computational time . . . . . . . . . . . . . . . . . . . . . 93
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6
LIST OF TABLES
Table page
2-1 Comparison between DSP and e-Science applications . . . . . . . . . . . . . . 30
2-2 Summary of system parameters of the visualization application . . . . . . . . . 36
2-3 Notation for problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3-1 Time Complexity for MPMJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3-2 Space Complexity for MPMJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4-1 Notation for problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4-2 Single workflow scheduling formulation time complexity analysis . . . . . . . . 84
4-3 Edge-path form single workflow scheduling formulation time complexity analysis 88
7
LIST OF FIGURES
Figure page
2-1 An example of SDFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2-2 A homogeneous SDFG converted from Figure 2-1 (a) . . . . . . . . . . . . . . 33
2-3 A real example of e-Science applications [53] . . . . . . . . . . . . . . . . . . . 35
2-4 An ESDFG model for Figure 2-3 . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2-5 Modeling communication delay in a SDFG . . . . . . . . . . . . . . . . . . . . . 39
2-6 Modeling communication delay in the case of multiple communication channels 41
2-7 More exploited parallelism in case of multiple communication channels . . . . . 42
2-8 BAFS problem formulation in case of the conservative model . . . . . . . . . . 42
2-9 BAFS problem formulation in case of the optimistic model . . . . . . . . . . . . 43
2-10 The Abilene network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2-11 Rejection ratio vs. number of requests . . . . . . . . . . . . . . . . . . . . . . . 47
3-1 An example of inter-domain QoS routing . . . . . . . . . . . . . . . . . . . . . . 51
3-2 An illustrative example for limitations of the line segment algorithm . . . . . . . 54
3-3 Full-mesh AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3-4 Star AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3-5 Partitioned star AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3-6 Earliest finish time on-line scheduling of multiple file transfers . . . . . . . . . . 63
3-7 Error ratio vs. the number of nodes . . . . . . . . . . . . . . . . . . . . . . . . . 65
3-8 Normalized computational time vs. the number of source and destination nodes 65
4-1 A DAG consisting of 17 nodes, representing dependencies among 17 tasksof an application. For example, the arc from task E to task B represents thefact that the output generated by task E is utilized by task B. . . . . . . . . . . . 68
4-2 An example of a network resource graph . . . . . . . . . . . . . . . . . . . . . 72
4-3 An example of a task graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4-4 An example of an auxiliary graph . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4-5 Single workflow scheduling problem formulation via network flow model . . . . 79
8
4-6 Multiple workflow scheduling problem formulation via network flow model . . . 82
4-7 Edge-path form of single workflow scheduling problem formulation . . . . . . . 87
4-8 The Abilene network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4-9 Makespan vs. CCR for all algorithms in the Abilene network when the numberof nodes in a workflow is 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4-10 Makespan vs. CCR and the number of nodes in a workflow for LPREdge andLS in the Abilene network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4-11 Computational time vs. CCR for all algorithms in the Abilene network whenthe number of nodes in a workflow is 3. . . . . . . . . . . . . . . . . . . . . . . 94
4-12 Computational time vs. the number of nodes in a workflow for LPREdge andLS in the Abilene network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9
List of Algorithms
2-1 A heuristic for BAFS problem . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-1 Full-mesh AR construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3-2 Star AR construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3-3 Partitioned star AR construction . . . . . . . . . . . . . . . . . . . . . . . . . 59
4-1 First step - Determination of the mapping of tasks except data transfers . . 85
4-2 Second step - Determination of the mapping of network resources . . . . . 85
4-3 The adapted extended list scheduling algorithm . . . . . . . . . . . . . . . . 89
4-4 Data transfer finish time computation algorithm . . . . . . . . . . . . . . . . 90
10
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
NETWORK RESOURCE PROVISIONING IN RESEARCH NETWORKS
By
Eun-Sung Jung
December 2010
Chair: Sanjay RankaCochair: Sartaj SahniMajor: Computer Engineering
Advances in optical communication and networking technologies, together with
the computing and storage technologies, are dramatically changing the ways scientific
research is conducted. A new term, e-Science, has emerged to describe “large-scale
science carried out through distributed global collaborations enabled by networks,
requiring access to very large scale data collections, computing resources, and
high-performance visualization” [12].
E-Science application workflows are complex and require schedulable and
high-bandwidth connectivity with known future characteristics. Moreover, these
workflows have performance requirements or metrics that have not been considered
by conventional networking. For example, large file transfer may need guaranteed total
turnaround time and the rate of progress. Given the long duration of many requests, the
network resources available may change before it is completed.
We develop a novel framework for provisioning a variety of e-Science applications
that require complex workflows that span over multiple domains. Our framework
provides guarantees on the performance while incurring minimal overhead, both
necessary conditions for such a framework to be adopted in practice.
11
CHAPTER 1INTRODUCTION
1.1 Overview
Advances in optical communication and networking technologies, together with
the computing and storage technologies, are dramatically changing the ways scientific
research is conducted. A new term, e-Science, has emerged to describe “large-scale
science carried out through distributed global collaborations enabled by networks,
requiring access to very large scale data collections, computing resources, and
high-performance visualization” [12]. Well-quoted e-Science (and the related grid
computing [47]) examples include high-energy nuclear physics [33], radio astronomy
[19], geoscience [3] and climate studies [13]. To support e-Science activities, a new
generation of high-speed research and education networks have been developed.
These include Internet2 [17], the Department of Energy’s ESnet [14], National Lambda
Rail [21], CA*net4 [9] in Canada, and the pan-Europe GEANT2 [5]. A large portion of all
data traffic supporting U.S. science is carried by ESnet, Internet2, and National Lambda
Rail [55].
E-science activities often need to transport large volumes of data at a very high
rate among a large number of collaborating sites [33, 78], severely stressing network
resources. For instance, the high-energy physics (HEP) data is expected to grow from
the current petabytes (PB) (1015) to exabytes (1018) by 2015 [33]. Beyond the obvious
need for large amounts of data to be transferred, e-Science requirements for network
use are significantly different from the traditional network applications [7, 46, 55] in the
following ways:
1. Need to support schedulable, long-duration workflows with performance guarantee:The underlying applications require schedulable, high-bandwidth, low-latencyconnectivity with known future characteristics or performance guarantees [41], forreal-time remote visualization, interactions with instruments, distributed simulationor data analysis, etc. In a distributed workflow system that involves many entitiessuch as distant parties, scientific instruments, computation devices, as well ascomplex feedback in various stages of the workflow, unintended delay due to
12
lack of planning for future communication paths can ripple through the entireworkflow environment, slowing down other participating systems as they wait forintermediate results, thus reducing the overall effectiveness [55].
2. Need to support a large number of network services with novel performancemetrics: There are many different types of sciences and scientific activities,which require different types of network services tailored to the specific scienceactivities. (Also see [1, 2, 7, 46, 55].) Moreover, many of the e-Science activitieshave performance requirements or metrics that have not been considered byconventional networking. Large file transfer may only be concerned with totalturnaround time and the rate of progress; streaming consumer-producer type ofjobs running at two different sites may require a minimum and maximum datatransfer rate; fusion experiments may care about lowering the probability of failurein the experiments due to inadequate network services.
3. Need to support dynamic user and resource environment with high networkefficiency: Given that each job can be a “heavy hitter” in terms of network resourceconsumption, the network must handle with great efficiency the dynamic arrivals ofservice requests, the changes in requirements, traffic pattern and access policiesat different stages of experiments or collaboration. The efficiency requirement isespecially important for the new-generation, high-speed, coarse-granular networks,such as the wavelength-based systems. In addition, given the long duration ofmany jobs, the resources available to a particular job or the network topology maychange before the job is completed. The network services must be able to adaptto the resource changes by incorporating newly added resources (e.g., links orwavelengths) or falling back to alternative resources when the assigned resourcesare no longer available.
In short, e-Science activities need schedulable, high-bandwidth, flexible and
evolving network services with novel performance guarantees, and the network needs
to provide these services efficiently. There is a large body of research on how to
provide quality-of-service (QoS) guarantees (e.g., InterServ [31], DiffServ [29], the ATM
network [71], or MPLS [86]) for Internet-type networks. Those proposals do not consider
advance reservations with start and end times. Bulk transfer is usually regarded as
low-priority best-effort traffic, not subject to admission control (AC). AC and scheduling
are decoupled from routing in that each connection has a single default path separately
determined by a routing protocol; the routing protocol is usually oblivious of the jobs in
the system. AC is myopic in that each network element on the path determines whether
the connection can be accepted based on a comparison of the remaining link capacity
13
at the node itself with the requested resource of the job alone. Once admitted, the path
and resource allocation remain fixed throughout the lifetime of the connection.
To meet the needs of e-Science, we propose a framework for conducting advance
reservations, admission control (AC) and scheduling of network service requests in
research networks that (i) supports an evolving large array of network services required
by or useful for e-Science collaboration; (ii) guarantees performance levels that are
based on metrics relevant to the underlying applications; (iii) adapts to underlying
changes in network topology, resources and user dynamics; and (iv) provides efficient
utilization of the underlying resources.
1.2 Target Networks and Services
Recent network signaling protocols, such as Multiple Protocol Label Switching
(MPLS) and Generalized MPLS (GMPLS), allow applications to overcome deficiencies
prevalent in existing routed TCP/IP protocols (e.g., the inability to guarantee bandwidth,
or offer Quality of Service). Many high-bandwidth network projects currently are
deploying these protocols in the research and academic domain. This is the case, for
example, in the Internet2’s HOPI testbed [16], the NSF-supported Ultralight [8], Teragrid
[23], CHEETAH [6] and DRAGON [63] networks and the DOE-supported UltraScience
Net (USN) [24], ESnet [14] and LHCNet [20]. We expect these protocols to proliferate
into the production and commercial network domain. As the first evidence, both the
Internet2 and the DOE’s ESnet have chosen to offer dedicated bandwidth capability
and lightpaths using GMPLS and MPLS control plane techniques, developed in the
OSCARS/BRUW projects [10]. The techniques provide a framework to automate the
provisioning process for bandwidth and make it easier for users to access Service
Oriented Bandwidth Management (SOBM) functions, compared to the current
provisioning and bandwidth management practices, which are manual and labor-intensive.
Past and current projects on research networks have focused on addressing the
following challenges [6, 10, 16, 63]: 1) Set up the high-speed data plane by a hybrid of
14
IP packet-switching and optical circuit-switching technologies with a large footprint and
sufficient connectivity by connecting the national labs and universities and peering with
other networks, 2) Develop support for end-to-end high-speed circuits statically or on
demand, which requires multi-domain interoperability, 3) Set up the basic control plane
and develop signaling and control middleware for handling user requests and basic
network resource reservation, 4) Develop end-to-end transport protocols for supporting
high-speed channels and large volumes of data, 5) Ensure security by encryption,
authentication, authorization (AAA), and 6) Ensure reliability.
Bulk transfer. Being able to transfer very large files is a priority in nearly all
e-Sciences [1, 2, 7, 46, 55]. If the turnaround time is the performance metric that the
user cares about, there is a great deal of flexibility in how the transfer can be carried
out. For instance, the transfer of a 100 GB file can be completed in 8 seconds using
ten 10 Gbps lightpaths (Internet2 links), or in 1 hour and 26 minutes using a 155 Mbps
(OC-3) long-lasting SONET circuit. The transfer choice not only affects the job in
question, but also other current or future jobs in complex ways. For large transfers with
start and end time constraints, peak bandwidth assignment can lead to an undesirable
phenomenon known as fragmentation [77], which in turn leads to low utilization of
network resources. This occurs when some time intervals are lightly loaded but not long
enough to accommodate new large jobs. Greater transfer flexibility is needed to combat
this problem, such as time-varying bandwidth assignment and dynamic re-assignment.
Streaming workflow. For the nanomaterial sciences conducted at DOE’s Center
for Functional Nanomaterials, research often involves distributed collaboration among
smaller research centers with different scientific instruments and capabilities [2]. Data
are generally collected from several centers and/or are compared against each other.
Then, a medium sized cluster of computers processes and analyzes the data. The
visualization is done by special workstations equipped with large memory graphics cards
to handle the large images and volumes of data from the output of the data processing.
15
The generated animation is then streamed to remote scientists’ desktops, or in the case
where the visualization is in stereo, to a 3D theater. The network requirements vary at
each stage of the workflow.
Data-intensive workflow. Large-scale supercomputing is expected to produce
data at a similar rate to large-scale experiments. In order to post-process the computed
results, high throughput transfers are often required to stage the data at the related
computational resources. Similarly, high-end scientific computing also processes large
amounts of input data that, from a performance perspective, should be accessible as
efficiently as possible. Local parallel file systems are well suited for supporting the
demanded I/O capabilities, even when data has to be staged to the respective file
systems. Community schedulers need to control multiple distributed computational
resources in order to serve individual workflows.
1.3 Problems Addressed and Our Contributions
E-Science networks usually provide QoS guarantee, i.e., bandwidth guarantee
through multi-protocol label switching (MPLS) and general multi-protocol label switching
(GMPLS) to meet the requirements of e-Science applications, e.g., in-advance path
reservations for high-volume data transfers. The distinctive features of e-Science
applications compared with other distributed applications can be summarized in two key
words, ”network-centric” and ”in-advance”. Unlike other grid computing applications,
scheduling of e-Science applications puts more focus on network resources or considers
the network resource as most important among multiple resources such as compute
resource and storage resource. Moreover, in-advance scheduling of e-Science
applications satisfies the needs of users requesting periodic or predictable services.
1.3.1 Bandwidth Allocation for Iterative Data-dependent Applications
We present a framework for bandwidth scheduling of streaming e-Science
applications. These applications include interactive visualization of simulations, large
data streaming coordinated with job execution for producer consumer applications,
16
and networked supercomputing [46]. We have adapted the Synchronous Dataflow
(SDF) model to model and analyze iterative data-dependent applications in e-Science.
Synchronous dataflow was proposed in late 1980s as a modeling method for digital
signal processing (DSP) applications, but it ignores the communication delays. Our
model incorporates the communication delays that are inherent in large-scale distributed
applications. We have formulated the bandwidth allocation problem of iterative
data-dependent e-Science applications with temporal constraints as a multi-commodity
linear programming problem. It incorporates optimal rates and buffer minimization for
streaming applications that can be represented by a SDFG. Our algorithms determine
how much bandwidth is allocated to each edge while satisfying temporal constraints
on collaborative tasks. Using the solution of the bandwidth allocation problem, buffer
requirements for the schedule are achieved using procedures similar to the ones
presented in [50]. To the best of our knowledge, this represents the first attempt to
analyze the temporal behavior of collaboratively iterative tasks and to determine the
optimal bandwidth allocations among distributed nodes.
1.3.2 Topology Aggregation for E-Science Networks
The network supporting e-Science applications typically is comprised of multiple
domains. Each domain usually belongs to different organizations, and is managed
based on different operational policies. In such cases, internal topologies of domains
may not be visible to the others for security or other reasons. However, aggregated
information of internal topology and associated attributes is advertised to the other
domains.
A set of techniques to aggregate data to advertise outside one domain is called
Topology Aggregation (TA). The aggregated data itself is termed as Aggregated
Representation (AR). A survey of TA algorithms is presented in [98]. There exists a
tradeoff between the accuracy and the size of AR. Hence, most algorithms proposed in
17
the previous work tried to achieve the most efficient AR in terms of both accuracy and
space complexity.
One can classify QoS path requests into two classes: single-path single-job
(SPSJ) and multiple-path multiple-job (MPMJ), depending on the nature of requests.
SPSJ corresponds to a situation in which requests for single QoS path arrive and
are scheduled in the order of arrival. In contrast, MPMJ corresponds to batch/off-line
scheduling of multiple requests for multiple QoS paths. Many e-Science applications
require simultaneous transfer of data from multiple sources and destinations. Also,
each of these requests (e.g., file transfers) can be more efficiently supported by using
concurrent multiple paths.
We show that existing TA approaches developed for SPSJ do not work well with
MPMJ applications as they overestimate the amount of bandwidth that is available. We
propose a max flow based TA approach that is suitable for this purpose. Our simulation
results demonstrate that our algorithms result in better accuracy or less scheduling time.
1.3.3 Workflow Scheduling
Workflow/Directed Acyclic Graph (DAG) scheduling has been shown to be NP-hard
[91]. A number of practical heuristics have been developed for this problem. Most of
these ignore the communication costs [26, 39] or assumed a very simple interconnection
network model, i.e., a fully-connected network model without contention [59, 60, 79, 97,
99]. The work in [97] proposed the heterogeneous-earliest-finish-time (HEFT) algorithm
extended from the classic list scheduling for heterogeneous computing resources.
However, the advances in computing platforms ranging from clusters to grids and
emerging clouds for data intensive applications has posed new challenges where
network contention is an important issue that needs to be addressed. We propose
to address this issue by formulating and solving the overall workflow scheduling that
incorporates network contention and overheads of the large scale data transfers. In
18
particular, we address the following issues for e-Science grids that have networks that
are a mix of IP networks and optical networks:
• Malleable resource allocation.
• Dynamic multipath scheduling.
• Multiple workflows.
We have formulated workflow scheduling problems in e-Science networks, whose
goal is minimizing either makespan or network resource consumption by jointly
scheduling heterogeneous resources such as compute and network resources. The
formulations are different from previous work in the literature in the sense that they
allow dynamic multiple paths for data transfer between tasks and more flexible resource
allocation that may vary over time. Moreover, our work is the first to formulate the
workflow scheduling problem incorporating multiple paths as a mixed integer linear
programming (MILP). We formulate also a linear programming relaxation, LPR, of our
MILP, an edge-path based LP relaxation, LPREdge, and a list scheduling heuristic, LS.
The experimental results show that the makespan of LPR schedules is much closer
to optimal than that of LS schedules when the communication-to-compute ratio (CCR) is
large. The LS algorithm performs roughly similar to the LPR algorithm when CCR = 0.1
and 1.0, but the performance gap of these non-optimal algorithms grows dramatically as
CCR grows from 1 to 10. Our results indicate that data-intensive workflow scheduling,
which is common in e-Science applications, will benefit from dynamic multiple paths and
malleable resource allocation.
1.4 Background and Related Work
Ongoing research projects for supporting e-Science applications (e.g., HOPI [16],
Ultralight [8], Teragrid[23], CHEETAH [6], DRAGON [63], ESnet [14], OSCARS/BRUW
[10]) have mainly focused on setting up a fast data plane with a large footprint and
sufficient connectivity and setting up a basic but functional control plane, such as
developing signaling and control middleware for handling user requests for elementary
19
network services, ensuring security and improving reliability. However, the control plane
mechanisms lack sophisticated network service support or efficient service reservation
algorithms. They normally only support fixed bandwidth guarantee by reserving circuits
or lightpaths. Using such a restricted set of services or simplistic resource management
algorithms to support diverse e-Science activities can lead to inefficient utilization of
the network resources (especially for the new-generation, high-speed, coarse-granular
networks, such as the wavelength-based systems) and/or not provide the level of
performance required by those activities in desired but varied performance metrics.
Compared with the traditional QoS frameworks, such as InterServ [31], DiffServ
[29], ATM networks [71], or MPLS [86], admission control and scheduling for research
networks are recent concerns with limited published work. Prior work is either about
dedicated path reservation, bulk data transfers or jobs that require minimum bandwidth
guarantee (MBG). None has considered as rich a class of job types as we do.
Control plane protocols, architectures and tools. The NSF-supported DRAGON
[63, 106] project develops control plane architecture and middleware for multi-domain
traffic engineering and resource allocation, e.g., using GMPLS protocols [43] for setting
up SONET circuits or lightpaths. It uses a centralized resource computation element
per domain, which is responsible to compute paths. It supports advance reservations
of label switched paths (LSP) on requested time periods. CHEETAH [6] is a similar
project to DRAGON but is more traditional in that it focuses on simpler, distributed
operations for path computation and bandwidth management to support high arrival
rates of immediate connection requests. OSCARS [10] is the control plane project for
DOE’s ESnet, also similar to DRAGON. It develops and deploys a prototype service
that enables on-demand provisioning of guaranteed bandwidth circuits for ESnet. HOPI
[16] is a testbed project on research networks that examines how to provide network
services in a hybrid network of shared IP packet switching and dynamically provisioned
lightpaths.
20
[52] presents an architecture for advance reservation of intra and interdomain
lightpaths. GARA [48], the reservation and allocation architecture for the grid computing
toolkit Globus [15], supports advance reservation of network and computing resources.
[40] adapts GARA to support advance reservation of lightpaths, MPLS paths and
DiffServ paths. Other related work in this category includes GridJIT [96], ODIN [54], [30]
and [34]. Much of the objective, architecture framework and capabilities of the proposed
project coincides with the NSF’s GENI project [22], for instance, the use of network
controllers and the support of network virtualization. Most of the above control-plane
architectures and tools provide rudimentary AC and scheduling algorithms for simple job
types. However, much more can be done to support more service types or improve the
network resource utilization.
Path reservation. The ability to provide dedicated or on-demand circuits or
lightpaths is currently the focus of many projects, including most aforementioned
major research networks and associated projects, e.g., Internet2, ESnet, National
Lambda Rail, GEANT2, UltraScience Net (USN), HOPI, DRAGON, CHEETAH and
OSCARS/BRUW. Further examples include User Controlled Light Paths (UCLP) [25],
Enlightened [4], Japanese Gigabit Network II [18], LHCNet[20], and Bandwidth Brokers
[109]. In our previous research work, we have proposed novel algorithms for advance
path computation and bandwidth scheduling for connection oriented networks [87] that
have considerably better performance [57]. In [56], we have extended these algorithms
to incorporate the wavelength sharing and wavelength continuity constraints.
MBG service. Several earlier studies [32, 36, 89, 104] have considered AC at
an individual link for the MBG (minimum bandwidth guarantee) job type with start and
end times. The concern is typically about designing efficient data structures, such as a
segment tree [32], for recording and querying the link bandwidth usage on different time
intervals. Admission of a new job is based on the availability of the requested bandwidth
between its start time and end time. [35, 44, 51, 100] and [36] tackle the more general
21
path-finding problem for the MBG class, but typically only for new requests, one at
a time. The routes and bandwidth of existing jobs are unchanged. [64] considers
a network with known routing in which each admitted job derives a profit. It gives
approximation algorithms for admitting a subset of the jobs so as to maximize the total
profit.
Bulk transfer. Recent papers on AC and scheduling algorithms for bulk transfer
with advance reservations include [35, 37, 51, 70, 73–75, 77, 82]. In [77], the AC and
scheduling problem is considered only for the single link case. Network-level AC and
scheduling are considered to be outside the scope of [77]. As a result, multi-path routing
and network-level bandwidth allocation and re-allocation have no counterpart in [77]. In
contrast, we periodically re-optimize the bandwidth assignment for all the new and old
jobs.
For a one-time scheduling problem, our recent work [82] conducts a detailed
performance comparison between single-slice scheduling and multi-slice scheduling
under various slice sizes, and between single-path routing and multi-path routing. We
conclude that a small number of paths per job is usually sufficient to yield near-optimal
throughput; multi-slice scheduling leads to significant performance (e.g., throughput)
improvement. Other authors have also considered a similar problem but with different
emphasis [37].
In [73–75], the authors consider single-link AC or link-by-link AC under single-path
routing. The AC uses heuristic algorithms instead of solutions to optimization problems.
Based on its size and the deadline, the average required bandwidth of a bulk transfer
job is computed. The AC is based on the job’s average bandwidth requirement. The
bandwidth of existing jobs may be re-allocated only for the single-link case.
The authors of [35] propose a malleable reservation scheme for bulk transfer, which
checks every possible interval between the requested start and end times for the job and
tries to find a path that can accommodate the entire job on that interval. The scheme
22
favors intervals with earlier deadlines. In [51], the computational complexity of a related
path-finding problem is studied and an approximation algorithm is suggested. [70] starts
with an advance reservation problem for bulk transfer, but converts it into a constant
bandwidth allocation problem to maximize the job acceptance rate. All the requests
are known at the time of AC; AC/scheduling is carried out only once. The bandwidth
constraints are at the ingress and egress links only, and hence, there is no routing issue.
Grid/Utility/Cloud Computing. Network resource provisioning problems in
e-Science networks share some design goals such as the earliest finish time of a
job with resource management problems in grid/utility/cloud computing. However, the
network resource provisioning problems in e-Science networks are different from those
problems in grid/utility/cloud computing in that network resources, i.e., the bandwidth
of links, are assumed to be guaranteed and manageable by emerging technologies
such as MPLS and GMPLS. Such QoS guaranteeing infrastructures for e-Science
applications are originating from the fact that common e-Science applications transport
large volumes of data at very high rates. This difference opens a research area toward
more elegant management of network resources, which can make system performance
better.
Optical networks. E-science networks are mix of IP networks and optical networks.
In optical networks, the bandwidth along a given link can be decomposed into multiple
wavelengths. For such reasons, optical networks have the following constraints.
• Wavelength continuity constraint: This constraint forces a single lightpath to occupythe same wavelength throughout all the links that it spans. This constraint isnot required when an optical network is equipped with wavelength converters.When such converters are present, the network is called a wavelength convertiblenetwork.
• Wavelength sharing constraint: For many deployments, it is most effective toconsider the bandwidth on a link as consisting of integer multiples of wavelengthand a single wavelength as a unit for assignment i.e., one wavelength is occupiedby only one reservation at a certain point of time. It is worth noting that techniques
23
based on Time Division Multiplexing (TDM)/Wavelength Division Multiplexing(WDM) [110] allow for decomposing the bandwidth on a wavelength.
The related issues in the research area of optical networks are: Routing and
wavelength assignment (RWA) problem, virtual topology (VT) problem, traffic grooming
(TR) problem, and task scheduling and lightpath establishment (TSLE) problem.
1.5 Outline of Dissertation
The remainder of this dissertation is organized as follows.
Chapter 2 describes a SDF-based model for iterative data-dependent e-Science
applications that incorporates variable communication delays and temporal constraints,
such as throughput. We formulate the problem as a variation of multi-commodity
linear programming with an objective of minimizing network resource consumption
while meeting temporal constraints. The resulting solution can then be used to derive
buffer space requirements by previously developed algorithms in the context of DSP
applications. Finally, an illustrative example of an e-Science application shows that
the framework and algorithm we propose is valid to model and analyze iterative
data-dependent e-Science applications. The simulation results show that the optimal
bandwidth allocation by the formulated linear programming outperforms the bandwidth
allocation by a simple heuristic in terms of rejection ratio of requests.
Chapter 3 describes topology aggregation algorithms for e-Science networks.
E-Science applications require higher quality intradomain and interdomain QoS
paths, and some of those are distinguished from classic single-path single-job (SPSJ)
applications. We define a new class of requests, called multiple-path multiple-job
(MPMJ), and propose TA algorithms for the new class of applications. The proposed
algorithms, star and partitioned star ARs, are shown to be significantly better than naive
approaches.
Chapter 4 describes efficient algorithms for workflow scheduling problems in
e-Science networks, whose goal is minimizing either makespan or network resource
24
consumption by jointly scheduling heterogeneous resources such as compute and
network resources. Our algorithms are different from previous work in the literature in
the sense that they allow dynamic multiple paths for data transfer between tasks and
more flexible resource allocation that may vary over time. In addition, it is advantageous
that the formulation for a single workflow scheduling can be easily extended to the
formulation for a multiple workflow scheduling.
25
CHAPTER 2BANDWIDTH ALLOCATION FOR ITERATIVE DATA-DEPENDENT E-SCIENCE
APPLICATIONS
2.1 Overview
E-Science activities often require the transport of large volumes of data at very high
rates among a large number of collaborating sites [33, 78], severely stressing network
resources. For instance, the high-energy physics (HEP) data are expected to grow from
the current petabytes (1015) to exabytes (1018) by 2015 [33]. Beyond the obvious need
for large amounts of data to be transferred, e-Science requirements for network use are
significantly different from the traditional network applications [7, 46, 55]. The underlying
applications require schedulable, high-bandwidth, low-latency connectivity with known
future characteristics or performance guarantees [41] for real-time remote visualization,
interactions with instruments, distributed simulation or data analysis, and so on. In a
distributed workflow system that involves many entities, such as distant parties, scientific
instruments, computation devices, as well as complex feedback in various stages of the
workflow, unintended delays due to a lack of planning for future communication paths
can ripple through the entire workflow environment, slowing down other participating
systems as they wait for intermediate results, thus reducing the overall effectiveness
[55].
The focus of this chapter is on supporting e-Science applications that require
streaming of information between sites. We present a framework for bandwidth
scheduling of streaming e-Science applications. These applications include interactive
visualization of simulations, large data streaming coordinated with job execution for
producer consumer applications, and networked supercomputing [46]. The main
contributions are as follows:
1. We have adapted the Synchronous Dataflow (SDF) model to model and analyzeiterative data-dependent applications in e-Science. Synchronous dataflow wasproposed in late 1980s as a modeling method for digital signal processing (DSP)
26
applications, but it ignores the communication delays. Our model incorporates thecommunication delays that are inherent in large-scale distributed applications.
2. We have formulated the bandwidth allocation problem of iterative data-dependente-Science applications with temporal constraints as a multi-commodity linearprogramming problem. It incorporates optimal rates and buffer minimizationfor streaming applications that can be represented by a SDFG. Our algorithmsdetermine how much bandwidth is allocated to each edge while satisfying temporalconstraints on collaborative tasks. Using the solution of the bandwidth allocationproblem, buffer requirements for the schedule is achieved using procedures similarto the ones presented in [50].
To the best of our knowledge, this represents the first attempt to analyze the temporal
behavior of collaboratively iterative tasks and to determine the optimal bandwidth
allocations among distributed nodes.
Ongoing research projects for supporting e-Science applications ( e.g., HOPI
[16], Ultralight [8], Teragrid [23], CHEETAH [6], DRAGON [63], ESnet [14], and
OSCARS/BRUW [10]) have mainly focused on setting up a fast data plane with a
large footprint and sufficient connectivity and setting up a basic but functional control
plane, such as developing signaling and control middleware for handling user requests
for elementary network services, ensuring security and improving reliability. However,
the control plane mechanisms lack sophisticated network service support or efficient
service reservation algorithms. They normally only support fixed bandwidth guarantee
by reserving circuits or lightpaths. Using such a restricted set of services or simplistic
resource management algorithms to support diverse e-Science activities can lead
to inefficient utilization of the network resources (especially for the new-generation,
high-speed, coarse-granular networks, such as the wavelength-based systems) and/or
not provide the level of performance required by those activities in desired but varied
performance metrics.
The rest of the paper is organized as follows. We provide a detailed description
of SDF and its operational semantics and examine its applicability to e-Science
applications in Section 2.2. We present an overall process of problem-solving, including
27
a mathematical formulation as a linear programming and a discussion of the detailed
deployment of the obtained solution for the linear programming in real systems in
Section 2.3. We show that our approach outperforms a naive heuristic, also given by us,
in Section 2.4. Lastly, we conclude with a summary and discussion of the practicality of
our dissertation in Section 2.5.
2.2 Synchronous Dataflow for E-Science Applications
The SDF model of computation was first proposed by Lee in [62]. The SDF model
has been found to be very useful for expressing DSP applications that have the following
features: infinitely looping execution, discretized communication expressed by tokens,
and parallelism to be exploited for maximizing throughput. Most of the existing research
for these problems is limited to deriving maximal rates and buffer minimization.
SDFG is a directed graph defined by G = (V ,E , I ,O, τ , Φ), where V and E
represent a set of nodes and a set of edges, respectively. Each node in SDFG is called
an actor and the edge in SDFG is called a communication channel or channel. The
notation is based on its earlier use in DSP applications comprising function blocks and
the communication channels interconnecting them. An actor repeats its task infinitely,
and the execution of its task is called firing. In this paper, we use the terms node for
actor, and edge for channel interchangeably. An actor can produce and consume data
per channel at different rates, which are specified by the number of tokens.
The number of tokens is a positive integer. If multiple inputs and outputs are
associated with an actor, it is assumed that the actor waits until all input buffers
have their tokens to be consumed ready for use and all output buffers are available.
Homogeneous SDFG, where at most one token can be produced or consumed, is a
special case of SDFG.
The number of tokens that actors produce and consume is specified by sets, I and
O. I is a set of numbers of tokens consumed by destination actors of edges, and O is a
set of numbers of tokens produced by source actors of edges. Thus, each edge (u, v)
28
U VOuv=2
Iuv=1
Fires ru times Fires rv times
u v viteration 0
(a)
0 1 2 3 4
u v v
u v v
time
iteration 0
iteration 1
(b)
Figure 2-1. An example of SDFG
is associated with two integer values, Iuv and Ouv . Consider the sample SDFG shown
in Figure 2-1 (a). The edge (u, v) has two associated integer values, Iuv and Ouv , which
are 1 and 2, respectively. This represents the fact that actor u produces 2 tokens at each
firing and actor v consumes 1 token at each firing. In addition, τ is a set of execution
times of actors, and the execution time of each actor’s firing is denoted by τi . Finally, a
set Φ represents the initial numbers of tokens on edges, which are necessary for the
start of iterative operations of a SDFG.
Using the known properties of a homogeneous SDFG allows us to derive the
maximal computation rates as well as buffer requirements. Also, it can be shown
that any arbitrary SDFG can be converted into a homogeneous SDFG, although this
conversion may increase the size of the network exponentially.
To adapt the SDFG model for e-Science applications, it is important to understand
the key differences between e-Science and DSP applications. A summary of differences
between DSP and e-Science applications is provided in Table 2-1. Unlike DSP
applications, e-Science applications can be represented by acyclic graphs, have
fixed start and end time, and have communication delays to be considered. The time
unit of DSP applications is on the order of a few milliseconds, compared to the time unit
of e-Science applications that may be from a few hours to several days. Throughput is
the most important objective in both DSP and e-Science applications. However, for DSP
29
Table 2-1. Comparison between DSP and e-Science applications
Category DSP application e-Science applicationInter-taskdependency
Cycles are allowed. Usually acyclic.
Execution period Infinite. Finite.Time unit Small (a few milliseconds). Ranges from small to large (a
few minutes).Compute resource Unlimited. Unlimited or limited if
compute resource should beco-allocated.
Communicationdelay
Assumed to be 0. Needs to be considered.
Temporal constraints Objective is maximizingcomputation rate(throughput).
Throughput.
Schedule Static or dynamic. Static.
applications, tradeoffs are between throughput and buffer size, while for e-Science, the
tradeoff is generally between throughput and network resource requirements. The focus
of our work is on optimizing these resources.
Lee [61] divided scheduling of parallel computation defined by SDFG into four
classes: fully dynamic, static assignment, self-timed, and fully static. Fully dynamic
scheduling schedules actors at run-time only. In static assignment, assignment of actors
to processors is done off-line and a local run-time scheduler of each processor invokes
actors assigned to the processor. In self-timed scheduling, the assignment and ordering
of actors on each processor is determined off-line and exact firing time is scheduled at
run-time. In other words, the actor that will be executed by a certain processor waits
for all input data to be available and is fired once all input data are ready. Finally, fully
static scheduling determines all information off-line. Based on this classification, the
target e-Science applications can be considered to be self-timed. A node of SDFGs for
e-Science applications represents one site, such as a data server or a computing node.
This implies that every actor is assigned to a unique processor that only manages that
task.
30
As described earlier, a SDFG is represented by G = (V ,E , I ,O, τ , Φ). Since actors
can produce or consume tokens at different rates, a feasible schedule should guarantee
that tokens are not infinitely accumulated. In Figure 2-1 (a), actor u produces 2 tokens at
each firing, while actor v consumes 1 token. To prevent infinite buffer overflow, actor u
should be fired once for every two firings of actor v . Formally, this can be stated by the
equation, ru×2 = rv×1, where ru and rv denote firing rates of actor u and v , respectively.
These kinds of equations are called balance equations or state equations. To solve
balance equations formally, we need to define a topology matrix, where ei denotes the
i th edge and Oei and Iei denote the number of produced tokens and consumed tokens,
respectively, on an edge ei .
Definition 1 (Topology matrix).
topology matrix Γ is a |E | × |V | matrix.
Γij =
Oei if an edge ei = (vj , vk),
−Iei if an edge ei = (vk , vj),
Oei − Iei if an edge ei = (vj , vj),
0 otherwise.
(2–1)
The topology matrix for Figure 2-1 (a) is: Γ =(2 −1
). The existence of a
solution, as well as a method to solve the balance equations, can be shown using the
following theorem.
Theorem 2.1 ([62]). A connected SDF graph with actors has a periodic schedule if and
only if its topology matrix Γ has rank n − 1. Further, if its topology matrix has rank n − 1,
then there exists a unique smallest integer solution to the balance equations Γq = 0. It
can be shown that the entries in the vector q are coprime.
Given rates of actors obtained by Theorem 2.1, {r1, r2, · · · , rn}, one iteration is
defined as a schedule containing ri firings of actor i . Figure 2-1 (b) shows the optimal
31
schedule for a SDFG in Figure 2-1 (a) when both actor u and v have self-dependency
loops and the execution times of actors are all 1.
Theorem 2.2 ([84]). For a homogeneous SDFG represented by G = (V ,E , I ,O, τ , Φ),
the maximal computation rate of every node in the graph is given by
min∀C
∑(i ,j)∈C Φij∑i∈C τi
. (2–2)
where C is any cycle in the graph.
Regardless of the SDFG type, i.e., homogeneous or multi-rate, the computation
rate of a SDFG is defined as the number of iterations per unit time. The maximal
computation rate of a homogeneous SDFG can be derived by examining all cycles
in the graph. Theorem 2.2 says the maximal computation rate of a homogeneous
SDFG is bounded by the minimum initial token-to-time ratio cycle in the graph. As
for a homogeneous SDFG, the maximal computation rate of an iteration equals to
the maximal computation rate of a node since the number of firings of a node in
one iteration is 1. But, regarding a multi-rate SDFG, we can compute the maximal
computation rate of a node in two steps. First, we can compute the maximal computation
rate of an iteration after converting the multi-rate SDFG into a homogeneous SDFG.
Figure 2-2 shows the homogeneous SDFG converted from the multi-rate SDFG in
Figure 2-1 when putting a self-dependency loop on each node. A certain node u with a
rate ru in a multi-rate SDFG will be expanded to ru number of nodes in the homogeneous
SDFG converted from the multi-rate SDFG [49]. Hence, the maximal computation rate of
an iteration with regard to Figure 2-2 is 12
through the equation
min{ 1
1× ru,1
1× rv} = min{1, 1
2}.
Next, we can compute the maximal rate of each node by multiplying the maximal rate of
an iteration by the rate of the node. In this example, the maximal rate of node u and v
32
U1
V1
V2
Figure 2-2. A homogeneous SDFG converted from Figure 2-1 (a)
are 12(= 1
2× ru) and 1(= 1
2× rv), respectively. In this paper, we call the number of firings
of a node per unit time, the throughput of the node.
2.3 Problem Formulation
In this section, we propose an algorithm for determining efficient bandwidth
allocations to edges of the original network topology graph while satisfying temporal
constraints such as throughput, required by an e-Science application whose data
dependency is given by a SDFG. In addition, with these bandwidth allocations, we can
minimize buffer size requirements and find the corresponding schedules.
The overall process of finding the full-fledged solution for an e-Science application is
summarized as follows.
1. Discretization step: In this step, both time and data size have discretized values:execution time, data transmission time, and data transfer size. Discretization isan important requirement for using the SDFG model. For target applications, abase unit for execution and communication times can be chosen and appropriaterounding can be performed. A base time unit should be fine-grained enough todifferentiate each actor’s execution time and temporal constraints.The base unitcan be a few seconds to several hours, depending on the application.
2. Firing rates of actors: Using Theorem 2.1, firing rates of actors guaranteeingwell-behaved SDFG, i.e., free of deadlock and infinite buffer accumulation, canbe calculated. In MATLAB, the firing rates of actors can be obtained through asimple operation, null(Γ, ′r ′), where Γ is a topology matrix for the SDFG and null isa MATLAB function returning a solution, Z , for Γ× Z = 0.
3. Path bandwidth selection: The e-Science applications are distributed,andconnection-oriented communication paths among distributed nodes are set upon demand or in advance. The bandwidth of paths is guaranteed by networktechnologies, such as multi-protocol label switching (MPLS) and general
33
multi-protocol label switching (GMPLS). The communication delay of a path towhich bandwidth is allocated on request within available bandwidth is inverselyproportional to the allocated bandwidth. Hence, path bandwidth allocation shouldalso be taken into account since it can affect throughput as well as the totalnetwork resource consumption. A formal problem formulation will be presented inSection 2.3.2.2.
4. The amount of buffer space requirements is the total number of tokens queuedon every edge. Clearly, different schedules can lead to different buffer spacerequirements. The following buffer minimization problem is shown to be NP-complete[76]: Given a homogeneous SDFG, is there a valid schedule for the SDFG ofwhich buffer space requirements are less then a constant K? it is easier, however,to find the minimum buffer space when the computation rate is fixed, even thoughthe problem is still NP-complete. Using an approach similar to Govindaraja [50] butadapted to e-Science applications, we use a two-phase approach for first findingthe optimal solution for the bandwidth allocation problem, then use this solution tominimize the buffer requirements. Using this two-phase approach, as mentionedabove, we obtain the optimal solution for the bandwidth allocation problem, thenminimize buffer requirements based on the obtained previous solution. After thesolution to the BAFS problem (described in the next section) is obtained, wehave to find exact schedules and minimize buffer space requirements. Since wechoose a model where the communication delays are included in the executiontime of actors, the previously developed algorithms for buffer space requirementminimization can be directly applied. The buffer space requirement minimizationproblem has been solved in the context of DSP applications in many papers[50][103] [94].
5. Adjust for deployment in a real system: The implementation of the derived solutionrequires a few considerations. The generated solution consists of discretizedvalues in terms of the properly chosen base time and data unit. As long as wecan ensure that the discretized problem has stricter constraints than the originalproblem, such as higher production rates and lower consumption rates, theresulting solution should be feasible. Additionally, with the absence of a globalclock, synchronization issues need to be considered to force firings of tasks tofollow the computed schedule. Self-timed scheduling may not achieve the maximalrate without the global clock if buffer space is limited and not properly synchronizedwith actors’ schedules. However, for reasonable buffer sizes, the deterioration ofthe maximal rate will be small.
2.3.1 Illustrative Example
We pick the visualization application in [53] as a representative example of
e-Science applications that can be modeled by an extended synchronous dataflow
34
(a) (b)
DD LSU SanDiegobmoData Source Computing VisualizationFigure 2-3. A real example of e-Science applications [53]
graph (ESDFG). The visualization application shown in Figure 2-3 (a) has a use-case
scenario as follows.
For the demonstration in San Diego, CCT/LSU (Louisiana), CESNET/MU (Czech
Republic) and iGrid/Calit2 (California) participated in a distributed collaborative session.
The visualization front-end is located at LSU running Amira for the 3D texture-based
volume rendering for distributed visualization. The visualization back-end (data
server) also ran at LSU. The actual data set for the demonstration had a size of 120
Gbytes and contained 4003 data points at each timestep (4 bytes data/point for a 256
Mbyte/timestep).
In this chapter, we assume a more general model, similar to the use-case in [46],
extended from this application such that data servers reside at different sites from
computing sites. This general model can be abstracted as the diagram in Figure 2-3 (b).
0(1)
1(1)
2(2)
3(2)
4(1)
128
128
256
256
1
1
1
1
Figure 2-4. An ESDFG model for Figure 2-3
The system parameters of the visualization application are summarized in Table
2-2. If not explicitly mentioned, all the parameters are per one firing. The figures
marked by bold type are parameters that are not explicitly given in [53], thus arbitrarily
35
Table 2-2. Summary of system parameters of the visualization applicationItem Continuous Discretized
value valueData centers
Production 2560 Mbyte 128Execution time 1 second 1
Computing site at LSUConsumption 256 Mbyte 256Production 1 frame (1 Mbyte) 1Execution time 100 ms 2
Visualizing site at San DiegoConsumption 1 frame (1 Mbyte) 1Execution time 100 ms 2Throughput At least 5 frames/sec 0.25
Visualizing site at BrnoConsumption 1 frame (1 Mbyte) 1Execution time 50 ms 1Throughput At least 5 frames/sec 0.25
Base time unit: 50 ms, Base data unit: 1 Mbyte
chosen by us within a reasonable range of the associated hardware’s performance. The
discretized values for the parameters are computed with appropriately chosen base time
and data unit. For example, the data production speed of data centers, 2560 Mbyte/s,
is discretized into 128 tokens/1 unit time since the base unit time is 50 ms and the rate
of 2560 Mbyte/s equals to the rate of 128 Mbyte/50 ms. The resultant ESDFG for the
application is shown in Figure 2-4.
Second, the firing rates of nodes are calculated using simple math on a topology
matrix of the ESDFG, as described in Section 2.2.
Γ =
128 0 −256 0 0
0 128 −256 0 0
0 0 1 −1 0
0 0 1 0 −1
The solution for rates of nodes is given by [2, 2, 1, 1, 1]. Each element of the solution
vector corresponds to r1 through r5, respectively.
The next step is to formulate the problem as a linear programming.
36
2.3.2 Optimal Bandwidth Allocation with a Feasible Schedule
To include temporal constraints such as throughput, we define extended SDFG
(ESDFG) as follows.
Definition 2 (Extended SDFG (ESDFG)). An ESDFG is represented by
G =(V ,E , I ,O, τ ,D, st, et,T ),
where V ,E , I ,O, τ ,D are same as SDFG,
st, et are start and end time of execution period of a SDFG, and
T is {(v ,Tv )|v ∈ V ,Tv ∈ R}.
The set, T , has elements of throughput constraints defined as two-tuple (v ,Tv),
where v is the node whose throughput should be equal to or greater than Tv . st
and et are used for in-advance bandwidth reservations. Suppose that we manage
data structures for in-advance bandwidth reservations such as time-bandwidth lists
representing how much available bandwidth varies over time on each edge, we can
easily obtain a subgraph whose available bandwidth on each edge is set to maximum
available bandwidth during the period [st, et). For example, if an edge eij has available
bandwidth 1 and 2 over time period [0, 1) and [1, 2), and st and et are given as 0
and 2, the eij of subgraph has a value of 1 as an available bandwidth. BAFS problem
formulation works on this subgraph if in-advance bandwidth reservations are considered.
Informally we can define the bandwidth allocation with a feasible schedule (BAFS)
problem as follows: Given a network topology represented by G = (V ,E) and iterative
data-dependent tasks represented by an ESDFG, Gt = (Vt ,Et , It ,Ot , τt , st, et,T ), what
is the optimal bandwidth allocation with a feasible schedule that minimizes network
resource consumption and meets temporal requirements?
The formal problem formulation will be presented below after discussion of how to
model communication delays in established paths for e-Science applications.
37
2.3.2.1 Modeling communication delays
A communication delay is composed of four factors: processing delay, transmission
delay, queueing delay, and propagation delay. Processing delay is associated with
operations such as packetizing, thus is proportional to the data size as is transmission
delay. Queueing delay is stochastic, and propagation delay is constant for a certain link.
In this chapter, we assume that e-Science applications run on dedicated networks, i.e.,
MPLS or GMPLS networks, where the paths are established using label switched paths
(LSPs). For such scenarios, queueing delay and propagation delay can be ignored. We
assume that transmission delay dominates total delay. The processing delay can be
incorporated into transmission delay because both kinds of delay are proportional to the
data size. We thus optimize with regard to transmission delay.
We now investigate how to incorporate communication delays into an optimal
computation rate problem given a SDFG. To the best of our knowledge, this has not
been addressed in the literature on SDF modeling for DSP applications. Although
communication delays have been considered in multiprocessor scheduling, the focus
is mainly on the makespan of schedules, which is total time taken to execute all the
tasks specified by a precedence task graph, not on the throughput of infinitely repeated
schedules. The target applications are e-Science applications whose data-dependent
distributed nodes collaborate iteratively.
Figure 2-5 (a) shows a SDFG consisting of two actors, u and v . Actor u produces 2
tokens per firing, actor v consumes 1 token per firing. We assume that it takes 2 units
of time for actor u to send 2 tokens to actor v . The value in parenthesis inside a node
indicates the execution time of the node.
There are two ways to integrate the communication delay within the SDF model.
1. The communication delay can be included in the execution time of producing actoru (Figure 2-5 (b))
2. The communication delay can be included by having a dummy actor c whoseexecution time is set to the communication delay (Figure 2-5 (c))
38
1
(a)
(b)
U(1) V(1)2
Communication
delay = 2
U(3) V(1)2
1
(c)
1
U(1) V(1)2
C(2)2
2
0 1 2 3 4
u u u
time
iteration 0
iteration 1
5
u c c
u c c
iteration 0
iteration 1
v
v v
v
v v
u u u v v
6 7
0 1 2 3 4time
5 6 7
Figure 2-5. Modeling communication delay in a SDFG
The first option (Figure 2-5 (b)) implies that communication can occur right after tokens
are produced in the producer’s buffer and the producer cannot be fired again until
transfer of produced tokens is done. This is the most conservative way of modeling
a communication delay since the relation between the execution and communication
is assumed to be synchronous. We call this model the conservative model in this
chapter. If we are not sure how the program is implemented internally, we can take this
conservative model to guarantee the final solution meets the throughput requirements.
The second option (Figure 2-5 (c)) implies that communication can run independently
of the producer. This, in general, can lead to higher buffer space requirements, but
may result in a higher computation rate. As can be seen in Figure 2-5 (c), the optimal
schedule shows more throughput compared to the optimal schedule in Figure 2-5 (b).
We call this model the optimistic model as opposed to the conservative model in this
chapter. Either of these two models can be chosen arbitrarily per each node, and the
details on how this issue can be dealt with in the problem formulation are presented in
Section 2.3.
In some cases, there may be multiple outgoing communication channels (Figure
2-6 (a)). As single communication channel, we can make a choice between two options:
a conservative approach and an optimistic approach. The conservative approach
adds max {communication delays of outgoing communication channels} to the execution time of
39
Table 2-3. Notation for problem formulation
Category Notation DescriptionFunction vt(v) vt : Z→ Z, maps a vertex, v , in V into a vertex in Vt .
Com(a) Com : V → boolean, returns true, if an actor a is a dummy node tomodel communication delay.
Constant G (V ,E), original network topology.or Set Gt (Vt ,Et , It ,Ot , τt , st, et,T ),
an ESDFG specifying iterative data-dependent tasks.Jc {(si , di)|si ∈ V , di ∈ V , (vt(si), vt(di)) ∈ Et},
A set of communication jobs modeled by the conservative approach,defined by two tuples of source and destination nodes.
Jo {(si , di)|si ∈ V , di ∈ V , c ∈ Vt , (vt(si), c) ∈ Et , (c , vt(di)) ∈ Et},A set of communication jobs modeled by the optimistic approach, alsodefined by two tuples of source and destination nodes.
J Jc or Jo depending on the approach.sj sj ∈ V , j ∈ Jc
∨j ∈ Jo , source node of job j .
dj dj ∈ V , j ∈ Jc∨j ∈ Jo , destination node of job j .
τi Execution time of node (actor) i ∈ Vt .ri Rate of node (actor) i ∈ Vt .Ij j ∈ J, amount of data (number of tokens) consumed by actor vt(dj).Oj j ∈ J, amount of data (number of tokens) produced by actor vt(sj).Clk Available bandwidth on edge (l , k) ∈ E during the period [st, et).Vtf A set of front-end nodes whose throughputs are concerned, Vtf ⊂ VtTd Throughput requirement of node (actor) d ∈ Vtf , specified by users.
Variable Rmax The maximal computation rate of an iteration.td Throughput of node (actor) d ∈ Vtf .f jlk Flow of job j on an edge (l , k) ∈ E .Dj Allocated bandwidth for job j .
a producer actor. Figure 2-6 (b) shows such a case where the execution time of actor
u increases by 3, max{2, 1, 3}. One of drawbacks of this model is that it deters early
executable actors from starting on their own schedules. For example, actor w in Figure
2-6 (b) cannot be fired 1 unit time after u finishes its execution. Instead it should wait 2
unit times more. The other approach as in Figure 2-6 (c), the optimistic one, is the same
as the case of the single communication channels. For each channel, a logical actor
accounting for a corresponding communication delay is inserted between the original
producer/consumer actors.
40
U(2)
V(1)
W(1)
X(1)
2
1
3
(a)
11
1
1
1
1
U(5)
V(1)
W(1)
X(1)
(b)
11
1
1
1
1
V(1)1C(2)
1
U(2) W(1)
X(1)
(c)
11
1
1
1
1
C(2)
C(1)
C(3)
1
1
1
1
1
Figure 2-6. Modeling communication delay in the case of multiple communicationchannels
A more elaborate analysis of a certain actor’s execution pattern may lead to the
more exact modeling, and Figure 2-7 shows in what cases and how we can improve our
models. The semantics of SDF enforces that the output of an actor is generated at the
end of the execution of the actor. Hence, in case of Figure 2-6 (a), actor v , w and x can
start their own execution at least 2, 1 and 3 unit time after actor u’s firing is done, which
means actor x can start at time 5 if actor u is fired at time 0. However, suppose that the
output data for actor x is generated at time 1. The communication delay on the channel
between actor u and x can be adjusted by 2 as in Figure 2-7. The next procedures for
incorporating communication delay into SDF model will take either of Figure 2-6 (b) and
(c).
41
U(2)
V(1)
W(1)
X(1)
2
1
2
(b)
11
1
1
1
1
u uiteration 0
iteration 1
w
0 1 2 3 4time
5 6 7
(a)
v,x
u u w v,x
3
Figure 2-7. More exploited parallelism in case of multiple communication channels
2.3.2.2 Problem formulation
The notation for the BAFS problem is summarized in Table 2-3. The BAFS problem
can be formulated as linear programming problems shown in Figure 2-8 and 2-9, for the
conservative and the optimistic models, respectively.
Objective
minimize∑
j∈J,(l ,k)∈E
f jlk (2–1)
Multi-commodity flow constraints∑k:(l ,k)∈E
f jlk −∑
k:(k,l)∈E
f jkl = 0, l = sj , l = dj ,∀j ∈ J (2–2)
∑j∈J
f jlk ≤ Clk , ∀(l , k) ∈ E (2–3)
∑k:(l ,k)∈E
f jlk −∑
k:(k,l)∈E
f jkl =
{Dj , if l = sj−Dj , if l = dj
,∀l ∈ V , j ∈ J (2–4)
0 ≤ f jlk ,∀j ∈ J, (l , k) ∈ E (2–5)0 ≤ Dj (2–6)
Temporal constraints
Rmax ≤1
ri(τi +OjDj), i ∈ Vt , j ∈ Jc , (vt(sj), vt(dj) ∈ Et (2–7)
td = Rmax · rd , d ∈ Vtf (2–8)
Td ≤ td ≤rd
ri(τi +OjDj),
d ∈ Vtf , i ∈ Vt , j ∈ Jc , (vt(sj), vt(dj) ∈ Et (2–9)
Td · (τiDj +Oj) ≤rdriDj ,
d ∈ Vtf , i ∈ Vt , j ∈ Jc , (vt(sj), vt(dj) ∈ Et (2–10)
Figure 2-8. BAFS problem formulation in case of the conservative model
42
Objective
minimize∑
j∈J,(l ,k)∈E
f jlk (2–11)
Multi-commodity flow constraints∑k:(l ,k)∈E
f jlk −∑
k:(k,l)∈E
f jkl = 0, l = sj , l = dj ,∀j ∈ J (2–12)
∑j∈J
f jlk ≤ Clk , ∀(l , k) ∈ E (2–13)
∑k:(l ,k)∈E
f jlk −∑
k:(k,l)∈E
f jkl =
{Dj , if l = sj−Dj , if l = dj
,∀l ∈ V , j ∈ J (2–14)
0 ≤ f jlk ,∀j ∈ J, (l , k) ∈ E (2–15)0 ≤ Dj (2–16)
Temporal constraints
Rmax ≤1
ri · OjDj, if Com(i) = true and j ∈ Jo , (vt(sj), i) ∈ Et (2–17)
td = Rmax · rd , d ∈ Vtf (2–18)
Td ≤ td ≤rd
ri · OjDj,
if Com(i) = true and d ∈ Vtf , j ∈ Jo , (vt(sj), i) ∈ Et (2–19)
Td ·Oj ≤rdriDj ,
if Com(i) = true and d ∈ Vtf , j ∈ Jo , (vt(sj), i) ∈ Et (2–20)
Figure 2-9. BAFS problem formulation in case of the optimistic model
The problem formulation allows the use of the multi-commodity flow problem, for
which a variety of efficient solutions exists in the literature [27]. The major differences
between a typical multi-commodity flow problem and this problem formulation are as
follows:
1. The demand of each job is not a constant, but a decision variable. This determineshow much bandwidth is allocated to a job, i.e., communication between producerand consumer actors.
2. The decision variables are constrained by temporal constraints pertaining tothroughputs of actors.
43
The objective of the linear programming (given in Equation 2–1 and 2–11) is to
minimize network resource consumption, which is the total amount of bandwidths
allocated to all edges in the original network topology. If the demands were constant
values, the objective can be regarded as minimizing average hops of all jobs (communication
channels) if we approximate the average hop number as total network traffictotal demand
. However, since
we have demands as decision variables, this objective can be thought of as minimizing
allocated network resources regardless of average hop number of all jobs.
The constraints can be divided into two parts. The first part is typical of multi-commodity
flow constraints. Equation 2–2 and 2–12, the flow conservation constraint, mandate that
for all jobs, the net flow to a node is zero, i.e., the incoming and outgoing flows to a node
are balanced unless the node is a source or a destination. Equation 2–3 and 2–13, the
capacity constraint, mandate that the flow along any edge cannot exceed the capacity of
the edge. Equation 2–4 and 2–14 ensures that the source and the destination of any job
should produce and consume, respectively, the flow of the job, Dj .
The second part concerns temporal constraints, guaranteeing the throughputs
of front-end actors. As discussed earlier, Theorem 2.2 states that the maximum
computation rate is limited by the cycle whose cost-to-time ratio is minimum. Accordingly,
Equation 2–7 and 2–17 account for communication delays on outgoing edges of a
certain node. Since the target application is acyclic e-Science application, the cycles
we should consider for the maximum computation rate are limited to self-dependency
loops, where the number of tokens is 1 and the total execution times are ri(τi +OjDj) and
ri · OjDj for the conservative and optimistic models, respectively. The term OjDj
accounts
for communication delays of outgoing edges of actor i . In addition, since Theorem 2.2
is for homogeneous SDFs, considering the conversion from the given ESDFG to the
homogeneous SDF [76], the execution time, (τi +OjDj) or Oj
Dj, should be multiplied by the
firing rate ri as in Equation 2–7 and 2–17. Since Rmax is the number of iterations per unit
time and the firing rate is the number of firings per iteration, the throughput of a certain
44
node d equals Rmax · rd , as in Equation 2–8 and 2–18. Equation 2–8 and 2–18 can be
transformed into Equation 2–9 and 2–19, respectively, since the required throughput can
be guaranteed if any Rmax · rd is greater than or equal to Td specified by users. With a
few transformations, the throughput constraints result in Equation 2–10 and 2–20, which
are linear inequalities.
The solution for the linear programming determines the optimal bandwidth allocation
on edges, and exact schedules and associated buffer space allocation can then be
computed in [50].
2.4 Experimental Evaluation
There is no other research work on BAFS problem in the context of grid computing.
We compare our LP-based algorithm for BAFS problem with a heuristic that simply
uses the definition of throughput between two actors in the assignment process. We
also compare two approaches of our algorithm, conservative and optimistic ones.
The heuristic is presented in Algorithm 2-1. It enumerates all the paths based on the
throughput requirements, and computes delays on edges based on the assumption
that all the delays of edges constituting a given path are the same. If tighter delay is
required while examining paths, the tighter delay is updated as the delay of the edge.
For example, in Figure 2-4, the throughput of node 3 comes from two paths: 0 → 2 → 3
and 1 → 2 → 3. Suppose that a communication delay on the edge (2, 3) is 2, to achieve
the throughput required by node 3 on the path 0 → 2 → 3. If the communication delay
should be 1 on the edge (2, 3), considering another path 1→ 2→ 3, the communication
delay on the edge (2, 3) is updated to 1.
The heuristic does not consider possible parallelism of tasks, or possible balanced
bandwidth allocations on edges since it computes delays by assuming all delays on a
path to be the same.
We compare two algorithms in terms of rejection ratio of requests. The bandwidths
of edges are randomly selected from a uniform distribution between 10 to 1024 unit
45
data per base unit time. We varied the number of requests from 1 to 16 on the Abilene
network [11] (see Figure 2-10), and each request is a specified task graph (Figure
2-4). The nodes of a request were constrained to have a matching node in the original
network topology graph, and the matching node is randomly assigned using uniform
distribution.
Algorithm 2-1 A heuristic for BAFS problemInput: An ESDFG
1: Enumerate all the possible paths from front-end nodes to back-end nodes whose through-
puts are specified by temporal requirements.
2: Initialize the delay on edge i , edi , as∞.
3: for each path do
4: Assume a same delay, d , on all the edges of the path.
5: Compute d satisfying temporal requirements.
6: if edi > d then
7: edi = d
8: end if
9: end for
10: Compute the bandwidth on each edge based on edi .
Figure 2-10. The Abilene network
The experimental results are shown in Figure 2-11. Both of our LP-based
approaches have better rejection ratios and are lower than the heuristic by 5 to 30%.
46
15
20
25
30
35
40
45
Re
ject
Ra
tio
(%
)
Heuristic
LP-Conservative
LP-Optimistic
0
5
10
15
1 2 4 8 16R
eje
ct R
ati
o (
%)
Number of Requests
Figure 2-11. Rejection ratio vs. number of requests
The drawback of the heuristic is two-fold. First, it does not consider the fact that
schedules of iterations can be overlapped. Second, it cannot allocate bandwidth to
links interconnecting nodes according to the current network status, i.e., the current
available bandwidth on each link. Hence, the optimistic approach, which assumes
that data transfers can also occur in parallel with computational executions, uses the
least amount of network resources to achieve throughput requirements given by users.
Consequently, the optimistic approach leads to the least rejection ratio of requests
since one request has a better chance to be accepted due to less amount of bandwidth
allocation requirements and the following requests benefit from less loaded network
status.
2.5 Summary
The dedicated network on which e-Science applications operate guarantees
that a certain path can have a reserved bandwidth over a given period, which means
the communication delays vary depending on allocated bandwidths. We develop a
SDF-based model for iterative data-dependent e-Science applications that incorporates
variable communication delays and temporal constraints, such as throughput. We
formulate the problem as a variation of multi-commodity linear programming with
an objective of minimizing network resource consumption while meeting temporal
constraints. The resulting solution can then be used to derive buffer space requirements
by previously developed algorithms in the context of DSP applications. Finally,
47
an illustrative example of an e-Science application shows that the framework and
algorithm we propose is valid to model and analyze iterative data-dependent e-Science
applications. The simulation results show that the optimal bandwidth allocation by
the formulated linear programming outperforms the bandwidth allocation by a simple
heuristic in terms of rejection ratio of requests.
In future, we will extend our framework so that it also schedules computation jobs to
distributed computing resources when such mappings are not known ahead of time, or
it maximizes the overall throughput when multiple SDFGs with throughput requirements
are given.
48
CHAPTER 3TOPOLOGY AGGREGATION
3.1 Overview
The need for transporting large volumes of data in e-Science has been well-argued
[33, 78]. For instance, the HEP data is expected to grow from the current petabytes
(PB) (1015) to exabytes (1018) sometime between 2012 to 2015. In addition, e-Scientists
desire schedulable network services to support predictable work processes [46]. Quality
of service (QoS) in network applications has been an active research area for several
decades. Recently new technologies such as multiprotocol label switching (MPLS) and
generalized multiprotocol label switching (GMPLS) drew more attention to QoS routing
since those technologies have made it possible for network managers to set up and tear
down explicit paths while guaranteeing specified amounts of bandwidth.
The network supporting e-Science applications typically comprises multiple
domains. Each domain usually belongs to different organizations, and is managed
based on different operational policies. In such cases, internal topologies of domains
may not be visible to the others for security or other reasons. However, aggregated
information of internal topology and associated attributes is advertised to the other
domains.
A set of techniques to aggregate data to advertise outside one domain is called
Topology Aggregation (TA). The aggregated data itself is termed as Aggregated
Representation (AR). A survey of TA algorithms is presented in [98]. There exists a
tradeoff between the accuracy and the size of AR. Hence, most algorithms proposed in
the previous work tried to achieve the most efficient AR in terms of both accuracy and
space complexity.
One can classify QoS path requests into two classes: single-path single-job
(SPSJ) and multiple-path multiple-job (MPMJ), depending on the nature of requests.
SPSJ corresponds to a situation that requests for single QoS path arrive and they
49
are scheduled in the order of arrival. In contrast, MPMJ corresponds to batch/off-line
scheduling of multiple requests for multiple QoS paths. Many e-Science applications
require simultaneous transfer of data from multiple sources and destinations. Also,
each of these requests (e.g., file transfers) can be more efficiently supported by using
concurrent multiple paths.
We show that existing TA approaches developed for SPSJ do not work well with
MPMJ applications as they overestimate the amount of bandwidth that is available. We
propose a max flow based TA approach that is suitable for this purpose. Our simulation
results demonstrate that our algorithms result in better accuracy or less scheduling time.
BGP, which has been deployed for inter-domain protocol, has limited use for AR
techniques, as it is not flexible enough to be extended to accommodate many QoS
parameters. This is because it was originally designed only for distributing reachability
information [107]. Recently a new network model based on PCE has been proposed to
overcome the aforementioned drawbacks of BGP [45]. PCE is an entity that is capable
of computing network paths utilizing the traffic engineering database which contains
required network status information such as topology, delays on links and etc. Recent
papers [80, 85, 93] have based their network model on PCE-based architecture. We
develop TA algorithms in the context of PCE-based architecture that support most
e-Science applications. In particular, the following network model is assumed throughout
this chapter.
1. A centralized PCE exists per each domain. A node sends a request to the PCE tomake a reservation for a QoS path.
2. Centralized PCEs flood aggregated topology information to others so that everycentralized PCE maintains a complete view of a network in an AR except its owndomain.
The first condition states that one active element in a domain acts as a supernode
in one domain, which knows every information essential for QoS path computation.
One possible implementation is that every node in a domain sends a request for QoS
50
s d
PCE1 PCE2PCE3
12
4
54
53
AS1 AS2 AS3
3
3
Figure 3-1. An example of inter-domain QoS routing
path to the designated centralized PCE, therefore, the PCE can manage one consistent
information on network status related to QoS parameters. The second condition can
be reasonably assumed in e-Science networks, of which size is relatively very small
compared to the Internet. This statement enables us to directly apply QoS routing
algorithms which have been developed so far. In this network architecture, one domain
can advertise its aggregated topology information and associated QoS parameters to all
the other domains.
Based on the described network model, a scenario of inter-domain QoS routing
works as in Figure 3-1.
• STEP 1 A source node sends a path computation request to a single centralizedPCE in the same domain.
• STEP 2 Then the PCE replies back with a coarse path, which is a sequence ofborder nodes without detailed hops between border nodes.
• STEP 3 With the coarse path, the source node sends a path setup request that willtraverse border nodes of the coarse path.
• STEP 4 and 5 The border node which receives a path setup request gets a strictpath for a coarse path from the PCE in the same domain.
• STEP 6 The same steps repeat until a path setup request reaches a destinationnode.
TA algorithms can also be used for scheduling paths in a single domain. These
methods are useful as a large domain can be partitioned into subdomains. TA
algorithms can then be applied to each subdomain. With ARs on subdomains, the actual
scheduling may be performed either on a single node with a rich compute resource or
51
on a distributed set of nodes such that the time complexity of scheduling paths would be
reduced by running scheduling algorithms on the partitioned smaller subdomains.
The rest of the chapter is organized as follows. The related work on TA is described
in Section 3.2. Section 3.3 describes novel algorithms for MPMJ. Section 3.4 describes
how real routing works for TA algorithms, and Section 3.5 gives time and space
complexity comparison analysis. The experimental results by simulation are given in
Section 3.6, and, finally, we conclude in Section 3.7.
3.2 Related Work
TA consists of algorithms and mechanisms for reducing the size of topological
information and associated attributes within a domain or subdomains while maintaining
a certain level of accuracy. Uludag et. al [98] presented a survey of these algorithms for
multi-domain environments. All TA algorithms have two elements: an aggregated graph
and aggregated QoS parameter values, called epitome, assigned on logical links in an
aggregated graph.
Typical topologies for aggregated graphs are full-mesh, simple compaction,
tree-based, and star-based topologies. Some other topologies, e.g., Shufflenet
[108], have been proposed to reduce space complexity in specific cases such as
asymmetric networks. Most TA algorithms start by building a full-mesh graph, which
is a complete graph whose nodes are composed of only border nodes of the original
network. Algorithms that are more focused on the size of AR usually try to transform
a full-mesh graph into more compact forms, for example, a spanning tree or a star
topology, while trying to keep up with the accuracy of a full-mesh AR. For aggregated
QoS parameter values, epitome, the maximum, the minimum or the average of QoS
values are typically used.
TA algorithms for SPSJ in large-scale multi-domain networks focus on the
compaction of ARs – accuracy is a secondary issue. As for TA algorithms in small
sized networks, accuracy has been the main focus [80, 85, 88, 93]. For a single QoS
52
constraint, a distortion-free algorithm exists [98]. But for two QoS constraints composed
of an additive and a restrictive one, the problem gets more complicated. Even though
the problem itself is not intractable, distortion-free representation is not compact. For
such reasons, several approximating algorithms minimizing distortion such as the line
segment algorithm [68] have been proposed. Usually, the multiple QoS constraints
problem is generalized as one restrictive with multiple additive constraints, since a
multiplicative constraint such as a link reliability can be transformed into an additive one
through a log operation.
To the best of our knowledge, all existent TA algorithms are limited to a single QoS
path routing at one time, i.e., SPSJ, with few exceptions of customized algorithms for
special purposes such as computation of reliable paths. MPMJ applications consider
a batch of jobs at a time and multiple paths are allowed for one job. For instance, a
request for the earliest finish time for a given multiple-source multiple-destination data
transfer, which is one of the important e-Science applications [46], is handled at one
time and multiple paths are set up for the request.
The emerging technologies such as MPLS or GMPLS make it possible that
applications requiring strict QoS requirements are implemented on networks equipped
with such facilities. Special purpose networks such as research networks linking national
labs in the U.S. can be set up for that purpose [83]. Especially for inter-domain QoS path
routing in such special purpose networks, the accuracy of aggregated topologies and
associated QoS parameter values is more important than the size of data exchanged
among domains since the number of domains is relatively small compared to the
Internet which is constituted by a huge number of hosts and switches. Thus the need for
more accurate ARs is prominent.
As described above, one of the most recent work regarding TA for two QoS
constraints is the line segment algorithm in delay-bandwidth sensitive networks [68].
The line segment algorithm first computes 2-D charts whose x-axis and y-axis are delay
53
AS1 AS2 AS3
(a) An example of multi-domain networks (b) Internal topology of AS2
B1 B2
N1
N2
Figure 3-2. An illustrative example for limitations of the line segment algorithm
and bandwidth respectively, for every pair of border nodes. The chart contains all the
information for computing QoS paths with delay and bandwidth constraints. Authors in
[68] suggested the line segment algorithm approximating that information by a line to
reduce the size of data representing all possible delay-bandwidth combinations between
two border nodes, and it is possible because the shape of the charts takes a increasing
staircase function. The next step is to establish a full-mesh topology and convert it to a
star topology to further enhance the space complexity up to O(|B|).
With existent TA algorithms for SPSJ, there is no way to estimate if more than
one path between two border nodes is available. Consider a multi-domain network in
Fig. 3-2. The network consists of three domains where AS1 is connected to AS3 via AS2.
Suppose that a host in AS1 wants to find max flow paths or reliable paths, composed
of a primary and a backup path, to a certain host in AS3. If TA algorithms such as the
line segment algorithm are deployed in this network, the PCE in AS1 computes paths
based on the AR from AS2, which only gives the information on how much bandwidth is
available within a certain delay. Since the PCE in AS1 has no clue how many paths exist
internally in AS2, the computed max-flow or reliable paths are not necessarily the most
accurate paths computed based on the complete network status information.
3.3 TA for Multiple-Path Multiple-Job (MPMJ)
3.3.1 Problem Statement
When it comes to scheduling a batch of multiple jobs allowing for multiple paths,
existent TA techniques are not useful because the performance degradation can be
54
significant. For example, the full-mesh AR for bandwidth scheduling, where each logical
link has the maximum available bandwidth between two border nodes as an epitome,
has been known as a distortion-free AR for single path bandwidth scheduling. However,
it may not be effective for multi-path bandwidth scheduling, e.g., a max flow bandwidth
scheduling.
An important class of e-Science applications is bulk file transfers. For example,
for high energy physics large files are routinely transferred between tiered centers
that are geographically distributed around the world. The generated data have to
be transferred from stored places to research centers for the purpose of analysis or
visualization. In the context of e-Science applications, bandwidth scheduling problems
range from single-source single-destination data transfer optimization to multiple-source
multiple-destination data transfer optimization. The computational complexity of such
problems definitely depends on the space complexity of the network topology. Generally,
we can break down network resource provisioning procedures for e-Science applications
into the admission control phase and the network resource, i.e., bandwidth, allocation
phase. In admission control phase, acceptance of requested jobs is determined and
then if accepted; explicit bandwidth allocation for each link will be executed in the
network resource allocation phase. With compact network information abstracted from
a complete network topology, chances are that the network resource allocation phase
may fail due to inaccurate network status information. Even though the accepted request
in the admission control phase can be rejected due to inaccurate ARs in network
resource allocation phase, the benefits from less space complexity compensate for failed
operations if the error rate is fairly small.
In the following subsections, we propose several TA algorithms suited for MPMJ.
Each request consists of single or multiple data transfer jobs. However, we want to allow
for the use of multiple paths. The only QoS parameter that is considered is bandwidth.
55
3.3.2 New Topology Aggregation Algorithms
3.3.2.1 Full-mesh method
The most typical way of aggregating networks with QoS parameters is to build a
full-mesh topology by connecting every pair of nodes of interest and assigning epitomes
to the built logical links. Following this conventional way, we can build a full-mesh AR
with max flow values between nodes assigned to each logical link. Let’s consider the
edge connecting nodes D1 and D2 in Fig. 3-3. The epitome associated with the edge
ED1D2, F12, can be computed using any known max flow algorithms. The algorithm for
building a full-mesh AR is described in Algorithm 3-1.
D1
D2
D3
D4
F13,31
F12,21
F24,42
F34,43
F23,32
F14,41
Figure 3-3. Full-mesh AR
Algorithm 3-1 Full-mesh AR constructionInput: a graph G = (V ,E).
1: Pick nodes of interest from a full set of nodes, V and add them to the aggregation
representation.
2: for each pair of picked nodes do
3: Create a link between two nodes
4: Compute a max flow value between two nodes.
5: Assign the computed max flow value as an epitome to the link created above.
6: end for
This simple method adapted from existent TA techniques for SPSJ easily turns
out to be inappropriate for MPMJ. Let us take an example of a job requesting max flow
between D1 and D2 where D1, D2, D3 and D4 are nodes of interest. Since we consider
56
D1
D2
D3
D4
L
F1
F2
F3
F4
Figure 3-4. Star AR
MPMJ in both a single and multiple domain network environments, the nodes, D1, D2,
D3 and D4, don’t have to be border nodes. The final max flow value between D1 and
D2 would be far bigger than the right answer, since there exist other paths such as
D1 → D3 → D2.
3.3.2.2 Star method
A full-mesh AR does not effectively support MPMJ as the maximum amount of flow
that a specific node can push into a network is not restricted.
For single path computation algorithms, most recent TA techniques start from
full-mesh AR and produce diverse variants stemming from it such as partial full-mesh,
star, tree and so on. For multiple path computation algorithms, the reasons described
in the previous subsections prevent full-mesh AR from being utilized as a base AR for
other efficient ARs in terms of space complexity.
A star AR as in Fig. 3-4 can overcome the drawbacks of a full-mesh AR by limiting
the max flow value from any node. First, the logical node, L, is created and all nodes
of interest are connected to it. Suppose that four nodes of interest (D1, D2, D3 and D4)
are connected to the central logical node L. The epitome, assigned on the logical link
connecting a certain node and the central logical node L, is a max flow value from the
node to all the remaining nodes. This is easily computed by putting a supersource node
connected to a node and a supersink node connected to all the remaining nodes, and
running a max flow algorithm between the supersource and the supersink nodes. In
this case, F1 is a max flow value that a node D1 can send to the network, which is easily
57
computed by adding a supersink node connecting D2, D3 and D4 and running a max flow
algorithm between D1 and the supersink node. Likewise, we can also compute the other
epitomes such as F2, F3 and F4. This AR has only one outgoing link from each node,
which keeps one node from sending the data flow beyond the epitome assigned to the
outgoing link. Formal description of the algorithm is presented in Algorithm 3-2.
Algorithm 3-2 Star AR constructionInput: a graph G = (V ,E).
1: Pick nodes of interest from a full set of nodes, V .
2: Create a single logical node, L.
3: for each of picked nodes do
4: Create a link between the node and the logical node, L.
5: Compute a max flow value from a target node to all the remaining nodes.
6: Assign the computed max flow value as an epitome to the link created above.
7: end for
3.3.2.3 Partitioned star method
After performing some experiments on star AR, we realized that a star AR shows
little distortion compared to an original topology. However, TA algorithms should take
into account how the result on AR can be transformed into real path setups on an
original topology.
Originally, TA has arisen from efforts to deal with scalability issues related with
space complexity and security issues regarding intradomain topology in multi-domain
network environments. Usually, routing procedures consist of two steps: (1) path
computation and bandwidth allocation with ARs, and (2) explicit path computation and
bandwidth allocation with original network topology for each domain. Similar steps can
also be applied for single domain network environments, where several subdomains
exist for hierarchical routing or we can intentionally partition one domain into several
logical subdomains. In this case, the benefits from TA are almost the same as those in
58
multi-domain network environments. In the case of MPMJ applications, however, we can
expect more benefits in terms of computational complexity. A detailed computational
complexity analysis will be given in the following Section 3.5
The partitioned star method tries to combine the benefits of star and full-mesh
methods by partitioning a domain into k subdomains, and each subdomain is aggregated
using the previous star method. Fig. 3-5 shows an example of a domain with four
partitioned subdomains. In this chapter, we use general graph partitioning algorithms,
which are widely used in many other computer science areas including load distribution
in parallel computers, sparse matrices and design of very large scale integrated circuits
(VLSI) [58]. The algorithm for building a partitioned star AR is described in Algorithm
3-3.
Algorithm 3-3 Partitioned star AR constructionInput: a graph G = (V ,E) and k , the number of partitions(subdomains).
1: Pick nodes of interest from a full set of nodes, V and add them to the aggregation
representation.
2: Partition a graph into k parts so that the number of nodes of interest is evenly
distributed over partitioned parts.
3: Identify cut nodes and cut edges, and add them to the aggregation representation.
4: for each part do
5: Construct star AR with picked nodes and cut nodes in the part.
6: end for
3.4 Routing
With the network model described in Section 3.1, inter-domain QoS path routing
is relatively easy compared to a QoS path routing in a distance vector routing protocol.
Any centralized PCE can compute a path to a destination which consists of a strict
path within its own domain and a coarse inter-domain path to the destination domain.
The coarse inter-domain path is composed of border nodes, and when the path setup
59
D1
D2
D3
D4
L1
FD1
FD2
FD3
FD4
C2
C1
C3
C4
C5
C6
C7
C8
L2
L3
L4
FC2
FC1
FC5
FC6
FC3
FC4
FC7
FC8
Figure 3-5. Partitioned star AR
request is received by a border node on the intermediate path, it is translated into a
strict path composed of intra-domain routers or switches. On the other hand, when
line segment algorithm is deployed, computing the QoS path goes through two steps.
First, we can assign delay values to virtual edges of aggregated representations of all
domains except the domain in which a source node resides; a certain request has a
bandwidth requirement, and corresponding delay is computed through a line segment
algorithm. Second, any shortest path algorithm such as Dijkstra’s algorithm can be run
on the normal graph with delay attributes on its edges.
Inter-domain routing for SPSJ applications is well described in Section 3.1. The
routing procedures for MPMJ applications are the same as those for SPSJ applications.
The results from any algorithms, e.g., a maximum bandwidth path algorithm, run on ARs
are expanded on each domain or each subdomain by running the same algorithm on the
original topology of a domain or a subdomain. If operations fail in any of the domains
or subdomains, the entire operation will fail. Note that the reason MPMJ applications
in intradomain environments use ARs of subdomains is to reduce the time complexity
of scheduling, whereas SPSJ or MPMJ in interdomain environments are forced to use
ARs for security or administrative reasons. The benefits of using ARs in intradomain
environments from the perspective of time complexity will be described in Section 3.5.
60
Table 3-1. Time Complexity for MPMJMethod Time Complexity
Full-mesh O(n3D2)Star O(n3D)
Partitioned star O((nk
)3(C +D))
D = number of nodes of interestC = number of cut nodesk = number of partitions
3.5 Complexity Analysis
Usually most algorithms for MPMJ have higher computational complexities than
algorithms for SPSJ. Dijkstra algorithm can be used to derive the maximum bandwidth
path between two nodes, which can be translated into the max flow single path. The
complexity of Dijkstra algorithm is O(n log n + m), where n is the number of vertices
and m is the number of edges. In contrast, the complexity of the push-relabel max flow
algorithm is O(n3) [27]. This shows algorithms for MPMJ may require a few orders
higher computational costs than those for SPSJ.
The time complexities of TA algorithms for MPMJ are summarized in Table 3-1. The
full-mesh method requires O(n3D2), and the star and the partitioned star method require
O(n3D) and O((nk
)3(C + D)), respectively, where D is the number of nodes of interest,
C is the number of cut nodes, and k is the number of partitions. The complexity of max
flow algorithms is assumed to be O(n3) and the number of partitions in the partitioned
star method is given as k .
The space complexities are summarized in Table 3-2. The space complexities of
ARs for full-mesh, star and partitioned star methods are O(D2), O(D) and O(C + D),
respectively. Suppose that a certain algorithm for MPMJ applications takes O(n3). If we
run the algorithm on ARs, it will take O((C +D)3) and kO((nk
)3), which are time taken
for running the algorithm on ARs and time taken for explicit routing in each partition,
respectively. (C + D) and k is definitely a small value compared to n, and n3 may
be greater than(nk
)3 by a few orders of magnitude. Hence, we can expect that the
61
Table 3-2. Space Complexity for MPMJMethod Space Complexity
Full-mesh O(D2)Star O(D)
Partitioned star O(C +D)D = number of nodes of interestC = number of cut nodes
partitioned star method can expedite the path computation and bandwidth allocation
process significantly.
3.6 Experimental Evaluation
3.6.1 Bulk File Transfers in E-Science
We chose a bulk file transfer application in [65] as a typical MPMJ e-Science
application to show that our proposed algorithms perform better than naive algorithms
adapted from SPSJ TA algorithms. In [65], the authors formulated the in-advance
scheduling of multiple bulk file transfers as a linear programming problem. We adapted
their linear programming formulation to on-demand scheduling of multiple bulk file
transfers for our simulation. The linear programming formulation is shown in Figure
3-6. The notations and equations are borrowed from [65] whenever possible. In this
formulation, tf denotes the time by which all file transfers complete. The objective of
this linear programming problem is to find the earliest finish time. f jlk is the amount of
file transferred for request j ∈ F on link (l , k) ∈ E . blk is the bandwidth available on
link (l , k). Equation 3–3 ensures that for each transfer request j ∈ F , for each node l
that is neither the source nor the destination node, the amount of file j that leaves node
l equals the amount that enters this node. Equation 3–4 requires the source node of
request j to send a net fj units of file j out and requires the destination node to receive
a net fi units. Equation 3–5 ensures that the amount of traffic on each link does not
exceed the available capacity of any link in the interval [0, tf ). Equation 3–6 ensures that
file transfer amounts are non-negative.
62
minimize tf (3–1)subject to (3–2)∑k:(l ,k)∈E
f jlk −∑
k:(k,l)∈E
f jkl = 0
∀j ∈ F ,∀l ∈ V , l = sj , l = dj (3–3)∑k:(l ,k)∈E
f jlk −∑
k:(k,l)∈E
f jkl ={fj , if l = sj−fj , if l = dj
,∀j ∈ F (3–4)∑j∈F
f jlk ≤ blk × tf ,∀(l , k) ∈ E (3–5)
f jlk ≥ 0 (3–6)
Figure 3-6. Earliest finish time on-line scheduling of multiple file transfers
3.6.2 Experiment Testbed
For TA algorithms for MPMJ, we performed experiments on random networks with
a single domain. Random network topologies are generated by the BRITE internet
topology generation package [72]. We tried several models such as Waxman, BRITE,
etc., but the results for different models show similar trends. Therefore, we show only
results for random network topologies following the Waxman model with the average
node degree of 4. The bandwidth values of edges are randomly selected from a uniform
distribution between 10 to 1024. The number of nodes in each domain is varied from
100 to 300 with the increment of 50. The nodes of interest are picked randomly within a
domain, and the number of nodes ranges from 1 to 16, which is doubled at each step.
We generated a synthetic set of data transfer requests. Each request is described by
the 3-tuple (source node, destination node, requested file transfer size). The number
of requests is also randomly selected within the range of 1 to the maximum possible
number of requests determined by the number of nodes of interest. For example, if the
number of nodes of interest is 4, the maximum possible number of requests is 4× 3. The
63
source and destination nodes for each request are randomly selected using a uniform
random number generator. The results are averaged over 100 random networks for a
certain number of nodes.
3.6.3 Performance Metrics
The performance metric we have used to compare the different approaches is to
find the earliest finish time (EFT) to complete all the multiple data transfer requests
that are given. One would expect a good AR approach to perform as close to using the
original topology.
Hence, we use the error ratio (ER) that measures the deterioration from the correct
EFT on the original topology. A TA algorithm with lower ER shows better performance.
ER is formally defined as
ER =TA EFT−Original EFT
Original EFT
.
3.6.4 Results
We measured ER according to the equation defined in Section 3.6.3. The
computational times taken for each algorithm are also gathered to show how much
computation cost reduction we can get from the compact representation. Fig. 3-7 shows
that the star and the partitioned star methods give around 5% ER. This is because the
application of finding EFT tends to find and allocate all the available bandwidths in a
network, which are limited by the star or the partitioned star ARs in a similar way as the
original network does. In addition, we could observe that as the number of requests
increase, ER is improved because all the network resources, i.e., the bandwidths, are
eventually used up. As expected, the performance of full-mesh AR is the worst.
Our simulation results in Fig. 3-8 also show that the star method is comparable
to the partitioned star method but significantly faster than the partitioned star method.
This is not only because the star method is a more compact representation, but also
64
50
60
70
80
90
100
Err
or
Ra
tio
(%)
Full-Mesh
Star
Partitioned Star
0
10
20
30
40
100 150 200 250 300
Err
or
Ra
tio
(%)
Number of nodes
Figure 3-7. Error ratio vs. the number of nodes
0.4
0.6
0.8
1
No
rma
lize
d C
om
pu
tati
on
al
Tim
e
Original
Full-Mesh
Star
Partitioned Star
0
0.2
0.4
2 4 8 16
No
rma
lize
d C
om
pu
tati
on
al
Tim
e
The number of source and destination nodes
Figure 3-8. Normalized computational time vs. the number of source and destinationnodes
partly because, for randomly generated networks, the number of cut nodes are relatively
large. If a domain is structured as backbone and the other network, the number of cut
nodes can be reduced to a reasonable value, which can enhance the performance of the
partitioned star method.
3.7 Summary
We propose several topology aggregation algorithms for e-Science networks.
E-Science applications require higher quality intradomain and interdomain QoS
paths, and some of those are distinguished from classic single-path single-job (SPSJ)
applications. We define a new class of requests, called multiple-path multiple-job
(MPMJ), and propose TA algorithms for the new class of applications. The proposed
algorithms, star and partitioned star ARs, are shown to be significantly better than naive
65
approaches. Especially, star AR shows the best performance in terms of computational
time. Its performance is also very close to using the entire topology for performing the
scheduling. Thus, it is well suited for multiple domain e-Science applications.
66
CHAPTER 4WORKFLOW SCHEDULING
4.1 Overview
An application scientist typically solves his/her problem as a series of transformations.
Each transformation may require one or more inputs and may generate one or more
outputs. The inputs and outputs are predominantly files. However, we expect that files
will be replaced by databases for many applications. The sequence of transformations
required to solve a problem can be effectively modeled as a Directed Acyclic Graph
(DAG) for many practical applications of interest that this chapter is targeting.
Figure 4-1 describes a DAG consisting of 17 nodes, representing dependencies
among 17 tasks of an application. For example, the arc from task E to task B represents
the fact that the output generated by task E is utilized by task B. Each task involves
transforming the data or storing an intermediate result for archiving. The time requirement
for solving the entire DAG for large scale applications may be of the order of hours to
days even assuming that each of the task is being executed on a cluster of workstations
or a parallel supercomputer. DAGs have been widely used for the development of
scheduling algorithms in the computer science literature [38, 105].
The general form of DAG scheduling has been shown to be NP-hard [91], and a
number of heuristics have been proposed. Early research regarded communication
costs as small [26, 39] or assumed a very simple interconnection network model,
i.e., a fully-connected network model without contention [59, 60, 79, 97, 99]. The
heterogeneous-earliest-finish-time (HEFT) algorithm extended for heterogeneous
computing resources was proposed in [97].
For the distributed applications that we are targeting, the amount of data that
needs to be transferred between tasks may be of the order of hundreds of gigabytes to
multiple terabytes. Thus, the key challenge is to be able to schedule a workflow such
that the total execution time and the communication costs are minimized. The former
67
B C D
E F G H I
J L M N O P QK
A
B C D
E F G H I
J L M N O P QK
A
Figure 4-1. A DAG consisting of 17 nodes, representing dependencies among 17 tasksof an application. For example, the arc from task E to task B represents thefact that the output generated by task E is utilized by task B.
requires mapping the tasks to appropriate machines while the latter requires the use
of high bandwidth networks and effective scheduling of the communication bandwidth.
The past research on scheduling DAGs (e.g. [38, 105]) is generally limited to solving
compute intensive problems. In contrast, we will propose new algorithms to map tasks
that have large data access requirements onto distributed heterogeneous clusters and
supercomputers. For many applications, a node in the task graph can also represent
multiple concurrent and interacting subtasks. If these subtasks are mapped to multiple
machines, the required interaction has to be mapped on to the underlying network to
support this interaction. For such DAGs, the precedence is between sets of multiple
subtasks.
The extended list scheduling algorithms in [92] and [90] targeted for heterogeneous
cluster architectures address this network contention issue by various priority attributing
schemes and assumed the path between any two processors is determined and fixed
by the target system using conventional algorithms such as a breadth first search
(BFS). On the other hand, similar work regarding DAG scheduling has been done in the
68
literature of grid computing. Generally, the term, workflow, is used interchangeably
with DAG in the context of grid computing. A taxonomy of previous work on the
workflow scheduling problem in grid computing was presented in [102]. The goal of
the scheduling algorithms is to map the tasks and subtasks of all the applications on
the grid such that the resources are effectively utilized, while the quality of service
guarantees given to an application are respected. In this chapter, the actual formulation
of the optimization goals will be presented. The network resource mapping can have the
following characteristics:
1. Rigid vs. malleable: Rigid mapping is fixed bandwidth mapping over the timeperiod of data transfer whereas malleable mapping allows for variable bandwidth.If there is no quality of service requirements such as constant data rate, malleablemapping is a viable option to utilize network resources efficiently since solutionscan be flexible as long as total amount of data transmission over time meets thedata transmission requirement.
2. Single path vs. multiple paths: For many transfers, multiple paths can be effectivelyused to reduce the transfer time. However, finding a set of multiple paths requiresmore computation time, and thus efficient algorithms are needed.
3. Static vs. dynamic paths: In static mapping, paths determined at the start timeof data transmission do not change until the end of data transmission, while indynamic path mapping, paths can change dynamically over time.
The recent work on workflow scheduling in optical grids can provision network
resources dynamically with guarantee of specified bandwidth [66, 67, 69, 95, 101].
According to the above taxonomy, those methods use rigid and single path mapping.
Moreover, the paths are assumed to be static.
The workflow scheduling problem can be classified into in-advance and on-demand
scheduling depending on the reservation start time. If the reservation start time is the
same as the job request arrival time, it is on-demand scheduling. On the other hand,
the reservation start time is equal to or later than the job request arrival time in case of
in-advance scheduling. On-demand scheduling can be also regarded as a special case
of in-advance scheduling where reservation start time equals to the job request arrival
69
time. In this chapter, we solve in-advance workflow scheduling problem in e-Science
networks which are a mix of IP networks and optical networks. Our framework supports
in-advance reservation and provides malleable mapping and dynamic paths. Further, we
are able to exploit multiple paths and applicable for heterogeneous and our framework is
applicable to homogenous resources especially for network resources.
4.2 Workflow Scheduling in E-Science Networks
We develop workflow scheduling algorithms for e-Science networks. A real network
topology and workflows are given as inputs for workflow scheduling algorithms. A real
network topology is represented by a network resource graph represents where a
node denotes a resource such as compute resource or a router/switch only forwarding
network traffic, and an edge denotes a physical link between two nodes. A workflow is
represented by a task graph/DAG where a node denotes a task associated with a type
and amount, and a directed edge connecting two nodes denotes a producer/consumer
relation of them, i.e., required data transfer from a source node to a destination node.
A task in a task graph is executed only once and execution order complies with the
precedence constraints defined by the task graph. In this chapter, the goal of workflow
scheduling algorithms is to map a node(task) and an edge(data transfer) in a task graph
into a node and dynamic multiple paths in a network resource graph, respectively. The
mapping of a node(task) in a task graph into a node in a network resource graph implies
that the task is not splittable and the mapping is not varied over time. But the amount
of resource allocation can vary over time. In contrast, the mapping of an edge(data
transfer) in a task graph into dynamic multiple paths in a network resource graph means
that data transfer is fulfilled by multiple paths varying over time. The time model we
assume is the uniform time slice model, which discretizes the timeline into many time
slices with uniform period. In the following sections, more detailed and formal definition
will be described.
70
4.2.1 System Model and Data Structure
4.2.1.1 Time model
The uniform time slice model is represented by τ and M where τ is the size of a
time slide and M is the maximum number of time slices the system would consider. The
start and end time of the time slice m is denoted by Tm and Tm+1, respectively.
4.2.1.2 Network resource model
A network resource model is represented by Gn = (V,E, r,TR,TB), where V and
E are a set of nodes and a set of edges, respectively, rv denotes the resource type of a
node v , and TRv and TBe denote the data structures for the resource availability of node
v and edge e, respectively, over time. More specifically, we use time-resource (TR) or
time-bandwidth (TB) arrays as data structures for managing resource availability over
time. A TR or TB array is a set of (am) where m is the index of a time slice and am is
the available amount of a resource over the period [Tm,Tm+1). These data structures
are necessary for effective in-advance reservation of resources. Basically, the data
structures of a TR array and a TB array are same except the fact that the resource type
represented by a TB array is only network resource type whereas other resource types
are represented by a TR array. Thus, a TB array is assigned on each edge and a TR
array is assigned on each node in a network resource graph.
Figure 4-2 shows an example of network resource graph. Each node represents
one resource, and is associated with a resource type and a TR array, which tracks the
resource availability over time. In Figure 4-2, nodes V1 through V3 are of resource type
1, and nodes V4 and V5 are of resource type 2. We can assign a unique number to a
different resource type excluding the network resource. For example, resource type
1 is pure compute resource and resource type 2 is database service resource. Each
edge represents a physical link connecting two nodes, and is associated with a TB array,
which tracks the network resource availability over time.
71
V1
TRV1 1
V2
TRV2 1
V3
TRV3 1
V4
TRV4 2
V5
TRV5 2
TBV1V3 TBV2V3
TBV3V4 TBV3V5
Resource type
Figure 4-2. An example of a network resource graph
4.2.1.3 Workflow model
A workflow can be represented by a task graph, which is a directed graph and
formally defined as Gt = (N,L, r,RN,RL,ST ,Deadline). N and L represent a node set
and an edge set, respectively. ri denotes the resource type of node Ni . RNi denotes the
required amount of a resource of node Ni , and RLi denotes the required amount of data
transfer between both end nodes of edge Li . We assume all capacities of resources are
normalized with regard to the base capacity and we can then express the required or
available amount of resources by rational numbers, multiples of the base capacity. ST
is the start time of a workflow, which has to be taken into consideration for in-advance
scheduling. Deadline is an optional parameter. If it is given, we may set the optimization
objective to minimizing network resource consumption. Otherwise, we may set the
optimization objective to minimizing the makespan of the workflow. Figure 4-3 shows an
example of a task graph. Resource requirement and resource type are associated with
each node, and only network resource requirement property is associated with each
edge.
4.2.2 Problem Statement
We solve in-advance workflow scheduling problems in e-Science networks which
are mix of IP networks and optical networks. Even though optical networks, where
a physical link carries multiple wavelengths, have inherently integrality of bandwidth,
72
N1
3 1
10
N2
5 2
Resource typeResource
requirement
Figure 4-3. An example of a task graph
we assume that the bandwidth of a network resource graph is infinitely divisible. The
application of the algorithms developed in this chapter to optical networks is left to future
work. We develop our algorithms for two broad cases:
1. Single workflow: In this case, a single workflow is scheduled based on theavailable (fractional) resources. It is assumed that the previous workflowshave already been scheduled and the goal is to optimize the performancecharacteristics of a single workflow.
2. Multiple workflows: In this case, multiple workflows will be simultaneouslyscheduled. The expectation is that this will achieve better performance, thanscheduling one workflow at a time.
For both cases, the goals of our algorithms can be minimization of either network
resource consumption or makespan (finish time). For simplicity, when deadlines
for workflows are given, we set the objective to minimization of network resource
consumption. Otherwise we set the objective to minimization of finish time. Therefore
we have four problems in total: (1) minimization of network resource consumption for a
single workflow, (2) minimization of finish time for a single workflow, (3) minimization of
network resource consumption for multiple workflows, and (4) minimization of finish time
for multiple workflows.
4.2.3 Construction of an Auxiliary Graph
We translate the workflow scheduling problem into a network flow problem. The
multicommodity flow problem, which optimizes the cost of multiple commodities with
different source and destination nodes flowing through the network, is a well-known
73
network flow problem. To formulate the workflow scheduling problem as a multicommodity
flow problem, we first have to construct an auxiliary graph from the given network
resource graph and task graph. The workflow scheduling problem is comprised of two
mapping problems, a node mapping problem and an edge mapping problem onto a
network resource graph. The goal of constructing an auxiliary graph is to convert a node
mapping problem into an edge mapping problem since the multicommodity flow problem
can deal with only an edge mapping problem.
An illustrative example of the auxiliary graph corresponding to Figures 4-2 and 4-3
is shown in Figure 4-4. An auxiliary graph GA = (VA,EA,TBA) is constructed as follows.
First, we expand the network resource graph by duplicating each node and connecting
from the original one to the duplicated one. For convenience, let’s call the original one a
frontend node, and the duplicated one a backend node. For example, in Figure 4-4, the
node V1 is expanded into two nodes, V1′ and V1′′, and a new edge connecting these two
nodes is inserted with the associated TB array corresponding to V1’s TR array. In this
case, V1′ is a frontend node and V1′′ is a backend node. Obviously, this expansion is to
convert a resource allocation problem into a network flow problem. The original topology
of the network resource graph remains unchanged among the backend nodes of the
expanded graph as in Figure 4-4. Note that some nodes of no chance of being selected
may not need to be expanded.
Second, we expand the task graph in the same way as we did the network resource
graph. But we do not create any edge connecting nodes in the expanded task graph.
Lastly, we interconnect the expanded network resource graph and the expanded task
graph.
As mentioned above, two kinds of flows are needed for problem conversion from a
general workflow scheduling problem to a network flow problem. One is the resource
allocation flow for the purpose of resource allocation of each task (node) in a task graph.
For example, in Figure 4-4, N1′ is connected to all possible frontend nodes of the same
74
resource type of N1 in the expanded network resource graph. Similarly, N1′′ is connected
to all backend nodes corresponding frontend nodes. Thus, by constraining the flow from
N1′ to N1′′ demanding N1’s resource requirement to take the only one single path, we
can solve the problem of resource allocation of each task (node).
The other is the data transfer flow for the purpose of data transfers between
tasks. These flows are seamlessly modeled by multiple flows with different source
and destination nodes in a typical multicommodity flow problem. The source node of
a data transfer flow is set to the backend node corresponding to a source task in the
task graph, and the destination node of a data transfer flow is set to backend node
corresponding a destination task in the task graph. For instance, the data transfer
requirement of 10 units between N1 and N2 in a task graph is modeled by a flow of 10
units of data between N1′′ and N2′′.
The auxiliary graph accounts for a situation where two tasks are mapped into a
same resource and thus the communication cost between them should be ignored.
Since the bandwidth of interconnecting edges between an expanded network resource
graph and an expanded task graph is set to infinity, the communication cost of tasks
mapped into the same resource will be near zero. Suppose that all tasks and resources
are of same type in Figure 4-4 and N1 and N2 are mapped onto V1. Then the data flow
between N1 and N2 will follow the path, N1′′ → V1′′ → N2′′, which is composed only of
edges with infinite bandwidth.
The space complexity of an auxiliary graph is summarized as follows; |VA| =
2(|V |+ |N|), |EA| = |E |+ |V |+ 2|N||V |.
4.3 MILP Formulation
The single or multiple workflow scheduling problem can be formulated as a mixed
integer linear programming (MILP) problem, which is a variant of a multicommodity
flow problem. The objective of the MILP problem can be minimizing the finish time or
minimizing the total network resource consumption depending on whether the deadlines
75
N2’
N1’
V1’’
V1’
V2’’
V2’
V4’ V5’
N1’’
3
3
5
10
V3’’
V3’
10
V4’’ V5’’ N2’’5
Figure 4-4. An example of an auxiliary graph
for workflows are given or not. If a deadline is not imposed on a workflow, the user who
requests for the workflow job wants to get the job done as fast as possible. If a deadline
is imposed on a workflow, the user will be satisfied as long as the deadline is met, which
allows the system to utilize resources more efficiently in compensation for the delayed
time.
The constraints of the MILP problem are composed of four parts: 1) multi-commodity
flow constraints, 2) task assignment constraints, 3) precedence constraints, and 4)
deadline constraints. Since we have transformed the workflow scheduling problem into
a multi-commodity problem, the typical multi-commodity flow constraints remain valid.
Additional multi-commodity flow constraints are added to account for malleable resource
allocation. Secondly, the task assignment constraints are integer constraints to enforce
that one task is mapped to only one resource node in the network topology graph.
Thirdly, the precedence constraints ensure that precedence constraints of a workflow
are obeyed. Finally, the deadline constraints are just for the case when deadlines for
workflows are given. The notation for the MILP formulation is listed in Table 4-1.
76
Table 4-1. Notation for problem formulation
Category Notation DescriptionFunction pred(v) Returns the set of predecessors of node vConstantor Set
J {(sj , dj ,Fj)|sj , dj ∈ VA, 0 ≤ j < |N| + |L|} where sj and dj aresource and destination nodes of job j and Fj is the requiredamount of flow (resource) for job j , a set of jobs. One jobcorresponds to either a node or an edge of a workflow.
Jc A set of communication jobs, Jc ⊂ J.Jnc A set of non-communication jobs, Jnc ⊂ J, J = Jc ∪ Jnc.Pj A set of allowed paths for job j ∈ J (from sj to dj ).GA An auxiliary graph, (VA,EA,TBEA).Einf A set of edges with infinite available bandwidth ⊂ EA.blk(m) The available bandwidth on edge (l , k) during the time slice m in
an auxiliary graph GA.Tm The start time of time slice m.Tm+1 The end time of time slice m.M The number of time slices to be considered.N The number of workflows in case of multiple workflow
scheduling.Incv The set of edges incident on node v .D/Dn Deadline of a workflow/workflow n.WST/WSTn Start time of a workflow/workflow n.Q Very large number.Qt Tolerance.
Variable Tf /Tnf Finish time of a workflow/workflow n.
f jlk(m)/fjnlk (m) The amount of flow transferred (resource allocated) for job j ∈ J
on link (l , k) during the time slice m for a workflow/workflow n.This variable is defined in case that the type of a job is same asthat of the edge.
f jp (m)/fjnp (m) The amount of flow transferred (resource allocated) for job j ∈ J
on path p ∈ Pj during the time slice m for a workflow/workflow n.x jlk/x
jnlk 0 or 1. 1 if edge (l , k) ∈ Incsj is selected for job j ∈ J or 0
otherwise for a workflow/workflow n.y jlk/y
jnlk 0 or 1. 1 if edge (l , k) ∈ Incdj is selected for job j ∈ J or 0
otherwise for a workflow/workflow n.z jm/z
jnm 0 or 1. 1 if job j ∈ J is allocated in time slice m or 0 otherwise for
a workflow/workflow n.STj/ST
nj Start time of a job j ∈ J in a workflow/workflow n.
ENDj/ENDnj End time of a job j ∈ J in a workflow/workflow n.
The tasks and data transfers among them are all mapped into jobs in a multicommodity
flow problem. A job in the formulation is described as a three-tuple (s, d ,F ), where s
77
and d denote the source and destination nodes of the job, F denotes the required
amount of flow (resource). ST and END, the start and end times of the job, are
determined by workflow scheduling algorithms. The resource type does not have to
be included in this tuple since a flow is forced to take a link of the same resource type
due to the carefully chosen connect edges between a network resource graph and a
task graph. Three kinds of binary decision variables are introduced; x , y and z . The
discrete nature of the problem is due to the fact that a task cannot be split and we
have discrete time intervals to accommodate jobs. Binary decision variables, x jlk and
y jlk , determine which resource is to be allocated to a non-split task. Regarding a job j
corresponding to a task in a task graph, the flow of the job can take only one outgoing
edge from the frontend node of a task and only one incoming edge into the backend
node of a task in the auxiliary graph. These constraints reflect the non-split property of a
task. z jm indicates whether time slice m is used for the job j or not. These binary decision
variables can be easily extended to the multiple workflow scheduling problem by using
separate variables for each workflow.
4.3.1 Single Workflow
The complete formulation is presented in Figure 4-5. First of all, the problem
can be optimized for either minimum finish time, Tf , or minimum network resource
consumption,∑j∈J
∑(l ,k)∈EA
∑M−1m=0 f
jlk(m), as in Expression 4–1. Minimizing network
resource consumption can be helpful for saving more resources for future arriving
requests so that more requests can be accepted in the long term.
4.3.1.1 Multi-commodity flow constraints
The flow conservation rule at the nodes other than source and destination nodes
is ensured by Constraint 4–2. The amount of flow (resource) to be allocated (reserved)
is ensured by Constraint 4–3. Constraint 4–4 ensures that the amount of total flows
on link (l , k) during time slice m should not exceed the maximum possible amount of
flow during the time slice m, which is given by blk(m) × (Tm+1 − Tm), where blk(m)
78
Objective
minimize Tf or∑j∈J
∑(l ,k)∈EA
M−1∑m=0
f jlk(m) (4–1)
Multi-commodity flow constraints∑k:(l ,k)∈EA
f jlk(m)−∑
k:(k,l)∈EA
f jkl(m) = 0, ∀j ∈ J,∀l ∈ VA, 0 ≤ m < M, l = sj , l = dj , (4–2)
M−1∑m=0
(∑
k:(l ,k)∈EA
f jlk(m)−∑
k:(k,l)∈EA
f jkl(m)) =
{Fj , if l = sj−Fj , if l = dj
,∀l ∈ VA,∀j ∈ J (4–3)
∑j∈Jf jlk(m) ≤ blk(m)× (Tm+1 − Tm), ∀(l , k) ∈ EA, 0 ≤ m < M (4–4)
0 ≤ f jlk(m) ≤ blk(m)× (Tm+1 − Tm)× zjm,∀l ∈ VA,∀j ∈ J,∀(l , k) ∈ EA\Einf , 0 ≤ m < M (4–5)
STj ≤ Tm × z jm + (1− z jm)Q, ∀j ∈ J, 0 ≤ m < M (4–6)
ENDj ≥ Tm+1 × z jm, ∀j ∈ J, 0 ≤ m < M (4–7)
Task assignment constraints∑(l ,k)∈Incsj ,l=sj
x jlk = 1,∀j ∈ J (4–8)
∑(l ,k)∈Incdj ,k=dj
y jlk = 1,∀j ∈ J (4–9)
∑M−1m=0 f
jlk(m)
Q≤
{x jlk , if (l , k) ∈ Incsjy jlk , if (l , k) ∈ Incdj
,∀j ∈ J (4–10)
Precedence constraintsSTj =WST , if pred(j) = ∅,∀j ∈ J (4–11)STj ≤ ENDj ,∀j ∈ J (4–12)ENDp ≤ STj , if p ∈ pred(j), p, j ∈ J (4–13)ENDj ≤ Tf ,∀j ∈ J (4–14)
y ilk = xjlk , if i ∈ pred(j), i ∈ Jnc, j ∈ Jc (4–15)
y ilk = yjlk , if i ∈ pred(j), i ∈ Jc, j ∈ Jnc (4–16)
Deadline constraints (optional)Tf ≤ D (4–17)
Figure 4-5. Single workflow scheduling problem formulation via network flow model
79
is the available bandwidth in the time slice m, and Tm+1 and Tm are the end time and
start time of the time slice m. Constraint 4–5 ensures that if the time slice m is not
used for job j , the amount of flow for the job j during the time slice m is 0. But note
that Constraint 4–5 should not be imposed on edges with infinite available bandwidth
as there is no cost required for communications between tasks assigned on the same
resource. Otherwise, one time slice is allocated for such communications. Constraint
4–6 ensures that if the time slice m is used for job j , the start time of job j should be at
most Tm, the start time of the time slice m. Suppose that multiple time slices are chosen
for job j , this constraint enforces STj to be less than or equal to the start time of the
earliest time slice, which complies with the definition of STj . Similarly, the end time of job
j is ensured to be greater than or equal to the end time of any time slice m in which the
job is scheduled by Constraint 4–7.
4.3.1.2 Task assignment constraints
The second part of the constraints reflects the non-split property of tasks. Thus,
Constraints 4–8 and 4–9 ensure that only one resource among possible candidate
resources is assigned to one task. Constraint 4–10 relates discrete selection of a
resource with flow decision variables, which means if a resource is chosen for a job j ,
there could be a flow on the related links.
4.3.1.3 Precedence constraints
After transforming all the tasks and data transfers into jobs in a network flow
problem, we should ensure that precedence constraints inherent in a task graph are also
embedded in the network flow problem. Accordingly, Constraint 4–11 ensures that the
start time of jobs with no precedent jobs is set to the start time of a workflow. Constraint
4–12 ensures that the end time of a job is greater than or equal to the start time of a
job. Constraint 4–13 ensures that the start time of a job is not before the end times of
precedent jobs. Constraint 4–14 ensures that all the end times of jobs should be less
80
than or equal to the global finish time Tf . Constraints 4–15 and 4–16 ensure that data
transfers between tasks occur between chosen resources.
4.3.1.4 Deadline constraints
Constraint 4–17 is optional depending on whether we have deadlines on workflows
or not.
4.3.2 Multiple Workflows
The formulation for the single workflow scheduling problem can be easily extended
to the multiple workflow scheduling problem by using separate variables for each
workflow as in Figure 4-6.
We can set the objective of the multiple workflow scheduling problem formulation
to minimizing either the total sum of makespans of all workflows or the total network
resource consumption of all workflows as in Expression 4–18. The first term of
Expression 4–18 indicates the total sum of makespans of all workflows. Even though
we can optimize the finish time of the whole workflows by directly applying the objective,
Expression 4–1, of the single workflow scheduling problem formulation, minimizing the
finish time of the whole workflows may not contribute to the efficient resource scheduling
of workflows whose timelines are far ahead of the finish time of the whole workflows. For
such reasons, we choose to minimize the total sum of makespans. Yet, there still exists
a concern that this objective cannot achieve balanced optimization for the makespan
of every workflow. Suppose that each workflow is issued by a different user. From the
perspective of the whole system, this objective can achieve balanced scheduling among
workflows. But from the perspective of users, the makespan of a certain workflow can
be sacrificed to achieve the minimum of the total sum of makespans by reducing the
makespans of other workflows.
4.3.3 Time Complexity
The time complexity of a MILP problem depends on the number of decision
variables and the number of constraints. To formally analyze the number of decision
81
Objective
minimizeN−1∑n=0
(T nf −WST n) orN−1∑n=0
∑j∈J
∑(l ,k)∈EA
M−1∑m=0
f jnlk (m) (4–18)
Multi-commodity flow constraints∑k:(l ,k)∈EA
f jnlk (m)−∑
k:(k,l)∈EA
f jnkl (m) = 0, ∀j ∈ J,∀l ∈ VA, 0 ≤ m < M, 0 ≤ n < N, l = sj , l = dj ,
(4–19)M−1∑m=0
(∑
k:(l ,k)∈EA
f jnlk (m)−∑
k:(k,l)∈EA
f jnkl (m)) =
{Fj , if l = sj−Fj , if l = dj
,∀l ∈ VA,∀j ∈ J, 0 ≤ n < N, (4–20)
N−1∑n=0
∑j∈Jf jnlk (m) ≤ blk(m)× (Tm+1 − Tm), ∀(l , k) ∈ EA, 0 ≤ m < M (4–21)
0 ≤ f jnlk (m) ≤ blk(m)× (Tm+1 − Tm)× zjnm ,∀l ∈ VA,∀j ∈ J,∀(l , k) ∈ EA\Einf , 0 ≤ m < M, 0 ≤ n < N
(4–22)
ST nj ≤ Tm × z jnm + (1− z jnm )Q, ∀j ∈ J, 0 ≤ m < M, 0 ≤ n < N (4–23)
ENDnj ≥ Tm+1 × z jnm , ∀j ∈ J, 0 ≤ m < M, 0 ≤ n < N (4–24)
Task assignment constraints∑(l ,k)∈Incsj ,l=sj
x jnlk = 1,∀j ∈ J, 0 ≤ n < N (4–25)
∑(l ,k)∈Incdj ,k=dj
y jnlk = 1,∀j ∈ J, 0 ≤ n < N (4–26)
∑M−1m=0 f
jnlk (m)
Q≤
{x jnlk , if (l , k) ∈ Incsjy jnlk , if (l , k) ∈ Incdj
, j ∈ J, 0 ≤ n < N (4–27)
Precedence constraintsSTj =WST , if pred(j) = ∅,∀j ∈ J (4–28)ST nj ≤ ENDnj ,∀j ∈ J, 0 ≤ n < N (4–29)
ENDnp ≤ ST j , if p ∈ pred(j), p, j ∈ J, 0 ≤ n < N (4–30)ENDnj ≤ T nf ,∀j ∈ J, 0 ≤ n < N (4–31)
ypnlk = xjnlk , if p ∈ pred(j), p ∈ Jnc, j ∈ Jc, 0 ≤ n < N (4–32)
ypnlk = yjnlk , if p ∈ pred(j), p ∈ Jc, j ∈ Jnc, 0 ≤ n < N (4–33)
Deadline constraints (optional)T nf ≤ Dn (4–34)
Figure 4-6. Multiple workflow scheduling problem formulation via network flow model82
variables and the number of constraints, first, the following variables are defined. nn and
mn denote the number of nodes and the number of edges, respectively, of a network
resource graph. nt and mt denote the number of nodes and the number of edges,
respectively, of a workflow. nJ denotes the number of jobs. If nA and mA represent the
number of nodes and the number of edges, respectively, of the auxiliary graph, nA
equals 2(nn + nt), and mA equals mn + nn + 2ntnn as described in Section 4.2.3. Table
4-2 shows the number of variables and constraints of the single workflow scheduling
problem formulation. Flow variables f consists of flow variables of communication
jobs and flow variables of non-communication jobs. The first part is accounted for by
(2nn +mn) ·mt ·M because we need to consider flows on a network resource graph (mn)
and interconnecting edges between a network resource graph and a task graph related
to a job (2nn). On the other hand, the second part is accounted for by (3nn · nt) · M
because we need to consider flows only on interconnecting edges between a network
resource graph and a task graph and edges connected between frontend and backend
nodes in GA. For simplicity, let’s assume that the network resource graph is fixed as
we conduct experiments by varying only the size of workflows in Section 4.6. Then
decision variable f is dominant in the number of decision variables, and the number of
decision variables f is proportional to (mt + nt) ·M. As shown in Table 4-2, the number
of constraints is also proportional to (mt + nt) ·M.
4.4 LP Relaxation
As you will see in the experimental results, the running time of MILP for the
workflow scheduling increases exponentially as the number of nodes of a workflow
grows. The general workaround to solve the MILP problem fast enough to be useful in
practical is the linear programming relaxation by transforming binary variables into real
variables ranging between 0 and 1. We can turn the solution to the linear programming
relaxation of the MILP problem into the approximate solution to the MILP problem via
83
Table 4-2. Single workflow scheduling formulation time complexity analysisVariable/Constraint Number of variables/constraintsf ((2nn +mn) ·mt + 3nn · nt) ·Mx nJ · nn = (mt + nt) · nny nJ · nn = (mt + nt) · nnz nJ ·M = (mt + nt) ·MST nJ = (mt + nt)END nJ = (mt + nt)
Constraint 4–2+4–3 (nt · (2nn + 2) +mt · (nn + 2)) ·MConstraint 4–4 (mn + nn) ·MConstraint 4–5 (nt · nn +mt · (mn + nn)) ·MConstraint 4–6, 4–7 nJ ·M = (mt + nt) ·MConstraint 4–8,4–9 nJ = (mt + nt)Constraint 4–10 nJ · nn = (mt + nt) · nnConstraint 4–11,4–12, 4–14
nJ = (mt + nt)
Constraint 4–13 2mtConstraint4–15,4–16
2mt · nn
techniques such as rounding. We propose a LP relaxation (LPR) algorithm, consisting of
two steps, for the workflow scheduling problem.
• We determine which resources are selected for the tasks (nodes) of a task graph.
• The next step is to iteratively determine the start and end times of jobs along withnetwork resource allocations for data transfer jobs.
The detailed operations of the first-step algorithm are described in Algorithm 4-1. The
goal of the first step is to determine the mapping of resources other than network, and
the related binary variables, x and y . In the original MILP formulation, Constraint 4–8
and 4–9 ensure that only one x /y variable in the constraints becomes 1. Hence we
can turn the solution of the LP relaxation problem into the solution of the original MILP
problem by picking the variable with the maximum value and setting it to 1 and all the
others to 0. In this step, we don’t care about z variables, which are related to time slice
assignment.
84
Algorithm 4-1 First step - Determination of the mapping of tasks except data transfersInput: A network topology graph Gn and a workflow Gt
1: Relax all the binary variables of the MILP problems, i.e., x , y , and z variables.
2: Solve the LP relaxation of the MILP problem.
3: Find the maximum relaxed variable among many relaxed variables whose total sum
should be 1, and set the variable to 1 and all other variables to 0 regarding x and y
variables.
With the solution of the first-step algorithm, we can determine the start and end
times of jobs by solving small MILP problems iteratively, regarding unscheduled jobs.
The basic idea is that finding a solution to the MILP problem with determined x /y binary
variables and undetermined z binary variables for a small number of jobs, e.g., 3,
takes little time. Thus, we can divide the problem into many small problems and solve
them sequentially. To pick appropriate jobs, we also use the same bottom level priority
scheme as the heuristic. However, in our case, the node mapping is already determined.
The detailed operations of the second-step algorithm are described in Algorithm 4-2.
Algorithm 4-2 Second step - Determination of the mapping of network resourcesInput: A network topology graph Gn and a workflow Gt with fixed resource mapping
obtained from Algorithm 4-1
1: while There are network jobs with unfixed end times do
2: Pick 3 non-communication jobs and associated communication jobs.
3: Solve the MILP problem, which has only those jobs and related z variables as
binary variables.
4: Update the start and end times of jobs affected by the solution.
5: end while
As for LP, the computation time is proportional to p2q if q ≥ p where p is the number
of decision variables and q is the number of constraints. The decision variable f is
85
dominant in the number of decision variables. Suppose that the network resource graph
is fixed as we conduct experiments by varying only the size of workflows in Section
4.6. To address the fast growing running time with regard to the size of a workflow, we
choose another form of multicommodity flow. There exist two kinds of LP formulations
for the multicommodity flow problem, node-arc form and edge-path form. The MILP
formulation in Figure 4-5 takes the node-arc form, which assigns a separate decision
variable for a certain job on a certain link. In contrast, the edge-path form assigns a
separate decision variable for a certain job on a certain path in a set of paths, P, which
the job can take. Accordingly, if we limit the number of paths in the set P, we can reduce
the number of decision variables, which leads to better performance in terms of time
complexity by sacrificing the accuracy of the solution. In [81], authors showed that the
edge-path formulation for bulk file transfers can lead to a near optimal solution with
a reasonable time complexity by using a limited number of pre-defined paths. The
edge-path form of the single workflow scheduling problem formulation is presented in
Figure 4-7. We will refer to this edge-path based LP relaxation as LPREdge for the rest
of this chapter.
The time complexity analysis for the edge-path form formulation is summarized in
Table 4-3. Compared to the original formulation, the number of variables and constraints
is much reduced. Especially, the time complexity of the edge-path from formulation is
much less influenced by the size of a network resource graph, i.e., mn and nn.
4.5 List Scheduling Heuristic
The extended list scheduling algorithm with the bottom level priority scheme
achieves the best performance among other priority schemes such as top level
priority scheme [92]. Even though the authors in [67] tried to enhance the performance
considering the properties of a pipelined task graph, the new algorithm does not make
much difference in the case of random workflows. Their results show that the new and
classic algorithms produce almost the same makespans regarding workflows with up
86
Objective
minimize Tf or∑j∈J
∑p∈Pj
M−1∑m=0
f jp (m) (4–35)
Multi-commodity flow constraints∑0≤m<M
∑p∈Pj
f jp (m) = Fj , ∀j ∈ J (4–36)
∑j∈J
∑p∈Pj,p:(l ,k)∈p
f jp (m) ≤ blk(m)× (Tm+1 − Tm), ∀(l , k) ∈ EA, 0 ≤ m < M (4–37)
0 ≤ f jp (m) ≤ blk(m)× (Tm+1 − Tm)× z jm,∀j ∈ J,∀p ∈ Pj, (l , k) ∈ p, 0 ≤ m < M (4–38)
STj ≤ Tm × z jm + (1− z jm)Q, ∀j ∈ J, 0 ≤ m < M (4–39)
ENDj ≥ Tm+1 × z jm, ∀j ∈ J, 0 ≤ m < M (4–40)
Task assignment constraints∑(l ,k)∈Incsj ,l=sj
x jlk = 1,∀j ∈ J (4–41)
∑(l ,k)∈Incdj ,k=dj
y jlk = 1,∀j ∈ J (4–42)
∑M−1m=0 f
jp (m)
Q≤
{x jlk , if (l , k) ∈ Incsjy jlk , if (l , k) ∈ Incdj
,∀j ∈ J,∀p ∈ Pj, (l , k) ∈ p, 0 ≤ m < M (4–43)
Precedence constraintsSTj =WST , if pred(j) = ∅,∀j ∈ J (4–44)STj ≤ ENDj ,∀j ∈ J (4–45)ENDp ≤ STj , if p ∈ pred(j), p, j ∈ J (4–46)ENDj ≤ Tf ,∀j ∈ J (4–47)
y ilk = xjlk , if i ∈ pred(j), i ∈ Jnc, j ∈ Jc (4–48)
y ilk = yjlk , if i ∈ pred(j), i ∈ Jc, j ∈ Jnc (4–49)
Deadline constraints (optional)Tf ≤ D (4–50)
Figure 4-7. Edge-path form of single workflow scheduling problem formulation
87
Table 4-3. Edge-path form single workflow scheduling formulation time complexityanalysisVariable/Constraint Number of variables/constraintsf (k ·mt + nn · nt) ·Mx nJ · nn = (mt + nt) · nny nJ · nn = (mt + nt) · nnz nJ ·M = (mt + nt) ·MST nJ = (mt + nt)END nJ = (mt + nt)
Constraint 4–36 nJ ·M = (mt + nt) ·MConstraint 4–37 (mn + nn) ·MConstraint 4–38 k · nJ ·M = k · (mt + nt) ·MConstraint 4–39,4–40
nJ ·M = (mt + nt) ·M
Constraint4–41,4–42
nJ = (mt + nt)
Constraint 4–43 nJ · nn = (mt + nt) · nnConstraint 4–44,4–45, 4–47
nJ = (mt + nt)
Constraint 4–46 2mtConstraint4–48,4–49
2mt · nn
to 60 tasks and the new algorithm performs at most 5-10% better regarding workflows
with 80 to 100 tasks. Hence, we consider adapting the general extended list scheduling
algorithm with the bottom level priority scheme to our random workflows.
The direct application of the extended list scheduling algorithm proposed in [92]
does not fit well into e-Science networks in three aspects. First, the algorithm of [92]
allows that links on the path can be available at different time periods as long as
descendent links become available after the precedent links of a path. This assumption
requires buffers at the ends of links and intervention of moderators controlling the start
and the end of data transfer on each link. Second, [92] does not consider in-advance
reservation, which means only available bandwidth at the time when the request is
made is taken into account for path computation. The extended list scheduling algorithm
adapted for e-Science networks are described in Algorithm 4-3.
88
The changes necessary for adaptation for in-advance workflow reservations in
e-Science networks are related to computing data transfer time as part of computation
of the earliest finish time. The assumption regarding synchronized availability of links on
a path is reasonable in e-Science networks and in-advance reservations pose another
challenge. In the case of on-demand reservations, we can compute data transfer time
simply by the amount of data over the maximum available bandwidth of a path where
the maximum available bandwidth of a path is the minimum of maximum available
bandwidths of links of a path. Lastly, the varying available bandwidth over time due to
the nature of in-advance reservation requires careful handling of data transfer time.
We assume rigid mapping for the extended list scheduling, which means the allocated
bandwidth of a path does not change over time. To find the data transfer finish time, we
use the simple heuristic as described in Algorithm 4-4. We will refer to the extended list
scheduling algorithm adapted for e-Science networks as LS for the rest of this chapter.
Algorithm 4-3 The adapted extended list scheduling algorithmInput: A network resource graph Gn and a workflow Gt
1: Determine the priorities of all nodes in Gt based on the bottom priority scheme.
2: Order the nodes with respect to priorities while complying with precedence
constraints.
3: for each node in the ordered list in the decreasing order do
4: Find the node that allows the earliest finish time among all candidate nodes by
virtually scheduling all incoming data transfers. // Network paths between two
nodes are predetermined by BFS.
5: end for
4.6 Experimental Evaluation
4.6.1 Experiment Setup
We compare the performance of four algorithms, the optimal MILP algorithm,
the LP relaxation algorithm, the edge-path form LP relaxation algorithm and the list
89
Algorithm 4-4 Data transfer finish time computation algorithmInput: A network resource graph Gn and a data transfer specified by(source, destination, amount of data, start time)
1: for each time slice in the increasing order of start time of time slices whose end timeis greater than or equal to start time do
2: // Basic interval refers to the time period within which the available bandwidth oflinks of a path is constant.
3: AllocBW ← the available bandwidth of the time slice.4: RemainingData← the amount of data to transfer5: CurTimeSlice ← the time slice6: FinishTime ← the start time of the data transfer7: while RemainingData > 0 do8: if CurTimeSlice has more available bandwidth than AllocBW then9: RemainingData←
10: RemainingData−the amount of data transferred in the current time slice
11: Update FinishTime.12: else13: Exit while14: end if15: CurTimeSlice ← the next time slice16: end while17: if RemaingData = 0 then18: return FinishTime.19: end if20: end for
scheduling heuristic of Section 4.5 in terms of the makespan, i.e., the schedule length
of workflows and the computational time of algorithms. In the following, we refer to
the MILP algorithm, the LP relaxation algorithm, the edge-path form LP relaxation
algorithm, and the general extended list scheduling algorithm of Section 4.5 as MILP,
LPR, LPREdge and LS, respectively.
We first compare the performance of all four algorithms with regard to workflows
with a small number of nodes, 3. This experiment is for comparison of non-optimal
algorithms against the optimal algorithm. We then compare two algorithms, LPREdge
and LS, with regard to workflows with a large number of nodes ranging from 10 to 50
90
with an increment of 10. The second experiment is to verify that our algorithm performs
better than the heuristic algorithm in terms of makespan.
As a network resource graph, we use the Abilene network [11] (see Figure 4-8),
which is deployed in practice. The resource capacities of nodes of the network resource
graph as well as the bandwidth capacities of edges are randomly selected from a
uniform distribution between 10 to 1024. For workflow generation, we can choose
either way of generating randomly [26, 28, 42, 59, 66] or synthesizing workflows
from a set of pre-determined workflows [60]. In our experiments, we use a random
workflow generation method that depends on three parameters: the number of nodes,
the average degree of nodes and communication-to-computation ratio (CCR). The
number of nodes is varied according to the aforementioned experiments. The average
degree of nodes is related to the level of parallelism of workflows and fixed to 2. The
different CCRs of 0.1, 1, and 10 are used to assess the impact of the communication
factor on the performance of the algorithms. A larger CCR means a workflow is more
data-intensive. The weights of nodes of a workflow are randomly selected from a
uniform distribution between 10 to 1024 as the resource capacities of the Abilene
network are determined. Subsequently, the weights of edges of a workflow are set to
the CCR times the uniform distribution between 10 to 1024. One hundred trials were for
every combination of workflow parameters, the number of nodes, CCR and the chosen
algorithm. We then averaged the results and plotted charts for performance evaluation.
Even though we present formulations for both single workflow and multiple workflow
scheduling, we have conducted experiments for the single workflow scheduling only.
This is because we can understand every multiple workflow scheduling instance may be
transformed into a single big workflow scheduling instance.
As a MILP/LP solver, we used CPLEX, a popular commercial software package,
and the computer machines on which CPLEX has been installed have the following
specification; 2 GHz dual core AMD Opteron(tm) Processor 280 and 7 Gbyte memory.
91
Figure 4-8. The Abilene network
4.6.2 Results
We evaluate the performance of workflow scheduling algorithms with regard to two
metrics: schedule length of workflows, i.e., makespan, and computational (running) time.
The detailed results with explanation are presented in this subsection.
4.6.2.1 Schedule length of workflows
Comparison against optimal scheduling results:. Since the optimal schedules
for randomly generated workflows on the given network resource graph, the Abilene
network, are not known ahead of time, the only way of evaluating the makespans of
non-optimal algorithms is to compare makespans of those algorithms against the
optimal algorithm.
In Figure 4-9, we can see that the performance of non-optimal algorithms, i.e., LPR,
LPREdge, and LS, is comparable to the optimal algorithm, MILP, when CCR = 0.1 and
1.0. However, as CCR grows up to 10, the makespan of LS becomes roughly 2 times
the makespan of MILP. In contrast, the makespan of LPR and LPREdge is at most 20%
more than the optimal makespan.
Comparison between LPREdge and LS. As the general workflow scheduling
problem is a NP-hard, our corresponding formulation, MILP, requires exponential
computational time as the size of workflows increases. For large workflows, it is
impractical to determine the optimal makespan using the MILP algorithm. For this
reason, we compare the makespans of only our non-optimal algorithms, LPREdge and
92
0
5
10
15
20
25
Makespan
MILP
LPR
LPREdge
LS
0
5
10
15
20
25
0.1 1 10
Makespan
CCR
MILP
LPR
LPREdge
LS
Figure 4-9. Makespan vs. CCR for all algorithms in the Abilene network when thenumber of nodes in a workflow is 3.
LS, in Figure 4-10. We can see that LPREdge is much better than LS. It achieves half
the makespan of LS in some cases.
0
50
100
150
200
250
300
350
400
Makespan
LPREdge
LS
0
50
100
150
200
250
300
350
400
0.1 1 10 0.1 1 10 0.1 1 10 0.1 1 10 0.1 1 10
10 20 30 40 50
Makespan
CCR and number of nodes in a workflow
LPREdge
LS
Figure 4-10. Makespan vs. CCR and the number of nodes in a workflow for LPREdgeand LS in the Abilene network
4.6.2.2 Computational time
Comparison against optimal scheduling results. The running time of the
optimal algorithm grows exponentially as shown in Figure 4-11. This algorithm takes
approximately 14 seconds when there are 3 nodes and CCR = 0.1. With 3 nodes, the
run time becomes approximately 47 seconds when CCR = 10. When the number of
93
nodes is increased to 10 and CCR = 0.1, MILP takes more than 1,500 seconds. By
contrast, LPREdge takes less than 5 seconds when there are 3 nodes and less than 150
seconds when the number of nodes is less than 50 (Figure 4-12).
05
1015
2025
303540
4550
Computation time (sec) MILP
LPR
LPREdge
LS
05
1015
2025
303540
4550
0.1 1 10
Computation time (sec)
CCR
MILP
LPR
LPREdge
LS
Figure 4-11. Computational time vs. CCR for all algorithms in the Abilene network whenthe number of nodes in a workflow is 3.
Comparison between LPREdge and LS. The running time of the heuristic is a few
seconds whereas the running time of LPREdge is linearly increasing up to 150 seconds
when the number of nodes is 50.
0
20
40
60
80
100
120
140
160
Computational Time (second) LPREdge
LS
0
20
40
60
80
100
120
140
160
10 20 30 40 50
Computational Time (second)
Number of nodes in a work flow
LPREdge
LS
Figure 4-12. Computational time vs. the number of nodes in a workflow for LPREdgeand LS in the Abilene network
If requests for workflow scheduling from users are on-demand and should be
handled in realtime, the computational time of the fast greedy algorithm shown in
the experiments is not positive. However, when the requests are in-advance, there is
94
enough time between request arrival time and request start time, and the centralized
server is a more high-end machine, LPREdge is deployable in practice.
4.7 Summary
We have formulated workflow scheduling problems in e-Science networks, whose
goal is minimizing either makespan or network resource consumption by jointly
scheduling heterogeneous resources such as compute and network resources. The
formulations are different from previous work in the literature in the sense that they
allow dynamic multiple paths for data transfer between tasks and more flexible resource
allocation that may vary over time. In addition, it is advantageous that the formulation for
a single workflow scheduling can be easily extended to the formulation for a multiple
workflow scheduling. The computation time of the optimal formulation increases
exponentially with regard to the size of a workflow. Accordingly, the LP relaxation
algorithm, referred to as LPR, for deployment in practice has been developed based on
the optimal algorithm through the common linear relaxation technique. We also propose
the edge-path form LP relaxation algorithm, LPREdge, to enhance time complexity.
The experimental results show that the makespan of LPR and LPREdge is
comparable, less than 20% longer, to that of the optimal algorithm regardless of CCR
for small workflows. In contrast, the general list scheduling algorithm, LS, performs
roughly similar to LPR and LPREdge when CCR = 0.1, but the performance gap of
LPR/LPREdge and LS grows dramatically as CCR grows from 1 to 10. Data-intensive
workflow scheduling, which is common in e-Science application, can benefit from
dynamic multiple paths and malleable resource allocation. In terms of computational
time, the heuristic algorithm of course is the best because it requires only trivial
computations. LPR and LPREdge algorithms require more computations which may
take a few minutes when the number of nodes is 50. Infrequent workflow scheduling
requests from users and reasonable scheduling time between arrival time and start time
of requests may relieve this burden a little.
95
To the best of our knowledge, the optimal algorithm, the MILP formulation, is
the first algorithm that jointly schedules heterogeneous resources including network
resources using dynamic multiple network paths and malleable resource allocation.
The approximation based on the optimal algorithm achieves reasonable performance
compared with the optimal algorithm in terms of schedule length (makespan). The
application of these results to optical networks will be future work.
96
CHAPTER 5CONCLUSIONS
We propose to develop a novel framework for provisioning a variety of e-Science
applications that require complex workflows that span over multiple domains. Our
framework will provide guarantees on the performance while incurring minimal
overhead, both necessary conditions for such a framework to be adopted in practice.
We have already developed a SDF-based model for iterative data-dependent
e-Science applications that incorporates variable communication delays and temporal
constraints, such as throughput. We formulated the problem as a variation of multi-commodity
linear programming with an objective of minimizing network resource consumption while
meeting temporal constraints.
We also proposed topology aggregation algorithms for e-Science networks.
E-Science applications require higher quality intradomain and interdomain QoS
paths, and some of those are distinguished from classic single-path single-job (SPSJ)
applications. We defined a new class of requests, called multiple-path multiple-job
(MPMJ), and propose TA algorithms for the new class of applications. The proposed
algorithms, star and partitioned star ARs, are shown to be significantly better than naive
approaches.
Finally, We formulated workflow scheduling problems in e-Science networks,
whose goal is minimizing either makespan or network resource consumption by jointly
scheduling heterogeneous resources such as compute and network resources. The
formulations are different from previous work in the literature in the sense that they
allow dynamic multiple paths for data transfer between tasks and more flexible resource
allocation that may vary over time. the LP relaxation algorithm for deployment in
practice has been developed based on the optimal algorithm through the common linear
relaxation technique. We also proposed the edge-path form LP relaxation algorithm to
enhance time complexity.
97
REFERENCES
[1] BER Network Requirements Workshop Final Report. Lawrence BerkeleyNational Laboratory, 2007. http://www.es.net/pub/esnet-doc/
BER-Net-Req-Workshop-2007-Final-Report.pdf; cited Sep. 2010.
[2] BES Network Requirements Workshop Final Report. Lawrence BerkeleyNational Laboratory, 2007. http://www.es.net/pub/esnet-doc/
BES-Net-Req-Workshop-2007-Final-Report.pdf; cited Sep. 2010.
[3] EarthScope: An Earth Science Program. EarthScope, 2007. http://www.
earthscope.org/usarray/data_flow/archiving.php; cited Sep. 2010.
[4] Enlightened Computing. MCNC, 2007. http://www.enlightenedcomputing.org/;cited Jan. 2008.
[5] GEANT2. DANTE, 2007. http://www.geant2.net/; cited Sep. 2010.
[6] CHEETAH: Circuit-switched High-speed End-to-End Transport Architecture.University of Virginia, 2008. http://www.ece.virginia.edu/cheetah/; cited Sep.2010.
[7] FES Network Requirements Workshop Final Report. Lawrence BerkeleyNational Laboratory, 2008. http://www.es.net/pub/esnet-doc/
FES-Net-Req-Workshop-2008-Final-Report.pdf; cited Sep. 2010.
[8] Ultralight: An Ultrascale Information System for Data Intensive Research. NationalScience Foundation, 2008. http://www.ultralight.org; cited Sep. 2010.
[9] CA*net4. CANARIE, 2009. http://www.canarie.ca/canet4/index.html; citedSep. 2010.
[10] OSCARS: On-demand Secure Circuits and Advance Reservation System. U.S.Department of Energy, 2009. http://www.es.net/oscars; cited Sep. 2010.
[11] Abilene. Internet2, 2010. http://abilene.internet2.edu/, cited Jan. 2009.
[12] e-Science. The U.K. Research Councils, 2010. http://www.rcuk.ac.uk/
escience; cited Sep. 2010.
[13] The Earth System Grid (ESG). University Corporation for Atmospheric Research,2010. http://www.earthsystemgrid.org/; cited Sep. 2010.
[14] Energy Science Network (ESnet). Lawrence Berkeley National Laboratory, 2010.http://www.es.net; cited Sep. 2010.
[15] The Globus Alliance. Globus, 2010. http://www.globus.org/.
[16] Hybrid Optical and Packet Infrastructure. Internet2, 2010. http://www.internet2.edu/networkresearch/projects.html; cited Sep. 2010.
98
[17] Internet2. Internet2, 2010. http://www.internet2.edu; cited Sep. 2010.
[18] JGN II: Advanced Network Testbed for Research and Development. NICT, 2010.http://www.jgn.nict.go.jp; cited Sep. 2010.
[19] JIVE. Joint Institute for Very Long Baseline Interferometry, 2010. http://www.
jive.nl/; cited Sep. 2010.
[20] LHCNet: Transatlantic Networking for the LHC and the U.S. HEP Community. U.S.Department of Energy, 2010. http://lhcnet.caltech.edu/; cited Sep. 2010.
[21] National Lambda Rail. U.S. research and education community, 2010. http:
//www.nlr.net; cited Sep. 2010.
[22] NSF Global Environment for Network Innovations (GENI) Project. GENI, 2010.http://geni.net/; cited Sep. 2010.
[23] TeraGrid. National Science Foundation, 2010. http://www.teragrid.org/; citedSep. 2010.
[24] UltraScience Net. U.S. Department of Energy, 2010. http://www.csm.ornl.gov/ultranet/; cited Sep. 2010.
[25] User Controlled LightPath Provisioning. Communications Research Centre, 2010.http://www.uclp.ca/; cited Sep. 2010.
[26] Adam, Thomas L., Chandy, K. M., and Dickson, J. R. “A comparison of listschedules for parallel processing systems.” Commun. ACM 17 (1974).12:685–690.
[27] Ahuja, Ravindra, Magnanti, T., and Orin, J. Network flows : theory, algorithms, andapplications. Englewood Cliffs N.J.: Prentice Hall, 1993.
[28] Benoit, Anne, Hakem, Mourad, and Robert, Yves. “Contention awareness andfault-tolerant scheduling for precedence constrained tasks in heterogeneoussystems.” Parallel Computing 35 (2009).2: 83–108.
[29] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., and Weiss, W. Anarchitecture for differentiated services. RFC 2475, IETF, 1998.
[30] Boutaba, R., Golab, W., Iraqi, Y., Li, T., and Arnaud, B. “Grid-controlled lightpathsfor high performance grid applications.” Journal of Grid Computing 1 (2003).4:387–394.
[31] Braden, R., Clark, D., and Shenker, S. Integrated services in the internet architec-ture: An overview. RFC 1633, IETF, 1994.
[32] Brodnik, Andrej and Nilsson, Andreas. “A static data structure for discreteadvance bandwidth reservations on the Internet.” Tech. Rep. Tech report
99
cs.DS/0308041, Department of Computer Science and Electrical Engineering,Lulea University of Technology, Sweden, 2003.
[33] Bunn, J. and Newman, H. “Data-intensive grids for high-energy physics.” GridComputing: Making the Global Infrastructure a Reality. eds. F. Berman, G. Fox,and T. Hey. John Wiley & Sons, Inc, 2003.
[34] Burchard, Lars-O. “Networks with advance reservations: applications,architecture, and performance.” Journal of Network and Systems Management 13(2005).4: 429–449.
[35] Burchard, Lars-O. and Heiss, Hans-U. “Performance issues of bandwidthreservation for grid computing.” Proceedings of the 15th Symposium on ComputerArchetecture and High Performance Computing (SBAC-PAD’03). 2003.
[36] Burchard, Lars-O., Schneider, J., and Linnert, B. “Rerouting strategies fornetworks with advance reservations.” Proceedings of the First IEEE InternationalConference on e-Science and Grid Computing (e-Science 2005). Melbourne,Australia, 2005.
[37] Chen, Bin Bin and Primet, Pascale Vicat-Blanc. “Scheduling deadline-constrainedbulk data transfers to minimize network congestion.” Proceedings of the SeventhIEEE International Symposium on Cluster Computing and the Grid (CCGRID).2007.
[38] Chung, Yeh-Ching and Ranka, S. “Applications and performance analysis of acompile-time optimization approach for list scheduling algorithms on distributedmemory multiprocessors.” Supercomputing ’92. Proceedings. 1992, 512–521.
[39] Coffman, E. G. and Graham, R. L. “Optimal scheduling for two-processorsystems.” Acta Informatica 1 (1972).3: 200–213.
[40] Curti, C., Ferrari, T., Gommans, L., van Oudenaarde, B., Ronchieri, E., Giacomini,F., and Vistoli, C. “On advance reservation of heterogeneous network paths.”Future Generation Computer Systems 21 (2005).4: 525–538.
[41] DeFanti, T., d. Laat, C., Mambretti, J., Neggers, K., and Arnaud, B. “TransLight:A global-scale LambdaGrid for e-science.” Communications of the ACM 46(2003).11: 34–41.
[42] Dick, Robert P., Rhodes, David L., and Wolf, Wayne. “TGFF: task graphs for free.”Proceedings of the 6th international workshop on Hardware/software codesign.Seattle, Washington, United States: IEEE Computer Society, 1998, 97–101.
[43] (Ed.), E. Mannie. Generalized multi-protocol label switching (GMPLS) architecture.RFC 3945, IETF, 2004.
100
[44] Erlebach, T. “Call admission control for advance reservation requests withalternatives.” Tech. Rep. TIK-Report Nr. 142, Computer Engineering and NetworksLaboratory, Swiss Federal Institute of Technology (ETH) Zurich, 2002.
[45] Farrel, A. “A Path Computation Element (PCE)-Based Architecture.” 2006.
[46] Ferrari, Tiziana. Grid Network Services Use Cases from the e-Science Commu-nity. The Open Grid Forum, 2007.
[47] Foster, I. and Kesselman, C. The Grid: Blueprint for a New Computing Infrastruc-ture. Morgan Kaufmann, 1999.
[48] Foster, I., Kesselman, C., Lee, C., Lindell, R., Nahrstedt, K., and Roy, A. “Adistributed resource management architecture that supports advance reservationsand co-allocation.” Proceedings of the International Workshop on Quality ofService (IWQoS ’99). 1999.
[49] Govindarajan, R. and Gao, Guang. “Rate-optimal schedule for multi-rate DSPcomputations.” The Journal of VLSI Signal Processing 9 (1995).3: 211–232.
[50] Govindarajan, R., Gao, Guang R., and Desai, Palash. “Minimizing BufferRequirements under Rate-Optimal Schedule in Regular Dataflow Networks.”The Journal of VLSI Signal Processing 31 (2002).3: 207–229.
[51] Guerin, R. and Orda, A. “Networks with advance reservations: The routingperspective.” Proceedings of IEEE INFOCOM 99. 1999.
[52] He, E., Wang, X., Vishwanath, V., and Leigh, J. “AR-PIN/PDC: Flexible advancereservation of intradomain and interdomain lightpaths.” Proceedings of the IEEEGLOBECOM 2006. 2006.
[53] Hutanu, Andrei, Allen, Gabrielle, Beck, Stephen D., Holub, Petr, Kaiser, Hartmut,Kulshrestha, Archit, Lika, Milo, MacLaren, Jon, Matyska, Ludk, Paruchuri,Ravi, Prohaska, Steffen, Seidel, Ed, Ullmer, Brygg, and Venkataraman, Shalini.“Distributed and collaborative visualization of large data sets using high-speednetworks.” Future Gener. Comput. Syst. 22 (2006).8: 1004–1010.
[54] J. Mambretti, et al. “The Photonic TeraStream: enabling next generationapplications through intelligent optical networking at iGRID2002.” Future Genera-tion Computer Systems 19 (2003).6: 897908.
[55] Johnston, W. E., Metzger, J., OConnor, M., Collins, M., Burrescia, J., Dart,E., Gagliardi, J., Guok, C., and Oberman, K. “Network Communication as aService-Oriented Capability.” High Performance Computing and Grids in Action,Vol. 16. ed. L. Grandinetti. IOS Press, 2008.
[56] Jung, E., Li, Y., Ranka, S., , and Sahni, S. “Performance evaluation of routing andwavelength assignment algorithms for optical networks.” IEEE Symposium onComputers and Communications. 2008.
101
[57] Jung, E., Li, Y., Ranka, S., and Sahni, S. “An evaluation of in-advance bandwidthscheduling algorithms for connection-oriented networks.” International Symp. onParallel Architectures, Algorithms, and Networks (ISPAN). 2008.
[58] Karypis, George and Kumar, Vipin. MeTis: Unstrctured Graph Partitioning andSparse Matrix Ordering System, Version 2.0, 1995.
[59] Khan, A.A., Mccreary, C.L., and Jones, M.S. “A Comparison of MultiprocessorScheduling Heuristics.” Parallel Processing, 1994. ICPP 1994. InternationalConference on. vol. 2. 1994, 243–250.
[60] Kwok, Y.-K. and Ahmad, I. “Benchmarking the task graph scheduling algorithms.”Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the FirstMerged International ... and Symposium on Parallel and Distributed Processing1998. 1998, 531–537.
[61] Lee, E.A. and Ha, S. “Scheduling strategies for multiprocessor real-time DSP.”Global Telecommunications Conference, 1989, and Exhibition. CommunicationsTechnology for the 1990s and Beyond. GLOBECOM ’89., IEEE. 1989, 1279–1283vol.2.
[62] Lee, Edward Ashford. A coupled hardware and software architecture for pro-grammable digital signal processors (synchronous data flow). Ph.D. thesis,University of California, Berkeley, 1986.
[63] Lehman, T., Sobieski, J., and Jabbari, B. “DRAGON: A framework for serviceprovisioning in heterogeneous grid networks.” IEEE Communications Magazine(2006).
[64] Lewin-Eytan, L., Naor, J., and Orda, A. “Routing and admission control innetworks with advance reservatione.” Proceedings of the Fifth InternationalWorkshop on Approximation Algorithms for Combinatorial Optimization (APPROX02). 2002.
[65] Li, Yan, Ranka, S., and Sahni, S. “In-advance path reservation for file transfersIn e-Science applications.” Computers and Communications, 2009. ISCC 2009.IEEE Symposium on. 2009, 176–181.
[66] Liu, Xin. Application-Specific, Agile and Private (ASAP) Platforms for FederatedComputing Services over WDM Networks. Ph.D. thesis, The State Universify ofNew York at Buffalo, 2009.
[67] Liu, Xin, Wei, Wei, Qiao, Chunming, Wang, Ting, Hu, Weisheng, Guo, Wei, andWu, Min-You. “Task Scheduling and Lightpath Establishment in Optical Grids.”INFOCOM 2008. The 27th Conference on Computer Communications. IEEE.2008, 1966–1974.
102
[68] Lui, King-Shan, Nahrstedt, K., and Chen, Shigang. “Routing with topologyaggregation in delay-bandwidth sensitive networks.” Networking, IEEE/ACMTransactions on 12 (2004): 17–29.
[69] Luo, Xubin and Wang, Bin. “Integrated Scheduling of Grid Applications in WDMOptical Light-Trail Networks.” Journal of Lightwave Technology 27 (2009).12:1785–1795.
[70] Marchal, L., Primet, P., Robert, Y., and Zeng, J. “Optimal bandwidth sharing ingrid environment.” Proceedings of IEEE High Performance Distributed Computing(HPDC). 2006.
[71] McDysan, D. E. and Spohn, D. L. ATM Theory and Applications. McGraw-Hill,1998.
[72] Medina, A., Lakhina, A., Matta, I., and Byers, J. “BRITE: an approach to universaltopology generation.” Modeling, Analysis and Simulation of Computer andTelecommunication Systems, 2001. Proceedings. Ninth International Symposiumon (2001): 346–353.
[73] Munir, K., Javed, S., and Welzl, M. “A Reliable and Realistic Approach of AdvanceNetwork Reservations with Guaranteed Completion Time for Bulk Data Transfersin Grids.” Proceedings of ACM International Conference on Networks for GridApplications (GridNets 2007). San Jose, California, 2007.
[74] Munir, K., Javed, S., Welzl, M., Ehsan, H., and Javed, T. “An End-to-End QoSMechanism for Grid Bulk Data Transfer for Supporting Virtualization.” Proceedingsof IEEE/IFIP International Workshop on End-to-end Virtualization and GridManagement (EVGM 2007). San Jose, California, 2007.
[75] Munir, K., Javed, S., Welzl, M., and Junaid, M. “Using an Event Based PriorityQueue for Reliable and Opportunistic Scheduling of Bulk Data Transfers in GridNetworks.” Proceedings of the 11th IEEE International Multitopic Conference(INMIC 2007). 2007.
[76] Murthy, Praveen. Scheduling techniques for synchronous and multidimensionalsynchronous dataflow. Berkeley: Electronics Research Laboratory College ofEngineering University of California, 1996.
[77] Naiksatam, Sumit and Figueira, Silvia. “Elastic Reservations for EfficientBandwidth Utilization in LambdaGrids.” The International Journal of Grid Comput-ing 23 (2007).1: 1–22.
[78] Newman, H. B., Ellisman, M. H., and Orcutt, J. A. “Data-intensive e-sciencefrontier research.” Communications of the ACM 46 (2003).11: 68–77.
103
[79] Palis, M.A., Liou, Jing-Chiou, and Wei, D.S.L. “Task clustering and scheduling fordistributed memory parallel architectures.” Parallel and Distributed Systems, IEEETransactions on 7 (1996).1: 46–55.
[80] Pelsser and Bonaventure. Path Selection Techniques to Establish ConstrainedInterdomain MPLS LSPs. 2006, 209–220.
[81] Rajah, K., Ranka, S., and Xia, Ye. “Scheduling Bulk File Transfers with Start andEnd Times.” Network Computing and Applications, 2007. NCA 2007. Sixth IEEEInternational Symposium on. 2007, 295–298.
[82] Rajah, Kannan, Ranka, Sanjay, and Xia, Ye. “Scheduling Bulk File Transfers withStart and End Times.” Computer Networks 52 (2008).5: 1105–1122. .
[83] Rao, N. S., Carter, S. M., Wu, Q., Wing, W. R., Zhu, M., Mezzacappa, A.,Veeraraghavan, M., and Blondin, J. M. “Networking for large-scale science:Infrastructure, provisioning, transport and application mapping.” Proceedings ofSciDAC Meeting. 2005.
[84] Reiter, Raymond. “Scheduling Parallel Computations.” J. ACM 15 (1968).4:590–599.
[85] Ricciato, F., Monaco, U., and Ali, D. “Distributed schemes for diverse pathcomputation in multidomain MPLS networks.” Communications Magazine, IEEE43 (2005): 138–146.
[86] Rosen, E., Viswanathan, A., and Callon, R. Multiprotocol label switching architec-ture. RFC 3031, IETF, 2001.
[87] Sahni, S., Rao, N., Ranka, S., Li, Y., Jung, E., and Kamath, N. “Bandwidthscheduling and path computation algorithms for connection-oriented networks.”International Conference on Networking. 2007.
[88] Sarangan, V., Ghosh, D., and Acharya, R. “Performance analysis ofcapacity-aware state aggregation for inter-domain QoS routing.” Global Telecom-munications Conference, 2004. GLOBECOM ’04. IEEE 3 (2004): 1458–1463Vol.3.
[89] Schelen, O., Nilsson, A., Norrgard, Joakim, and Pink, S. “Performance ofQoS agents for provisioning network resources.” Proceedings of IFIP SeventhInternational Workshop on Quality of Service (IWQoS’99). London, UK, 1999.
[90] Sinnen, O. and Sousa, L.A. “Communication contention in task scheduling.”Parallel and Distributed Systems, IEEE Transactions on 16 (2005).6: 503 – 515.
[91] Sinnen, Oliver. Task scheduling for parallel systems. Hoboken N.J.:Wiley-Interscience, 2007.
104
[92] Sinnen, Oliver and Sousa, Leonel. “List scheduling: extension for contentionawareness and evaluation of node priorities for heterogeneous clusterarchitectures.” Parallel Computing 30 (2004).1: 81–101.
[93] Sprintson, A., Yannuzzi, M., Orda, A., and Masip-Bruin, X. “Reliable Routingwith QoS Guarantees for Multi-Domain IP/MPLS Networks.” INFOCOM 2007.26th IEEE International Conference on Computer Communications. IEEE (2007):1820–1828.
[94] Stuijk, S., Geilen, M., and Basten, T. “Throughput-Buffering Trade-Off Explorationfor Cyclo-Static and Synchronous Dataflow Graphs.” Computers, IEEE Transac-tions on 57 (2008).10: 1331–1345.
[95] Sun, Zhenyu, Guo, Wei, Wang, Zhengyu, Jin, Yaohui, Sun, Weiqiang, Hu,Weisheng, and Qiao, Chunming. “Scheduling Algorithm for Workflow-BasedApplications in Optical Grid.” Journal of Lightwave Technology 26 (2008).17:3011–3020.
[96] Thorpe, S. R., Stevenson, D., and Edwards, G. K. “Using just-in-time to enableoptical networking for grids.” First ICST/IEEE International Workshop on Networksfor Grid Applications (GridNets 2004). 2004.
[97] Topcuoglu, H., Hariri, S., and Wu, Min-You. “Performance-effective andlow-complexity task scheduling for heterogeneous computing.” Parallel andDistributed Systems, IEEE Transactions on 13 (2002).3: 260–274.
[98] Uludag, Suleyman, Lui, King-Shan, Nahrstedt, Klara, and Brewster, Gregory.“Analysis of Topology Aggregation techniques for QoS routing.” ACM Comput.Surv. 39 (2007): 7.
[99] Wang, Lee, Siegel, Howard Jay, Roychowdhury, Vwani P., and Maciejewski,Anthony A. “Task Matching and Scheduling in Heterogeneous ComputingEnvironments Using a Genetic-Algorithm-Based Approach,.” Journal of Paralleland Distributed Computing 47 (1997).1: 8–22.
[100] Wang, Tao and Chen, Jianer. “Bandwidth tree - A data structure for routing innetworks with advanced reservations.” Proceedings of the IEEE InternationalPerformance, Computing and Communications Conference (IPCCC 2002). 2002.
[101] Wang, Yan, Jin, Yaohui, Guo, Wei, Sun, Weiqiang, Hu, Weisheng, and Wu,Min-You. “Joint scheduling for optical grid applications.” Journal of OpticalNetworking 6 (2007).3: 304–318.
[102] Wieczorek, Marek, Hoheisel, Andreas, and Prodan, Radu. “Taxonomies of theMulti-Criteria Grid Workflow Scheduling Problem.” Grid Middleware and Services.2008. 237–264.
105
[103] Wiggers, M., Bekooij, M., Jansen, P., and Smit, G. “Efficient computation of buffercapacities for multi-rate real-time systems with back-pressure.” Hardware/softwarecodesign and system synthesis, 2006. CODES+ISSS ’06. Proceedings of the 4thinternational conference. 2006, 10–15.
[104] Xiong, Qing, Wu, Chanle, Xing, Jianbing, Wu, Libing, and Zhang, Huyin. “Alinked-list data structure for advance reservation admission control.” ICCNMC2005. 2005. Lecture Notes in Computer Science, Volume 3619/2005.
[105] Yang, Tao and Gerasoulis, Apostolos. “PYRROS: static task scheduling andcode generation for message passing multiprocessors.” Proceedings of the 6thinternational conference on Supercomputing. Washington, D. C., United States:ACM, 1992, 428–437.
[106] Yang, Xi, Lehman, Tom, Tracy, Chris, Sobieski, Jerry, Gong, Shujia, Torab,Payam, and Jabbari, Bijan. “Policy-Based Resource Management and ServiceProvisioning in GMPLS Networks.” Proceedings of IEEE INFOCOM. 2006.
[107] Yannuzzi, M., Masip-Bruin, X., and Bonaventure, O. “Open issues in interdomainrouting: a survey.” Network, IEEE 19 (2005): 49–56.
[108] Yoo, Younghwan, Ahn, Sanghyun, and Kim, Chong Sang. “Link state aggregationusing a shufflenet in ATM PNNI networks.” Global Telecommunications Confer-ence, 2000. GLOBECOM ’00. IEEE. vol. 1. 2000, 481–486 vol.1.
[109] Zhang, Z. L., Duan, Z., and Hou, Y. T. “Decoupling QoS control from corerouters: A novel bandwidth broker architecture for scalable support of guaranteedservices.” Proc. ACM SIGCOMM. 2000.
[110] Zheng, Jun, Zhang, Baoxian, and Mouftah, H.T. “Toward automated provisioningof advance reservation service in next-generation optical internet.” Communica-tions Magazine, IEEE 44 (2006).12: 68–74.
106
BIOGRAPHICAL SKETCH
Eun-Sung Jung received B.S. and M.S. degrees in electrical engineering from Seoul
National University, Korea, in 1996 and 1998, respectively. His research interests include
network optimization in connection-oriented networks and its applications to existing
research networks.
107