c 2010 Eun-Sung Jungufdcimages.uflib.ufl.edu/UF/E0/04/19/96/00001/jung_e.pdf · Eun-Sung Jung...

NETWORK RESOURCE PROVISIONING IN RESEARCH NETWORKS

By

EUN-SUNG JUNG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2010

c⃝ 2010 Eun-Sung Jung

2

To my parents, my wife, Hyeseon, and my daughter, Lauren

3

ACKNOWLEDGMENTS

First of all, I would like to thank my chair, Dr. Sanjay Ranka, and my co-chair, Dr.

Sartaj Sahni. Since I started to work with him, they have inspired me, guided me through

all the research, and gave me invaluable advice, suggestions, comments and support

with patience and generosity. I also would like to show my sincere gratitude to my

supervisory committee members for insightful comments on my research.

I would like to give my deepest gratitude to my family and friends. Without their help

and support, this dissertation would not have been possible.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Target Networks and Services . . . . . . . . . . . . . . . . . . . . . . . . 141.3 Problems Addressed and Our Contributions . . . . . . . . . . . . . . . . . 16

1.3.1 Bandwidth Allocation for Iterative Data-dependent Applications . . 161.3.2 Topology Aggregation for E-Science Networks . . . . . . . . . . . . 171.3.3 Workflow Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 191.5 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 BANDWIDTH ALLOCATION FOR ITERATIVE DATA-DEPENDENT E-SCIENCEAPPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Synchronous Dataflow for E-Science Applications . . . . . . . . . . . . . 282.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.1 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.3.2 Optimal Bandwidth Allocation with a Feasible Schedule . . . . . . 37

2.3.2.1 Modeling communication delays . . . . . . . . . . . . . . 382.3.2.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . 42

2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 TOPOLOGY AGGREGATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3 TA for Multiple-Path Multiple-Job (MPMJ) . . . . . . . . . . . . . . . . . . 54

3.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.3.2 New Topology Aggregation Algorithms . . . . . . . . . . . . . . . . 56

3.3.2.1 Full-mesh method . . . . . . . . . . . . . . . . . . . . . . 563.3.2.2 Star method . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.2.3 Partitioned star method . . . . . . . . . . . . . . . . . . . 58

3.4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5

3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.6.1 Bulk File Transfers in E-Science . . . . . . . . . . . . . . . . . . . . 623.6.2 Experiment Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . 633.6.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 643.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4 WORKFLOW SCHEDULING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.2 Workflow Scheduling in E-Science Networks . . . . . . . . . . . . . . . . 70

4.2.1 System Model and Data Structure . . . . . . . . . . . . . . . . . . 714.2.1.1 Time model . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.1.2 Network resource model . . . . . . . . . . . . . . . . . . 714.2.1.3 Workflow model . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.3 Construction of an Auxiliary Graph . . . . . . . . . . . . . . . . . . 73

4.3 MILP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 Single Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.3.1.1 Multi-commodity flow constraints . . . . . . . . . . . . . . 784.3.1.2 Task assignment constraints . . . . . . . . . . . . . . . . 804.3.1.3 Precedence constraints . . . . . . . . . . . . . . . . . . . 804.3.1.4 Deadline constraints . . . . . . . . . . . . . . . . . . . . . 81

4.3.2 Multiple Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.3.3 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 LP Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5 List Scheduling Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.6.2.1 Schedule length of workflows . . . . . . . . . . . . . . . . 924.6.2.2 Computational time . . . . . . . . . . . . . . . . . . . . . 93

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6

LIST OF TABLES

Table page

2-1 Comparison between DSP and e-Science applications . . . . . . . . . . . . . . 30

2-2 Summary of system parameters of the visualization application . . . . . . . . . 36

2-3 Notation for problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3-1 Time Complexity for MPMJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3-2 Space Complexity for MPMJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4-1 Notation for problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4-2 Single workflow scheduling formulation time complexity analysis . . . . . . . . 84

4-3 Edge-path form single workflow scheduling formulation time complexity analysis 88

7

LIST OF FIGURES

Figure page

2-1 An example of SDFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2-2 A homogeneous SDFG converted from Figure 2-1 (a) . . . . . . . . . . . . . . 33

2-3 A real example of e-Science applications [53] . . . . . . . . . . . . . . . . . . . 35

2-4 An ESDFG model for Figure 2-3 . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2-5 Modeling communication delay in a SDFG . . . . . . . . . . . . . . . . . . . . . 39

2-6 Modeling communication delay in the case of multiple communication channels 41

2-7 More exploited parallelism in case of multiple communication channels . . . . . 42

2-8 BAFS problem formulation in case of the conservative model . . . . . . . . . . 42

2-9 BAFS problem formulation in case of the optimistic model . . . . . . . . . . . . 43

2-10 The Abilene network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2-11 Rejection ratio vs. number of requests . . . . . . . . . . . . . . . . . . . . . . . 47

3-1 An example of inter-domain QoS routing . . . . . . . . . . . . . . . . . . . . . . 51

3-2 An illustrative example for limitations of the line segment algorithm . . . . . . . 54

3-3 Full-mesh AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3-4 Star AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3-5 Partitioned star AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3-6 Earliest finish time on-line scheduling of multiple file transfers . . . . . . . . . . 63

3-7 Error ratio vs. the number of nodes . . . . . . . . . . . . . . . . . . . . . . . . . 65

3-8 Normalized computational time vs. the number of source and destination nodes 65

4-1 A DAG consisting of 17 nodes, representing dependencies among 17 tasksof an application. For example, the arc from task E to task B represents thefact that the output generated by task E is utilized by task B. . . . . . . . . . . . 68

4-2 An example of a network resource graph . . . . . . . . . . . . . . . . . . . . . 72

4-3 An example of a task graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4-4 An example of an auxiliary graph . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4-5 Single workflow scheduling problem formulation via network flow model . . . . 79

8

4-6 Multiple workflow scheduling problem formulation via network flow model . . . 82

4-7 Edge-path form of single workflow scheduling problem formulation . . . . . . . 87

4-8 The Abilene network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4-9 Makespan vs. CCR for all algorithms in the Abilene network when the numberof nodes in a workflow is 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4-10 Makespan vs. CCR and the number of nodes in a workflow for LPREdge andLS in the Abilene network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4-11 Computational time vs. CCR for all algorithms in the Abilene network whenthe number of nodes in a workflow is 3. . . . . . . . . . . . . . . . . . . . . . . 94

4-12 Computational time vs. the number of nodes in a workflow for LPREdge andLS in the Abilene network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9

List of Algorithms

2-1 A heuristic for BAFS problem . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3-1 Full-mesh AR construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3-2 Star AR construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3-3 Partitioned star AR construction . . . . . . . . . . . . . . . . . . . . . . . . . 59

4-1 First step - Determination of the mapping of tasks except data transfers . . 85

4-2 Second step - Determination of the mapping of network resources . . . . . 85

4-3 The adapted extended list scheduling algorithm . . . . . . . . . . . . . . . . 89

4-4 Data transfer finish time computation algorithm . . . . . . . . . . . . . . . . 90

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

NETWORK RESOURCE PROVISIONING IN RESEARCH NETWORKS

By

Eun-Sung Jung

December 2010

Chair: Sanjay RankaCochair: Sartaj SahniMajor: Computer Engineering

Advances in optical communication and networking technologies, together with

the computing and storage technologies, are dramatically changing the ways scientific

research is conducted. A new term, e-Science, has emerged to describe “large-scale

science carried out through distributed global collaborations enabled by networks,

requiring access to very large scale data collections, computing resources, and

high-performance visualization” [12].

E-Science application workflows are complex and require schedulable and

high-bandwidth connectivity with known future characteristics. Moreover, these

workflows have performance requirements or metrics that have not been considered

by conventional networking. For example, large file transfer may need guaranteed total

turnaround time and the rate of progress. Given the long duration of many requests, the

network resources available may change before it is completed.

We develop a novel framework for provisioning a variety of e-Science applications

that require complex workflows that span over multiple domains. Our framework

provides guarantees on the performance while incurring minimal overhead, both

necessary conditions for such a framework to be adopted in practice.

11

CHAPTER 1INTRODUCTION

1.1 Overview

Advances in optical communication and networking technologies, together with

the computing and storage technologies, are dramatically changing the ways scientific

research is conducted. A new term, e-Science, has emerged to describe “large-scale

science carried out through distributed global collaborations enabled by networks,

requiring access to very large scale data collections, computing resources, and

high-performance visualization” [12]. Well-quoted e-Science (and the related grid

computing [47]) examples include high-energy nuclear physics [33], radio astronomy

[19], geoscience [3] and climate studies [13]. To support e-Science activities, a new

generation of high-speed research and education networks have been developed.

These include Internet2 [17], the Department of Energy’s ESnet [14], National Lambda

Rail [21], CA*net4 [9] in Canada, and the pan-Europe GEANT2 [5]. A large portion of all

data traffic supporting U.S. science is carried by ESnet, Internet2, and National Lambda

Rail [55].

E-science activities often need to transport large volumes of data at a very high

rate among a large number of collaborating sites [33, 78], severely stressing network

resources. For instance, the high-energy physics (HEP) data is expected to grow from

the current petabytes (PB) (1015) to exabytes (1018) by 2015 [33]. Beyond the obvious

need for large amounts of data to be transferred, e-Science requirements for network

use are significantly different from the traditional network applications [7, 46, 55] in the

following ways:

1. Need to support schedulable, long-duration workflows with performance guarantee:The underlying applications require schedulable, high-bandwidth, low-latencyconnectivity with known future characteristics or performance guarantees [41], forreal-time remote visualization, interactions with instruments, distributed simulationor data analysis, etc. In a distributed workflow system that involves many entitiessuch as distant parties, scientific instruments, computation devices, as well ascomplex feedback in various stages of the workflow, unintended delay due to

12

lack of planning for future communication paths can ripple through the entireworkflow environment, slowing down other participating systems as they wait forintermediate results, thus reducing the overall effectiveness [55].

2. Need to support a large number of network services with novel performancemetrics: There are many different types of sciences and scientific activities,which require different types of network services tailored to the specific scienceactivities. (Also see [1, 2, 7, 46, 55].) Moreover, many of the e-Science activitieshave performance requirements or metrics that have not been considered byconventional networking. Large file transfer may only be concerned with totalturnaround time and the rate of progress; streaming consumer-producer type ofjobs running at two different sites may require a minimum and maximum datatransfer rate; fusion experiments may care about lowering the probability of failurein the experiments due to inadequate network services.

3. Need to support dynamic user and resource environment with high networkefficiency: Given that each job can be a “heavy hitter” in terms of network resourceconsumption, the network must handle with great efficiency the dynamic arrivals ofservice requests, the changes in requirements, traffic pattern and access policiesat different stages of experiments or collaboration. The efficiency requirement isespecially important for the new-generation, high-speed, coarse-granular networks,such as the wavelength-based systems. In addition, given the long duration ofmany jobs, the resources available to a particular job or the network topology maychange before the job is completed. The network services must be able to adaptto the resource changes by incorporating newly added resources (e.g., links orwavelengths) or falling back to alternative resources when the assigned resourcesare no longer available.

In short, e-Science activities need schedulable, high-bandwidth, flexible and

evolving network services with novel performance guarantees, and the network needs

to provide these services efficiently. There is a large body of research on how to

provide quality-of-service (QoS) guarantees (e.g., InterServ [31], DiffServ [29], the ATM

network [71], or MPLS [86]) for Internet-type networks. Those proposals do not consider

advance reservations with start and end times. Bulk transfer is usually regarded as

low-priority best-effort traffic, not subject to admission control (AC). AC and scheduling

are decoupled from routing in that each connection has a single default path separately

determined by a routing protocol; the routing protocol is usually oblivious of the jobs in

the system. AC is myopic in that each network element on the path determines whether

the connection can be accepted based on a comparison of the remaining link capacity

13

at the node itself with the requested resource of the job alone. Once admitted, the path

and resource allocation remain fixed throughout the lifetime of the connection.

To meet the needs of e-Science, we propose a framework for conducting advance

reservations, admission control (AC) and scheduling of network service requests in

research networks that (i) supports an evolving large array of network services required

by or useful for e-Science collaboration; (ii) guarantees performance levels that are

based on metrics relevant to the underlying applications; (iii) adapts to underlying

changes in network topology, resources and user dynamics; and (iv) provides efficient

utilization of the underlying resources.

1.2 Target Networks and Services

Recent network signaling protocols, such as Multiple Protocol Label Switching

(MPLS) and Generalized MPLS (GMPLS), allow applications to overcome deficiencies

prevalent in existing routed TCP/IP protocols (e.g., the inability to guarantee bandwidth,

or offer Quality of Service). Many high-bandwidth network projects currently are

deploying these protocols in the research and academic domain. This is the case, for

example, in the Internet2’s HOPI testbed [16], the NSF-supported Ultralight [8], Teragrid

[23], CHEETAH [6] and DRAGON [63] networks and the DOE-supported UltraScience

Net (USN) [24], ESnet [14] and LHCNet [20]. We expect these protocols to proliferate

into the production and commercial network domain. As the first evidence, both the

Internet2 and the DOE’s ESnet have chosen to offer dedicated bandwidth capability

and lightpaths using GMPLS and MPLS control plane techniques, developed in the

OSCARS/BRUW projects [10]. The techniques provide a framework to automate the

provisioning process for bandwidth and make it easier for users to access Service

Oriented Bandwidth Management (SOBM) functions, compared to the current

provisioning and bandwidth management practices, which are manual and labor-intensive.

Past and current projects on research networks have focused on addressing the

following challenges [6, 10, 16, 63]: 1) Set up the high-speed data plane by a hybrid of

14

IP packet-switching and optical circuit-switching technologies with a large footprint and

sufficient connectivity by connecting the national labs and universities and peering with

other networks, 2) Develop support for end-to-end high-speed circuits statically or on

demand, which requires multi-domain interoperability, 3) Set up the basic control plane

and develop signaling and control middleware for handling user requests and basic

network resource reservation, 4) Develop end-to-end transport protocols for supporting

high-speed channels and large volumes of data, 5) Ensure security by encryption,

authentication, authorization (AAA), and 6) Ensure reliability.

Bulk transfer. Being able to transfer very large files is a priority in nearly all

e-Sciences [1, 2, 7, 46, 55]. If the turnaround time is the performance metric that the

user cares about, there is a great deal of flexibility in how the transfer can be carried

out. For instance, the transfer of a 100 GB file can be completed in 8 seconds using

ten 10 Gbps lightpaths (Internet2 links), or in 1 hour and 26 minutes using a 155 Mbps

(OC-3) long-lasting SONET circuit. The transfer choice not only affects the job in

question, but also other current or future jobs in complex ways. For large transfers with

start and end time constraints, peak bandwidth assignment can lead to an undesirable

phenomenon known as fragmentation [77], which in turn leads to low utilization of

network resources. This occurs when some time intervals are lightly loaded but not long

enough to accommodate new large jobs. Greater transfer flexibility is needed to combat

this problem, such as time-varying bandwidth assignment and dynamic re-assignment.

Streaming workflow. For the nanomaterial sciences conducted at DOE’s Center

for Functional Nanomaterials, research often involves distributed collaboration among

smaller research centers with different scientific instruments and capabilities [2]. Data

are generally collected from several centers and/or are compared against each other.

Then, a medium sized cluster of computers processes and analyzes the data. The

visualization is done by special workstations equipped with large memory graphics cards

to handle the large images and volumes of data from the output of the data processing.

15

The generated animation is then streamed to remote scientists’ desktops, or in the case

where the visualization is in stereo, to a 3D theater. The network requirements vary at

each stage of the workflow.

Data-intensive workflow. Large-scale supercomputing is expected to produce

data at a similar rate to large-scale experiments. In order to post-process the computed

results, high throughput transfers are often required to stage the data at the related

computational resources. Similarly, high-end scientific computing also processes large

amounts of input data that, from a performance perspective, should be accessible as

efficiently as possible. Local parallel file systems are well suited for supporting the

demanded I/O capabilities, even when data has to be staged to the respective file

systems. Community schedulers need to control multiple distributed computational

resources in order to serve individual workflows.

1.3 Problems Addressed and Our Contributions

E-Science networks usually provide QoS guarantee, i.e., bandwidth guarantee

through multi-protocol label switching (MPLS) and general multi-protocol label switching

(GMPLS) to meet the requirements of e-Science applications, e.g., in-advance path

reservations for high-volume data transfers. The distinctive features of e-Science

applications compared with other distributed applications can be summarized in two key

words, ”network-centric” and ”in-advance”. Unlike other grid computing applications,

scheduling of e-Science applications puts more focus on network resources or considers

the network resource as most important among multiple resources such as compute

resource and storage resource. Moreover, in-advance scheduling of e-Science

applications satisfies the needs of users requesting periodic or predictable services.

1.3.1 Bandwidth Allocation for Iterative Data-dependent Applications

We present a framework for bandwidth scheduling of streaming e-Science

applications. These applications include interactive visualization of simulations, large

data streaming coordinated with job execution for producer consumer applications,

16

and networked supercomputing [46]. We have adapted the Synchronous Dataflow

(SDF) model to model and analyze iterative data-dependent applications in e-Science.

Synchronous dataflow was proposed in late 1980s as a modeling method for digital

signal processing (DSP) applications, but it ignores the communication delays. Our

model incorporates the communication delays that are inherent in large-scale distributed

applications. We have formulated the bandwidth allocation problem of iterative

data-dependent e-Science applications with temporal constraints as a multi-commodity

linear programming problem. It incorporates optimal rates and buffer minimization for

streaming applications that can be represented by a SDFG. Our algorithms determine

how much bandwidth is allocated to each edge while satisfying temporal constraints

on collaborative tasks. Using the solution of the bandwidth allocation problem, buffer

requirements for the schedule are achieved using procedures similar to the ones

presented in [50]. To the best of our knowledge, this represents the first attempt to

analyze the temporal behavior of collaboratively iterative tasks and to determine the

optimal bandwidth allocations among distributed nodes.

1.3.2 Topology Aggregation for E-Science Networks

The network supporting e-Science applications typically is comprised of multiple

domains. Each domain usually belongs to different organizations, and is managed

based on different operational policies. In such cases, internal topologies of domains

may not be visible to the others for security or other reasons. However, aggregated

information of internal topology and associated attributes is advertised to the other

domains.

A set of techniques to aggregate data to advertise outside one domain is called

Topology Aggregation (TA). The aggregated data itself is termed as Aggregated

Representation (AR). A survey of TA algorithms is presented in [98]. There exists a

tradeoff between the accuracy and the size of AR. Hence, most algorithms proposed in

17

the previous work tried to achieve the most efficient AR in terms of both accuracy and

space complexity.

One can classify QoS path requests into two classes: single-path single-job

(SPSJ) and multiple-path multiple-job (MPMJ), depending on the nature of requests.

SPSJ corresponds to a situation in which requests for single QoS path arrive and

are scheduled in the order of arrival. In contrast, MPMJ corresponds to batch/off-line

scheduling of multiple requests for multiple QoS paths. Many e-Science applications

require simultaneous transfer of data from multiple sources and destinations. Also,

each of these requests (e.g., file transfers) can be more efficiently supported by using

concurrent multiple paths.

We show that existing TA approaches developed for SPSJ do not work well with

MPMJ applications as they overestimate the amount of bandwidth that is available. We

propose a max flow based TA approach that is suitable for this purpose. Our simulation

results demonstrate that our algorithms result in better accuracy or less scheduling time.

1.3.3 Workflow Scheduling

Workflow/Directed Acyclic Graph (DAG) scheduling has been shown to be NP-hard

[91]. A number of practical heuristics have been developed for this problem. Most of

these ignore the communication costs [26, 39] or assumed a very simple interconnection

network model, i.e., a fully-connected network model without contention [59, 60, 79, 97,

99]. The work in [97] proposed the heterogeneous-earliest-finish-time (HEFT) algorithm

extended from the classic list scheduling for heterogeneous computing resources.

However, the advances in computing platforms ranging from clusters to grids and

emerging clouds for data intensive applications has posed new challenges where

network contention is an important issue that needs to be addressed. We propose

to address this issue by formulating and solving the overall workflow scheduling that

incorporates network contention and overheads of the large scale data transfers. In

18

particular, we address the following issues for e-Science grids that have networks that

are a mix of IP networks and optical networks:

• Malleable resource allocation.

• Dynamic multipath scheduling.

• Multiple workflows.

We have formulated workflow scheduling problems in e-Science networks, whose

goal is minimizing either makespan or network resource consumption by jointly

scheduling heterogeneous resources such as compute and network resources. The

formulations are different from previous work in the literature in the sense that they

allow dynamic multiple paths for data transfer between tasks and more flexible resource

allocation that may vary over time. Moreover, our work is the first to formulate the

workflow scheduling problem incorporating multiple paths as a mixed integer linear

programming (MILP). We formulate also a linear programming relaxation, LPR, of our

MILP, an edge-path based LP relaxation, LPREdge, and a list scheduling heuristic, LS.

The experimental results show that the makespan of LPR schedules is much closer

to optimal than that of LS schedules when the communication-to-compute ratio (CCR) is

large. The LS algorithm performs roughly similar to the LPR algorithm when CCR = 0.1

and 1.0, but the performance gap of these non-optimal algorithms grows dramatically as

CCR grows from 1 to 10. Our results indicate that data-intensive workflow scheduling,

which is common in e-Science applications, will benefit from dynamic multiple paths and

malleable resource allocation.

1.4 Background and Related Work

Ongoing research projects for supporting e-Science applications (e.g., HOPI [16],

Ultralight [8], Teragrid[23], CHEETAH [6], DRAGON [63], ESnet [14], OSCARS/BRUW

[10]) have mainly focused on setting up a fast data plane with a large footprint and

sufficient connectivity and setting up a basic but functional control plane, such as

developing signaling and control middleware for handling user requests for elementary

19

network services, ensuring security and improving reliability. However, the control plane

mechanisms lack sophisticated network service support or efficient service reservation

algorithms. They normally only support fixed bandwidth guarantee by reserving circuits

or lightpaths. Using such a restricted set of services or simplistic resource management

algorithms to support diverse e-Science activities can lead to inefficient utilization of

the network resources (especially for the new-generation, high-speed, coarse-granular

networks, such as the wavelength-based systems) and/or not provide the level of

performance required by those activities in desired but varied performance metrics.

Compared with the traditional QoS frameworks, such as InterServ [31], DiffServ

[29], ATM networks [71], or MPLS [86], admission control and scheduling for research

networks are recent concerns with limited published work. Prior work is either about

dedicated path reservation, bulk data transfers or jobs that require minimum bandwidth

guarantee (MBG). None has considered as rich a class of job types as we do.

Control plane protocols, architectures and tools. The NSF-supported DRAGON

[63, 106] project develops control plane architecture and middleware for multi-domain

traffic engineering and resource allocation, e.g., using GMPLS protocols [43] for setting

up SONET circuits or lightpaths. It uses a centralized resource computation element

per domain, which is responsible to compute paths. It supports advance reservations

of label switched paths (LSP) on requested time periods. CHEETAH [6] is a similar

project to DRAGON but is more traditional in that it focuses on simpler, distributed

operations for path computation and bandwidth management to support high arrival

rates of immediate connection requests. OSCARS [10] is the control plane project for

DOE’s ESnet, also similar to DRAGON. It develops and deploys a prototype service

that enables on-demand provisioning of guaranteed bandwidth circuits for ESnet. HOPI

[16] is a testbed project on research networks that examines how to provide network

services in a hybrid network of shared IP packet switching and dynamically provisioned

lightpaths.

20

[52] presents an architecture for advance reservation of intra and interdomain

lightpaths. GARA [48], the reservation and allocation architecture for the grid computing

toolkit Globus [15], supports advance reservation of network and computing resources.

[40] adapts GARA to support advance reservation of lightpaths, MPLS paths and

DiffServ paths. Other related work in this category includes GridJIT [96], ODIN [54], [30]

and [34]. Much of the objective, architecture framework and capabilities of the proposed

project coincides with the NSF’s GENI project [22], for instance, the use of network

controllers and the support of network virtualization. Most of the above control-plane

architectures and tools provide rudimentary AC and scheduling algorithms for simple job

types. However, much more can be done to support more service types or improve the

network resource utilization.

Path reservation. The ability to provide dedicated or on-demand circuits or

lightpaths is currently the focus of many projects, including most aforementioned

major research networks and associated projects, e.g., Internet2, ESnet, National

Lambda Rail, GEANT2, UltraScience Net (USN), HOPI, DRAGON, CHEETAH and

OSCARS/BRUW. Further examples include User Controlled Light Paths (UCLP) [25],

Enlightened [4], Japanese Gigabit Network II [18], LHCNet[20], and Bandwidth Brokers

[109]. In our previous research work, we have proposed novel algorithms for advance

path computation and bandwidth scheduling for connection oriented networks [87] that

have considerably better performance [57]. In [56], we have extended these algorithms

to incorporate the wavelength sharing and wavelength continuity constraints.

MBG service. Several earlier studies [32, 36, 89, 104] have considered AC at

an individual link for the MBG (minimum bandwidth guarantee) job type with start and

end times. The concern is typically about designing efficient data structures, such as a

segment tree [32], for recording and querying the link bandwidth usage on different time

intervals. Admission of a new job is based on the availability of the requested bandwidth

between its start time and end time. [35, 44, 51, 100] and [36] tackle the more general

21

path-finding problem for the MBG class, but typically only for new requests, one at

a time. The routes and bandwidth of existing jobs are unchanged. [64] considers

a network with known routing in which each admitted job derives a profit. It gives

approximation algorithms for admitting a subset of the jobs so as to maximize the total

profit.

Bulk transfer. Recent papers on AC and scheduling algorithms for bulk transfer

with advance reservations include [35, 37, 51, 70, 73–75, 77, 82]. In [77], the AC and

scheduling problem is considered only for the single link case. Network-level AC and

scheduling are considered to be outside the scope of [77]. As a result, multi-path routing

and network-level bandwidth allocation and re-allocation have no counterpart in [77]. In

contrast, we periodically re-optimize the bandwidth assignment for all the new and old

jobs.

For a one-time scheduling problem, our recent work [82] conducts a detailed

performance comparison between single-slice scheduling and multi-slice scheduling

under various slice sizes, and between single-path routing and multi-path routing. We

conclude that a small number of paths per job is usually sufficient to yield near-optimal

throughput; multi-slice scheduling leads to significant performance (e.g., throughput)

improvement. Other authors have also considered a similar problem but with different

emphasis [37].

In [73–75], the authors consider single-link AC or link-by-link AC under single-path

routing. The AC uses heuristic algorithms instead of solutions to optimization problems.

Based on its size and the deadline, the average required bandwidth of a bulk transfer

job is computed. The AC is based on the job’s average bandwidth requirement. The

bandwidth of existing jobs may be re-allocated only for the single-link case.

The authors of [35] propose a malleable reservation scheme for bulk transfer, which

checks every possible interval between the requested start and end times for the job and

tries to find a path that can accommodate the entire job on that interval. The scheme

22

favors intervals with earlier deadlines. In [51], the computational complexity of a related

path-finding problem is studied and an approximation algorithm is suggested. [70] starts

with an advance reservation problem for bulk transfer, but converts it into a constant

bandwidth allocation problem to maximize the job acceptance rate. All the requests

are known at the time of AC; AC/scheduling is carried out only once. The bandwidth

constraints are at the ingress and egress links only, and hence, there is no routing issue.

Grid/Utility/Cloud Computing. Network resource provisioning problems in

e-Science networks share some design goals such as the earliest finish time of a

job with resource management problems in grid/utility/cloud computing. However, the

network resource provisioning problems in e-Science networks are different from those

problems in grid/utility/cloud computing in that network resources, i.e., the bandwidth

of links, are assumed to be guaranteed and manageable by emerging technologies

such as MPLS and GMPLS. Such QoS guaranteeing infrastructures for e-Science

applications are originating from the fact that common e-Science applications transport

large volumes of data at very high rates. This difference opens a research area toward

more elegant management of network resources, which can make system performance

better.

Optical networks. E-science networks are mix of IP networks and optical networks.

In optical networks, the bandwidth along a given link can be decomposed into multiple

wavelengths. For such reasons, optical networks have the following constraints.

• Wavelength continuity constraint: This constraint forces a single lightpath to occupythe same wavelength throughout all the links that it spans. This constraint isnot required when an optical network is equipped with wavelength converters.When such converters are present, the network is called a wavelength convertiblenetwork.

• Wavelength sharing constraint: For many deployments, it is most effective toconsider the bandwidth on a link as consisting of integer multiples of wavelengthand a single wavelength as a unit for assignment i.e., one wavelength is occupiedby only one reservation at a certain point of time. It is worth noting that techniques

23

based on Time Division Multiplexing (TDM)/Wavelength Division Multiplexing(WDM) [110] allow for decomposing the bandwidth on a wavelength.

The related issues in the research area of optical networks are: Routing and

wavelength assignment (RWA) problem, virtual topology (VT) problem, traffic grooming

(TR) problem, and task scheduling and lightpath establishment (TSLE) problem.

1.5 Outline of Dissertation

The remainder of this dissertation is organized as follows.

Chapter 2 describes a SDF-based model for iterative data-dependent e-Science

applications that incorporates variable communication delays and temporal constraints,

such as throughput. We formulate the problem as a variation of multi-commodity

linear programming with an objective of minimizing network resource consumption

while meeting temporal constraints. The resulting solution can then be used to derive

buffer space requirements by previously developed algorithms in the context of DSP

applications. Finally, an illustrative example of an e-Science application shows that

the framework and algorithm we propose is valid to model and analyze iterative

data-dependent e-Science applications. The simulation results show that the optimal

bandwidth allocation by the formulated linear programming outperforms the bandwidth

allocation by a simple heuristic in terms of rejection ratio of requests.

Chapter 3 describes topology aggregation algorithms for e-Science networks.

E-Science applications require higher quality intradomain and interdomain QoS

paths, and some of those are distinguished from classic single-path single-job (SPSJ)

applications. We define a new class of requests, called multiple-path multiple-job

(MPMJ), and propose TA algorithms for the new class of applications. The proposed

algorithms, star and partitioned star ARs, are shown to be significantly better than naive

approaches.

Chapter 4 describes efficient algorithms for workflow scheduling problems in

e-Science networks, whose goal is minimizing either makespan or network resource

24

consumption by jointly scheduling heterogeneous resources such as compute and

network resources. Our algorithms are different from previous work in the literature in

the sense that they allow dynamic multiple paths for data transfer between tasks and

more flexible resource allocation that may vary over time. In addition, it is advantageous

that the formulation for a single workflow scheduling can be easily extended to the

formulation for a multiple workflow scheduling.

25

CHAPTER 2BANDWIDTH ALLOCATION FOR ITERATIVE DATA-DEPENDENT E-SCIENCE

APPLICATIONS

2.1 Overview

E-Science activities often require the transport of large volumes of data at very high

rates among a large number of collaborating sites [33, 78], severely stressing network

resources. For instance, the high-energy physics (HEP) data are expected to grow from

the current petabytes (1015) to exabytes (1018) by 2015 [33]. Beyond the obvious need

for large amounts of data to be transferred, e-Science requirements for network use are

significantly different from the traditional network applications [7, 46, 55]. The underlying

applications require schedulable, high-bandwidth, low-latency connectivity with known

future characteristics or performance guarantees [41] for real-time remote visualization,

interactions with instruments, distributed simulation or data analysis, and so on. In a

distributed workflow system that involves many entities, such as distant parties, scientific

instruments, computation devices, as well as complex feedback in various stages of the

workflow, unintended delays due to a lack of planning for future communication paths

can ripple through the entire workflow environment, slowing down other participating

systems as they wait for intermediate results, thus reducing the overall effectiveness

[55].

The focus of this chapter is on supporting e-Science applications that require

streaming of information between sites. We present a framework for bandwidth

scheduling of streaming e-Science applications. These applications include interactive

visualization of simulations, large data streaming coordinated with job execution for

producer consumer applications, and networked supercomputing [46]. The main

contributions are as follows:

1. We have adapted the Synchronous Dataflow (SDF) model to model and analyzeiterative data-dependent applications in e-Science. Synchronous dataflow wasproposed in late 1980s as a modeling method for digital signal processing (DSP)

26

applications, but it ignores the communication delays. Our model incorporates thecommunication delays that are inherent in large-scale distributed applications.

2. We have formulated the bandwidth allocation problem of iterative data-dependente-Science applications with temporal constraints as a multi-commodity linearprogramming problem. It incorporates optimal rates and buffer minimizationfor streaming applications that can be represented by a SDFG. Our algorithmsdetermine how much bandwidth is allocated to each edge while satisfying temporalconstraints on collaborative tasks. Using the solution of the bandwidth allocationproblem, buffer requirements for the schedule is achieved using procedures similarto the ones presented in [50].

To the best of our knowledge, this represents the first attempt to analyze the temporal

behavior of collaboratively iterative tasks and to determine the optimal bandwidth

allocations among distributed nodes.

Ongoing research projects for supporting e-Science applications ( e.g., HOPI

[16], Ultralight [8], Teragrid [23], CHEETAH [6], DRAGON [63], ESnet [14], and

OSCARS/BRUW [10]) have mainly focused on setting up a fast data plane with a

large footprint and sufficient connectivity and setting up a basic but functional control

plane, such as developing signaling and control middleware for handling user requests

for elementary network services, ensuring security and improving reliability. However,

the control plane mechanisms lack sophisticated network service support or efficient

service reservation algorithms. They normally only support fixed bandwidth guarantee

by reserving circuits or lightpaths. Using such a restricted set of services or simplistic

resource management algorithms to support diverse e-Science activities can lead

to inefficient utilization of the network resources (especially for the new-generation,

high-speed, coarse-granular networks, such as the wavelength-based systems) and/or

not provide the level of performance required by those activities in desired but varied

performance metrics.

The rest of the paper is organized as follows. We provide a detailed description

of SDF and its operational semantics and examine its applicability to e-Science

applications in Section 2.2. We present an overall process of problem-solving, including

27

a mathematical formulation as a linear programming and a discussion of the detailed

deployment of the obtained solution for the linear programming in real systems in

Section 2.3. We show that our approach outperforms a naive heuristic, also given by us,

in Section 2.4. Lastly, we conclude with a summary and discussion of the practicality of

our dissertation in Section 2.5.

2.2 Synchronous Dataflow for E-Science Applications

The SDF model of computation was first proposed by Lee in [62]. The SDF model

has been found to be very useful for expressing DSP applications that have the following

features: infinitely looping execution, discretized communication expressed by tokens,

and parallelism to be exploited for maximizing throughput. Most of the existing research

for these problems is limited to deriving maximal rates and buffer minimization.

SDFG is a directed graph defined by G = (V ,E , I ,O, τ , Φ), where V and E

represent a set of nodes and a set of edges, respectively. Each node in SDFG is called

an actor and the edge in SDFG is called a communication channel or channel. The

notation is based on its earlier use in DSP applications comprising function blocks and

the communication channels interconnecting them. An actor repeats its task infinitely,

and the execution of its task is called firing. In this paper, we use the terms node for

actor, and edge for channel interchangeably. An actor can produce and consume data

per channel at different rates, which are specified by the number of tokens.

The number of tokens is a positive integer. If multiple inputs and outputs are

associated with an actor, it is assumed that the actor waits until all input buffers

have their tokens to be consumed ready for use and all output buffers are available.

Homogeneous SDFG, where at most one token can be produced or consumed, is a

special case of SDFG.

The number of tokens that actors produce and consume is specified by sets, I and

O. I is a set of numbers of tokens consumed by destination actors of edges, and O is a

set of numbers of tokens produced by source actors of edges. Thus, each edge (u, v)

28

U VOuv=2

Iuv=1

Fires ru times Fires rv times

u v viteration 0

(a)

0 1 2 3 4

u v v

u v v

time

iteration 0

iteration 1

(b)

Figure 2-1. An example of SDFG

is associated with two integer values, Iuv and Ouv . Consider the sample SDFG shown

in Figure 2-1 (a). The edge (u, v) has two associated integer values, Iuv and Ouv , which

are 1 and 2, respectively. This represents the fact that actor u produces 2 tokens at each

firing and actor v consumes 1 token at each firing. In addition, τ is a set of execution

times of actors, and the execution time of each actor’s firing is denoted by τi . Finally, a

set Φ represents the initial numbers of tokens on edges, which are necessary for the

start of iterative operations of a SDFG.

Using the known properties of a homogeneous SDFG allows us to derive the

maximal computation rates as well as buffer requirements. Also, it can be shown

that any arbitrary SDFG can be converted into a homogeneous SDFG, although this

conversion may increase the size of the network exponentially.

To adapt the SDFG model for e-Science applications, it is important to understand

the key differences between e-Science and DSP applications. A summary of differences

between DSP and e-Science applications is provided in Table 2-1. Unlike DSP

applications, e-Science applications can be represented by acyclic graphs, have

fixed start and end time, and have communication delays to be considered. The time

unit of DSP applications is on the order of a few milliseconds, compared to the time unit

of e-Science applications that may be from a few hours to several days. Throughput is

the most important objective in both DSP and e-Science applications. However, for DSP

29

Table 2-1. Comparison between DSP and e-Science applications

Category DSP application e-Science applicationInter-taskdependency

Cycles are allowed. Usually acyclic.

Execution period Infinite. Finite.Time unit Small (a few milliseconds). Ranges from small to large (a

few minutes).Compute resource Unlimited. Unlimited or limited if

compute resource should beco-allocated.

Communicationdelay

Assumed to be 0. Needs to be considered.

Temporal constraints Objective is maximizingcomputation rate(throughput).

Throughput.

Schedule Static or dynamic. Static.

applications, tradeoffs are between throughput and buffer size, while for e-Science, the

tradeoff is generally between throughput and network resource requirements. The focus

of our work is on optimizing these resources.

Lee [61] divided scheduling of parallel computation defined by SDFG into four

classes: fully dynamic, static assignment, self-timed, and fully static. Fully dynamic

scheduling schedules actors at run-time only. In static assignment, assignment of actors

to processors is done off-line and a local run-time scheduler of each processor invokes

actors assigned to the processor. In self-timed scheduling, the assignment and ordering

of actors on each processor is determined off-line and exact firing time is scheduled at

run-time. In other words, the actor that will be executed by a certain processor waits

for all input data to be available and is fired once all input data are ready. Finally, fully

static scheduling determines all information off-line. Based on this classification, the

target e-Science applications can be considered to be self-timed. A node of SDFGs for

e-Science applications represents one site, such as a data server or a computing node.

This implies that every actor is assigned to a unique processor that only manages that

task.

30

As described earlier, a SDFG is represented by G = (V ,E , I ,O, τ , Φ). Since actors

can produce or consume tokens at different rates, a feasible schedule should guarantee

that tokens are not infinitely accumulated. In Figure 2-1 (a), actor u produces 2 tokens at

each firing, while actor v consumes 1 token. To prevent infinite buffer overflow, actor u

should be fired once for every two firings of actor v . Formally, this can be stated by the

equation, ru×2 = rv×1, where ru and rv denote firing rates of actor u and v , respectively.

These kinds of equations are called balance equations or state equations. To solve

balance equations formally, we need to define a topology matrix, where ei denotes the

i th edge and Oei and Iei denote the number of produced tokens and consumed tokens,

respectively, on an edge ei .

Definition 1 (Topology matrix).

topology matrix Γ is a |E | × |V | matrix.

Γij =

Oei if an edge ei = (vj , vk),

−Iei if an edge ei = (vk , vj),

Oei − Iei if an edge ei = (vj , vj),

0 otherwise.

(2–1)

The topology matrix for Figure 2-1 (a) is: Γ =(2 −1

). The existence of a

solution, as well as a method to solve the balance equations, can be shown using the

following theorem.

Theorem 2.1 ([62]). A connected SDF graph with actors has a periodic schedule if and

only if its topology matrix Γ has rank n − 1. Further, if its topology matrix has rank n − 1,

then there exists a unique smallest integer solution to the balance equations Γq = 0. It

can be shown that the entries in the vector q are coprime.

Given rates of actors obtained by Theorem 2.1, {r1, r2, · · · , rn}, one iteration is

defined as a schedule containing ri firings of actor i . Figure 2-1 (b) shows the optimal

31

schedule for a SDFG in Figure 2-1 (a) when both actor u and v have self-dependency

loops and the execution times of actors are all 1.

Theorem 2.2 ([84]). For a homogeneous SDFG represented by G = (V ,E , I ,O, τ , Φ),

the maximal computation rate of every node in the graph is given by

min∀C

∑(i ,j)∈C Φij∑i∈C τi

. (2–2)

where C is any cycle in the graph.

Regardless of the SDFG type, i.e., homogeneous or multi-rate, the computation

rate of a SDFG is defined as the number of iterations per unit time. The maximal

computation rate of a homogeneous SDFG can be derived by examining all cycles

in the graph. Theorem 2.2 says the maximal computation rate of a homogeneous

SDFG is bounded by the minimum initial token-to-time ratio cycle in the graph. As

for a homogeneous SDFG, the maximal computation rate of an iteration equals to

the maximal computation rate of a node since the number of firings of a node in

one iteration is 1. But, regarding a multi-rate SDFG, we can compute the maximal

computation rate of a node in two steps. First, we can compute the maximal computation

rate of an iteration after converting the multi-rate SDFG into a homogeneous SDFG.

Figure 2-2 shows the homogeneous SDFG converted from the multi-rate SDFG in

Figure 2-1 when putting a self-dependency loop on each node. A certain node u with a

rate ru in a multi-rate SDFG will be expanded to ru number of nodes in the homogeneous

SDFG converted from the multi-rate SDFG [49]. Hence, the maximal computation rate of

an iteration with regard to Figure 2-2 is 12

through the equation

min{ 1

1× ru,1

1× rv} = min{1, 1

2}.

Next, we can compute the maximal rate of each node by multiplying the maximal rate of

an iteration by the rate of the node. In this example, the maximal rate of node u and v

32

U1

V1

V2

Figure 2-2. A homogeneous SDFG converted from Figure 2-1 (a)

are 12(= 1

2× ru) and 1(= 1

2× rv), respectively. In this paper, we call the number of firings

of a node per unit time, the throughput of the node.

2.3 Problem Formulation

In this section, we propose an algorithm for determining efficient bandwidth

allocations to edges of the original network topology graph while satisfying temporal

constraints such as throughput, required by an e-Science application whose data

dependency is given by a SDFG. In addition, with these bandwidth allocations, we can

minimize buffer size requirements and find the corresponding schedules.

The overall process of finding the full-fledged solution for an e-Science application is

summarized as follows.

1. Discretization step: In this step, both time and data size have discretized values:execution time, data transmission time, and data transfer size. Discretization isan important requirement for using the SDFG model. For target applications, abase unit for execution and communication times can be chosen and appropriaterounding can be performed. A base time unit should be fine-grained enough todifferentiate each actor’s execution time and temporal constraints.The base unitcan be a few seconds to several hours, depending on the application.

2. Firing rates of actors: Using Theorem 2.1, firing rates of actors guaranteeingwell-behaved SDFG, i.e., free of deadlock and infinite buffer accumulation, canbe calculated. In MATLAB, the firing rates of actors can be obtained through asimple operation, null(Γ, ′r ′), where Γ is a topology matrix for the SDFG and null isa MATLAB function returning a solution, Z , for Γ× Z = 0.

3. Path bandwidth selection: The e-Science applications are distributed,andconnection-oriented communication paths among distributed nodes are set upon demand or in advance. The bandwidth of paths is guaranteed by networktechnologies, such as multi-protocol label switching (MPLS) and general

33

multi-protocol label switching (GMPLS). The communication delay of a path towhich bandwidth is allocated on request within available bandwidth is inverselyproportional to the allocated bandwidth. Hence, path bandwidth allocation shouldalso be taken into account since it can affect throughput as well as the totalnetwork resource consumption. A formal problem formulation will be presented inSection 2.3.2.2.

4. The amount of buffer space requirements is the total number of tokens queuedon every edge. Clearly, different schedules can lead to different buffer spacerequirements. The following buffer minimization problem is shown to be NP-complete[76]: Given a homogeneous SDFG, is there a valid schedule for the SDFG ofwhich buffer space requirements are less then a constant K? it is easier, however,to find the minimum buffer space when the computation rate is fixed, even thoughthe problem is still NP-complete. Using an approach similar to Govindaraja [50] butadapted to e-Science applications, we use a two-phase approach for first findingthe optimal solution for the bandwidth allocation problem, then use this solution tominimize the buffer requirements. Using this two-phase approach, as mentionedabove, we obtain the optimal solution for the bandwidth allocation problem, thenminimize buffer requirements based on the obtained previous solution. After thesolution to the BAFS problem (described in the next section) is obtained, wehave to find exact schedules and minimize buffer space requirements. Since wechoose a model where the communication delays are included in the executiontime of actors, the previously developed algorithms for buffer space requirementminimization can be directly applied. The buffer space requirement minimizationproblem has been solved in the context of DSP applications in many papers[50][103] [94].

5. Adjust for deployment in a real system: The implementation of the derived solutionrequires a few considerations. The generated solution consists of discretizedvalues in terms of the properly chosen base time and data unit. As long as wecan ensure that the discretized problem has stricter constraints than the originalproblem, such as higher production rates and lower consumption rates, theresulting solution should be feasible. Additionally, with the absence of a globalclock, synchronization issues need to be considered to force firings of tasks tofollow the computed schedule. Self-timed scheduling may not achieve the maximalrate without the global clock if buffer space is limited and not properly synchronizedwith actors’ schedules. However, for reasonable buffer sizes, the deterioration ofthe maximal rate will be small.

2.3.1 Illustrative Example

We pick the visualization application in [53] as a representative example of

e-Science applications that can be modeled by an extended synchronous dataflow

34

(a) (b)

DD LSU SanDiegobmoData Source Computing VisualizationFigure 2-3. A real example of e-Science applications [53]

graph (ESDFG). The visualization application shown in Figure 2-3 (a) has a use-case

scenario as follows.

For the demonstration in San Diego, CCT/LSU (Louisiana), CESNET/MU (Czech

Republic) and iGrid/Calit2 (California) participated in a distributed collaborative session.

The visualization front-end is located at LSU running Amira for the 3D texture-based

volume rendering for distributed visualization. The visualization back-end (data

server) also ran at LSU. The actual data set for the demonstration had a size of 120

Gbytes and contained 4003 data points at each timestep (4 bytes data/point for a 256

Mbyte/timestep).

In this chapter, we assume a more general model, similar to the use-case in [46],

extended from this application such that data servers reside at different sites from

computing sites. This general model can be abstracted as the diagram in Figure 2-3 (b).

0(1)

1(1)

2(2)

3(2)

4(1)

128

128

256

256

1

1

1

1

Figure 2-4. An ESDFG model for Figure 2-3

The system parameters of the visualization application are summarized in Table

2-2. If not explicitly mentioned, all the parameters are per one firing. The figures

marked by bold type are parameters that are not explicitly given in [53], thus arbitrarily

35

Table 2-2. Summary of system parameters of the visualization applicationItem Continuous Discretized

value valueData centers

Production 2560 Mbyte 128Execution time 1 second 1

Computing site at LSUConsumption 256 Mbyte 256Production 1 frame (1 Mbyte) 1Execution time 100 ms 2

Visualizing site at San DiegoConsumption 1 frame (1 Mbyte) 1Execution time 100 ms 2Throughput At least 5 frames/sec 0.25

Visualizing site at BrnoConsumption 1 frame (1 Mbyte) 1Execution time 50 ms 1Throughput At least 5 frames/sec 0.25

Base time unit: 50 ms, Base data unit: 1 Mbyte

chosen by us within a reasonable range of the associated hardware’s performance. The

discretized values for the parameters are computed with appropriately chosen base time

and data unit. For example, the data production speed of data centers, 2560 Mbyte/s,

is discretized into 128 tokens/1 unit time since the base unit time is 50 ms and the rate

of 2560 Mbyte/s equals to the rate of 128 Mbyte/50 ms. The resultant ESDFG for the

application is shown in Figure 2-4.

Second, the firing rates of nodes are calculated using simple math on a topology

matrix of the ESDFG, as described in Section 2.2.

Γ =

128 0 −256 0 0

0 128 −256 0 0

0 0 1 −1 0

0 0 1 0 −1

The solution for rates of nodes is given by [2, 2, 1, 1, 1]. Each element of the solution

vector corresponds to r1 through r5, respectively.

The next step is to formulate the problem as a linear programming.

36

2.3.2 Optimal Bandwidth Allocation with a Feasible Schedule

To include temporal constraints such as throughput, we define extended SDFG

(ESDFG) as follows.

Definition 2 (Extended SDFG (ESDFG)). An ESDFG is represented by

G =(V ,E , I ,O, τ ,D, st, et,T ),

where V ,E , I ,O, τ ,D are same as SDFG,

st, et are start and end time of execution period of a SDFG, and

T is {(v ,Tv )|v ∈ V ,Tv ∈ R}.

The set, T , has elements of throughput constraints defined as two-tuple (v ,Tv),

where v is the node whose throughput should be equal to or greater than Tv . st

and et are used for in-advance bandwidth reservations. Suppose that we manage

data structures for in-advance bandwidth reservations such as time-bandwidth lists

representing how much available bandwidth varies over time on each edge, we can

easily obtain a subgraph whose available bandwidth on each edge is set to maximum

available bandwidth during the period [st, et). For example, if an edge eij has available

bandwidth 1 and 2 over time period [0, 1) and [1, 2), and st and et are given as 0

and 2, the eij of subgraph has a value of 1 as an available bandwidth. BAFS problem

formulation works on this subgraph if in-advance bandwidth reservations are considered.

Informally we can define the bandwidth allocation with a feasible schedule (BAFS)

problem as follows: Given a network topology represented by G = (V ,E) and iterative

data-dependent tasks represented by an ESDFG, Gt = (Vt ,Et , It ,Ot , τt , st, et,T ), what

is the optimal bandwidth allocation with a feasible schedule that minimizes network

resource consumption and meets temporal requirements?

The formal problem formulation will be presented below after discussion of how to

model communication delays in established paths for e-Science applications.

37

2.3.2.1 Modeling communication delays

A communication delay is composed of four factors: processing delay, transmission

delay, queueing delay, and propagation delay. Processing delay is associated with

operations such as packetizing, thus is proportional to the data size as is transmission

delay. Queueing delay is stochastic, and propagation delay is constant for a certain link.

In this chapter, we assume that e-Science applications run on dedicated networks, i.e.,

MPLS or GMPLS networks, where the paths are established using label switched paths

(LSPs). For such scenarios, queueing delay and propagation delay can be ignored. We

assume that transmission delay dominates total delay. The processing delay can be

incorporated into transmission delay because both kinds of delay are proportional to the

data size. We thus optimize with regard to transmission delay.

We now investigate how to incorporate communication delays into an optimal

computation rate problem given a SDFG. To the best of our knowledge, this has not

been addressed in the literature on SDF modeling for DSP applications. Although

communication delays have been considered in multiprocessor scheduling, the focus

is mainly on the makespan of schedules, which is total time taken to execute all the

tasks specified by a precedence task graph, not on the throughput of infinitely repeated

schedules. The target applications are e-Science applications whose data-dependent

distributed nodes collaborate iteratively.

Figure 2-5 (a) shows a SDFG consisting of two actors, u and v . Actor u produces 2

tokens per firing, actor v consumes 1 token per firing. We assume that it takes 2 units

of time for actor u to send 2 tokens to actor v . The value in parenthesis inside a node

indicates the execution time of the node.

There are two ways to integrate the communication delay within the SDF model.

1. The communication delay can be included in the execution time of producing actoru (Figure 2-5 (b))

2. The communication delay can be included by having a dummy actor c whoseexecution time is set to the communication delay (Figure 2-5 (c))

38

1

(a)

(b)

U(1) V(1)2

Communication

delay = 2

U(3) V(1)2

1

(c)

1

U(1) V(1)2

C(2)2

2

0 1 2 3 4

u u u

time

iteration 0

iteration 1

5

u c c

u c c

iteration 0

iteration 1

v

v v

v

v v

u u u v v

6 7

0 1 2 3 4time

5 6 7

Figure 2-5. Modeling communication delay in a SDFG

The first option (Figure 2-5 (b)) implies that communication can occur right after tokens

are produced in the producer’s buffer and the producer cannot be fired again until

transfer of produced tokens is done. This is the most conservative way of modeling

a communication delay since the relation between the execution and communication

is assumed to be synchronous. We call this model the conservative model in this

chapter. If we are not sure how the program is implemented internally, we can take this

conservative model to guarantee the final solution meets the throughput requirements.

The second option (Figure 2-5 (c)) implies that communication can run independently

of the producer. This, in general, can lead to higher buffer space requirements, but

may result in a higher computation rate. As can be seen in Figure 2-5 (c), the optimal

schedule shows more throughput compared to the optimal schedule in Figure 2-5 (b).

We call this model the optimistic model as opposed to the conservative model in this

chapter. Either of these two models can be chosen arbitrarily per each node, and the

details on how this issue can be dealt with in the problem formulation are presented in

Section 2.3.

In some cases, there may be multiple outgoing communication channels (Figure

2-6 (a)). As single communication channel, we can make a choice between two options:

a conservative approach and an optimistic approach. The conservative approach

adds max {communication delays of outgoing communication channels} to the execution time of

39

Table 2-3. Notation for problem formulation

Category Notation DescriptionFunction vt(v) vt : Z→ Z, maps a vertex, v , in V into a vertex in Vt .

Com(a) Com : V → boolean, returns true, if an actor a is a dummy node tomodel communication delay.

Constant G (V ,E), original network topology.or Set Gt (Vt ,Et , It ,Ot , τt , st, et,T ),

an ESDFG specifying iterative data-dependent tasks.Jc {(si , di)|si ∈ V , di ∈ V , (vt(si), vt(di)) ∈ Et},

A set of communication jobs modeled by the conservative approach,defined by two tuples of source and destination nodes.

Jo {(si , di)|si ∈ V , di ∈ V , c ∈ Vt , (vt(si), c) ∈ Et , (c , vt(di)) ∈ Et},A set of communication jobs modeled by the optimistic approach, alsodefined by two tuples of source and destination nodes.

J Jc or Jo depending on the approach.sj sj ∈ V , j ∈ Jc

∨j ∈ Jo , source node of job j .

dj dj ∈ V , j ∈ Jc∨j ∈ Jo , destination node of job j .

τi Execution time of node (actor) i ∈ Vt .ri Rate of node (actor) i ∈ Vt .Ij j ∈ J, amount of data (number of tokens) consumed by actor vt(dj).Oj j ∈ J, amount of data (number of tokens) produced by actor vt(sj).Clk Available bandwidth on edge (l , k) ∈ E during the period [st, et).Vtf A set of front-end nodes whose throughputs are concerned, Vtf ⊂ VtTd Throughput requirement of node (actor) d ∈ Vtf , specified by users.

Variable Rmax The maximal computation rate of an iteration.td Throughput of node (actor) d ∈ Vtf .f jlk Flow of job j on an edge (l , k) ∈ E .Dj Allocated bandwidth for job j .

a producer actor. Figure 2-6 (b) shows such a case where the execution time of actor

u increases by 3, max{2, 1, 3}. One of drawbacks of this model is that it deters early

executable actors from starting on their own schedules. For example, actor w in Figure

2-6 (b) cannot be fired 1 unit time after u finishes its execution. Instead it should wait 2

unit times more. The other approach as in Figure 2-6 (c), the optimistic one, is the same

as the case of the single communication channels. For each channel, a logical actor

accounting for a corresponding communication delay is inserted between the original

producer/consumer actors.

40

U(2)

V(1)

W(1)

X(1)

2

1

3

(a)

11

1

1

1

1

U(5)

V(1)

W(1)

X(1)

(b)

11

1

1

1

1

V(1)1C(2)

1

U(2) W(1)

X(1)

(c)

11

1

1

1

1

C(2)

C(1)

C(3)

1

1

1

1

1

Figure 2-6. Modeling communication delay in the case of multiple communicationchannels

A more elaborate analysis of a certain actor’s execution pattern may lead to the

more exact modeling, and Figure 2-7 shows in what cases and how we can improve our

models. The semantics of SDF enforces that the output of an actor is generated at the

end of the execution of the actor. Hence, in case of Figure 2-6 (a), actor v , w and x can

start their own execution at least 2, 1 and 3 unit time after actor u’s firing is done, which

means actor x can start at time 5 if actor u is fired at time 0. However, suppose that the

output data for actor x is generated at time 1. The communication delay on the channel

between actor u and x can be adjusted by 2 as in Figure 2-7. The next procedures for

incorporating communication delay into SDF model will take either of Figure 2-6 (b) and

(c).

41

U(2)

V(1)

W(1)

X(1)

2

1

2

(b)

11

1

1

1

1

u uiteration 0

iteration 1

w

0 1 2 3 4time

5 6 7

(a)

v,x

u u w v,x

3

Figure 2-7. More exploited parallelism in case of multiple communication channels

2.3.2.2 Problem formulation

The notation for the BAFS problem is summarized in Table 2-3. The BAFS problem

can be formulated as linear programming problems shown in Figure 2-8 and 2-9, for the

conservative and the optimistic models, respectively.

Objective

minimize∑

j∈J,(l ,k)∈E

f jlk (2–1)

Multi-commodity flow constraints∑k:(l ,k)∈E

f jlk −∑

k:(k,l)∈E

f jkl = 0, l = sj , l = dj ,∀j ∈ J (2–2)

∑j∈J

f jlk ≤ Clk , ∀(l , k) ∈ E (2–3)

∑k:(l ,k)∈E

f jlk −∑

k:(k,l)∈E

f jkl =

{Dj , if l = sj−Dj , if l = dj

,∀l ∈ V , j ∈ J (2–4)

0 ≤ f jlk ,∀j ∈ J, (l , k) ∈ E (2–5)0 ≤ Dj (2–6)

Temporal constraints

Rmax ≤1

ri(τi +OjDj), i ∈ Vt , j ∈ Jc , (vt(sj), vt(dj) ∈ Et (2–7)

td = Rmax · rd , d ∈ Vtf (2–8)

Td ≤ td ≤rd

ri(τi +OjDj),

d ∈ Vtf , i ∈ Vt , j ∈ Jc , (vt(sj), vt(dj) ∈ Et (2–9)

Td · (τiDj +Oj) ≤rdriDj ,

d ∈ Vtf , i ∈ Vt , j ∈ Jc , (vt(sj), vt(dj) ∈ Et (2–10)

Figure 2-8. BAFS problem formulation in case of the conservative model

42

Objective

minimize∑

j∈J,(l ,k)∈E

f jlk (2–11)

Multi-commodity flow constraints∑k:(l ,k)∈E

f jlk −∑

k:(k,l)∈E

f jkl = 0, l = sj , l = dj ,∀j ∈ J (2–12)

∑j∈J

f jlk ≤ Clk , ∀(l , k) ∈ E (2–13)

∑k:(l ,k)∈E

f jlk −∑

k:(k,l)∈E

f jkl =

{Dj , if l = sj−Dj , if l = dj

,∀l ∈ V , j ∈ J (2–14)

0 ≤ f jlk ,∀j ∈ J, (l , k) ∈ E (2–15)0 ≤ Dj (2–16)

Temporal constraints

Rmax ≤1

ri · OjDj, if Com(i) = true and j ∈ Jo , (vt(sj), i) ∈ Et (2–17)

td = Rmax · rd , d ∈ Vtf (2–18)

Td ≤ td ≤rd

ri · OjDj,

if Com(i) = true and d ∈ Vtf , j ∈ Jo , (vt(sj), i) ∈ Et (2–19)

Td ·Oj ≤rdriDj ,

if Com(i) = true and d ∈ Vtf , j ∈ Jo , (vt(sj), i) ∈ Et (2–20)

Figure 2-9. BAFS problem formulation in case of the optimistic model

The problem formulation allows the use of the multi-commodity flow problem, for

which a variety of efficient solutions exists in the literature [27]. The major differences

between a typical multi-commodity flow problem and this problem formulation are as

follows:

1. The demand of each job is not a constant, but a decision variable. This determineshow much bandwidth is allocated to a job, i.e., communication between producerand consumer actors.

2. The decision variables are constrained by temporal constraints pertaining tothroughputs of actors.

43

The objective of the linear programming (given in Equation 2–1 and 2–11) is to

minimize network resource consumption, which is the total amount of bandwidths

allocated to all edges in the original network topology. If the demands were constant

values, the objective can be regarded as minimizing average hops of all jobs (communication

channels) if we approximate the average hop number as total network traffictotal demand

. However, since

we have demands as decision variables, this objective can be thought of as minimizing

allocated network resources regardless of average hop number of all jobs.

The constraints can be divided into two parts. The first part is typical of multi-commodity

flow constraints. Equation 2–2 and 2–12, the flow conservation constraint, mandate that

for all jobs, the net flow to a node is zero, i.e., the incoming and outgoing flows to a node

are balanced unless the node is a source or a destination. Equation 2–3 and 2–13, the

capacity constraint, mandate that the flow along any edge cannot exceed the capacity of

the edge. Equation 2–4 and 2–14 ensures that the source and the destination of any job

should produce and consume, respectively, the flow of the job, Dj .

The second part concerns temporal constraints, guaranteeing the throughputs

of front-end actors. As discussed earlier, Theorem 2.2 states that the maximum

computation rate is limited by the cycle whose cost-to-time ratio is minimum. Accordingly,

Equation 2–7 and 2–17 account for communication delays on outgoing edges of a

certain node. Since the target application is acyclic e-Science application, the cycles

we should consider for the maximum computation rate are limited to self-dependency

loops, where the number of tokens is 1 and the total execution times are ri(τi +OjDj) and

ri · OjDj for the conservative and optimistic models, respectively. The term OjDj

accounts

for communication delays of outgoing edges of actor i . In addition, since Theorem 2.2

is for homogeneous SDFs, considering the conversion from the given ESDFG to the

homogeneous SDF [76], the execution time, (τi +OjDj) or Oj

Dj, should be multiplied by the

firing rate ri as in Equation 2–7 and 2–17. Since Rmax is the number of iterations per unit

time and the firing rate is the number of firings per iteration, the throughput of a certain

44

node d equals Rmax · rd , as in Equation 2–8 and 2–18. Equation 2–8 and 2–18 can be

transformed into Equation 2–9 and 2–19, respectively, since the required throughput can

be guaranteed if any Rmax · rd is greater than or equal to Td specified by users. With a

few transformations, the throughput constraints result in Equation 2–10 and 2–20, which

are linear inequalities.

The solution for the linear programming determines the optimal bandwidth allocation

on edges, and exact schedules and associated buffer space allocation can then be

computed in [50].

2.4 Experimental Evaluation

There is no other research work on BAFS problem in the context of grid computing.

We compare our LP-based algorithm for BAFS problem with a heuristic that simply

uses the definition of throughput between two actors in the assignment process. We

also compare two approaches of our algorithm, conservative and optimistic ones.

The heuristic is presented in Algorithm 2-1. It enumerates all the paths based on the

throughput requirements, and computes delays on edges based on the assumption

that all the delays of edges constituting a given path are the same. If tighter delay is

required while examining paths, the tighter delay is updated as the delay of the edge.

For example, in Figure 2-4, the throughput of node 3 comes from two paths: 0 → 2 → 3

and 1 → 2 → 3. Suppose that a communication delay on the edge (2, 3) is 2, to achieve

the throughput required by node 3 on the path 0 → 2 → 3. If the communication delay

should be 1 on the edge (2, 3), considering another path 1→ 2→ 3, the communication

delay on the edge (2, 3) is updated to 1.

The heuristic does not consider possible parallelism of tasks, or possible balanced

bandwidth allocations on edges since it computes delays by assuming all delays on a

path to be the same.

We compare two algorithms in terms of rejection ratio of requests. The bandwidths

of edges are randomly selected from a uniform distribution between 10 to 1024 unit

45

data per base unit time. We varied the number of requests from 1 to 16 on the Abilene

network [11] (see Figure 2-10), and each request is a specified task graph (Figure

2-4). The nodes of a request were constrained to have a matching node in the original

network topology graph, and the matching node is randomly assigned using uniform

distribution.

Algorithm 2-1 A heuristic for BAFS problemInput: An ESDFG

1: Enumerate all the possible paths from front-end nodes to back-end nodes whose through-

puts are specified by temporal requirements.

2: Initialize the delay on edge i , edi , as∞.

3: for each path do

4: Assume a same delay, d , on all the edges of the path.

5: Compute d satisfying temporal requirements.

6: if edi > d then

7: edi = d

8: end if

9: end for

10: Compute the bandwidth on each edge based on edi .

Figure 2-10. The Abilene network

The experimental results are shown in Figure 2-11. Both of our LP-based

approaches have better rejection ratios and are lower than the heuristic by 5 to 30%.

46

15

20

25

30

35

40

45

Re

ject

Ra

tio

(%

)

Heuristic

LP-Conservative

LP-Optimistic

0

5

10

15

1 2 4 8 16R

eje

ct R

ati

o (

%)

Number of Requests

Figure 2-11. Rejection ratio vs. number of requests

The drawback of the heuristic is two-fold. First, it does not consider the fact that

schedules of iterations can be overlapped. Second, it cannot allocate bandwidth to

links interconnecting nodes according to the current network status, i.e., the current

available bandwidth on each link. Hence, the optimistic approach, which assumes

that data transfers can also occur in parallel with computational executions, uses the

least amount of network resources to achieve throughput requirements given by users.

Consequently, the optimistic approach leads to the least rejection ratio of requests

since one request has a better chance to be accepted due to less amount of bandwidth

allocation requirements and the following requests benefit from less loaded network

status.

2.5 Summary

The dedicated network on which e-Science applications operate guarantees

that a certain path can have a reserved bandwidth over a given period, which means

the communication delays vary depending on allocated bandwidths. We develop a

SDF-based model for iterative data-dependent e-Science applications that incorporates

variable communication delays and temporal constraints, such as throughput. We

formulate the problem as a variation of multi-commodity linear programming with

an objective of minimizing network resource consumption while meeting temporal

constraints. The resulting solution can then be used to derive buffer space requirements

by previously developed algorithms in the context of DSP applications. Finally,

47

an illustrative example of an e-Science application shows that the framework and

algorithm we propose is valid to model and analyze iterative data-dependent e-Science

applications. The simulation results show that the optimal bandwidth allocation by

the formulated linear programming outperforms the bandwidth allocation by a simple

heuristic in terms of rejection ratio of requests.

In future, we will extend our framework so that it also schedules computation jobs to

distributed computing resources when such mappings are not known ahead of time, or

it maximizes the overall throughput when multiple SDFGs with throughput requirements

are given.

48

CHAPTER 3TOPOLOGY AGGREGATION

3.1 Overview

The need for transporting large volumes of data in e-Science has been well-argued

[33, 78]. For instance, the HEP data is expected to grow from the current petabytes

(PB) (1015) to exabytes (1018) sometime between 2012 to 2015. In addition, e-Scientists

desire schedulable network services to support predictable work processes [46]. Quality

of service (QoS) in network applications has been an active research area for several

decades. Recently new technologies such as multiprotocol label switching (MPLS) and

generalized multiprotocol label switching (GMPLS) drew more attention to QoS routing

since those technologies have made it possible for network managers to set up and tear

down explicit paths while guaranteeing specified amounts of bandwidth.

The network supporting e-Science applications typically comprises multiple

domains. Each domain usually belongs to different organizations, and is managed

based on different operational policies. In such cases, internal topologies of domains

may not be visible to the others for security or other reasons. However, aggregated

information of internal topology and associated attributes is advertised to the other

domains.

A set of techniques to aggregate data to advertise outside one domain is called

Topology Aggregation (TA). The aggregated data itself is termed as Aggregated

Representation (AR). A survey of TA algorithms is presented in [98]. There exists a

tradeoff between the accuracy and the size of AR. Hence, most algorithms proposed in

the previous work tried to achieve the most efficient AR in terms of both accuracy and

space complexity.

One can classify QoS path requests into two classes: single-path single-job

(SPSJ) and multiple-path multiple-job (MPMJ), depending on the nature of requests.

SPSJ corresponds to a situation that requests for single QoS path arrive and they

49

are scheduled in the order of arrival. In contrast, MPMJ corresponds to batch/off-line

scheduling of multiple requests for multiple QoS paths. Many e-Science applications

require simultaneous transfer of data from multiple sources and destinations. Also,

each of these requests (e.g., file transfers) can be more efficiently supported by using

concurrent multiple paths.

We show that existing TA approaches developed for SPSJ do not work well with

MPMJ applications as they overestimate the amount of bandwidth that is available. We

propose a max flow based TA approach that is suitable for this purpose. Our simulation

results demonstrate that our algorithms result in better accuracy or less scheduling time.

BGP, which has been deployed for inter-domain protocol, has limited use for AR

techniques, as it is not flexible enough to be extended to accommodate many QoS

parameters. This is because it was originally designed only for distributing reachability

information [107]. Recently a new network model based on PCE has been proposed to

overcome the aforementioned drawbacks of BGP [45]. PCE is an entity that is capable

of computing network paths utilizing the traffic engineering database which contains

required network status information such as topology, delays on links and etc. Recent

papers [80, 85, 93] have based their network model on PCE-based architecture. We

develop TA algorithms in the context of PCE-based architecture that support most

e-Science applications. In particular, the following network model is assumed throughout

this chapter.

1. A centralized PCE exists per each domain. A node sends a request to the PCE tomake a reservation for a QoS path.

2. Centralized PCEs flood aggregated topology information to others so that everycentralized PCE maintains a complete view of a network in an AR except its owndomain.

The first condition states that one active element in a domain acts as a supernode

in one domain, which knows every information essential for QoS path computation.

One possible implementation is that every node in a domain sends a request for QoS

50

s d

PCE1 PCE2PCE3

12

4

54

53

AS1 AS2 AS3

3

3

Figure 3-1. An example of inter-domain QoS routing

path to the designated centralized PCE, therefore, the PCE can manage one consistent

information on network status related to QoS parameters. The second condition can

be reasonably assumed in e-Science networks, of which size is relatively very small

compared to the Internet. This statement enables us to directly apply QoS routing

algorithms which have been developed so far. In this network architecture, one domain

can advertise its aggregated topology information and associated QoS parameters to all

the other domains.

Based on the described network model, a scenario of inter-domain QoS routing

works as in Figure 3-1.

• STEP 1 A source node sends a path computation request to a single centralizedPCE in the same domain.

• STEP 2 Then the PCE replies back with a coarse path, which is a sequence ofborder nodes without detailed hops between border nodes.

• STEP 3 With the coarse path, the source node sends a path setup request that willtraverse border nodes of the coarse path.

• STEP 4 and 5 The border node which receives a path setup request gets a strictpath for a coarse path from the PCE in the same domain.

• STEP 6 The same steps repeat until a path setup request reaches a destinationnode.

TA algorithms can also be used for scheduling paths in a single domain. These

methods are useful as a large domain can be partitioned into subdomains. TA

algorithms can then be applied to each subdomain. With ARs on subdomains, the actual

scheduling may be performed either on a single node with a rich compute resource or

51

on a distributed set of nodes such that the time complexity of scheduling paths would be

reduced by running scheduling algorithms on the partitioned smaller subdomains.

The rest of the chapter is organized as follows. The related work on TA is described

in Section 3.2. Section 3.3 describes novel algorithms for MPMJ. Section 3.4 describes

how real routing works for TA algorithms, and Section 3.5 gives time and space

complexity comparison analysis. The experimental results by simulation are given in

Section 3.6, and, finally, we conclude in Section 3.7.

3.2 Related Work

TA consists of algorithms and mechanisms for reducing the size of topological

information and associated attributes within a domain or subdomains while maintaining

a certain level of accuracy. Uludag et. al [98] presented a survey of these algorithms for

multi-domain environments. All TA algorithms have two elements: an aggregated graph

and aggregated QoS parameter values, called epitome, assigned on logical links in an

aggregated graph.

Typical topologies for aggregated graphs are full-mesh, simple compaction,

tree-based, and star-based topologies. Some other topologies, e.g., Shufflenet

[108], have been proposed to reduce space complexity in specific cases such as

asymmetric networks. Most TA algorithms start by building a full-mesh graph, which

is a complete graph whose nodes are composed of only border nodes of the original

network. Algorithms that are more focused on the size of AR usually try to transform

a full-mesh graph into more compact forms, for example, a spanning tree or a star

topology, while trying to keep up with the accuracy of a full-mesh AR. For aggregated

QoS parameter values, epitome, the maximum, the minimum or the average of QoS

values are typically used.

TA algorithms for SPSJ in large-scale multi-domain networks focus on the

compaction of ARs – accuracy is a secondary issue. As for TA algorithms in small

sized networks, accuracy has been the main focus [80, 85, 88, 93]. For a single QoS

52

constraint, a distortion-free algorithm exists [98]. But for two QoS constraints composed

of an additive and a restrictive one, the problem gets more complicated. Even though

the problem itself is not intractable, distortion-free representation is not compact. For

such reasons, several approximating algorithms minimizing distortion such as the line

segment algorithm [68] have been proposed. Usually, the multiple QoS constraints

problem is generalized as one restrictive with multiple additive constraints, since a

multiplicative constraint such as a link reliability can be transformed into an additive one

through a log operation.

To the best of our knowledge, all existent TA algorithms are limited to a single QoS

path routing at one time, i.e., SPSJ, with few exceptions of customized algorithms for

special purposes such as computation of reliable paths. MPMJ applications consider

a batch of jobs at a time and multiple paths are allowed for one job. For instance, a

request for the earliest finish time for a given multiple-source multiple-destination data

transfer, which is one of the important e-Science applications [46], is handled at one

time and multiple paths are set up for the request.

The emerging technologies such as MPLS or GMPLS make it possible that

applications requiring strict QoS requirements are implemented on networks equipped

with such facilities. Special purpose networks such as research networks linking national

labs in the U.S. can be set up for that purpose [83]. Especially for inter-domain QoS path

routing in such special purpose networks, the accuracy of aggregated topologies and

associated QoS parameter values is more important than the size of data exchanged

among domains since the number of domains is relatively small compared to the

Internet which is constituted by a huge number of hosts and switches. Thus the need for

more accurate ARs is prominent.

As described above, one of the most recent work regarding TA for two QoS

constraints is the line segment algorithm in delay-bandwidth sensitive networks [68].

The line segment algorithm first computes 2-D charts whose x-axis and y-axis are delay

53

AS1 AS2 AS3

(a) An example of multi-domain networks (b) Internal topology of AS2

B1 B2

N1

N2

Figure 3-2. An illustrative example for limitations of the line segment algorithm

and bandwidth respectively, for every pair of border nodes. The chart contains all the

information for computing QoS paths with delay and bandwidth constraints. Authors in

[68] suggested the line segment algorithm approximating that information by a line to

reduce the size of data representing all possible delay-bandwidth combinations between

two border nodes, and it is possible because the shape of the charts takes a increasing

staircase function. The next step is to establish a full-mesh topology and convert it to a

star topology to further enhance the space complexity up to O(|B|).

With existent TA algorithms for SPSJ, there is no way to estimate if more than

one path between two border nodes is available. Consider a multi-domain network in

Fig. 3-2. The network consists of three domains where AS1 is connected to AS3 via AS2.

Suppose that a host in AS1 wants to find max flow paths or reliable paths, composed

of a primary and a backup path, to a certain host in AS3. If TA algorithms such as the

line segment algorithm are deployed in this network, the PCE in AS1 computes paths

based on the AR from AS2, which only gives the information on how much bandwidth is

available within a certain delay. Since the PCE in AS1 has no clue how many paths exist

internally in AS2, the computed max-flow or reliable paths are not necessarily the most

accurate paths computed based on the complete network status information.

3.3 TA for Multiple-Path Multiple-Job (MPMJ)

3.3.1 Problem Statement

When it comes to scheduling a batch of multiple jobs allowing for multiple paths,

existent TA techniques are not useful because the performance degradation can be

54

significant. For example, the full-mesh AR for bandwidth scheduling, where each logical

link has the maximum available bandwidth between two border nodes as an epitome,

has been known as a distortion-free AR for single path bandwidth scheduling. However,

it may not be effective for multi-path bandwidth scheduling, e.g., a max flow bandwidth

scheduling.

An important class of e-Science applications is bulk file transfers. For example,

for high energy physics large files are routinely transferred between tiered centers

that are geographically distributed around the world. The generated data have to

be transferred from stored places to research centers for the purpose of analysis or

visualization. In the context of e-Science applications, bandwidth scheduling problems

range from single-source single-destination data transfer optimization to multiple-source

multiple-destination data transfer optimization. The computational complexity of such

problems definitely depends on the space complexity of the network topology. Generally,

we can break down network resource provisioning procedures for e-Science applications

into the admission control phase and the network resource, i.e., bandwidth, allocation

phase. In admission control phase, acceptance of requested jobs is determined and

then if accepted; explicit bandwidth allocation for each link will be executed in the

network resource allocation phase. With compact network information abstracted from

a complete network topology, chances are that the network resource allocation phase

may fail due to inaccurate network status information. Even though the accepted request

in the admission control phase can be rejected due to inaccurate ARs in network

resource allocation phase, the benefits from less space complexity compensate for failed

operations if the error rate is fairly small.

In the following subsections, we propose several TA algorithms suited for MPMJ.

Each request consists of single or multiple data transfer jobs. However, we want to allow

for the use of multiple paths. The only QoS parameter that is considered is bandwidth.

55

3.3.2 New Topology Aggregation Algorithms

3.3.2.1 Full-mesh method

The most typical way of aggregating networks with QoS parameters is to build a

full-mesh topology by connecting every pair of nodes of interest and assigning epitomes

to the built logical links. Following this conventional way, we can build a full-mesh AR

with max flow values between nodes assigned to each logical link. Let’s consider the

edge connecting nodes D1 and D2 in Fig. 3-3. The epitome associated with the edge

ED1D2, F12, can be computed using any known max flow algorithms. The algorithm for

building a full-mesh AR is described in Algorithm 3-1.

D1

D2

D3

D4

F13,31

F12,21

F24,42

F34,43

F23,32

F14,41

Figure 3-3. Full-mesh AR

Algorithm 3-1 Full-mesh AR constructionInput: a graph G = (V ,E).

1: Pick nodes of interest from a full set of nodes, V and add them to the aggregation

representation.

2: for each pair of picked nodes do

3: Create a link between two nodes

4: Compute a max flow value between two nodes.

5: Assign the computed max flow value as an epitome to the link created above.

6: end for

This simple method adapted from existent TA techniques for SPSJ easily turns

out to be inappropriate for MPMJ. Let us take an example of a job requesting max flow

between D1 and D2 where D1, D2, D3 and D4 are nodes of interest. Since we consider

56

D1

D2

D3

D4

L

F1

F2

F3

F4

Figure 3-4. Star AR

MPMJ in both a single and multiple domain network environments, the nodes, D1, D2,

D3 and D4, don’t have to be border nodes. The final max flow value between D1 and

D2 would be far bigger than the right answer, since there exist other paths such as

D1 → D3 → D2.

3.3.2.2 Star method

A full-mesh AR does not effectively support MPMJ as the maximum amount of flow

that a specific node can push into a network is not restricted.

For single path computation algorithms, most recent TA techniques start from

full-mesh AR and produce diverse variants stemming from it such as partial full-mesh,

star, tree and so on. For multiple path computation algorithms, the reasons described

in the previous subsections prevent full-mesh AR from being utilized as a base AR for

other efficient ARs in terms of space complexity.

A star AR as in Fig. 3-4 can overcome the drawbacks of a full-mesh AR by limiting

the max flow value from any node. First, the logical node, L, is created and all nodes

of interest are connected to it. Suppose that four nodes of interest (D1, D2, D3 and D4)

are connected to the central logical node L. The epitome, assigned on the logical link

connecting a certain node and the central logical node L, is a max flow value from the

node to all the remaining nodes. This is easily computed by putting a supersource node

connected to a node and a supersink node connected to all the remaining nodes, and

running a max flow algorithm between the supersource and the supersink nodes. In

this case, F1 is a max flow value that a node D1 can send to the network, which is easily

57

computed by adding a supersink node connecting D2, D3 and D4 and running a max flow

algorithm between D1 and the supersink node. Likewise, we can also compute the other

epitomes such as F2, F3 and F4. This AR has only one outgoing link from each node,

which keeps one node from sending the data flow beyond the epitome assigned to the

outgoing link. Formal description of the algorithm is presented in Algorithm 3-2.

Algorithm 3-2 Star AR constructionInput: a graph G = (V ,E).

1: Pick nodes of interest from a full set of nodes, V .

2: Create a single logical node, L.

3: for each of picked nodes do

4: Create a link between the node and the logical node, L.

5: Compute a max flow value from a target node to all the remaining nodes.

6: Assign the computed max flow value as an epitome to the link created above.

7: end for

3.3.2.3 Partitioned star method

After performing some experiments on star AR, we realized that a star AR shows

little distortion compared to an original topology. However, TA algorithms should take

into account how the result on AR can be transformed into real path setups on an

original topology.

Originally, TA has arisen from efforts to deal with scalability issues related with

space complexity and security issues regarding intradomain topology in multi-domain

network environments. Usually, routing procedures consist of two steps: (1) path

computation and bandwidth allocation with ARs, and (2) explicit path computation and

bandwidth allocation with original network topology for each domain. Similar steps can

also be applied for single domain network environments, where several subdomains

exist for hierarchical routing or we can intentionally partition one domain into several

logical subdomains. In this case, the benefits from TA are almost the same as those in

58

multi-domain network environments. In the case of MPMJ applications, however, we can

expect more benefits in terms of computational complexity. A detailed computational

complexity analysis will be given in the following Section 3.5

The partitioned star method tries to combine the benefits of star and full-mesh

methods by partitioning a domain into k subdomains, and each subdomain is aggregated

using the previous star method. Fig. 3-5 shows an example of a domain with four

partitioned subdomains. In this chapter, we use general graph partitioning algorithms,

which are widely used in many other computer science areas including load distribution

in parallel computers, sparse matrices and design of very large scale integrated circuits

(VLSI) [58]. The algorithm for building a partitioned star AR is described in Algorithm

3-3.

Algorithm 3-3 Partitioned star AR constructionInput: a graph G = (V ,E) and k , the number of partitions(subdomains).

1: Pick nodes of interest from a full set of nodes, V and add them to the aggregation

representation.

2: Partition a graph into k parts so that the number of nodes of interest is evenly

distributed over partitioned parts.

3: Identify cut nodes and cut edges, and add them to the aggregation representation.

4: for each part do

5: Construct star AR with picked nodes and cut nodes in the part.

6: end for

3.4 Routing

With the network model described in Section 3.1, inter-domain QoS path routing

is relatively easy compared to a QoS path routing in a distance vector routing protocol.

Any centralized PCE can compute a path to a destination which consists of a strict

path within its own domain and a coarse inter-domain path to the destination domain.

The coarse inter-domain path is composed of border nodes, and when the path setup

59

D1

D2

D3

D4

L1

FD1

FD2

FD3

FD4

C2

C1

C3

C4

C5

C6

C7

C8

L2

L3

L4

FC2

FC1

FC5

FC6

FC3

FC4

FC7

FC8

Figure 3-5. Partitioned star AR

request is received by a border node on the intermediate path, it is translated into a

strict path composed of intra-domain routers or switches. On the other hand, when

line segment algorithm is deployed, computing the QoS path goes through two steps.

First, we can assign delay values to virtual edges of aggregated representations of all

domains except the domain in which a source node resides; a certain request has a

bandwidth requirement, and corresponding delay is computed through a line segment

algorithm. Second, any shortest path algorithm such as Dijkstra’s algorithm can be run

on the normal graph with delay attributes on its edges.

Inter-domain routing for SPSJ applications is well described in Section 3.1. The

routing procedures for MPMJ applications are the same as those for SPSJ applications.

The results from any algorithms, e.g., a maximum bandwidth path algorithm, run on ARs

are expanded on each domain or each subdomain by running the same algorithm on the

original topology of a domain or a subdomain. If operations fail in any of the domains

or subdomains, the entire operation will fail. Note that the reason MPMJ applications

in intradomain environments use ARs of subdomains is to reduce the time complexity

of scheduling, whereas SPSJ or MPMJ in interdomain environments are forced to use

ARs for security or administrative reasons. The benefits of using ARs in intradomain

environments from the perspective of time complexity will be described in Section 3.5.

60

Table 3-1. Time Complexity for MPMJMethod Time Complexity

Full-mesh O(n3D2)Star O(n3D)

Partitioned star O((nk

)3(C +D))

D = number of nodes of interestC = number of cut nodesk = number of partitions

3.5 Complexity Analysis

Usually most algorithms for MPMJ have higher computational complexities than

algorithms for SPSJ. Dijkstra algorithm can be used to derive the maximum bandwidth

path between two nodes, which can be translated into the max flow single path. The

complexity of Dijkstra algorithm is O(n log n + m), where n is the number of vertices

and m is the number of edges. In contrast, the complexity of the push-relabel max flow

algorithm is O(n3) [27]. This shows algorithms for MPMJ may require a few orders

higher computational costs than those for SPSJ.

The time complexities of TA algorithms for MPMJ are summarized in Table 3-1. The

full-mesh method requires O(n3D2), and the star and the partitioned star method require

O(n3D) and O((nk

)3(C + D)), respectively, where D is the number of nodes of interest,

C is the number of cut nodes, and k is the number of partitions. The complexity of max

flow algorithms is assumed to be O(n3) and the number of partitions in the partitioned

star method is given as k .

The space complexities are summarized in Table 3-2. The space complexities of

ARs for full-mesh, star and partitioned star methods are O(D2), O(D) and O(C + D),

respectively. Suppose that a certain algorithm for MPMJ applications takes O(n3). If we

run the algorithm on ARs, it will take O((C +D)3) and kO((nk

)3), which are time taken

for running the algorithm on ARs and time taken for explicit routing in each partition,

respectively. (C + D) and k is definitely a small value compared to n, and n3 may

be greater than(nk

)3 by a few orders of magnitude. Hence, we can expect that the

61

Table 3-2. Space Complexity for MPMJMethod Space Complexity

Full-mesh O(D2)Star O(D)

Partitioned star O(C +D)D = number of nodes of interestC = number of cut nodes

partitioned star method can expedite the path computation and bandwidth allocation

process significantly.


3.6.1 Bulk File Transfers in E-Science

We chose a bulk file transfer application in [65] as a typical MPMJ e-Science

application to show that our proposed algorithms perform better than naive algorithms

adapted from SPSJ TA algorithms. In [65], the authors formulated the in-advance

scheduling of multiple bulk file transfers as a linear programming problem. We adapted

their linear programming formulation to on-demand scheduling of multiple bulk file

transfers for our simulation. The linear programming formulation is shown in Figure

3-6. The notations and equations are borrowed from [65] whenever possible. In this

formulation, tf denotes the time by which all file transfers complete. The objective of

this linear programming problem is to find the earliest finish time. f jlk is the amount of

file transferred for request j ∈ F on link (l , k) ∈ E . blk is the bandwidth available on

link (l , k). Equation 3–3 ensures that for each transfer request j ∈ F , for each node l

that is neither the source nor the destination node, the amount of file j that leaves node

l equals the amount that enters this node. Equation 3–4 requires the source node of

request j to send a net fj units of file j out and requires the destination node to receive

a net fi units. Equation 3–5 ensures that the amount of traffic on each link does not

exceed the available capacity of any link in the interval [0, tf ). Equation 3–6 ensures that

file transfer amounts are non-negative.

62

minimize tf (3–1)subject to (3–2)∑k:(l ,k)∈E

f jlk −∑

k:(k,l)∈E

f jkl = 0

∀j ∈ F ,∀l ∈ V , l = sj , l = dj (3–3)∑k:(l ,k)∈E

f jlk −∑

k:(k,l)∈E

f jkl ={fj , if l = sj−fj , if l = dj

,∀j ∈ F (3–4)∑j∈F

f jlk ≤ blk × tf ,∀(l , k) ∈ E (3–5)

f jlk ≥ 0 (3–6)

Figure 3-6. Earliest finish time on-line scheduling of multiple file transfers

3.6.2 Experiment Testbed

For TA algorithms for MPMJ, we performed experiments on random networks with

a single domain. Random network topologies are generated by the BRITE internet

topology generation package [72]. We tried several models such as Waxman, BRITE,

etc., but the results for different models show similar trends. Therefore, we show only

results for random network topologies following the Waxman model with the average

node degree of 4. The bandwidth values of edges are randomly selected from a uniform

distribution between 10 to 1024. The number of nodes in each domain is varied from

100 to 300 with the increment of 50. The nodes of interest are picked randomly within a

domain, and the number of nodes ranges from 1 to 16, which is doubled at each step.

We generated a synthetic set of data transfer requests. Each request is described by

the 3-tuple (source node, destination node, requested file transfer size). The number

of requests is also randomly selected within the range of 1 to the maximum possible

number of requests determined by the number of nodes of interest. For example, if the

number of nodes of interest is 4, the maximum possible number of requests is 4× 3. The

63

source and destination nodes for each request are randomly selected using a uniform

random number generator. The results are averaged over 100 random networks for a

certain number of nodes.

3.6.3 Performance Metrics

The performance metric we have used to compare the different approaches is to

find the earliest finish time (EFT) to complete all the multiple data transfer requests

that are given. One would expect a good AR approach to perform as close to using the

original topology.

Hence, we use the error ratio (ER) that measures the deterioration from the correct

EFT on the original topology. A TA algorithm with lower ER shows better performance.

ER is formally defined as

ER =TA EFT−Original EFT

Original EFT

.

3.6.4 Results

We measured ER according to the equation defined in Section 3.6.3. The

computational times taken for each algorithm are also gathered to show how much

computation cost reduction we can get from the compact representation. Fig. 3-7 shows

that the star and the partitioned star methods give around 5% ER. This is because the

application of finding EFT tends to find and allocate all the available bandwidths in a

network, which are limited by the star or the partitioned star ARs in a similar way as the

original network does. In addition, we could observe that as the number of requests

increase, ER is improved because all the network resources, i.e., the bandwidths, are

eventually used up. As expected, the performance of full-mesh AR is the worst.

Our simulation results in Fig. 3-8 also show that the star method is comparable

to the partitioned star method but significantly faster than the partitioned star method.

This is not only because the star method is a more compact representation, but also

64

50

60

70

80

90

100

Err

or

Ra

tio

(%)

Full-Mesh

Star

Partitioned Star

0

10

20

30

40

100 150 200 250 300

Err

or

Ra

tio

(%)

Number of nodes

Figure 3-7. Error ratio vs. the number of nodes

0.4

0.6

0.8

1

No

rma

lize

d C

om

pu

tati

on

al

Tim

e

Original

Full-Mesh

Star

Partitioned Star

0

0.2

0.4

2 4 8 16

No

rma

lize

d C

om

pu

tati

on

al

Tim

e

The number of source and destination nodes

Figure 3-8. Normalized computational time vs. the number of source and destinationnodes

partly because, for randomly generated networks, the number of cut nodes are relatively

large. If a domain is structured as backbone and the other network, the number of cut

nodes can be reduced to a reasonable value, which can enhance the performance of the

partitioned star method.

3.7 Summary

We propose several topology aggregation algorithms for e-Science networks.



applications. We define a new class of requests, called multiple-path multiple-job



65

approaches. Especially, star AR shows the best performance in terms of computational

time. Its performance is also very close to using the entire topology for performing the

scheduling. Thus, it is well suited for multiple domain e-Science applications.

66

CHAPTER 4WORKFLOW SCHEDULING

4.1 Overview

An application scientist typically solves his/her problem as a series of transformations.

Each transformation may require one or more inputs and may generate one or more

outputs. The inputs and outputs are predominantly files. However, we expect that files

will be replaced by databases for many applications. The sequence of transformations

required to solve a problem can be effectively modeled as a Directed Acyclic Graph

(DAG) for many practical applications of interest that this chapter is targeting.

Figure 4-1 describes a DAG consisting of 17 nodes, representing dependencies

among 17 tasks of an application. For example, the arc from task E to task B represents

the fact that the output generated by task E is utilized by task B. Each task involves

transforming the data or storing an intermediate result for archiving. The time requirement

for solving the entire DAG for large scale applications may be of the order of hours to

days even assuming that each of the task is being executed on a cluster of workstations

or a parallel supercomputer. DAGs have been widely used for the development of

scheduling algorithms in the computer science literature [38, 105].

The general form of DAG scheduling has been shown to be NP-hard [91], and a

number of heuristics have been proposed. Early research regarded communication

costs as small [26, 39] or assumed a very simple interconnection network model,

i.e., a fully-connected network model without contention [59, 60, 79, 97, 99]. The

heterogeneous-earliest-finish-time (HEFT) algorithm extended for heterogeneous

computing resources was proposed in [97].

For the distributed applications that we are targeting, the amount of data that

needs to be transferred between tasks may be of the order of hundreds of gigabytes to

multiple terabytes. Thus, the key challenge is to be able to schedule a workflow such

that the total execution time and the communication costs are minimized. The former

67

B C D

E F G H I

J L M N O P QK

A

B C D

E F G H I

J L M N O P QK

A

Figure 4-1. A DAG consisting of 17 nodes, representing dependencies among 17 tasksof an application. For example, the arc from task E to task B represents thefact that the output generated by task E is utilized by task B.

requires mapping the tasks to appropriate machines while the latter requires the use

of high bandwidth networks and effective scheduling of the communication bandwidth.

The past research on scheduling DAGs (e.g. [38, 105]) is generally limited to solving

compute intensive problems. In contrast, we will propose new algorithms to map tasks

that have large data access requirements onto distributed heterogeneous clusters and

supercomputers. For many applications, a node in the task graph can also represent

multiple concurrent and interacting subtasks. If these subtasks are mapped to multiple

machines, the required interaction has to be mapped on to the underlying network to

support this interaction. For such DAGs, the precedence is between sets of multiple

subtasks.

The extended list scheduling algorithms in [92] and [90] targeted for heterogeneous

cluster architectures address this network contention issue by various priority attributing

schemes and assumed the path between any two processors is determined and fixed

by the target system using conventional algorithms such as a breadth first search

(BFS). On the other hand, similar work regarding DAG scheduling has been done in the

68

literature of grid computing. Generally, the term, workflow, is used interchangeably

with DAG in the context of grid computing. A taxonomy of previous work on the

workflow scheduling problem in grid computing was presented in [102]. The goal of

the scheduling algorithms is to map the tasks and subtasks of all the applications on

the grid such that the resources are effectively utilized, while the quality of service

guarantees given to an application are respected. In this chapter, the actual formulation

of the optimization goals will be presented. The network resource mapping can have the

following characteristics:

1. Rigid vs. malleable: Rigid mapping is fixed bandwidth mapping over the timeperiod of data transfer whereas malleable mapping allows for variable bandwidth.If there is no quality of service requirements such as constant data rate, malleablemapping is a viable option to utilize network resources efficiently since solutionscan be flexible as long as total amount of data transmission over time meets thedata transmission requirement.

2. Single path vs. multiple paths: For many transfers, multiple paths can be effectivelyused to reduce the transfer time. However, finding a set of multiple paths requiresmore computation time, and thus efficient algorithms are needed.

3. Static vs. dynamic paths: In static mapping, paths determined at the start timeof data transmission do not change until the end of data transmission, while indynamic path mapping, paths can change dynamically over time.

The recent work on workflow scheduling in optical grids can provision network

resources dynamically with guarantee of specified bandwidth [66, 67, 69, 95, 101].

According to the above taxonomy, those methods use rigid and single path mapping.

Moreover, the paths are assumed to be static.

The workflow scheduling problem can be classified into in-advance and on-demand

scheduling depending on the reservation start time. If the reservation start time is the

same as the job request arrival time, it is on-demand scheduling. On the other hand,

the reservation start time is equal to or later than the job request arrival time in case of

in-advance scheduling. On-demand scheduling can be also regarded as a special case

of in-advance scheduling where reservation start time equals to the job request arrival

69

time. In this chapter, we solve in-advance workflow scheduling problem in e-Science

networks which are a mix of IP networks and optical networks. Our framework supports

in-advance reservation and provides malleable mapping and dynamic paths. Further, we

are able to exploit multiple paths and applicable for heterogeneous and our framework is

applicable to homogenous resources especially for network resources.

4.2 Workflow Scheduling in E-Science Networks

We develop workflow scheduling algorithms for e-Science networks. A real network

topology and workflows are given as inputs for workflow scheduling algorithms. A real

network topology is represented by a network resource graph represents where a

node denotes a resource such as compute resource or a router/switch only forwarding

network traffic, and an edge denotes a physical link between two nodes. A workflow is

represented by a task graph/DAG where a node denotes a task associated with a type

and amount, and a directed edge connecting two nodes denotes a producer/consumer

relation of them, i.e., required data transfer from a source node to a destination node.

A task in a task graph is executed only once and execution order complies with the

precedence constraints defined by the task graph. In this chapter, the goal of workflow

scheduling algorithms is to map a node(task) and an edge(data transfer) in a task graph

into a node and dynamic multiple paths in a network resource graph, respectively. The

mapping of a node(task) in a task graph into a node in a network resource graph implies

that the task is not splittable and the mapping is not varied over time. But the amount

of resource allocation can vary over time. In contrast, the mapping of an edge(data

transfer) in a task graph into dynamic multiple paths in a network resource graph means

that data transfer is fulfilled by multiple paths varying over time. The time model we

assume is the uniform time slice model, which discretizes the timeline into many time

slices with uniform period. In the following sections, more detailed and formal definition

will be described.

70

4.2.1 System Model and Data Structure

4.2.1.1 Time model

The uniform time slice model is represented by τ and M where τ is the size of a

time slide and M is the maximum number of time slices the system would consider. The

start and end time of the time slice m is denoted by Tm and Tm+1, respectively.

4.2.1.2 Network resource model

A network resource model is represented by Gn = (V,E, r,TR,TB), where V and

E are a set of nodes and a set of edges, respectively, rv denotes the resource type of a

node v , and TRv and TBe denote the data structures for the resource availability of node

v and edge e, respectively, over time. More specifically, we use time-resource (TR) or

time-bandwidth (TB) arrays as data structures for managing resource availability over

time. A TR or TB array is a set of (am) where m is the index of a time slice and am is

the available amount of a resource over the period [Tm,Tm+1). These data structures

are necessary for effective in-advance reservation of resources. Basically, the data

structures of a TR array and a TB array are same except the fact that the resource type

represented by a TB array is only network resource type whereas other resource types

are represented by a TR array. Thus, a TB array is assigned on each edge and a TR

array is assigned on each node in a network resource graph.

Figure 4-2 shows an example of network resource graph. Each node represents

one resource, and is associated with a resource type and a TR array, which tracks the

resource availability over time. In Figure 4-2, nodes V1 through V3 are of resource type

1, and nodes V4 and V5 are of resource type 2. We can assign a unique number to a

different resource type excluding the network resource. For example, resource type

1 is pure compute resource and resource type 2 is database service resource. Each

edge represents a physical link connecting two nodes, and is associated with a TB array,

which tracks the network resource availability over time.

71

V1

TRV1 1

V2

TRV2 1

V3

TRV3 1

V4

TRV4 2

V5

TRV5 2

TBV1V3 TBV2V3

TBV3V4 TBV3V5

Resource type

Figure 4-2. An example of a network resource graph

4.2.1.3 Workflow model

A workflow can be represented by a task graph, which is a directed graph and

formally defined as Gt = (N,L, r,RN,RL,ST ,Deadline). N and L represent a node set

and an edge set, respectively. ri denotes the resource type of node Ni . RNi denotes the

required amount of a resource of node Ni , and RLi denotes the required amount of data

transfer between both end nodes of edge Li . We assume all capacities of resources are

normalized with regard to the base capacity and we can then express the required or

available amount of resources by rational numbers, multiples of the base capacity. ST

is the start time of a workflow, which has to be taken into consideration for in-advance

scheduling. Deadline is an optional parameter. If it is given, we may set the optimization

objective to minimizing network resource consumption. Otherwise, we may set the

optimization objective to minimizing the makespan of the workflow. Figure 4-3 shows an

example of a task graph. Resource requirement and resource type are associated with

each node, and only network resource requirement property is associated with each

edge.

4.2.2 Problem Statement

We solve in-advance workflow scheduling problems in e-Science networks which

are mix of IP networks and optical networks. Even though optical networks, where

a physical link carries multiple wavelengths, have inherently integrality of bandwidth,

72

N1

3 1

10

N2

5 2

Resource typeResource

requirement

Figure 4-3. An example of a task graph

we assume that the bandwidth of a network resource graph is infinitely divisible. The

application of the algorithms developed in this chapter to optical networks is left to future

work. We develop our algorithms for two broad cases:

1. Single workflow: In this case, a single workflow is scheduled based on theavailable (fractional) resources. It is assumed that the previous workflowshave already been scheduled and the goal is to optimize the performancecharacteristics of a single workflow.

2. Multiple workflows: In this case, multiple workflows will be simultaneouslyscheduled. The expectation is that this will achieve better performance, thanscheduling one workflow at a time.

For both cases, the goals of our algorithms can be minimization of either network

resource consumption or makespan (finish time). For simplicity, when deadlines

for workflows are given, we set the objective to minimization of network resource

consumption. Otherwise we set the objective to minimization of finish time. Therefore

we have four problems in total: (1) minimization of network resource consumption for a

single workflow, (2) minimization of finish time for a single workflow, (3) minimization of

network resource consumption for multiple workflows, and (4) minimization of finish time

for multiple workflows.

4.2.3 Construction of an Auxiliary Graph

We translate the workflow scheduling problem into a network flow problem. The

multicommodity flow problem, which optimizes the cost of multiple commodities with

different source and destination nodes flowing through the network, is a well-known

73

network flow problem. To formulate the workflow scheduling problem as a multicommodity

flow problem, we first have to construct an auxiliary graph from the given network

resource graph and task graph. The workflow scheduling problem is comprised of two

mapping problems, a node mapping problem and an edge mapping problem onto a

network resource graph. The goal of constructing an auxiliary graph is to convert a node

mapping problem into an edge mapping problem since the multicommodity flow problem

can deal with only an edge mapping problem.

An illustrative example of the auxiliary graph corresponding to Figures 4-2 and 4-3

is shown in Figure 4-4. An auxiliary graph GA = (VA,EA,TBA) is constructed as follows.

First, we expand the network resource graph by duplicating each node and connecting

from the original one to the duplicated one. For convenience, let’s call the original one a

frontend node, and the duplicated one a backend node. For example, in Figure 4-4, the

node V1 is expanded into two nodes, V1′ and V1′′, and a new edge connecting these two

nodes is inserted with the associated TB array corresponding to V1’s TR array. In this

case, V1′ is a frontend node and V1′′ is a backend node. Obviously, this expansion is to

convert a resource allocation problem into a network flow problem. The original topology

of the network resource graph remains unchanged among the backend nodes of the

expanded graph as in Figure 4-4. Note that some nodes of no chance of being selected

may not need to be expanded.

Second, we expand the task graph in the same way as we did the network resource

graph. But we do not create any edge connecting nodes in the expanded task graph.

Lastly, we interconnect the expanded network resource graph and the expanded task

graph.

As mentioned above, two kinds of flows are needed for problem conversion from a

general workflow scheduling problem to a network flow problem. One is the resource

allocation flow for the purpose of resource allocation of each task (node) in a task graph.

For example, in Figure 4-4, N1′ is connected to all possible frontend nodes of the same

74

resource type of N1 in the expanded network resource graph. Similarly, N1′′ is connected

to all backend nodes corresponding frontend nodes. Thus, by constraining the flow from

N1′ to N1′′ demanding N1’s resource requirement to take the only one single path, we

can solve the problem of resource allocation of each task (node).

The other is the data transfer flow for the purpose of data transfers between

tasks. These flows are seamlessly modeled by multiple flows with different source

and destination nodes in a typical multicommodity flow problem. The source node of

a data transfer flow is set to the backend node corresponding to a source task in the

task graph, and the destination node of a data transfer flow is set to backend node

corresponding a destination task in the task graph. For instance, the data transfer

requirement of 10 units between N1 and N2 in a task graph is modeled by a flow of 10

units of data between N1′′ and N2′′.

The auxiliary graph accounts for a situation where two tasks are mapped into a

same resource and thus the communication cost between them should be ignored.

Since the bandwidth of interconnecting edges between an expanded network resource

graph and an expanded task graph is set to infinity, the communication cost of tasks

mapped into the same resource will be near zero. Suppose that all tasks and resources

are of same type in Figure 4-4 and N1 and N2 are mapped onto V1. Then the data flow

between N1 and N2 will follow the path, N1′′ → V1′′ → N2′′, which is composed only of

edges with infinite bandwidth.

The space complexity of an auxiliary graph is summarized as follows; |VA| =

2(|V |+ |N|), |EA| = |E |+ |V |+ 2|N||V |.

4.3 MILP Formulation

The single or multiple workflow scheduling problem can be formulated as a mixed

integer linear programming (MILP) problem, which is a variant of a multicommodity

flow problem. The objective of the MILP problem can be minimizing the finish time or

minimizing the total network resource consumption depending on whether the deadlines

75

N2’

N1’

V1’’

V1’

V2’’

V2’

V4’ V5’

N1’’

3

3

5

10

V3’’

V3’

10

V4’’ V5’’ N2’’5

Figure 4-4. An example of an auxiliary graph

for workflows are given or not. If a deadline is not imposed on a workflow, the user who

requests for the workflow job wants to get the job done as fast as possible. If a deadline

is imposed on a workflow, the user will be satisfied as long as the deadline is met, which

allows the system to utilize resources more efficiently in compensation for the delayed

time.

The constraints of the MILP problem are composed of four parts: 1) multi-commodity

flow constraints, 2) task assignment constraints, 3) precedence constraints, and 4)

deadline constraints. Since we have transformed the workflow scheduling problem into

a multi-commodity problem, the typical multi-commodity flow constraints remain valid.

Additional multi-commodity flow constraints are added to account for malleable resource

allocation. Secondly, the task assignment constraints are integer constraints to enforce

that one task is mapped to only one resource node in the network topology graph.

Thirdly, the precedence constraints ensure that precedence constraints of a workflow

are obeyed. Finally, the deadline constraints are just for the case when deadlines for

workflows are given. The notation for the MILP formulation is listed in Table 4-1.

76

Table 4-1. Notation for problem formulation

Category Notation DescriptionFunction pred(v) Returns the set of predecessors of node vConstantor Set

J {(sj , dj ,Fj)|sj , dj ∈ VA, 0 ≤ j < |N| + |L|} where sj and dj aresource and destination nodes of job j and Fj is the requiredamount of flow (resource) for job j , a set of jobs. One jobcorresponds to either a node or an edge of a workflow.

Jc A set of communication jobs, Jc ⊂ J.Jnc A set of non-communication jobs, Jnc ⊂ J, J = Jc ∪ Jnc.Pj A set of allowed paths for job j ∈ J (from sj to dj ).GA An auxiliary graph, (VA,EA,TBEA).Einf A set of edges with infinite available bandwidth ⊂ EA.blk(m) The available bandwidth on edge (l , k) during the time slice m in

an auxiliary graph GA.Tm The start time of time slice m.Tm+1 The end time of time slice m.M The number of time slices to be considered.N The number of workflows in case of multiple workflow

scheduling.Incv The set of edges incident on node v .D/Dn Deadline of a workflow/workflow n.WST/WSTn Start time of a workflow/workflow n.Q Very large number.Qt Tolerance.

Variable Tf /Tnf Finish time of a workflow/workflow n.

f jlk(m)/fjnlk (m) The amount of flow transferred (resource allocated) for job j ∈ J

on link (l , k) during the time slice m for a workflow/workflow n.This variable is defined in case that the type of a job is same asthat of the edge.

f jp (m)/fjnp (m) The amount of flow transferred (resource allocated) for job j ∈ J

on path p ∈ Pj during the time slice m for a workflow/workflow n.x jlk/x

jnlk 0 or 1. 1 if edge (l , k) ∈ Incsj is selected for job j ∈ J or 0

otherwise for a workflow/workflow n.y jlk/y

jnlk 0 or 1. 1 if edge (l , k) ∈ Incdj is selected for job j ∈ J or 0

otherwise for a workflow/workflow n.z jm/z

jnm 0 or 1. 1 if job j ∈ J is allocated in time slice m or 0 otherwise for

a workflow/workflow n.STj/ST

nj Start time of a job j ∈ J in a workflow/workflow n.

ENDj/ENDnj End time of a job j ∈ J in a workflow/workflow n.

The tasks and data transfers among them are all mapped into jobs in a multicommodity

flow problem. A job in the formulation is described as a three-tuple (s, d ,F ), where s

77

and d denote the source and destination nodes of the job, F denotes the required

amount of flow (resource). ST and END, the start and end times of the job, are

determined by workflow scheduling algorithms. The resource type does not have to

be included in this tuple since a flow is forced to take a link of the same resource type

due to the carefully chosen connect edges between a network resource graph and a

task graph. Three kinds of binary decision variables are introduced; x , y and z . The

discrete nature of the problem is due to the fact that a task cannot be split and we

have discrete time intervals to accommodate jobs. Binary decision variables, x jlk and

y jlk , determine which resource is to be allocated to a non-split task. Regarding a job j

corresponding to a task in a task graph, the flow of the job can take only one outgoing

edge from the frontend node of a task and only one incoming edge into the backend

node of a task in the auxiliary graph. These constraints reflect the non-split property of a

task. z jm indicates whether time slice m is used for the job j or not. These binary decision

variables can be easily extended to the multiple workflow scheduling problem by using

separate variables for each workflow.

4.3.1 Single Workflow

The complete formulation is presented in Figure 4-5. First of all, the problem

can be optimized for either minimum finish time, Tf , or minimum network resource

consumption,∑j∈J

∑(l ,k)∈EA

∑M−1m=0 f

jlk(m), as in Expression 4–1. Minimizing network

resource consumption can be helpful for saving more resources for future arriving

requests so that more requests can be accepted in the long term.

4.3.1.1 Multi-commodity flow constraints

The flow conservation rule at the nodes other than source and destination nodes

is ensured by Constraint 4–2. The amount of flow (resource) to be allocated (reserved)

is ensured by Constraint 4–3. Constraint 4–4 ensures that the amount of total flows

on link (l , k) during time slice m should not exceed the maximum possible amount of

flow during the time slice m, which is given by blk(m) × (Tm+1 − Tm), where blk(m)

78

Objective

minimize Tf or∑j∈J

∑(l ,k)∈EA

M−1∑m=0

f jlk(m) (4–1)

Multi-commodity flow constraints∑k:(l ,k)∈EA

f jlk(m)−∑

k:(k,l)∈EA

f jkl(m) = 0, ∀j ∈ J,∀l ∈ VA, 0 ≤ m < M, l = sj , l = dj , (4–2)

M−1∑m=0

(∑

k:(l ,k)∈EA

f jlk(m)−∑

k:(k,l)∈EA

f jkl(m)) =

{Fj , if l = sj−Fj , if l = dj

,∀l ∈ VA,∀j ∈ J (4–3)

∑j∈Jf jlk(m) ≤ blk(m)× (Tm+1 − Tm), ∀(l , k) ∈ EA, 0 ≤ m < M (4–4)

0 ≤ f jlk(m) ≤ blk(m)× (Tm+1 − Tm)× zjm,∀l ∈ VA,∀j ∈ J,∀(l , k) ∈ EA\Einf , 0 ≤ m < M (4–5)

STj ≤ Tm × z jm + (1− z jm)Q, ∀j ∈ J, 0 ≤ m < M (4–6)

ENDj ≥ Tm+1 × z jm, ∀j ∈ J, 0 ≤ m < M (4–7)

Task assignment constraints∑(l ,k)∈Incsj ,l=sj

x jlk = 1,∀j ∈ J (4–8)

∑(l ,k)∈Incdj ,k=dj

y jlk = 1,∀j ∈ J (4–9)

∑M−1m=0 f

jlk(m)

Q≤

{x jlk , if (l , k) ∈ Incsjy jlk , if (l , k) ∈ Incdj

,∀j ∈ J (4–10)

Precedence constraintsSTj =WST , if pred(j) = ∅,∀j ∈ J (4–11)STj ≤ ENDj ,∀j ∈ J (4–12)ENDp ≤ STj , if p ∈ pred(j), p, j ∈ J (4–13)ENDj ≤ Tf ,∀j ∈ J (4–14)

y ilk = xjlk , if i ∈ pred(j), i ∈ Jnc, j ∈ Jc (4–15)

y ilk = yjlk , if i ∈ pred(j), i ∈ Jc, j ∈ Jnc (4–16)

Deadline constraints (optional)Tf ≤ D (4–17)

Figure 4-5. Single workflow scheduling problem formulation via network flow model

79

is the available bandwidth in the time slice m, and Tm+1 and Tm are the end time and

start time of the time slice m. Constraint 4–5 ensures that if the time slice m is not

used for job j , the amount of flow for the job j during the time slice m is 0. But note

that Constraint 4–5 should not be imposed on edges with infinite available bandwidth

as there is no cost required for communications between tasks assigned on the same

resource. Otherwise, one time slice is allocated for such communications. Constraint

4–6 ensures that if the time slice m is used for job j , the start time of job j should be at

most Tm, the start time of the time slice m. Suppose that multiple time slices are chosen

for job j , this constraint enforces STj to be less than or equal to the start time of the

earliest time slice, which complies with the definition of STj . Similarly, the end time of job

j is ensured to be greater than or equal to the end time of any time slice m in which the

job is scheduled by Constraint 4–7.

4.3.1.2 Task assignment constraints

The second part of the constraints reflects the non-split property of tasks. Thus,

Constraints 4–8 and 4–9 ensure that only one resource among possible candidate

resources is assigned to one task. Constraint 4–10 relates discrete selection of a

resource with flow decision variables, which means if a resource is chosen for a job j ,

there could be a flow on the related links.

4.3.1.3 Precedence constraints

After transforming all the tasks and data transfers into jobs in a network flow

problem, we should ensure that precedence constraints inherent in a task graph are also

embedded in the network flow problem. Accordingly, Constraint 4–11 ensures that the

start time of jobs with no precedent jobs is set to the start time of a workflow. Constraint

4–12 ensures that the end time of a job is greater than or equal to the start time of a

job. Constraint 4–13 ensures that the start time of a job is not before the end times of

precedent jobs. Constraint 4–14 ensures that all the end times of jobs should be less

80

than or equal to the global finish time Tf . Constraints 4–15 and 4–16 ensure that data

transfers between tasks occur between chosen resources.

4.3.1.4 Deadline constraints

Constraint 4–17 is optional depending on whether we have deadlines on workflows

or not.

4.3.2 Multiple Workflows

The formulation for the single workflow scheduling problem can be easily extended

to the multiple workflow scheduling problem by using separate variables for each

workflow as in Figure 4-6.

We can set the objective of the multiple workflow scheduling problem formulation

to minimizing either the total sum of makespans of all workflows or the total network

resource consumption of all workflows as in Expression 4–18. The first term of

Expression 4–18 indicates the total sum of makespans of all workflows. Even though

we can optimize the finish time of the whole workflows by directly applying the objective,

Expression 4–1, of the single workflow scheduling problem formulation, minimizing the

finish time of the whole workflows may not contribute to the efficient resource scheduling

of workflows whose timelines are far ahead of the finish time of the whole workflows. For

such reasons, we choose to minimize the total sum of makespans. Yet, there still exists

a concern that this objective cannot achieve balanced optimization for the makespan

of every workflow. Suppose that each workflow is issued by a different user. From the

perspective of the whole system, this objective can achieve balanced scheduling among

workflows. But from the perspective of users, the makespan of a certain workflow can

be sacrificed to achieve the minimum of the total sum of makespans by reducing the

makespans of other workflows.

4.3.3 Time Complexity

The time complexity of a MILP problem depends on the number of decision

variables and the number of constraints. To formally analyze the number of decision

81

Objective

minimizeN−1∑n=0

(T nf −WST n) orN−1∑n=0

∑j∈J

∑(l ,k)∈EA

M−1∑m=0

f jnlk (m) (4–18)

Multi-commodity flow constraints∑k:(l ,k)∈EA

f jnlk (m)−∑

k:(k,l)∈EA

f jnkl (m) = 0, ∀j ∈ J,∀l ∈ VA, 0 ≤ m < M, 0 ≤ n < N, l = sj , l = dj ,

(4–19)M−1∑m=0

(∑

k:(l ,k)∈EA

f jnlk (m)−∑

k:(k,l)∈EA

f jnkl (m)) =

{Fj , if l = sj−Fj , if l = dj

,∀l ∈ VA,∀j ∈ J, 0 ≤ n < N, (4–20)

N−1∑n=0

∑j∈Jf jnlk (m) ≤ blk(m)× (Tm+1 − Tm), ∀(l , k) ∈ EA, 0 ≤ m < M (4–21)

0 ≤ f jnlk (m) ≤ blk(m)× (Tm+1 − Tm)× zjnm ,∀l ∈ VA,∀j ∈ J,∀(l , k) ∈ EA\Einf , 0 ≤ m < M, 0 ≤ n < N

(4–22)

ST nj ≤ Tm × z jnm + (1− z jnm )Q, ∀j ∈ J, 0 ≤ m < M, 0 ≤ n < N (4–23)

ENDnj ≥ Tm+1 × z jnm , ∀j ∈ J, 0 ≤ m < M, 0 ≤ n < N (4–24)


x jnlk = 1,∀j ∈ J, 0 ≤ n < N (4–25)


y jnlk = 1,∀j ∈ J, 0 ≤ n < N (4–26)

∑M−1m=0 f

jnlk (m)

Q≤

{x jnlk , if (l , k) ∈ Incsjy jnlk , if (l , k) ∈ Incdj

, j ∈ J, 0 ≤ n < N (4–27)

Precedence constraintsSTj =WST , if pred(j) = ∅,∀j ∈ J (4–28)ST nj ≤ ENDnj ,∀j ∈ J, 0 ≤ n < N (4–29)

ENDnp ≤ ST j , if p ∈ pred(j), p, j ∈ J, 0 ≤ n < N (4–30)ENDnj ≤ T nf ,∀j ∈ J, 0 ≤ n < N (4–31)

ypnlk = xjnlk , if p ∈ pred(j), p ∈ Jnc, j ∈ Jc, 0 ≤ n < N (4–32)

ypnlk = yjnlk , if p ∈ pred(j), p ∈ Jc, j ∈ Jnc, 0 ≤ n < N (4–33)

Deadline constraints (optional)T nf ≤ Dn (4–34)

Figure 4-6. Multiple workflow scheduling problem formulation via network flow model82

variables and the number of constraints, first, the following variables are defined. nn and

mn denote the number of nodes and the number of edges, respectively, of a network

resource graph. nt and mt denote the number of nodes and the number of edges,

respectively, of a workflow. nJ denotes the number of jobs. If nA and mA represent the

number of nodes and the number of edges, respectively, of the auxiliary graph, nA

equals 2(nn + nt), and mA equals mn + nn + 2ntnn as described in Section 4.2.3. Table

4-2 shows the number of variables and constraints of the single workflow scheduling

problem formulation. Flow variables f consists of flow variables of communication

jobs and flow variables of non-communication jobs. The first part is accounted for by

(2nn +mn) ·mt ·M because we need to consider flows on a network resource graph (mn)

and interconnecting edges between a network resource graph and a task graph related

to a job (2nn). On the other hand, the second part is accounted for by (3nn · nt) · M

because we need to consider flows only on interconnecting edges between a network

resource graph and a task graph and edges connected between frontend and backend

nodes in GA. For simplicity, let’s assume that the network resource graph is fixed as

we conduct experiments by varying only the size of workflows in Section 4.6. Then

decision variable f is dominant in the number of decision variables, and the number of

decision variables f is proportional to (mt + nt) ·M. As shown in Table 4-2, the number

of constraints is also proportional to (mt + nt) ·M.

4.4 LP Relaxation

As you will see in the experimental results, the running time of MILP for the

workflow scheduling increases exponentially as the number of nodes of a workflow

grows. The general workaround to solve the MILP problem fast enough to be useful in

practical is the linear programming relaxation by transforming binary variables into real

variables ranging between 0 and 1. We can turn the solution to the linear programming

relaxation of the MILP problem into the approximate solution to the MILP problem via

83

Table 4-2. Single workflow scheduling formulation time complexity analysisVariable/Constraint Number of variables/constraintsf ((2nn +mn) ·mt + 3nn · nt) ·Mx nJ · nn = (mt + nt) · nny nJ · nn = (mt + nt) · nnz nJ ·M = (mt + nt) ·MST nJ = (mt + nt)END nJ = (mt + nt)

Constraint 4–2+4–3 (nt · (2nn + 2) +mt · (nn + 2)) ·MConstraint 4–4 (mn + nn) ·MConstraint 4–5 (nt · nn +mt · (mn + nn)) ·MConstraint 4–6, 4–7 nJ ·M = (mt + nt) ·MConstraint 4–8,4–9 nJ = (mt + nt)Constraint 4–10 nJ · nn = (mt + nt) · nnConstraint 4–11,4–12, 4–14

nJ = (mt + nt)

Constraint 4–13 2mtConstraint4–15,4–16

2mt · nn

techniques such as rounding. We propose a LP relaxation (LPR) algorithm, consisting of

two steps, for the workflow scheduling problem.

• We determine which resources are selected for the tasks (nodes) of a task graph.

• The next step is to iteratively determine the start and end times of jobs along withnetwork resource allocations for data transfer jobs.

The detailed operations of the first-step algorithm are described in Algorithm 4-1. The

goal of the first step is to determine the mapping of resources other than network, and

the related binary variables, x and y . In the original MILP formulation, Constraint 4–8

and 4–9 ensure that only one x /y variable in the constraints becomes 1. Hence we

can turn the solution of the LP relaxation problem into the solution of the original MILP

problem by picking the variable with the maximum value and setting it to 1 and all the

others to 0. In this step, we don’t care about z variables, which are related to time slice

assignment.

84

Algorithm 4-1 First step - Determination of the mapping of tasks except data transfersInput: A network topology graph Gn and a workflow Gt

1: Relax all the binary variables of the MILP problems, i.e., x , y , and z variables.

2: Solve the LP relaxation of the MILP problem.

3: Find the maximum relaxed variable among many relaxed variables whose total sum

should be 1, and set the variable to 1 and all other variables to 0 regarding x and y

variables.

With the solution of the first-step algorithm, we can determine the start and end

times of jobs by solving small MILP problems iteratively, regarding unscheduled jobs.

The basic idea is that finding a solution to the MILP problem with determined x /y binary

variables and undetermined z binary variables for a small number of jobs, e.g., 3,

takes little time. Thus, we can divide the problem into many small problems and solve

them sequentially. To pick appropriate jobs, we also use the same bottom level priority

scheme as the heuristic. However, in our case, the node mapping is already determined.

The detailed operations of the second-step algorithm are described in Algorithm 4-2.

Algorithm 4-2 Second step - Determination of the mapping of network resourcesInput: A network topology graph Gn and a workflow Gt with fixed resource mapping

obtained from Algorithm 4-1

1: while There are network jobs with unfixed end times do

2: Pick 3 non-communication jobs and associated communication jobs.

3: Solve the MILP problem, which has only those jobs and related z variables as

binary variables.

4: Update the start and end times of jobs affected by the solution.

5: end while

As for LP, the computation time is proportional to p2q if q ≥ p where p is the number

of decision variables and q is the number of constraints. The decision variable f is

85

dominant in the number of decision variables. Suppose that the network resource graph

is fixed as we conduct experiments by varying only the size of workflows in Section

4.6. To address the fast growing running time with regard to the size of a workflow, we

choose another form of multicommodity flow. There exist two kinds of LP formulations

for the multicommodity flow problem, node-arc form and edge-path form. The MILP

formulation in Figure 4-5 takes the node-arc form, which assigns a separate decision

variable for a certain job on a certain link. In contrast, the edge-path form assigns a

separate decision variable for a certain job on a certain path in a set of paths, P, which

the job can take. Accordingly, if we limit the number of paths in the set P, we can reduce

the number of decision variables, which leads to better performance in terms of time

complexity by sacrificing the accuracy of the solution. In [81], authors showed that the

edge-path formulation for bulk file transfers can lead to a near optimal solution with

a reasonable time complexity by using a limited number of pre-defined paths. The

edge-path form of the single workflow scheduling problem formulation is presented in

Figure 4-7. We will refer to this edge-path based LP relaxation as LPREdge for the rest

of this chapter.

The time complexity analysis for the edge-path form formulation is summarized in

Table 4-3. Compared to the original formulation, the number of variables and constraints

is much reduced. Especially, the time complexity of the edge-path from formulation is

much less influenced by the size of a network resource graph, i.e., mn and nn.

4.5 List Scheduling Heuristic

The extended list scheduling algorithm with the bottom level priority scheme

achieves the best performance among other priority schemes such as top level

priority scheme [92]. Even though the authors in [67] tried to enhance the performance

considering the properties of a pipelined task graph, the new algorithm does not make

much difference in the case of random workflows. Their results show that the new and

classic algorithms produce almost the same makespans regarding workflows with up

86

Objective

minimize Tf or∑j∈J

∑p∈Pj

M−1∑m=0

f jp (m) (4–35)

Multi-commodity flow constraints∑0≤m<M

∑p∈Pj

f jp (m) = Fj , ∀j ∈ J (4–36)

∑j∈J

∑p∈Pj,p:(l ,k)∈p

f jp (m) ≤ blk(m)× (Tm+1 − Tm), ∀(l , k) ∈ EA, 0 ≤ m < M (4–37)

0 ≤ f jp (m) ≤ blk(m)× (Tm+1 − Tm)× z jm,∀j ∈ J,∀p ∈ Pj, (l , k) ∈ p, 0 ≤ m < M (4–38)

STj ≤ Tm × z jm + (1− z jm)Q, ∀j ∈ J, 0 ≤ m < M (4–39)

ENDj ≥ Tm+1 × z jm, ∀j ∈ J, 0 ≤ m < M (4–40)


x jlk = 1,∀j ∈ J (4–41)


y jlk = 1,∀j ∈ J (4–42)

∑M−1m=0 f

jp (m)

Q≤

{x jlk , if (l , k) ∈ Incsjy jlk , if (l , k) ∈ Incdj

,∀j ∈ J,∀p ∈ Pj, (l , k) ∈ p, 0 ≤ m < M (4–43)

Precedence constraintsSTj =WST , if pred(j) = ∅,∀j ∈ J (4–44)STj ≤ ENDj ,∀j ∈ J (4–45)ENDp ≤ STj , if p ∈ pred(j), p, j ∈ J (4–46)ENDj ≤ Tf ,∀j ∈ J (4–47)

y ilk = xjlk , if i ∈ pred(j), i ∈ Jnc, j ∈ Jc (4–48)

y ilk = yjlk , if i ∈ pred(j), i ∈ Jc, j ∈ Jnc (4–49)

Deadline constraints (optional)Tf ≤ D (4–50)

Figure 4-7. Edge-path form of single workflow scheduling problem formulation

87

Table 4-3. Edge-path form single workflow scheduling formulation time complexityanalysisVariable/Constraint Number of variables/constraintsf (k ·mt + nn · nt) ·Mx nJ · nn = (mt + nt) · nny nJ · nn = (mt + nt) · nnz nJ ·M = (mt + nt) ·MST nJ = (mt + nt)END nJ = (mt + nt)

Constraint 4–36 nJ ·M = (mt + nt) ·MConstraint 4–37 (mn + nn) ·MConstraint 4–38 k · nJ ·M = k · (mt + nt) ·MConstraint 4–39,4–40

nJ ·M = (mt + nt) ·M

Constraint4–41,4–42

nJ = (mt + nt)

Constraint 4–43 nJ · nn = (mt + nt) · nnConstraint 4–44,4–45, 4–47

nJ = (mt + nt)

Constraint 4–46 2mtConstraint4–48,4–49

2mt · nn

to 60 tasks and the new algorithm performs at most 5-10% better regarding workflows

with 80 to 100 tasks. Hence, we consider adapting the general extended list scheduling

algorithm with the bottom level priority scheme to our random workflows.

The direct application of the extended list scheduling algorithm proposed in [92]

does not fit well into e-Science networks in three aspects. First, the algorithm of [92]

allows that links on the path can be available at different time periods as long as

descendent links become available after the precedent links of a path. This assumption

requires buffers at the ends of links and intervention of moderators controlling the start

and the end of data transfer on each link. Second, [92] does not consider in-advance

reservation, which means only available bandwidth at the time when the request is

made is taken into account for path computation. The extended list scheduling algorithm

adapted for e-Science networks are described in Algorithm 4-3.

88

The changes necessary for adaptation for in-advance workflow reservations in

e-Science networks are related to computing data transfer time as part of computation

of the earliest finish time. The assumption regarding synchronized availability of links on

a path is reasonable in e-Science networks and in-advance reservations pose another

challenge. In the case of on-demand reservations, we can compute data transfer time

simply by the amount of data over the maximum available bandwidth of a path where

the maximum available bandwidth of a path is the minimum of maximum available

bandwidths of links of a path. Lastly, the varying available bandwidth over time due to

the nature of in-advance reservation requires careful handling of data transfer time.

We assume rigid mapping for the extended list scheduling, which means the allocated

bandwidth of a path does not change over time. To find the data transfer finish time, we

use the simple heuristic as described in Algorithm 4-4. We will refer to the extended list

scheduling algorithm adapted for e-Science networks as LS for the rest of this chapter.

Algorithm 4-3 The adapted extended list scheduling algorithmInput: A network resource graph Gn and a workflow Gt

1: Determine the priorities of all nodes in Gt based on the bottom priority scheme.

2: Order the nodes with respect to priorities while complying with precedence

constraints.

3: for each node in the ordered list in the decreasing order do

4: Find the node that allows the earliest finish time among all candidate nodes by

virtually scheduling all incoming data transfers. // Network paths between two

nodes are predetermined by BFS.

5: end for


4.6.1 Experiment Setup

We compare the performance of four algorithms, the optimal MILP algorithm,

the LP relaxation algorithm, the edge-path form LP relaxation algorithm and the list

89

Algorithm 4-4 Data transfer finish time computation algorithmInput: A network resource graph Gn and a data transfer specified by(source, destination, amount of data, start time)

1: for each time slice in the increasing order of start time of time slices whose end timeis greater than or equal to start time do

2: // Basic interval refers to the time period within which the available bandwidth oflinks of a path is constant.

3: AllocBW ← the available bandwidth of the time slice.4: RemainingData← the amount of data to transfer5: CurTimeSlice ← the time slice6: FinishTime ← the start time of the data transfer7: while RemainingData > 0 do8: if CurTimeSlice has more available bandwidth than AllocBW then9: RemainingData←

10: RemainingData−the amount of data transferred in the current time slice

11: Update FinishTime.12: else13: Exit while14: end if15: CurTimeSlice ← the next time slice16: end while17: if RemaingData = 0 then18: return FinishTime.19: end if20: end for

scheduling heuristic of Section 4.5 in terms of the makespan, i.e., the schedule length

of workflows and the computational time of algorithms. In the following, we refer to

the MILP algorithm, the LP relaxation algorithm, the edge-path form LP relaxation

algorithm, and the general extended list scheduling algorithm of Section 4.5 as MILP,

LPR, LPREdge and LS, respectively.

We first compare the performance of all four algorithms with regard to workflows

with a small number of nodes, 3. This experiment is for comparison of non-optimal

algorithms against the optimal algorithm. We then compare two algorithms, LPREdge

and LS, with regard to workflows with a large number of nodes ranging from 10 to 50

90

with an increment of 10. The second experiment is to verify that our algorithm performs

better than the heuristic algorithm in terms of makespan.

As a network resource graph, we use the Abilene network [11] (see Figure 4-8),

which is deployed in practice. The resource capacities of nodes of the network resource

graph as well as the bandwidth capacities of edges are randomly selected from a

uniform distribution between 10 to 1024. For workflow generation, we can choose

either way of generating randomly [26, 28, 42, 59, 66] or synthesizing workflows

from a set of pre-determined workflows [60]. In our experiments, we use a random

workflow generation method that depends on three parameters: the number of nodes,

the average degree of nodes and communication-to-computation ratio (CCR). The

number of nodes is varied according to the aforementioned experiments. The average

degree of nodes is related to the level of parallelism of workflows and fixed to 2. The

different CCRs of 0.1, 1, and 10 are used to assess the impact of the communication

factor on the performance of the algorithms. A larger CCR means a workflow is more

data-intensive. The weights of nodes of a workflow are randomly selected from a

uniform distribution between 10 to 1024 as the resource capacities of the Abilene

network are determined. Subsequently, the weights of edges of a workflow are set to

the CCR times the uniform distribution between 10 to 1024. One hundred trials were for

every combination of workflow parameters, the number of nodes, CCR and the chosen

algorithm. We then averaged the results and plotted charts for performance evaluation.

Even though we present formulations for both single workflow and multiple workflow

scheduling, we have conducted experiments for the single workflow scheduling only.

This is because we can understand every multiple workflow scheduling instance may be

transformed into a single big workflow scheduling instance.

As a MILP/LP solver, we used CPLEX, a popular commercial software package,

and the computer machines on which CPLEX has been installed have the following

specification; 2 GHz dual core AMD Opteron(tm) Processor 280 and 7 Gbyte memory.

91

Figure 4-8. The Abilene network

4.6.2 Results

We evaluate the performance of workflow scheduling algorithms with regard to two

metrics: schedule length of workflows, i.e., makespan, and computational (running) time.

The detailed results with explanation are presented in this subsection.

4.6.2.1 Schedule length of workflows

Comparison against optimal scheduling results:. Since the optimal schedules

for randomly generated workflows on the given network resource graph, the Abilene

network, are not known ahead of time, the only way of evaluating the makespans of

non-optimal algorithms is to compare makespans of those algorithms against the

optimal algorithm.

In Figure 4-9, we can see that the performance of non-optimal algorithms, i.e., LPR,

LPREdge, and LS, is comparable to the optimal algorithm, MILP, when CCR = 0.1 and

1.0. However, as CCR grows up to 10, the makespan of LS becomes roughly 2 times

the makespan of MILP. In contrast, the makespan of LPR and LPREdge is at most 20%

more than the optimal makespan.

Comparison between LPREdge and LS. As the general workflow scheduling

problem is a NP-hard, our corresponding formulation, MILP, requires exponential

computational time as the size of workflows increases. For large workflows, it is

impractical to determine the optimal makespan using the MILP algorithm. For this

reason, we compare the makespans of only our non-optimal algorithms, LPREdge and

92

0

5

10

15

20

25

Makespan

MILP

LPR

LPREdge

LS

0

5

10

15

20

25

0.1 1 10

Makespan

CCR

MILP

LPR

LPREdge

LS

Figure 4-9. Makespan vs. CCR for all algorithms in the Abilene network when thenumber of nodes in a workflow is 3.

LS, in Figure 4-10. We can see that LPREdge is much better than LS. It achieves half

the makespan of LS in some cases.

0

50

100

150

200

250

300

350

400

Makespan

LPREdge

LS

0

50

100

150

200

250

300

350

400

0.1 1 10 0.1 1 10 0.1 1 10 0.1 1 10 0.1 1 10

10 20 30 40 50

Makespan

CCR and number of nodes in a workflow

LPREdge

LS

Figure 4-10. Makespan vs. CCR and the number of nodes in a workflow for LPREdgeand LS in the Abilene network

4.6.2.2 Computational time

Comparison against optimal scheduling results. The running time of the

optimal algorithm grows exponentially as shown in Figure 4-11. This algorithm takes

approximately 14 seconds when there are 3 nodes and CCR = 0.1. With 3 nodes, the

run time becomes approximately 47 seconds when CCR = 10. When the number of

93

nodes is increased to 10 and CCR = 0.1, MILP takes more than 1,500 seconds. By

contrast, LPREdge takes less than 5 seconds when there are 3 nodes and less than 150

seconds when the number of nodes is less than 50 (Figure 4-12).

05

1015

2025

303540

4550

Computation time (sec) MILP

LPR

LPREdge

LS

05

1015

2025

303540

4550

0.1 1 10

Computation time (sec)

CCR

MILP

LPR

LPREdge

LS

Figure 4-11. Computational time vs. CCR for all algorithms in the Abilene network whenthe number of nodes in a workflow is 3.

Comparison between LPREdge and LS. The running time of the heuristic is a few

seconds whereas the running time of LPREdge is linearly increasing up to 150 seconds

when the number of nodes is 50.

0

20

40

60

80

100

120

140

160

Computational Time (second) LPREdge

LS

0

20

40

60

80

100

120

140

160

10 20 30 40 50

Computational Time (second)

Number of nodes in a work flow

LPREdge

LS

Figure 4-12. Computational time vs. the number of nodes in a workflow for LPREdgeand LS in the Abilene network

If requests for workflow scheduling from users are on-demand and should be

handled in realtime, the computational time of the fast greedy algorithm shown in

the experiments is not positive. However, when the requests are in-advance, there is

94

enough time between request arrival time and request start time, and the centralized

server is a more high-end machine, LPREdge is deployable in practice.

4.7 Summary

We have formulated workflow scheduling problems in e-Science networks, whose

goal is minimizing either makespan or network resource consumption by jointly




allocation that may vary over time. In addition, it is advantageous that the formulation for

a single workflow scheduling can be easily extended to the formulation for a multiple

workflow scheduling. The computation time of the optimal formulation increases

exponentially with regard to the size of a workflow. Accordingly, the LP relaxation

algorithm, referred to as LPR, for deployment in practice has been developed based on

the optimal algorithm through the common linear relaxation technique. We also propose

the edge-path form LP relaxation algorithm, LPREdge, to enhance time complexity.

The experimental results show that the makespan of LPR and LPREdge is

comparable, less than 20% longer, to that of the optimal algorithm regardless of CCR

for small workflows. In contrast, the general list scheduling algorithm, LS, performs

roughly similar to LPR and LPREdge when CCR = 0.1, but the performance gap of

LPR/LPREdge and LS grows dramatically as CCR grows from 1 to 10. Data-intensive

workflow scheduling, which is common in e-Science application, can benefit from

dynamic multiple paths and malleable resource allocation. In terms of computational

time, the heuristic algorithm of course is the best because it requires only trivial

computations. LPR and LPREdge algorithms require more computations which may

take a few minutes when the number of nodes is 50. Infrequent workflow scheduling

requests from users and reasonable scheduling time between arrival time and start time

of requests may relieve this burden a little.

95

To the best of our knowledge, the optimal algorithm, the MILP formulation, is

the first algorithm that jointly schedules heterogeneous resources including network

resources using dynamic multiple network paths and malleable resource allocation.

The approximation based on the optimal algorithm achieves reasonable performance

compared with the optimal algorithm in terms of schedule length (makespan). The

application of these results to optical networks will be future work.

96

CHAPTER 5CONCLUSIONS

We propose to develop a novel framework for provisioning a variety of e-Science

applications that require complex workflows that span over multiple domains. Our

framework will provide guarantees on the performance while incurring minimal

overhead, both necessary conditions for such a framework to be adopted in practice.

We have already developed a SDF-based model for iterative data-dependent

e-Science applications that incorporates variable communication delays and temporal

constraints, such as throughput. We formulated the problem as a variation of multi-commodity

linear programming with an objective of minimizing network resource consumption while

meeting temporal constraints.

We also proposed topology aggregation algorithms for e-Science networks.



applications. We defined a new class of requests, called multiple-path multiple-job



approaches.

Finally, We formulated workflow scheduling problems in e-Science networks,

whose goal is minimizing either makespan or network resource consumption by jointly




allocation that may vary over time. the LP relaxation algorithm for deployment in

practice has been developed based on the optimal algorithm through the common linear

relaxation technique. We also proposed the edge-path form LP relaxation algorithm to

enhance time complexity.

97

REFERENCES

[1] BER Network Requirements Workshop Final Report. Lawrence BerkeleyNational Laboratory, 2007. http://www.es.net/pub/esnet-doc/

BER-Net-Req-Workshop-2007-Final-Report.pdf; cited Sep. 2010.

[2] BES Network Requirements Workshop Final Report. Lawrence BerkeleyNational Laboratory, 2007. http://www.es.net/pub/esnet-doc/

BES-Net-Req-Workshop-2007-Final-Report.pdf; cited Sep. 2010.

[3] EarthScope: An Earth Science Program. EarthScope, 2007. http://www.

earthscope.org/usarray/data_flow/archiving.php; cited Sep. 2010.

[4] Enlightened Computing. MCNC, 2007. http://www.enlightenedcomputing.org/;cited Jan. 2008.

[5] GEANT2. DANTE, 2007. http://www.geant2.net/; cited Sep. 2010.

[6] CHEETAH: Circuit-switched High-speed End-to-End Transport Architecture.University of Virginia, 2008. http://www.ece.virginia.edu/cheetah/; cited Sep.2010.

[7] FES Network Requirements Workshop Final Report. Lawrence BerkeleyNational Laboratory, 2008. http://www.es.net/pub/esnet-doc/

FES-Net-Req-Workshop-2008-Final-Report.pdf; cited Sep. 2010.

[8] Ultralight: An Ultrascale Information System for Data Intensive Research. NationalScience Foundation, 2008. http://www.ultralight.org; cited Sep. 2010.

[9] CA*net4. CANARIE, 2009. http://www.canarie.ca/canet4/index.html; citedSep. 2010.

[10] OSCARS: On-demand Secure Circuits and Advance Reservation System. U.S.Department of Energy, 2009. http://www.es.net/oscars; cited Sep. 2010.

[11] Abilene. Internet2, 2010. http://abilene.internet2.edu/, cited Jan. 2009.

[12] e-Science. The U.K. Research Councils, 2010. http://www.rcuk.ac.uk/

escience; cited Sep. 2010.

[13] The Earth System Grid (ESG). University Corporation for Atmospheric Research,2010. http://www.earthsystemgrid.org/; cited Sep. 2010.

[14] Energy Science Network (ESnet). Lawrence Berkeley National Laboratory, 2010.http://www.es.net; cited Sep. 2010.

[15] The Globus Alliance. Globus, 2010. http://www.globus.org/.

[16] Hybrid Optical and Packet Infrastructure. Internet2, 2010. http://www.internet2.edu/networkresearch/projects.html; cited Sep. 2010.

98

http://www.es.net/pub/esnet-doc/BER-Net-Req-Workshop-2007-Final-Report.pdf

http://www.es.net/pub/esnet-doc/BER-Net-Req-Workshop-2007-Final-Report.pdf

http://www.es.net/pub/esnet-doc/BES-Net-Req-Workshop-2007-Final-Report.pdf

http://www.es.net/pub/esnet-doc/BES-Net-Req-Workshop-2007-Final-Report.pdf

http://www.earthscope.org/usarray/data_flow/archiving.php

http://www.earthscope.org/usarray/data_flow/archiving.php

http://www.enlightenedcomputing.org/

http://www.geant2.net/

http://www.ece.virginia.edu/cheetah/

http://www.es.net/pub/esnet-doc/FES-Net-Req-Workshop-2008-Final-Report.pdf

http://www.es.net/pub/esnet-doc/FES-Net-Req-Workshop-2008-Final-Report.pdf

http://www.ultralight.org

http://www.canarie.ca/canet4/index.html

http://www.es.net/oscars

http://abilene.internet2.edu/

http://www.rcuk.ac.uk/escience

http://www.rcuk.ac.uk/escience

http://www.earthsystemgrid.org/

http://www.es.net

http://www.globus.org/

http://www.internet2.edu/networkresearch/projects.html

http://www.internet2.edu/networkresearch/projects.html

[17] Internet2. Internet2, 2010. http://www.internet2.edu; cited Sep. 2010.

[18] JGN II: Advanced Network Testbed for Research and Development. NICT, 2010.http://www.jgn.nict.go.jp; cited Sep. 2010.

[19] JIVE. Joint Institute for Very Long Baseline Interferometry, 2010. http://www.

jive.nl/; cited Sep. 2010.

[20] LHCNet: Transatlantic Networking for the LHC and the U.S. HEP Community. U.S.Department of Energy, 2010. http://lhcnet.caltech.edu/; cited Sep. 2010.

[21] National Lambda Rail. U.S. research and education community, 2010. http:

//www.nlr.net; cited Sep. 2010.

[22] NSF Global Environment for Network Innovations (GENI) Project. GENI, 2010.http://geni.net/; cited Sep. 2010.

[23] TeraGrid. National Science Foundation, 2010. http://www.teragrid.org/; citedSep. 2010.

[24] UltraScience Net. U.S. Department of Energy, 2010. http://www.csm.ornl.gov/ultranet/; cited Sep. 2010.

[25] User Controlled LightPath Provisioning. Communications Research Centre, 2010.http://www.uclp.ca/; cited Sep. 2010.

[26] Adam, Thomas L., Chandy, K. M., and Dickson, J. R. “A comparison of listschedules for parallel processing systems.” Commun. ACM 17 (1974).12:685–690.

[27] Ahuja, Ravindra, Magnanti, T., and Orin, J. Network flows : theory, algorithms, andapplications. Englewood Cliffs N.J.: Prentice Hall, 1993.

[28] Benoit, Anne, Hakem, Mourad, and Robert, Yves. “Contention awareness andfault-tolerant scheduling for precedence constrained tasks in heterogeneoussystems.” Parallel Computing 35 (2009).2: 83–108.

[29] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., and Weiss, W. Anarchitecture for differentiated services. RFC 2475, IETF, 1998.

[30] Boutaba, R., Golab, W., Iraqi, Y., Li, T., and Arnaud, B. “Grid-controlled lightpathsfor high performance grid applications.” Journal of Grid Computing 1 (2003).4:387–394.

[31] Braden, R., Clark, D., and Shenker, S. Integrated services in the internet architec-ture: An overview. RFC 1633, IETF, 1994.

[32] Brodnik, Andrej and Nilsson, Andreas. “A static data structure for discreteadvance bandwidth reservations on the Internet.” Tech. Rep. Tech report

99

http://www.internet2.edu

http://www.jgn.nict.go.jp

http://www.jive.nl/

http://www.jive.nl/

http://lhcnet.caltech.edu/

http://www.nlr.net

http://www.nlr.net

http://geni.net/

http://www.teragrid.org/

http://www.csm.ornl.gov/ultranet/

http://www.csm.ornl.gov/ultranet/

http://www.uclp.ca/

cs.DS/0308041, Department of Computer Science and Electrical Engineering,Lulea University of Technology, Sweden, 2003.

[33] Bunn, J. and Newman, H. “Data-intensive grids for high-energy physics.” GridComputing: Making the Global Infrastructure a Reality. eds. F. Berman, G. Fox,and T. Hey. John Wiley & Sons, Inc, 2003.

[34] Burchard, Lars-O. “Networks with advance reservations: applications,architecture, and performance.” Journal of Network and Systems Management 13(2005).4: 429–449.

[35] Burchard, Lars-O. and Heiss, Hans-U. “Performance issues of bandwidthreservation for grid computing.” Proceedings of the 15th Symposium on ComputerArchetecture and High Performance Computing (SBAC-PAD’03). 2003.

[36] Burchard, Lars-O., Schneider, J., and Linnert, B. “Rerouting strategies fornetworks with advance reservations.” Proceedings of the First IEEE InternationalConference on e-Science and Grid Computing (e-Science 2005). Melbourne,Australia, 2005.

[37] Chen, Bin Bin and Primet, Pascale Vicat-Blanc. “Scheduling deadline-constrainedbulk data transfers to minimize network congestion.” Proceedings of the SeventhIEEE International Symposium on Cluster Computing and the Grid (CCGRID).2007.

[38] Chung, Yeh-Ching and Ranka, S. “Applications and performance analysis of acompile-time optimization approach for list scheduling algorithms on distributedmemory multiprocessors.” Supercomputing ’92. Proceedings. 1992, 512–521.

[39] Coffman, E. G. and Graham, R. L. “Optimal scheduling for two-processorsystems.” Acta Informatica 1 (1972).3: 200–213.

[40] Curti, C., Ferrari, T., Gommans, L., van Oudenaarde, B., Ronchieri, E., Giacomini,F., and Vistoli, C. “On advance reservation of heterogeneous network paths.”Future Generation Computer Systems 21 (2005).4: 525–538.

[41] DeFanti, T., d. Laat, C., Mambretti, J., Neggers, K., and Arnaud, B. “TransLight:A global-scale LambdaGrid for e-science.” Communications of the ACM 46(2003).11: 34–41.

[42] Dick, Robert P., Rhodes, David L., and Wolf, Wayne. “TGFF: task graphs for free.”Proceedings of the 6th international workshop on Hardware/software codesign.Seattle, Washington, United States: IEEE Computer Society, 1998, 97–101.

[43] (Ed.), E. Mannie. Generalized multi-protocol label switching (GMPLS) architecture.RFC 3945, IETF, 2004.

100

[44] Erlebach, T. “Call admission control for advance reservation requests withalternatives.” Tech. Rep. TIK-Report Nr. 142, Computer Engineering and NetworksLaboratory, Swiss Federal Institute of Technology (ETH) Zurich, 2002.

[45] Farrel, A. “A Path Computation Element (PCE)-Based Architecture.” 2006.

[46] Ferrari, Tiziana. Grid Network Services Use Cases from the e-Science Commu-nity. The Open Grid Forum, 2007.

[47] Foster, I. and Kesselman, C. The Grid: Blueprint for a New Computing Infrastruc-ture. Morgan Kaufmann, 1999.

[48] Foster, I., Kesselman, C., Lee, C., Lindell, R., Nahrstedt, K., and Roy, A. “Adistributed resource management architecture that supports advance reservationsand co-allocation.” Proceedings of the International Workshop on Quality ofService (IWQoS ’99). 1999.

[49] Govindarajan, R. and Gao, Guang. “Rate-optimal schedule for multi-rate DSPcomputations.” The Journal of VLSI Signal Processing 9 (1995).3: 211–232.

[50] Govindarajan, R., Gao, Guang R., and Desai, Palash. “Minimizing BufferRequirements under Rate-Optimal Schedule in Regular Dataflow Networks.”The Journal of VLSI Signal Processing 31 (2002).3: 207–229.

[51] Guerin, R. and Orda, A. “Networks with advance reservations: The routingperspective.” Proceedings of IEEE INFOCOM 99. 1999.

[52] He, E., Wang, X., Vishwanath, V., and Leigh, J. “AR-PIN/PDC: Flexible advancereservation of intradomain and interdomain lightpaths.” Proceedings of the IEEEGLOBECOM 2006. 2006.

[53] Hutanu, Andrei, Allen, Gabrielle, Beck, Stephen D., Holub, Petr, Kaiser, Hartmut,Kulshrestha, Archit, Lika, Milo, MacLaren, Jon, Matyska, Ludk, Paruchuri,Ravi, Prohaska, Steffen, Seidel, Ed, Ullmer, Brygg, and Venkataraman, Shalini.“Distributed and collaborative visualization of large data sets using high-speednetworks.” Future Gener. Comput. Syst. 22 (2006).8: 1004–1010.

[54] J. Mambretti, et al. “The Photonic TeraStream: enabling next generationapplications through intelligent optical networking at iGRID2002.” Future Genera-tion Computer Systems 19 (2003).6: 897908.

[55] Johnston, W. E., Metzger, J., OConnor, M., Collins, M., Burrescia, J., Dart,E., Gagliardi, J., Guok, C., and Oberman, K. “Network Communication as aService-Oriented Capability.” High Performance Computing and Grids in Action,Vol. 16. ed. L. Grandinetti. IOS Press, 2008.

[56] Jung, E., Li, Y., Ranka, S., , and Sahni, S. “Performance evaluation of routing andwavelength assignment algorithms for optical networks.” IEEE Symposium onComputers and Communications. 2008.

101

[57] Jung, E., Li, Y., Ranka, S., and Sahni, S. “An evaluation of in-advance bandwidthscheduling algorithms for connection-oriented networks.” International Symp. onParallel Architectures, Algorithms, and Networks (ISPAN). 2008.

[58] Karypis, George and Kumar, Vipin. MeTis: Unstrctured Graph Partitioning andSparse Matrix Ordering System, Version 2.0, 1995.

[59] Khan, A.A., Mccreary, C.L., and Jones, M.S. “A Comparison of MultiprocessorScheduling Heuristics.” Parallel Processing, 1994. ICPP 1994. InternationalConference on. vol. 2. 1994, 243–250.

[60] Kwok, Y.-K. and Ahmad, I. “Benchmarking the task graph scheduling algorithms.”Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the FirstMerged International ... and Symposium on Parallel and Distributed Processing1998. 1998, 531–537.

[61] Lee, E.A. and Ha, S. “Scheduling strategies for multiprocessor real-time DSP.”Global Telecommunications Conference, 1989, and Exhibition. CommunicationsTechnology for the 1990s and Beyond. GLOBECOM ’89., IEEE. 1989, 1279–1283vol.2.

[62] Lee, Edward Ashford. A coupled hardware and software architecture for pro-grammable digital signal processors (synchronous data flow). Ph.D. thesis,University of California, Berkeley, 1986.

[63] Lehman, T., Sobieski, J., and Jabbari, B. “DRAGON: A framework for serviceprovisioning in heterogeneous grid networks.” IEEE Communications Magazine(2006).

[64] Lewin-Eytan, L., Naor, J., and Orda, A. “Routing and admission control innetworks with advance reservatione.” Proceedings of the Fifth InternationalWorkshop on Approximation Algorithms for Combinatorial Optimization (APPROX02). 2002.

[65] Li, Yan, Ranka, S., and Sahni, S. “In-advance path reservation for file transfersIn e-Science applications.” Computers and Communications, 2009. ISCC 2009.IEEE Symposium on. 2009, 176–181.

[66] Liu, Xin. Application-Specific, Agile and Private (ASAP) Platforms for FederatedComputing Services over WDM Networks. Ph.D. thesis, The State Universify ofNew York at Buffalo, 2009.

[67] Liu, Xin, Wei, Wei, Qiao, Chunming, Wang, Ting, Hu, Weisheng, Guo, Wei, andWu, Min-You. “Task Scheduling and Lightpath Establishment in Optical Grids.”INFOCOM 2008. The 27th Conference on Computer Communications. IEEE.2008, 1966–1974.

102

[68] Lui, King-Shan, Nahrstedt, K., and Chen, Shigang. “Routing with topologyaggregation in delay-bandwidth sensitive networks.” Networking, IEEE/ACMTransactions on 12 (2004): 17–29.

[69] Luo, Xubin and Wang, Bin. “Integrated Scheduling of Grid Applications in WDMOptical Light-Trail Networks.” Journal of Lightwave Technology 27 (2009).12:1785–1795.

[70] Marchal, L., Primet, P., Robert, Y., and Zeng, J. “Optimal bandwidth sharing ingrid environment.” Proceedings of IEEE High Performance Distributed Computing(HPDC). 2006.

[71] McDysan, D. E. and Spohn, D. L. ATM Theory and Applications. McGraw-Hill,1998.

[72] Medina, A., Lakhina, A., Matta, I., and Byers, J. “BRITE: an approach to universaltopology generation.” Modeling, Analysis and Simulation of Computer andTelecommunication Systems, 2001. Proceedings. Ninth International Symposiumon (2001): 346–353.

[73] Munir, K., Javed, S., and Welzl, M. “A Reliable and Realistic Approach of AdvanceNetwork Reservations with Guaranteed Completion Time for Bulk Data Transfersin Grids.” Proceedings of ACM International Conference on Networks for GridApplications (GridNets 2007). San Jose, California, 2007.

[74] Munir, K., Javed, S., Welzl, M., Ehsan, H., and Javed, T. “An End-to-End QoSMechanism for Grid Bulk Data Transfer for Supporting Virtualization.” Proceedingsof IEEE/IFIP International Workshop on End-to-end Virtualization and GridManagement (EVGM 2007). San Jose, California, 2007.

[75] Munir, K., Javed, S., Welzl, M., and Junaid, M. “Using an Event Based PriorityQueue for Reliable and Opportunistic Scheduling of Bulk Data Transfers in GridNetworks.” Proceedings of the 11th IEEE International Multitopic Conference(INMIC 2007). 2007.

[76] Murthy, Praveen. Scheduling techniques for synchronous and multidimensionalsynchronous dataflow. Berkeley: Electronics Research Laboratory College ofEngineering University of California, 1996.

[77] Naiksatam, Sumit and Figueira, Silvia. “Elastic Reservations for EfficientBandwidth Utilization in LambdaGrids.” The International Journal of Grid Comput-ing 23 (2007).1: 1–22.

[78] Newman, H. B., Ellisman, M. H., and Orcutt, J. A. “Data-intensive e-sciencefrontier research.” Communications of the ACM 46 (2003).11: 68–77.

103

[79] Palis, M.A., Liou, Jing-Chiou, and Wei, D.S.L. “Task clustering and scheduling fordistributed memory parallel architectures.” Parallel and Distributed Systems, IEEETransactions on 7 (1996).1: 46–55.

[80] Pelsser and Bonaventure. Path Selection Techniques to Establish ConstrainedInterdomain MPLS LSPs. 2006, 209–220.

[81] Rajah, K., Ranka, S., and Xia, Ye. “Scheduling Bulk File Transfers with Start andEnd Times.” Network Computing and Applications, 2007. NCA 2007. Sixth IEEEInternational Symposium on. 2007, 295–298.

[82] Rajah, Kannan, Ranka, Sanjay, and Xia, Ye. “Scheduling Bulk File Transfers withStart and End Times.” Computer Networks 52 (2008).5: 1105–1122. .

[83] Rao, N. S., Carter, S. M., Wu, Q., Wing, W. R., Zhu, M., Mezzacappa, A.,Veeraraghavan, M., and Blondin, J. M. “Networking for large-scale science:Infrastructure, provisioning, transport and application mapping.” Proceedings ofSciDAC Meeting. 2005.

[84] Reiter, Raymond. “Scheduling Parallel Computations.” J. ACM 15 (1968).4:590–599.

[85] Ricciato, F., Monaco, U., and Ali, D. “Distributed schemes for diverse pathcomputation in multidomain MPLS networks.” Communications Magazine, IEEE43 (2005): 138–146.

[86] Rosen, E., Viswanathan, A., and Callon, R. Multiprotocol label switching architec-ture. RFC 3031, IETF, 2001.

[87] Sahni, S., Rao, N., Ranka, S., Li, Y., Jung, E., and Kamath, N. “Bandwidthscheduling and path computation algorithms for connection-oriented networks.”International Conference on Networking. 2007.

[88] Sarangan, V., Ghosh, D., and Acharya, R. “Performance analysis ofcapacity-aware state aggregation for inter-domain QoS routing.” Global Telecom-munications Conference, 2004. GLOBECOM ’04. IEEE 3 (2004): 1458–1463Vol.3.

[89] Schelen, O., Nilsson, A., Norrgard, Joakim, and Pink, S. “Performance ofQoS agents for provisioning network resources.” Proceedings of IFIP SeventhInternational Workshop on Quality of Service (IWQoS’99). London, UK, 1999.

[90] Sinnen, O. and Sousa, L.A. “Communication contention in task scheduling.”Parallel and Distributed Systems, IEEE Transactions on 16 (2005).6: 503 – 515.

[91] Sinnen, Oliver. Task scheduling for parallel systems. Hoboken N.J.:Wiley-Interscience, 2007.

104

[92] Sinnen, Oliver and Sousa, Leonel. “List scheduling: extension for contentionawareness and evaluation of node priorities for heterogeneous clusterarchitectures.” Parallel Computing 30 (2004).1: 81–101.

[93] Sprintson, A., Yannuzzi, M., Orda, A., and Masip-Bruin, X. “Reliable Routingwith QoS Guarantees for Multi-Domain IP/MPLS Networks.” INFOCOM 2007.26th IEEE International Conference on Computer Communications. IEEE (2007):1820–1828.

[94] Stuijk, S., Geilen, M., and Basten, T. “Throughput-Buffering Trade-Off Explorationfor Cyclo-Static and Synchronous Dataflow Graphs.” Computers, IEEE Transac-tions on 57 (2008).10: 1331–1345.

[95] Sun, Zhenyu, Guo, Wei, Wang, Zhengyu, Jin, Yaohui, Sun, Weiqiang, Hu,Weisheng, and Qiao, Chunming. “Scheduling Algorithm for Workflow-BasedApplications in Optical Grid.” Journal of Lightwave Technology 26 (2008).17:3011–3020.

[96] Thorpe, S. R., Stevenson, D., and Edwards, G. K. “Using just-in-time to enableoptical networking for grids.” First ICST/IEEE International Workshop on Networksfor Grid Applications (GridNets 2004). 2004.

[97] Topcuoglu, H., Hariri, S., and Wu, Min-You. “Performance-effective andlow-complexity task scheduling for heterogeneous computing.” Parallel andDistributed Systems, IEEE Transactions on 13 (2002).3: 260–274.

[98] Uludag, Suleyman, Lui, King-Shan, Nahrstedt, Klara, and Brewster, Gregory.“Analysis of Topology Aggregation techniques for QoS routing.” ACM Comput.Surv. 39 (2007): 7.

[99] Wang, Lee, Siegel, Howard Jay, Roychowdhury, Vwani P., and Maciejewski,Anthony A. “Task Matching and Scheduling in Heterogeneous ComputingEnvironments Using a Genetic-Algorithm-Based Approach,.” Journal of Paralleland Distributed Computing 47 (1997).1: 8–22.

[100] Wang, Tao and Chen, Jianer. “Bandwidth tree - A data structure for routing innetworks with advanced reservations.” Proceedings of the IEEE InternationalPerformance, Computing and Communications Conference (IPCCC 2002). 2002.

[101] Wang, Yan, Jin, Yaohui, Guo, Wei, Sun, Weiqiang, Hu, Weisheng, and Wu,Min-You. “Joint scheduling for optical grid applications.” Journal of OpticalNetworking 6 (2007).3: 304–318.

[102] Wieczorek, Marek, Hoheisel, Andreas, and Prodan, Radu. “Taxonomies of theMulti-Criteria Grid Workflow Scheduling Problem.” Grid Middleware and Services.2008. 237–264.

105

[103] Wiggers, M., Bekooij, M., Jansen, P., and Smit, G. “Efficient computation of buffercapacities for multi-rate real-time systems with back-pressure.” Hardware/softwarecodesign and system synthesis, 2006. CODES+ISSS ’06. Proceedings of the 4thinternational conference. 2006, 10–15.

[104] Xiong, Qing, Wu, Chanle, Xing, Jianbing, Wu, Libing, and Zhang, Huyin. “Alinked-list data structure for advance reservation admission control.” ICCNMC2005. 2005. Lecture Notes in Computer Science, Volume 3619/2005.

[105] Yang, Tao and Gerasoulis, Apostolos. “PYRROS: static task scheduling andcode generation for message passing multiprocessors.” Proceedings of the 6thinternational conference on Supercomputing. Washington, D. C., United States:ACM, 1992, 428–437.

[106] Yang, Xi, Lehman, Tom, Tracy, Chris, Sobieski, Jerry, Gong, Shujia, Torab,Payam, and Jabbari, Bijan. “Policy-Based Resource Management and ServiceProvisioning in GMPLS Networks.” Proceedings of IEEE INFOCOM. 2006.

[107] Yannuzzi, M., Masip-Bruin, X., and Bonaventure, O. “Open issues in interdomainrouting: a survey.” Network, IEEE 19 (2005): 49–56.

[108] Yoo, Younghwan, Ahn, Sanghyun, and Kim, Chong Sang. “Link state aggregationusing a shufflenet in ATM PNNI networks.” Global Telecommunications Confer-ence, 2000. GLOBECOM ’00. IEEE. vol. 1. 2000, 481–486 vol.1.

[109] Zhang, Z. L., Duan, Z., and Hou, Y. T. “Decoupling QoS control from corerouters: A novel bandwidth broker architecture for scalable support of guaranteedservices.” Proc. ACM SIGCOMM. 2000.

[110] Zheng, Jun, Zhang, Baoxian, and Mouftah, H.T. “Toward automated provisioningof advance reservation service in next-generation optical internet.” Communica-tions Magazine, IEEE 44 (2006).12: 68–74.

106

BIOGRAPHICAL SKETCH

Eun-Sung Jung received B.S. and M.S. degrees in electrical engineering from Seoul

National University, Korea, in 1996 and 1998, respectively. His research interests include

network optimization in connection-oriented networks and its applications to existing

research networks.

107

c 2010 Eun-Sung Jungufdcimages.uflib.ufl.edu/UF/E0/04/19/96/00001/jung_e.pdf · Eun-Sung Jung...

Documents

Transcript of c 2010 Eun-Sung Jungufdcimages.uflib.ufl.edu/UF/E0/04/19/96/00001/jung_e.pdf · Eun-Sung Jung...