[IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium...

8
Reliable Workflow Execution in Distributed Systems for Cost Efficiency Young Choon Lee 1 , Albert Y. Zomaya 1 , Mazin Yousif 2 1 Centre for Distributed and High Performance Computing School of Information Technologies The University of Sydney NSW 2006, Australia {yclee,zomaya}@it.usyd.edu.au 2 Global Technology Services, IBM Canada [email protected] AbstractReliability is of great practical importance in distributed computing systems (DCSs) due to its immediate impact on system performance, i.e., quality of service. The issue of reliability becomes more crucial particularly for ‘cost- conscious’ DCSs like grids and clouds. Unreliability brings about additionaloften excessivecapital and operating costs. Resource failures are considered as the main source of unreliability in this study. In this study, we investigate the reliability of workflow execution in the context of scheduling and its effect on operating costs in DCSs, and present the reliability for profit assurance (RPA) algorithm as a novel workflow scheduling heuristic. The proposed RPA algorithm incorporates a (operating) cost-aware replication scheme to increase reliability. The incorporation of cost awareness greatly contributes to efficient replication decisions in terms of profitability. To the best of our knowledge, the work in this paper is the first attempt to explicitly take into account (monetary) reliability cost in workflow scheduling. Keywords-distributed computing; cost-efficient computing; workflow; scheduling; reliability I. INTRODUCTION Distributed systems (grids, data centres and clouds) provide an unparalleled level of computational horsepower for solving challenging problems across a wide spectrum of fieldsfrom scientific inquiry, engineering design, and financial analysis to national defense and disaster predictionsuch horsepower usually comes with an enormous cost, not only to provision resources but also operate them (i.e., capital and operating expenditure). Therefore, the efficient resource management of these costly systems is a major issue in both research and practice. Often, resources in DCSs are made publicly available associating costs with resource usage, i.e., a pay-per-use pricing model. Typical examples of such systems are market-based grid environments and recent cloud systems [2], [11], which are cost conscious. Clearly, the reconciliation of conflicting objectives between providers and consumers is a major issue that needs to be addressed. In other words, the resource provider aims to accommodate/process as many requests as possible with the main objective of maximizing profit; and this may conflict with the consumer‘s performance requirements (e.g., response time). This already complex issue becomes far more intricate when resource failures are explicitly taken into account. Reliability is of great practical importance to DCSs due to various reasons including its immediate impact on system performance, such as, meeting quality of service constraints [24]. DCSs can be best characterized by their dynamic nature; that is, they typically consist of a large number of heterogeneous resources geographically dispersed and deal with very diverse applications. Each machine/server in these systems consists of multiple hardware components including CPUs, disks, memory modules and network devices. As a result, the occurrence of (resource) failures is not very rare [23]. Issues in system reliability include availability, fault tolerance and security; these issues have both a direct and an indirect relationship with cost. In other words, downtime incurs direct (financially) and indirect (quality of service; and in turn possible loss of consumers) losses. For example, each 0.1% decrease in service availability results in roughly 9 hours of additional downtime per year; and this increase in downtime costs about $1 million for every $1 billion in annual revenue [24]. Unreliability or instability of DCSs can result from various factors ranging from typical hardware and software failures to uncertainties in resource participation. Hardware (resource) failures are considered as the main source of unreliability in this paper. While the issue of reliability has been extensively studied in the past, most previous efforts tend to neglect its cost implications; however, these cost implications are a major concern particularly in ‗cost-conscious‘ DCSs like grids and clouds [16], [18]. Performance degradation caused by unreliability most likely adversely affects the cost to revenue ratio; that is, unreliability brings about additional (often excessive) capital and operating costs. User applications that require the processing of workflow (mashup) services (e.g., [3], [4], [10], [17]) are common in grids and clouds. As a result, many workflow management systemssuch as Condor DAGMan [21], GrADS [7] and Pegasus [9]have been developed and integrated as part of resource management systems in grids. There also have been a number of studies on market-based resource allocation and reliable/robust scheduling for workflow applications in 978-1-4244-9349-4/10/$26.00 © 2010 IEEE 11 th IEEE/ACM International Conference on Grid Computing 89

Transcript of [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium...

Page 1: [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium (2010.10.25-2010.10.28)] 2010 11th IEEE/ACM International Conference on Grid Computing

Reliable Workflow Execution in Distributed Systems for Cost Efficiency

Young Choon Lee1, Albert Y. Zomaya1, Mazin Yousif2

1Centre for Distributed and High Performance Computing

School of Information Technologies

The University of Sydney

NSW 2006, Australia

{yclee,zomaya}@it.usyd.edu.au

2Global Technology Services,

IBM Canada

[email protected]

Abstract—Reliability is of great practical importance in

distributed computing systems (DCSs) due to its immediate

impact on system performance, i.e., quality of service. The

issue of reliability becomes more crucial particularly for ‘cost-

conscious’ DCSs like grids and clouds. Unreliability brings

about additional—often excessive—capital and operating costs.

Resource failures are considered as the main source of

unreliability in this study. In this study, we investigate the

reliability of workflow execution in the context of scheduling

and its effect on operating costs in DCSs, and present the

reliability for profit assurance (RPA) algorithm as a novel

workflow scheduling heuristic. The proposed RPA algorithm

incorporates a (operating) cost-aware replication scheme to

increase reliability. The incorporation of cost awareness

greatly contributes to efficient replication decisions in terms of

profitability. To the best of our knowledge, the work in this

paper is the first attempt to explicitly take into account

(monetary) reliability cost in workflow scheduling.

Keywords-distributed computing; cost-efficient computing;

workflow; scheduling; reliability

I. INTRODUCTION

Distributed systems (grids, data centres and clouds) provide an unparalleled level of computational horsepower for solving challenging problems across a wide spectrum of fields—from scientific inquiry, engineering design, and financial analysis to national defense and disaster prediction—such horsepower usually comes with an enormous cost, not only to provision resources but also operate them (i.e., capital and operating expenditure). Therefore, the efficient resource management of these costly systems is a major issue in both research and practice. Often, resources in DCSs are made publicly available associating costs with resource usage, i.e., a pay-per-use pricing model. Typical examples of such systems are market-based grid environments and recent cloud systems [2], [11], which are cost conscious. Clearly, the reconciliation of conflicting objectives between providers and consumers is a major issue that needs to be addressed. In other words, the resource provider aims to accommodate/process as many requests as possible with the main objective of maximizing profit; and this may conflict with the consumer‘s performance

requirements (e.g., response time). This already complex issue becomes far more intricate when resource failures are explicitly taken into account.

Reliability is of great practical importance to DCSs due to various reasons including its immediate impact on system performance, such as, meeting quality of service constraints [24]. DCSs can be best characterized by their dynamic nature; that is, they typically consist of a large number of heterogeneous resources geographically dispersed and deal with very diverse applications. Each machine/server in these systems consists of multiple hardware components including CPUs, disks, memory modules and network devices. As a result, the occurrence of (resource) failures is not very rare [23]. Issues in system reliability include availability, fault tolerance and security; these issues have both a direct and an indirect relationship with cost. In other words, downtime incurs direct (financially) and indirect (quality of service; and in turn possible loss of consumers) losses. For example, each 0.1% decrease in service availability results in roughly 9 hours of additional downtime per year; and this increase in downtime costs about $1 million for every $1 billion in annual revenue [24]. Unreliability or instability of DCSs can result from various factors ranging from typical hardware and software failures to uncertainties in resource participation. Hardware (resource) failures are considered as the main source of unreliability in this paper. While the issue of reliability has been extensively studied in the past, most previous efforts tend to neglect its cost implications; however, these cost implications are a major concern particularly in ‗cost-conscious‘ DCSs like grids and clouds [16], [18]. Performance degradation caused by unreliability most likely adversely affects the cost to revenue ratio; that is, unreliability brings about additional (often excessive) capital and operating costs.

User applications that require the processing of workflow (mashup) services (e.g., [3], [4], [10], [17]) are common in grids and clouds. As a result, many workflow management systems—such as Condor DAGMan [21], GrADS [7] and Pegasus [9]—have been developed and integrated as part of resource management systems in grids. There also have been a number of studies on market-based resource allocation and reliable/robust scheduling for workflow applications in

978-1-4244-9349-4/10/$26.00 © 2010 IEEE 11th IEEE/ACM International Conference on Grid Computing89

Page 2: [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium (2010.10.25-2010.10.28)] 2010 11th IEEE/ACM International Conference on Grid Computing

DCSs, e.g., [1], [12], [18], [22]. Most previous workflow scheduling algorithms that take into account reliability focus on reliable workflow execution in terms of performance assurance. However, these techniques don‘t consider simultaneously the issues of reliability and costs; that is, these two metrics have been mostly dealt with separately. What‘s more, the impact of system unreliability on workflow applications (i.e., recursive impact of resource failures on inter-dependent tasks) is not explicitly considered.

We consider a resource provider and resource consumer reach a service level agreement (SLA). A consumer/client specifies a workflow and its characteristics, e.g., (soft) deadline, resource requirements, and price. Note that, the resource provider may actually be a single administrative entity (e.g., Amazon, Google or Microsoft) or it can be a high-level (global) scheduler in a grid environment. A resource provider attempts to guarantee the quality of service to a certain degree in the way its profit is maximized. In this scenario, resource failures should be explicitly and proactively addressed, or QoS is much compromised and profit is reduced accordingly. Specifically, a resource failure not only affects the completion of a given workflow execution, but also increases the usage of those resources provisioned (or reserved) for other tasks in the workflow due to the delay incurred by that failure and precedence constraints of the workflow. Those reserved resources can be seen as a result of resource co-allocation using advance reservation at the arrival of the workflow in grids, or that of resource provisioning for the next time frame (epoch) at the finalization of a given bidding event in clouds.

In this paper, we investigate the reliability of workflow execution in the context of scheduling and its effect on operating costs in DCSs, and present the reliability for profit assurance (RPA) algorithm as a new workflow scheduling heuristic. The proposed RPA algorithm at its core incorporates a (operating) cost-aware replication scheme to increase reliability. The replication scheme effectively identifies ‗replicability‘ taking into account replication cost, penalty (incurred by deviating SLA targets) and failure characteristics. The incorporation of cost awareness greatly contributes to judicious replication decisions in terms of profitability. Note that, RPA is generic to deal with both interdependent tasks of workflow applications and individual independent jobs, although workflow applications are practically the best application model RPA can exploit for cost efficiency. To the best of our knowledge, the work in this paper is the first attempt to explicitly incorporate (monetary) reliability cost into workflow scheduling.

The rest of the paper is organized as follows. Section 2 describes the system, workflow and reliability models used in this paper. Section 3 shows upper bound calculations of workflow completion time in the presence of estimation errors. In Section 4 we present our workflow scheduling algorithm explicating its cost-aware replication scheme. Experimental results and our interpretation on them are presented in Section 5. Section 6 discusses related work. We then summarize our work and draw a conclusion in Section 7.

II. MODELS

In this section, we describe the scheduling scenario considered in this work in the context of reliable workflow execution. The workflow execution model in this study is described as the process of allocating a set of resources provisioned by a resource provider to a stream of workflow submissions from multiple consumers without violating SLAs.

A. System model

A distributed computing system in this study consists of a set P of p heterogeneous computing resources (server computers, compute nodes or simply processors); these resources are fully interconnected in the sense that a route exists between any two individual resources. The inter-resource/processor communications are assumed to perform at a high speed on all links without substantial contentions. We also assume resources can be reserved in advance, i.e., advance reservation is possible at the time of a scheduling event (request arrival). Since a market-based resource usage model is adopted in this study, some constant cost ac (advertised cost) per unit time per resource occupancy is charged to the resource consumer; this advertised cost is derived based on the net operating cost oc, i.e., ac = oc/(1 – prnet) , where prnet is the net profit rate set by a provider. By resource occupancy we mean that the resource is exclusively allocated for a particular request. The occupancy time of a resource includes the actual task execution time, idle time occurring due to inaccurate execution time estimate and waiting time caused by delays (from resource failures) in the completion of predecessor tasks.

B. Application model

Workflow applications are basically the same as typical parallel programs, with one exception: a workflow application consists of a set of interdependent applications (not partitioned tasks of a parallel program). Like conventional parallel programs, workflow applications can be represented by a directed acyclic graph (DAG). A DAG, G = (N, E), consists of a set N of n nodes, and a set E of e edges. A DAG (Figure 1) is also known as a task graph or macro-dataflow graph. The nodes usually represent tasks of a workflow application, and the edges usually represent

precedence constraints. An edge (i, j) E between task ni and task nj represents the inter-task communication. Specifically, the output of task ni must be transmitted to task nj for task nj to start its execution. A task with no predecessors is called an entry task, nentry; an exit task, nexit, is one that has no successors. Among the predecessors of a task ni, the predecessor that completes the communication at the latest time is the most influential parent (MIP) of the task denoted as MIP(ni). A task is called a ready task if all of its predecessors have been completed. The longest path of a task graph is the critical path (CP).

The weight on a task, ni, denoted as wi, represents the computation time of the task. The computation time of a task, ni on a processor, pj is wi,j. The weight on an edge, denoted as ci,j represents the communication time between

90

Page 3: [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium (2010.10.25-2010.10.28)] 2010 11th IEEE/ACM International Conference on Grid Computing

two tasks, ni and nj. The communication time is only required when two tasks are assigned to different processors.

The earliest start and finish times of a task ni on a processor pj are defined as:

𝐸𝑆𝑇(𝑛𝑖 ,𝑝𝑗 ) = 0 if 𝑛𝑖 = 𝑛𝑒𝑛𝑡𝑟𝑦

𝐸𝐹𝑇 𝑀𝐼𝑃 𝑛𝑖 ,𝑝𝑘 + 𝑐𝑀𝐼𝑃 𝑛𝑖 ,𝑖 otherwise

𝐸𝐹𝑇 𝑛𝑖 ,𝑝𝑗 = 𝐸𝑆𝑇 𝑛𝑖 ,𝑝𝑗 + 𝑤𝑖 ,𝑗

where Ppk is the processor executing MIP(ni).

The communication to computation ratio (CCR) is a measure that indicates whether a task graph is communication intensive, computation intensive or moderate. For a given task graph, it is computed by the average communication cost divided by the average computation cost on a target system.

C. Reliability model

Cost efficiency in computing systems is becoming increasingly important particularly because current practices of resource management encounter serious by low performance-to-cost ratios and struggle to deal with there. In this regard, reliability in DCSs is of great practical importance since resources in these systems are essentially dynamic facing various unreliability issues including resource failures.

The resource failure rate function is defined as:

𝜆 𝑡 = lim∆𝑡→01

∆𝑡𝑃(𝑡 < 𝑇 ≤ 𝑡 + ∆𝑡|𝑇 > 𝑡).

The reliability of a resource pj during the execution of a task ni is defined as:

𝑟 𝑤𝑖 ,𝑗 = exp −𝜆𝑤𝑖 ,𝑗 .

The unreliability ur (can be seen as probability of failure)

is then defined as:

𝑢𝑟 𝑤𝑖 ,𝑗 = 1 − 𝑟(𝑤𝑖 ,𝑗 )

For a given workflow application, a single resource failure can often cause a significant delay in its completion; specifically, such a delay results not only from the re-execution of task running on that failed resource, but also a series of delays occurred with tasks whose execution is dependent on the completion of the discontinued task, i.e., a cascading effect. When a resource fails during the execution of a task in a given workflow application, the task should be rescheduled on an alternative resource. In such a case, the completion time on the new resource is most likely to be increased due to two types of delay, i.e., processing delay and rescheduling delay. The processing delay pdi of a task ni rescheduled on a new resource pk is defined as:

𝑝𝑑𝑖 = 𝑤𝑖 ,𝑘 −𝑤𝑖 ,𝑗′

where w'i,j is the remaining execution time on the failed resource pj at the time of failure.

The rescheduling delay can be a constant value rd. The actual delay in the completion of that task is then defined as:

𝑎𝑑𝑖 = 𝑝𝑑𝑖 + 𝑟𝑑

III. PERFORMANCE BOUNDS

Due to uncertainty and inaccuracy in performance estimation surrounding a heterogeneous distributed system, the expected/estimated completion time (or makespan) of a workflow may not be realized. For practical purposes, it may be beneficial to compute an upper bound or confidence interval of the calculated finish time of an application. Certainly, such computation helps the valuation of a given workflow when specifying SLA parameters and also helps scheduling from the resource provider‘s perspective. In this section we discuss the upper bound (within the 95 percent confidence interval) of the estimated finish time of a workflow, in the presence of prediction errors.

Case 1

Case 2

Figure 1. Task graph

The DCS is assumed to have unbiased predictors (for processors‘ and links‘ capacity/performance) whose errors are normally distributed with zero mean and a constant standard deviation. Such error leads to a normally distributed actual task completion time that is centered at the predicted task completion time.

In a workflow schedule, there may be cases where tasks are being executed one at a time; such a case may arise when successive tasks require results from the previous one (see Case 1 in Figure 1). In such a case, the completion time distribution of this particular batch of tasks, which includes computation and communication time, may be calculated by summing the individual task‘s completion time distribution, as shown:

𝑍 = 𝑋1 + 𝑋2 + ⋯+ 𝑋𝑛 ,

where Xi is a random variable representing a computation time or communication time. If Xi is normally distributed, then Z would also be normally distributed. A workflow schedule may also consist of multiple tasks being executed at different processors at the same time, if the interdependency (or rather independence) of the tasks in the application allow

91

Page 4: [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium (2010.10.25-2010.10.28)] 2010 11th IEEE/ACM International Conference on Grid Computing

this (see Case 2 in Figure 1, where there are three branches of independently executing tasks). In such a case, the completion time distribution is the maximum of all the branch distributions, that is

𝑍 = max(𝑋1,𝑋2,… ,𝑋𝑛),

where Xi is a random variable representing a branch‘s completion time. Computing both the maximum and the summation of individual distributions would involve computing possibly time-consuming integrals, and when we consider task graphs with thousands of nodes this may lead to non-acceptable computation times. Instead, we consider here a simple approximation to the upper bound (of the 95 percent confidence interval) of the calculated finish time. Assuming a relatively large task graph with good task divisibility, our proposed scheduling algorithm would yield a schedule where multiple tasks are scheduled and are executing at the same time on different processors, with little gap (idle time) in the processors. That is, the processors are relatively fully utilized. First, with multiple branches in the schedule the finish time distribution can be approximated as the maximum of all the branch distributions. Second, with little gap in the branches and, as stated previously, error function that is normally distributed with zero mean and constant standard deviation (expressed as a percentage), each branch distribution would be approximately identical and is the summation of each element in the branch. Finally, as we are dealing with a large task graph (that may contain thousands of nodes), for large scale applications and systems in our scheduling model, we can take the limit of the distribution of the maximum (of the branch distribution) because the number of branches goes ever larger (to infinity). From the theory of Generalized Extreme Value (GEV) distribution [16], the cdf of the finish time can be expressed as:

Pr 𝑋 ≤ 𝑥 = exp − 1 + 𝜉 𝑥−𝜇

𝜎

1𝜉 , 1 + 𝜉

𝑥−𝜇

𝜎 > 0,𝜇 ∈ ℜ,𝜎 > 0, 𝜉 ∈ ℜ,

where µ is the location parameter, σ is the scale parameter, ξ is the shape parameter. When the distributions are normal, such as in our case, the GEV distribution reduces to a Type I extreme value distribution with ξ=0 and the following cdf:

Pr(𝑋 ≤ 𝑥) = exp −𝑒−(𝑥−𝜇)/𝜎 .

To calculate the upper bound of a schedule, the CP is first identified. This CP would represent a branch in our upper bound calculation. From the CP, the branch time distribution and the extreme value distribution can be calculated, and hence the upper 95% confidence interval. Our approximation is a conservative estimate of the upper bound and has the following property:

𝑈(𝑆) ≤ 𝑈∗(𝑆),

where S is the generated schedule from our proposed scheduling algorithm, U(S) is the actual upper bound (95%

confidence interval), and U*(S) is our approximation to the upper bound. Such approximation can be useful for planning purposes.

IV. RELIABLE WORKFLOW SCHEDULING FOR COST

EFFICIENCY

The reliable workflow scheduling problem addressed in this paper can be described as the optimization of cost efficiency by minimizing the deterioration of SLA compliance—more specifically, by delivering as best SLA goals as possible. The main objective is the maximization of profitability reducing costs incurred specifically by resource failures.

A. Economics of workflow execution

Resource consumers place requests for resource allocation with valuation of their jobs/applications, e.g., price willing to pay for processing their jobs. For a given workflow, its valuation is carried out based primarily on its performance characteristics including processing times of its composed tasks and their time criticality. These characteristics can be derived and determined with estimates obtained by calculations detailed in Sections 2 and 3. From the resource provider‘s perspective, the scheduling of a stream of workflow applications—with its main object maximizing profit—is a major challenge, especially when taking into account resource failures. The revenue of a provider for a particular time frame is defined as the total of values charged to consumers for processing their applications during that time frame. These conflicting objectives (i.e., response time vs. profit) are negotiated and compromised results used to form an SLA. In the SLA for a given application, the consumer specifies the maximum value VMAX willing to pay and the value decay rate α for the response time. The provider makes a schedule plan according to these pieces of information.

We assume that each consumer application has an expectation of response time t, i.e., application G expects t < TMAX, in which TMAX represents the maximum acceptable response time of application G. We denote the value of finishing processing a request of application G in time t as v(t); obviously, this value should be greater than the (minimum) price the consumer has to pay. Note that, v is inversely related to t. We define a time-varying utility function as below:

𝑣 𝑡 = 𝑉𝑀𝐴𝑋, 𝑡 ≤ 𝑇𝑀𝐼𝑁

𝑉𝑀𝐴𝑋 − 𝛼𝑡, 𝑇𝑀𝐼𝑁 ≤ 𝑡 ≤ 𝑇𝑀𝐴𝑋0, 𝑡 > 𝑇𝑀𝐴𝑋

where TMIN is the minimal time required for serving a request of G. The VMAX value of an application request shall be proportional to the processing time of that application. The function has similarity to that in [8], however, the way we treat TMIN is different. TMIN is not the mean service time of a request as used in [8], instead it is a dynamic value that takes into account of the dependency of processing the request.

92

Page 5: [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium (2010.10.25-2010.10.28)] 2010 11th IEEE/ACM International Conference on Grid Computing

Algorithm 1: RPA

1: Let Q = Ø

2: while ∃ni ∈N | ni is not scheduled do

3: Let n* = the first ready task in N to be scheduled

4: Let p* = processor in P on which n*‘s EFT is the smallest

5: Reserve p* for n*

6: Enqueue n* to Q

7: end while

8: Let ms = makespan based on the current schedule

9: while Q ≠ Ø do

10: Let ni = the first task in Q

11: Dequeue Q

12: Let cdi = compute_cumulative_delay(ni, wi/2)

13: Compute mii based on ms // makespan increase

14: Let loss = VMAX – (mii·α + cdi·oc)

15: Let actLoss = loss · ur(wi,*)

16: Let pj = processor in P on which ni‘s EFT is the smallest

17: Let repCost = wi,j·oc

18: if actLoss > repCost then

19: Let ni' = a replica of ni

20: Assign ni' to pj

21: end if

22: end while

Algorithm 2: compute_cumulative_delay

Input: a task ni ∈N and the estimated delay edi

1: for ∀ni,j∈succ(ni) do

2: Let wi,j = EFT(ni,j) – EST(ni,j)

3: Update EFT(ni,j) considering pdi

4: if (EFT(ni) > EST(ni,j)) then

5: Let EST(ni,j) = EFT(ni)

6: end if 7: Let di,j = EFT(ni,j) – EST(ni,j) – wi,j

8: Let cdi = cdi + di,j // recursive function call

9: Let cdi = cdi + compute_cumulative_delay(ni,j, edi)

10: end for

11: Return cdi

The minimum value VMIN and the maximum value VMAX of application G are defined as:

𝑉𝑀𝐼𝑁 = 𝑤𝑖 ·𝑎𝑐𝑛𝑖=1

𝑉𝑀𝐴𝑋 = 𝑉𝑀𝐼𝑁 + Δ𝑣

where Δv is extra cost—for prioritizing the workflow request—added perhaps after negotiating SLA; it may be calculated based on TMAX or vice versa.

The decay rate α of a workflow application G is inversely related to the maximum acceptable response time TMAX (more specifically, TMAX – TMIN) and defined as:

𝛼 =𝑉𝑀𝐴𝑋

(𝑇𝑀𝐴𝑋−𝑇𝑀𝐼𝑁)

Note that, workflow requests may not be processed, if their SLA targets cannot be met due to various reasons including tight temporal constraints and lack of resource availability. In such a case, the consumer may re-negotiate with the provider relaxing SLA targets.

B. RPA algorithm

As many tasks in a workflow application can be run in parallel, resource co-allocation plays a crucial role in workflow scheduling; that is, efficient resource co-allocation significantly contributes to minimizing the completion time of workflow applications. Resources reserved as a result of co-allocation are often for tasks that are independent of each other; and these tasks may have the same predecessor task. Due to this precedence constraint, those co-allocated resources (more precisely, tasks on these resources) are vulnerable to the failure of resource on which that predecessor task runs. This vulnerability becomes much more serious when resource co-allocation activities take

place frequently and co-allocated resources are within a close proximity in terms of time. The rationale behind our RPA algorithm is that the cost efficiency in workflow scheduling is heavily dependent on the reliable execution of high impact tasks (normally, small in number) who have a large number of successor tasks.

Task replication is a common approach adopted to deal with resource failures [17]. However, replicas created in response for increasing reliability are clearly a major source of resource wastage; this side effect becomes even more serious when costs associated with resource usage come into play. Therefore, replication decisions should be made explicitly taking into account reliability benefits over replication costs. RPA (Algorithm 1) incorporates a replication scheme that determines which tasks to be replicated in order to increase reliability, to reduce costs incurred with delays and to ultimately maximize profit. Replication is guided by novel cost functions (Steps 14–17) which enable the identification of ―replicability‖. They effectively capture and quantify the trade-off between replication costs and reliability benefits.

RPA can be characterized by its essentially greedy nature when generating the initial schedule (Steps 2–7). Specifically, for each task it reserves the resource on which the finish time of that task is minimized. The b-level prioritization method is used to order tasks to be scheduled since the quality of initial schedules can vary with the scheduling order of tasks. The b-level value of a task is computed by adding the computation and communication costs along the longest path of the task from the exit task in the task graph (including the task). The initial schedule then undergoes the replication process of RPA (Steps 9–22). For each task-resource match in the schedule, RPA estimates the cost of resource failure, or loss, (Steps 12–15). This loss is calculated based on two factors, i.e., penalty due to a makespan increase (mii ·α in Step 14) and costs incurred by delays in resource reservation periods (cdi ·oc in Step 14). As stated in Section 2.3, a task discontinuation recursively affects tasks (i.e., delays in their start times) whose execution is dependent on that task (Algorithm 2). Since each point in time during the execution of a given task has an equal probability, its estimated delay is a half of its execution time (Step 12). The actual loss is then calculated based on the loss (Step 14) and the unreliability of the resource originally scheduled during the execution time of the task (Step 15). If this actual loss is greater than the

93

Page 6: [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium (2010.10.25-2010.10.28)] 2010 11th IEEE/ACM International Conference on Grid Computing

(a) (b) (c)

Figure 2. average profit rates with respect to different resource heterogeneity values. (a) 100%. (b) 200%. (c) 400%.

0.4

0.5

0.6

0.1% 0.5% 2.0%

ave

rag

e p

rofit r

ate

failure rate

Dynamic Reactive RPA

0.4

0.5

0.6

0.1% 0.5% 2.0%

ave

rag

e p

rofit r

ate

failure rate

Dynamic Reactive RPA

0.4

0.5

0.6

0.1% 0.5% 2.0%

ave

rag

e p

rofit r

ate

failure rate

Dynamic Reactive RPA

cost incurred with replicating the task (i.e., the replication cost is smaller) then the task is replicated (Steps 18–21). In the case of a resource failure during the actual processing, this cost estimation is performed along with rescheduling, i.e., dynamic rescheduling. Here, rescheduling delay (rd) is taken into account.

V. PERFORMANCE EVALUATION

In this section, we describe the performance evaluation methods, present and discuss experimental results obtained from an extensive set of simulations.

A. Experiments

Experiments in this study were carried out using our discrete-event DCS simulator developed in C/C++. For the thorough and practical evaluation of RPA, simulation settings were configured with a diverse set of workflow applications and various resource characteristics (Table 1). The total number of experiments conducted is 1,012,500 (337,500 for each algorithm) with various combinations of those simulation parameter values in Table 1. We have used a total of 500 different workflow applications generated based on ten different numbers of task graph (workflow) size (n), five maximum widths, five different out degrees and two CCRs. Note that, computationally intensive applications are our particular interest in this study; hence, CCRs of 0.1 and 0.2. The computational and communication costs of the tasks in each workflow were randomly selected from a uniform distribution, with the mean equal to the chosen average computation and communication costs. A processor heterogeneity value of 100 is defined to be the percentage of the speed difference between the fastest processor and the slowest processor in a given system. Inter-arrival times of workflow requests were generated on a Poison process with mean values randomly generated from a uniform distribution. Each workflow in the incoming workflow request stream was randomly selected (from 500 base workflows) from a uniform distribution.

Comparisons as part of our evaluation study were conducted between RPA and two algorithms (Reactive and Dynamic). These two algorithms were implemented based on typical failure handling techniques. Both of them do not adopt task replication. Rather, Reactive identifies and attempts to reserve an alternative resource at the time of resource failure, whereas Dynamic as the name implies allocates a resource for each ready task of a given workflow on the fly. The latter deals with resource failures in the same

way Reactive handles. Reactive aims to avoid replication costs while attempting to secure resources to meet SLA targets; however it is still exposed to profit losses from (recursive) delays occurred due to resource failures. On the other hand, Dynamic does not encounter such delays since resource allocation is carried out with one task at a time. Its primary source of profit losses is in resource allocation (reservation) delay.

TABLE I. SIMULATION SETTINGS

Parameter Value

Global # different workflow applications 500

mean for workflow inter-arrival times U(10, 100)

simulation duration {2,000, 4,000, 8,000,

16,000, 32,000}

operating cost (oc) 10

net profit rate (prnet) {20%, 40%, 60%,

80%, 100%}

reservation delay (rd) U(1, 64)

Application # tasks per workflow (n) U(4, 128)

maximum width {2, 4, 8, 16, 32}

out degree of a node U(0, 8)

comm. to comp. ratio (CCR) ≤ 0.2

Resource # resources U(4, 64)

failure rate {0.1%, 0.5%, 2.0%}

heterogeneity {100%, 200%, 400%}

mean time to repair (MTTR) U(10, 100)

B. Results

The performance of our RPA algorithm was evaluated based on cost efficiency (profit rate) and response rate. For a given processed workflow Gi, the profit rate pri is defined as the difference between its actual profit and its processing (resource) costs. More formally,

𝑝𝑟𝑖 =𝑣(𝑡)𝑖−(𝑜𝑐 𝑤𝑗 ,∗+𝑝𝑑 𝑗

𝑛𝑗=1 )

𝑉𝑀𝐴𝑋 𝑖−(𝑜𝑐 𝑤𝑗 ,∗𝑛𝑗=1 )

where wj,* is the actual amount of time taken to execute task nj on the allocated resource p*.

The average profit rate 𝑝𝑟 is then defined as:

𝑝𝑟 = 𝑝𝑟 𝑖𝑀𝑖=1

𝑀

where M is the total number of workflow execution requests processed and v(t)i is the actual value (profit) gained through processing request Gi within t time.

94

Page 7: [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium (2010.10.25-2010.10.28)] 2010 11th IEEE/ACM International Conference on Grid Computing

(a) (b) (c)

Figure 3. average response rates with respect to different resource heterogeneity values. (a) 100%. (b) 200%. (c) 400%.

0.6

0.7

0.8

0.9

1.0

0.1% 0.5% 2.0%

ave

rag

e r

espon

se r

ate

failure rate

Dynamic Reactive RPA

0.6

0.7

0.8

0.9

1.0

0.1% 0.5% 2.0%

ave

rag

e r

espon

se r

ate

failure rate

Dynamic Reactive RPA

0.6

0.7

0.8

0.9

1.0

0.1% 0.5% 2.0%

ave

rag

e r

espon

se r

ate

failure rate

Dynamic Reactive RPA

The response rate rri of a workflow request Gi is defined as:

𝑟𝑟𝑖 =𝑇𝑀𝐼𝑁𝑖

𝑡𝑖

Then the average response rate 𝑟𝑟 for all applications processed by the provider is defined as:

𝑟𝑟 = 𝑟𝑟 𝑖𝑀𝑖=1

𝑀

The overall results obtained from our simulations are summarized in Table 2, followed results plotted (Figures 2 and 3) based on each of those two performance metrics (profit rate and response rate). These results confirm the effectiveness of our replication scheme incorporated into RPA. That is, the average response rate of RPA is very high (93%) resulted from judicious replication that help avoid delay in makespan; besides, this performance gain is realized in parallel with an appealing average profit rate (55%). The performance of RPA tends not to vary significantly with respect to resource heterogeneity, although replication is adopted; this can be explained by cost calculations (RPA performs) which effectively quantify the trade-off between replication costs and profit assurance. Specifically, replicas created for a given workflow application are likely to be limited to a small number. Our experimental results showed that the number of replicas made is approximately five (4.7 replicas) on average. Replication mostly took place with tasks in upper levels and in CP of a task graph. Clearly, these tasks (e.g., entry tasks) have relatively more influence to reliable workflow execution than those towards the bottom of a task graph (e.g., exit tasks). While most results are in line with our expectations, results in Figure 2c draw our attention. After the close observation of those results, it is identified that delays—processing delay in particular—are quite large due to the high resource heterogeneity (400%); this heterogeneity issue makes a substantial negative impact on resource occupancy (delays) and in turn response rate. Another noticeable result identified in our experiments is that the poor average profit rate of Reactive considering the fact that its average response rate is at a very decent level (85%). This is again due to processing delay and its impact on subsequent resource occupancies in Reactive‘s schedules;

however this impact is trivial on schedules generated by ―Dynamic‖.

TABLE II. OVERALL COMPARATIVE RESULTS

Dynamic Reactive RPA

avg. profit rate (𝒑𝒓) 49% 48% 55%

avg. response rate (𝒓𝒓) 77% 85% 93%

VI. RELATED WORK

The primary objective of workflow scheduling is to optimize performance in terms of application completion time. Since the dynamic nature of DCSs is a major source of performance uncertainties, some recent studies in workflow scheduling [12], [15], [22] have explicitly considered this unreliability issue in their approaches. With these highly dynamic resources, resource reliability (or resource failures)—modeled by a Weibull distribution adopted in a number of previous works [13], [20]—may not hold the same practical value, i.e., resource failures in DCSs are relatively more stochastic than those in tightly coupled systems like computer clusters. In [12], several failure recovery mechanisms including retrying, checkpointing and replication were incorporated as part of a flexible failure handling framework (Grid-WFS). Y. Tao et al, [22] proposed a Markov Chain based grid node availability prediction model. It was then used to devise a grid workflow scheduling algorithm considering reliable workflow execution. The degree of reliability in [22] was measured based on their reliability cost function. In [19], a reputation-based grid workflow scheduling algorithm was presented to deal with unreliability of grid resources. What distinguishes our work from them is that it explicitly correlating costs with reliability in its decision making process.

Cost efficiency is another important factor in DCSs; and thus, it has gained a great amount of attention particularly in recent times and become an important aspect in workflow scheduling, more generally, resource management. Many previous efforts in market-based (or cost-aware) scheduling [5], [6], [14], [18] have focused on the one-dimensional linear relationship between performance and operating costs. That is, costs are directly proportional to a particular performance metric, such as the completion time of user applications and energy consumption. In general, this direct relationship holds; however, in the case of workflow

95

Page 8: [IEEE 2010 11th IEEE/ACM International Conference on Grid Computing (GRID) - Brussels, Belgium (2010.10.25-2010.10.28)] 2010 11th IEEE/ACM International Conference on Grid Computing

execution accounting for resource unreliability, the slope in this curve is not entirely dependent of that independent variable (completion time). Rather, the total amount of time taken (i.e., the total amount of resource usage) to process tasks in a given workflow application is used in correlating to costs (or revenue/profit). D. E. Irwin et al. [14] extended the time-varying resource valuation function in [8] to take account of penalty when the value was not realized. The optimization therefore also minimizes the loss due to penalty. In our model, the penalty is not directly reflected in our time-varying valuation function, but it is implicitly reflected in the cost of physical resource usage in DCSs. The scheduling algorithm (FirstProfit) proposed in [18] uses a priority queue to maximize the per-profit for each job independently. In [5], [6], energy efficiency was their primary focus to improve cost effectiveness. Reduction in energy consumption was achieved using server consolidation; hence, they are cost-effective solutions rather than cost efficient solutions.

VII. CONCLUSION

A resource failure causes delay not only in the completion of the task running on it, but also in the completion of that task‘s successors and eventually the makespan of corresponding workflow recursively. This recursive impact of resource unreliability on workflow execution is a serious limiting factor in cost efficiency in DCSs. In this paper, the already intricate workflow scheduling problem in DCSs has been revisited—from the viewpoint of reliability—for cost efficiency, and RPA incorporating a novel replication scheme has been presented. Replication is an appealing solution for reliable workflow execution. However, the inherent cost of replication needs to be explicitly addressed since replicas incur direct operating costs for their redundant resource usage and indirect costs for service refuses due to resource unavailability caused by such redundant resource usage. RPA effectively addresses this cost implication of replication capturing the trade-off between replication costs and profitability when making replication decisions. We have validated that this cost-aware replication is an intuitive cost optimization technique for workflow execution in DCSs.

ACKNOWLEDGMENT

We would like to thank Dr. Chen Wang, CSIRO, Australia, for his insightful comments. We would also like to thank the anonymous reviewers for their constructive feedback. This work was supported by an Australian Research Council Grant DP1097110.

REFERENCES

[1] D. Abramson, R. Buyya, and J. Giddy, ―A computational economy for Grid computing and its implementation in the Nimrod-G resource broker,‖ Future Generation Computer Systems Journal, vol. 18, no. 8, pp. 1061–1074, 2002.

[2] Amazon Web Services, http://aws.amazon.com/.

[3] G. B. Berriman et al., ―Montage: A Grid-Enabled Engine for Delivering Custom Science-Grade Mosaics On Demand,‖ Proc. SPIE, vol. 5493, pp. 221–234, 2004.

[4] P. Blaha, K. Schwarz, G. Madsen, D. Kvasnicka, and J. Luitz, ―WIEN2k: An Augmented Plane Wave plus Local Orbitals Program for Calculating Crystal Properties, Institute of Physical and Theoretical Chemistry,‖ Vienna University of Technology, 2001.

[5] J. Burge, P. Ranganathan, J.L. Wiener, ―Cost-aware scheduling for heterogeneous enterprise machines (CASH'EM),‖ Proc. the IEEE Int‘l Conf. Cluster Computing, 2007.

[6] J. S. Chase, D.C. Anderson, P.N. Thakar, A.M. Vahdat and R.P. Doyle, ―Managing energy and server resources in hosting centers,‖ Proc. the ACM Symp. Operating Systems Principles, 2001.

[7] K. Cooper et al., ―New Grid Scheduling and Rescheduling Methods in the GrADS Project,‖ Int‘l J. Parallel Programming, vol. 33, no. 2, pp. 209–229, 2005.

[8] B. N. Chun, D. E. Culler, ―User-centric performance analysis of market-based cluster batch schedulers,‖ Proc. IEEE/ACM Int‘l Symp. Cluster Computing and the Grid, pp. 30–38, May 2002.

[9] E. Deelman et al., ―Pegasus and the Pulsar Search: From Metadata to Execution on the Grid,‖ Proc. Int‘l Conf. Parallel Processing and Applied Mathematics, Czestochowa, Poland, pp. 821–830, 2003.

[10] Y. Gil, V. Ratnakar, E. Deelman, G. Mehta, and J. Kim, ―Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows,‖ Proc. Conf. Innovative Applications of Artificial Intelligence (IAAI), pp. 1767–1774, 2007.

[11] GoGrid, http://www.gogrid.com/.

[12] S. Hwang and C. Kesselman, ―Grid workflow: a flexible failure handling framework for the grid,‖ Proc. Int‘l Symp. High Performance Distributed Computing (HPDC), 2003.

[13] T. Heath, R. Martin and T.D. Nguyen, ―Improving cluster availability using workstation validation,‖ Proc. ACM SIGMETRICS, 2002.

[14] D. E. Irwin, L. E. Grit, J. S. Chase, ―Balancing risk and reward in a market-based task service,‖ Proc. IEEE Symp. High Performance Distributed Computing, 160–169, 2004.

[15] A. F. Jenkinson, ―The Frequency Distribution of the Annual Maximum (or Minimum) Values of Meteorological Elements,‖ Quarterly J. Royal Meteorology Soc., vol. 87, pp. 145–158, 1955.

[16] Y. C. Lee, C. Wang, A. Y. Zomaya and B. B. Zhou, ―Profit-driven service request scheduling in clouds,‖ Proc. Int‘l Symp. Cluster, Cloud and Grid Computing, pp. 15–24, 2010.

[17] S. Ludtke, P. Baldwin, and W. Chiu, ―EMAN: Semiautomated software for high-resolution single-particle reconstructions,‖ J. Structural Biology, vol. 128, no. 1, pp. 82–97, 1999.

[18] F. I. Popovici and J. Wilkes, ―Profitable services in an uncertain world,‖ Proc. the ACM/IEEE SC2005 Conf. High Performance Networking and Computing (SC 2005), 2005.

[19] M. Rahman, R. Ranjan and R. Buyya, ―Dependable workflow scheduling in global grids,‖ Proc. Int‘l Conf. Grid Computing, 2009.

[20] R. Sahoo, A.J. Oliner, I. Rish, M. Gupta, J.E. Moreira and S. Ma, ―Critical event prediction for proactive management in large-scale computing clusters,‖ Proc. the ACM SIGKDD, pp. 426–435, 2003.

[21] T. Tannenbaum, D. Wright, K. Miller, and M. Livny, ―Condor: A

Distributed Job Scheduler,‖ Beowulf Cluster Computing with Linux,

The MIT Press, MA, USA, pp. 307–350, 2002. [22] Y. Tao, H. Jin and X. Shi, ―Grid workflow scheduling based on

reliability cost,‖ Proc. Int‘l Conf.Scalable Information Systems, 2007.

[23] K. V. Vishwanath and N. Nagappan, ―Characterizing Cloud Computing Hardware Reliability,‖ Proc. The ACM Symp. Cloud Computing (SOCC), pp. 193–204, 2010.

[24] M. Xie, Y.S. Dai, K.L. Poh, ―Computing System Reliability: Models and Analysis,‖ New York, Kluwer Academic Publishers, 2004.

[25] H. Yu and A. Vahdat, ―The Costs and Limits of Availability for Replicated Services,‖ ACM Trans. Computer Systems, vol. 24, no. 1, pp.70–113, 2006.

96