Failure-Aware Workflow Scheduling in Cluster Environments
Embed Size (px)
Transcript of Failure-Aware Workflow Scheduling in Cluster Environments
Cluster Comput (2010) 13: 421434DOI 10.1007/s10586-010-0126-7
Failure-aware workflow scheduling in cluster environments
Zhifeng Yu Chenjia Wang Weisong Shi
Received: 25 June 2008 / Accepted: 24 February 2010 / Published online: 13 March 2010 Springer Science+Business Media, LLC 2010
Abstract The goal of workflow application scheduling isto achieve minimal makespan for each workflow. Schedul-ing workflow applications in high performance cluster en-vironments is an NP-Complete problem, and becomes morecomplicated when potential resource failures are considered.While more research on failure prediction has been wit-nessed in recent years to improve system availability and re-liability, very few of them attack the problem in the contextof workflow application scheduling. In this paper, we studyhow a workflow scheduler benefits from failure predictionand propose FLAW, a failure-aware workflow schedulingalgorithm. We propose two important definitions on accu-racy, Application Oblivious Accuracy (AOA) and Applica-tion Aware Accuracy (AAA), from the perspectives of systemand scheduling respectively, as we observe that the predic-tion accuracy defined conventionally imposes different per-formance implications on different applications and fails tomeasure how that improves scheduling effectiveness. Thecomprehensive evaluation results using real failure tracesshow that FLAW performs well with practically achievableprediction accuracy by reducing the average makespan, theloss time and the number of job rescheduling.
Keywords Failure-aware Workflow scheduling Clustercomputing
This work is in part supported by National Science FoundationCAREER grant CCF-0643521.
Z. Yu C. Wang W. Shi ()Wayne State University, Detroit, USAe-mail: firstname.lastname@example.org
Z. Yue-mail: email@example.com
C. Wange-mail: firstname.lastname@example.org
Workflow applications become to prevail in recent years,resulting from stronger demands from scientific computa-tion and more provision of high performance cluster sys-tems. Typically, a workflow application is represented as aDirected Acyclic Graph (DAG), where nodes represent indi-vidual jobs and edges represent the inter-job dependence. Ina DAG, nodes and edges are weighed for computation costand communication cost respectively. Makespan, the timedifference between the start and completion of a workflowapplication, is used to measure the application performanceand scheduler efficiency. In the rest of the paper, we useDAG and workflow application interchangeably in this pa-per.
The goal of scheduling a workflow application is toachieve minimal makespan. The scheduling problem, of-ten referred to as DAG scheduling in research literature,was recognized as an NP-Complete problem even in a lim-ited scale environment of a few processors . A sub-optimal schedule often results in unnecessarily excessivedata movement and/or long waiting time thanks to inter-job dependence. With the resource failure considered, DAGscheduling is significantly more difficult and unfortunatelyfew existing algorithms tackle this. Furthermore, the actualfailure rate in production environment is extremely high.Both OSG  and TaraGrid  report over 30% failuresat times . On the other side, the failure tolerance poli-cies applicable to ordinary job scheduling is a reactive ap-proach and deserves another look as job inter-dependenciesin workflows and usually longer computation complicate thefailure handling.
Recent years have seen many analysis on published fail-ure traces of large scale cluster systems . These re-search efforts contribute to better understanding of the fail-
422 Cluster Comput (2010) 13: 421434
ure characteristics and result in more advanced failure pre-diction models, and help improve the system reliability andavailability. However, not sufficient attention has been paidto how workflow scheduling can benefit from these accom-plishments to reduce the impact of failures on applicationperformance. In particular, we want to answer the followingtwo related questions: One is what is the right and practi-cal objective of failure prediction in the context of workflowscheduling? The other is how does the failure predicationaccuracy affect workflow scheduling? Note that we defer thedesign of a good failure protection algorithm as our futurework.
Failures can have a significant impact on job executionunder existing scheduling policies that ignore failures .In an error prone computing environment, failure predic-tion will improve the scheduling efficiency if it can answerqueries with a great success rate such as: Will a node failduring data staging and job execution if the job is goingto execute on this node? The job will be assigned to thisnode only when the answer is NO with high confidence.Like any other non-workflow application, the unexpectedfailure causes rescheduling of the failed job and resourcewaste on the uncompleted job execution. But more impor-tantly, job rescheduling will affect the subsequent schedul-ing decisions for not-scheduled-yet child or peer jobs in theworkflow, and potentially further imposes negative impacton overall performance which is determined by the job-resource mapping as a whole. We propose a FaiLure AwareWorkflow scheduling algorithm (FLAW) in this paper toschedule workflow applications with resource failure pres-ence.
On the other side, we argue that the conventional defini-tion of failure prediction accuracy does not well reflect howaccuracy affects scheduling effectiveness. The conventionaldefinition borrows precision and recall metrics from in-formational retrieval theory, which measure prediction ef-fectiveness statistically. Driven by these definitions, mostexisting failure prediction techniques attempt to forecastfailures regardless of which jobs are running. Intuitively, alonger running job may require failure prediction to be moreaccurate than a shorter one. But what about scheduling aworkflow application which is not just lengthy but also hasjob dependence involved? We argue that the conventionalprediction approach and accuracy requirement are not in-tended for failure aware scheduling and two new definitionsof failure prediction accuracy are proposed as a result: Appli-cation Oblivious Accuracy (AOA) from a systems perspec-tive and Application Aware Accuracy (AAA) from a sched-ulers perspective, which we believe better reflect how fail-ure prediction accuracy impacts scheduling effectiveness.
The contributions of this paper are three-fold: (1) de-sign FLAW that takes potential resource failures into con-sideration when scheduling workflow applications and de-fines the failure prediction requirement in a practical way;
(2) propose two new definitions of failure prediction ac-curacy, which reflects its effect on workflow applicationscheduling more precisely, in the context of job schedulingin high performance computing environments; and (3) per-form comprehensive simulations using real failure tracesand find that our proposed FLAW algorithm performs welleven with moderate prediction accuracy, which is much eas-ier to achieve using existing algorithms. The number ofjob resubmission (rescheduling) due to resource failure de-creased significantly with trivial AOA level with presence ofintensive workload.
The rest of the paper is organized as follows. A brief re-view of related work which motivates our work is given inSect. 2. The new definitions of failure prediction accuracyare proposed in Sect. 3. Then we describe the FLAW algo-rithm design in Sect. 4. Section 5 elaborates the simulationdesign and analyzes simulation results. Finally, we summa-rize and lay out our future work in Sect. 6.
2 Related work
Noticeable progress has been made on failure prediction re-search and practice, following that more failure traces aremade public available since 2006 and the failure analysis  reveals more failure characteristics in high performancecomputing systems. Zhang et al. evaluate the performanceimplications of failures in large scale cluster . Fu et al.propose both online and offline failure prediction modelsin coalitions of clusters. Another failure prediction modelis proposed by Liang et al.  based on failure analysis ofBlueGene/L system. Recently, Ren et al.  develop a re-source failure prediction model for fine-grained cycle shar-ing systems. However, most of them focus on improving thepredication accuracy, and few of them address how to lever-age their predication results in practice.
Salfner et al.  suggest that proactive failure han-dling provides the potential to improve system availabilityup to an order of magnitude, and the FT-Pro project and the FARS project  demonstrate a significant perfor-mance improvement for long-running applications providedby proactive fault tolerance policies.  demonstrated afault aware job scheduling algorithms for BlueGene/L sys-tem and simulation studies show that it can significantly im-prove the system performance with even trivial fault pre-diction confidence or accuracy levels (as low as 10%). It isworth noting that though, as a specially designed toroidalsystem, BlueGene/L requires job partitions must be bothrectangular (in a multidimensional sense) and contiguous.So the goal of job scheduling  is to reduce the frag-mentation with failure awareness and eventually improve thesystem utilization. The jobs in the system are independent ofeach other and the failure prediction employs conventional
Cluster Comput (2010) 13: 421434 423
accuracy concept, which is quite different than our study onscheduling workflow applications in a cluster or grid of acommon form without job partition constraint.
Failure handling is considered in