MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Slide 1

MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management ProjectYoung Suk Moon

Chair: Dr. Hans-Peter BischofReader: Dr. Gregor von LaszewskiObserver: Dr. Minseok Kwon1OutlineIntroduction to Water Threat Management ProjectMotivationResearch ObjectivesFault-Tolerant QueueEvaluationConclusion

2Water Threat ManagementMotivationUrban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water.MethodsDetect contamination using sensors located across the WDSs.Run algorithms (developed by NCSA) to determine the sensor locations to minimize the searching time to find the contaminant source locations (sensors are expensive).3Motivation should be hereSecurity problem e.g. terror, threat, sensors are expensiveRun algorithm put sensor for minimize timeNCSA developed statistical algorithmFault tolerantNetwork graphic3Water Threat ManagementRequirementsTime sensitiveMassive calculationDynamic adaptation to a Grid environmentFault toleranceOur goalsThe current system is not fault-tolerant.Develop a fault-tolerant framework and increase performance in the faulty environment.4Add network graphic image4Existing Water Threat Management System Architecture

5EPANET Simulation in the Simulation Engine

6Motivation (1) Resource OutagesTeraGrid resource outages during 2009.TeraGrid User & System News (http://news.teragrid.org/)77Motivation (1) Resource OutagesOutage Rate (total outage time / year) in 2009

TeraGrid User & System News (http://news.teragrid.org/)8May look reliable, but it is threat management8Motivation (1) Resource OutagesWTM deployment problem with outagesTeraGrid User & System News (http://news.teragrid.org/)9Suppose there are power outages (by thunderstorm and lightening) on Mercury, Abe, Lincoln 9Motivation (2) Queue Wait TimeQueue wait time

10Research ObjectivesDevelop a fault-tolerant framework dealing with resource outagesStrategy: generation distribution on multiple sites

Reduce queue wait timeStrategy: dynamic job dependency11Water Threat Management ApplicationSequential & parallel processing

12

Generation DistributionDivide generations into multiple parts as multiple jobs.

13Can divide any number of parts with any range13Generation DistributionFile communication

14Additional framework to make the distribution work14Dynamic Job DependencyProblems of generation distribution on multiple sitesAdditional queue wait timesEach job is dependent on another.Cannot submit a job before the prior job finishes.15

Solution: determine job dependency at run time.Submit jobs at the same time.Any job start first computes the first set of generationsNot understandable:15Dynamic WTM Workflow ManagementExample scenario

16Fault-tolerant QueueMost common fault-tolerant strategies in a GridReplicationCheckpointingLimitation of checkpointing with time-criticalityCheckpointing performance degradationCheckpointing may not be compatible on a different site (heterogeneity)Cannot reschedule job on the same site in case of site outageChoosing the replication strategy within the fault-tolerant queue17heterogeneous17Fault-tolerant Queue DesignArchitecture

18Point: resource checker detects site outages18Fault-tolerant Queue DesignComponentsCommand Line InterfaceTask PoolResource PoolSchedulerResource Checker (intergration with the TeraGrid Information Services)19

19Fault-tolerant Queue DesignFault detectionMessage from Grid Resource Allocation and Management (GRAM) in the Globus ToolkitCommunicate with GRAM to detect job failure

TeraGrid Information ServicesGRAM service may fail when the resource is downPublishes XML documents containing the outage information20Teragrid info service fast another service20Evaluation WTM performanceWTM application performance (generation)

21Run time grows linearly EPANET simulates each parameter for the same amount of timeWe can predict the run time21Evaluation Queue Wait TimeQueue wait time statistics

AbeBig RedAvg. (min)8242Var.385135354sd.1967322Long queue wait time means resource is not available22Evaluation - OverheadPerformance overheadIntegrating a fault-tolerant framework usually causes performance degradationNo performance loss in our framework

23Different type of workflow run time comparisonOriginal deployment VS. fault-tolerant deploymentDynamic job dependency VS. static job dependencyTest each type of deployment in the real Grid system including queue wait timeVersionWorkflowSite Name# JobsGen. rangeOriginal-Abe11-20Original-Big Red11-20Fault-tolerantstaticAbe, Big Red21-10 (Abe),11-20 (Big Red)Fault-tolerantdynamicAbe, Big Red21-10,11-2024Evaluation Workflow Performancestatic no dynamic job dependencydynamic dynamic job dependency24Evaluation Workflow PerformanceSetup pointsWhat to measureJob run time + queue wait time4 different types of deploymentOriginal on AbeOriginal on Big RedStatic fault-tolerant workflow on Abe + Big RedDynamic fault-tolerant workflow on Abe + Big Red6 different jobs6 = 1 (original) + 1 (original) + 2 (static) + 2 (dynamic)25Evaluation Workflow PerformanceSetup pointsSubmit 4 different deployments at the same time5 jobs are submitted at the same time (1 job is for static workflow).Repeat this at different timesThe queue wait times will make different results26Evaluation Workflow PerformanceWorkflow comparison results

27Blue original, Black static, Red dynamic4 different deployments, 6 different jobs27Simulation Run Time ComparisonAverage run timeStatistical model for the original WTM deployment

t: run time of a job, p: failure rate, q: avg. queue wait timeStatistical model for the dynamic WTM deployment

k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job

28Simulation Run Time ComparisonResults (queue wait time + job run time + failure time)

29Reason that the results are similar failure rate is low29Simulation Worst Case Run Time ComparisonA threat management system must deliver results in any circumstances.

Thus, a run time of the worst case is a critical factor in the Water Threat Management system.30Simulation Worst Case Run Time ComparisonSimulation setup

Use the 2009 TeraGrid outage data for this simulationSubmit jobs every 5 minutes during 2009 and compare the worst case run time between the original deployment and the dynamic workflow deployment31AbeBig RedQueen BeeMax. Queue Wait Time (min)668289302Run Time per Gen. (ming)0.522.071.02Simulation Worst Case Run Time Comparison

32Simulation Worst Case Run Time Comparison

33ConclusionIn general, the dynamic fault-tolerant workflow has similar performance to the performance of the original deployment.

However, the dynamic workflow of the worst case scenario has much better performance than the performance of the worst case scenario of the original deployment. 34

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Documents

Transcript of MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project