On-line, non-clairvoyant optimization of workflow activity granularity task on grids

Click here to load reader

  • date post

    21-Jun-2015
  • Category

    Technology

  • view

    172
  • download

    0

Embed Size (px)

description

Presentation held at Euro-Par 2013, Aachen, Germany Abstract. Controlling the granularity of workflow activities executed on widely distributed computing platforms such as grids is required to reduce the impact of task queuing and data transfer time. Most existing granularity control approaches assume extensive knowledge about the applications and resources (e.g. task duration on each resource), and that both the workload and available resources do not change over time. We propose a granularity control algorithm for platforms where such clairvoyant and offline conditions are not realistic. Our method groups tasks when the fineness degree of the application, which takes into account the ratio of shared data and the queuing/round-trip time ratio, becomes higher than a threshold determined from execution traces. The algorithm also de-groups task groups when new resources arrive. The application's behavior is constantly monitored so that the characteristics useful for the optimization are progressively discovered. Experimental results, obtained with 3 workflow activities deployed on the European Grid Infrastructure, show that (i) the grouping process yields speed-ups of about 2.5 when the amount of available resources is constant and that (ii) the use of de-grouping yields speed-ups of 2 when resources progressively appear. More information: www.rafaelsilva.com

Transcript of On-line, non-clairvoyant optimization of workflow activity granularity task on grids

  • 1. 1 Rafael Ferreira da Silva [email protected] On-line, Non-Clairvoyant Optimization of Workflow Activity Granularity on Grids Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Frdric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon Lyon, France Euro-Par 2013 August 26-30, 2013

2. Outline Context The Virtual Imaging Platform Problem definition Task granularity Self-healing of workflow executions on grids Task granularity control process Experiments and results Conclusion 2 Rafael Ferreira da Silva [email protected] 3. Outline Context The Virtual Imaging Platform Problem definition Task granularity Self-healing of workflow executions on grids Task granularity control process Experiments and results Conclusion 3 Rafael Ferreira da Silva [email protected] 4. Context Virtual Imaging Platform (VIP) Medical imaging science-gateway Grid of ~180 sites (EGI http://www.egi.eu) Significant usage 452 registered users from 50 countries Consumed 472 CPU years from August 2012 to July 2013 http://dirac.france-grilles.fr 4 Rafael Ferreira da Silva [email protected] VIP consumption since August 2012 5. Workflow Execution Rafael Ferreira da Silva [email protected] 2. User launches a simulation 3. MOTEUR generates invocations 4. GASW generates grid jobs 5. Jobs are submitted to DIRAC 6. Pilot jobs are submitted to EGI 1. Input data upload 7. Pilot jobs fetch grid jobs 8. Inputs download 10. Results upload 11. Download results 9. Execution 5 6. Low performance of lightweight (a.k.a. fine-grained) tasks: High queuing times Communication overhead Task Granularity 6 Rafael Ferreira da Silva [email protected] time R1 R2 R3 t1 t2 t3 t4 t5 t1 t2 t3 t4 t5 Resources lightweight tasks Lightweight task executions are delayed Group into coarse-grained tasks reduces the cost of data transfers when grouped tasks share input data, and saves queuing time 7. Workflow Self-Healing 7 Rafael Ferreira da Silva [email protected] Problem: costly manual operations Rescheduling tasks, restarting services or replicating data files In this work: task granularity in distributed workflows Objective: automated platform administration Autonomous detection of fine-grained tasks Perform appropriate set of actions Assumptions: online and non-clairvoyant Only partial information available Decisions must be fast Production conditions, no user activity and workloads prediction 8. General MAPE-K loop 8 Rafael Ferreira da Silva [email protected] Incident 1 degree = 0.8 Incident 2 degree = 0.4 Incident 3 degree = 0.1 level 1 level 2 level 3 Roulette wheel selection Incident 1 Selected Rule Confidence () x 2 1 0.8 0.32 3 1 0.2 0.02 1 1 1.0 0.80 Association rules for incident 1 Incident 2 Selected Roulette wheel selection based on association rules Set of Actions x2 level 1 level 2 level 3 level 1 level 2 level 3 = i jj=1 n event (job completion and failures) or timeout Monitoring Analysis Execution Knowledge Planning Monitoring data R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workow activity incidents on distributed computing infrastructures, Future Generation Computer Systems (FGCS), in press, 2013. 9. Incident degrees are quantified in discrete incident levels Thresholds are determined from visual mode clustering or K-means Incident Levels and Actions 9 Rafael Ferreira da Silva [email protected] No actions are triggered Triggers a set of actions Thresholds cluster platform configurations into groups 10. Outline Context The Virtual Imaging Platform Problem definition Task granularity Self-healing of workflow executions on grids Task granularity control process Experiments and results Conclusion 10 Rafael Ferreira da Silva [email protected] 11. Task execution Incident degree Fineness control: degree 11 Rafael Ferreira da Silva [email protected] f = maxi[1,m]{ fi = di ri} di = t ~ _ shared t ~ _ shared + ni (t ~ t ~ _ shared ) ri = max j[1,ni ] qj max j[1,ni ] qj + t ~ _ shared + ni(t ~ t ~ _ shared ) Queued Time Shared Input DataOther Input DataApplication Execution t ~ _ shared t qj Median task phase durations i = waiting task n = number of waiting tasks 12. Fineness control: task estimation Estimation of task durations Job phases: setup inputs download execution outputs upload Assumption: bag of tasks (all jobs have equal durations) Median-based estimation: 12 Rafael Ferreira da Silva [email protected] Median duration of jobs phases Real job duration 42s 300s 20s ? 42s 300s 400s* 15s Estimated job duration 50s 250s 400s 15s completed current *: max(400s, 20s) = 400s t ~ = 715s t ~ i = 757s 13. Fineness control: levels and actions 13 Rafael Ferreira da Silva [email protected] Levels: identified from the platform logs Actions Task grouping Grouped pairwise until or the amount of waiting groups Q is smaller or equal to the amount of running groups R f Level 1 (no actions) Level 2 action: task grouping f f 14. Levels Incident degree Coarseness control 14 Rafael Ferreira da Silva [email protected] c = R Q + R c = 0.5 time R1 R2 R3 t1 t2 t3 t4 t5 t1 t2+t3 t4+t5 Resources Tasks at t1 t2+t3 t4+t5 Loss of parallelism Non-stationary load Loss of parallelism Task-degrouping t1 t2 Grouped tasks at t2 De-group tasks when R > Q 15. Workload for Case Studies Based on the workload of VIP January 2011 to April 2012 Case Studies on: Pilot Jobs User accounting Task analysis Bag of tasks Workflows 112 users 2,941 workflow executions 680,988 tasks 338,989 completed 138,480 error 105,488 aborted 15,576 aborted replicas 48,293 stalled 34,162 queued 339,545 pilot jobs 15 Rafael Ferreira da Silva [email protected] R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012. 16. Outline Context The Virtual Imaging Platform Problem definition Task granularity Self-healing of workflow executions on grids Task granularity control process Experiments and results Conclusion 16 Rafael Ferreira da Silva [email protected] 17. Experiment Conditions 17 Rafael Ferreira da Silva [email protected] Experiment 1 Evaluate the fineness control process under stationary load Experiment 2 Evaluate the de-grouping control process under non-stationary load Workflows characteristics 18. 18 Results: stationary load 18 Rafael Ferreira da Silva [email protected] Fineness yields significant makespan reduction for all repetitions 19. 19 Results: stationary load (2) 19 Rafael Ferreira da Silva [email protected] Task grouping speed-ups SimuBloch and FIELD-II up to a factor of 2.6, and PET-SORTEO/emission up to a factor of 2.5 Not able to group all SimuBloch tasks in a single group because 2 tasks must be completed for the task estimation process 20. 20 Results: non-stationary load 20 Rafael Ferreira da Silva [email protected] Resources appear progressively Resources appear suddenly Speeds up executions up to a factor of 1.5 for Fineness, and 2.1 for Fineness-Coarseness Fineness is penalized by its lack of adaptation: slowdown of 20% 21. 21 Results: non-stationary load (2) 21 Rafael Ferreira da Silva [email protected] Linear correlation coefficient between the makespan and the average queuing time is 0.91, which indicates they are correlated 22. Outline Context The Virtual Imaging Platform Problem definition Task granularity Self-healing of workflow executions on grids Task granularity control process Experiments and results Conclusion 22 Rafael Ferreira da Silva [email protected] 23. Concluding remarks 23 Rafael Ferreira da Silva [email protected] Context Autonomous handling of unfairness among workflow executions No strong assumptions on resource characteristics and workload Summary of the proposed method Implements a generic MAPE-K loop Determines task fineness based on queue waiting time and estimated data transfer time of shared input data Tasks are grouped pairwise as long as Q > R, and tasks are too fine Tasks are ungrouped when the number of available resources increases Optimizing task granularity Properly detects and handles lightweight tasks Stationary load: fineness control significantly reduces the makespan of all applications Non-stationary load: de-grouping algorithm compensates lack of adaptation of task grouping 24. Rafael Ferreira da Silva [email protected] Thank you for your attention. Questions? Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Frdric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon Lyon, France On-line, Non-Clairvoyant Optimization of Workflow Activity Granularity on Grids Acknowledgments: VIP users and project members French National Agency for Research (ANR-09-COSI-03, ANR-11-LABX-0063) EC FP7 Programme (312579 ER-flow) European Grid Initiative (EGI) France-Grilles