A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow...

45
1 A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Supervisors: Frédéric DESPREZ and Tristan GLATARD This work was funded by the French National Agency for Research under grant ANR-09-COSI-03 "VIP”

description

PhD Thesis presented on November 29th 2013 at INSA-Lyon Abstract - Science gateways, such as the Virtual Imaging Platform (VIP), enable transparent access to distributed computing and storage resources for scientific computations. However, their large scale and the number of middleware systems involved lead to many errors and faults. In practice, science gateways are often backed by substantial support staff who monitors running experiments by performing simple yet crucial actions such as rescheduling tasks, restarting services, killing misbehaving runs or replicating data files to reliable storage facilities. Fair quality of service (QoS) can then be delivered, yet with important human intervention. Automating such operations is challenging for two reasons. First, the problem is online by nature because no reliable user activity prediction can be assumed, and new workloads may arrive at any time. Therefore, the considered metrics, decisions and actions have to remain simple and to yield results while the application is still executing. Second, it is non-clairvoyant due to the lack of information about applications and resources in production conditions. Computing resources are usually dynamically provisioned from heterogeneous clusters, clouds or desktop grids without any reliable estimate of their availability and characteristics. Models of application execution times are hardly available either, in particular on heterogeneous computing resources. In this thesis, we propose a general healing process for autonomous detection and handling of operational incidents in workflow executions. Instances are modeled as Fuzzy Finite State Machines (FuSM) where state degrees of membership are determined by an external healing process. Degrees of membership are computed from metrics assuming that incidents have outlier performance, e.g. a site or a particular invocation behaves differently than the others. Based on incident degrees, the healing process identifies incident levels using thresholds determined from the platform history. A specific set of actions is then selected from association rules among incident levels. For more information visit http://www.rafaelsilva.com

Transcript of A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow...

Page 1: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

1

A science-gateway for workflow executions: online and non-clairvoyant self-healing

of workflow executions on grids

Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Supervisors: Frédéric DESPREZ and Tristan GLATARD

This work was funded by the French National Agency for Research under grant ANR-09-COSI-03 "VIP”

Page 2: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Outline

�  Technical context and challenges

�  Contributions �  Self-healing of workflow executions on grids

�  Treatment of blocked activities �  Optimization of task granularity �  Fairness control among workflow executions

�  Conclusions

2

Page 3: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Outline

�  Technical context and challenges

�  Contributions �  Self-healing of workflow executions on grids

�  Treatment of blocked activities �  Optimization of task granularity �  Fairness control among workflow executions

�  Conclusions

3

Page 4: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Heavy Medical Simulations

4

Treatement planning for prostate protontherapy [L. Grevillot, D. Sarrut] CPU Time: 2 months

Simulated diffusion weighted images [L. Wang, Y. Zhu, I. Magnin] CPU Time: 8 years

Echography simulation [O. Bernard, M. Alessandrini] CPU Time: 42 hours

Virtual Imaging Platform

Public Computing Infrastructure 150 computing sites world-wide

Medical-Imaging Execution Platform 491 users from 52 countries

Goal: Self-healing of workflow executions on grids to handle operational issues

Page 5: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Workflow Execution

2. User launches a simulation

(application workflow)

3. Workflow engine generates invocations

4. Invocations are wrapped into grid jobs

5. Jobs are submitted to a Pilot Engine

6. Pilot jobs are submitted to the

distributed infrastructure

1. Input data upload

7. Pilot jobs fetch grid jobs

8. Inputs download

10. Results upload

11. Download results

9. Execution

5

Science-Gateway

High-level interface Software-as-a-Service

Virt

ual I

mag

ing

Plat

form

(VIP

)

Page 6: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Workflow Execution

2. User launches a simulation

(application workflow)

3. Workflow engine generates invocations

4. Invocations are wrapped into grid jobs

5. Jobs are submitted to a Pilot Engine

6. Pilot jobs are submitted to the

distributed infrastructure

1. Input data upload

7. Pilot jobs fetch grid jobs

8. Inputs download

10. Results upload

11. Download results

9. Execution

6

Workflow Management System Applications described as workflows Parallel language Grid-aware enactor

Page 7: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Workflow Execution

2. User launches a simulation

(application workflow)

3. Workflow engine generates invocations

4. Invocations are wrapped into grid jobs

5. Jobs are submitted to a Pilot Engine

6. Pilot jobs are submitted to the

distributed infrastructure

1. Input data upload

7. Pilot jobs fetch grid jobs

8. Inputs download

10. Results upload

11. Download results

9. Execution

7

Workload Management System Pilot jobs run special agents that fetch user tasks from the task queue, set up their environment and steer their execution

Page 8: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Workflow Execution

2. User launches a simulation

(application workflow)

3. Workflow engine generates invocations

4. Invocations are wrapped into grid jobs

5. Jobs are submitted to a Pilot Engine

6. Pilot jobs are submitted to the

distributed infrastructure

1. Input data upload

7. Pilot jobs fetch grid jobs

8. Inputs download

10. Results upload

11. Download results

9. Execution

8

European Grid Infrastructure (EGI) +100 computing sites +25,000 job slots ~4PB of Storage

Page 9: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Challenges

9

�  Several workflow execution errors

�  Several dysfunctional and performance problems �  Requires manual interventions

�  Problem: costly manual operations �  e.g.: rescheduling tasks, restarting services, killing misbehaving

experiments, or replicating data files

Number of launched and completed workflow in VIP from Jan to Dec 2012

Average workflow completion rate is about 60%

Page 10: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Objectives

10

�  Objective: Automated platform administration �  Autonomous detection of operational incidents

�  Perform appropriate set of actions

�  Assumptions: Online and non-clairvoyant �  Decisions must be fast

�  No information about tasks (duration, data transfer time, etc.)

�  No information about resources (availability, performance, etc.)

�  No user activity and workloads prediction

Page 11: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Outline

�  Technical context and challenges

�  Contributions �  Self-healing of workflow executions on grids

�  Treatment of blocked activities �  Optimization of task granularity �  Fairness control among workflow executions

�  Conclusions

11

Page 12: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

State of the Art

12

�  Self-healing of workflow executions �  Most works from the literature are offline and/or clairvoyant

�  Common techniques to address operational incidents �  Task resubmission

�  [Kandaswamy et al., 2008], [Zhang et al., 2009], [Montagnat et al., 2010]

�  Task and file replication �  [Cirne et al., 2007], [Ben-Yehuda et al., 2012], [Ma et al., 2013]

�  Task grouping �  [Muthuvelu et al., 2005-2013], [Lie and Liao, 2009], [Chen et al., 2013]

�  Heuristics to fairly schedule workflow tasks �  [Zhao and Sakellariou, 2006], [N’Takpe and Suter, 2009], [Casanova et al., 2010]

Page 13: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  The healing process sets the degree of FuSM states from incident detection metrics

Fuzzy Finite State Machine

13

������������ �����

�� ������ ������

�������� �

�������� �

�������� �

Fuzzy states

Cri

sp s

tate

s

Possible values: 0 or 1

Values between 0 and 1

Page 14: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

General MAPE-K loop

14

Incident 1 degree η = 0.8

Incident 2 degree η = 0.4

Incident 3 degree η = 0.1

level 1

level2

level3

Roulette wheel selection

Incident 1

Selected

Rule Confidence (ρ) ρxη

2è 1 0.8 0.32

3 è 1 0.2 0.02

1 è 1 1.0 0.80

Association rules for incident 1

Incident 2

Selected

Roulette wheel selection based on association rules

Set of Actions

x2

level 1

level2

level3

level 1

level2

level3

=ηiη jj=1

n∑

event (job completion and failures)

or timeout

Monitoring Analysis

Execution Knowledge

Planning

Monitoring data

ηu

Frequency

0.0 0.2 0.4 0.6 0.8 1.00e+00

6e+04

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on ���distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.

Page 15: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Incident degrees are quantified in discrete incident levels

�  Thresholds are determined from mode clustering

Incident Levels and Actions

15

No actions are triggered

���������� ������� ���

Thresholds τ cluster platform configurations into groups

Triggers a set of actions

Page 16: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

A-priori knowledge �  Based on the workload of VIP

�  January 2011 to April 2012

112 users 2,941 workflow executions 680,988 tasks

338,989 completed 138,480 error 105,488 aborted 15,576 aborted replicas

48,293 stalled 34,162 queued

339,545 pilot jobs

16

R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.

Page 17: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Outline

�  Technical context and challenges

�  Contributions �  Self-healing of workflow executions on grids

�  Treatment of blocked activities �  Optimization of task granularity �  Fairness control among workflow executions

�  Conclusions

17

Page 18: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  A task is late compared to the others

�  Possible causes �  Longer waiting times

�  Lost tasks (e.g. killed by site due to quota violation)

�  Resources with poor performance

0.0e+00 4.0e+06 8.0e+06 1.2e+07

020

4060

80100

FIELD-II/pasa - workflow-9SIeNv

Time (s)

Com

plet

ed J

obs

Incident: Activity Blocked

18

Task completion rate of a real simulation Job flow of a real simulation

Long-tail effect

Page 19: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Activity Blocked: State of the Art �  Task replication

�  Is commonly used to address non-clairvoyant problems

�  Drawback: may overload the system and degrade fairness

�  Task replication in the literature �  Is used to increase the probability to complete a task [Ramakrishnan et

al., 2009]

�  Use of the Weibull distribution to estimate the number of replicas [Litke et al., 2007]

�  Tasks are replicated only in the tail phase [Ben-Yehuda et al., 2012]

�  Evaluation of the waste of resources by using replication [Cirne et al., 2007]

19

All approaches make strong assumptions on task or resource characteristics

Page 20: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Activity Blocked: Degree �  Degree computed from all completed tasks of the activity

�  Task phases: setup è inputs download è execution è outputs upload

�  Assumption: bag of tasks (all tasks have equal durations)

�  Median-based estimation:

�  Incident degree: task performance w.r.t median

20

Median duration of task phases

Real task duration

42s

300s

20s

?

42s

300s

400s*

15s

Estimated task duration

50s

250s

400s

15s

completed

current

Mi = 715s Ei = 757s

*: max(400s, 20s) = 400s

Page 21: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Levels: identified from the platform logs extracted from VIP on EGI

�  Actions �  Task replication

�  Cancel replicas with bad performance

�  Replicate only if all active replicas are running

0

50

100

150

0.00 0.25 0.50 0.75 1.00ηb

Frequency

Activity blocked: levels and actions

21

Replication process for one task

Level 1 (no actions)

Level 2

action: replicate tasks

d

τb Activity Blocked degree ηb

Level 1 Level 2

Page 22: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Goal: Self-Healing vs No-Healing �  Cope with recoverable errors

Self-Healing process reduced resource consumption up to 35% when compared to

the No-Healing execution

Activity Blocked: Results

22

0

4000

8000

12000

1 2 3 4 5Repetitions

Mak

espa

n (s

)

No−HealingSelf−Healing

0

4000

8000

12000

1 2 3 4 5Repetitions

Mak

espa

n (s

)

No−HealingSelf−Healing

w =(CPU + data) self −healing(CPU + data)no−healing

−1

Resource waste:

Mean-Shift/hs3 FIELD-II/pasa

Average execution speed up: 3.4 Average execution speed up: 2.9

Page 23: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Repetition 1 Repetition 2 Repetition 3

Repetition 4 Repetition 5

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

0 50 100 150 0 50 100 150 200 0 50 100 150

0 20 40 60 0 50 100Time (min)

CD

F No−HealingSelf−Healing

Number of Completed Tasks

23

Curve similarities up to 95% indicate similar grid conditions

Page 24: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Activity Blocked: Conclusions

�  First results in controlling blocked activities in these conditions �  Conditions: production system, non-clairvoyant, online

�  Limitation �  The method only works for bag-of-tasks

�  The waste metric does not consider resource performance

�  Currently used in production by VIP �  From Aug 2012 to Oct 2013 more than 6000 workflow executions benefited

�  Publications

24

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on ���distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of operational workflow incidents on distributed computing infrastructures, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa, Canada, 2012.

Page 25: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Outline

�  Technical context and challenges

�  Contributions �  Self-healing of workflow executions on grids

�  Treatment of blocked activities �  Optimization of task granularity �  Fairness control among workflow executions

�  Conclusions

25

Page 26: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Low performance of lightweight (a.k.a. fine-grained) tasks: �  High queuing times

�  Communication overhead

Incident: Fineness Control

26

time

R1

R2

R3

t1

t2

t3

t4

t5

t1 t2

t3

t4

t5

Res

ourc

es

lightweight tasks Lightweight task executions are delayed

Group into coarse-grained tasks reduces the cost of data transfers

when grouped tasks share input data, and saves queuing time

Page 27: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Fineness Control: State of the Art �  Task grouping in the literature

�  Groups tasks based on the granularity size (processing time) [Muthuvelu et al., 2005]

�  Adds bandwidth to the definition of the granularity size [Ng et al., 2006], [And et al., 2009]

�  Defines the granularity size based on QoS requirements

�  Task file size, CPU time, resource constraints [Muthuvelu et al., 2008]

�  Drawback: only works under stationary load

�  Adaptive algorithms (non-stationary load)

�  Monitors information about the current availability and capability of resources [Liu and Liao, 2009], [Muthuvelu et al., 2013]

27

All approaches make strong assumptions on task or resource characteristics

Page 28: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Task execution

�  Incident degree

Fineness Control: Degree

28

η f =maxi∈[1,m ]{ f i = di ⋅ ri}

Queued Time Shared Input Data Other Input Data Application Execution

t~_ shared

t

q j

Median task phase durations

i = waiting task n = number of waiting tasks

Page 29: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Fineness control: levels and actions

29

�  Levels: identified from the platform logs extracted from VIP on EGI

�  Actions �  Task grouping

�  Grouped pairwise until or until Q ≤ R

ηf

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

6e+04

τ f

Level 1 (no actions)

Level 2

action: task grouping

η f ≤ τ f

Fineness Control degree ηf

Level 1 Level 2

Page 30: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Levels �  Incident degree

Coarseness control

30

ηc =R

Q+ R

τc = 0.5

time

R1

R2

R3

t1

t2

t3

t4

t5

t1

t2+t3

t4+t5

Res

ourc

es

Tasks at t1

t2+t3

t4+t5 Loss of parallelism

�  Non-stationary load �  Loss of parallelism

�  Task-degrouping

t1 t2

Grouped tasks at t2

De-group tasks when R > Q

Page 31: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

0

2000

4000

6000

Run 1 Run 2 Run 3 Run 4 Run 5

Mak

espa

n (s

)

FinenessFineness−CoarsenessNo−Granularity

�  Experiment �  Evaluate the de-grouping control process under non-stationary load

31

Results: Non-Stationary Load

31

Resources appear progressively Resources appear suddenly

Speeds up executions up to a factor of 1.5 for Fineness, and 2.1 for Fineness-Coarseness

Fineness is penalized by its lack of adaptation: slowdown of 20%

Linear correlation coefficient between the makespan and the average queuing time is 0.91, which indicates they are correlated

Page 32: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Task Granularity: Conclusions

�  First results in controlling task granularity in these conditions �  Conditions: production system, non-clairvoyant, online

�  Limitation �  The method only works for data-intensive workloads

�  Future Work �  Task pre-emption to handle the scenario where resources suddenly appear

and all tasks are running

�  Publications

32

R. Ferreira da Silva, T. Glatard, F. Desprez, On-line, non-clairvoyant optimization of workflow activity granularity task on grids, Euro-Par, Aachen, 2013.

R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: Practice and Experience (CCPE), Submited, 2014.

Page 33: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Outline

�  Technical context and challenges

�  Contributions �  Self-healing of workflow executions on grids

�  Treatment of blocked activities �  Optimization of task granularity �  Fairness control among workflow executions

�  Conclusions

33

Page 34: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Under resource contention workflows are unequally slowed down by concurrent executions

Incident: Unfairness Among Workflow Executions

34

3 identical workflows submitted sequentially

(ti,j = 10s)

t2,2

t2,3

t3,1

t2,4

t2,1

t1,2

t1,1

t1,3

t1,4

t3,2

t3,3

t3,4

t1,5 t3,5 t2,5

time

R1

R2

R3

Res

ourc

es

t1,1 t1,4

t1,5 t1,2

t1,3 t2,1

t2,2

t2,3

t2,4

t2,5

t3,1

t3,2

t3,3

t3,4

t3,5

0 10 20 30 40

slowdown(s) =Mmulti

Mown

s1 =2020

=1.0

s2 =4020

= 2.0

s3 =5020

= 2.5

Identical workflow executions do not experience the same slowdown

Makespan with concurrent executions

Makespan without concurrent executions

Page 35: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Fairness: State of the Art �  Workflow execution fairness in the literature

�  Addresses fairness based on the slowdown of DAGs based on execution and data transfer times [Zhao and Sakellariou, 2006], [Casanova et al., 2010]

�  Proposes a mapping procedure to increase fairness based on the critical path length [N’Takpe and Suter, 2009]

�  Online, but clairvoyant, HEFT-like algorithms [Hsu et al., 2011], [Sommerfield and Richter, 2011], [Arabnejad and Barbosa, 2012]

�  Non-clairvoyant, but offline, scheduling strategy based on task labeling and adaptive allocation [Hirales-Carbajal et al., 2012]

35

No algorithm was proposed in a non-clairvoyant and online case

Page 36: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Unfairness degree

where:

Fairness Control: Degree

36

ηu =Wmax −Wmin

Wi =max j∈[1,ni ]

Qi, j

Qi, j + Ri, j ⋅ Pi, j⋅ Ti, j

$ % &

' ( )

i = activity, ni = active activities Qi,j = number of waiting tasks Ri,j = number of running tasks

Relative observed duration Performance

Median task phase durations

Max difference between the fractions of pending work

A low Pi,j indicates that resources allocated to the activity have bad

performance for the activity

Page 37: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Levels: identified from the platform logs extracted from VIP on EGI

�  Actions �  Task prioritization

�  Task priority is an integer initialized to 1

�  Increase priority of Δi,j tasks

ηu

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

6e+04

Fairness Control: Levels and Actions

37

τuLevel 1 (no actions)

Level 2

Fairness Control degree ηu

Level 1 Level 2

action: task prioritization

Page 38: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Fairness Control: Metrics

38

�  Unfairness �  Is the area under the curve ηu during the execution:

�  Slowdown

where:

s =Mmulti

Mown

µ = ηu(ti)⋅ (ti − ti−1)i=2

M

Mown =maxp∈Ω tuu∈p∑

This metric measures if the fairness process can indeed minimize its own criterion ηu

Page 39: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Tests whether unfairness among identical workflows is properly addressed

Results: identical workflows

39

Repetition 1 Repetition 2 Repetition 3 Repetition 4

0.00

0.25

0.50

0.75

1.00

0 10000 20000 300000 5000 10000 15000 200000 10000 20000 30000 0 500010000150002000025000Time (s)

ηf Fairness

No−Fairness

Repetition 1 Repetition 2 Repetition 3 Repetition 4

0

10000

20000

30000

Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness

Mak

espa

n (s

)

Gate 1Gate 2Gate 3

Makespans and unfairness degree values are significantly reduced

Page 40: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  Tests whether unfairness among different workflows is detected and properly handled

Results: different workflows

40

Repetition 1 Repetition 2 Repetition 3 Repetition 4

1

10

100

Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness Fairness No−Fairness

Slow

dow

n FIELD−IIGatePET−SorteoSimuBloch

Repetition 1 Repetition 2 Repetition 3 Repetition 4

0.00

0.25

0.50

0.75

1.00

0 5000 100001500020000 0 10000 20000 0 20000 40000 0 5000100001500020000Time (s)

η f FairnessNo−Fairness

Reduced slowdown stand. dev. up to a factor of 3.8, and unfairness value up to a factor 1.9

Page 41: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

�  First results in controlling fairness among workflow executions in these conditions

�  Conditions: production system, non-clairvoyant, online

�  Limitation �  Fairness optimization is delayed due to the acquisition of information

about the applications

�  The method works best for applications with a lot of short tasks

�  Future Work �  Evaluation of the influence of the metrics’ parameters

�  Publications

41

Fairness Control: Conclusions

R. Ferreira da Silva, T. Glatard, F. Desprez, Workflow fairness control on online and non-clairvoyant distributed computing platforms, Euro-Par, Aachen, 2013.

R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: Practice and Experience (CCPE), Submited, 2014.

Page 42: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Outline

�  Technical context and challenges

�  Contributions �  Self-healing of workflow executions on grids

�  Treatment of blocked activities �  Optimization of task granularity �  Fairness control among workflow executions

�  Conclusions

42

Page 43: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Contributions Summary

43

Self-healing of workflow incidents - Generic MAPE-K loop - Non-clairvoyance and online

[Ferreira da Silva et al., CCGRID’12, FGCS’13]

Treatment of blocked activities - Properly detects and handles blocked activities

Optimization of task granularity - Properly detects and handles lightweight tasks under

stationary and non-stationary loads

[Ferreira da Silva et al., Euro-Par’13a]

Fairness control among workflow executions - Properly detects and handles unfairness among

workflow executions

[Ferreira da Silva et al., Euro-Par’13b, CPE’14]

Science-gateway model for workload archive - Illustration by using traces of the VIP from 2011/2012

[Ferreira da Silva and Glatard, CGWS’12]

All methods were evaluated on VIP - Production platform with about 500 users

[Ferreira da Silva et al., HealthGrid’11; Glatard et al., TMI’13]

Page 44: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Perspectives

44

�  Mode detection automation

�  Automatically detect variation on threshold values

�  Time-windowed historical information

�  User’s behavior may change

�  Errors may be restricted to a specific time span

�  Optimization of the incident selection method �  There is no mechanism to prevent an incident to be successively selected

�  Sensitivity analysis of parameters �  Evaluate the influence of parameters on the metrics

�  Workflow workload archive

�  The science gateway workload archive model does not embrace all characteristics inherent to a workflow execution

Page 45: A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Thank you for your attention. Questions?

http://vip.creatis.insa-lyon.fr!

Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

A science-gateway for workflow executions: online and non-clairvoyant self-healing

of workflow executions on grids

Supervisors: Frédéric DESPREZ and Tristan GLATARD