On-line, non-clairvoyant optimization of workflow activity granularity task on grids
Workflow fairness control on online and non-clairvoyant distributed computing platforms
-
Upload
rafael-ferreira-da-silva -
Category
Technology
-
view
116 -
download
0
description
Transcript of Workflow fairness control on online and non-clairvoyant distributed computing platforms
1 Rafael Ferreira da Silva – [email protected]
Workflow Fairness Control on Online and Non-Clairvoyant
Distributed Computing Platforms
Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon
Lyon, France
Euro-Par 2013 August 26-30, 2013
Outline
Context The Virtual Imaging Platform Problem definition
Fairness among workflow executions Self-healing of workflow executions on grids
Fairness control process
Experiments and results
Conclusion
2 Rafael Ferreira da Silva – [email protected]
Outline
Context The Virtual Imaging Platform Problem definition
Fairness among workflow executions Self-healing of workflow executions on grids
Fairness control process
Experiments and results
Conclusion
3 Rafael Ferreira da Silva – [email protected]
Context Virtual Imaging Platform (VIP)
Medical imaging science-gateway
Grid of ~180 sites (EGI – http://www.egi.eu)
Significant usage 452 registered users from 50 countries
Consumed 472 CPU years from August 2012 to July 2013 http://dirac.france-grilles.fr
4 Rafael Ferreira da Silva – [email protected]
VIP consumption since August 2012
Workflow Execution
Rafael Ferreira da Silva – [email protected]
2. User launches a simulation
3. MOTEUR generates invocations
4. GASW generates grid jobs
5. Jobs are submitted to DIRAC
6. Pilot jobs are submitted to EGI
1. Input data upload
7. Pilot jobs fetch grid jobs
8. Inputs download
10. Results upload
11. Download results
9. Execution
5
Under resource contention workflows are unequally slowed down by concurrent executions
Fairness among workflow executions
6 Rafael Ferreira da Silva – [email protected]
3 identical workflows submitted sequentially
(ti,j = 10s)
t2,2
t2,3
t3,1
t2,4
t2,1
t1,2
t1,1
t1,3
t1,4
t3,2
t3,3
t3,4
t1,5 t3,5 t2,5
time
R1
R2
R3
Res
ourc
es
t1,1 t1,4
t1,5 t1,2
t1,3 t2,1
t2,2
t2,3
t2,4
t2,5
t3,1
t3,2
t3,3
t3,4
t3,5
0 10 20 30 40
€
slowdown(s) =Mmulti
Mown
€
s1 =2020
=1.0
€
s2 =4020
= 2.0
€
s3 =5020
= 2.5
Identical workflow executions do not experience the same slowdown
Makespan with concurrent executions
Makespan without concurrent executions
Under resource contention workflows are unequally slowed down by concurrent executions
Fairness among workflow executions
7 Rafael Ferreira da Silva – [email protected]
Very short workflow (t = 2s)
t3,1
t3,2
t3,3
t3,4
t3,5
time
R1
R2
R3
Res
ourc
es
t1,1 t1,4
t1,5 t1,2
t1,3 t2,1
t2,2
t2,3
t2,4
t2,5
0 10 20 30 40
2 identical workflows submitted sequentially
(ti,j = 10s)
t1,2
t1,1
t1,3
t1,4
t1,5
t2,2
t2,3
t2,4
t2,1
t2,5
t3,1
t3,2
t3,3
t3,4
t3,5
€
slowdown(s) =Mmulti
Mown
€
s1 =2020
=1.0
€
s2 =4020
= 2.0
€
s3 =366
= 6.0
Very short workflow executions are extremely slowed down
Workflow Self-Healing
8 Rafael Ferreira da Silva – [email protected]
Problem: costly manual operations Rescheduling tasks, restarting services or replicating data files
In this work: fairly allocating computing resources among workflow executions
Objective: automated platform administration Autonomous detection of unfairness among workflow executions
Perform appropriate set of actions
Assumptions: online and non-clairvoyant Only partial information available
Decisions must be fast
Production conditions, no user activity and workloads prediction
General MAPE-K loop
9 Rafael Ferreira da Silva – [email protected]
Incident 1 degree η = 0.8
Incident 2 degree η = 0.4
Incident 3 degree η = 0.1
level 1
level2
level3
Roulette wheel selection
Incident 1
Selected
Rule Confidence (ρ) ρxη
2 1 0.8 0.32
3 1 0.2 0.02
1 1 1.0 0.80
Association rules for incident 1
Incident 2
Selected
Roulette wheel selection based on association rules
Set of Actions
x2
level 1
level2
level3
level 1
level2
level3
€
=ηiη jj=1
n∑
event (job completion and failures)
or timeout
Monitoring Analysis
Execution Knowledge
Planning
Monitoring data
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on distributed computing infrastructures, Future Generation Computer Systems (FGCS), in press, 2013.
Incident degrees are quantified in discrete incident levels
Thresholds are determined from visual mode clustering or K-means
Incident Levels and Actions
10 Rafael Ferreira da Silva – [email protected]
No actions are triggered Triggers a set of actions
Thresholds cluster platform configurations into groups
Outline
Context The Virtual Imaging Platform Problem definition
Fairness among workflow executions Self-healing of workflow executions on grids
Fairness control process
Experiments and results
Conclusion
11 Rafael Ferreira da Silva – [email protected]
Unfairness degree
where:
Fairness control: degree
12 Rafael Ferreira da Silva – [email protected]
€
ηu =Wmax −Wmin
€
Wi =max j∈[1,ni ]
Qi, j
Qi, j + Ri, j ⋅ Pi, j⋅ Ti, j
⎧ ⎨ ⎩
⎫ ⎬ ⎭
i = activity, ni = active activities Qi,j = number of waiting tasks Ri,j = number of running tasks
€
Ti, j =t~i, j
maxv∈[1,m ],w∈[1,ni* ](t~v,w )
Relative observed duration
€
Pi, j = 2⋅ 1−maxu∈[1,k j ]tu
t~i, j+ tu
⎧ ⎨ ⎪
⎩ ⎪
⎫ ⎬ ⎪
⎭ ⎪
⎛
⎝
⎜ ⎜
⎞
⎠
⎟ ⎟
Performance
Median task phase durations
Max difference between the fractions of pending work
A low Pi,j indicates that resources allocated to the activity have bad
performance for the activity
Fairness control: task estimation Estimation of task durations
Job phases: setup inputs download execution outputs upload
Assumption: bag of tasks (all jobs have equal durations)
Median-based estimation:
13 Rafael Ferreira da Silva – [email protected]
Median duration of jobs phases
Real job duration
42s
300s
20s
?
42s
300s
400s*
15s
Estimated job duration
50s
250s
400s
15s
completed
current
*: max(400s, 20s) = 400s
€
t~
= 715s
€
ti, j = 757s
Levels: identified from the platform logs
Actions Task prioritization
Task priority is an integer initialized to 1
Increase priority of Δi,j tasks:
Fairness control: levels and actions
14 Rafael Ferreira da Silva – [email protected]
€
τuLevel 1 (no actions)
Level 2 (action: task prioritization)
€
Δ i, j =Qi, j −(τ u +Wmin )(Qi, j + Ri, jPi, j )
Ti, j
⎢
⎣ ⎢
⎥
⎦ ⎥
Workload for Case Studies Based on the workload of VIP
January 2011 to April 2012
Case Studies on: Pilot Jobs
User accounting
Task analysis
Bag of tasks
Workflows
112 users 2,941 workflow executions 680,988 tasks
338,989 completed
138,480 error
105,488 aborted
15,576 aborted replicas
48,293 stalled
34,162 queued 339,545 pilot jobs
15 Rafael Ferreira da Silva – [email protected]
R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.
Outline
Context The Virtual Imaging Platform Problem definition
Fairness among workflow executions Self-healing of workflow executions on grids
Fairness control process
Experiments and results
Conclusion
16 Rafael Ferreira da Silva – [email protected]
Experiment 1 Tests whether unfairness among identical workflows is properly addressed
Experiment 2 Tests whether the performance of very short workflow executions is
improved by the fairness mechanism
Experiment 3 Tests whether unfairness among different workflows is detected and
properly handled
Workflows characteristics
Experiment Conditions
17 Rafael Ferreira da Silva – [email protected]
The experiments are performed in the Virtual Imaging Platform
Experiments: metrics
18 Rafael Ferreira da Silva – [email protected]
Unfairness Is the area under the curve ηu during the execution:
Slowdown
where:
€
s =Mmulti
Mown
€
µ = ηu(ti)⋅ (ti − ti−1)i=2
M
∑
€
Mown =maxp∈Ω tuu∈p∑
This metric measures if the fairness process can indeed minimize its own criterion ηu
19
Results: identical workflows
19 Rafael Ferreira da Silva – [email protected]
makespans and unfairness degree values are significantly reduced reduced σm up to a factor of 15, σs up to a factor of 7, and µ by about 2
20
Results: very short workflows
20 Rafael Ferreira da Silva – [email protected]
makespans of very short workflow executions are significantly reduced reduced σs up to a factor of 5.9, and µ up to a factor 1.9
21
Results: very short workflows (2)
21 Rafael Ferreira da Silva – [email protected]
Speeds up executions up to a factor of 2.9, reduces task average waiting time up to a factor of 4.4 and slowdown up to a factor of 5.9
22
Results: different workflows
22 Rafael Ferreira da Silva – [email protected]
reduced σs up to a factor of 3.8, and µ up to a factor 1.9
Outline
Context The Virtual Imaging Platform Problem definition
Fairness among workflow executions Self-healing of workflow executions on grids
Fairness control process
Experiments and results
Conclusion
23 Rafael Ferreira da Silva – [email protected]
Concluding remarks
24 Rafael Ferreira da Silva – [email protected]
Context Autonomous handling of unfairness among workflow executions
No strong assumptions on resource characteristics and workload
Summary of the proposed method Implements a generic MAPE-K loop
Quantifies unfairness based on the fraction of pending work: Ratio of queuing tasks, relative durations, and performance
Controlling fairness among workflow executions Properly detects and handles unfairness among workflow executions
Significantly reduced the standard deviation of the slowdown and unfairness metric for: Identical workflows
Very short workflow execution
Different workflows
Rafael Ferreira da Silva – [email protected]
Thank you for your attention. Questions?
Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon
Lyon, France
Workflow Fairness Control on Online and Non-Clairvoyant Distributed Computing Platforms
Acknowledgments: VIP users and project members
French National Agency for Research (ANR-09-COSI-03, ANR-11-LABX-0063) EC FP7 Programme (312579 ER-flow)
European Grid Initiative (EGI) France-Grilles