Fault Tolerant Clustering (IEEE Services 2012)
-
Upload
weiwei-chen -
Category
Technology
-
view
106 -
download
1
description
Transcript of Fault Tolerant Clustering (IEEE Services 2012)
![Page 1: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/1.jpg)
Fault Tolerant Clustering in Scien2fic Workflows
Weiwei Chen, Ewa Deelman Informa2on Sciences Ins2tute
University of Southern California
1
![Page 2: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/2.jpg)
Outline
• Introduc2on • Workflow and Failure Model • Fault Tolerant Clustering • Experiments • Task Specific Failures • Loca2on Specific Failures
2
![Page 3: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/3.jpg)
Introduc2on • Task based Scien2fic Workflows – Task – Job
• Task Clustering – Merges mul2ple small tasks into a job – Reduce scheduling and submit overhead
• Fault Tolerance in Task Clustering – Exis2ng techniques underes2mate or ignore the influences of failures
3
![Page 4: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/4.jpg)
Task Clustering • Task Clustering – Horizontal Clustering – Ver2cal Clustering – Arbitrary Clustering
Clustering Factor (k): number of tasks in a job 4
![Page 5: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/5.jpg)
System Overview
5
Timeline
with clustering
without clustering
Improvement
scheduling and submit delay
![Page 6: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/6.jpg)
Task Failures and Job Failures • We only focus on Transient Failure and Job Retry • We don’t differen2ate the causes of failures but we concern about the average failure rate.
• Assump2on: a failure is a random event independent of workflow characteris2cs or execu2on environment
• Two Categories o Task Failure: a task fails, other
tasks in the same job may not fail § E.g. Applica2on
o Job Failure: a job fails, all of its tasks fail § E.g. Scheduling System
6
![Page 7: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/7.jpg)
Influence of Failures on Clustering ttotal Es2mated Overall Run2me
n Number of tasks to run
t Run2me of a single task
r Number of available resources
d Time delay between jobs
N Expected retry 2mes for a single task
k Number of tasks in a job
β Job failure rate
α Task failure rate Target Func2on: Min (ttotal)
given n tasks to run on r resources task failure rate (α) is measurable (Task Failure Model) or job failure rate (β) is measurable (Job Failure Model)
Assump2on: n >> r, but n/k >> r 7
![Page 8: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/8.jpg)
Job Failure Model
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−β)
, if nk≥ r
N(kt + d) = (kt + d)1−β
, if nk< r
#
$
%%
&
%%
8
ttotal Es2mated Overall Run2me
n Number of tasks to run
t Run2me of a single task
r Number of available resources
d Time delay between jobs
N Expected retry 2mes for a single task
k Number of tasks in a job
β Job failure rate
α Task failure rate
N job =1
(1−β)
Ntotal =
N jobnrk
if nk≥ r
N job, if nk< r
"
#
$$
%
$$
t job = kt + d
ttotal = t jobNtotal
Run2me for a single job
Avg retry 2me for a single job
Retry 2me for all jobs
Overall run2me
![Page 9: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/9.jpg)
Job Failure Model
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−β)
, if nk≥ r
N(kt + d) = (kt + d)1−β
, if nk< r
#
$
%%
&
%%
k* = nr
ttotal* =(kt + d)1−β
It’s not necessary to adjust k. Just set it to be
9
n=1000, t=5 sec, d=5 sec, r=20
k* is independent of β
![Page 10: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/10.jpg)
Task Failure Model
10
ttotal Es2mated Overall Run2me
n Number of tasks to run
t Run2me of a single task
r Number of available resources
d Time delay between jobs
N Expected retry 2mes for a single task
k Number of tasks in a job
β Job failure rate
α Task failure rate
N job =1
(1−α)k
Ntotal =
N jobnrk
if nk≥ r
N job, if nk< r
"
#
$$
%
$$
t job = kt + d
ttotal = t jobNtotal
Run2me for a single job
Avg retry 2me for a single job
Retry 2me for all jobs
Overall run2me
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−α)k
, if nk≥ r
N(kt + d) = (kt + d)(1−α)k
, if nk< r
#
$
%%
&
%%
![Page 11: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/11.jpg)
Task Failure Model
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−α)k
, if nk≥ r
N(kt + d) = (kt + d)(1−α)k
, if nk< r
#
$
%%
&
%%
k* =−d + d 2 − 4d
ln(1−α)2t
, if n >> r
ttotal* =
n(k*t + d)rk(1−α)k
*
k* is dependent of α
It’s necessary to adjust k according to α
11
![Page 12: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/12.jpg)
Comparing TFM and JFM
k* =−d + d 2 − 4d
ln(1−α)2t
, if n >> r
ttotal* =
n(k*t + d)rk(1−α)k
*
ttotal* =(kt + d)1−β
k* = nr
1. Linear increase vs exponen2al increase 2. Op2mal clustering factor
12
![Page 13: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/13.jpg)
Fault Tolerant Clustering • Job Failure Model: k=n/r • Selec2ve Reclustering (SR) – select the failed tasks in a clustered job and cluster them into a new clustered job
– It requires the iden2fica2on of failed tasks.
13
![Page 14: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/14.jpg)
Fault Tolerant Clustering
• Dynamic Clustering (DC) – adjust the clustering factor according to the task failure rates dynamically
ttotal,DC* =
n(k*t + d)rk*(1−α)k
*
k* =−d + d 2 − 4d
ln(1−α)2t
, if n >> r
14
![Page 15: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/15.jpg)
Fault Tolerant Clustering
• Dynamic Reclustering (DR) – A combina2on of SR and DC
15
![Page 16: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/16.jpg)
Evalua2on
• Run simula2ons based on the real traces that were run by the Pegasus group.
• Each workflow was simulated 100 2mes so that the standard devia2on is less than 10%
• Two workflows were used. • 20 worker nodes were used in each experiment.
16
![Page 17: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/17.jpg)
Workflows Used • Montage – An astronomy applica2on used to construct large image mosaics of the sky.
– Montage has complex data dependencies between tasks
– 10,422 tasks, 57GB data.
17 Image from hhp://montage.ipac.caltech.edu/
![Page 18: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/18.jpg)
Workflows Used • Periodogram – Iden2fy periodic signals from light curves that arise from transi2ng planets.
– 216,600 tasks, 19GB input data. – Periodogram has only one level
18 Image from hhp://pegasus.isi.edu/presenta2ons/2011/sci709-‐voeckler-‐talk.ppt/
![Page 19: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/19.jpg)
Simulator
• Extension to CloudSim – Workflow Engine – Clustering Engine – Scheduler – Failure Generator – Failure Monitor
19
![Page 20: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/20.jpg)
Performance • NOOP: no op2miza2on, (k=n/r) • DC (Dynamic Clustering) • SR (Selec2ve Reclustering) • DR ( Dynamic Reclustering) • Overall Run2me in seconds
20
![Page 21: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/21.jpg)
Performance
• Periodogram
21
![Page 22: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/22.jpg)
Performance
• Montage
22
![Page 23: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/23.jpg)
Task Specific Failure Detec2on (TSFD) • Task Failures are related to the type of tasks • Failure Monitor classifies failures based on the type • Clustering Engine merges tasks based on different task
failure rates • In this experiment of Montage, we set the task failure
rate of mProjectPP and mDiffFit to be 0.001 while mBackground ranges from 0.2 to 0.8.
α1 Optimization Methods
DR DR+TSFD DC DC+TSFD
0.2 10415 10412 13804 13820
0.4 11830 11839 22946 22923
0.6 14704 14688 60429 60414
0.8 23238 23229 436638 435297
1 The task failure rate of mBackground only!
23
![Page 24: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/24.jpg)
Task Failure Model
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−α)k
, if nk≥ r
N(kt + d) = (kt + d)(1−α)k
, if nk< r
#
$
%%
&
%%
k* =−d + d 2 − 4d
ln(1−α)2t
, if n >> r
ttotal* =
n(k*t + d)rk(1−α)k
*
ttotal is not sensi2ve to α
24 Simplifica2on of failures is acceptable
![Page 25: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/25.jpg)
Loca2on Specific Failure Detec2on (LSFD) • Task Failures are related to the loca2on of execu2on • Failure Monitor classifies failures based on resource id
• Scheduler orders resources based on their reliability. • Two out of twenty nodes have a higher task failure rates (from 0.2 to 0.8) while others s2ll have a task failure rate of 0.001.
25
DC generates many small tasks if task failure rate is high
![Page 26: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/26.jpg)
Conclusion
• We present three basic methods to improve fault tolerance in task clustering
• If the system supports iden2fica2on of failed tasks, dynamic reclustering performs best
• Otherwise, use dynamic clustering • Improvement is significant even for very basic method
26
![Page 27: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/27.jpg)
Future Work
• Ver2cal Clustering and Arbitrary Clustering • Intelligent Scheduler • More Workflow Examples • Distribu2on of Failures
27
![Page 28: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/28.jpg)
Ques2ons?
• Thank you for coming! • For further info, please visit: pegasus.isi.edu or email [email protected]
28
![Page 29: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/29.jpg)
Refinements • When n>>r does not hold in the end of execu2on
• Default: • Replica2ve: replicate jobs by • Even:
kactual = k* njobs =
ntaskk
< r
kactual = k* njobs = r
rntask / k
kactual =ntaskr
njobs = r
29
![Page 30: Fault Tolerant Clustering (IEEE Services 2012)](https://reader033.fdocuments.in/reader033/viewer/2022052905/5586559ad8b42a221b8b4699/html5/thumbnails/30.jpg)
Dynamic Performance
• TFM and DC
30