Fault Tolerant Clustering (IEEE Services 2012)

Post on 21-Jun-2015

106 views 1 download

Tags:

description

fault tolerant clustering for workflows

Transcript of Fault Tolerant Clustering (IEEE Services 2012)

Fault  Tolerant  Clustering  in  Scien2fic  Workflows  

Weiwei  Chen,  Ewa  Deelman  Informa2on  Sciences  Ins2tute  

University  of  Southern  California  

1  

Outline  

•  Introduc2on  •  Workflow  and  Failure  Model  •  Fault  Tolerant  Clustering  •  Experiments  •  Task  Specific  Failures  •  Loca2on  Specific  Failures  

2  

Introduc2on    •  Task  based  Scien2fic  Workflows  –  Task  –  Job  

•  Task  Clustering    – Merges  mul2ple  small  tasks  into  a  job    –  Reduce  scheduling  and  submit  overhead  

•  Fault  Tolerance  in  Task  Clustering  –  Exis2ng  techniques  underes2mate  or  ignore  the  influences  of  failures  

3  

Task  Clustering    •  Task  Clustering    – Horizontal  Clustering  – Ver2cal  Clustering  – Arbitrary  Clustering  

Clustering  Factor  (k):  number  of  tasks  in  a  job  4  

System  Overview    

5  

Timeline  

with  clustering  

without  clustering  

Improvement  

scheduling  and  submit  delay  

Task  Failures  and  Job  Failures  •  We  only  focus  on  Transient  Failure  and  Job  Retry  •  We  don’t  differen2ate  the  causes  of  failures  but  we  concern  about  the  average  failure  rate.    

•  Assump2on:  a  failure  is  a  random  event  independent  of  workflow  characteris2cs  or  execu2on  environment    

•  Two  Categories  o  Task  Failure:  a  task  fails,  other  

tasks  in  the  same  job  may  not  fail  §  E.g.  Applica2on  

  o  Job  Failure:  a  job  fails,  all  of  its  tasks  fail  §  E.g.  Scheduling  System  

 

6  

Influence  of  Failures  on  Clustering  ttotal   Es2mated  Overall  Run2me  

n   Number  of  tasks  to  run  

t   Run2me  of  a  single  task  

r   Number  of  available  resources  

d   Time  delay  between  jobs  

N   Expected  retry  2mes  for  a  single  task  

k   Number  of  tasks  in  a  job  

β   Job  failure  rate  

α   Task  failure  rate  Target  Func2on:  Min  (ttotal)  

given  n  tasks  to  run  on  r  resources  task  failure  rate  (α)  is  measurable  (Task  Failure  Model)  or  job  failure  rate  (β)  is  measurable  (Job  Failure  Model)  

 Assump2on:  n  >>  r,  but  n/k  >>  r    7  

Job  Failure  Model  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−β)

, if nk≥ r

N(kt + d) = (kt + d)1−β

, if nk< r

#

$

%%

&

%%

8  

ttotal   Es2mated  Overall  Run2me  

n   Number  of  tasks  to  run  

t   Run2me  of  a  single  task  

r   Number  of  available  resources  

d   Time  delay  between  jobs  

N   Expected  retry  2mes  for  a  single  task  

k   Number  of  tasks  in  a  job  

β   Job  failure  rate  

α   Task  failure  rate  

N job =1

(1−β)

Ntotal =

N jobnrk

if nk≥ r

N job, if nk< r

"

#

$$

%

$$

t job = kt + d

ttotal = t jobNtotal

Run2me  for  a  single  job  

Avg  retry  2me  for  a  single  job  

Retry  2me  for  all  jobs  

Overall  run2me  

Job  Failure  Model  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−β)

, if nk≥ r

N(kt + d) = (kt + d)1−β

, if nk< r

#

$

%%

&

%%

k* = nr

ttotal* =(kt + d)1−β

It’s  not  necessary  to  adjust  k.  Just  set  it  to  be  

9  

n=1000,  t=5  sec,  d=5  sec,  r=20  

k*  is  independent  of  β    

Task  Failure  Model  

10  

ttotal   Es2mated  Overall  Run2me  

n   Number  of  tasks  to  run  

t   Run2me  of  a  single  task  

r   Number  of  available  resources  

d   Time  delay  between  jobs  

N   Expected  retry  2mes  for  a  single  task  

k   Number  of  tasks  in  a  job  

β   Job  failure  rate  

α   Task  failure  rate  

N job =1

(1−α)k

Ntotal =

N jobnrk

if nk≥ r

N job, if nk< r

"

#

$$

%

$$

t job = kt + d

ttotal = t jobNtotal

Run2me  for  a  single  job  

Avg  retry  2me  for  a  single  job  

Retry  2me  for  all  jobs  

Overall  run2me  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−α)k

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

Task  Failure  Model  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−α)k

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

n(k*t + d)rk(1−α)k

*

k*  is  dependent  of  α    

It’s  necessary  to  adjust  k  according  to  α  

11  

Comparing  TFM  and  JFM  

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

n(k*t + d)rk(1−α)k

*

ttotal* =(kt + d)1−β

k* = nr

1.  Linear  increase  vs  exponen2al  increase  2.  Op2mal  clustering  factor  

12  

Fault  Tolerant  Clustering  •  Job  Failure  Model:  k=n/r  •  Selec2ve  Reclustering  (SR)  – select  the  failed  tasks  in  a  clustered  job  and  cluster  them  into  a  new  clustered  job    

–  It  requires  the  iden2fica2on  of  failed  tasks.  

13  

Fault  Tolerant  Clustering  

•  Dynamic  Clustering  (DC)  – adjust  the  clustering  factor  according  to  the  task  failure  rates  dynamically  

ttotal,DC* =

n(k*t + d)rk*(1−α)k

*

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

14  

Fault  Tolerant  Clustering  

•  Dynamic  Reclustering  (DR)  – A  combina2on  of  SR  and  DC  

15  

Evalua2on  

•  Run  simula2ons  based  on  the  real  traces  that  were  run  by  the  Pegasus  group.    

•  Each  workflow  was  simulated  100  2mes  so  that  the  standard  devia2on  is  less  than  10%  

•  Two  workflows  were  used.    •  20  worker  nodes  were  used  in  each  experiment.    

16  

Workflows  Used  •  Montage  – An  astronomy  applica2on  used  to  construct   large  image  mosaics  of  the  sky.    

– Montage   has   complex   data   dependencies  between  tasks    

– 10,422  tasks,  57GB  data.    

17  Image  from  hhp://montage.ipac.caltech.edu/  

Workflows  Used  •  Periodogram  –  Iden2fy   periodic   signals   from   light   curves   that  arise  from  transi2ng  planets.    

– 216,600  tasks,  19GB  input  data.    – Periodogram  has  only  one  level  

18  Image  from  hhp://pegasus.isi.edu/presenta2ons/2011/sci709-­‐voeckler-­‐talk.ppt/  

Simulator  

•  Extension  to  CloudSim  – Workflow  Engine  – Clustering  Engine  – Scheduler  – Failure  Generator  – Failure  Monitor  

19  

Performance  •  NOOP:  no  op2miza2on,  (k=n/r)  •  DC  (Dynamic  Clustering)    •  SR  (Selec2ve  Reclustering)  •  DR  (  Dynamic  Reclustering)  •  Overall  Run2me  in  seconds  

20  

Performance  

•  Periodogram  

21  

Performance  

•  Montage  

22  

Task  Specific  Failure  Detec2on  (TSFD)  •  Task  Failures  are  related  to  the  type  of  tasks  •  Failure  Monitor  classifies  failures  based  on  the  type    •  Clustering   Engine   merges   tasks   based   on   different   task  

failure  rates  •  In   this   experiment   of  Montage,   we   set   the   task   failure  

rate   of   mProjectPP   and   mDiffFit   to   be   0.001   while  mBackground  ranges  from  0.2  to  0.8.    

α1 Optimization Methods

DR DR+TSFD DC DC+TSFD

0.2 10415 10412 13804 13820

0.4 11830 11839 22946 22923

0.6 14704 14688 60429 60414

0.8 23238 23229 436638 435297

1 The task failure rate of mBackground only!

23  

Task  Failure  Model  

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−α)k

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

n(k*t + d)rk(1−α)k

*

ttotal  is  not  sensi2ve  to  α    

24  Simplifica2on  of  failures  is  acceptable    

Loca2on  Specific  Failure  Detec2on  (LSFD)  •  Task  Failures  are  related  to  the  loca2on  of  execu2on  •  Failure  Monitor  classifies   failures  based  on  resource  id  

•  Scheduler  orders  resources  based  on  their  reliability.  •  Two  out  of   twenty  nodes  have  a  higher   task   failure  rates   (from  0.2   to  0.8)  while  others   s2ll  have  a   task  failure  rate  of  0.001.    

25  

DC  generates  many  small  tasks  if  task  failure  rate  is  high  

Conclusion  

•  We  present  three  basic  methods  to  improve  fault  tolerance  in  task  clustering  

•  If  the  system  supports  iden2fica2on  of  failed  tasks,  dynamic  reclustering  performs  best  

•  Otherwise,  use  dynamic  clustering  •  Improvement  is  significant  even  for  very  basic  method  

26  

Future  Work  

•  Ver2cal  Clustering  and  Arbitrary  Clustering  •  Intelligent  Scheduler  •  More  Workflow  Examples  •  Distribu2on  of  Failures  

27  

Ques2ons?  

•  Thank  you  for  coming!  •  For  further  info,  please  visit:  pegasus.isi.edu  or  email  wchen@isi.edu  

28  

Refinements  •  When  n>>r  does  not  hold  in  the  end  of  execu2on  

•  Default:    •  Replica2ve:                                                    replicate  jobs  by  •  Even:    

kactual = k* njobs =

ntaskk

< r

kactual = k* njobs = r

rntask / k

kactual =ntaskr

njobs = r

29  

Dynamic  Performance  

•  TFM  and  DC  

30