Fault Tolerant Clustering (IEEE Services 2012)

Fault Tolerant Clustering in Scien2fic Workflows

Weiwei Chen, Ewa Deelman Informa2on Sciences Ins2tute

University of Southern California

Outline

•  Introduc2on •  Workflow and Failure Model •  Fault Tolerant Clustering •  Experiments •  Task Specific Failures •  Loca2on Specific Failures

Introduc2on •  Task based Scien2fic Workflows –  Task –  Job

•  Task Clustering – Merges mul2ple small tasks into a job –  Reduce scheduling and submit overhead

•  Fault Tolerance in Task Clustering –  Exis2ng techniques underes2mate or ignore the influences of failures

Task Clustering •  Task Clustering – Horizontal Clustering – Ver2cal Clustering – Arbitrary Clustering

Clustering Factor (k): number of tasks in a job 4

System Overview

Timeline

with clustering

without clustering

Improvement

scheduling and submit delay

Task Failures and Job Failures •  We only focus on Transient Failure and Job Retry •  We don’t differen2ate the causes of failures but we concern about the average failure rate.

•  Assump2on: a failure is a random event independent of workflow characteris2cs or execu2on environment

•  Two Categories o  Task Failure: a task fails, other

tasks in the same job may not fail §  E.g. Applica2on

o  Job Failure: a job fails, all of its tasks fail §  E.g. Scheduling System

Influence of Failures on Clustering ttotal Es2mated Overall Run2me

n Number of tasks to run

t Run2me of a single task

r Number of available resources

d Time delay between jobs

N Expected retry 2mes for a single task

k Number of tasks in a job

β Job failure rate

α Task failure rate Target Func2on: Min (ttotal)

given n tasks to run on r resources task failure rate (α) is measurable (Task Failure Model) or job failure rate (β) is measurable (Job Failure Model)

Assump2on: n >> r, but n/k >> r 7

Job Failure Model

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−β)

, if nk≥ r

N(kt + d) = (kt + d)1−β

, if nk< r

ttotal Es2mated Overall Run2me

β Job failure rate

α Task failure rate

N job =1

(1−β)

Ntotal =

N jobnrk

if nk≥ r

N job, if nk< r

t job = kt + d

ttotal = t jobNtotal

Run2me for a single job

Avg retry 2me for a single job

Retry 2me for all jobs

Overall run2me

Job Failure Model

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−β)

, if nk≥ r

N(kt + d) = (kt + d)1−β

, if nk< r

k* = nr

ttotal* =(kt + d)1−β

It’s not necessary to adjust k. Just set it to be

n=1000, t=5 sec, d=5 sec, r=20

k* is independent of β

Task Failure Model

ttotal Es2mated Overall Run2me

β Job failure rate

α Task failure rate

N job =1

(1−α)k

Ntotal =

N jobnrk

if nk≥ r

N job, if nk< r

t job = kt + d

ttotal = t jobNtotal

Run2me for a single job

Avg retry 2me for a single job

Retry 2me for all jobs

Overall run2me

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−α)k

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

Task Failure Model

ttotal =

Nn(kt + d)rk

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

n(k*t + d)rk(1−α)k

k* is dependent of α

It’s necessary to adjust k according to α

Comparing TFM and JFM

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

ttotal* =(kt + d)1−β

k* = nr

1. Linear increase vs exponen2al increase 2. Op2mal clustering factor

Fault Tolerant Clustering •  Job Failure Model: k=n/r •  Selec2ve Reclustering (SR) – select the failed tasks in a clustered job and cluster them into a new clustered job

–  It requires the iden2fica2on of failed tasks.

Fault Tolerant Clustering

•  Dynamic Clustering (DC) – adjust the clustering factor according to the task failure rates dynamically

ttotal,DC* =

n(k*t + d)rk*(1−α)k

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

Fault Tolerant Clustering

•  Dynamic Reclustering (DR) – A combina2on of SR and DC

Evalua2on

•  Run simula2ons based on the real traces that were run by the Pegasus group.

•  Each workflow was simulated 100 2mes so that the standard devia2on is less than 10%

•  Two workflows were used. •  20 worker nodes were used in each experiment.

Workflows Used •  Montage – An astronomy applica2on used to construct large image mosaics of the sky.

– Montage has complex data dependencies between tasks

– 10,422 tasks, 57GB data.

17 Image from hhp://montage.ipac.caltech.edu/

Workflows Used •  Periodogram –  Iden2fy periodic signals from light curves that arise from transi2ng planets.

– 216,600 tasks, 19GB input data. – Periodogram has only one level

18 Image from hhp://pegasus.isi.edu/presenta2ons/2011/sci709-‐voeckler-‐talk.ppt/

Simulator

•  Extension to CloudSim – Workflow Engine – Clustering Engine – Scheduler – Failure Generator – Failure Monitor

Performance •  NOOP: no op2miza2on, (k=n/r) •  DC (Dynamic Clustering) •  SR (Selec2ve Reclustering) •  DR ( Dynamic Reclustering) •  Overall Run2me in seconds

Performance

•  Periodogram

Performance

•  Montage

Task Specific Failure Detec2on (TSFD) •  Task Failures are related to the type of tasks •  Failure Monitor classifies failures based on the type •  Clustering Engine merges tasks based on different task

failure rates •  In this experiment of Montage, we set the task failure

rate of mProjectPP and mDiffFit to be 0.001 while mBackground ranges from 0.2 to 0.8.

α1 Optimization Methods

DR DR+TSFD DC DC+TSFD

0.2 10415 10412 13804 13820

0.4 11830 11839 22946 22923

0.6 14704 14688 60429 60414

0.8 23238 23229 436638 435297

1 The task failure rate of mBackground only!

Task Failure Model

ttotal =

Nn(kt + d)rk

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

ttotal is not sensi2ve to α

24 Simplifica2on of failures is acceptable

Loca2on Specific Failure Detec2on (LSFD) •  Task Failures are related to the loca2on of execu2on •  Failure Monitor classifies failures based on resource id

•  Scheduler orders resources based on their reliability. •  Two out of twenty nodes have a higher task failure rates (from 0.2 to 0.8) while others s2ll have a task failure rate of 0.001.

DC generates many small tasks if task failure rate is high

Conclusion

•  We present three basic methods to improve fault tolerance in task clustering

•  If the system supports iden2fica2on of failed tasks, dynamic reclustering performs best

•  Otherwise, use dynamic clustering •  Improvement is significant even for very basic method

Future Work

•  Ver2cal Clustering and Arbitrary Clustering •  Intelligent Scheduler •  More Workflow Examples •  Distribu2on of Failures

Ques2ons?

•  Thank you for coming! •  For further info, please visit: pegasus.isi.edu or email wchen@isi.edu

Refinements •  When n>>r does not hold in the end of execu2on

•  Default: •  Replica2ve: replicate jobs by •  Even:

kactual = k* njobs =

ntaskk

kactual = k* njobs = r

rntask / k

kactual =ntaskr

njobs = r

Dynamic Performance

•  TFM and DC

Fault Tolerant Clustering (IEEE Services 2012)

Technology

Transcript of Fault Tolerant Clustering (IEEE Services 2012)

Technical Roadmap for Fault-Tolerant Quantum Computing · Technical Roadmap for Fault-Tolerant Quantum Computing October 2016 ... of the fault-tolerant quantum computing technology,

FAULT DETECTION AND FAULT TOLERANT APPROACHES … · FAULT DETECTION AND FAULT TOLERANT APPROACHES WITH AIRCRAFT APPLICATION ... Fault Detection and Identification: ... G., "Application

FAULT MODEL DEVELOPMENT FOR FAULT TOLERANT VLSI …

Fault Tolerant Ethernet (FTE)

Fault Tolerant ES Subramanian

Building Fault-Tolerant Applications on AWSd0.awsstatic.com/whitepapers/aws-building-fault-tolerant-application… · Amazon Web Services – Building Fault-Tolerant Applications

TRICON Fault Tolerant Systems

Fault-Tolerant Broadcast

Synchrony and Time in Fault-Tolerant Distributed Algorithmspub.ist.ac.at/formats2010/schmid-formats.pdfin Fault-Tolerant Distributed Algorithms ... FORMATS‘10 Tutorial. Target: Fault-tolerant

Advanced Design Scheme for Fault Tolerant Distributed ... · Advanced design scheme for fault tolerant distributed networked control systems B ... coordinated and ... for Fault Tolerant

Building Fault-Tolerant Applications on AWSd36cz9buwru1tt.cloudfront.net/AWS_Building_Fault_Tolerant_Applications.pdf · Amazon Web Services – Building Fault-Tolerant Applications

Fault Tolerant Virtualization

Fault Tolerant Configuration

Distributed Fault Tolerant Controllers

X10 Using ULFM for implementing Fault Tolerant Applications · Using ULFM for implementing Fault Tolerant Applications Fault Tolerant PDE Applications Md Mohsin Ali1 (md.ali@anu.edu.au)

Fault Tolerant Server

AWS fault tolerant architecture

Fault tolerant systems

Fault-tolerant Control Systems

Distributed Transactional Memory for Fault -Tolerant … · Distributed Transactional Memory for Fault -Tolerant ... Library Support for Fault-Tolerant ... class stores the set of