Towards an Efficient Fault-Tolerance Scheme for...

GLB Fault Tolerance Scheme Experimental Results

Towards an Efficient Fault-Tolerance

Scheme for GLB

Claudia Fohry, Marco Bungart and Jonas Posner

Programming Languages / Methodologies

June 14, 2015

1 / 18

Global Load Balancing

1 Global LoadBalancing

2 Fault ToleranceScheme

3 ExperimentalResults

2 / 18

Worker-local Pools

Examples:

UTS: counting nodes in an unbalanced tree

BC: calculate a property of each node in a graph

3 / 18

Task pool framework for inter-place load balancing

Utilizes cooperative work stealing

Tasks are free of side effects and can spawn new task atexecution time

Final result computed by reduction

Only one worker per place

Worker-private pool

4 / 18

GLB’s main processing loop

while (process(n)) {

Runtime.probe();

distribute();

reject();

} while (steal());

5 / 18

Fault Tolerance Scheme

6 / 18

Conceptual Ideas

One backup-place per place (cyclic)

Write backup periodically and when necessary (stealing)

Exploit stealing-induces redundancy

Write incremental backups whenever possible

Each information at exactly two places

7 / 18

Incremental Backup of stable Tasks

st-1... ...

snap snap snapt-1

backup

mint-2

mint-1

8 / 18

Actor Scheme

No blocking constructs (except one outer finish)

split and merge have to operate on the bottom of theTask Pool

Actor Scheme

Worker is passive entity (only processing tasks)Worker becomes active when a message is receivedTwo kinds of messages:

executed directly orstored and processed later

→ Worker stays responsive

9 / 18

Stealing Protocol

trySteal

steal-backup

VBack(F) Back(V)

● continue processing non-stolen tasks

● record stolen tasks in Open(F)

● valid = false● update

backup

link to V

save linkinsert + process

FendBVend

STLack

valid = true

At next backup of F:

non-incremental

● delete Open(F)

delOpen

● update backup

● delete link to V

10 / 18

Asynchronism

11 / 18

Asynchronism with Fault-Tolerance

12 / 18

Detection of dead Places

Cannot use DeadPlaceExceptions

Check relevant places regularly via isDead(), as well asthe own backup place

What if a place P is inactive?

Does not check its backup-place for lifenessBut its predecessor Forth(P) does check P

If P is active, it checks lifeness of Back(P)Recursive process

13 / 18

Experimental Results

14 / 18

Experiments were conductet on an Infiniband-connectedCluster

One place per node

Up to 128 Nodes

Configuration:

small UTS: -d=13large UTS: -d=17

15 / 18

UTS, small

0 10 20 30 40 50 60

Places

GLBFTGLB

FTGLB-Incremental

16 / 18

UTS, small

0 10 20 30 40 50 60

Places

GLBFTGLB

FTGLB-Incremental

17 / 18

Thank you for your attention!

Please feel free to ask questions.

18 / 18

Towards an Efficient Fault-Tolerance Scheme for...

Documents

Transcript of Towards an Efficient Fault-Tolerance Scheme for...

Report on the Language X10x10.sourceforge.net/documentation/languagespec/x10-203.pdfReport on the Programming Language X10 Version 2.0.3 DRAFT — April 16, 2010 Vijay Saraswat Please

X10 Tutorial x10.sf

konicaminolta - SourceForge.net

X10 Language Specificationx10.sourceforge.net/documentation/languagespec/x10-262.pdf · X10 Language Speciﬁcation Version 2.6.2 Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier

Performance Optimization in X10 Graph Processing Libraryx10.sourceforge.net/documentation/presentations/X...Parallel Serialization for Collective Communication 10 Message communication

Introduction to X10x10.sourceforge.net/documentation/papers/X10... · reified generics templates; constrained kinds 11 gradual typing . Tool Chain 12 . Tool Chain ! Eclipse Public

Resilient X10 Reviewx10.sourceforge.net/.../ppopp_2014_resilient_talk.pdf · • If lost data is unmodified: reload it • If data is mutated: checkpoint it • Libraries can hide,

INTRODUCING SCALEGRAPH : AN X10 LIBRARY …x10.sourceforge.net/documentation/papers/X10Workshop2012/...INTRODUCING SCALEGRAPH : AN X10 LIBRARY FOR BILLION SCALE GRAPH ANALYTICS Miyuru

X10 Language Specificationx10.sourceforge.net/documentation/languagespec/x10-latest.pdf · X10 Language Speciﬁcation Version 2.6.1 Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier

X10: An Object-Oriented Approach to Non-Uniform …x10.sourceforge.net/documentation/papers/oopsla05-final.pdf · X10: An Object-Oriented Approach to Non-Uniform Cluster Computing

APGAS Programming in X10x10.sourceforge.net/tutorials/x10-2.4/Hartree-July-2013/x10-hartree-slides-july2013.pdfForeword X10 is ! A language ! Scala-like syntax ! object-oriented, imperative,

X10 Language Specificationx10.sourceforge.net/documentation/languagespec/x10-230.pdf · X10 Language Speciﬁcation Version 2.3 Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier

Performance Analysis of Lattice QCD in X10 CUDAx10.sourceforge.net/documentation/presentations/X10DayTokyo2015/x...Performance Analysis of Lattice QCD in X10 CUDA ... – Dominated

Report on the Language X10x10.sourceforge.net › documentation › languagespec › x10-201.pdf · Report on the Programming Language X10 Version 2.01 DRAFT — January 13, 2010

APGAS Programming in X10x10.sourceforge.net/tutorials/x10-2.4/...slides-V7.pdf · Design for scale ... enable full utilization of HPC hardware capabilities 7. X10 Tool Chain Open-source

X10x10.sourceforge.net › documentation › languagespec › x10-176.pdf · Report on the Experimental Language X10 Version 1.7.6 Vijay Saraswat Nathaniel Nystrom Please send comments

Developing Scalable Parallel Applications in X10x10.sourceforge.net/tutorials/x10-2.2/SC_2012/SC12_X10_Tutorial.pdf · Developing Scalable Parallel Applications in X10 David Grove,

Report on the Language X10x10.sourceforge.net/documentation/languagespec/x10-206.pdf · Report on the Programming Language X10 Version 2.0.6 DRAFT — August 28, ... Patrick Gallop,

X10 Language Specificationx10.sourceforge.net/documentation/languagespec/x10-260.pdf · X10 Language Speciﬁcation Version 2.6 Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier

Boost Sandbox download | SourceForge.net