Towards an Efficient Fault-Tolerance Scheme for...

18
GLB Fault Tolerance Scheme Experimental Results Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner Programming Languages / Methodologies June 14, 2015 1 / 18

Transcript of Towards an Efficient Fault-Tolerance Scheme for...

Page 1: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Towards an Efficient Fault-Tolerance

Scheme for GLB

Claudia Fohry, Marco Bungart and Jonas Posner

Programming Languages / Methodologies

June 14, 2015

1 / 18

Page 2: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Global Load Balancing

1 Global LoadBalancing

2 Fault ToleranceScheme

3 ExperimentalResults

2 / 18

Page 3: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Worker-local Pools

Examples:

UTS: counting nodes in an unbalanced tree

BC: calculate a property of each node in a graph

3 / 18

Page 4: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

GLB

Task pool framework for inter-place load balancing

Utilizes cooperative work stealing

Tasks are free of side effects and can spawn new task atexecution time

Final result computed by reduction

Only one worker per place

Worker-private pool

4 / 18

Page 5: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

GLB’s main processing loop

do {

while (process(n)) {

Runtime.probe();

distribute();

reject();

}

} while (steal());

5 / 18

Page 6: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Fault Tolerance Scheme

1 Global LoadBalancing

2 Fault ToleranceScheme

3 ExperimentalResults

6 / 18

Page 7: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Conceptual Ideas

One backup-place per place (cyclic)

Write backup periodically and when necessary (stealing)

Exploit stealing-induces redundancy

Write incremental backups whenever possible

Each information at exactly two places

7 / 18

Page 8: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Incremental Backup of stable Tasks

s

AR

R

...

t-1

...

RA

s

RA

s...

RA

st-1... ...

R

s

A

snap snap snapt-1

backup

t

mint-2

mint-1

st-2

send

st-2

8 / 18

Page 9: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Actor Scheme

No blocking constructs (except one outer finish)

split and merge have to operate on the bottom of theTask Pool

Actor Scheme

Worker is passive entity (only processing tasks)Worker becomes active when a message is receivedTwo kinds of messages:

executed directly orstored and processed later

→ Worker stays responsive

9 / 18

Page 10: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Stealing Protocol

F

trySteal

steal-backup

VBack(F) Back(V)

● continue processing non-stolen tasks

● record stolen tasks in Open(F)

● valid = false● update

backup

link to V

save linkinsert + process

FendBVend

STLack

BFack

give

valid = true

At next backup of F:

non-incremental

● delete Open(F)

delOpen

XYack

● update backup

● delete link to V

V1

V2

V3

10 / 18

Page 11: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Asynchronism

11 / 18

Page 12: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Asynchronism with Fault-Tolerance

12 / 18

Page 13: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Detection of dead Places

Cannot use DeadPlaceExceptions

Check relevant places regularly via isDead(), as well asthe own backup place

What if a place P is inactive?

Does not check its backup-place for lifenessBut its predecessor Forth(P) does check P

If P is active, it checks lifeness of Back(P)Recursive process

13 / 18

Page 14: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Experimental Results

1 Global LoadBalancing

2 Fault ToleranceScheme

3 ExperimentalResults

14 / 18

Page 15: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Setup

Experiments were conductet on an Infiniband-connectedCluster

One place per node

Up to 128 Nodes

Configuration:

small UTS: -d=13large UTS: -d=17

15 / 18

Page 16: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

UTS, small

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Tim

e(s

econ

ds)

Places

GLBFTGLB

FTGLB-Incremental

16 / 18

Page 17: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

UTS, small

0

500

1000

1500

2000

0 10 20 30 40 50 60

Tim

e(s

econ

ds)

Places

GLBFTGLB

FTGLB-Incremental

17 / 18

Page 18: Towards an Efficient Fault-Tolerance Scheme for GLBx10.sourceforge.net/documentation/papers/X10... · Thank you for your attention! Please feel free to ask questions. 18/18. Title:

GLB Fault Tolerance Scheme Experimental Results

Thank you for your attention!

Please feel free to ask questions.

18 / 18