Towards an Efficient Fault-Tolerance Scheme for...

GLB Fault Tolerance Scheme Experimental Results

Towards an Efficient Fault-Tolerance

Scheme for GLB

Claudia Fohry, Marco Bungart and Jonas Posner

Programming Languages / Methodologies

June 14, 2015

1 / 18

mailto:[email protected]


Global Load Balancing

1 Global LoadBalancing

2 Fault ToleranceScheme

3 ExperimentalResults

2 / 18


Worker-local Pools

Examples:

UTS: counting nodes in an unbalanced tree

BC: calculate a property of each node in a graph

3 / 18


GLB

Task pool framework for inter-place load balancing

Utilizes cooperative work stealing

Tasks are free of side effects and can spawn new task atexecution time

Final result computed by reduction

Only one worker per place

Worker-private pool

4 / 18


GLB’s main processing loop

do {

while (process(n)) {

Runtime.probe();

distribute();

reject();

}

} while (steal());

5 / 18


Fault Tolerance Scheme




6 / 18


Conceptual Ideas

One backup-place per place (cyclic)

Write backup periodically and when necessary (stealing)

Exploit stealing-induces redundancy

Write incremental backups whenever possible

Each information at exactly two places

7 / 18


Incremental Backup of stable Tasks

s

AR

R

...

t-1

...

RA

s

RA

s...

RA

st-1... ...

R

s

A

snap snap snapt-1

backup

t

mint-2

mint-1

st-2

send

st-2

8 / 18


Actor Scheme

No blocking constructs (except one outer finish)

split and merge have to operate on the bottom of theTask Pool

Actor Scheme

Worker is passive entity (only processing tasks)Worker becomes active when a message is receivedTwo kinds of messages:

executed directly orstored and processed later

→ Worker stays responsive

9 / 18


Stealing Protocol

F

trySteal

steal-backup

VBack(F) Back(V)

● continue processing non-stolen tasks

● record stolen tasks in Open(F)

● valid = false● update

backup

link to V

save linkinsert + process

FendBVend

STLack

BFack

give

valid = true

At next backup of F:

non-incremental

● delete Open(F)

delOpen

XYack

● update backup

● delete link to V

V1

V2

V3

10 / 18


Asynchronism

11 / 18


Asynchronism with Fault-Tolerance

12 / 18


Detection of dead Places

Cannot use DeadPlaceExceptions

Check relevant places regularly via isDead(), as well asthe own backup place

What if a place P is inactive?

Does not check its backup-place for lifenessBut its predecessor Forth(P) does check P

If P is active, it checks lifeness of Back(P)Recursive process

13 / 18


Experimental Results




14 / 18


Setup

Experiments were conductet on an Infiniband-connectedCluster

One place per node

Up to 128 Nodes

Configuration:

small UTS: -d=13large UTS: -d=17

15 / 18


UTS, small

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Tim

e(s

econ

ds)

Places

GLBFTGLB

FTGLB-Incremental

16 / 18


UTS, small

0

500

1000

1500

2000

0 10 20 30 40 50 60

Tim

e(s

econ

ds)

Places

GLBFTGLB

FTGLB-Incremental

17 / 18


Thank you for your attention!

Please feel free to ask questions.

18 / 18

Towards an Efficient Fault-Tolerance Scheme for...

Documents

Transcript of Towards an Efficient Fault-Tolerance Scheme for...