Towards an Efficient Fault-Tolerance Scheme for...

Post on 06-Aug-2020

0 views 0 download

Transcript of Towards an Efficient Fault-Tolerance Scheme for...

GLB Fault Tolerance Scheme Experimental Results

Towards an Efficient Fault-Tolerance

Scheme for GLB

Claudia Fohry, Marco Bungart and Jonas Posner

Programming Languages / Methodologies

June 14, 2015

1 / 18

GLB Fault Tolerance Scheme Experimental Results

Global Load Balancing

1 Global LoadBalancing

2 Fault ToleranceScheme

3 ExperimentalResults

2 / 18

GLB Fault Tolerance Scheme Experimental Results

Worker-local Pools

Examples:

UTS: counting nodes in an unbalanced tree

BC: calculate a property of each node in a graph

3 / 18

GLB Fault Tolerance Scheme Experimental Results

GLB

Task pool framework for inter-place load balancing

Utilizes cooperative work stealing

Tasks are free of side effects and can spawn new task atexecution time

Final result computed by reduction

Only one worker per place

Worker-private pool

4 / 18

GLB Fault Tolerance Scheme Experimental Results

GLB’s main processing loop

do {

while (process(n)) {

Runtime.probe();

distribute();

reject();

}

} while (steal());

5 / 18

GLB Fault Tolerance Scheme Experimental Results

Fault Tolerance Scheme

1 Global LoadBalancing

2 Fault ToleranceScheme

3 ExperimentalResults

6 / 18

GLB Fault Tolerance Scheme Experimental Results

Conceptual Ideas

One backup-place per place (cyclic)

Write backup periodically and when necessary (stealing)

Exploit stealing-induces redundancy

Write incremental backups whenever possible

Each information at exactly two places

7 / 18

GLB Fault Tolerance Scheme Experimental Results

Incremental Backup of stable Tasks

s

AR

R

...

t-1

...

RA

s

RA

s...

RA

st-1... ...

R

s

A

snap snap snapt-1

backup

t

mint-2

mint-1

st-2

send

st-2

8 / 18

GLB Fault Tolerance Scheme Experimental Results

Actor Scheme

No blocking constructs (except one outer finish)

split and merge have to operate on the bottom of theTask Pool

Actor Scheme

Worker is passive entity (only processing tasks)Worker becomes active when a message is receivedTwo kinds of messages:

executed directly orstored and processed later

→ Worker stays responsive

9 / 18

GLB Fault Tolerance Scheme Experimental Results

Stealing Protocol

F

trySteal

steal-backup

VBack(F) Back(V)

● continue processing non-stolen tasks

● record stolen tasks in Open(F)

● valid = false● update

backup

link to V

save linkinsert + process

FendBVend

STLack

BFack

give

valid = true

At next backup of F:

non-incremental

● delete Open(F)

delOpen

XYack

● update backup

● delete link to V

V1

V2

V3

10 / 18

GLB Fault Tolerance Scheme Experimental Results

Asynchronism

11 / 18

GLB Fault Tolerance Scheme Experimental Results

Asynchronism with Fault-Tolerance

12 / 18

GLB Fault Tolerance Scheme Experimental Results

Detection of dead Places

Cannot use DeadPlaceExceptions

Check relevant places regularly via isDead(), as well asthe own backup place

What if a place P is inactive?

Does not check its backup-place for lifenessBut its predecessor Forth(P) does check P

If P is active, it checks lifeness of Back(P)Recursive process

13 / 18

GLB Fault Tolerance Scheme Experimental Results

Experimental Results

1 Global LoadBalancing

2 Fault ToleranceScheme

3 ExperimentalResults

14 / 18

GLB Fault Tolerance Scheme Experimental Results

Setup

Experiments were conductet on an Infiniband-connectedCluster

One place per node

Up to 128 Nodes

Configuration:

small UTS: -d=13large UTS: -d=17

15 / 18

GLB Fault Tolerance Scheme Experimental Results

UTS, small

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Tim

e(s

econ

ds)

Places

GLBFTGLB

FTGLB-Incremental

16 / 18

GLB Fault Tolerance Scheme Experimental Results

UTS, small

0

500

1000

1500

2000

0 10 20 30 40 50 60

Tim

e(s

econ

ds)

Places

GLBFTGLB

FTGLB-Incremental

17 / 18

GLB Fault Tolerance Scheme Experimental Results

Thank you for your attention!

Please feel free to ask questions.

18 / 18