Towards an Efficient Fault-Tolerance Scheme for...
Transcript of Towards an Efficient Fault-Tolerance Scheme for...
GLB Fault Tolerance Scheme Experimental Results
Towards an Efficient Fault-Tolerance
Scheme for GLB
Claudia Fohry, Marco Bungart and Jonas Posner
Programming Languages / Methodologies
June 14, 2015
1 / 18
GLB Fault Tolerance Scheme Experimental Results
Global Load Balancing
1 Global LoadBalancing
2 Fault ToleranceScheme
3 ExperimentalResults
2 / 18
GLB Fault Tolerance Scheme Experimental Results
Worker-local Pools
Examples:
UTS: counting nodes in an unbalanced tree
BC: calculate a property of each node in a graph
3 / 18
GLB Fault Tolerance Scheme Experimental Results
GLB
Task pool framework for inter-place load balancing
Utilizes cooperative work stealing
Tasks are free of side effects and can spawn new task atexecution time
Final result computed by reduction
Only one worker per place
Worker-private pool
4 / 18
GLB Fault Tolerance Scheme Experimental Results
GLB’s main processing loop
do {
while (process(n)) {
Runtime.probe();
distribute();
reject();
}
} while (steal());
5 / 18
GLB Fault Tolerance Scheme Experimental Results
Fault Tolerance Scheme
1 Global LoadBalancing
2 Fault ToleranceScheme
3 ExperimentalResults
6 / 18
GLB Fault Tolerance Scheme Experimental Results
Conceptual Ideas
One backup-place per place (cyclic)
Write backup periodically and when necessary (stealing)
Exploit stealing-induces redundancy
Write incremental backups whenever possible
Each information at exactly two places
7 / 18
GLB Fault Tolerance Scheme Experimental Results
Incremental Backup of stable Tasks
s
AR
R
...
t-1
...
RA
s
RA
s...
RA
st-1... ...
R
s
A
snap snap snapt-1
backup
t
mint-2
mint-1
st-2
send
st-2
8 / 18
GLB Fault Tolerance Scheme Experimental Results
Actor Scheme
No blocking constructs (except one outer finish)
split and merge have to operate on the bottom of theTask Pool
Actor Scheme
Worker is passive entity (only processing tasks)Worker becomes active when a message is receivedTwo kinds of messages:
executed directly orstored and processed later
→ Worker stays responsive
9 / 18
GLB Fault Tolerance Scheme Experimental Results
Stealing Protocol
F
trySteal
steal-backup
VBack(F) Back(V)
● continue processing non-stolen tasks
● record stolen tasks in Open(F)
● valid = false● update
backup
link to V
save linkinsert + process
FendBVend
STLack
BFack
give
valid = true
At next backup of F:
non-incremental
● delete Open(F)
delOpen
XYack
● update backup
● delete link to V
V1
V2
V3
10 / 18
GLB Fault Tolerance Scheme Experimental Results
Asynchronism
11 / 18
GLB Fault Tolerance Scheme Experimental Results
Asynchronism with Fault-Tolerance
12 / 18
GLB Fault Tolerance Scheme Experimental Results
Detection of dead Places
Cannot use DeadPlaceExceptions
Check relevant places regularly via isDead(), as well asthe own backup place
What if a place P is inactive?
Does not check its backup-place for lifenessBut its predecessor Forth(P) does check P
If P is active, it checks lifeness of Back(P)Recursive process
13 / 18
GLB Fault Tolerance Scheme Experimental Results
Experimental Results
1 Global LoadBalancing
2 Fault ToleranceScheme
3 ExperimentalResults
14 / 18
GLB Fault Tolerance Scheme Experimental Results
Setup
Experiments were conductet on an Infiniband-connectedCluster
One place per node
Up to 128 Nodes
Configuration:
small UTS: -d=13large UTS: -d=17
15 / 18
GLB Fault Tolerance Scheme Experimental Results
UTS, small
0
10
20
30
40
50
60
0 10 20 30 40 50 60
Tim
e(s
econ
ds)
Places
GLBFTGLB
FTGLB-Incremental
16 / 18
GLB Fault Tolerance Scheme Experimental Results
UTS, small
0
500
1000
1500
2000
0 10 20 30 40 50 60
Tim
e(s
econ
ds)
Places
GLBFTGLB
FTGLB-Incremental
17 / 18
GLB Fault Tolerance Scheme Experimental Results
Thank you for your attention!
Please feel free to ask questions.
18 / 18