Adaptive and Robust Broadcast Algorithm Takeshi Sekiya Chikayama-Taura Lab. 2007/4/13.
Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant...
-
Upload
doris-lloyd -
Category
Documents
-
view
215 -
download
0
Transcript of Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant...
![Page 1: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/1.jpg)
Intermediate Presentation(05/04/15)
Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation
05/04/15Taura Lab. Master 2nd46432 Yuuki Horita
![Page 2: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/2.jpg)
Intermediate Presentation(05/04/15)
Background Large-scale computation runs
in parallel on a great number of nodes in
distributed environments (Grid) over a long period of time
High failure rate
• Node / Process Failures
• Network Failures
Fault Tolerance is getting more important
![Page 3: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/3.jpg)
Intermediate Presentation(05/04/15)
Fault tolerant computing
Failures
Recovery ResumingFailure Detection
The end…Computing
![Page 4: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/4.jpg)
Intermediate Presentation(05/04/15)
Failure Detection Heartbeat strategy
X
Tto
Y is probably
dead
Y
Thb
Thb
msg
① A process Y sends a message, called heartbeat, to another process X at regular time interval Thb
② After Y dies, X receives no heartbeat from Y
③ X suspects Y after a certain period of time Thb+Tto from the last receipt of heartbeat
![Page 5: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/5.jpg)
Intermediate Presentation(05/04/15)
Objective
To design and implement failure detection service for supporting fault-tolerant parallel computation
![Page 6: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/6.jpg)
Intermediate Presentation(05/04/15)
Contributions propose a new failure detection approach for
fault-tolerant parallel computation high autonomy
address join/leave of procs. support Grid environments with less manual
configurations high consistency
all the procs. obtain consistent failure information
high efficiency more efficient than other autonomous
approaches (the overhead with 313 procs. was at most about 2% where the heartbeat interval is 0.1[s])
![Page 7: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/7.jpg)
Intermediate Presentation(05/04/15)
Agenda Background Demands / Related Works Our Approach Experiments Summary
![Page 8: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/8.jpg)
Intermediate Presentation(05/04/15)
Agenda Background Demands / Related Works Our Approach Experiments Summary
![Page 9: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/9.jpg)
Intermediate Presentation(05/04/15)
Demands for Failure Detection System demand ( : Autonomy)
Adaptability/Fault-tolerance: address join/leave of processes
Accessibility: need less manual configuration Information demand ( : Consistency)
Consistency: must provide consistent information Performance demand ( : Efficiency)
Low overhead: don’t deteriorate application performance
Low detection latency: inform failure events ASAP Accuracy: less false positive
![Page 10: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/10.jpg)
Intermediate Presentation(05/04/15)
Hierarchical style MDS (Globus Project) NWS [R. Wolski ’97, N.T.Spring ’99]
a single point of failure may lead to system failure
manual configuration may be cumbersome
: Autonomy Problem
![Page 11: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/11.jpg)
Intermediate Presentation(05/04/15)
Gossip style [R. Renesse’98]
utilize the mechanism of rumor spreading each process sends a gossip message (like
heartbeat) to a randomly selected process periodically
a gossip message includes {node, heartbeat} of all processes node : a process identifier heartbeat : the latest time when some node
received node’s heartbeat
![Page 12: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/12.jpg)
Intermediate Presentation(05/04/15)
Gossip styleHeartbeats are propagated to all processes in a certain amount of time automatically
each process judges process failure independently
: Consistency Problem
it takes longer to detect failures
: Efficiency Problem
![Page 13: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/13.jpg)
Intermediate Presentation(05/04/15)
Agenda Background Demands / Related Works Our Approach Experiments Summary
![Page 14: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/14.jpg)
Intermediate Presentation(05/04/15)
Basic Design Separation of failure detection and information
propagation Each process is monitored by some processes
(Failure-detection phase) If a process detects process failures, it broadcasts the
information (Information-propagation phase)
• the overhead under normal conditions will be low (Efficiency)
• the failure information will be shared (Consistency)
![Page 15: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/15.jpg)
Intermediate Presentation(05/04/15)
Failure Detection
Each process autonomously acts so that it is always monitored by some processes
Each process requests randomly selected k
neighbor processes to monitor itself (neighbor : directly connectable)
sends heartbeat to them at regular time interval Thb
requests again in the same way if the monitoring process has failed (self-repairing) A → B :
A sends heartbeats to B( B monitors A )
k = 2
![Page 16: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/16.jpg)
Intermediate Presentation(05/04/15)
Information Propagation
flood along the monitoring network
Can we guarantee that the monitoring network is connected ?
no need for extra connections redundant paths for broadcast
(:fault-tolerant) at most 2k messages per proc.
(:scalable)
![Page 17: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/17.jpg)
Intermediate Presentation(05/04/15)
Connectivity of Monitoring Network
We calculated the probability of disconnectivity of the monitoring network
1.00E- 161.00E- 141.00E- 121.00E- 101.00E- 081.00E- 061.00E- 041.00E- 021.00E+00
1 6 11 16 21 26 31 36
# of nodes
Pro
b.
k = 1
k = 2
k = 3
The disconnectivity can be ignored if k >= 3
![Page 18: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/18.jpg)
Intermediate Presentation(05/04/15)
Support Grid Environments The connectivity between different
networks is often limited (i.e. NAT, Firewall)
Cluster A Cluster B
GatewayGateway
Disconnected!
![Page 19: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/19.jpg)
Intermediate Presentation(05/04/15)
Support Grid Environments
K monitoring requests
K monitoring requests
For each process,
any of its neighbor processes should be either monitoring it directly or adjacent to k of its monitoring processes
![Page 20: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/20.jpg)
Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes
Support Grid Environments
5
4 2
3 9
8
[2, 7]
[1, 2], [4, 5]
k = 2
2 3 4 5 6 7
2 2 2 2 2 2
2 3 4 5 6 7
1 2 1 1 2 2
neighbor processes
monitoring directly
6
7
1
![Page 21: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/21.jpg)
Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes
Support Grid Environments
5
4 2
3 9
8
k = 2
2 4 5 6 7
1 1 1 2 2
1, [7,9]
monitoring directly
2 4 5 7
1 1 1 1
2 4 5
1 1 1
2 4 5
0 1 0
monitoring directly
2 4 5 6 7
1 1 1 2 1
monitoring directly6
7
1
![Page 22: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/22.jpg)
Intermediate Presentation(05/04/15)
Agenda Background Demands / Related Works Our Approach Experiments Summary
![Page 23: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/23.jpg)
Intermediate Presentation(05/04/15)
Experiment Environment ISTBS Cluster (112 nodes × 2 CPU)
Xeon2.4GHz × 70 + Xeon2.8GHz ×42 105 nodes (7 nodes down) located at Hongo
SHEEP Cluster (65 nodes × 2 CPU) Xeon2.4GHz × 65 65 nodes located at Kashiwa
Internet
SHEEP cluster in Kashiwa
ISTBS cluster in Hongo
![Page 24: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/24.jpg)
Intermediate Presentation(05/04/15)
Demonstration (Java Applet)
a process
monitoring
lots of processes will die concurrently 3-times (turn black and disappear)
the surviving processes will detect all of the failures (change in color)
processes will repair the broken monitoring relations (add new edges)
![Page 25: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/25.jpg)
Intermediate Presentation(05/04/15)
connectivity under failures simulate the connectivity of the monitoring
network under some failures check whether monitoring network is connected
when F failures happen concurrently 1.8×109 trials in each case
![Page 26: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/26.jpg)
Intermediate Presentation(05/04/15)
Connectivity under failures
# of procs. 10 20 40 80 160
k=3, p=0.01 3 4 5 8 13
k=3, p=0.0001 2 2 3 3 4
k=4, p=0.01 4 6 9 14 24
k=4, p=0.0001 3 4 4 6 10
calculated the maximum number of failure where probability of disconnection is less than p
![Page 27: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/27.jpg)
Intermediate Presentation(05/04/15)
Efficiency measured the execution time of a Fibonacci program
under the following autonomous failure detection service all-to-all Gossip ours
parameters # of processes : 2 ~ 313 k = 3 Thb = 0.1, 1.0[s]
![Page 28: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/28.jpg)
Intermediate Presentation(05/04/15)
Results (Efficiency)
20
21
22
23
24
25
26
27
0 100 200 300 400# of processes(N)
exec
utio
n tim
e [s
]
all- to- all (Thb=0.1[s])
gossip(Thb=0.1[s])
ours(Thb=0.1[s])
all- to- all (Thb=1.0[s])
10% overhead (N = 127)
over 5% overhead (N =
153)
The overhead is at most around 2 %
![Page 29: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/29.jpg)
Intermediate Presentation(05/04/15)
Summary proposed a new failure detection technique
for fault-tolerant parallel computation showed that
our system could be autonomously constructed in Grid environments
our system has high fault-tolerance it is more efficient than other autonomous
approaches
![Page 30: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/30.jpg)
Intermediate Presentation(05/04/15)
Future Work handling network partitioning sharing load on dynamic process join showing its practicality by implementing
fault-tolerant parallel application using it
![Page 31: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432.](https://reader035.fdocuments.in/reader035/viewer/2022070404/56649f325503460f94c4e109/html5/thumbnails/31.jpg)
Intermediate Presentation(05/04/15)
Publications 堀田勇樹 , 田浦健次朗 , 近山隆 . 分散環境における耐故障並
列計算を支援する通信ライブラリ . 先進的計算基盤システムシンポジウム (SACSIS2004). May 2004. (ポスター論文)
堀田勇樹 , 田浦健次朗 , 近山隆 . Phoenix プログラミングモデルにおける故障検知機構 . 並列 /分散 /協調処理に関するサマー・ワークショップ (SWoPP2004). July 2004.
堀田勇樹 , 田浦健次朗 , 近山隆 . 耐故障並列計算を支援する自律的な故障検知機構 . 先進的計算基盤システムシンポジウム (SACSIS2005). May 2005. ( 発表予定 )