Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu,...
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
3
Transcript of Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu,...
Computing in theReliable Array of Independent Nodes
Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck
May 5, 2000
IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems
California Institute of Technology
Marc RiedelMarc Riedel
RAIN Project
Collaboration:
• Caltech’s Parallel and Distributed Computing Group www.paradise.caltech.edu
• JPL’s Center for Integrated Space Microsystems www.csmt.jpl.nasa.gov
RAIN Platform
switchswitch
bus
netw
ork
Heterogeneous network of nodes and switches
nodenode
nodenode
nodenode
nodenode
switch
nodenode
nodenode
RAIN Testbed
www.paradise.caltech.edu
10 Pentium boxesw/multiple NICs
4 eight-way Myrinet Switches
Proof of Concept: Video Server
AA BB CC DD
switch switch
Video client & server on every node.
Limited Storage
AA BB CC DD
Insufficient storage to replicate all the data on each node.
switch switch
k-of-n Code
ad+c
bd+a
ca+b
db+c
from any k of n columns
b = a+b a+
d = d+c c+
a b c d
recover data
Erasure-correcting code:
Encoding
AA BB CC DD
Encode video using 2-of-4 code.
switch switch
Decoding
AA BB CC DD
Retrieve data and decode.
switch switch
Node Failure
AA BB CC DD
switch switch
Node Failure
AA BB CC D
switch switch
Node Failure
Dynamically switch to another node.
AA BB CC D
switch switch
Link Failure
BB DCCAA
switch switch
Link Failure
AA BB CC D
switch switch
Link Failure
Dynamically switch to another network path.
AA BB DCC
switch switch
Switch Failure
AA BB DCC
switch switch
Switch Failure
AA BB CC D
switch switch
Switch Failure
AA CC D
Dynamically switch to another network path.
BB
switch switch
Node Recovery
AA CCBB DD
switch switch
Node Recovery
AA CCBB DD
switch switch
Continuous reconfiguration (e.g., load-balancing).
Features
• tolerates multiple node/link/switch failures• no single point of failure
High availability:
Certified
Buzz-Word
Compliant
• multiple data paths • redundant storage• graceful degradation
Efficient use of resources:
Dynamic scalability/reconfigurability
RAIN Project: Goals
NetworksCommunication
key building blocks
Storage
Applications
Efficient, reliable distributed computing and storage systems:
Topics
• Fault-Tolerant Interconnect Topologies
• Connectivity
• Group Membership
• Distributed Storage
Today’s Talk:
NetworksCommunication
Storage
Applications
Interconnect Topologies
= computing/storage node
Network
Goal: lose at most a constant number of nodes for given network loss
NN NN NN NN NN NN NN NN NN NN
NN
= computing/storage node
Network
NN NN NN NN NN NN NN NN NN NN
NN
Resistance to Partitions
Large partitions problematic for distributed services/computation
Resistance to Partitions
= computing/storage node
Large partitions problematic for distributed services/computation
NN
NN NN NN NN NN NN NN NN NN NN
Network
Related Work
• Hayes et al., Bruck et al., Boesch et al.
Embedding hypercubes, rings, meshes, trees in fault-tolerant networks:
• Ku and Hayes, 1997. “ Connective Fault-Tolerance in Multiple-Bus Systems”
Bus-based networks which are resistant to partitioning:
IEEE ACM
A Ring of Switches
NN
SS
SS
SS
SS
SSSS
SS
NN
NN
NN
NN
NN
NN
= Node
= SwitchSS
NN
a naïve solution
degree-2 compute nodes,degree-4 switches
A Ring of Switches
NN
SS
SS
SS
SS
SSSS
SS
NN
NN
NN
NN
NN
NN
= Node
= SwitchSS
NN
a naïve solution
degree-2 compute nodes,degree-4 switches
= Node
= SwitchSS
NN
easily partitioned
A Ring of Switches
a naïve solution
degree-2 compute nodes,degree-4 switches NN
SS SS
SS
SS
SS
NN
NN
NN
NN
NN
NN
S
S
Resistance to Partitioning
11
11 22
33
44
55
66
77
88
22
33
4455
66
77
88nodes on diagonals
degree-2 compute nodes,degree-4 switches
Resistance to Partitioning
11
11 22
33
44
55
66
77
88
22
33
4455
66
77
88nodes on diagonals
degree-2 compute nodes,degree-4 switches
nodes on diagonals
degree-2 compute nodes,degree-4 switches
• tolerates any 3 switch failures (optimal)
• generalizes to arbitrary node/switch degrees.
Resistance to Partitioning
Details: paper IPPS’98, www.paradise.caltech.edu
22
33
55
77
88
22
33
4455
77
88
1
1
46
6
11
11 22
33
44
55
77
88
22
33
4455
66
77
88
66
11
22
33
44
55
66
77
88
11
22
33 44
55
66
7788
Resistance to Partitioning
Isomorphic
Details: paper IPPS’98, www.paradise.caltech.edu
Point-to-Point Connectivity
Is the path from A to B up or down?
?nodenode
nodenode
nodenode
nodenode
nodenode
nodenode
A
B
Network
Connectivity
Link is seen as up or down by each node.
NodeA
NodeB
{U,D} {U,D}
Bi-directional communication.
Each node sends out pings.A node may time-out, deciding the link is down.
Consistent History
A B
U
D
UU
U
D
DD
Time
NodeState
A B
U
D
U
U
U
D D
D
Time
U
UD
NodeState
A B
The Slack
NodeState
A B
Time
U
D
U
U
U
D D
DU
UD
A is 1 ahead
A is 2 ahead
Now A will wait for B to transition
Slack n=2:at most 2 unacknowledged transitions before a node waits
Consistent History
Consistency in error reporting:If A sees channel error, B sees channel error.
Birman et al.: “Reliability Through Consistency”
NodeA
NodeB
{U,D} {U,D}
Details: paper IPPS’99, www.paradise.caltech.edu
Group Membership
BBAA
DDCC
ABCDABCD
ABCDABCD
• link/node failures• dynamic reconfiguration
Consistent global view given local, point-to-point connectivity information
Related Work
Totem, Isis/Horus, TransisSystems
TheoryChandra et al., Impossibility of Group Membership in an Asynchronous Environment
IEEE ACM
Token-Ring based Group Membership Protocol
BBAA
CCDD
Group Membership
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:
Token-Ring based Group Membership Protocol
1: ABCD
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:
Token-Ring based Group Membership Protocol
1: ABCD
1
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:1
Token-Ring based Group Membership Protocol
2
2: ABCD
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:1
Token-Ring based Group Membership Protocol
2
3
3: ABCD
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:1
Token-Ring based Group Membership Protocol
2
34
4: ABCD
Group Membership
BBAA
CCDD
• group membership list
• sequence number
Token carries:5
Token-Ring based Group Membership Protocol
2
34
Group Membership
BBAA
CCDD
5 2
34
Node or link fails:
Group Membership
BAA
CCDD
5
34
Node or link fails:
Group Membership
BAA
CCDD
?
5
34
Node or link fails:
Group Membership
BAA
CCDD
?
5
34
Node or link fails:
Group Membership
BAA
CCDD
5
34
If a node is inaccessible,it is excluded and bypassed.
5: ACD
Node or link fails:
Group Membership
BAA
CCDD
5
64
If a node is inaccessible,it is excluded and bypassed.
6: ACD
Node or link fails:
Group Membership
BAA
CCDD
5
67
If a node is inaccessible,it is excluded and bypassed.
Node or link fails:
Group Membership
BAA
CCDD
5
67
If a node is inaccessible,it is excluded and bypassed.
Node or link fails:
Group Membership
BAA
CCDD
5
67
Node with token fails:
Group Membership
AA
CCD
5
6
B
Node with token fails:
Group Membership
AA
CCD
5
6
B
?
?
Node with token fails:
Group Membership
AA
CCD
5
6
If the token is lost,it is regenerated.
B
?
?
Node with token fails:
Group Membership
AA
CCD
5
6
If the token is lost,it is regenerated.
B
Node with token fails:
Group Membership
AA
CCD
5
6
6: AD
If the token is lost,it is regenerated.
B
5: ACD
Node with token fails:
Group Membership
AA
CCD
5
6
6: AD
If the token is lost,it is regenerated.
B
5: ACD
Highest sequence numberprevails.
Node with token fails:
Group Membership
AA
CCD
7
6
Highest sequence numberprevails.
If the token is lost,it is regenerated.
B
Node with token fails:
Group Membership
AA
CC
7
6
Node recovers:
B
DD
Group Membership
AA
CC
7
6
B
DD
Recovering nodesare added.
Node recovers:
Group Membership
AA
CC
7
6
B
DD
Recovering nodesare added.
7: ADC
Node recovers:
Group Membership
AA
CC
7
6
B
DD
Recovering nodesare added.
8: ADC
8
Node recovers:
Group Membership
AA
CC
7
9
B
DD
Recovering nodesare added.
9: ADC
8
Node recovers:
Group Membership
AA
CC
10 B
DD
Recovering nodesare added.
98
Node recovers:
Group Membership
AA
CC
10 B
DD 98
• Unicast messages
• Dynamic reconfiguration
• Mean time-to-failure > convergence time
Features:
Details: publication forthcoming.
Distributed Storage
disk disk diskdisk
101001001000101001001000
Distributed Storage
disk disk disk
Focus: reliability and performance.
disk
1010 10 101 11
Array Codes
ad+c
bd+a
ca+b
db+c
Ideally suited for distributed storage. Low encoding/decoding complexity.
dataredundancy
“B-code”
Array Codes
ad+c
bd+a
ca+b
db+c
Ideally suited for distributed storage. Low encoding/decoding complexity.
from any k of n columns
b = a+b a+
d = d+c c+
a b c d
recover data
Array Codes
ad+c
bd+a
ca+b
db+c
Ideally suited for distributed storage. Low encoding/decoding complexity.
a b c d
Details: IEEE Trans. Info Theory, www.paradise.caltech.edu
B-Code and X-Code:• optimally redundant• optimal encoding/decoding complexity
Summary
1
1 2
3
4
5
6
7
8
2
3
45
6
7
8
Fault-tolerant Interconnect Topologies
Connectivity
A B
{U,D} {U,D}
Group Membership
BA
CD
1: ABCD
1 2
34
2: ABCD
3: ABCD
4: ABCD
Distributed Storage
a
d+c
b
d+a
c
a+b
d
b+c
Proof-of-Concept Applications
RAINVideoHigh-availability video server
RAINCheckDistributed checkpoint rollback/recovery system
SNOWStable Network of Webservers
Rainfinity
www.rainfinity.com
Start-up based on RAIN technology
• availability
• scalability
• performance
Clustered solutions for Internet data centers, focusing on:
Business Plan:
Rainfinity
• Founded Sept. 1998
• Released first product April 1999
• Received $15 million funding in Dec. 1999
• Now over 50 employees
www.rainfinity.com
Start-up based on RAIN technology
Company:
Future Research
• Development of API’s• Fault-Tolerant Distributed Filesystem• Fault-Tolerant MPI/PVM implementation
End of Talk
Material that was cut...
Erasure Correcting Codes
data
k
1 0 1 0 1 1 0 1 0 0 010
encoded data
n
Strategy:encode data with an erasure-correcting code.
Erasure Correcting Codes
k
1 0 1 0 1 1 0 1 0 0 010
n
lose up to m coordinates
data
Strategy:encode data with an erasure-correcting code.
Erasure Correcting Codes
1 0 1 0 1 1 0 1 0 0 010
n
reconstructed data
k
Strategy:encode data with an erasure-correcting code.
lose up to m coordinates
k
data
Erasure Correcting Codes
Code is optimally-redundant (MDS) if . knm Example: Reed-Solomon code.
1 0 1 0 1 1 0 1 0 0 010
n
reconstructed data
k
lose up to m coordinates
k
data
RAIN: Distributed Store
ad+c
bd+a
ca+b
db+c
disk disk disk disk
ad+c
bd+a
ca+b
db+c
• Encode data with (n, k) array code
• Store one symbol per node
RAIN: Distributed Retrieve
disk
ad+c
disk
bd+a
disk
ca+b
disk
db+c
• Retrieve encoded data from any k nodes
• Reconstruct data
a b c d
RAIN: Distributed Retrieve
disk
ad+c
disk
bd+a
disk
ca+b
disk
db+c
a b c d• Reliability (similar to RAID systems)
RAIN: Distributed Retrieve
disk
ad+c
disk
bd+a
disk
ca+b
disk
db+c
a b c d• Reliability (similar to RAID systems)
RAIN: Distributed Retrieve
disk
ad+c
disk disk
ca+b
disk
• Reliability (similar to RAID systems)
a b c d
• Performance: load-balancing
RAIN: Distributed Retrieve
disk
ad+c
disk disk
ca+b
disk
db+c
a b c d• Reliability (similar to RAID systems)
• Performance: load-balancing
RAIN: Distributed Retrieve
disk
ad+c
disk disk
ca+b
disk
db+c
busy!
a b c d• Reliability (similar to RAID systems)
• Performance: load-balancing