Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan...
-
Upload
bethanie-cook -
Category
Documents
-
view
215 -
download
0
Transcript of Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan...
Computing in the RAIN: A Reliable Array of Independent Nodes
Group A3Ka Hou Wong
Jahanzeb FaizanJonathan Sippel
Introduction
Presenter: Ka Hou Wong
Introduction RAIN
Research collaboration between Caltech and Jet Propulsion Laboratory
Goal Identify and develop key building
blocks for reliable distributed systems built with inexpensive off-the-shelf components
Hardware Platform Heterogeneous cluster of computing and/or
storage nodes connected via multiple interfaces through a network of switches
C0 C1 C2 C3 C4
S0 S1
C5 C6 C7 C8 C9
S2 S3
C = Computer
S = Switch
Software Platform Collection of software modules that run
in conjunction with operating system services and standard network protocols
Network Connections
Application MPI/PVM
TCP/IP
RAIN
Ethernet Myrinet ATM Servernet
Key Building Blocks For Distributed Computer Systems Communication
Fault-tolerant communication topologies
Reliable communication protocols Fault Management
Group membership techniques Storage
Distributed data storage schemes based on error-control codes
Features of RAIN Communication
Provides fault tolerance in the network via the following mechanisms
Bundled interfaces Link monitoring Fault-tolerant interconnect topologies
Features of RAIN (cont’d) Group membership
Identifies healthy nodes that are participating in the cluster
Data storage Uses redundant storage schemes
over multiple disks for fault tolerance
Communication
Presenter: Jahanzeb Faizan
Communication Fault-tolerant interconnect topologies Network interfaces
Fault-tolerant Interconnect Technologies Goal
To connect computer nodes to a network of switches in order to maximize the network’s resistance to partitioning
SSS
S S
SSS
C
C
C
CC C
CC
How do you connect n nodes to a ring of n switches?
Naïve Approach Connect the computer nodes to the
nearest switches in a regular fashion
SSS
S S
SSS
C
C
C
C C
C
C
C
1-fault-tolerant
The network is easily partitioned with two switch failures
Diameter Construction Approach Connect computer nodes to the switching
network in the most non-local way possible Computer nodes are connected to maximally
distant switches Nodes of degree 2 connected between
switches should form a diameter
Diameter Construction Approach (cont’d)Construction (Diameters). Let ds = 4 and dc = 2. i, 0 < i < n, label all compute nodes ci and switches si. Connect switch si to s(i+1)mod n, i.e., in a ring. Connect node ci to switches si and s(i+ n/2 +1)mod n.S0
S1S6
S2
S4 S3
S5
C0
C1
C2C3
C4
C5
C6
n = 7
S0
S1S7
S6 S2
S4
S3S5
C2
C1
C0
C7C6
C5
C4
C3
n = 8
Can tolerate 3 faults of any kind without partitioning the network
Protocol for Link Failure Goal
Monitoring of available paths Requirements
Correctness Bounded Slack Stability
Correctness Must correctly reflect the true
state of the channel
Bi-directional Communication
A B If one side sees timeouts…
Both sides should mark the channel as being down
Bounded Slack Ensure that both have a maximum
slack of n transactions
Link History
Time
U = link upD = link down
DD
U
D
U
DU
U
BA
DD
U
D
U
D
U
D
U
D
UU
BANode A sees many more transactions than node B
Nodes A and B see tightly
coupled views of the channel
Stability Each real channel event (i.e. time-
out) should cause at most some bounded number state transactions at each endpoint
Consistent-History Protocol for Link Failures Monitor available paths in the
network for proper functioning Modified Ping Protocol guarantees
each side of communication channel sees the same history (bounded slack)
The Protocol Reliable Message Passing Implementation:
Sliding window protocol Existing reliable communication layer
not needed Reliable messaging built on top of
ping messages
The Protocol (cont’d)
Protocol
Sending and receiving of token using reliable messaging
Tokens are sent on request
Consistent history maintained
Sending and receiving of Ping messages using unreliable messaging
Detect when link is up or down
Implemented by Pings or hardware feedback
Demonstration
Downt = 1
Upt = 2
Downt = 2
Downt = 0
Upt = 1
T/0 T/1 T/0
tout/1
tout/1
T/1
T/1
t: token countT: token arrival eventtout: time-out event
trigger event / token sent
Start
Group Membership
Presenter: Jonathan Sippel
Group Membership Provides a level of agreement
between non-faulty processes in a distributed application
Tolerates permanent and transient failures in both nodes and links
Based on two mechanisms Token Mechanism 911 Mechanism
Token Mechanism Nodes in the membership are ordered in
a logical ring Token passed at a regular interval from
one node to the next Token carries the authoritative
knowledge of the membership Node updates its local membership
information according to the received token
Token Mechanism (cont’d) Aggressive Failure Detection
A D
B C
A D
B C
Token Mechanism (cont’d) Conservative Failure Detection
A D
B C
A D
B C
911 Mechanism When is the 911 Mechanism used?
Token Regeneration - Regenerate a token that is lost if a node or a link fails
Dynamic Scalability - Add a new node to the system
What is a 911 message? Request for the right to regenerate the lost
token Must be approved by all the live nodes in
the membership
Token Regeneration Only one node is allowed to regenerate the token Token sequence number is used to guarantee
mutual exclusivity and is incremented every time the token is passed from one node to the next
Each node makes a local copy of the token on receipt
Sequence number on the node’s local copy of the token is added to the 911 message and compared to all the sequence numbers on the local copies of the token on the other live nodes
911 request is denied by any node with a more recent copy of the token
Dynamic Scalability 911 message sent by a new node
to join the group Receiving node
Treats the message as a join request because the originating node is not in the membership
Updates the membership the next time it receives the token and sends it to the new node
Data Storage The RAIN system provides a
distributed storage system based on a class of erasure-correcting codes called array codes that provide a mathematical means of representing data so lost information can be recovered
Data Storage (cont’d) Array codes
With an (n, k) erasure-correcting code, k symbols of original data are represented with n symbols of encoded data
With an m-erasure-correcting code, the original data can be recovered even if m symbols of the encoded data are lost
A code is said to be Maximum Distance Separable (MDS) if m = n – k
The only operations necessary to encode/decode an array code are simple binary XOR operations
Data Storage (cont’d)
A+C+d+eF+B+c+dE+A+b+cD+F+a+bC+E+f+aB+D+e+f
FDDCBA
fedcba
Data Placement Scheme for a (6, 4) Array Code
Data Storage (cont’d)
A+C+d+eF+B+c+dE+A+b+cD+F+a+b??
FDDC??
fedc??
Data Placement Scheme for a (6, 4) Array Code
A = C + d + e + (A + C + d + e)b = A + (E + A + b + c) + c + Ea = b + (D + F + a + b) + D + F
B = a + c + (F + B + c + d) + d
Data Storage (cont’d) Distributed store/retrieve operations
For a store operation a block of data of size d is encoded into n symbols, each of size d/k, using an (n, k) MDS array code
For a retrieve operation, symbols are collected from any k nodes and decoded
The original data can be recovered with up to n – k node failures
The encoding scheme provides for dynamic reconfigurability and load balancing
RAIN Contributions to Distributed Computing Systems Fault-tolerant interconnect
topologies and communication protocols providing consistent error reporting of link failures
Fault management techniques based on group membership
Data storage schemes based on computationally efficient error-control codes
References Vasken Bohossian, Chenggong C. Fan,
Paul S. LeMahieu, Marc D. Riedel, Lihao Xu, Jehoshua Bruck, “Computing in the RAIN: A Reliable Array of Independent Nodes,” IEEE Transactions On Parallel and Distributed Systems, Vol. 12, No. 2, February 2001
http://www.rainfinity.com/