Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan...

Computing in the RAIN: A Reliable Array of Independent Nodes

Group A3Ka Hou Wong

Jahanzeb FaizanJonathan Sippel

Introduction

Presenter: Ka Hou Wong

Introduction RAIN

Research collaboration between Caltech and Jet Propulsion Laboratory

Goal Identify and develop key building

blocks for reliable distributed systems built with inexpensive off-the-shelf components

Hardware Platform Heterogeneous cluster of computing and/or

storage nodes connected via multiple interfaces through a network of switches

C0 C1 C2 C3 C4

S0 S1

C5 C6 C7 C8 C9

S2 S3

C = Computer

S = Switch

Software Platform Collection of software modules that run

in conjunction with operating system services and standard network protocols

Network Connections

Application MPI/PVM

TCP/IP

RAIN

Ethernet Myrinet ATM Servernet

Key Building Blocks For Distributed Computer Systems Communication

Fault-tolerant communication topologies

Reliable communication protocols Fault Management

Group membership techniques Storage

Distributed data storage schemes based on error-control codes

Features of RAIN Communication

Provides fault tolerance in the network via the following mechanisms

Bundled interfaces Link monitoring Fault-tolerant interconnect topologies

Features of RAIN (cont’d) Group membership

Identifies healthy nodes that are participating in the cluster

Data storage Uses redundant storage schemes

over multiple disks for fault tolerance

Communication

Presenter: Jahanzeb Faizan

Communication Fault-tolerant interconnect topologies Network interfaces

Fault-tolerant Interconnect Technologies Goal

To connect computer nodes to a network of switches in order to maximize the network’s resistance to partitioning

SSS

S S

SSS

C

C

C

CC C

CC

How do you connect n nodes to a ring of n switches?

Naïve Approach Connect the computer nodes to the

nearest switches in a regular fashion

SSS

S S

SSS

C

C

C

C C

C

C

C

1-fault-tolerant

The network is easily partitioned with two switch failures

Diameter Construction Approach Connect computer nodes to the switching

network in the most non-local way possible Computer nodes are connected to maximally

distant switches Nodes of degree 2 connected between

switches should form a diameter

Diameter Construction Approach (cont’d)Construction (Diameters). Let ds = 4 and dc = 2. i, 0 < i < n, label all compute nodes ci and switches si. Connect switch si to s(i+1)mod n, i.e., in a ring. Connect node ci to switches si and s(i+ n/2 +1)mod n.S0

S1S6

S2

S4 S3

S5

C0

C1

C2C3

C4

C5

C6

n = 7

S0

S1S7

S6 S2

S4

S3S5

C2

C1

C0

C7C6

C5

C4

C3

n = 8

Can tolerate 3 faults of any kind without partitioning the network

Protocol for Link Failure Goal

Monitoring of available paths Requirements

Correctness Bounded Slack Stability

Correctness Must correctly reflect the true

state of the channel

Bi-directional Communication

A B If one side sees timeouts…

Both sides should mark the channel as being down

Bounded Slack Ensure that both have a maximum

slack of n transactions

Link History

Time

U = link upD = link down

DD

U

D

U

DU

U

BA

DD

U

D

U

D

U

D

U

D

UU

BANode A sees many more transactions than node B

Nodes A and B see tightly

coupled views of the channel

Stability Each real channel event (i.e. time-

out) should cause at most some bounded number state transactions at each endpoint

Consistent-History Protocol for Link Failures Monitor available paths in the

network for proper functioning Modified Ping Protocol guarantees

each side of communication channel sees the same history (bounded slack)

The Protocol Reliable Message Passing Implementation:

Sliding window protocol Existing reliable communication layer

not needed Reliable messaging built on top of

ping messages

The Protocol (cont’d)

Protocol

Sending and receiving of token using reliable messaging

Tokens are sent on request

Consistent history maintained

Sending and receiving of Ping messages using unreliable messaging

Detect when link is up or down

Implemented by Pings or hardware feedback

Demonstration

Downt = 1

Upt = 2

Downt = 2

Downt = 0

Upt = 1

T/0 T/1 T/0

tout/1

tout/1

T/1

T/1

t: token countT: token arrival eventtout: time-out event

trigger event / token sent

Start

Group Membership

Presenter: Jonathan Sippel

Group Membership Provides a level of agreement

between non-faulty processes in a distributed application

Tolerates permanent and transient failures in both nodes and links

Based on two mechanisms Token Mechanism 911 Mechanism

Token Mechanism Nodes in the membership are ordered in

a logical ring Token passed at a regular interval from

one node to the next Token carries the authoritative

knowledge of the membership Node updates its local membership

information according to the received token

Token Mechanism (cont’d) Aggressive Failure Detection

A D

B C

A D

B C

Token Mechanism (cont’d) Conservative Failure Detection

A D

B C

A D

B C

911 Mechanism When is the 911 Mechanism used?

Token Regeneration - Regenerate a token that is lost if a node or a link fails

Dynamic Scalability - Add a new node to the system

What is a 911 message? Request for the right to regenerate the lost

token Must be approved by all the live nodes in

the membership

Token Regeneration Only one node is allowed to regenerate the token Token sequence number is used to guarantee

mutual exclusivity and is incremented every time the token is passed from one node to the next

Each node makes a local copy of the token on receipt

Sequence number on the node’s local copy of the token is added to the 911 message and compared to all the sequence numbers on the local copies of the token on the other live nodes

911 request is denied by any node with a more recent copy of the token

Dynamic Scalability 911 message sent by a new node

to join the group Receiving node

Treats the message as a join request because the originating node is not in the membership

Updates the membership the next time it receives the token and sends it to the new node

Data Storage The RAIN system provides a

distributed storage system based on a class of erasure-correcting codes called array codes that provide a mathematical means of representing data so lost information can be recovered

Data Storage (cont’d) Array codes

With an (n, k) erasure-correcting code, k symbols of original data are represented with n symbols of encoded data

With an m-erasure-correcting code, the original data can be recovered even if m symbols of the encoded data are lost

A code is said to be Maximum Distance Separable (MDS) if m = n – k

The only operations necessary to encode/decode an array code are simple binary XOR operations

Data Storage (cont’d)

A+C+d+eF+B+c+dE+A+b+cD+F+a+bC+E+f+aB+D+e+f

FDDCBA

fedcba

Data Placement Scheme for a (6, 4) Array Code

Data Storage (cont’d)

A+C+d+eF+B+c+dE+A+b+cD+F+a+b??

FDDC??

fedc??

Data Placement Scheme for a (6, 4) Array Code

A = C + d + e + (A + C + d + e)b = A + (E + A + b + c) + c + Ea = b + (D + F + a + b) + D + F

B = a + c + (F + B + c + d) + d

Data Storage (cont’d) Distributed store/retrieve operations

For a store operation a block of data of size d is encoded into n symbols, each of size d/k, using an (n, k) MDS array code

For a retrieve operation, symbols are collected from any k nodes and decoded

The original data can be recovered with up to n – k node failures

The encoding scheme provides for dynamic reconfigurability and load balancing

RAIN Contributions to Distributed Computing Systems Fault-tolerant interconnect

topologies and communication protocols providing consistent error reporting of link failures

Fault management techniques based on group membership

Data storage schemes based on computationally efficient error-control codes

References Vasken Bohossian, Chenggong C. Fan,

Paul S. LeMahieu, Marc D. Riedel, Lihao Xu, Jehoshua Bruck, “Computing in the RAIN: A Reliable Array of Independent Nodes,” IEEE Transactions On Parallel and Distributed Systems, Vol. 12, No. 2, February 2001

http://www.rainfinity.com/

Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan...

Documents

Transcript of Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan...