Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu,...

Post on 19-Dec-2015

216 views 3 download

Tags:

Transcript of Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu,...

Computing in theReliable Array of Independent Nodes

Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck

May 5, 2000

IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems

California Institute of Technology

Marc RiedelMarc Riedel

RAIN Project

Collaboration:

• Caltech’s Parallel and Distributed Computing Group www.paradise.caltech.edu

• JPL’s Center for Integrated Space Microsystems www.csmt.jpl.nasa.gov

RAIN Platform

switchswitch

bus

netw

ork

Heterogeneous network of nodes and switches

nodenode

nodenode

nodenode

nodenode

switch

nodenode

nodenode

RAIN Testbed

www.paradise.caltech.edu

10 Pentium boxesw/multiple NICs

4 eight-way Myrinet Switches

Proof of Concept: Video Server

AA BB CC DD

switch switch

Video client & server on every node.

Limited Storage

AA BB CC DD

Insufficient storage to replicate all the data on each node.

switch switch

k-of-n Code

ad+c

bd+a

ca+b

db+c

from any k of n columns

b = a+b a+

d = d+c c+

a b c d

recover data

Erasure-correcting code:

Encoding

AA BB CC DD

Encode video using 2-of-4 code.

switch switch

Decoding

AA BB CC DD

Retrieve data and decode.

switch switch

Node Failure

AA BB CC DD

switch switch

Node Failure

AA BB CC D

switch switch

Node Failure

Dynamically switch to another node.

AA BB CC D

switch switch

Link Failure

BB DCCAA

switch switch

Link Failure

AA BB CC D

switch switch

Link Failure

Dynamically switch to another network path.

AA BB DCC

switch switch

Switch Failure

AA BB DCC

switch switch

Switch Failure

AA BB CC D

switch switch

Switch Failure

AA CC D

Dynamically switch to another network path.

BB

switch switch

Node Recovery

AA CCBB DD

switch switch

Node Recovery

AA CCBB DD

switch switch

Continuous reconfiguration (e.g., load-balancing).

Features

• tolerates multiple node/link/switch failures• no single point of failure

High availability:

Certified

Buzz-Word

Compliant

• multiple data paths • redundant storage• graceful degradation

Efficient use of resources:

Dynamic scalability/reconfigurability

RAIN Project: Goals

NetworksCommunication

key building blocks

Storage

Applications

Efficient, reliable distributed computing and storage systems:

Topics

• Fault-Tolerant Interconnect Topologies

• Connectivity

• Group Membership

• Distributed Storage

Today’s Talk:

NetworksCommunication

Storage

Applications

Interconnect Topologies

= computing/storage node

Network

Goal: lose at most a constant number of nodes for given network loss

NN NN NN NN NN NN NN NN NN NN

NN

= computing/storage node

Network

NN NN NN NN NN NN NN NN NN NN

NN

Resistance to Partitions

Large partitions problematic for distributed services/computation

Resistance to Partitions

= computing/storage node

Large partitions problematic for distributed services/computation

NN

NN NN NN NN NN NN NN NN NN NN

Network

Related Work

• Hayes et al., Bruck et al., Boesch et al.

Embedding hypercubes, rings, meshes, trees in fault-tolerant networks:

• Ku and Hayes, 1997. “ Connective Fault-Tolerance in Multiple-Bus Systems”

Bus-based networks which are resistant to partitioning:

IEEE ACM

A Ring of Switches

NN

SS

SS

SS

SS

SSSS

SS

NN

NN

NN

NN

NN

NN

= Node

= SwitchSS

NN

a naïve solution

degree-2 compute nodes,degree-4 switches

A Ring of Switches

NN

SS

SS

SS

SS

SSSS

SS

NN

NN

NN

NN

NN

NN

= Node

= SwitchSS

NN

a naïve solution

degree-2 compute nodes,degree-4 switches

= Node

= SwitchSS

NN

easily partitioned

A Ring of Switches

a naïve solution

degree-2 compute nodes,degree-4 switches NN

SS SS

SS

SS

SS

NN

NN

NN

NN

NN

NN

S

S

Resistance to Partitioning

11

11 22

33

44

55

66

77

88

22

33

4455

66

77

88nodes on diagonals

degree-2 compute nodes,degree-4 switches

Resistance to Partitioning

11

11 22

33

44

55

66

77

88

22

33

4455

66

77

88nodes on diagonals

degree-2 compute nodes,degree-4 switches

nodes on diagonals

degree-2 compute nodes,degree-4 switches

• tolerates any 3 switch failures (optimal)

• generalizes to arbitrary node/switch degrees.

Resistance to Partitioning

Details: paper IPPS’98, www.paradise.caltech.edu

22

33

55

77

88

22

33

4455

77

88

1

1

46

6

11

11 22

33

44

55

77

88

22

33

4455

66

77

88

66

11

22

33

44

55

66

77

88

11

22

33 44

55

66

7788

Resistance to Partitioning

Isomorphic

Details: paper IPPS’98, www.paradise.caltech.edu

Point-to-Point Connectivity

Is the path from A to B up or down?

?nodenode

nodenode

nodenode

nodenode

nodenode

nodenode

A

B

Network

Connectivity

Link is seen as up or down by each node.

NodeA

NodeB

{U,D} {U,D}

Bi-directional communication.

Each node sends out pings.A node may time-out, deciding the link is down.

Consistent History

A B

U

D

UU

U

D

DD

Time

NodeState

A B

U

D

U

U

U

D D

D

Time

U

UD

NodeState

A B

The Slack

NodeState

A B

Time

U

D

U

U

U

D D

DU

UD

A is 1 ahead

A is 2 ahead

Now A will wait for B to transition

Slack n=2:at most 2 unacknowledged transitions before a node waits

Consistent History

Consistency in error reporting:If A sees channel error, B sees channel error.

Birman et al.: “Reliability Through Consistency”

NodeA

NodeB

{U,D} {U,D}

Details: paper IPPS’99, www.paradise.caltech.edu

Group Membership

BBAA

DDCC

ABCDABCD

ABCDABCD

• link/node failures• dynamic reconfiguration

Consistent global view given local, point-to-point connectivity information

Related Work

Totem, Isis/Horus, TransisSystems

TheoryChandra et al., Impossibility of Group Membership in an Asynchronous Environment

IEEE ACM

Token-Ring based Group Membership Protocol

BBAA

CCDD

Group Membership

Group Membership

BBAA

CCDD

• group membership list

• sequence number

Token carries:

Token-Ring based Group Membership Protocol

1: ABCD

Group Membership

BBAA

CCDD

• group membership list

• sequence number

Token carries:

Token-Ring based Group Membership Protocol

1: ABCD

1

Group Membership

BBAA

CCDD

• group membership list

• sequence number

Token carries:1

Token-Ring based Group Membership Protocol

2

2: ABCD

Group Membership

BBAA

CCDD

• group membership list

• sequence number

Token carries:1

Token-Ring based Group Membership Protocol

2

3

3: ABCD

Group Membership

BBAA

CCDD

• group membership list

• sequence number

Token carries:1

Token-Ring based Group Membership Protocol

2

34

4: ABCD

Group Membership

BBAA

CCDD

• group membership list

• sequence number

Token carries:5

Token-Ring based Group Membership Protocol

2

34

Group Membership

BBAA

CCDD

5 2

34

Node or link fails:

Group Membership

BAA

CCDD

5

34

Node or link fails:

Group Membership

BAA

CCDD

?

5

34

Node or link fails:

Group Membership

BAA

CCDD

?

5

34

Node or link fails:

Group Membership

BAA

CCDD

5

34

If a node is inaccessible,it is excluded and bypassed.

5: ACD

Node or link fails:

Group Membership

BAA

CCDD

5

64

If a node is inaccessible,it is excluded and bypassed.

6: ACD

Node or link fails:

Group Membership

BAA

CCDD

5

67

If a node is inaccessible,it is excluded and bypassed.

Node or link fails:

Group Membership

BAA

CCDD

5

67

If a node is inaccessible,it is excluded and bypassed.

Node or link fails:

Group Membership

BAA

CCDD

5

67

Node with token fails:

Group Membership

AA

CCD

5

6

B

Node with token fails:

Group Membership

AA

CCD

5

6

B

?

?

Node with token fails:

Group Membership

AA

CCD

5

6

If the token is lost,it is regenerated.

B

?

?

Node with token fails:

Group Membership

AA

CCD

5

6

If the token is lost,it is regenerated.

B

Node with token fails:

Group Membership

AA

CCD

5

6

6: AD

If the token is lost,it is regenerated.

B

5: ACD

Node with token fails:

Group Membership

AA

CCD

5

6

6: AD

If the token is lost,it is regenerated.

B

5: ACD

Highest sequence numberprevails.

Node with token fails:

Group Membership

AA

CCD

7

6

Highest sequence numberprevails.

If the token is lost,it is regenerated.

B

Node with token fails:

Group Membership

AA

CC

7

6

Node recovers:

B

DD

Group Membership

AA

CC

7

6

B

DD

Recovering nodesare added.

Node recovers:

Group Membership

AA

CC

7

6

B

DD

Recovering nodesare added.

7: ADC

Node recovers:

Group Membership

AA

CC

7

6

B

DD

Recovering nodesare added.

8: ADC

8

Node recovers:

Group Membership

AA

CC

7

9

B

DD

Recovering nodesare added.

9: ADC

8

Node recovers:

Group Membership

AA

CC

10 B

DD

Recovering nodesare added.

98

Node recovers:

Group Membership

AA

CC

10 B

DD 98

• Unicast messages

• Dynamic reconfiguration

• Mean time-to-failure > convergence time

Features:

Details: publication forthcoming.

Distributed Storage

disk disk diskdisk

101001001000101001001000

Distributed Storage

disk disk disk

Focus: reliability and performance.

disk

1010 10 101 11

Array Codes

ad+c

bd+a

ca+b

db+c

Ideally suited for distributed storage. Low encoding/decoding complexity.

dataredundancy

“B-code”

Array Codes

ad+c

bd+a

ca+b

db+c

Ideally suited for distributed storage. Low encoding/decoding complexity.

from any k of n columns

b = a+b a+

d = d+c c+

a b c d

recover data

Array Codes

ad+c

bd+a

ca+b

db+c

Ideally suited for distributed storage. Low encoding/decoding complexity.

a b c d

Details: IEEE Trans. Info Theory, www.paradise.caltech.edu

B-Code and X-Code:• optimally redundant• optimal encoding/decoding complexity

Summary

1

1 2

3

4

5

6

7

8

2

3

45

6

7

8

Fault-tolerant Interconnect Topologies

Connectivity

A B

{U,D} {U,D}

Group Membership

BA

CD

1: ABCD

1 2

34

2: ABCD

3: ABCD

4: ABCD

Distributed Storage

a

d+c

b

d+a

c

a+b

d

b+c

Proof-of-Concept Applications

RAINVideoHigh-availability video server

RAINCheckDistributed checkpoint rollback/recovery system

SNOWStable Network of Webservers

Rainfinity

www.rainfinity.com

Start-up based on RAIN technology

• availability

• scalability

• performance

Clustered solutions for Internet data centers, focusing on:

Business Plan:

Rainfinity

• Founded Sept. 1998

• Released first product April 1999

• Received $15 million funding in Dec. 1999

• Now over 50 employees

www.rainfinity.com

Start-up based on RAIN technology

Company:

Future Research

• Development of API’s• Fault-Tolerant Distributed Filesystem• Fault-Tolerant MPI/PVM implementation

End of Talk

Material that was cut...

Erasure Correcting Codes

data

k

1 0 1 0 1 1 0 1 0 0 010

encoded data

n

Strategy:encode data with an erasure-correcting code.

Erasure Correcting Codes

k

1 0 1 0 1 1 0 1 0 0 010

n

lose up to m coordinates

data

Strategy:encode data with an erasure-correcting code.

Erasure Correcting Codes

1 0 1 0 1 1 0 1 0 0 010

n

reconstructed data

k

Strategy:encode data with an erasure-correcting code.

lose up to m coordinates

k

data

Erasure Correcting Codes

Code is optimally-redundant (MDS) if . knm Example: Reed-Solomon code.

1 0 1 0 1 1 0 1 0 0 010

n

reconstructed data

k

lose up to m coordinates

k

data

RAIN: Distributed Store

ad+c

bd+a

ca+b

db+c

disk disk disk disk

ad+c

bd+a

ca+b

db+c

• Encode data with (n, k) array code

• Store one symbol per node

RAIN: Distributed Retrieve

disk

ad+c

disk

bd+a

disk

ca+b

disk

db+c

• Retrieve encoded data from any k nodes

• Reconstruct data

a b c d

RAIN: Distributed Retrieve

disk

ad+c

disk

bd+a

disk

ca+b

disk

db+c

a b c d• Reliability (similar to RAID systems)

RAIN: Distributed Retrieve

disk

ad+c

disk

bd+a

disk

ca+b

disk

db+c

a b c d• Reliability (similar to RAID systems)

RAIN: Distributed Retrieve

disk

ad+c

disk disk

ca+b

disk

• Reliability (similar to RAID systems)

a b c d

• Performance: load-balancing

RAIN: Distributed Retrieve

disk

ad+c

disk disk

ca+b

disk

db+c

a b c d• Reliability (similar to RAID systems)

• Performance: load-balancing

RAIN: Distributed Retrieve

disk

ad+c

disk disk

ca+b

disk

db+c

busy!

a b c d• Reliability (similar to RAID systems)

• Performance: load-balancing