Replica Control for Peer-to- Peer Storage Systems.

21
Replica Control for Peer-to-Peer Storage Systems

Transcript of Replica Control for Peer-to- Peer Storage Systems.

Page 1: Replica Control for Peer-to- Peer Storage Systems.

Replica Control for Peer-to-Peer Storage Systems

Page 2: Replica Control for Peer-to- Peer Storage Systems.

P2P

• Peer-to-peer (P2P) has emerged as an important paradigm model for sharing resources at the edges of the Internet.

• the most widely exploited resource is storage, as typified in P2P music file sharing– Napster– Gnutella

• Following the great success of P2P file sharing, a natural next step is to develop wide-area, P2P storage systems to aggregate the storage across the Internet.

Page 3: Replica Control for Peer-to- Peer Storage Systems.

Replica Control Protocol

•Replication– maintain multiple copies of some critical data to increase the availability

– to reduce read access times

•Replica Control Protocol – to avoid inconsistent updates

– to guarantee a consistent view of the replicated data

Page 4: Replica Control for Peer-to- Peer Storage Systems.

Resiliency Requirement

• Need data replication– Even if some nodes fail, the computation can

progress– Consistency requirement– Failures may partition the network– Rejoining need to use consistency control

algorithms

Page 5: Replica Control for Peer-to- Peer Storage Systems.

One-copy equivalence consistency criteria

• The set of replicas must behave as if there is only a single copy. Conditions to ensure one-copy equivalence are– no two write operations can proceed at the sa

me time– no a pair of a read operation and a write oper

ation can proceed at the same time– a read operation always returns the replica th

at the last write operation writes

Page 6: Replica Control for Peer-to- Peer Storage Systems.

Replica Control Methods

• Optimistic– Proceed with computation on the available

subgroup– Optimistic to join later with consistency

• Pessimistic– Restrict computations with worst-case

assumptions– Approaches

• Primary site • Voting

Page 7: Replica Control for Peer-to- Peer Storage Systems.

Optimistic Approach

• Version vector for file f– N element vector, where N is the number of n

odes in which f is stores– The ith element represents the number of upda

tes done by node I

• A vector V dominated V’ if– Every element in V >= corresponding element

in V’

• Conflicts if neither dominates

Page 8: Replica Control for Peer-to- Peer Storage Systems.

Optimistic (cont’d)

• Consistency resolution – If V dominates V’, inconsistent; can be

resolved by copying V to V’– If V and V’ conflict, inconsistency cannot be

resolved

• Version vector can resolve only update conflicts; cannot resolve read-write conflicts

Page 9: Replica Control for Peer-to- Peer Storage Systems.

Primary Site Approach

• Data replicated on at least k+1 nodes (for k-resilient)

• One node acts as the primary site (PS)– Any read request is served by the PS– Any write request is copied to all other back-

up sites– Any write request to back-up sites are

forwarded to the PS

Page 10: Replica Control for Peer-to- Peer Storage Systems.

PS Failure Handling

• If back-up fails, no interruption in service• If PS fails, there are two possibilities

– If the network not segmented• Choose another node in the set as the primary• If checkpointing has been active, need to restart

only from the previous checkpoint

– If segmented• Only the partition with PS can progress• Other partitions stops updates on data• Necessary to distinguish between site failures and

network partitions

Page 11: Replica Control for Peer-to- Peer Storage Systems.

Witnesses

Witness - small entity that maintains enough information to identity

the replicas that contain the most recent version of the data

- this information could be a timestamp containing the time of

the latest update

- replaced by a version number, which is an integer

incremented each time the data are updated

Page 12: Replica Control for Peer-to- Peer Storage Systems.

Voting Approach

• V votes are distributed to n replicas with – Vw+Vr > V– Vw+Vw > V

• Obtain Vr or more votes to read

• Obtain Vw or more votes to write

• Quorum system is more general than voting

Page 13: Replica Control for Peer-to- Peer Storage Systems.

Quorum Systems

• Trees

• Grid-based (array-based)

• Torus

• Hierarchical

• Multi-column

and so on…

Page 14: Replica Control for Peer-to- Peer Storage Systems.

Classification of P2P Storage Sys.

• Unstructured– “Replication Strategies for Highly Available Peer-to-peer Sto

rage”– “Replication Strategies in Unstructured Peer-to-peer Networ

ks” • Structured

– CFS– PAST– LAR– Ivy– Oasis– Om– Eliot– Sigma (for mutual exclusion primitive)

Read only

Read/Write (Mutable)

Page 15: Replica Control for Peer-to- Peer Storage Systems.

Ivy

• Stores a set of logs with the aid of distributed hash tables.

• Ivy keeps, for each participant, a log storing all its updates, and maintains data consistency optimistically by performing conflict resolutions among all logs. (Maintain data consistency in a best-effort manner)

• The logs should be kept indefinitely and a participant must scan all the logs related to a file to look up the up-to-date file data. Thus, Ivy is only suitable for small groups of participants.

Page 16: Replica Control for Peer-to- Peer Storage Systems.

Eliot

• Eliot relies a reliable, fault-tolerant, immutable P2P storage substrate Charles to store data blocks, and uses an auxiliary metadata service (MS) for storing mutable metadata.

• It supports NFS-like consistency semantics; however, the traffic between MS and the client is high for such semantics.

• It also supports AFS open-close consistency semantics; however, this semantics may cause the problem of lost updates.

• The MS service is provided by a conventional replicated database, which may be not fit for dynamic P2P environments.

Page 17: Replica Control for Peer-to- Peer Storage Systems.

Oasis

• Oasis is based on Gifford’s weighted voting quorum concept and allows dynamic quorum membership.

• It spreads versioned metadata along with data replicas over the P2P network.

• To complete an operation on a data object, a client must first find a metadata related to the object and figure out the total number of votes, required votes for read/write operations, replica list, and so on, to form a quorum accordingly.

• One drawback of Oasis is that if a node happens to use a stale metadata, the data consistency may be violated.

Page 18: Replica Control for Peer-to- Peer Storage Systems.

Om

• Om is based on the concepts of automatic replica regeneration and replica membership reconfiguration.

• The consistency is maintained by two quorum systems: a read-one-write-all quorum system for accessing replicas, and a witness-modeled quorum system for reconfiguration.

• Om allows replica regeneration from single replica. However, a write in Om is always first forwarded to the primary copy, which serializing all writes and uses a two-phase procedure to propagate the write to all secondary replicas.

• The drawbacks of Om are (1) the primary replica may become a bottleneck (2) the overhead incurred by the two-phase procedure may be too high (3) the reconfiguration by witness model has the probability of violating consistency.

Page 19: Replica Control for Peer-to- Peer Storage Systems.

Sigma

• The Sigma protocol intelligently collect states from all replicas to achieve mutual exclusion.

• The basic idea of the Sigma protocol is as follows. A node u wishing to be the winner of the mutual exclusion sends a timestamped request for each of the totally n (n=3k+1) replicas and waits for replies. On receiving a request from u, a node v should put u’s request in a local queue by the timestamp order, takes the node as the winner whose request is in the front of the queue, and reply the winner ID to u.

Page 20: Replica Control for Peer-to- Peer Storage Systems.

Sigma• When the number of replies received by u exceeds m (m=2k+

1), u acts according to the following conditions:(1) if more than m replies take v as the winner, then u is the winner. (2) if more than m replies take w (wu) as the winner, then w is the winner and u just keeps waiting.(3) if no node is regarded as the winner by more than m replies, then u sends YIELD message to cancel its request temporarily and then re-inserts its request again.

• In this manner, one node can eventually be elected as the winner even when communication delay variance is large.

• A drawback of the Sigma protocol is that a node needs to send requests to all replicas and gets advantaged replies from a large portion (2/3) of nodes to be the winner of the mutual exclusion, which will incur large overhead. Moreover, the overhead will even be larger under an environment of high contention.

Page 21: Replica Control for Peer-to- Peer Storage Systems.

MUREX comes to the rescue!