Pisa

32
Beolink.org Pisa Block Distribution and Replication Framework Fabrizio Manfredi Furuholmen Federico Mosca

description

Pisa is a decentralized block storage distribution and replication framework with the specific goal of simplifying the development of storage back-end services in a distributed environment. Main chararistics of the project are the message security, self-organization cluster and simple setup. Pisa is a subproject of RestFS project and the talk will explain our experience acquired with the development of this subcomponent and the decisions taken in the design of the framework.

Transcript of Pisa

Page 1: Pisa

Beolink.org

Pisa Block Distribution and Replication Framework

Fabrizio Manfredi FuruholmenFederico Mosca

Page 2: Pisa

Beolink.org

Buzzwords 2014

2

Agenda

Introduction Overview Problem Common Pattern

Implementation Data placement Data Consistency Cluster Coordination Data Transmission

Page 3: Pisa

Beolink.org

Block Storage Devices

3

Pisa

is a simple block data distribution and

Replication Framework on a wide range of node

New NodeTransfer

New NodeNew Node

Node

Data Block

KeyData[Hash]

Page 4: Pisa

Beolink.org

4

Build a solution

Page 5: Pisa

Beolink.org What is it ?

5

RestFS is

High scalable, high available

network object storage

Page 6: Pisa

Beolink.org Five pylons

6

Ob

ject

s •Separation btw data and metadata

• Each element is marked with a revision

•Each element is marked with an hash.

Cac

he • Client side

• Callback/Notify

• Persistent

Tra

nsm

iss

ion • Parallel

operation

• Http like protocol

• Compression

• Transfer by difference

Dis

trib

uti

on •Resource

discovery by DNS

•Data spread on multi node cluster

•Decentralize

•Independents cluster

•Data Replication

Se

curi

ty •Secure connection

• Encryption client side,

• Extend ACL

• Delegation/Federation

•Admin Delegation

Page 7: Pisa

Beolink.org

7

RestFS Key Words

RestFS

Cellcollection of servers

Bucket virtual container, hosted by one or

more server

Object entity (file, dir, …)

contained in a Bucket

Page 8: Pisa

Beolink.orgObject

8

Data Metadata

Segments Ob

ject

Attributes set by user

Properties

ACL

Ext Properties

Block 1

Block 2

Block n

Block …

Ha

sh

Ha

sh

Ha

sh

Ha

sh

Se

ria

lS

eri

al

Se

ria

lS

eri

al

Se

ria

l

Page 9: Pisa

Beolink.org

9

Main Goal …

Storage as Lego Brick

The infrastructure has to be inexpensive with high scalability and reliability

Page 10: Pisa

Beolink.org

10

Problems

Page 11: Pisa

Beolink.org

11

Main Problem

VS

Page 12: Pisa

Beolink.org

12

Main Problem

Page 13: Pisa

Beolink.org

13

CAP theorem

According to Brewer’s CAP theorem, it is impossible for any distributed computer system to simultaneously provide all three of Consistency, Availability and Partition Tolerance.

You can’t have the three at

the same time and get an acceptable latency.

Page 14: Pisa

Beolink.org

14

CAP

ACIDAtomic: Everything in a transaction succeeds or the entire transaction is rolled back.Consistent: A transaction cannot leave the database in an inconsistent state.Isolated: Transactions cannot interfere with each other.Durable: Completed transactions persist, even when servers restart etc.

- Strong consistency for transaction highest priority- Pessimistic- Complex mechanisms

- Availability and scaling highest priorities- Weak consistency- Optimistic- Best Effort- Simple and FAST

Basic AvailabilitySoft-stateEventual consistency

BASE

RDBMS

NoSQL

Page 15: Pisa

Beolink.org First of all …

15

“Think as a child…”

Page 16: Pisa

Beolink.org Second …

16

“There is always a failure waiting around the corner”

*Werner Vogel

Page 17: Pisa

Beolink.org

17

Data Distribution

Replication

Data Placement

Data Consistency

Cluster Coordination

Data Transmission

Page 18: Pisa

Beolink.org

18

Data Placement

Better Distribution = partitioning Parallel operation = parallel stream/multi core

Page 19: Pisa

Beolink.org

19

Data Distribution: DHT

Distributed Hash Table

Blocks are distributed in partitions

Partition are identify by an hash prefix

Partition hosted in servers

19

Part id Node id

1 2

2 …

Node id

Node

1 obj

2 obj

0000010000

Key (hash)

Partition id

Page 20: Pisa

Beolink.orgData Distribution

Zero Hop Hash (Consistent HASH)- Partition location with 0 hops- 1% capacity added and 1% moved

Node • Zone• Weight

Partition , array list (FIXED) :• Position = kex prefix• Value = node id

Shuffle Avoid sequential allocation

Part_list = array('H')part_key_shift = 32 - part_exppart_count = 2 ** part_expsha(data).digest())[0] >> self.partition_shift

shuffle(part_list)

Ip = 10.1.0.1zone = 1weight = 3.0class = 1

Page 21: Pisa

Beolink.org

21

Data placement

Vnode base Client base

Replication

Page 22: Pisa

Beolink.org

22

Data Distribution

Proximity base

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/

node_ids = [master_node]zones = [self.nodes[node_ids[0]]] for replica in xrange(1, replicas): while self.part_list[part_id] in node_ids : part_id += 1 if part_id >= len(self.part_list): part_id = 0 node_ids.append(self.part_list[part_id])return [self.nodes[n] for n in node_ids]

Part Serv

1 xxxx

2 yyyyy

3 zzzzz

4

5

Partition one will be also in node 2 and 3 , the master node is always the first

Page 23: Pisa

Beolink.org

23

Data Consistency

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/

To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.

Page 24: Pisa

Beolink.org

24

Data Consistency

http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/

Tunable trade-offs for distribution and replication (N, R, W)

The Read operation is implemented with hash check

Page 25: Pisa

Beolink.org

25

Cluster Coordination

Cluster communication

Table distribution

(routing table)

Failure detection

Join / Leaving node to Cluster

Page 26: Pisa

Beolink.org

26

Cluster Coordination

Epidemic (Gossip)

epidemic: anybody can infect anyone else with equal probability

O(lo

g n

)h

ttp

://w

ww

.cis

.co

rne

ll.e

du

/IA

I/e

ven

ts/G

oss

ip_

Tu

toria

l.pd

f

Periodic anti-entropy exchanges among nodes ensure that they eventually converge, even if updates are lost.

Arbitrary pairs of replicas periodically establish contact and resolve all differences between their databases.

Hash reduce the volume of data exchanged in the common case.

Page 27: Pisa

Beolink.org

27

Cluster Coordination

Table Items(Routing Table)• Node table list• Partition 2 Node List

Bootstrap • DNS name or IP at startup • DNS Lookup (SRV)• Multicast

Transfer Type• Complete transfer• Resync by Diff (Merkel Tree)• Notification for a single change

• Join Node• Leave Node• Partition owner

Part Serv

1 xxxx

2 …

3

4

5

…Node ID

Object

1 xxxx

2 …

3

4

5

Segment hash

1-100 xxxx

101-200 …

Page 28: Pisa

Beolink.orgCluster Coordination

28

Node X New Node Z

Bootstrap

Part Serv

X Z

.. …

Notify of new node

Partition claim x

Table Change Notification via Gossip

Node Y

Accept

Client

Request part x

Return New Owner

Re

qu

es

t p

art

x

Re

turn

da

ta

In case the date is not present in the new node the new node act as a proxy (Lazy trnasfer)

Page 29: Pisa

Beolink.org

29

Transport Protocol

ZeroMQ and MessagePack (RPC)

Cluster Communications

Client Data transfer

Partition replication/Relocation

Page 30: Pisa

Beolink.org

30

Status

Eeeemmm… not really perfect …

Page 31: Pisa

Beolink.org

31

Next

http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html

Chord

Space base/multi dimension

New data distribution model Chord/Cluster Node

Vector clock

Rebalance, handover partition (weight change)

Locking

WAN area network replication (Async)

Config Replication (pub/sub, event)

Server Priority

Page 32: Pisa

Beolink.org

Thank you

http://restfs.beolink.org

[email protected]@gmail.com