myS3 Fabrizio Manfredi Furuholmen Federico Mosca
description
Transcript of myS3 Fabrizio Manfredi Furuholmen Federico Mosca
![Page 1: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/1.jpg)
Beolink.org
myS3
Fabrizio Manfredi FuruholmenFederico Mosca
![Page 2: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/2.jpg)
Beolink.org
FOSDEM 2014
2
Agenda
Introduction Goals Principals
myS3 Architecture Internals Sub project
Conclusion Developments
![Page 3: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/3.jpg)
Beolink.org
3
Unsolved problem
![Page 4: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/4.jpg)
Beolink.org
4
Web Interface
“Amazon S3 provides a simple web-services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web…”
![Page 5: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/5.jpg)
Beolink.org S3
5
• Every file you upload to Amazon S3 is stored in a container called a bucket.
• Each bucket name should be unique. • Each bucket can contain an unlimited number of object (key/value). • Buckets cannot be nested, you can not create a bucket within a
bucket.• Object
– Id – Version– Metadata– Subresources– ACL
• Http Rest Call• Byte range transfer• Parallel transfer
![Page 6: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/6.jpg)
Beolink.org myS3
6
Translate S3 Request to local Disk
![Page 7: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/7.jpg)
Beolink.org Mapping
7
S3 Bucket is a directory in the AFS space
S3 Object is file or a directory, the directory
S3 ACLFake object
AFS ACL permission are returned as a S3 metadata unix permission are returned as a S3 metadata
All other S3 features are not implemented
![Page 8: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/8.jpg)
Beolink.org S3 Request
8
GET /mybucket/puppy.jpg HTTP/1.1User-Agent: dotnetHost: s3.amazonaws.comDate: Tue, 15 Jan 2008 21:20:27 +0000x-amz-date: Tue, 15 Jan 2008 21:20:27 +0000Authorization: AWS AKIAIOSFODNN7EXAMPLE:k3nL7gH3+PadhTEVn5EXAMPLE
Objects in the same bucket don’t have any relation !!!No Hierarchically
GET /mybucket/puppy.jpgGET /mybucket/yesterday/puppy.jp
“yesterday” doesn’t exist
![Page 9: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/9.jpg)
Beolink.org S3 Request
9
For retrieving directory content :- Prefix for the parent directory - ‘/’ for end name Delimiter
For create a Directoy- Object name with ‘/’ at the end
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"> <Name>ExampleBucket</Name> <Prefix>/mydir/</Prefix> <Marker></Marker> <MaxKeys>1000</MaxKeys> <Delimiter>/</Delimiter> <IsTruncated>false</IsTruncated> <Contents>
![Page 10: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/10.jpg)
Beolink.org AWS Auth
10
Authorization = "AWS" + " " + AWSAccessKeyId + ":" + Signature;
Signature = Base64( HMAC-SHA1( YourSecretAccessKeyID, UTF-8-Encoding-Of( StringToSign ) ) );
StringToSign = HTTP-Verb + "\n" +Content-MD5 + "\n" +Content-Type + "\n" +Date + "\n" +CanonicalizedAmzHeaders +CanonicalizedResource;
CanonicalizedResource = [ "/" + Bucket ] +<HTTP-Request-URI, from the protocol name up to the query string> +[ subresource, if present. For example "?acl", "?location", "?logging", or "?torrent"];
CanonicalizedAmzHeaders = <described below>
![Page 11: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/11.jpg)
Beolink.org Authentication
11
IP Base Computer Account, the authentication of the users is handle by internal db
Impersonate Forge the ticket for the users on the server side, the authentication is handle by internal db
Token Generation Web interface authentication( kbr auth), one time AWS token generation
![Page 12: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/12.jpg)
Beolink.org
12
Server Architecture
S3 Interface
StorageManager
Auth Manager
Bucket Manager
Storage Driver Cache
Inte
rfac
eM
anag
ers
Driv
ers
Plug
in
/afs
Token Manager
Web Interface
![Page 13: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/13.jpg)
Beolink.org InternalDB
13
Bucket DB - Contains the map btw the bucket name and the AFS Path ex. Myhome -> /afs/beolink/home/manfred
Token DB - Contains the access key and secret key for Amazon Authentication, with web base authentication the db contains the kerberos token
![Page 14: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/14.jpg)
Beolink.org Storage Manager
14
NFS style Most of the operation are made on temporary file (.NFSXXX)
Caching Save temporary file in non AFS space
NoWait Return Ok as soon the file is on the S3 server
MemKeep file transferred in memory (max 100MB)
ACLEnable write operation on AFS ACL
MD5Enable or disable MD5
![Page 15: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/15.jpg)
Beolink.org TODO
15
• Parallel Transfer• Locking• Kerberos Token base• Chunk transfer (http 100)/ byte range transfer• Create a interface for CloudStack• Automatic Volume release
![Page 16: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/16.jpg)
Beolink.org
16
RestFS
![Page 17: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/17.jpg)
Beolink.org
17
GOAL
Create a framework for testing a new technologies and paradigm
![Page 18: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/18.jpg)
Beolink.org Principle 1/3
18
“Moving Computation is
Cheaper than Moving Data”
![Page 19: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/19.jpg)
Beolink.org Principle 2/3
19
“There is always a failure waiting around the corner”
*Werner Vogel
![Page 20: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/20.jpg)
Beolink.org Principle 3/3
20
“Decompose into small loosely coupled, stateless building
blocks”
*’ Leaving a Legacy System Revisited’ Chad Fowler
![Page 21: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/21.jpg)
Beolink.org Five pylons
21
Obj
ects •Separation
btw data and metadata
• Each element is marked with a revision
•Each element is marked with an hash.
Cac
he• Client side
• Callback/Notify
• Persistent
Tran
smis
sion
• Parallel operation
• Http like protocol
• Compression
• Transfer by difference
Dis
trib
utio
n •Resource discovery by DNS
•Data spread on multi node cluster
•Decentralize
•Independents cluster
•Data Replication
Secu
rity •Secure
connection
• Encryption client side,
• Extend ACL
• Delegation/Federation
•Admin Delegation
![Page 22: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/22.jpg)
Beolink.org
22
RestFS Key Words
RestFS
Cellcollection of servers
Bucket virtual container, hosted by one or
more server
Object entity (file, dir, …)
contained in a Bucket
![Page 23: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/23.jpg)
Beolink.orgObject
23
Data Metadata
Segments Obj
ect
Attributes set by user
Properties
ACL
Ext Properties
Block 1
Block 2
Block n
Block …
Has
hH
ash
Has
hH
ash
Seria
lSe
rial
Seria
lSe
rial
Seria
l
![Page 24: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/24.jpg)
Beolink.orgBucket Discovery
24
Client
DNSLookup
Cell 1
Cell 2
N server
N server
Bucket name Cell RL IP list
Bucket name
Server list +Load info
Server Priority Type
IP 1
.. …
Server list priority List
![Page 25: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/25.jpg)
Beolink.org
25
RestFS Cache client side
DNS
RestFS Metadata
RestFS Block
Federated Auth
Callbacks
Metadata cache
Block cache
RestFS BlockRestFS Block
Pers
iste
nt
Cac
heResource Locator
ServerList
Tokens
Pub/SubList
Tem
pora
ry
Locks
![Page 26: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/26.jpg)
Beolink.org
26
Server Architecture
S3
Service
StorageMgr
Auth Manager
Meta Mgr
Storage Driver
Token Driver
RestFSRPC
Resource Manager
Distributed Cache
CallbacksManager
Meta Driver
Auth Driver
CallbacksDriver
Auth
Inte
rfac
eM
anag
ers
Driv
ers
Plug
in
Resource Locator
Backends
Token Sub/Pub
Token Manager
Resource DriverM
eta
Serv
ice
RL
Serv
ice
Cal
lbac
k Se
rvic
e
Aut
h Se
rvic
e
Toke
n Se
rvic
e
Blo
ck S
ervi
ce
Locks Mgr
Locks DriverLo
cks
Serv
ice
![Page 27: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/27.jpg)
Beolink.org
27
Mounting
Cell
Bucket NObjects
Cell
Bucket NObjects
![Page 28: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/28.jpg)
Beolink.org
28
Object Versioning
Cell
Bucket N
Objects
Objects
Objects
The segment contain the diff to upstream object
Each object knows the previous and the next. The current object knows the previous and the last
![Page 29: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/29.jpg)
Beolink.org
29
Block Storage
![Page 30: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/30.jpg)
Beolink.org
30
Backend: Consistent Hashing
Number of key to move for add/remove a node :
Keys/Node= keys to relocate
Blocks are collected in shards
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
![Page 31: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/31.jpg)
Beolink.org Block Storage
31
AFS - Volume store a range of HASH - Chunk is write in 3 volume - Server
PISA- cluster of node - communication base on zmq- consensus base on raft
CEPH - Use CEPH node directly
![Page 32: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/32.jpg)
Beolink.org
32
Backend: Storage
3 CopiesConfigurable read and write consistent level and security:- 2W1R- 2W2R- 1W1R- …
Monitor of neighbored small cluster of 3 nodes (GOSSIP)
Mini cluster electionkey space reclaim for replica coordination, leave join cluster
![Page 33: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/33.jpg)
Beolink.org
33
Protocols
Europython 2013
![Page 34: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/34.jpg)
Beolink.org
34
RestFS Protocol
{"hello": "world"}→"\x16\x00\x00\x00\x02hello\x00 \x06\x00\x00\x00world\x00\x00"
Europython 2013
--> { "method": ”readBlock", "params": [”…"], "id": 1}<-- { "result": [..], "error": null, "id": 1}
GET /mychat HTTP/1.1Host: server.example.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==Sec-WebSocket-Protocol: chatSec-WebSocket-Version: 13Origin: http://example.com
WebSocket is a web technology for multiplexing bi-directional, full-duplex communications channels over a single TCP connection.
Standard HTTP/HTTPS port
JSON-RPC is lightweight remote procedure call protocol similar to XML-RPC. It's designed to be simpleSimple to covert in
python dict
BSON short for Binary JSON,is a binary-encoded serialization of JSON-like documents..BSON can be compared to binary interchange formats
*Compression is a long story…
![Page 35: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/35.jpg)
Beolink.org Protocols Metadata
35
Europython 2013
{ "method": ”readBlock", "params": [“bucket_name: test, segment:1 , blocks:[1,2,3,4]"], "id": 1}
Collecting per segment
Parallel request per segment
{ "method": ”getSegmentVer", "params": [“bucket_name: test, segment:1 , , "id": 1}
<-- { "result": [ver: 1335519328.091779], "error": null, "id": 1}
Check cached Data
{ "method": ”getSegmentHash", "params": [“bucket_name: test, segment:1 , , "id": 1}
<-- { "result": [1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c…], "error": null, "id": 1}
Block hash list for a specific segment
![Page 36: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/36.jpg)
Beolink.org
36
NOSQL DB
![Page 37: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/37.jpg)
Beolink.org
37
Redis performance
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 –q
SET: 552028.75 requests per secondGET: 707463.75 requests per secondLPUSH: 767459.75 requests per secondLPOP: 770119.38 requests per second
![Page 38: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/38.jpg)
Beolink.org
38
Code
![Page 39: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/39.jpg)
Beolink.org
39
Pluggable
Protocol
• Connection Handler• Data transcoding
Service
• High level Operations across multiple functions (like locking)
• Integrity operations/transaction
Manager
• Operations handler for specific area (ex. metadata)
• Split info in sub info
Driver
• Read and write operation to storage system, agnostic operation
Inte
rfac
e, d
ynam
ic lo
ad
![Page 40: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/40.jpg)
Beolink.orgSupport
40
![Page 42: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/42.jpg)
Beolink.org
42
Bucket
Europython 2013
![Page 43: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/43.jpg)
Beolink.org
43
Bucket
Europython 2013
Bucket Namezebra
Propertysegment_size= 512block_size = 16kmax_read’=1000Bucket_size=0Bucket_quota=10000storage_class=STANDARDcompression= nonelogging=enablebucket_type=fs…
The bucket has many properties, the property element is a collection of object information, with this element you can retrieve the default value for the bucket (logging level, security level, ect).
Bucket Name
Properties objects:- Property- Property Ext- Property ACL- Property Stats
- Filesystm, The bucket is used as a filesystem- Logging, Logging operation done on the specific Bucket- Replica RO, Bucket shadow replication…Custom definition
Default parameters
Python Dict
![Page 44: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/44.jpg)
Beolink.org
44
Objects
Europython 2013
![Page 45: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/45.jpg)
Beolink.orgObject
45
Data Metadata
Segments Obj
ect
Attributes set by user
Europython 2013
Properties
ACL
Ext Properties
Block 1
Block 2
Block n
Block …
Has
hH
ash
Has
hH
ash
Seria
lSe
rial
Seria
lSe
rial
Seria
l
![Page 46: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/46.jpg)
Beolink.org
46
MetaData Properties
Europython 2013
Object
zebra.c1d2197420bd41ef24fc665f228e2c76e98da247
PropertyObject_type=datasegment_size= 512block_size = 16kcontent_type = md5=ab86d732d11beb65ed0183d6a87b9b0max_read’=1000storage_class=STANDARDcompression= noneName=“my first object”Object_size=10000Object_prev=zebra.c1d2197420bd41ef24fc665f228e2c76e98dartg…vers:1335519328.091779
Object id (Special id is : bucket_name.ROOT is the starting point of the file system)
Object default
Object version
Object hash (replaced by merkel tree)
Pointer to the previous Object
Object type:- Data, Contains files- Folder, Special object that contain others objects- Mount point, Contains the name of the buckets- Link, Contains the name of the objects- Immutable, Gold imageCustom, Defined by the users
Bucket name
![Page 47: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/47.jpg)
Beolink.org
47
Metadata Segment
Europython 2013
Segment Segment-1
Segment-id 1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c3:158aa47df63f79fd5bc227d32d52a97e1451828c4:1ee794c0785c7991f986afc199a6eee1fa45:c3c662928ac93e206e025a1b08b14ad02e77b29d …vers:1335519328.091779
…
Segment element
Block pos: integrity hash
Version base on timestamp +Incremental useful for vector clock conflict resolution
Data_size------------------------------------- = Total Segmentblock_size*segment_size
Python Dict
![Page 48: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/48.jpg)
Beolink.org
48
Restfs ID
Europython 2013
Id Bucket
Id Object
Id segment and id block
Chunck data on the storage
Plain text DNS name
UUID random generation
Base on the position of the content
SHA-1 hash of the concatenation of Bucket.object.segment.block_id
Id Object is unique inside of the Bucket, with bucket name the id is a UUID
![Page 49: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/49.jpg)
Beolink.org
49
Mounting
Europython 2013
Cell
Bucket NObjects
Cell
Bucket NObjects
![Page 50: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/50.jpg)
Beolink.org
50
Object Versioning
Europython 2013
Cell
Bucket N
Objects
Objects
Objects
The segment contain the diff to upstream object
Each object knows the previous and the next. The current object knows the previous and the last
![Page 51: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/51.jpg)
Beolink.org
51
Protocols
Europython 2013
![Page 52: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/52.jpg)
Beolink.org
52
RestFS Protocol
{"hello": "world"}→"\x16\x00\x00\x00\x02hello\x00 \x06\x00\x00\x00world\x00\x00"
Europython 2013
--> { "method": ”readBlock", "params": [”…"], "id": 1}<-- { "result": [..], "error": null, "id": 1}
GET /mychat HTTP/1.1Host: server.example.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==Sec-WebSocket-Protocol: chatSec-WebSocket-Version: 13Origin: http://example.com
WebSocket is a web technology for multiplexing bi-directional, full-duplex communications channels over a single TCP connection.
Standard HTTP/HTTPS port
JSON-RPC is lightweight remote procedure call protocol similar to XML-RPC. It's designed to be simpleSimple to covert in
python dict
BSON short for Binary JSON,is a binary-encoded serialization of JSON-like documents..BSON can be compared to binary interchange formats
*Compression is a long story…
![Page 53: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/53.jpg)
Beolink.org Protocols Metadata
53
Europython 2013
{ "method": ”readBlock", "params": [“bucket_name: test, segment:1 , blocks:[1,2,3,4]"], "id": 1}
Collecting per segment
Parallel request per segment
{ "method": ”getSegmentVer", "params": [“bucket_name: test, segment:1 , , "id": 1}
<-- { "result": [ver: 1335519328.091779], "error": null, "id": 1}
Check cached Data
{ "method": ”getSegmentHash", "params": [“bucket_name: test, segment:1 , , "id": 1}
<-- { "result": [1:16db0420c9cc29a9d89ff89cd191bd2045e473782:9bcf720b1d5aa9b78eb1bcdbf3d14c353517986c…], "error": null, "id": 1}
Block hash list for a specific segment
![Page 54: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/54.jpg)
Beolink.org
54
Block Storage
Europython 2013
![Page 55: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/55.jpg)
Beolink.org
55
Backend: Consistent Hashing
Europython 2013
Number of key to move for add/remove a node :
Keys/Node= keys to relocate
Blocks are collected in shards
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
![Page 56: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/56.jpg)
Beolink.org
56
Backend: Storage
Europython 2013
3 CopiesConfigurable read and write consistent level and security:- 2W1R- 2W2R- 1W1R- …
Monitor of neighbored small cluster of 3 nodes (GOSSIP)
Mini cluster electionkey space reclaim for replica coordination, leave join cluster
![Page 57: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/57.jpg)
Beolink.org
57
Cache
Europython 2013
![Page 58: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/58.jpg)
Beolink.org
58
Cache
Europython 2013
Server Side
Client Side
Distribute Cache
Publish Subscribe
Pattern matching
Persistent cache
![Page 59: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/59.jpg)
Beolink.org
59
Security
Europython 2013
![Page 60: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/60.jpg)
Beolink.org
60
Security
Europython 2013
Protocol,• SSL Protocol
Authentication• Token for devices
(Enrollment)• Session Token for
User• External password
provider
Data Integrity• Encryption on block
level
Authorization• Extended ACL
based on NFS4 ACL• Admin delegation on
the Bucket level
![Page 61: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/61.jpg)
Beolink.org
61
NOSQL DB
Europython 2013
![Page 62: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/62.jpg)
Beolink.org
62
redis as much as Possible
Europython 2013
Main characteristics- Fast- Store Hash of HASH- Atomic operation- Sub/Pub primitives
zebra.c1d2197420bd41ef24fc665f228e2c76e98da247
object id
Dot format to simplify subscription operation (callback)
GLP
name of the properties
Primary key :
Subkey :
00101010101010Value :
Serialized Python Dict (bson in the future)
HASH of HASH
* Version and Hash of the objects has a dedicated subkey, no serialization
![Page 63: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/63.jpg)
Beolink.org
63
Redis performance
Europython 2013
$ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 –q
SET: 552028.75 requests per secondGET: 707463.75 requests per secondLPUSH: 767459.75 requests per secondLPOP: 770119.38 requests per second
![Page 64: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/64.jpg)
Beolink.org
64
Code
Europython 2013
![Page 65: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/65.jpg)
Beolink.org
65
Pluggable
Europython 2013
Protocol •Connection Handler•Data transcoding
Service •High level Operations across multiple functions (like locking)•Integrity operations/transaction
Manager •Operations handler for specific area (ex. metadata)•Split info in sub info
Driver •Read and write operation to storage system, agnostic operationInte
rfac
e, d
ynam
ic lo
ad
![Page 66: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/66.jpg)
Beolink.org
66
What we are using
Module SoftwareStorage Filesystem, DHT (kademlia, Pastry*)
Metadata SQL(mysql,sqlite), Nosql (Redis)
Auth Oauth(google, twitter, facebook), kerberos*, internal
Protocol Websocket
Message Format
JSON-RPC 2.0, Amazon S3
Encoding Plain, bson
CallBack Subscribe/Publish Websocket/Redis, Async I/O TornadoWeb, ZeroMQ*
HASH Sha-XXX, MD5-XXX, AES
Encryption
SSL, ciphers supported by crypto++
Discovery DNS, file base* are planned
Europython 2013
![Page 67: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/67.jpg)
Beolink.orgWhat is it good for ?
67
User
• Home directory• Remote/Internet disks
Application
• Object storage• Shared space• Virtual Machine
Distribution
• CDN (Multimedia)• Data replication• Disaster Recovery
Europython 2013
![Page 68: myS3 Fabrizio Manfredi Furuholmen Federico Mosca](https://reader035.fdocuments.in/reader035/viewer/2022062813/56816679550346895dda176d/html5/thumbnails/68.jpg)
Beolink.org
68
Backend: Storage
Transport Layer ZeroMQ
Storage Compressed DAta
Europython 2013