ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL,...

51
ZBS SmartX Distributed Block Storage Kyle Zhang Cofounder & CTO SmartX 2018-04-19

Transcript of ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL,...

Page 1: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

ZBS:SmartX Distributed Block Storage

Kyle ZhangCofounder & CTO SmartX

2018-04-19

Page 2: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

About Me

q Kyle Zhang 张凯

q Computer Science, Tsinghua University

q Software Engineer, Infrastructure Team, Baidu

q Cofounder & CTO, SmartX

q Contributor, Sheepdog/InfluxDB

Page 3: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

About SmartX

q Founded by three geeks in 2013

q Focuses on distributed storage and virtualizationq Currently leading in distributed block storage technology

q Product has been deployed on thousands of hosts and running for 2 years

q Cooperates with Chinese leading hardware venders and cloud venders

q Has now raised 10s of millions of dollars

Page 4: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

What is Block Storage Used For

q Virtual Machine

q Database

q Container

Page 5: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Outlines

q How SmartX Build ZBS

q Roadmap of ZBS

Page 6: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Outlines

q How SmartX Build ZBS

q Roadmap of ZBS

Page 7: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build a Distributed Storage

Page 8: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build a Distributed Storage

q Meta data service

q Data storage engine

q Consensus protocol

Page 9: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build a Distributed Storage

q Meta data service

q Data storage engine

q Consensus protocol

Page 10: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Data Service Requirements

q Reliableq Data should be protected by replicationq Failover

q High performanceq Average latency per request should be less than 5msq No necessary for high throughput

q Lightweightq Easy to operation

Page 11: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build Meta Data Service

q MySQL, LevelDB/RocksDB ...q No data protectionq No failover

q MongoDB/Cassandraq No ACIDq Hard to operation

Page 12: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build Meta Data Service

q Paxos/RAFTq Not available in 2013

q Zookeeperq Limited storage capacity

q DHT (Distributed Hash Table)q Loss control replica locationq Hard for load balance

Page 13: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Data Service Solution

q Combined LevelDB and Zookeeperq Lightweight and stable

q Log replication

Page 14: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Cluster

Elect leader through Zookeeper

Page 15: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Cluster

Submit operation log to Zookeeper

Page 16: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Cluster

Apply changes to local DB

Page 17: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Cluster

Learn logs from Zookeeper and apply to local DB

Page 18: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Cluster

Operation logs is cleaned after learned by all meta servers

Meta data is always consistent

Page 19: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Failover

q Elect new leader through Zookeeper

q Consume all logs from Zookeeper

q Provide meta data service

Page 20: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Summary

q Easy to understand and implementq Zookeeper: leader election and log replicationq LevelDB: meta data storage

q Fast enough for meta data service

q Failoverq Failover time: leader election + log consumption

q Deploymentq 3~5 Zookeeper and Meta Server in one cluster

Page 21: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Summary

Page 22: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Meta Server Summary

Page 23: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build a Distributed Storage

q Meta data service

q Data storage engine

q Consensus protocol

Page 24: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Data Storage Engine Requirements

q Reliable

q Performance

q Efficiencyq CPU, Memory, IO Bandwidth ...q Space

q Easy to debug and upgrade

Page 25: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Kernel Space

q Kernel spaceq Hard for debugging and

managementq Upgrading is costly: host restartq Large failure domain: kernel

panic!q No context switch

Page 26: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Kernel Space vs. User Space

q User space Storage Engineq Flexible and easy for debuggingq Upgrading is cheap: process

restartq Isolated failure domain: process

crashq Context switch is costly

Page 27: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

LSM Tree

q LSM Treeq Easy to implementq Optimized for small objectq Read/Write amplification

Page 28: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Kernel Filesystems

q Kernel filesystemsq Too much features not

necessary for building storage engine: performance overhead

q Filesystem is designed for single disk

q Bad support for asynchronous IO

q Write amplification of duplicated journaling

Page 29: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Userspace Storage Engine

Context switch No context switch

Page 30: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Userspace Storage Engine

q IO Schedulerq Construct IO transaction and

submit to specified IO worker

q IO Workerq Execute IO transactionq Sequentialize overlapping IO

requests

q Journalq Providing transaction

implementation

Page 31: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build a Distributed Storage

q Meta data service

q Data storage engine

q Consensus protocol

Page 32: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Consensus Protocol Requirements

q Strong consistency

q Performanceq Fast enough for flash devices

Page 33: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build Consistent Protocol

q Basic-Paxos

q Multi-Paxos

q Raft

q ...

Page 34: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Generation-Based Protocol

q Generation is the version of dataq Write increase data generation by 1q Generation is persist along with dataq Request is valid only when generation match

q 1 leader can access dataq Leader is chosen by Meta Server

q 2 phasesq Prepare: only necessary for first IOq Commit: update both data and generation atomically

Page 35: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Prepare 1/3

Page 36: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Prepare 2/3

Page 37: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Prepare 3/3

Page 38: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Read

Page 39: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Write 1/2

Page 40: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Write 2/2

Page 41: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Membership Change 1/2

Page 42: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Membership Change 2/2

Page 43: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Consistency Protocol Summary

q Easy to understand and implementq Like Multi-Paxos and Raft: single leader (proposer)q Leader is chosen by leader of Meta Server leaderq Maintain leadership by updating lease

q About Generationq Valid membership and least valid generation is persist to Meta Data

Serviceq Transaction of updating data and generation is ensured by Storage Engine

q Performanceq Less RTT than Basic Paxos

Page 44: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

How to Build a Distributed Storage

Page 45: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Architecture of ZBS

Page 46: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

ZBS Interface

q libzbs (c library)q Optimized for performance

q iSCSI/NFSq Based on libzbsq Friendly for operating

system, hypervisor...

Page 47: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

IO Path

Page 48: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Scalability Evaluation

K

50K

100K

150K

200K

250K

300K

100% Read 75% Read 25% Write

100% Write

4K Random IOPS

2 nodes 4 nodes

~2.1x

~1.98x

~2.04x

q Node Configurationq SATA SSD * 2q SATA HDD * 4

q Replica factor: 2

Page 49: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Outlines

q Deep Dive into SmartX ZBS

q Roadmap of ZBS

Page 50: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Roadmap of ZBS

q More data protectionq Erasure codeq Cloud backupq ...

q More efficiencyq Compressionq Deduplicationq ...

Page 51: ZBS SmartXDistributed Block Storage - …...2018/04/28  · How to Build Meta Data Service q MySQL, LevelDB/RocksDB ... q No data protection q No failover q MongoDB/Cassandra q No

Roadmap of ZBS

q More intelligenceq Failure predictionq Cache algorithmq ...

q More performanceq SPDKq RDMAq ...