Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Transcript of Less is More: 2X Storage Efficiency with HDFS Erasure Coding
LESS IS MORE2X storage efficiency with HDFS erasure coding
HDFS inherits 3-way replication from Google File System- Simple, scalable and robust
200% storage overhead Secondary replicas rarely accessed
Replication is Expensive
Erasure Coding Saves Storage Simplified Example: storing 2 bits
Same data durability- can lose any 1 bit
Half the storage overhead Slower recovery
1 01 0Replication:XOR Coding: 1 0⊕ 1=
2 extra bits1 extra bit
Erasure Coding Saves Storage Facebook
- f4 stores 65PB of BLOBs in EC Windows Azure Storage (WAS)
- A PB of new data every 1~2 days- All “sealed” data stored in EC
Google File System- Large portion of data stored in EC
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture- Choosing Block Layout- NameNode — Generalizing the Block Concept- Client — Parallel I/O- DataNode — Background Reconstruction
Hardware-accelerated Codec Framework
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
useful data
3-way Replication: Data Durability = 2
Storage Efficiency = 1/3 (33%)
redundant data
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
XOR:Data Durability = 1
Storage Efficiency = 2/3 (67%)
useful data redundant data
X Y X Y⊕0 0 00 1 11 0 11 1 0
Y = 0 1 = 1⊕
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Reed-Solomon (RS):Data Durability = 2
Storage Efficiency = 4/6 (67%)Very flexible!
Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?
Storage Efficiency = How much portion of storage is for useful data?
Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%
EC in Distributed StorageBlock Layout:
Data Locality 👍🏻Small Files 👎🏻
128~256MFile 0~128M … 640~768M
0~128M
bloc
k 0
DataNode 0
128~256M
bloc
k 1
DataNode 1
0~128M 128~256M
… 640~768M
bloc
k 5
DataNode 5 DataNode 6
…
parity
Contiguous Layout:
EC in Distributed StorageBlock Layout:
File
bloc
k 0
DataNode 0
bloc
k 1
DataNode 1
…bl
ock
5
DataNode 5 DataNode 6
…
parity
Striped Layout:0~1M 1~2M 5~6M6~7M
Data Locality 👎🏻
Small Files 👍🏻Parallel I/O 👍🏻
0~128M 128~256M
EC in Distributed Storage
Spectrum:
Replication ErasureCoding
Striping
Contiguous
Ceph
Ceph
Quancast File System
Quancast File System
HDFS Facebook f4Windows Azure
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture- Choosing Block Layout- NameNode — Generalizing the Block Concept- Client — Parallel I/O- DataNode — Background Reconstruction
Hardware-accelerated Codec Framework
Choosing Block Layout• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)
96.29%
1.86% 1.85%
26.06%
9.33%
64.61%
small medium large
file count
space usage
Top 2% files occupy ~65% space
Cluster A Profile
86.59%
11.38%2.03%
23.89%36.03% 40.08%
file count
space usage
Top 2% files occupy ~40% space
small medium large
Cluster B Profile
99.64%
0.36% 0.00%
76.05%
20.75%
3.20%
file count
space usage
Dominated by small files
small medium large
Cluster C Profile
Choosing Block Layout
CurrentHDFS
Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?
Hierarchical Naming Protocol:
Client Parallel Writing
streamer
queue
streamer … streamer
Coordinator
Client Parallel Reading
…
parity
Reconstruction on DataNode Important to avoid delay on the critical path
- Especially if original data is lost Integrated with Replication Monitor
- Under-protected EC blocks scheduled together with under-replicated blocks- New priority algorithms
New ErasureCodingWorker component on DataNode
Roadmap Background of EC
- Redundancy Theory- EC in Distributed Storage Systems
HDFS-EC architecture- Choosing Block Layout- NameNode — Generalizing the Block Concept- Client — Parallel I/O- DataNode — Background Reconstruction
Hardware-accelerated Codec Framework
Acceleration with Intel ISA-L 1 legacy coder
- From Facebook’s HDFS-RAID project 2 new coders
- Pure Java — code improvement over HDFS-RAID- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)
Microbenchmark: Codec Calculation
Microbenchmark: Codec Calculation
Microbenchmark: HDFS I/O
Hive-on-Spark — locality sensitive
Conclusion Erasure coding expands effective storage space by ~50%! HDFS-EC phase I implements erasure coding in striped block layout Upstream effort (HDFS-7285):
- Design finalized Nov. 2014- Development started Jan. 2015- 218 commits, ~25k LoC change- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)
Phase II will support contiguous block layout for better locality
Acknowledgements Cloudera
- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus Intel
- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang Hortonworks
- Jing Zhao, Tsz Wo Nicholas Sze Huawei
- Walter Su, Rakesh R, Xinwei Qin Yahoo (Japan)
- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng
Questions?
Zhe Zhang, [email protected] | @oldcaphttp://zhe-thoughts.github.io/