Less is More: 2X Storage Efficiency with HDFS Erasure Coding

LESS IS MORE2X storage efficiency with HDFS erasure coding

HDFS inherits 3-way replication from Google File System- Simple, scalable and robust

200% storage overhead Secondary replicas rarely accessed

Replication is Expensive

Erasure Coding Saves Storage Simplified Example: storing 2 bits

Same data durability- can lose any 1 bit

Half the storage overhead Slower recovery

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Erasure Coding Saves Storage Facebook

- f4 stores 65PB of BLOBs in EC Windows Azure Storage (WAS)

- A PB of new data every 1~2 days- All “sealed” data stored in EC

Google File System- Large portion of data stored in EC

Roadmap Background of EC

- Redundancy Theory- EC in Distributed Storage Systems

HDFS-EC architecture- Choosing Block Layout- NameNode — Generalizing the Block Concept- Client — Parallel I/O- DataNode — Background Reconstruction

Hardware-accelerated Codec Framework

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?

Storage Efficiency = How much portion of storage is for useful data?

useful data

3-way Replication: Data Durability = 2

Storage Efficiency = 1/3 (33%)

redundant data



XOR:Data Durability = 1

Storage Efficiency = 2/3 (67%)

useful data redundant data

X Y X Y⊕0 0 00 1 11 0 11 1 0

Y = 0 1 = 1⊕



Reed-Solomon (RS):Data Durability = 2

Storage Efficiency = 4/6 (67%)Very flexible!



Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%

EC in Distributed StorageBlock Layout:

Data Locality 👍🏻Small Files 👎🏻

128~256MFile 0~128M … 640~768M

0~128M

bloc

k 0

DataNode 0

128~256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~768M

bloc

k 5

DataNode 5 DataNode 6

…

parity

Contiguous Layout:

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

…bl

ock

5

DataNode 5 DataNode 6

…

parity

Striped Layout:0~1M 1~2M 5~6M6~7M

Data Locality 👎🏻

Small Files 👍🏻Parallel I/O 👍🏻

0~128M 128~256M

EC in Distributed Storage

Spectrum:

Replication ErasureCoding

Striping

Contiguous

Ceph

Ceph

Quancast File System

Quancast File System

HDFS Facebook f4Windows Azure

Choosing Block Layout• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)

96.29%

1.86% 1.85%

26.06%

9.33%

64.61%

small medium large

file count

space usage

Top 2% files occupy ~65% space

Cluster A Profile

86.59%

11.38%2.03%

23.89%36.03% 40.08%

file count

space usage

Top 2% files occupy ~40% space

small medium large

Cluster B Profile

99.64%

0.36% 0.00%

76.05%

20.75%

3.20%

file count

space usage

Dominated by small files

small medium large

Cluster C Profile

Choosing Block Layout

CurrentHDFS

Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?

Hierarchical Naming Protocol:

Client Parallel Writing

streamer

queue

streamer … streamer

Coordinator

Client Parallel Reading

…

parity

Reconstruction on DataNode Important to avoid delay on the critical path

- Especially if original data is lost Integrated with Replication Monitor

- Under-protected EC blocks scheduled together with under-replicated blocks- New priority algorithms

New ErasureCodingWorker component on DataNode

Acceleration with Intel ISA-L 1 legacy coder

- From Facebook’s HDFS-RAID project 2 new coders

- Pure Java — code improvement over HDFS-RAID- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)

Microbenchmark: Codec Calculation

Microbenchmark: HDFS I/O

Hive-on-Spark — locality sensitive

Conclusion Erasure coding expands effective storage space by ~50%! HDFS-EC phase I implements erasure coding in striped block layout Upstream effort (HDFS-7285):

- Design finalized Nov. 2014- Development started Jan. 2015- 218 commits, ~25k LoC change- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo (Japan)

Phase II will support contiguous block layout for better locality

Acknowledgements Cloudera

- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus Intel

- Kai Zheng, Uma Maheswara Rao G, Vinayakumar B, Yi Liu, Weihua Jiang Hortonworks

- Jing Zhao, Tsz Wo Nicholas Sze Huawei

- Walter Su, Rakesh R, Xinwei Qin Yahoo (Japan)

- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng

Questions?

Zhe Zhang, [email protected] | @oldcaphttp://zhe-thoughts.github.io/

mailto:[email protected]

http://zhe-thoughts.github.io/

Less is More: 2X Storage Efficiency with HDFS Erasure Coding

Software

Transcript of Less is More: 2X Storage Efficiency with HDFS Erasure Coding