2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD...
Transcript of 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD...
TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD
PARADIGM BASED ON TRIPARTITE GRAPH MODEL
Xiaoyi Lu and Haiyang Shi
The Ohio State University
{lu.932, shi.876}@osu.edu
2020 OFA Virtual Workshop
CHALLENGES OF DATA EXPLOSION
2 © OpenFabrics Alliance
0
20
40
60
80
100
120
140
160
180
200
0.00
0.20
0.40
0.60
0.80
1.00
Year
Disk Drive Cost and Annual Data Size with Time
Hard Disk Drives
Solid State Drives
Annual Data Size
Pri
ce (
$/G
B)
Dat
a Si
ze (
ZB)
Disk Drive Prices (1955-2019), https://jcmit.net/diskprice.htmData Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018
RETHINKING OF DISTRIBUTED STORAGE SYSTEMS
3 © OpenFabrics Alliance
▪ No Fault Tolerance
❖High performance
❖Not reliable
▪ N-way Replication
• Tolerate up to 𝑛 − 1 node failures
• Performance degradation
• Large storage overhead Perf
orm
ance
Storage Overhead
No FT
Replication
1 𝑛
Erasure Coding (EC)
EC BASICS - REED-SOLOMON CODE
4 © OpenFabrics Alliance
REED-SOLOMON (RS) CODE
5 © OpenFabrics Alliance
▪ 𝑅𝑆(𝑘,𝑚)
• Widely used (RAID, Microsoft Azure, HDFS 3.x)
• Encodes on 𝑘 data chunks to generate 𝑚parity chunks
• Decodes on any 𝑘 data/parity chunks to recover the original data
▪ Storage Overhead of Common Codes
• RS(4,2) => 1.5x overhead
• RS(6,3) => 1.5x overhead
Reed-Solomon EC for k=4 and m=2
01 00 00 00
00 01 00 00
00 00 01 00
00 00 00 01
1B 1C 12 14
1C 1B 14 12
A B C D
E F G H
I J K L
M N O P
A B C D
E F G H
I J K L
M N O P
51 52 53 49
55 56 57 25Original Data
Generator
Matrix
Encoded Data
× =Data Chunks
Parity Chunks
REED-SOLOMON (RS) CODE
6 © OpenFabrics Alliance
Reed-Solomon EC for k=4 and m=2
01 00 00 00
00 01 00 00
00 00 01 00
00 00 00 01
1B 1C 12 14
1C 1B 14 12
A B C D
E F G H
I J K L
M N O P
A B C D
E F G H
I J K L
M N O P
51 52 53 49
55 56 57 25Original Data
Generator
Matrix
Encoded Data
× =
▪ 𝑅𝑆(𝑘,𝑚)
• Widely used (RAID, Microsoft Azure, HDFS 3.x)
• Encodes on 𝑘 data chunks to generate 𝑚parity chunks
• Decodes on any 𝑘 data/parity chunks to recover the original data
▪ Storage Overhead of Common Codes
• RS(4,2) => 1.5x overhead vs. 3x in 3-way replication
• RS(6,3) => 1.5x overhead vs. 4x in 4-way replication
REED-SOLOMON (RS) CODE
7 © OpenFabrics Alliance
Reed-Solomon EC for k=4 and m=2
01 00 00 00
00 01 00 00
1B 1C 12 14
1C 1B 14 12
A B C D
E F G H
I J K L
M N O P
A B C D
E F G H
51 52 53 49
55 56 57 25
Original Data Survived Data
× =
01 00 00 00
00 01 00 00
8D F6 7B 01
F6 8D 01 7B
×
01 00 00 00
00 01 00 00
8D F6 7B 01
F6 8D 01 7B
×
Inverse Matrix
Inverse Matrix
▪ 𝑅𝑆(𝑘,𝑚)
• Widely used (RAID, Microsoft Azure, HDFS 3.x)
• Encodes on 𝑘 data chunks to generate 𝑚parity chunks
• Decodes on any 𝑘 data/parity chunks to recover the original data
▪ Storage Overhead of Common Codes
• RS(4,2) => 1.5x overhead vs. 3x in 3-way replication
• RS(6,3) => 1.5x overhead vs. 4x in 4-way replication
EC: THE ALTERNATIVE RESILIENCE TECHNIQUE
8 © OpenFabrics Alliance
▪ Exploiting EC will introduce both computation overhead and
communication overhead
Client
Write(Encoding) Read with Erasures (Decoding)
ClientComputation Computation
Communication Communication
Are there any good approaches to improve EC performance?
EC OFFLOAD ON SMARTNICS
9 © OpenFabrics Alliance
EC OFFLOAD ON SMARTNICS
10 © OpenFabrics Alliance
▪ Promising Capability on Modern SmartNICs
❖Mellanox InfiniBand ConnectX-4 and later
• About 100 supercomputers in latest TOP500 are equipped with
Mellanox InfiniBand NICs
OVERVIEW EC CAPABILITIES ON MODERN SMARTNICS
11 © OpenFabrics Alliance
Initiator ReceiverGroup
MemoryAccessbyCPU (R)DMA
PostSend/EC
Send(D1)Send(D2)Send(D3)
EC(D1-3)
Send(P1)
Send(P2)
PostECSend
CPU MEM NIC NICs MEMs CPUs
Initiator ReceiverGroup
ECSend
(D1-3)
CPU MEM NIC NICs MEMs CPUs
HardwareACK
EC
Incoherent and Coherent EC Calculation and Networking
▪ Coherent EC Calculation
and Networking
❖Less CPU involvement
❖Less DMA operations
RETHINKING TRADITIONAL EC PARADIGM ON SMARTNICS
12 © OpenFabrics Alliance
𝑁𝑦𝑥 Node y in layer x
Exclusive OR
Erasure Coder
Data Chunk
Parity Chunk
IntermediateChunk
EC𝑁11
𝑁𝑎1
𝑁𝑏1
𝑁𝑐1
𝑁𝑑1
𝑁𝑒1
𝑁𝑓1
𝑁𝑎2
𝑁𝑏2
𝑁𝑐2
𝑁𝑑2
𝑁𝐼𝐶
𝑁12
𝑁22
𝑁𝑘2
𝑁𝑘+12
𝑁𝑘+𝑚2
…k … EC…
…
k
m
• Encoding Procedure❖Collect 𝑘 data chunks
❖Post encode-and-send
• One set of nodes (e.g., clients) perform EC encoding calculations
• The other set of nodes (e.g., data nodes) receive encoded chunks
• We name it as Bipartite Graph Based EC (BiEC) Paradigm
RETHINKING TRADITIONAL EC PARADIGM ON SMARTNICS
13 © OpenFabrics Alliance
▪ Decoding Procedure
❖Fetch 𝑘 survived chunks
❖Decode to reconstruct corrupt chunks
• One set of nodes (e.g., data nodes)
send requested chunks
• The other set of nodes (e.g., clients)
receive and decode to reconstruct
• Follows Bipartite Graph Based EC
(BiEC) Paradigm
𝑁𝑦𝑥 Node y in layer x
Exclusive OR
Erasure Coder
Data Chunk
Parity Chunk
IntermediateChunk
EC
Corrupt Node
RecoveredChunk
𝑁11
𝑁𝑎1
𝑁𝑏1
𝑁𝑐1
𝑁𝑑1
𝑁𝑒1
𝑁𝑓1
𝑁𝑎2
𝑁𝑏2
𝑁𝑐2
𝑁𝑑2
𝑁𝐼𝐶
𝑁12
𝑁22
𝑁𝑘2
𝑁𝑘+12
𝑁𝑘+𝑚2
……EC
…
…
…
LIMITATIONS AND CHALLENGES
14 © OpenFabrics Alliance
LIMITATIONS OF BIEC PARADIGM
15 © OpenFabrics Alliance
▪ Limitation 1: Lack of parallelism
▪ Limitation 2: Treat a NIC as a power processor, not
fully exploit networked resources
𝑁11
𝑁𝑎1
𝑁𝑏1
𝑁𝑐1
𝑁𝑑1
𝑁𝑒1
𝑁𝑓1
𝑁𝑎2
𝑁𝑏2
𝑁𝑐2
𝑁𝑑2
𝑁𝐼𝐶
𝑁12
𝑁22
𝑁𝑘2
𝑁𝑘+12
𝑁𝑘+𝑚2
… … EC…
…
LIMITATIONS OF EXISTING APIS FOR EC ON SMARTNICS
16 © OpenFabrics Alliance
▪ Encode and Send Primitive
❖Offload encoding operation and data transmissions simultaneously
• Limitation 3: No receive and decode primitive support
CHALLENGES
1. Can we propose a better EC paradigm, which is able to exploit the networked
resources in an optimized approach?
2. If such an EC paradigm exists, how can we achieve better overlapping?
3. Can such a new EC paradigm be applied uniformly to both encoding and decoding
procedures?
4. Are there any additional challenges to co-design applications with the proposed EC
paradigm?
17 © OpenFabrics Alliance
TRIEC IDEAS
18 © OpenFabrics Alliance
TRIPARTITE GRAPH BASED EC PARADIGM (TRIEC)
19 © OpenFabrics Alliance
Key Observation: An EC computation can be decomposed into sub EC calculations,
which are possible to be performed on a set of nodes in parallel
▪ Bipartite EC graph transforms to Tripartite EC graph
▪ We name it as Tripartite Graph Based EC Paradigm
TRIEC FOR ENCODING
20 © OpenFabrics Alliance
▪ Encoding Procedure❖Client (e.g., 𝑁1
1) sends each data chunk
❖Data node in layer-2 (e.g., 𝑁12) posts
encode-and-send
❖Data node in layer-3 (e.g., 𝑁13) sums up
𝑘 intermediate chunks to generate a
parity chunk
𝑁𝑦𝑥 Node y in layer x
Exclusive OR
Erasure Coder
Data Chunk
Parity Chunk
IntermediateChunk
EC 𝑁11
𝑁𝑎1
𝑁𝑏1
𝑁𝑐1
𝑁𝑑1
𝑁𝑒1
𝑁𝑓1
𝑁𝑎2
𝑁𝑏2
𝑁𝑐2
𝑁𝑑2
𝑁12
𝑁22
𝑁𝑎3
𝑁𝑏3
𝑁𝑐3
𝑁13
𝑁𝑚3𝑁𝑘
2
𝑁𝑑3
…
𝑁𝐼𝐶
𝑁𝐼𝐶
𝑁𝐼𝐶
…
EC
…
EC
…
……
EC
m
TRIEC FOR DECODING
21 © OpenFabrics Alliance
▪ Decoding Procedure❖Data node in layer-3 (e.g., 𝑁1
3) decodes
to generate 𝑝 intermediate chunks
❖Data node in layer-2 (e.g., 𝑁12) fetches 𝑘
intermediate chunks and sums them up
to reconstruct corrupt chunk
❖Client in layer-1 (e.g., 𝑁11) gets
recovered chunks
𝑁𝑦𝑥 Node y in layer x
Exclusive OR
Erasure Coder
Data Chunk
Parity Chunk
IntermediateChunk
EC
Corrupt Node
RecoveredChunk
...
...
...
𝑁11
𝑁𝑎1
𝑁𝑏1
𝑁𝑐1
𝑁𝑑1
𝑁𝑒1
𝑁𝑓1
𝑁𝑒1
𝑁𝑎2
𝑁𝑏2
𝑁𝑐2
𝑁𝑑2
𝑁𝑒2
𝑁𝑓2
𝑁12
𝑁𝑝2
𝑁𝑎3
𝑁𝑏3
𝑁𝑐3
𝑁13
𝑁𝑥3
𝑁𝑘3
𝑁𝐼𝐶
𝑁𝐼𝐶
𝑁𝐼𝐶
EC
…
EC
…
EC
…
…
…
…
TriEC can be uniformly applied to both encoding and decoding
TRIEC-CACHE
22 © OpenFabrics Alliance
ARCHITECTURE OVERVIEW OF TRIEC-CACHE
23 © OpenFabrics Alliance
▪ Based on memcached (v1.5.12)
▪ Interleaved Architecture
❖EC Group: Leader, data processes, and
parity processes
❖Interleave EC groups into the cluster such
that data processes and parity processes are
evenly distributed
❖Balance workload and resource utilizations
Node 1 Node 2 Node 3 Node 4 Node 5
Group 1
Group 2
Group 3
Group 4
Group 5
𝐿𝑖 Leader of 𝑖th Group 𝐷𝑖 𝑖th Data Chunk
𝑖th Parity Chunk𝑃𝑖
𝐿1𝐷1
𝐷2 𝐷3 𝑃1 𝑃2
𝐷1𝐷2 𝐷3 𝑃1𝑃2
𝐿2
𝐷1𝐷2 𝐷3𝑃1 𝑃2
𝐿3
𝐷1𝐷2
𝐿4
𝐷1
𝐿5𝐷2 𝐷3 𝑃1 𝑃2
𝐷3 𝑃1 𝑃2
Corrupt Chunk
ARCHITECTURE OVERVIEW OF TRIEC-CACHE
24 © OpenFabrics Alliance
▪ set(key, value)
❖Implemented with encode-and-send
primitive
▪ get(key)
❖No EC operations
▪ get(key) with erasures
❖Involve data recoveries
❖Implemented with recv-and-decode
primitive
Node 1 Node 2 Node 3 Node 4 Node 5
Group 1
Group 2
Group 3
Group 4
Group 5
𝐿𝑖 Leader of 𝑖th Group 𝐷𝑖 𝑖th Data Chunk
𝑖th Parity Chunk𝑃𝑖
Client
𝐷1𝐷2 𝐷3 𝑃1𝑃2
𝐿2
𝐷1𝐷2
𝐿4𝐷3 𝑃1 𝑃2
𝐿1
𝐷1𝐷2 𝐷3
𝑃1𝑃2
𝐿3
𝐷1𝐷2
𝐷3
𝐷3
𝐷2 𝐷1𝐿5
𝑃1 𝑃2
𝑃1 𝑃2
Corrupt Chunk
RECOVERY MECHANISM OF BIEC AND TRIEC
25 © OpenFabrics Alliance
Out-of-Band Recovery for BiEC In-Band Recovery for TriEC
• Out-of-Band Recovery❖Write back
❖Recover in the background
• In-Band Recovery❖No extra computations or
communications
OPTIMIZATIONS WITH MLNX OFED
26 © OpenFabrics Alliance
▪ Avoidance of Sending Unrequested
Chunks
❖Nullify the work requests to avoid sending out
unrequested chunks
▪ EC Calculator Cache
❖Initializing EC calculators is very expensive
with Mellanox’s EC offload APIs
❖Static EC calculator cache vs. dynamic EC
calculator cache
0
1000
2000
3000
4000
5000
6000
7000
Lat
ency
(us)
No Cache
Dynamic Cache
Static Cache
Performance Impact of Calculator Cache (OSU RI2 Cluster)
EVALUATION
27 © OpenFabrics Alliance
MICROBENCHMARK – TRIEC ENCODING
28 © OpenFabrics Alliance
▪ TriEC reduces the encoding time with RS(12,4) by 23.5% - 45.1%
0500
10001500200025003000
BiEC TriEC
Ex
ecu
tion
Tim
e (m
s)
Encoding Performance Comparisons for RS(12,4) with Diverse Chunk Sizes (OSU RI2 Cluster)
MICROBENCHMARK – TRIEC ENCODING
29 © OpenFabrics Alliance
0
1500
3000
4500
600051
2
2K
B
8K
B
32
KB
12
8K
B
51
2K
B
2M
B
8M
B
1K
B
4K
B
16
KB
64
KB
25
6K
B
1M
B
4M
B
51
2
2K
B
8K
B
32
KB
12
8K
B
51
2K
B
2M
B
8M
B
1K
B
4K
B
16
KB
64
KB
25
6K
B
1M
B
4M
B
(3,2) (6,3) (8,3) (10,4)
BiEC TriEC
Exec
uti
on T
ime
(ms)
Encoding Performance Comparisons with Varied Configurations and Chunk Sizes (OSU RI2 Cluster)
Up to 1.29x Up to 1.43x Up to 1.61x Up to 1.72x
▪ Evaluate TriEC with all widely-used EC configurations
MICROBENCHMARK – TRIEC DECODING
30 © OpenFabrics Alliance
▪ TriEC reduces the decoding time with RS(12,4) by 18.6% - 64.6%
Ex
ecu
tion
Tim
e (m
s)
Performance Comparisons for Recovering Four Data Chunks for RS(12,4) with Diverse Chunk Sizes (OSU RI2 Cluster)
0100020003000400050006000
BiEC TriEC
MICROBENCHMARK – TRIEC DECODING
31 © OpenFabrics Alliance
0
3000
6000
9000
1200051
2
2K
B
8K
B
32
KB
12
8K
B
51
2K
B
2M
B
8M
B
1K
B
4K
B
16
KB
64
KB
25
6K
B
1M
B
4M
B
51
2
2K
B
8K
B
32
KB
12
8K
B
51
2K
B
2M
B
8M
B
1K
B
4K
B
16
KB
64
KB
25
6K
B
1M
B
4M
B
(3,2) (6,3) (8,3) (10,4)
BiEC TriEC
Performance Comparisons for Recovering 𝒎 Data Chunks with Varied Configurations and Chunk Sizes (OSU RI2 Cluster)
Exec
uti
on T
ime
(ms)
m=2 m=3 m=3 m=4
Up to 2.33x Up to 2.26x Up to 2.17x Up to 2.16x
▪ Evaluate TriEC with all widely-used EC configurations
THROUGHPUT OF TRIEC-CACHE
32 © OpenFabrics Alliance
▪ TriEC speeds up the overall throughput performance by up to 13.3% for
equal-shares, 14.8% for read-mostly, and 13.9% for read-only workloads.
Throughput Comparisons for RS(6, 3) (OSC Pitzer Cluster)
BiEC TriEC
1KB 4KB 16KB1KB 4KB 16KB1KB 4KB 16KB
Thro
ughput
(103
op
s/s)
1KB 4KB 16KB1KB 4KB 16KB1KB 4KB 16KB0
40
80
120
r:w=50:50 r:w=95:5 r:w=100:0 r:w=50:50 r:w=95:5 r:w=100:0
1% of reads with one erasure 1% of reads with three erasures
CONCLUSION
33 © OpenFabrics Alliance
Haiyang Shi and Xiaoyi Lu. TriEC: Tripartite Graph Based Erasure Coding NIC Offload. In Proceedings of the 32nd International Conference for High
Performance Computing, Networking, Storage and Analysis (SC), 2019. (Best Student Paper Finalist)
CONCLUSION AND FUTURE WORK
34 © OpenFabrics Alliance
▪ A new EC NIC offload paradigm based on tripartite graph model (i.e., TriEC)
▪ A new receive-and-decode primitive on top of Mellanox EC NIC offload APIs
▪ Co-design a TriEC-based key value store based on Memcached
▪ TriEC reduces the average write latency by up to 23.2% and the recovery time by up to 37.8%
even with only 1% failure occurrences
▪ In the future, we plan to apply TriEC to other storage systems and hardware platforms
ACKNOWLEDGEMENT
35 © OpenFabrics Alliance
▪ We would like to thank Ohio Supercomputer Center (OSC) for providing the cluster access.
▪ We would also like to thank Mellanox for donating SmartNICs.
▪ We also thank the anonymous reviewers for their precious feedback to this work.
▪ This work is supported in part by National Science Foundation grant #CCF-1822987.
THANK YOUXiaoyi Lu, Research Assistant Professor
The Ohio State University
2020 OFA Virtual Workshop