2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD...

TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD

PARADIGM BASED ON TRIPARTITE GRAPH MODEL

Xiaoyi Lu and Haiyang Shi

The Ohio State University

{lu.932, shi.876}@osu.edu

2020 OFA Virtual Workshop

CHALLENGES OF DATA EXPLOSION

2 © OpenFabrics Alliance

0

20

40

60

80

100

120

140

160

180

200

0.00

0.20

0.40

0.60

0.80

1.00

Year

Disk Drive Cost and Annual Data Size with Time

Hard Disk Drives

Solid State Drives

Annual Data Size

Pri

ce (

$/G

B)

Dat

a Si

ze (

ZB)

Disk Drive Prices (1955-2019), https://jcmit.net/diskprice.htmData Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018

https://jcmit.net/diskprice.htm

RETHINKING OF DISTRIBUTED STORAGE SYSTEMS


▪ No Fault Tolerance

❖High performance

❖Not reliable

▪ N-way Replication

• Tolerate up to 𝑛 − 1 node failures

• Performance degradation

• Large storage overhead Perf

orm

ance

Storage Overhead

No FT

Replication

1 𝑛

Erasure Coding (EC)

EC BASICS - REED-SOLOMON CODE


REED-SOLOMON (RS) CODE


▪ 𝑅𝑆(𝑘,𝑚)

• Widely used (RAID, Microsoft Azure, HDFS 3.x)

• Encodes on 𝑘 data chunks to generate 𝑚parity chunks

• Decodes on any 𝑘 data/parity chunks to recover the original data

▪ Storage Overhead of Common Codes

• RS(4,2) => 1.5x overhead

• RS(6,3) => 1.5x overhead

Reed-Solomon EC for k=4 and m=2

01 00 00 00

00 01 00 00

00 00 01 00

00 00 00 01

1B 1C 12 14

1C 1B 14 12

A B C D

E F G H

I J K L

M N O P

A B C D

E F G H

I J K L

M N O P

51 52 53 49

55 56 57 25Original Data

Generator

Matrix

Encoded Data

× =Data Chunks

Parity Chunks




01 00 00 00

00 01 00 00

00 00 01 00

00 00 00 01

1B 1C 12 14

1C 1B 14 12

A B C D

E F G H

I J K L

M N O P

A B C D

E F G H

I J K L

M N O P

51 52 53 49

55 56 57 25Original Data

Generator

Matrix

Encoded Data

× =






• RS(4,2) => 1.5x overhead vs. 3x in 3-way replication





01 00 00 00

00 01 00 00

1B 1C 12 14

1C 1B 14 12

A B C D

E F G H

I J K L

M N O P

A B C D

E F G H

51 52 53 49

55 56 57 25

Original Data Survived Data

× =

01 00 00 00

00 01 00 00

8D F6 7B 01

F6 8D 01 7B

×

01 00 00 00

00 01 00 00

8D F6 7B 01

F6 8D 01 7B

×

Inverse Matrix

Inverse Matrix








EC: THE ALTERNATIVE RESILIENCE TECHNIQUE


▪ Exploiting EC will introduce both computation overhead and

communication overhead

Client

Write(Encoding) Read with Erasures (Decoding)

ClientComputation Computation

Communication Communication

Are there any good approaches to improve EC performance?

EC OFFLOAD ON SMARTNICS


EC OFFLOAD ON SMARTNICS


▪ Promising Capability on Modern SmartNICs

❖Mellanox InfiniBand ConnectX-4 and later

• About 100 supercomputers in latest TOP500 are equipped with

Mellanox InfiniBand NICs

OVERVIEW EC CAPABILITIES ON MODERN SMARTNICS


Initiator ReceiverGroup

MemoryAccessbyCPU (R)DMA

PostSend/EC

Send(D1)Send(D2)Send(D3)

EC(D1-3)

Send(P1)

Send(P2)

PostECSend

CPU MEM NIC NICs MEMs CPUs

Initiator ReceiverGroup

ECSend

(D1-3)

CPU MEM NIC NICs MEMs CPUs

HardwareACK

EC

Incoherent and Coherent EC Calculation and Networking

▪ Coherent EC Calculation

and Networking

❖Less CPU involvement

❖Less DMA operations

RETHINKING TRADITIONAL EC PARADIGM ON SMARTNICS


𝑁𝑦𝑥 Node y in layer x

Exclusive OR

Erasure Coder

Data Chunk

Parity Chunk

IntermediateChunk

EC𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁𝐼𝐶

𝑁12

𝑁22

𝑁𝑘2

𝑁𝑘+12

𝑁𝑘+𝑚2

…k … EC…

…

k

m

• Encoding Procedure❖Collect 𝑘 data chunks

❖Post encode-and-send

• One set of nodes (e.g., clients) perform EC encoding calculations

• The other set of nodes (e.g., data nodes) receive encoded chunks

• We name it as Bipartite Graph Based EC (BiEC) Paradigm

RETHINKING TRADITIONAL EC PARADIGM ON SMARTNICS


▪ Decoding Procedure

❖Fetch 𝑘 survived chunks

❖Decode to reconstruct corrupt chunks

• One set of nodes (e.g., data nodes)

send requested chunks

• The other set of nodes (e.g., clients)

receive and decode to reconstruct

• Follows Bipartite Graph Based EC

(BiEC) Paradigm


Exclusive OR

Erasure Coder

Data Chunk

Parity Chunk

IntermediateChunk

EC

Corrupt Node

RecoveredChunk

𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁𝐼𝐶

𝑁12

𝑁22

𝑁𝑘2

𝑁𝑘+12

𝑁𝑘+𝑚2

……EC

…

…

…

LIMITATIONS AND CHALLENGES


LIMITATIONS OF BIEC PARADIGM


▪ Limitation 1: Lack of parallelism

▪ Limitation 2: Treat a NIC as a power processor, not

fully exploit networked resources

𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁𝐼𝐶

𝑁12

𝑁22

𝑁𝑘2

𝑁𝑘+12

𝑁𝑘+𝑚2

… … EC…

…

LIMITATIONS OF EXISTING APIS FOR EC ON SMARTNICS


▪ Encode and Send Primitive

❖Offload encoding operation and data transmissions simultaneously

• Limitation 3: No receive and decode primitive support

CHALLENGES

1. Can we propose a better EC paradigm, which is able to exploit the networked

resources in an optimized approach?

2. If such an EC paradigm exists, how can we achieve better overlapping?

3. Can such a new EC paradigm be applied uniformly to both encoding and decoding

procedures?

4. Are there any additional challenges to co-design applications with the proposed EC

paradigm?


TRIEC IDEAS


TRIPARTITE GRAPH BASED EC PARADIGM (TRIEC)


Key Observation: An EC computation can be decomposed into sub EC calculations,

which are possible to be performed on a set of nodes in parallel

▪ Bipartite EC graph transforms to Tripartite EC graph

▪ We name it as Tripartite Graph Based EC Paradigm

TRIEC FOR ENCODING


▪ Encoding Procedure❖Client (e.g., 𝑁1

1) sends each data chunk

❖Data node in layer-2 (e.g., 𝑁12) posts

encode-and-send

❖Data node in layer-3 (e.g., 𝑁13) sums up

𝑘 intermediate chunks to generate a

parity chunk


Exclusive OR

Erasure Coder

Data Chunk

Parity Chunk

IntermediateChunk

EC 𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁12

𝑁22

𝑁𝑎3

𝑁𝑏3

𝑁𝑐3

𝑁13

𝑁𝑚3𝑁𝑘

2

𝑁𝑑3

…

𝑁𝐼𝐶

𝑁𝐼𝐶

𝑁𝐼𝐶

…

EC

…

EC

…

……

EC

m

TRIEC FOR DECODING


▪ Decoding Procedure❖Data node in layer-3 (e.g., 𝑁1

3) decodes

to generate 𝑝 intermediate chunks

❖Data node in layer-2 (e.g., 𝑁12) fetches 𝑘

intermediate chunks and sums them up

to reconstruct corrupt chunk

❖Client in layer-1 (e.g., 𝑁11) gets

recovered chunks


Exclusive OR

Erasure Coder

Data Chunk

Parity Chunk

IntermediateChunk

EC

Corrupt Node

RecoveredChunk

...

...

...

𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑒1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁𝑒2

𝑁𝑓2

𝑁12

𝑁𝑝2

𝑁𝑎3

𝑁𝑏3

𝑁𝑐3

𝑁13

𝑁𝑥3

𝑁𝑘3

𝑁𝐼𝐶

𝑁𝐼𝐶

𝑁𝐼𝐶

EC

…

EC

…

EC

…

…

…

…

TriEC can be uniformly applied to both encoding and decoding

TRIEC-CACHE


ARCHITECTURE OVERVIEW OF TRIEC-CACHE


▪ Based on memcached (v1.5.12)

▪ Interleaved Architecture

❖EC Group: Leader, data processes, and

parity processes

❖Interleave EC groups into the cluster such

that data processes and parity processes are

evenly distributed

❖Balance workload and resource utilizations

Node 1 Node 2 Node 3 Node 4 Node 5

Group 1

Group 2

Group 3

Group 4

Group 5

𝐿𝑖 Leader of 𝑖th Group 𝐷𝑖 𝑖th Data Chunk

𝑖th Parity Chunk𝑃𝑖

𝐿1𝐷1

𝐷2 𝐷3 𝑃1 𝑃2

𝐷1𝐷2 𝐷3 𝑃1𝑃2

𝐿2

𝐷1𝐷2 𝐷3𝑃1 𝑃2

𝐿3

𝐷1𝐷2

𝐿4

𝐷1

𝐿5𝐷2 𝐷3 𝑃1 𝑃2

𝐷3 𝑃1 𝑃2

Corrupt Chunk

ARCHITECTURE OVERVIEW OF TRIEC-CACHE


▪ set(key, value)

❖Implemented with encode-and-send

primitive

▪ get(key)

❖No EC operations

▪ get(key) with erasures

❖Involve data recoveries

❖Implemented with recv-and-decode

primitive

Node 1 Node 2 Node 3 Node 4 Node 5

Group 1

Group 2

Group 3

Group 4

Group 5

𝐿𝑖 Leader of 𝑖th Group 𝐷𝑖 𝑖th Data Chunk

𝑖th Parity Chunk𝑃𝑖

Client

𝐷1𝐷2 𝐷3 𝑃1𝑃2

𝐿2

𝐷1𝐷2

𝐿4𝐷3 𝑃1 𝑃2

𝐿1

𝐷1𝐷2 𝐷3

𝑃1𝑃2

𝐿3

𝐷1𝐷2

𝐷3

𝐷3

𝐷2 𝐷1𝐿5

𝑃1 𝑃2

𝑃1 𝑃2

Corrupt Chunk

RECOVERY MECHANISM OF BIEC AND TRIEC


Out-of-Band Recovery for BiEC In-Band Recovery for TriEC

• Out-of-Band Recovery❖Write back

❖Recover in the background

• In-Band Recovery❖No extra computations or

communications

OPTIMIZATIONS WITH MLNX OFED


▪ Avoidance of Sending Unrequested

Chunks

❖Nullify the work requests to avoid sending out

unrequested chunks

▪ EC Calculator Cache

❖Initializing EC calculators is very expensive

with Mellanox’s EC offload APIs

❖Static EC calculator cache vs. dynamic EC

calculator cache

0

1000

2000

3000

4000

5000

6000

7000

Lat

ency

(us)

No Cache

Dynamic Cache

Static Cache

Performance Impact of Calculator Cache (OSU RI2 Cluster)

EVALUATION


MICROBENCHMARK – TRIEC ENCODING


▪ TriEC reduces the encoding time with RS(12,4) by 23.5% - 45.1%

0500

10001500200025003000

BiEC TriEC

Ex

ecu

tion

Tim

e (m

s)

Encoding Performance Comparisons for RS(12,4) with Diverse Chunk Sizes (OSU RI2 Cluster)

MICROBENCHMARK – TRIEC ENCODING


0

1500

3000

4500

600051

2

2K

B

8K

B

32

KB

12

8K

B

51

2K

B

2M

B

8M

B

1K

B

4K

B

16

KB

64

KB

25

6K

B

1M

B

4M

B

51

2

2K

B

8K

B

32

KB

12

8K

B

51

2K

B

2M

B

8M

B

1K

B

4K

B

16

KB

64

KB

25

6K

B

1M

B

4M

B

(3,2) (6,3) (8,3) (10,4)

BiEC TriEC

Exec

uti

on T

ime

(ms)

Encoding Performance Comparisons with Varied Configurations and Chunk Sizes (OSU RI2 Cluster)

Up to 1.29x Up to 1.43x Up to 1.61x Up to 1.72x

▪ Evaluate TriEC with all widely-used EC configurations

MICROBENCHMARK – TRIEC DECODING


▪ TriEC reduces the decoding time with RS(12,4) by 18.6% - 64.6%

Ex

ecu

tion

Tim

e (m

s)

Performance Comparisons for Recovering Four Data Chunks for RS(12,4) with Diverse Chunk Sizes (OSU RI2 Cluster)

0100020003000400050006000

BiEC TriEC

MICROBENCHMARK – TRIEC DECODING


0

3000

6000

9000

1200051

2

2K

B

8K

B

32

KB

12

8K

B

51

2K

B

2M

B

8M

B

1K

B

4K

B

16

KB

64

KB

25

6K

B

1M

B

4M

B

51

2

2K

B

8K

B

32

KB

12

8K

B

51

2K

B

2M

B

8M

B

1K

B

4K

B

16

KB

64

KB

25

6K

B

1M

B

4M

B

(3,2) (6,3) (8,3) (10,4)

BiEC TriEC

Performance Comparisons for Recovering 𝒎 Data Chunks with Varied Configurations and Chunk Sizes (OSU RI2 Cluster)

Exec

uti

on T

ime

(ms)

m=2 m=3 m=3 m=4

Up to 2.33x Up to 2.26x Up to 2.17x Up to 2.16x

▪ Evaluate TriEC with all widely-used EC configurations

THROUGHPUT OF TRIEC-CACHE


▪ TriEC speeds up the overall throughput performance by up to 13.3% for

equal-shares, 14.8% for read-mostly, and 13.9% for read-only workloads.

Throughput Comparisons for RS(6, 3) (OSC Pitzer Cluster)

BiEC TriEC

1KB 4KB 16KB1KB 4KB 16KB1KB 4KB 16KB

Thro

ughput

(103

op

s/s)

1KB 4KB 16KB1KB 4KB 16KB1KB 4KB 16KB0

40

80

120

r:w=50:50 r:w=95:5 r:w=100:0 r:w=50:50 r:w=95:5 r:w=100:0

1% of reads with one erasure 1% of reads with three erasures

CONCLUSION


Haiyang Shi and Xiaoyi Lu. TriEC: Tripartite Graph Based Erasure Coding NIC Offload. In Proceedings of the 32nd International Conference for High

Performance Computing, Networking, Storage and Analysis (SC), 2019. (Best Student Paper Finalist)

CONCLUSION AND FUTURE WORK


▪ A new EC NIC offload paradigm based on tripartite graph model (i.e., TriEC)

▪ A new receive-and-decode primitive on top of Mellanox EC NIC offload APIs

▪ Co-design a TriEC-based key value store based on Memcached

▪ TriEC reduces the average write latency by up to 23.2% and the recovery time by up to 37.8%

even with only 1% failure occurrences

▪ In the future, we plan to apply TriEC to other storage systems and hardware platforms

ACKNOWLEDGEMENT


▪ We would like to thank Ohio Supercomputer Center (OSC) for providing the cluster access.

▪ We would also like to thank Mellanox for donating SmartNICs.

▪ We also thank the anonymous reviewers for their precious feedback to this work.

▪ This work is supported in part by National Science Foundation grant #CCF-1822987.

THANK YOUXiaoyi Lu, Research Assistant Professor

The Ohio State University

2020 OFA Virtual Workshop

2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD...

Documents

Transcript of 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD...