Erasure Code Offload for Distributed Software Defined...

32
Dror Goldenberg 7 th Annual TCE Conference June 21-22/2017 Erasure Code Offload for Distributed Software Defined Storage

Transcript of Erasure Code Offload for Distributed Software Defined...

Page 1: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

Dror Goldenberg

7th Annual TCE Conference – June 21-22/2017

Erasure Code Offload for

Distributed Software Defined Storage

Page 2: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 2- Mellanox Confidential -

Enable the Most Resilient Cloud Infrastructure

Page 3: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 3- Mellanox Confidential -

Cloud Infrastructure – The Software Defined Data Center

Resource virtualization

Efficient services

• VM, Containers

• uServices scale out arch

• DevOps automation

Tenant isolation & security

Visibility and telemetry

vCPU

vStorage (SDS)vMem

vAcceleratorvNetwork (SDN)

Page 4: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 4- Mellanox Confidential -

Software Defined Storage

Software Defined Datacenter

• Software Defined Networks

• Software Defined Storage

Scale Out Advantage

Scale-out Storage Examples

• Ceph, Swift, Gluster and many more

Compute

Network

Storage

Page 5: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 5- Mellanox Confidential -

Storage Feeds and Speeds Drive the Need for Efficient Network

Network BW is critical to enable

Scale Out

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

HDD SAS SSD NMVe NVDIMM

Storage Access Latency (ns)

Low latency is imperative for

Software Defined Storage

Page 6: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 6- Mellanox Confidential -

“To make storage cheaper we use lots more network!

How do we make Azure Storage scale? RoCE (RDMA over

Converged Ethernet) enabled at 40GbE for Windows

Azure Storage, achieving massive COGS savings”

Interconnect that Powers Azure Cloud Storage

Microsoft Keynote at Open Networking Summit 2014 on RDMA

KeynoteAlbert Greenberg, Microsoft

SDN Azure Infrastructure

Page 7: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 7- Mellanox Confidential -

RoCE enables Virtualized Infrastructure without Compromise

With RDMAWithout RDMA

Scalability

Reduce Overhead

<1usec Latency

VM to VM Communication

<2% CPU Utilization

Delivering I/O at 100Gb/s

Efficiency

Lower CPU Utilization

Performance

Higher Throughput & IOPS

6X The Throughput

Compared to iWARP

Achieve Tomorrow’s Cloud Efficiency Today

Page 8: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 8- Mellanox Confidential -

Data Availability

https://www.nextofwindows.com/raid-terms-explained-in-water-cooler

Page 9: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 9- Mellanox Confidential -

D3D0D0

Ensuring Data Availability

D0 D0 D0 D0 P0

D

D1

D2

D3

D1 D1

D2 D2

D3 D3

D1

D2

D3

P1

P2

P3

Data Replication Erasure Coding

Page 10: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 10- Mellanox Confidential -

Ensuring Data Availability

D0 D0 D0 D0 P0

D1

D2

D3

D1 D1

D2 D2

D3 D3

D1

D2

D3

P1

P2

P3

Data Replication Erasure Coding

Page 11: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 11- Mellanox Confidential -

Ensuring Data Availability - Summary

Capacity: 3x data typical

Resilient to 2 failures

Capacity: 1.4x data typical

Better failure resilience

• e.g. 10+4: 1.4x capacity, 4 failures

CPU & network hungry

Longer & data intensive rebuild

Partial update requires read

D

D

D

D

D

D

D

D

D

D

D

D

D0

D0

D0

D0

D1

D1

D1

D1

D2

D2

D2

D2

D3

D3

D3

D3

P0

P0

P0

P0

P1

P1

P1

P1

Data Replication Erasure Coding

Page 12: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 12- Mellanox Confidential -

Erasure Codes Theory

K data + M parity = N total

• Tolerates up to M failures

• Total overhead N/K

Systematic Codes – encoded data contains original data

Maximal Distance Separable (MDS)

• Can survive any M erasures

Reed Solomon Coding

• Many other codes exist: RAID 6, XOR, Pyramid, LRC, …

D D D D D P P P

K M

Page 13: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 13- Mellanox Confidential -

Reed Solomon Encoding

Generating parity is matrix multiplication

1 D0 D0

1 D1 D1

1 D2 D2

1 D3 D3

1 D4 D4

1 D5 D5

1 D6 D6

1 D7 D7

1 D8 D8

1 D9 D9

B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 P0

B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 P1

B31 B32 B33 B34 B35 B36 B37 B38 B39 B310 P2

B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 P3

* =

B * D = {D,P}

k

k

m

Page 14: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 14- Mellanox Confidential -

Reed Solomon Decoding (1/2)

In case of failures

• Matrix is reduced

• Calculate Inverse matrix

• Recover Data

• Recover Parity

1 D0 D0

1 D1 D1

1 D2 D2

1 D3 D3

1 D4 D4

1 D5 D5

1 D6 D6

1 D7 D7

1 D8 D8

1 D9 D9

B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 P0

B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 P1

B31 B32 B33 B34 B35 B36 B37 B38 B39 B310 P2

B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 P3

* =

B' * D = {D,P}'

k

k

m

Page 15: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 15- Mellanox Confidential -

Reed Solomon Decoding (2/2)

Recovery - how?

• B’ * D = S

• B’-1 * B’ * D = B’ -1 * S

• D=B’ -1 * S

• {D,P}=B * D

S = Survivors vector

1 D0 D0

1 D1 D1

1 D2 D2

1 D3 D4

1 D4 D5

1 D5 D7

1 D6 D8

B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 D7 P0

B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 D8 P1

B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 D9 P3

* =

B' * D = {D,P}'= S

Page 16: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 16- Mellanox Confidential -

Algebra Magic (1/2) – Matrix Generation

All submatrices must be invertible

Vandermonde matrix is used as baseline

Derived matrix through elementary operations

1 1 1^2 1^3 1^4 … 1^(k-1) 1

1 2 2^2 2^3 2^4 … 2^(k-1) 1

1 3 3^2 3^3 … 1

1 4 4^2 … 1

. . 1

. . 1

. . 1

. . 1

1

1 k-1 1

1 k B11 B12 B13 B14 B15 B16 B17 B18 B19 B110

1 k+1 B21 B22 B23 B24 B25 B26 B27 B28 B29 B210

… … B31 B32 B33 B34 B35 B36 B37 B38 B39 B310

1 n-1 (n-1)^2 (n-1)^3 … B41 B42 B43 B44 B45 B46 B47 B48 B49 B410

Vandermonde Matrix Derived Matrix

k

k

m

k

k

m

Page 17: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 17- Mellanox Confidential -

Algebra Magic (2/2) – Arithmetic OPs

Needed: finite field, multiplicative inverse

Operation done over Galois Field – GF(2w)

• Sum is XOR operation

• Multiplication is more complicated…

- Numbers are multiplied and then divided by an irreducible polynomial

N must be ≤ 2w

Page 18: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 18- Mellanox Confidential -

High CPU Demand

Matrix multiplication compute needed

• O(k*m) multiply-add operations

Cache & TLB intensive (large data sets)

Page 19: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 19- Mellanox Confidential -

Erasure Code History

Évariste Galois 1811-1832

Alexandre-Théophile Vandermonde 1735 – 1796

Irving S. Reed 1923-2012

Gustave Solomon 1930-1996

http://selunec.deviantart.com/art/Evariste-Galois-189586719

http://www.giancarlodurso.altervista.org/blog/

interpolazione-polinomiale-con-matrice-di-vandermonde/

http://reedsolomon.tripod.com/

Page 20: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 20- Mellanox Confidential -

Network Traffic – Sunny Day Scenario (Replication)

Client OSD

Write

Write Ack

OSD OSD

Replications

NWrite Operation

Client OSD

Read

Read Reply

OSD OSD

NREAD Operation

Typically 2x cluster network traffic

(N-1)∙xNo extra cluster network traffic

Page 21: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 21- Mellanox Confidential -

Network Traffic – Sunny Day Scenario (Erasure Coding)

Client OSD

Read

Read Reply

K

OSD OSD OSD OSD OSD

M

Read

Shards

decode

READ Operation

Client OSD

Write

Write Ack

K

OSD OSD OSD OSD OSD

M

Write

Shards

encode

Write Operation

~1x cluster network traffic

((k-1)/k)∙x

Typically ~1.4x cluster network traffic

((k+m-1)/k) ∙x

Page 22: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 22- Mellanox Confidential -

Network Traffic – Recovery (Replication)

Example - Time to recover• Net networking time to move data

• 20TB system @40GE 1.1hrs

• 200TB system @40GE 11.1hrs

§ Similar flows for scrubbing

Client OSD OSD OSD OSD

Read

Read Reply

Recovery backend traffic 1x times lost data

Page 23: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 23- Mellanox Confidential -

Network Traffic – Recovery (Erasure Coding)

Example - Time to recover (10+4)

• Net networking time to move data

• 20TB system @40GE 14.4hrs

• 200TB system @40GE 144.4hrs

Similar flows for scrubbing

Client OSD

K

OSD OSD OSD OSD OSD

M

Read

Shards

OSD

decode

Recovery backend traffic typically ~14x times lost data

(K+M-1)∙x

Tradeoff:

recovery time

vs

storage efficiency

Page 24: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 24- Mellanox Confidential -

Typical Workflow - Long Write (Encode)

write(*data)

Data stripe

Encode

D D D D D

D D D D D P P P

Send

Calculation

B*D=P

Long WriteD

Page 25: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 25- Mellanox Confidential -

Typical Workflow - Reconstruct (Decode)

decode(*data)

Retrieve Available Elements

Decode (recover lost data)

D D D D D P P P

Send

CalculationD=B’-1*Survivors

D D D D P PD P

D D D D D

Calculation

B*D=PEncode (calculate parity)

Page 26: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 26- Mellanox Confidential -

Onload vs Offload

Onload

• Computation all done on CPU

• CPU at 100% during calculation

• Cache/TLB pollution

• Example: ISA-L

Offload

• Computation all done in accelerator

• CPU at 0% during calculation

• Cache/TLB unaffected

• Example: ec_offload APIs

Page 27: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 27- Mellanox Confidential -

Synchronous vs Asynchronous

Synchronous APIs block until operation completes

• Suitable to the current common API semantics

• CPU computes – runs to completion

• Onload operation – faster ISA, but still 100% CPU utilization

Asynchronous enables CPU to focus on computation

• Suitable to offload semantics

• Operation starts, completion reported upon callback

• Can implement Synchronous using Asynchronous calls

- Easy fit for today’s integrations

Page 28: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 28- Mellanox Confidential -

EC Offload APIs Cheat Sheet

Synchronous Asynchronous

Initialization ibv_exp_alloc_ec_calc()

ibv_exp_dealloc_ec_calc()-

Encode ibv_exp_ec_encode_sync() ibv_exp_ec_encode_async()

Decode ibv_exp_ec_decode_sync() ibv_exp_ec_decode_async()

Update ibv_exp_ec_update_sync() ibv_exp_ec_update_async()

Encode & Send - ibv_exp_ec_encode_send()

Book keeping - ibv_exp_ec_poll()

Page 29: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 29- Mellanox Confidential -

Encoding Performance (Single Core)

4x

Fa

ste

r

50

x L

ow

er C

PU

%

Page 30: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 30- Mellanox Confidential -

Encoding Performance (Single Core) - ARM

30

0x

90

x L

ow

er C

PU

%

Page 31: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

© 2017 Mellanox Technologies 31- Mellanox Confidential -

Summary

Cloud infrastructure requires efficient and scalable storage

Software Defined Storage drives scale out storage

Erasure Codes enable data availability at lower capacity

• Tradeoff: CPU & Network intensive, complexity

Erasure Codes offload offers

• Better performance

• Lower cost

Library available today

• Integration to storage systems underway

Challenges

• Efficient use of asynchronous acceleration

• Combining Erasure Codes offload and networking

Page 32: Erasure Code Offload for Distributed Software Defined Storagetce.webee.eedev.technion.ac.il/.../07/Goldenberg.pdf · Erasure Codes enable data availability at lower capacity • Tradeoff:

Thank You