Erasure Code Offload for Distributed Software Defined...
Transcript of Erasure Code Offload for Distributed Software Defined...
Dror Goldenberg
7th Annual TCE Conference – June 21-22/2017
Erasure Code Offload for
Distributed Software Defined Storage
© 2017 Mellanox Technologies 2- Mellanox Confidential -
Enable the Most Resilient Cloud Infrastructure
© 2017 Mellanox Technologies 3- Mellanox Confidential -
Cloud Infrastructure – The Software Defined Data Center
Resource virtualization
Efficient services
• VM, Containers
• uServices scale out arch
• DevOps automation
Tenant isolation & security
Visibility and telemetry
vCPU
vStorage (SDS)vMem
vAcceleratorvNetwork (SDN)
© 2017 Mellanox Technologies 4- Mellanox Confidential -
Software Defined Storage
Software Defined Datacenter
• Software Defined Networks
• Software Defined Storage
Scale Out Advantage
Scale-out Storage Examples
• Ceph, Swift, Gluster and many more
Compute
Network
Storage
© 2017 Mellanox Technologies 5- Mellanox Confidential -
Storage Feeds and Speeds Drive the Need for Efficient Network
Network BW is critical to enable
Scale Out
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
HDD SAS SSD NMVe NVDIMM
Storage Access Latency (ns)
Low latency is imperative for
Software Defined Storage
© 2017 Mellanox Technologies 6- Mellanox Confidential -
“To make storage cheaper we use lots more network!
How do we make Azure Storage scale? RoCE (RDMA over
Converged Ethernet) enabled at 40GbE for Windows
Azure Storage, achieving massive COGS savings”
Interconnect that Powers Azure Cloud Storage
Microsoft Keynote at Open Networking Summit 2014 on RDMA
KeynoteAlbert Greenberg, Microsoft
SDN Azure Infrastructure
© 2017 Mellanox Technologies 7- Mellanox Confidential -
RoCE enables Virtualized Infrastructure without Compromise
With RDMAWithout RDMA
Scalability
Reduce Overhead
<1usec Latency
VM to VM Communication
<2% CPU Utilization
Delivering I/O at 100Gb/s
Efficiency
Lower CPU Utilization
Performance
Higher Throughput & IOPS
6X The Throughput
Compared to iWARP
Achieve Tomorrow’s Cloud Efficiency Today
© 2017 Mellanox Technologies 8- Mellanox Confidential -
Data Availability
https://www.nextofwindows.com/raid-terms-explained-in-water-cooler
© 2017 Mellanox Technologies 9- Mellanox Confidential -
D3D0D0
Ensuring Data Availability
D0 D0 D0 D0 P0
D
D1
D2
D3
D1 D1
D2 D2
D3 D3
D1
D2
D3
P1
P2
P3
Data Replication Erasure Coding
© 2017 Mellanox Technologies 10- Mellanox Confidential -
Ensuring Data Availability
D0 D0 D0 D0 P0
D1
D2
D3
D1 D1
D2 D2
D3 D3
D1
D2
D3
P1
P2
P3
Data Replication Erasure Coding
© 2017 Mellanox Technologies 11- Mellanox Confidential -
Ensuring Data Availability - Summary
Capacity: 3x data typical
Resilient to 2 failures
Capacity: 1.4x data typical
Better failure resilience
• e.g. 10+4: 1.4x capacity, 4 failures
CPU & network hungry
Longer & data intensive rebuild
Partial update requires read
D
D
D
D
D
D
D
D
D
D
D
D
D0
D0
D0
D0
D1
D1
D1
D1
D2
D2
D2
D2
D3
D3
D3
D3
P0
P0
P0
P0
P1
P1
P1
P1
Data Replication Erasure Coding
© 2017 Mellanox Technologies 12- Mellanox Confidential -
Erasure Codes Theory
K data + M parity = N total
• Tolerates up to M failures
• Total overhead N/K
Systematic Codes – encoded data contains original data
Maximal Distance Separable (MDS)
• Can survive any M erasures
Reed Solomon Coding
• Many other codes exist: RAID 6, XOR, Pyramid, LRC, …
D D D D D P P P
K M
© 2017 Mellanox Technologies 13- Mellanox Confidential -
Reed Solomon Encoding
Generating parity is matrix multiplication
1 D0 D0
1 D1 D1
1 D2 D2
1 D3 D3
1 D4 D4
1 D5 D5
1 D6 D6
1 D7 D7
1 D8 D8
1 D9 D9
B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 P0
B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 P1
B31 B32 B33 B34 B35 B36 B37 B38 B39 B310 P2
B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 P3
* =
B * D = {D,P}
k
k
m
© 2017 Mellanox Technologies 14- Mellanox Confidential -
Reed Solomon Decoding (1/2)
In case of failures
• Matrix is reduced
• Calculate Inverse matrix
• Recover Data
• Recover Parity
1 D0 D0
1 D1 D1
1 D2 D2
1 D3 D3
1 D4 D4
1 D5 D5
1 D6 D6
1 D7 D7
1 D8 D8
1 D9 D9
B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 P0
B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 P1
B31 B32 B33 B34 B35 B36 B37 B38 B39 B310 P2
B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 P3
* =
B' * D = {D,P}'
k
k
m
© 2017 Mellanox Technologies 15- Mellanox Confidential -
Reed Solomon Decoding (2/2)
Recovery - how?
• B’ * D = S
• B’-1 * B’ * D = B’ -1 * S
• D=B’ -1 * S
• {D,P}=B * D
S = Survivors vector
1 D0 D0
1 D1 D1
1 D2 D2
1 D3 D4
1 D4 D5
1 D5 D7
1 D6 D8
B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 D7 P0
B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 D8 P1
B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 D9 P3
* =
B' * D = {D,P}'= S
© 2017 Mellanox Technologies 16- Mellanox Confidential -
Algebra Magic (1/2) – Matrix Generation
All submatrices must be invertible
Vandermonde matrix is used as baseline
Derived matrix through elementary operations
1 1 1^2 1^3 1^4 … 1^(k-1) 1
1 2 2^2 2^3 2^4 … 2^(k-1) 1
1 3 3^2 3^3 … 1
1 4 4^2 … 1
. . 1
. . 1
. . 1
. . 1
1
1 k-1 1
1 k B11 B12 B13 B14 B15 B16 B17 B18 B19 B110
1 k+1 B21 B22 B23 B24 B25 B26 B27 B28 B29 B210
… … B31 B32 B33 B34 B35 B36 B37 B38 B39 B310
1 n-1 (n-1)^2 (n-1)^3 … B41 B42 B43 B44 B45 B46 B47 B48 B49 B410
Vandermonde Matrix Derived Matrix
k
k
m
k
k
m
© 2017 Mellanox Technologies 17- Mellanox Confidential -
Algebra Magic (2/2) – Arithmetic OPs
Needed: finite field, multiplicative inverse
Operation done over Galois Field – GF(2w)
• Sum is XOR operation
• Multiplication is more complicated…
- Numbers are multiplied and then divided by an irreducible polynomial
N must be ≤ 2w
© 2017 Mellanox Technologies 18- Mellanox Confidential -
High CPU Demand
Matrix multiplication compute needed
• O(k*m) multiply-add operations
Cache & TLB intensive (large data sets)
© 2017 Mellanox Technologies 19- Mellanox Confidential -
Erasure Code History
Évariste Galois 1811-1832
Alexandre-Théophile Vandermonde 1735 – 1796
Irving S. Reed 1923-2012
Gustave Solomon 1930-1996
http://selunec.deviantart.com/art/Evariste-Galois-189586719
http://www.giancarlodurso.altervista.org/blog/
interpolazione-polinomiale-con-matrice-di-vandermonde/
http://reedsolomon.tripod.com/
© 2017 Mellanox Technologies 20- Mellanox Confidential -
Network Traffic – Sunny Day Scenario (Replication)
Client OSD
Write
Write Ack
OSD OSD
Replications
NWrite Operation
Client OSD
Read
Read Reply
OSD OSD
NREAD Operation
Typically 2x cluster network traffic
(N-1)∙xNo extra cluster network traffic
© 2017 Mellanox Technologies 21- Mellanox Confidential -
Network Traffic – Sunny Day Scenario (Erasure Coding)
Client OSD
Read
Read Reply
K
OSD OSD OSD OSD OSD
M
Read
Shards
decode
READ Operation
Client OSD
Write
Write Ack
K
OSD OSD OSD OSD OSD
M
Write
Shards
encode
Write Operation
~1x cluster network traffic
((k-1)/k)∙x
Typically ~1.4x cluster network traffic
((k+m-1)/k) ∙x
© 2017 Mellanox Technologies 22- Mellanox Confidential -
Network Traffic – Recovery (Replication)
Example - Time to recover• Net networking time to move data
• 20TB system @40GE 1.1hrs
• 200TB system @40GE 11.1hrs
§ Similar flows for scrubbing
Client OSD OSD OSD OSD
Read
Read Reply
Recovery backend traffic 1x times lost data
© 2017 Mellanox Technologies 23- Mellanox Confidential -
Network Traffic – Recovery (Erasure Coding)
Example - Time to recover (10+4)
• Net networking time to move data
• 20TB system @40GE 14.4hrs
• 200TB system @40GE 144.4hrs
Similar flows for scrubbing
Client OSD
K
OSD OSD OSD OSD OSD
M
Read
Shards
OSD
decode
Recovery backend traffic typically ~14x times lost data
(K+M-1)∙x
Tradeoff:
recovery time
vs
storage efficiency
© 2017 Mellanox Technologies 24- Mellanox Confidential -
Typical Workflow - Long Write (Encode)
write(*data)
Data stripe
Encode
D D D D D
D D D D D P P P
Send
Calculation
B*D=P
Long WriteD
© 2017 Mellanox Technologies 25- Mellanox Confidential -
Typical Workflow - Reconstruct (Decode)
decode(*data)
Retrieve Available Elements
Decode (recover lost data)
D D D D D P P P
Send
CalculationD=B’-1*Survivors
D D D D P PD P
D D D D D
Calculation
B*D=PEncode (calculate parity)
© 2017 Mellanox Technologies 26- Mellanox Confidential -
Onload vs Offload
Onload
• Computation all done on CPU
• CPU at 100% during calculation
• Cache/TLB pollution
• Example: ISA-L
Offload
• Computation all done in accelerator
• CPU at 0% during calculation
• Cache/TLB unaffected
• Example: ec_offload APIs
© 2017 Mellanox Technologies 27- Mellanox Confidential -
Synchronous vs Asynchronous
Synchronous APIs block until operation completes
• Suitable to the current common API semantics
• CPU computes – runs to completion
• Onload operation – faster ISA, but still 100% CPU utilization
Asynchronous enables CPU to focus on computation
• Suitable to offload semantics
• Operation starts, completion reported upon callback
• Can implement Synchronous using Asynchronous calls
- Easy fit for today’s integrations
© 2017 Mellanox Technologies 28- Mellanox Confidential -
EC Offload APIs Cheat Sheet
Synchronous Asynchronous
Initialization ibv_exp_alloc_ec_calc()
ibv_exp_dealloc_ec_calc()-
Encode ibv_exp_ec_encode_sync() ibv_exp_ec_encode_async()
Decode ibv_exp_ec_decode_sync() ibv_exp_ec_decode_async()
Update ibv_exp_ec_update_sync() ibv_exp_ec_update_async()
Encode & Send - ibv_exp_ec_encode_send()
Book keeping - ibv_exp_ec_poll()
© 2017 Mellanox Technologies 29- Mellanox Confidential -
Encoding Performance (Single Core)
4x
Fa
ste
r
50
x L
ow
er C
PU
%
© 2017 Mellanox Technologies 30- Mellanox Confidential -
Encoding Performance (Single Core) - ARM
30
0x
90
x L
ow
er C
PU
%
© 2017 Mellanox Technologies 31- Mellanox Confidential -
Summary
Cloud infrastructure requires efficient and scalable storage
Software Defined Storage drives scale out storage
Erasure Codes enable data availability at lower capacity
• Tradeoff: CPU & Network intensive, complexity
Erasure Codes offload offers
• Better performance
• Lower cost
Library available today
• Integration to storage systems underway
Challenges
• Efficient use of asynchronous acceleration
• Combining Erasure Codes offload and networking
Thank You