Koh-i-Noor

1

Koh-i-Noor

Mark Manasse, Chandu Thekkath

Microsoft Research

Silicon Valley

04/19/23

Motivation

Large-scale storage systems can be expensive to build and maintain: Cost of managing the storage, Cost of provisioning the storage.

Total cost of ownership is dominated by management cost, but provisioning 10,000 spindles worth of useful

data is expensive, especially with redundant storage.

3

Rationale / Mathematics I

Hotmail has storage needs approaching a petabyte.

Currently built from difficult-to-manage components. 1,000 10-disk RAID arrays Requires significant backup

effort, incessant replacement of failed drives, etc.

Reliability could be improved by mirroring. But 10,000 mirrored drives is

a lot of work to manage. For either system, difficult to

expand gracefully, since the storage is in small, isolated clumps.

The Galois Field of order pk (for p prime) is formed by considering polynomials in Z/Zp[x] modulo a primitive polynomial of degree k.

Facts Any primitive polynomial will

do; all the resulting fields are isomorphic.

We write GF(pk) to denote one such field.

x is a generator of the field. Everything you know about

algebra is still true.

4

Reducing the Total Cost of Ownership

Reduce management cost: Use automatic management, automatic load balancing,

incremental system expansion, fast backup. (Focus of earlier research and the Sepal project.)

Reduce provisioning cost: Use redundancy schemes that tolerate multiple failed

components without the cost of duplicating them as is done in mirroring or triplexing.

5

Rationale / Mathematics II

There are existing product that improve storage management.

It would be difficult to use them to build petabyte storage systems that can support something like Hotmail.

Need thousands of these components and even with squadrons of perl and java programmers, it would be difficult to configure and manage such a system.

These products lack formal management interfaces that can be used by external programs or each other to reliably manage the system.

A Vandermonde matrix is of the form

The determinant of a Vandermonde matrix is (a-b)(a-c)···(a-z)(b-c)···(b-z)···(y-z).

kkkk zcba

zcba

zcba

...

.......

.......

.......

...

...

1...111

2222

6

Large-Scale Mirroring

Assume disk mean-time-to-failure of 50 years. Mirroring needs 20,000 disks for 10,000 data disks. Expect 8 disks to fail every week.

Assume 1 day to repair a disk by copying frommirror.

Mean time before data is unavailable (because of a double failure) is 45 years.

Cost of storage is double what is actually needed.

7

Rationale / Mathematics III

Goal: build a distributed block-level storage system that is highly-available, scalable, and easy to manage.

Design each component of the system to automate or eliminate many management decisions that are today made by human beings.

Automatically tolerate and recover from exceptional conditions such as component failures that traditionally require human intervention.

Facts about determinants det(AB) = det(A) det(B) det(A | ka) = k det(A | a)

Facts about GF(256) a + b = a XOR b. Every element other than 0 is xk,

for some 0≤k≤255. If a=xk and b=xj, then ab=xk+j. With a 512-byte table of

logarithms and a 1024-byte table of anti-logarithms, multiplication becomes trivial.

A byte of data, viewed as an element of GF(256), multiplied and added with other bytes still occupies a byte of storage.

8

Improving Time to Data Unavailability

Consider 10,000 disks, each with 50 years MTTF. ~ 50 years with mirroring at 2x provisioning cost.

May be too short a period at too high a price. ~ ½M years with triplexing at 3x provisioning cost.

Very long at very high price. An alternative: use Reed-Solomon Erasure Codes.

40 clusters of 256 disks each. Can tolerate triple (or more) failures . ~50K years before data unavailability. ~1.03x privisioning cost.

9

Rationale IV

Suppose the instantaneous probability of a disk being in a failed state is p (computed as MTTR/MTTF).

The probability of k disks failed is pk. The probability of finding k disks failed out of j is

(j C k)pk=q. Cluster MTTF=MTTR/q; with N total disks,

System MTTF =~ k! MTTFk/N (j MTTR)k-1 (if j >> k). RAID sets

N=10,000, k=2, j=10, MTTF=50 years, MTTR=1 day, System MTTF=5023651/50000 =~ 9 years.

Duplexing N=20,000, k=2, j=2, System MTTF=5023651/20000 years =~45 years.

Triplexing N=30,000, k=3, j=3, System MTTF=5033652/30000 years =~ 555000 years.

Erasure codes N=10,000, k=4, j=256, System MTTF=~ 5043653 24/2563 10000 years =~ 43500

years. Is that enough?

10

)det())...()((

0)(...)()(

.......

.......

.......

0)(...)()(

01)(...1)(1)(

11...11

det

...

.......

.......

...

...

...

11...11

det

)det(

1...00

.1.....

.......

.......

00...0

00...1

00...01

det)det(

11211

1212111

1212111

11211

11111

112

1211

11

121

311

231

22

321

21

31

11211

212

2211

21

1111211

1

1

1

1

1

kkkkk

kkkk

kk

kk

kkkkk

kkkk

kkk

kkk

kk

kkk

kkk

kk

kkkkkkkk

kkkkkkkk

kkkkkk

k

k

k

k

k

Vxxxxxx

xxxxxxxxx

xxxxxxxxx

xxxxxx

xxxxxxxxxxxx

xxxxxxxxxxxx

xxxxxxxxxxxx

xxxxxxxx

V

x

x

x

V

Inductive step proving the determinant of a Vandermonde matrix is the product of differences.

Determinant here is 1.

Expand on last column, after removing common factors, what’s left if Vk.

11

Reed-Solomon Erasure Codes

3

2

1

3

2

1

3

2

1

242

2

.

.

.

.

.1

.1

1.111

1.000

.....

0.100

0.010

0.001

c

c

c

d

d

d

d

d

d

d

d

xxx

xxx

n

nn

n

3

2

1

3

2

1

3

2

1

242

2

.

.

.

.

.1

.1

1.111

1.000

.....

0.100

0.010

0.001

c

c

c

d

d

d

d

d

d

d

d

xxx

xxx

n

nn

n

2. Suppose data disks 2,3 and check disk 3 fail.

2

1

5

4

1

3

2

1

432

.

.

.

.

.1

1.11111

1.00000

.......

0.10000

0.01000

0.00001

c

c

d

d

d

d

d

d

d

d

xxxxx

n

nn

4. Multiplying both sides by R-1, we recover all the data.

2

1

5

4

1

1

432

3

2

1

.

.1

1.11111

1.00000

.......

0.10000

0.01000

0.00001

.

.

.

c

c

d

d

d

d

xxxxxd

d

d

d

n

nn

3. Omitting failed rows, we get an invertible n×n matrix R.

1. We use an n×(n+k) coding matrix to store data on n data disks and k check disks. (k=3 in our example)

12

Rationale / Mathematics Va

The matrices were transposed so that the data would be in column vectors, which fits better on the slide.

The correction matrix in the example uses my special erasure code for k = 3.

The correction rows are taken from a Vandermonde matrix. As you might recognize, had

you been reading the other half of the slides.

For general k, we take a Vandermonde matrix using elements 0, 1, …, 255.

The product of element differences is non-zero, because all the individual element differences are non-zero.

Make it tall enough for all the data disks.

Diagonalize the square part, and use the remaining columns for check disks.

Row operations change the determinant in simple ways.

13

Rationale / Mathematics Vb

Computing check digits is easy, and well-parallelized.

To update a block on data disk j, compute the XOR of the old block with the new block.

Multicast the log of the XOR together with j to the check disks (and have them start reading the block off disk).

Each check disk multiplies the XOR’ed data block by the j’th entry in their column, XOR’s that with the old check value, and writes the new check block.

Rotate which disks are check disks, making the load uniform.

No difference in latency compared to RAID-5 (but double the disk bandwidth).

For k=3, we can just take the identity matrix plus three columns of Vandermonde, using 1, x, x2.

Computation of check digits is faster: the logarithm for the k’th check disk and j’th data disk is kj.

We omit non-failed data-disks in computing determinants.

What remains is a small minor of the whole matrix.

1x1 works because all entries are non-zero.

3x3 works because it’s a trasposed Vandermonde matrix.

2x2 works because it’s either a Vandermonde matrix, or a Vandermonde matrix with columns multiplied by powers of x.

14

Parallel Reconstruction of a Failed Disk

D2 ← XOR

XOR XOR

D1 × (1+x+x2+x3+x7) D4 × x C1 × (1+x2+x4+x5+x6+x7) C2 × (x+x3+x4+x5+x6)

Given the inverse matrix (as above), a failed disk is the sum (exclusive or) of products of the individual bytes on data and check disks with the elements of the matrix.

The example below shows reconstruction of disk 2, in a system with 4 data disks and 2 check disks; disks 2 and 3 failed.

15

Rationale / Mathematics VI

The reconstruction on the previous slide requires: Reading all of the surviving

disks. Parallel multiplication by

precomputed entries from the inverse matrix.

XOR of pairs of disks entries working up a binary tree.

A network of small, cheap switches provides the necessary connectivity.

Put hot spare CPUs and drives into network; skip up tree at nodes with no useful partner.

Given nature of matrix, the inverse matrix can be constructed via Gaussian elimination very quickly for small values of k.

With more complicated erasure coding (2D parity, for example), it’s harder to see how to wire the network to accommodate hot-spares without requiring more from the interconnect.

16

Related Work

Reed-Solomon error correcting codes have been used for single disks and CDs.

Reed-Solomon and Even-Odd erasure codes have been proposed: For software RAID [ABC], For wide area storage systems [LANL].

Our usage of the code is different.Parallel reconstruction is novel.Patents in progress on both.

17

Potential deficiencies

This could be a bad idea, depending on what you’re trying to do: The bandwidth in at the top of the tree isn’t enough to keep all

the disk arms busy with small reads and writes. This could be a bad disk for building databases.

The total disk bandwidth at least doubles during degraded operation, which might affect many more disks than with mirroring or standard RAID-5.

Failures may not be independent, invalidating predicted reliability.

CPUs (which, if you attach multiple disks, should serve disks in different pods) may fail too often, in ways that rebooting doesn’t fix.

Power supplies, cables, switches, … may fail, which we haven’t accounted for.

Writes are going to be relatively expensive.

18

Comparisons and future work

Other people have already built small erasure-coded arrays. Is there enough new here? We think so.

Real systems “needs”: Backup, Geographical redundancy.

Koh-i-Noor

Documents

Transcript of Koh-i-Noor