Giza: Erasure Coding Objects across Global Data Centers · Giza: Erasure Coding Objects across...

46
Giza: Erasure Coding Objects across Global Data Centers Yu Lin Chen *ꭞ, Shuai Mu ꭞ, Jinyang Li ꭞ, Cheng Huang *, Jin li *, Aaron Ogus *, and Douglas Phillips* ꭞ New York University, *Microsoft Corporation USENIX ATC, Santa Clara, CA, July 2017 1

Transcript of Giza: Erasure Coding Objects across Global Data Centers · Giza: Erasure Coding Objects across...

Giza: Erasure Coding Objects across Global Data Centers

Yu Lin Chen*ꭞ, Shuai Mu ꭞ, Jinyang Li ꭞ, Cheng Huang *, Jin li *, Aaron Ogus*, and Douglas Phillips*

ꭞ New York University, *Microsoft Corporation

USENIX ATC, Santa Clara, CA, July 2017

1

Azure data centers span the globe

2

Data center fault tolerance: Geo Replication

blob

3

Data center fault tolerance: Erasure coding

blob a b p

Project Giza 4

The cost of erasure coding

bloba b

Project Giza 5

When is cross DC coding a win?

1. Inexpensive cross DC communication.

2. Coding friendly workload

a) Stores a lot of data in large objects (less coding overhead)

b) Infrequent reconstruction due to read or failure.

6

Cross DC communication is becoming less expensive

Largest bandwidth capacity today with 160Tpbs

7

In search of a coding friendly workload

8

3 months trace: January – March 2016

OneDrive is occupied mostly by large objects

9

OneDrive is occupied mostly by large objects

Object > 1GB

occupies 53%

of total storage

capacity.

10

OneDrive is occupied mostly by large objects

Objects < 4MB

occupies only

0.9% of total

storage capacity

11

OneDrive reads are infrequent after a month

12

OneDrive reads are infrequent after a month

47% of the total

bytes read occur

within the first day

of the object creation

13

OneDrive reads are infrequent after a month

less than 2% of the

total bytes read occur

after a month

14

Outline

1. The case for erasure coding across data centers.

2. Giza design and challenges

3. Experimental Results

15

Giza: a strongly consistent versioned object store

• Erasure code each object individually.

• Optimize latency for Put and Get.

• Leverage existing single-DC cloud storage API.

16

Table Blob

Giza node Giza nodeGiza node

DC1

DC2

DC3

Giza’s architecture

17

Giza node

Table Blob

Giza node Giza nodeGiza node

Giza’s workflowClient

Put(Blob1, Data)

Table Blob

Giza node Giza nodeGiza node

DC1

DC2

DC318

Table Blob

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza node Giza nodeGiza node

DC1

DC2

DC3

Giza’s workflowClient

Put(Blob1, Data)

Put(GUID1, Fragment A)

19

Table Blob

Giza node Giza nodeGiza node

Table Blob

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza node Giza nodeGiza node

DC1

DC2

DC3

Giza’s workflowClient

Put(Blob1, Data)

{DC1: GUID1, DC2: GUID2, DC3: GUID3}

20

Table Blob

Giza node Giza nodeGiza node

Table Blob

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza node Giza nodeGiza node

Giza metadata versioning

Key Metadata (version #: actual metadata)

Blob1 v1: md1

Let mdthis = {DC1:GUID4, DC2:GUID5, DC3:GUID6}

Key Metadata (version #: actual metadata)

Blob1 v1: md1, v2: mdthis

21

DC1

DC2

DC3

Inconsistency during concurrent put operationsClient

Put(Blob1, Data1)

Client

Put(Blob1, Data2)

22

Table Blob

Giza node Giza nodeGiza node

Table Blob

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza node Giza nodeGiza node

Inconsistency during concurrent put operations

Key Metadata

Blob1 v1: md1

DC1

Key Metadata

Blob1 v1: md2

DC2

Key Metadata

Blob1 v1: md1

DC3

md1 md2 md1

Order of arrival

md2 md2md1

WARNING: INCONSISTENT STATE

23

Inconsistency during concurrent put operations

Key Metadata

Blob1 v1: md1

DC1

Key Metadata

Blob1 v1: md2

DC2

Key Metadata

Blob1 v1: md1

DC3

md1 md2 md1

Order of arrival

md2 md2md1

WARNING: INCONSISTENT STATE

24

FAILURE!

Choosing version value via consensus

Key Metadata

Blob1 v1: md1

v2: md2

DC1

Key Metadata

Blob1 v1: md1

v2: md2

DC2

Key Metadata

Blob1 v1: md1

v2: md2

DC3

OR

Key Metadata

Blob1 v1: md2

v2: md1

DC1

Key Metadata

Blob1 v1: md2

v2: md1

DC2

Key Metadata

Blob1 v1: md2

v2: md1

DC3 25

Paxos and Fast Paxos properties

26

Paxos and Fast Paxos properties

27

Implementing Fast Paxos with Azure table API

Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)

DC1

DC2

DC328

Table Blob

Giza node Giza nodeGiza node

Table Blob

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza node Giza nodeGiza node

Implementing Fast Paxos with Azure table API

Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)

DC1

DC2

DC329

Table Blob

Giza node Giza nodeGiza node

Table Blob

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza node Giza nodeGiza node

Giza node

Implementing Fast Paxos with Azure table API

Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)

Remote DC

Read metadata rowand decide to accept or rejectproposal (v2 = md

this)

30

Etag = 1

Table Blob

Giza node Giza nodeGiza node

Table

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza nodeGiza node

Implementing Fast Paxos with Azure table API

Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)

Remote DC

Etag = 2

Etag = 1

31

Fail toupdate tablesince etaghas changed

Giza node

Table Blob

Giza node Giza nodeGiza node

Table

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza nodeGiza node

Write metadata rowback to the table if proposal accepted

Implementing Fast Paxos with Azure table API

Let mdthis = ({DC1:GUID1, DC2:GUID2, DC3:GUID3}, Paxos states)

Remote DC

Etag = 1

Etag = 1

32

Write metadata rowback to the table if proposal accepted

Successfullyupdate tablesince etaghasn’t changed

Giza node

Table Blob

Giza node Giza nodeGiza node

Table

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza nodeGiza node

Parallelizing metadata and data path

DC1

DC2

DC333

Table Blob

Giza node Giza nodeGiza node

Table Blob

Giza node Giza nodeGiza nodeGiza node

Table Blob

Giza node Giza nodeGiza node

Data path fails, metadata path succeeds

• During read, Giza will revert back to the latest version where the data is available.

Ongoing, data path failed

Key Metadata (version #: actual metadata)

Blob1 v1: md1, v2: md2

34

Giza Put

• Execute metadata and data path in parallel.

• Upon success, return acknowledgement to the client.

• In the background, update highest committed version number in the metadata row.

Ongoing, not committed

Key Committed Metadata (version #: actual metadata)

Blob1 1 v1: md1, v2: md2

35

Giza Get

• Optimistically read the latest committed version locally.

• Execute both data path and metadata path in parallel:

• Data path retrieves the data fragments indicated by the local latest committed version.

• Metadata path will retrieve the metadata row from all data centers and validate the latest committed version.

36

Outline

1. The case for erasure coding across data centers.

2. Giza design and challenges

3. Experimental Results

37

Giza deployment configurations

38

US-2-1

Giza deployment configurations

39

US-6-1

Giza deployment configurations

40

World-2-1

Giza deployment configurations

41

World-6-1

Metadata Path: Fast vs Classic Paxos

42

23ms32ms

64ms

Metadata Path: Fast vs Classic Paxos

43

139ms

240ms102ms

Giza Put Latency (World-2-1)

44

Giza Get Latency (World-2-1)

45

Summary

• Recent trend in data centers supports the economic feasibility of erasure coding across data centers.

• Giza is a strongly consistent versioned object store, built on top of Azure storage.

• Giza optimizes for latency and is fast in the common case.

46