Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.

Separating Data and Metadata for Robustness and Scalability

Yang WangUniversity of Texas at Austin

Goal: A better storage system

• Data is important.

• Data grows bigger.

• Data is accessed in different ways.

Challenge: achieve multiple goals simultaneously

• Robustness– Durable and available despite failures

• Scalability– Thousands of machines or more

• Efficiency – Good performance with a reasonable cost

Solution

Separating data and metadata

My works

Gnothi

Salus

ExaltEvaluate

Design

My works

Gnothi

Salus

Exalt

Small-scaleCrash failures

My works

Gnothi

Salus

Exalt

Large-scaleArbitrary failures

How to design?

• Problem: Stronger protection -> Higher cost• Key observation:– Data: big (4K to several MBs)– Metadata: small (tens of bytes); can validate data

• Solution– Strong protection for metadata -> Robustness– Minimal replication for data -> Scalability and Efficiency

How to evaluate?

Gnothi

Salus

Exalt

Evaluate large-scale storage systems on small to medium platforms

Outline

• Gnothi: Efficient and Available Storage Replication– Small scale; tolerate crash faults and timing errors

• Salus: Robust and Scalable Block Store– Large scale; tolerate arbitrary failures

• Exalt: Evaluate large-scale storage systems

Resolving a long-standing trade-off

• Efficiency– Write to f+1 nodes and read from 1 node

• Robustness– Availability: Aggressive timeout for failure detection– Consistency: Read returns the data of the latest write

11

Synchronous Primary Backup




12

Asynchronous Replication




13

Gnothi

Gnothi Overview

Gnothi resolves the trade-off …… but only for block storage, meaning …– A fixed number of fixed-size blocks.– A request reads/writes a single block.

Key ideas:– Don’t insist that nodes have identical state.– A node knows which blocks are fresh/stale.

14

Gnothi Seauton – Know yourself

Separating Data and Metadata

2f+1 nodesClients

Metadata Size: 24 bytes for a block (4K to 1M)

LAN

15

DataData Write requestMetadata: blockNo, client ID, ...

Rest of Gnothi

• Why is the trade-off challenging?

• How does Gnothi resolve the trade-off?

• How well does Gnothi perform?

16

Why is the trade-off challenging?

17

How to handle a timeout?

• Can we have both f+1 replication and short timeout?

2 Timeout?

1

Synchronous Primary Backup(Remus, Hbase, Hypervisor, …)

Continue with 1 node

Use conservative timeout

2 Timeout?

1

Asynchronous Replication(Paxos, …)

Send to 2f+1 nodes and waits for f+1 ACKs

3

Why is the trade-off challenging?× Continue with 1 node? – Not safe. × Wait? – Not live. Switch to another node. (Cheap Paxos, ZZ, …)? However, state of newly enlisted node may be incomplete.

– One solution: on switch, copy all data to new node – bad availability.

2 Partial Replication f+1

3

1

18

TimeoutWait?

Copy dataSwitch

?

Rest of Gnothi




19

Gnothi: Nodes can be incomplete• A new write will overwrite the block anyway.• Read can be processed correctly

– As long as a node knows which blocks are stale

• Recovery can be processed correctly– As long as a node knows which block is the latest one

2

1

Read block 2

I do not have current version of block 2

2

1

Write block 2

Fetch block 2

20

Write latest version of block 2

How does Gnothi work?

How to perform writes and reads efficiently when no failures occur?

- Write to f+1 and read from 1

How to continue processing requests during failures?

- Still write to f+1 and read from 1

How to recover the failed node efficiently?

21


Metadata

Data

Write Read

Maintain a single bit for each block: “do I have the current data?”

• Data replicated f+1 times

Metadata ensures read can be processed correctly.

Node 1

Node 2

Node 3

Node 1

Node 2

Node 3

Client Node with both data and metadata Node with only metadata

22

Gaios: Bolosky et al. NSDI 2011

Load-balanced Data DistributionVirtual diskVirtual disk

Slice 1Slice 1 Slice 2Slice 2



Slice 1Slice 1 Slice 2Slice 2 Slice 3Slice 3

Gnothi Block Drivers

LAN

Divide space into multiple slices

Evenly distribute slices to different preferred nodes

23

33

11

22

Preferred Storage Reserve Storage

Node 1

Node 2

Node 3

Load-balanced Data DistributionVirtual diskVirtual diskSlice 1Slice 1 Slice 2Slice 2 Slice 3Slice 3

Gnothi Block Drivers

LAN

Divide space into multiple slices

Evenly distribute slices to different preferred nodes

24

Preferred Storage Reserve Storage

Node 1

Node 2

Node 3

1 2

2 3

1 3

3

1

2


WriteDo not wait for data or metadata transfer Read

• Metadata replicated 2f+1 times

Metadata allows a node to process requests correctly.

25

Node 1

Node 2

Node 3

? ? ?

Catch-up problem in recovery

Can I catch up?

• Recovery speed vs Execution speed– Traditional systems have the catch-up problem

Node 1

Node 2

Node 3

26


Node 1

Node 2

Node 3

27

• Separate metadata and data recovery– Phase 1: Metadata recovery – fast


Node 1

Node 2

Node 3

28

Data Recovery in background

• Separate metadata and data recovery– Phase 1: Metadata recovery – fast– Phase 2: Data recovery – slow, in background

Rest of Gnothi




29

Evaluation

• Throughput– Compare to a Gaios (Bolosky et al. NSDI 2011) like system G’.– Sequential/Random read/write– f=1 (Gnothi-3, G’-3) and f=2 (Gnothi-5 and G’-5)– Block size 4K, 64K, and 1M

• Failure Recovery– Compare Gnothi to G’ and Cheap Paxos– How long does recovery take?– What is the client throughput during recovery?

30

Gnothi achieves higher throughput

Gnothi can achieve 40%-64% more write throughput and scalable read throughput.

31

More write throughput

Scalable read throughput

Higher throughput during recovery

Gnothi does not block long for failures.Gnothi can achieve 100%-200% more throughput during recovery.

32

Kill Restart

Cheap Paxos blocks for data copy

100%-200% more throughput

Complete recovery at almost the same time

No blocking

Gnothi can always catch up

Tunable recovery speed.In Gnothi, the recovering node can always catch up with others.

33

Gnothi G’

Catch upCatch up

Cannot catch up

Thro

ughp

ut (M

B/s)

Thro

ughp

ut (M

B/s)

Gnothi conclusion

• Separate Data and Metadata– Replication• Improve efficiency.• Ensure availability during failures.

– Recovery• Ensure catch-up.

34

Outline

• Gnothi: Efficient and Available Storage Replication– Small scale; tolerate crash faults and timing errors

• Salus: Robust and Scalable Block Store– Large scale; tolerate arbitrary failures

• Exalt: Evaluate large-scale storage systems

Problem: Not enough machines

• In practice– WAS in Microsoft: 60PB– HDFS in Facebook: 4000 servers– …

• In research– Salus: 100 servers– COPS: 300 servers– Spanner: 200 servers

Research should go beyond practice.

Public testbeds

• Utah Emulab: 588 machines• CMU Emulab: 1024 machines• TACC (Texas Advanced Computing Center)– 6400 machines, but not enough storage

• Amazon EC2– Cost $1400 for our Salus experiment (108 servers)

Solution 1: Extrapolation

• Measure with a small cluster• Predict the bottleneck• Assumption: resource consumption grows

linearly with the scale

CPU

Network

100 nodes

10%

5%

Extrapolate: The system can scale to 1,000 nodes.

Scale

Resource utilization

Solution 1: Extrapolation

• Measure with a small cluster• Predict the bottleneck• Problem: Assumption may not be true.

CPU

Network

100 nodes

10%

Scale

Resource utilization

Solution 2: Stub

• Build stub components to simulate real components

• Problem: stub component can be as complex as the original one

Solution 3: Simulation

Exalt: Evaluate 10,000 nodes on 100 machines

• Run real code

• Use fewer resources

• Seems impossible?– In general, Yes.– For storage systems with big data, we can achieve.

Key insight

• I/O is the bottleneck.• However, the content of data does not matter.• Solution:– We can choose a highly compressible data pattern.– Build emulated I/O devices that compress data.

00000000…

Emulated Network

1 million zeros

compress

00000000…

decompress

Challenge

• System may add metadata

• System may split data (possibly nondeteministically)

• Existing approaches are either inaccurate or inefficient on such mixed patterns.

00000… 00000… 00000…

Goals

• Can not lose metadata• High compressing ratio• Computationally efficient• Can work with the mixed pattern

Existing approaches

• David (FAST 11): discard file content– Lose metadata since it’s mixed with data

• Gzip, etc:– Not efficient

• Write all zeroes and scan for zeros– Still not efficient enough

Solution: Tardis

• Key: we cannot choose metadata but we can choose data – Make data distinguishable from metadata

Magic sequence of bytes that do not exist in metadata

An integer representing number of bytes left

Tardis compression

Search for magic sequence

Retrieve number of bytes left (Nleft) Jump Nleft bytes

Search for magic sequence again

Problems

• How to find a magic sequence– A randomly chosen 8-byte one works for HDFS.– Run the system, record trace, and analyze.

• What if system inserts metadata into data?– After jumping, check if it matches with the jumped

bytes.– If not, binary search until a match is found.

Use Exalt• Emulated devices have inaccurate performance.• If one or several nodes are bottleneck– Run those nodes in real mode– Run other nodes in emulation mode

Use Exalt

• How about if the behavior depends on a large number of nodes?– E.g. 99% latency and parallel recovery

• Need to model the behavior of emulated devices

Number of bytes

Disk/Network latency

Energy consumption

Implementation

• Bytecode Instrumentation (BCI)

• Emulated devices:– Disk (transparent)– Network (transparent)– Memory (need to modify code)

Preliminary results on HDFS

Proposed work

• Apply “separating data and metadata” to active storage in Salus

• Complete Exalt: – Incorporate latency modeling– Apply Exalt to more applications– Complete Tardis implementation

• Multiple-RSM communication– Join the project leaded by Manos– Not part of my thesis

Publications

"Robustness in the Salus scalable block store". Y. Wang, M. Kapritsos, Z. Ren, P. Mahajan, J. Kirubanandam, L. Alvisi, and M. Dahlin, in NSDI 2013.

"All about Eve: Execute-Verify Replication for Multi-Core Servers". M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin, in OSDI 2012.

"Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication". Y. Wang, L. Alvisi, and M. Dahlin, in USENIX ATC 2012.

"UpRight Cluster Services". A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin, T. Riche, in SOSP 2009.

Backup slides

Cost of Gnothi

• Higher write latency:– In LAN, the major latency comes from disk.– Write metadata and data together to disk.– Rethink-the-sync write should also help.

• Lose generality– Gnothi is only designed for block storage.

58

How does Gnothi compare to GFS/HDFS/xFS/… ?

• Those systems have a metadata server and multiple data servers.

• Gnothi updates metadata for every write and checks metadata for every read.

• They do that at a coarse granularity– Advantages: high scalability– Disadvantages: weaker consistency guarantee;

append-only interface, worse availability, …

59

Efficient Recovery

Can I catch up?

• Recovery speed vs Execution speed– Traditional systems have the catch-up problem

Node 1

Node 2

Node 3

60

Is timing error a real threat?

• Can cause data inconsistency• Reasons:– Network partitions– Server overloading– …

• A real concern in practical systemsHBASE-2238“Because HDFS and ZK are partitioned (in the sense that there's no communication between them) and there may be an unknown delay between acquiring the lock and performing the operation on HDFS you have no way of knowing that you still own the lock, like you say.”

61

https://issues.apache.org/jira/browse/HBASE-2238

Interface & Models

• Disk interface– A fixed number of fixed-size blocks– A request can read/write a single block– Linearizable reads and writes

• Asynchronous model: no maximum delay– Omission failure only– Always safe– Live when the network is synchronous

62

Architecture

• Fully replicated metadata• Partially replicated data– Load balancing– Preferred Storage– Reserve Storage

Slice 0Slice 0

Metadata Preferred

Virtual diskVirtual disk





Slice 2Slice 2

Slice 0Slice 0

Slice 1Slice 1

Reserve

63

Data can be stored out of its preferred replicas.

Data

Network problem

Metadata

Replica 0 does not have current data.

Only Replica 2 has current data.

64

Gnothi: Available and EfficientGnothi Storage ServersGnothi Block Drivers

• Availability: same as Asynchronous Replication– Safe regardless of timing errors– Can use aggressive timeout

AppApp

AppApp

LAN

65

Gnothi: Available and EfficientGnothi Storage ServersGnothi Block Drivers

• Efficiency:– Storage/Bandwidth efficiency: write to f+1 replicas– Read efficiency: read from 1 replica

AppApp

AppApp

LAN

66

Previous work cannot achieve both

Availability

Synchronous Primary Backup:Use conservative timeouts

Remus, Hypervisor, HBase, …

Efficiency

Availability

Preferred Quorum:Use cold backups

Cheap Paxos, ZZ, …

Efficiency Efficiency

Availability

Gaios:Scalable Read

Read Storage/Bandwidth

Availability

Asynchronous Replication:Use 2f+1 replicas

Paxos, …

Efficiency

Availability

Gnothi:Separating Data and Metadata

Efficiency

67


• Efficiency– Write to f+1 replicas and read from 1 replica

• Availability– Aggressive timeout for failure detection

• Consistency– Read always returns the data of the latest write.

68

Synchronous Primary BackupAsynchronous ReplicationGnothi (this talk)

Catch-up problem in recovery• Recovery speed vs Execution speed– Traditional systems have the catch-up problem

Fail Recover

Node 1

Node 2

Traditional Approaches: Fetch missing data before processing new requests

Cannot catch up

Have to block or throttle

Node 3

69

Separate Metadata and Data Recovery

• Metadata Recovery: fast• Data Recovery: slow; in background

Metadata

Metadata Recovery

The recovering node can process new requests after Metadata Recovery.

70

Node 1 recovers

Node 2

Node 3

Separate Metadata and Data Recovery

• Metadata Recovery: fast• Data Recovery: slow; in background

Data

Data Recovery

Release reserve storage

71

Node 1 recovers

Node 2

Node 3

Gnothi ensures catch-up

Fail Recover

Node 1

Node 2

Gnothi: fetch missing metadata before processing new requests

Traditional Approaches: fetch missing data before processing new requests

Node 1 is never left behind after Metadata Recovery.

Metadata

Node 3

72

How does Gnothi work?

Write Read

Write Read

Recovery




73

Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.

Documents

Transcript of Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.