VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv...

Post on 12-Jan-2020

0 views 0 download

Transcript of VWHPV - research.cs.wisc.edu26', ¶ exiihu xsgdwhv rqo\ lq yrodwloh phpru\ 7kh 7zr 'liihuhqw :ruogv...

Fault-Tolerance, Fast and Slow:Exploiting Failure Asynchrony

in Distributed SystemsRamnatthan Alagappan, Aishwarya Ganesan, Jing Liu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau

OSDI ‘18

Replication Protocols

1

Viewstamped replication RaftPaxos

OSDI ‘18

Replication Protocols

1

GFSColossusBigTable

Foundation upon which datacenter systems are built

Viewstamped replication RaftPaxos

OSDI ‘18

The Two Different Worlds of Replication

2

OSDI ‘18

The Two Different Worlds of Replication

2

World-1 World-2

OSDI ‘18

The Two Different Worlds of Replication

2

How and where to store system state?World-1 World-2

OSDI ‘18

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1 World-2

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Memory-durable

World-2

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

safe but suffer from poor performance

OSDI ‘18

buffer updates only in volatile memory

The Two Different Worlds of Replication

2

How and where to store system state?

synchronously persist updates to disks

Disk-durable

World-1

Neither approach is ideal: reliable or performant

Paxos, Raft [ATC ‘14], ZAB [DSN ‘11], Gaios [NSDI ‘11], ZooKeeper, etcd, LogCabin …

Memory-durable

World-2

Viewstamped replication, NOPaxos [OSDI ‘16],SpecPaxos [NSDI ’15] …

safe but suffer from poor performance

performant but riskunsafety or unavailability

OSDI ‘18

Can a protocol provide strong reliability while maintaining high performance?

3

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

4

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

4

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

with many or all nodes up, buffer in memory – fast modewith failures, if only bare majority up, flush to disk – slow mode

4

OSDI ‘18

SAUCR:Situation-Aware Updates and Crash Recovery

Simple insight: dynamically (based on the situation) decide how to commit updates

with many or all nodes up, buffer in memory – fast modewith failures, if only bare majority up, flush to disk – slow mode

Strong reliability while maintaining high performance

4

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failures

5

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow modepreserves durability and availability

5

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow modepreserves durability and availability

many truly simultaneous correlatedno gap and so cannot reactremain unavailable

5

OSDI ‘18

Simultaneity of Failures

SAUCR’s effectiveness depends upon simultaneity of failuresindependent and non-simultaneous correlated (gap of a few milliseconds to a few seconds)

can react and switch from fast to slow modepreserves durability and availability

many truly simultaneous correlatedno gap and so cannot reactremain unavailable

however, existing data hints they are extremely rare –the Non-Simultaneity Conjecture

5

OSDI ‘18

Results

6

OSDI ‘18

Results

Implemented in ZooKeeper

6

OSDI ‘18

Results

Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable

6

OSDI ‘18

Results

Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable

Improvements at no or little costoverheads within 0%-9% of memory-durable systems

6

OSDI ‘18

Results

Implemented in ZooKeeperSAUCR improves reliability compared to memory-durable systems

durable and available in 100s of crash scenariosmemory-durable loses data or becomes unavailable

Improvements at no or little costoverheads within 0%-9% of memory-durable systems

Compared to disk-durableslight reduction in availability in extremely rare casesimproves performance – 2.5x on SSDs, 100x on HDDs

6

OSDI ‘18

Outline

IntroductionDistributed updates and crash recovery

disk-durable protocolsmemory-durable protocols

Situation-aware updates and crash recoveryResultsSummary and conclusion

7

OSDI ‘18

Disk-Durable Protocols

8

OSDI ‘18

Disk-Durable Protocols

8

Leader Follower Follower

Update

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

Leader Follower Follower

Update

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

A=2

Leader Follower Follower

Client

Update

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

A=2

Leader Follower Follower

Client

Update

A=2A=2

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1

A=2

Leader Follower Follower

Client

Update

A=2 A=2

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

Client

Update

fsync

A=2 A=2

fsync

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

fsync completed on a majority ?

Update

fsync

A=2 A=2

fsync

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

Update

fsync

A=2 A=2

fsync

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

if ack’d anyone, data on disk – safe Update

fsync

A=2 A=2

fsync

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

fsync recovery: just read from local disk

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

fsync recovery: just read from local disk

A=1

ready

imm

edia

te

A=1

lagging

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

fsync completed on a majority ?

A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

Safe and available

fsync recovery: just read from local disk

A=1

ready

imm

edia

te

A=1

lagging

OSDI ‘18

Disk-Durable Protocols

8

A=1 A=1 A=1A=2

Leader Follower Follower

fsync

ClientCommitted

Recovery

But poor performance due to fsync – 50x on HDDs, 2.5x on SSDs

fsync completed on a majority ?

A=1 A=2

ready

imm

edia

te

if ack’d anyone, data on disk – safe

A=1 A=2

Update

fsync

A=2 A=2

Safe and available

fsync recovery: just read from local disk

A=1

ready

imm

edia

te

A=1

lagging

OSDI ‘18 9

Leader Follower Follower

Client

Update

Memory

A=1

Memory

A=1

Memory

A=1

A=2

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18 9

Leader Follower Follower

ClientCommitted

Update

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18 9

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18 9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18

Memory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18

Memory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18

Memory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18

MemoryMemory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

imm

edia

te

Memory-Durable Protocols (Oblivious Recovery)

OSDI ‘18

MemoryMemory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

imm

edia

te

Memory-Durable Protocols (Oblivious Recovery)

e.g., ZooKeeper with forceSync = falsepractitioners do use this config!

OSDI ‘18

MemoryMemory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

imm

edia

te

Memory-Durable Protocols (Oblivious Recovery)

Performant

e.g., ZooKeeper with forceSync = falsepractitioners do use this config!

OSDI ‘18

But can lead to data loss

MemoryMemory

9

Leader Follower Follower

ClientCommitted

RecoveryUpdateOblivious: doesn’t realize loss on failure

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2 A=2 A=2

buffered on a majority ?

imm

edia

te

Memory-Durable Protocols (Oblivious Recovery)

Performant

e.g., ZooKeeper with forceSync = falsepractitioners do use this config!

OSDI ‘18

Data Loss Example in Oblivious Approach

10

OSDI ‘18

Data Loss Example in Oblivious Approach

10

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committedtwo nodes slow or failed

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

two nodes slow or failed

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes slow or failed

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes slow or failed

, recovers

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes slow or failed

, recoversloses its data but oblivious:

immediately joins

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes slow or failed

, recoversloses its data but oblivious:

immediately joins

A=1

A=1

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashes

A=1

A=1

two nodes slow or failed

, recoversloses its data but oblivious:

immediately joins

OSDI ‘18

Data Loss Example in Oblivious Approach

10

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashes

A=1

A=1

two nodes slow or failed

, recoversloses its data but oblivious:

immediately joins

lagging nodes along with recovered node form majority;

lose committed update

majority do not know

of previously committed update

OSDI ‘18 11

Leader Follower Follower

ClientCommitted

Update

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

OSDI ‘18 11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

OSDI ‘18 11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

A=1 A=2

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

OSDI ‘18 11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

recoveringwait for majority

responsesMemory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

Memory-Durable Protocols (Loss-Aware Recovery)

A=2

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

recoveringwait for majority

responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

maj

ority

re

spon

ses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2

A=2

OSDI ‘18

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

recoveringwait for majority

responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

maj

ority

re

spon

ses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2

A=2 e.g., Viewstamped replication

OSDI ‘18

Avoids loss (unlike oblivious) but can lead to unavailability

Memory

11

Leader Follower Follower

ClientCommitted

RecoveryUpdate

Loss-aware: realizes loss, waits for majority

recoveringwait for majority

responses

ready

Memory

A=1

Memory

A=1

Memory

A=1A=2A=2

buffered on a majority ?

maj

ority

re

spon

ses

Memory-Durable Protocols (Loss-Aware Recovery)

A=1 A=2

A=2 e.g., Viewstamped replication

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committedtwo nodes crashed

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

two nodes crashed

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes crashed

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

A=1

crashestwo nodes crashed

, recovers

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes crashed

, recoverscannot collect

majority responsesalthough majority up – unavailable

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes crashed

, recoverscannot collect

majority responsesalthough majority up – unavailable

A=1

A=1

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes crashed

, recoverscannot collect

majority responsesalthough majority up – unavailable

A=1

A=1

failed nodes recover

OSDI ‘18

Unavailability Example in Loss-Aware Approach

12

A=1

A=1

A=1

A=1 committed

A=1

A=1

crashestwo nodes crashed

, recoverscannot collect

majority responsesalthough majority up – unavailable

A=1

A=1

failed nodes recoverstay in recovering

unavailable even after all nodes recover

OSDI ‘18

Outline

IntroductionDistributed updates and crash recoverySituation-aware updates and crash recovery

SAUCR insights, guarantees, and overviewsituation-aware updatessituation-aware crash recovery

ResultsSummary and conclusion

13

OSDI ‘18 14

SAUCR Intuition and Insight

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

SAUCR Intuition and Insight

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

buffer even with many failures

Memory-durable

always

poor reliability

SAUCR Intuition and Insight

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

alwaysalways

poor reliability poor performance

SAUCR Intuition and Insight

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

Insight: reacting to failures and adapting to situation can achieve reliability and performance

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

alwaysalways

poor reliability poor performance

SAUCR Intuition and Insight

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

Insight: reacting to failures and adapting to situation can achieve reliability and performance

when no or few failures could buffer in memory

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

alwaysalwayscommon case

when many or all up

buffer in memory

Memory-durable

poor reliability poor performance

SAUCR Intuition and Insight

OSDI ‘18

Existing protocols are static in nature: do not adapt to failures

Insight: reacting to failures and adapting to situation can achieve reliability and performance

when no or few failures could buffer in memorywhen failure arise, flush

14

buffer even with many failures

Memory-durable

persist even when no failures

Disk-durable

alwaysalwayswith failures

when only minimum upcommon case

when many or all up

buffer in memory

flush to disk

Memory-durable Disk-durable

poor reliability poor performance

SAUCR Intuition and Insight

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durability

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligible

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable

We conjecture they are extremely rare: a gap exists between failurescorrelated but a few seconds apart [Ford et al., OSDI ‘10]analysis reveals a gap of 50 ms or more almost always

OSDI ‘18

Guarantees Depend upon Simultaneity of Failures

15

With non-simultaneous, gap exists, SAUCR can react and ensures durabilityindependent: likelihood of many nodes failing together is negligiblecorrelated: many nodes fail together

although many nodes fail, not necessarily simultaneous; most cases, non-simultaneous

With simultaneous correlated, no gap, SAUCR cannot react, unavailable

We conjecture they are extremely rare: a gap exists between failurescorrelated but a few seconds apart [Ford et al., OSDI ‘10]analysis reveals a gap of 50 ms or more almost always

Most cases: any no. of independent and non-simultaneous correlated – same as disk-durableRare cases: more than a majority crash truly simultaneously – remain unavailable

OSDI ‘18

SAUCR Overview

16

OSDI ‘18

SAUCR Overview

Updateswhen more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

16

OSDI ‘18

SAUCR Overview

Updateswhen more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

When nodes fail and only a bare majority alive, flush to disk – slow modee.g., only 3 nodes up in a 5-node cluster

16

OSDI ‘18

SAUCR Overview

Updateswhen more than a majority up, buffer updates in memory – fast mode

e.g., 4 or 5 nodes up in a 5-node cluster

When nodes fail and only a bare majority alive, flush to disk – slow modee.g., only 3 nodes up in a 5-node cluster

Crash Recoverywhen a node recovers from a crash, it recovers its data

either from its disk (if crashed in slow mode)or from other nodes (if crashed in fast mode)

16

OSDI ‘18

Situation-Aware Updates

17

OSDI ‘18

Situation-Aware Updates

17

L

all nodes up

OSDI ‘18

Situation-Aware Updates

17

L

all nodes upfast mode -

buffer updates

OSDI ‘18

Situation-Aware Updates

17

L L

all nodes up 4 nodes up (more than majority)fast mode -

buffer updates

OSDI ‘18

Situation-Aware Updates

17

L L

all nodes up 4 nodes up (more than majority)fast mode -

buffer updates remain infast mode

OSDI ‘18

Situation-Aware Updates

17

L L L

all nodes up 4 nodes up (more than majority)

only majority upfast mode -

buffer updates remain infast mode

OSDI ‘18

Situation-Aware Updates

17

L L L

all nodes up 4 nodes up (more than majority)

only majority upfast mode -

buffer updates remain infast mode

switch to slow, flush to disk

OSDI ‘18

Situation-Aware Updates

17

L L L L

all nodes up 4 nodes up (more than majority)

only majority upfast mode -

buffer updates remain infast mode

switch to slow, flush to disk

OSDI ‘18

Situation-Aware Updates

17

L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

fast mode -buffer updates remain in

fast mode

switch to slow, flush to disk

OSDI ‘18

Situation-Aware Updates

17

L L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

one node recoversand catches up;fast mode -

buffer updates remain infast mode

switch to slow, flush to disk

OSDI ‘18

Situation-Aware Updates

17

L L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

one node recoversand catches up;fast mode -

buffer updates remain infast mode

switch to slow, flush to disk

OSDI ‘18

Situation-Aware Updates

17

L L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

one node recoversand catches up;fast mode -

buffer updates remain infast mode

switch to slow, flush to disk

OSDI ‘18

Situation-Aware Updates

17

L L L L L

all nodes up 4 nodes up (more than majority)

only majority up commitsubsequent

updates in slow mode

one node recoversand catches up;fast mode -

buffer updates remain infast mode

switch to slow, flush to disk switch to fast

OSDI ‘18

Failure Reaction

32

Basic failure-detection mechanism: heartbeats

remain in fast mode

Follower failures

switch to slow mode

Leader failures

on a missing heartbeat, followers flush to disk

Leader Leader

Challenges: too many packets, spurious elections, too much data to flushTechniques in the paper …

Result: can react to failures even when they are only a few milliseconds apart, preserving durability and availability

steps down

OSDI ‘18

Mode-Aware Crash Recovery

19

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

immediate

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

immediate

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

lost updates

recover from other nodes

immediate

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

lost updates

recover from other nodes

a bare minority (bare majority - 1)responses

immediate

OSDI ‘18

Disk-durable: always recover from diskMemory-durable: always recover from other nodes (loss-aware)

Mode-Aware Crash Recovery

19

SAUCR

recover from local disk

ready

lost updates

recover from other nodes

ready

a bare minority (bare majority - 1)responses

immediate

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crashS1

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recovered

S1

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-A

AAA

S1

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-AIf update-A was committed in slow mode, S1 recovers from disk

A AA

S1

OSDI ‘18

Intuition for why SAUCR’s recovery is safe

20

Assume update-A committed, S1 recovers and has seen A before crash

Safety condition: update-A must be recoveredIf A was committed in fast mode, then at least one in any bare minority must contain update-AIf update-A was committed in slow mode, S1 recovers from diskProof sketch in the paper …

A AA

S1

OSDI ‘18

Outline

IntroductionDistributed updates and crash recoverySituation-aware updates and crash recoveryResultsSummary and conclusion

51

OSDI ‘18

Evaluation

We implement in SAUCR in ZooKeeperCompare SAUCR’s reliability and performance against

disk-durable ZooKeeper (forceSync = true)memory-durable ZooKeeper (forceSync = false)viewstamped replication (ideal model)

52

OSDI ‘18

Reliability Testing

53

12345

1234 1235 1245 1345 2345

123 124 345…12

1 2 3 4 5

125

4514 …

13

Cluster crash-testing frameworkGenerates cluster-state sequences

How it works?Please see our paper…

OSDI ‘18

703 0

217 1047

1264 0 0

1264 0 0

Reliability Results

54

Correct Unavailable Data loss Correct Unavailable Data loss

703 0 561

217 1047 0

1264 0 0

Systemsmemory-durable

zookeeperviewstampedreplication

disk-durablezookeeperSAUCR 1200 64 0

Simultaneous

non-simultaneous: gap of 50 ms, simultaneous: no gap memory-durable zookeeper silently loses dataviewstamped replication leads to permanent unavailabilitySAUCR reacts to non-simultaneous – durable and availableother systems behave the same as non-simultaneous casessimultaneous: SAUCR by design remains unavailable in some cases

561

0

Non-Simultaneous

OSDI ‘18

Macro-benchmark Performance: YCSB-load

55

Compared to disk-durable, both memory-durable and SAUCR are faster SAUCR’s performance matches memory-durable ZooKeeper

within 9% of memory-durable Zookeeper even for write-intensive workloadsoverheads because SAUCR writes to one additional node

0

5

10

15

20

25

HDD SSD

Thr

ough

put

(KO

ps/s

)

Memory-durable ZooKeeper SAUCR Disk-durable ZooKeeper

100x

2.5x

9%9%

OSDI ‘18

Summary

26

Replication protocols are an important foundationneed to be performant, yet also provide high reliability

Dichotomy: disk-durable vs. memory-durable protocolsunsavory choices: either performant or reliable

SAUCR – situation-aware updates and crash recoveryprovides both high performance and reliability

OSDI ‘18

Conclusions

Paying careful attention to how failures occurcan find approaches that provide both performance and reliability more data from real-world deployments?

Hybrid approach – an effective systems-design technique –applicable to distributed updates and recovery too

worthwhile to look at other important protocols/systems where we make similar two-ends-of-the-spectrum tradeoffs?

Thank you!Poster #6

27