Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault...
Transcript of Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault...
![Page 1: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/1.jpg)
Redundancy Does Not Imply Fault Tolerance:Analysis of Distributed Storage Reactions
to Single Errors and Corruptions
Aishwarya Ganesan, Ramnatthan Alagappan,
Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
![Page 2: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/2.jpg)
Redundancy for Fault Tolerance
2
![Page 3: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/3.jpg)
Redundancy for Fault Tolerance
2
Replication can mask failures
![Page 4: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/4.jpg)
Redundancy for Fault Tolerance
2
Replication can mask failures
- System crashes
![Page 5: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/5.jpg)
Redundancy for Fault Tolerance
2
Replication can mask failures
- System crashes
- Machine reboots
![Page 6: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/6.jpg)
Redundancy for Fault Tolerance
2
Replication can mask failures
- Network failures
- System crashes
- Machine reboots
![Page 7: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/7.jpg)
Redundancy in Distributed Storage
3
![Page 8: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/8.jpg)
Redundancy in Distributed Storage
3
Depend on local file systems to store data
![Page 9: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/9.jpg)
Redundancy in Distributed Storage
3
How about partial storage faults?
Depend on local file systems to store data
![Page 10: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/10.jpg)
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
How about partial storage faults?
Depend on local file systems to store data
![Page 11: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/11.jpg)
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
How about partial storage faults?
Depend on local file systems to store data
![Page 12: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/12.jpg)
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
How about partial storage faults?
Depend on local file systems to store data
![Page 13: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/13.jpg)
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
File system may return I/O error on a read
How about partial storage faults?
Depend on local file systems to store data
![Page 14: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/14.jpg)
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
File system may return I/O error on a read
- latent sector error
How about partial storage faults?
Depend on local file systems to store data
![Page 15: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/15.jpg)
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
File system may return I/O error on a read
- latent sector error
We call these file-system faults
How about partial storage faults?
Depend on local file systems to store data
![Page 16: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/16.jpg)
Redundancy in Distributed Storage
3
File system may return corrupted data on reads
- disk block corruption
File system may return I/O error on a read
- latent sector error
We call these file-system faults
How about partial storage faults?
Depend on local file systems to store data
Do distributed storage systems use redundancy to recover from local file-system faults?
![Page 17: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/17.jpg)
Our Study
4
![Page 18: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/18.jpg)
Our StudyBehavior of eight distributed systems in response to file-system faults
4
![Page 19: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/19.jpg)
Our StudyBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
4
![Page 20: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/20.jpg)
Our StudyBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
Replicated state machines- ZooKeeper (uses ZAB for consensus)
- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)
4
![Page 21: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/21.jpg)
Our StudyBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
Replicated state machines- ZooKeeper (uses ZAB for consensus)
- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)
Primary backup replication- MongoDB
- Redis
- Kafka (in-sync replicas for leader election)
4
![Page 22: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/22.jpg)
Our StudyBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
Replicated state machines- ZooKeeper (uses ZAB for consensus)
- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)
Primary backup replication- MongoDB
- Redis
- Kafka (in-sync replicas for leader election)
Dynamo-style quorum- Cassandra (decentralized, no leader/follower)
4
![Page 23: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/23.jpg)
Fault model
5
![Page 24: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/24.jpg)
Fault model
A single fault to a single file-system block in a single node
5
![Page 25: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/25.jpg)
Fault model
A single fault to a single file-system block in a single node
Type of faults:
- corruptions
- read errors
- write errors
5
![Page 26: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/26.jpg)
Common Expectation
6
![Page 27: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/27.jpg)
Common Expectation
6
![Page 28: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/28.jpg)
Common Expectation
6
Fault Model: A single fault to one block in only one replica
![Page 29: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/29.jpg)
Common Expectation
6
Fault Model: A single fault to one block in only one replica
Redundancy would enable recovery from local file-system faults
![Page 30: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/30.jpg)
7
Redundancy Does Not Imply Fault Tolerance
![Page 31: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/31.jpg)
7
Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes
![Page 32: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/32.jpg)
7
Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes
- data loss, corruption, unavailability, and spread of corruption to other intact replicas
![Page 33: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/33.jpg)
7
Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes
- data loss, corruption, unavailability, and spread of corruption to other intact replicas
Silent corruption
Unavailability Data lossReduced
redundancyQuery
failures
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
![Page 34: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/34.jpg)
7
Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes
- data loss, corruption, unavailability, and spread of corruption to other intact replicas
Silent corruption
Unavailability Data lossReduced
redundancyQuery
failures
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
![Page 35: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/35.jpg)
Why does Redundancy Not Imply Fault Tolerance?
8
![Page 36: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/36.jpg)
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
8
![Page 37: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/37.jpg)
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
Faults are often undetected locally – leads to harmful global effects
8
![Page 38: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/38.jpg)
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
8
![Page 39: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/39.jpg)
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
Crash and corruption handling are entangled – loss of committed data
8
![Page 40: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/40.jpg)
Why does Redundancy Not Imply Fault Tolerance?
Some fundamental problems across multiple systems – not just implementation bugs!
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
Crash and corruption handling are entangled – loss of committed data
Unsafe interaction between local behavior and global distributed protocols can spread corruption or data loss
8
![Page 41: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/41.jpg)
9
Fundamental Problems: Summary
![Page 42: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/42.jpg)
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
![Page 43: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/43.jpg)
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
![Page 44: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/44.jpg)
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
![Page 45: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/45.jpg)
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
![Page 46: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/46.jpg)
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
![Page 47: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/47.jpg)
9
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
Fundamental Problems: Summary
![Page 48: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/48.jpg)
Outline
Introduction
Fault Injection
System Behavior Analysis
Major Results
Observations Across Systems
Conclusion
10
![Page 49: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/49.jpg)
Fault Model
11
![Page 50: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/50.jpg)
Fault Model
11
Server 1 Server 2 Server 3
![Page 51: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/51.jpg)
Request
Fault Model
11
Server 1 Server 2 Server 3Client
![Page 52: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/52.jpg)
Request
Fault Model
11
Server 1 Server 2 Server 3Client
![Page 53: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/53.jpg)
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
File System
read
/w
rite
File System
read
/w
rite
![Page 54: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/54.jpg)
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
A single fault to a single file-system block in a single node
File System
read
/w
rite
File System
read
/w
rite
![Page 55: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/55.jpg)
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
A single fault to a single file-system block in a single node
Fault for current run:server 1, block B1 read corruption
File System
read
/w
rite
File System
read
/w
rite
![Page 56: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/56.jpg)
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
A single fault to a single file-system block in a single node
Fault for current run:server 1, block B1 read corruption
Fault for next run:server 1, block B1read error
File System
read
/w
rite
File System
read
/w
rite
![Page 57: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/57.jpg)
Request
Fault Model
11
Server 1 Server 2 Server 3Client
File System
read
/w
rite
A single fault to a single file-system block in a single node
Faults injected only to user data not filesystem metadata
Fault for current run:server 1, block B1 read corruption
Fault for next run:server 1, block B1read error
File System
read
/w
rite
File System
read
/w
rite
![Page 58: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/58.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
File System File System File System
![Page 59: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/59.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
File System File System File Systemext4 ext4ext4
![Page 60: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/60.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
File System File System File Systemext4 ext4ext4
![Page 61: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/61.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
a
File System File System File Systemext4 ext4ext4
![Page 62: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/62.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
aC
orr
up
t d
ata
File System File System File Systemext4 ext4ext4
![Page 63: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/63.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
aC
orr
up
t d
ata
Ext4: disk corruption → corrupted data to applications
File System File System File Systemext4 ext4ext4
![Page 64: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/64.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Ext4: disk corruption → corrupted data to applications
File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs
![Page 65: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/65.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
a
Ext4: disk corruption → corrupted data to applications
File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs
![Page 66: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/66.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
a
Ext4: disk corruption → corrupted data to applications
File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs
I/O
Er
ror
![Page 67: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/67.jpg)
Request
Fault Model: ext4 and btrfs
12
Server 1 Server 2 Server 3Client
Co
rru
pt
dat
a
Ext4: disk corruption → corrupted data to applications
Btrfs: disk corruption → I/O error to applications
File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs
I/O
Er
ror
![Page 68: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/68.jpg)
Fault Injection Methodology
13
Server 1 Server 2 Server 3
File System File SystemFile System
![Page 69: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/69.jpg)
Fault Injection Methodology
13
Server 1 Server 2 Server 3
File System File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
errfs - a FUSE file system to inject file-system faults
![Page 70: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/70.jpg)
Fault Injection Methodology
13
Server 1 Server 2 Server 3
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
errfs - a FUSE file system to inject file-system faults
![Page 71: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/71.jpg)
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
errfs - a FUSE file system to inject file-system faults
![Page 72: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/72.jpg)
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
errfs - a FUSE file system to inject file-system faults
![Page 73: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/73.jpg)
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
errfs - a FUSE file system to inject file-system faults
![Page 74: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/74.jpg)
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
![Page 75: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/75.jpg)
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Local Behavior
![Page 76: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/76.jpg)
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery
![Page 77: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/77.jpg)
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery Global Effect
![Page 78: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/78.jpg)
Read
Fault Injection Methodology
13
Server 1 Server 2 Server 3Client
File System
Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)
read B1-B4
read B1-B4
return B1-B4
return B1’-B4
errfs - a FUSE file system to inject file-system faults
Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery Global Effect
CorruptionData lossUnavailability
![Page 79: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/79.jpg)
Outline
Introduction
Fault Injection
System Behavior Analysis
Major Results
Observations Across Systems
Conclusion
14
![Page 80: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/80.jpg)
System Behavior AnalysisBehavior of eight distributed systems in response to file-system faults
Broad spectrum of replication and consensus protocols
Replicated state machines- ZooKeeper (uses ZAB for consensus)
- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)
Primary backup replication- MongoDB
- Redis
- Kafka (in-sync replicas for leader election)
Dynamo-style quorum- Cassandra (decentralized, no leader/follower)
15
![Page 81: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/81.jpg)
An Example: Redis
16
![Page 82: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/82.jpg)
An Example: Redis
16
Redis is a popular data structure store
![Page 83: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/83.jpg)
Leader
An Example: Redis
16
Redis is a popular data structure store
![Page 84: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/84.jpg)
Follower
Follower
Leader
An Example: Redis
16
Redis is a popular data structure store
![Page 85: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/85.jpg)
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
![Page 86: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/86.jpg)
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
ClientWrite
Redis is a popular data structure store
![Page 87: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/87.jpg)
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
ClientWrite
Redis is a popular data structure store
![Page 88: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/88.jpg)
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
ClientWrite
Redis is a popular data structure store
![Page 89: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/89.jpg)
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
ClientWrite
Redis is a popular data structure store
![Page 90: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/90.jpg)
Follower
Follower
Leader
An Example: Redis
16
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
![Page 91: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/91.jpg)
Follower
Follower
Leader
An Example: Redis
16
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
![Page 92: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/92.jpg)
Follower
Follower
Leader
An Example: Redis
16
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
![Page 93: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/93.jpg)
Follower
Follower
Leader
An Example: Redis
16
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
![Page 94: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/94.jpg)
Follower
Follower
Leader
An Example: Redis
16
redis_database
redis_database
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
![Page 95: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/95.jpg)
Follower
Follower
Leader
An Example: Redis
16
redis_database
redis_database
redis_database
appendonlyfile
appendonlyfile
appendonlyfile
Redis is a popular data structure store
![Page 96: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/96.jpg)
Redis: Behavior Analysis
17
![Page 97: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/97.jpg)
Redis: Behavior Analysis
17
Read Workload
![Page 98: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/98.jpg)
Redis: Behavior Analysis
17
Local Behavior
Read Workload
![Page 99: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/99.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
Local Behavior
Read Workload
![Page 100: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/100.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
![Page 101: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/101.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderL
![Page 102: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/102.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF
![Page 103: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/103.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
![Page 104: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/104.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
![Page 105: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/105.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
![Page 106: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/106.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Local Behavior
![Page 107: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/107.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
![Page 108: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/108.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
![Page 109: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/109.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
![Page 110: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/110.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
![Page 111: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/111.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
UnavailabilityNo checksums to detect corruptionLeader crashes due to failed deserializationNo automatic failover - cluster unavailable
![Page 112: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/112.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
![Page 113: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/113.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
![Page 114: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/114.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
![Page 115: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/115.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
![Page 116: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/116.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
![Page 117: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/117.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
![Page 118: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/118.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
![Page 119: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/119.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
![Page 120: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/120.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
![Page 121: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/121.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
![Page 122: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/122.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
![Page 123: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/123.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
CorruptionNo checksums to detect corruptionLeader returns corrupted data on queriesCorruption propagation to followers
![Page 124: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/124.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
![Page 125: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/125.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
![Page 126: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/126.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
![Page 127: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/127.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Correct
![Page 128: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/128.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Correct
![Page 129: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/129.jpg)
Redis: Behavior Analysis
17
CorruptRead
I/O Error
L LF F
Local Behavior
Read Workload
LeaderLFollowerF Crash
On-disk Structures
appendonlyfile.metadata
appendonlyfile.data
redis_database.block_0
redis_database.metadata
redis_database.userdata
Global Effect
CorruptRead
I/O Error
L LF F
Local Behavior
Global Effect
Unavailability
ReducedRedundancy
No Detection/ No Recovery
Corruption
Retry
WriteUnavailability
Correct
![Page 130: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/130.jpg)
Other Systems
18
![Page 131: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/131.jpg)
Other Systems
Metadata stores: ZooKeeper, LogCabin
Wide column store: Cassandra
Document stores: MongoDB
Distributed databases: RethinkDB, CockroachDB
Message Queues: Kafka
18
![Page 132: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/132.jpg)
Outline
Introduction
Fault Injection
System Behavior Analysis
Major Results
Redundancy Does not Provide Fault Tolerance
Observations Across Systems
Conclusion
19
![Page 133: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/133.jpg)
Redundancy Does not Provide Fault Tolerance
20
![Page 134: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/134.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
ZooKeeper Write
Kafka Read
RethinkDB Read
Cassandra ReadKafka Write
![Page 135: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/135.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
ZooKeeper WriteWrite Error
Kafka Read
RethinkDB ReadCorrupt
Cassandra ReadKafka Write
CorruptReadError Corrupt
ReadError
CorruptReadError
![Page 136: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/136.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
Corrupt
Cassandra ReadKafka Write
checkpoint
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
![Page 137: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/137.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
![Page 138: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/138.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
CorruptionCorrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
![Page 139: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/139.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Data Loss
Corrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
![Page 140: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/140.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
Corrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
![Page 141: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/141.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
![Page 142: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/142.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
![Page 143: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/143.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Reduced Redundancy
![Page 144: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/144.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Reduced Redundancy
Harmful global effects despite redundancy
![Page 145: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/145.jpg)
Redundancy Does not Provide Fault Tolerance
20
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Reduced Redundancy
Harmful global effects despite redundancyNot simple implementation bugs - fundamental problems across multiple systems!
![Page 146: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/146.jpg)
OutlineIntroduction
Fault Injection
System Behavior Analysis
Major Results
Observations Across Systems
Faults are Often Undetected Locally
Crashing: Common Local Reaction
Crash and Corruption Handling are Entangled
Unsafe interaction between local and global protocols
Conclusion
21
![Page 147: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/147.jpg)
Faults are Often Undetected Locally
22
![Page 148: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/148.jpg)
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
22
![Page 149: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/149.jpg)
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
![Page 150: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/150.jpg)
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
Cassandra: Locally Undetected Fault
![Page 151: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/151.jpg)
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
Client Replica 1 Other Replicas
Cassandra: Locally Undetected Fault
![Page 152: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/152.jpg)
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
Cassandra: Locally Undetected Fault
sstable
![Page 153: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/153.jpg)
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
![Page 154: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/154.jpg)
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
![Page 155: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/155.jpg)
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
READ (R=1)
![Page 156: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/156.jpg)
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
![Page 157: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/157.jpg)
Read Repair
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
![Page 158: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/158.jpg)
Read Repair
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
key | value
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
![Page 159: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/159.jpg)
Read Repair
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
key | valueREAD
CORRUPT
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
![Page 160: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/160.jpg)
Read Repair
sstable compression = offNo checksums to detect corruption
sstable
Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect
- Locally undetected corruption → global silent corruption
22
key | value
Client Replica 1 Other Replicas
key | value
sstable.userdata corrupted
key | valueREAD
CORRUPT
Cassandra: Locally Undetected Fault
sstable
CORRUPT
READ (R=1)
Need for end-to-end integrity and error handling
![Page 161: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/161.jpg)
Crashing - Common Local Reaction
23
![Page 162: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/162.jpg)
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
![Page 163: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/163.jpg)
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
Crash
Leader
Follower
L
F
![Page 164: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/164.jpg)
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
Crash
Leader
Follower
L
F
![Page 165: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/165.jpg)
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
![Page 166: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/166.jpg)
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
![Page 167: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/167.jpg)
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
Crashing leads to reduced redundancy and imminent unavailability
![Page 168: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/168.jpg)
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
Crashing leads to reduced redundancy and imminent unavailabilityPersistent fault -- Requires manual intervention
![Page 169: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/169.jpg)
collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt
Crashing - Common Local Reaction
23
Many systems that reliably detect fault simply crash on encountering faults
MongoDBBlock Corruption during Read Workloads
L F
epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail
ZooKeeper
L F
Crash
Leader
Follower
L
F
Crashing leads to reduced redundancy and imminent unavailabilityPersistent fault -- Requires manual intervention
Redundancy underutilized!
![Page 170: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/170.jpg)
Crash and Corruption Handling are Entangled
24
![Page 171: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/171.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
24
![Page 172: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/172.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
24
![Page 173: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/173.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
data
24
![Page 174: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/174.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
24
![Page 175: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/175.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1
24
![Page 176: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/176.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1
Append(log, entry 2)
24
![Page 177: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/177.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1
Append(log, entry 2)
24
![Page 178: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/178.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
24
![Page 179: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/179.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
24
![Page 180: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/180.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
24
![Page 181: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/181.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
24
![Page 182: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/182.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Lose uncommitted data
24
![Page 183: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/183.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Lose uncommitted data
24
![Page 184: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/184.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Lose uncommitted data
0 1 2
24
![Page 185: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/185.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Lose uncommitted data
0 1 2
24
![Page 186: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/186.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Lose uncommitted data
0 1 2
24
![Page 187: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/187.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Lose uncommitted data
0 1 2
24
![Page 188: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/188.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
0 1 2
24
![Page 189: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/189.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
0 1 2
24
![Page 190: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/190.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
Lose committed data!
0 1 2
24
![Page 191: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/191.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
Lose committed data!
0 1 2
Developers of LogCabin and RethinkDB agree entanglement is the problem
24
![Page 192: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/192.jpg)
Crash and Corruption Handling are EntangledKafka Message Log
0
checksum data
1 2
Append(log, entry 2)
Checksum mismatch
Checksum mismatch
Action: Truncate log at 1
Disk corruption
Action: Truncate log at 0
Lose uncommitted data
Lose committed data!
0 1 2
Developers of LogCabin and RethinkDB agree entanglement is the problem
24Need for discerning corruptions due to crashes from other type of corruptions
![Page 193: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/193.jpg)
Unsafe Interaction between Local & Global Protocols
25
![Page 194: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/194.jpg)
Unsafe Interaction between Local & Global Protocols
0 1 2
25
Kafka: Message log at Node 1
![Page 195: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/195.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatch
0 1 2
25
Kafka: Message log at Node 1
![Page 196: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/196.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0
0 1 2
25
Kafka: Message log at Node 1
![Page 197: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/197.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
![Page 198: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/198.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
![Page 199: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/199.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
![Page 200: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/200.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Node1 Other Nodes
0 1 2
![Page 201: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/201.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Node1 Other Nodes
0 1 2
Set of in-sync replicas
![Page 202: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/202.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Node1 Other Nodes
0 1 2
Set of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
![Page 203: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/203.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Node1 Other Nodes
0 1 2
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
![Page 204: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/204.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other NodesREAD
0 1 2
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
![Page 205: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/205.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
0 1 2
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
![Page 206: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/206.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
![Page 207: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/207.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failure
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
![Page 208: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/208.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failureWRITE (W=2)
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
![Page 209: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/209.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failure
Failure
WRITE (W=2)
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
![Page 210: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/210.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failure
Failure
WRITE (W=2)
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Unsafe interaction between local behavior and leader election protocol leads to data loss and write unavailability
![Page 211: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/211.jpg)
Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2
25
Kafka: Message log at Node 1
Local Behavior
0 1 2
Client Node1 Other Nodes
message:0
[Silent data loss]
READ
Truncate upto message 0
0 1 2
Assertion failure
Failure
WRITE (W=2)
Leader FollowersSet of in-sync replicas
Node1 with truncated log not removed from in-sync replicas
Node 1 elected as leader
Need for synergy between local behavior and global protocol
Unsafe interaction between local behavior and leader election protocol leads to data loss and write unavailability
![Page 212: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/212.jpg)
Redundancy Does not Provide Fault Tolerance
26
Redis Read
CorruptReadError
txn_headlog.tail
ZooKeeper WriteWrite Error
log.headerlog.otherreplication
L F L F
L F
L F L F
Kafka Read
aof.metadataaof.datardb.metadatardb.userdata
RethinkDB Read
db.txn_headdb.txn_bodydb.txn_taildb.metablock
L F
Corruption
Write Unavailability
Data Loss
UnavailabilityCorrupt
Query Failure
Cassandra ReadKafka Write
checkpointL F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
sstable.block0sstable.metadatasstable.userdatasstable.index
Reduced Redundancy
![Page 213: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/213.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
![Page 214: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/214.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
![Page 215: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/215.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
![Page 216: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/216.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
![Page 217: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/217.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
![Page 218: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/218.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
![Page 219: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/219.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
![Page 220: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/220.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
![Page 221: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/221.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Unsafe interaction between local and global protocols
![Page 222: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/222.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Unsafe interaction between local and global protocols
![Page 223: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/223.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Not simple implementation bugs - fundamental problems across multiple systems!
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Unsafe interaction between local and global protocols
![Page 224: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/224.jpg)
Why does Redundancy Not Imply Fault Tolerance?
27
Redis Read
CorruptReadError
ZooKeeper Write
Write Error
L F L F
L F
L F L F
Kafka Read
RethinkDBRead
L F
Corrupt
Cassandra Read
Kafka Write
L F L F
CorruptReadError Corrupt
ReadError
CorruptReadError
Faults are often locally undetected
Not simple implementation bugs - fundamental problems across multiple systems!Redundancy underutilized as a source of recovery
Crashing on detecting faults is the common reaction
Crash and corruption handling are entangled
Unsafe interaction between local and global protocols
![Page 225: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/225.jpg)
28
Fundamental Problems: Summary
![Page 226: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/226.jpg)
28
Fundamental Problems: Summary
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
![Page 227: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/227.jpg)
28
Fundamental Problems: Summary
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
![Page 228: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/228.jpg)
28
Fundamental Problems: Summary
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
![Page 229: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/229.jpg)
28
Fundamental Problems: Summary
More observations, results, and discussions in the paper …
Locally undetected
faults?
Crashing -common local
action?
Crash & corruption handling
entangled?
Unsafe interaction of local & global
protocols?
Redundancy underutilized for recovery?
Redis
ZooKeeper
Cassandra
Kafka
RethinkDB
MongoDB
LogCabin
CockroachDB
![Page 230: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/230.jpg)
OutlineIntroduction
Fault Injection
System Behavior Analysis
Major Results
Observations Across Systems
Faults are Often Undetected Locally
Crashing: Common Local Reaction
Crash and Corruption Handling are Entangled
Unsafe interaction between local and global protocols
Conclusion
29
![Page 231: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/231.jpg)
Summary
30
![Page 232: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/232.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
30
![Page 233: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/233.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
30
![Page 234: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/234.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
30
![Page 235: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/235.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
30
![Page 236: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/236.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
30
![Page 237: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/237.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
30
![Page 238: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/238.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
Faults are often undetected locally – leads to harmful global effects
30
![Page 239: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/239.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
30
![Page 240: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/240.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
Crash and corruption handling are entangled – loss of committed data
30
![Page 241: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/241.jpg)
SummaryWe analyzed distributed storage reactions to single file-system faults
Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB
Redundancy does not provide fault tolerance
A single fault in one node can cause catastrophic outcomes
data loss, corruption, unavailability, and spread of corruption to other intact replicas
Some fundamental problems across multiple systems:
Faults are often undetected locally – leads to harmful global effects
On detection, crashing is the common action – redundancy underutilized
Crash and corruption handling are entangled – loss of committed data
Unsafe interaction between local behavior and global distributed protocols can spread corruption or data loss
30
![Page 242: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/242.jpg)
Conclusion
31
![Page 243: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/243.jpg)
Conclusion
Most distributed systems not yet resilient
31
![Page 244: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/244.jpg)
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
31
![Page 245: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/245.jpg)
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
31
![Page 246: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/246.jpg)
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
Cannot always assume corruption to be caused by a crash
31
![Page 247: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/247.jpg)
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
Cannot always assume corruption to be caused by a crash
Local behavior has implications for distributed systems
31
![Page 248: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/248.jpg)
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
Cannot always assume corruption to be caused by a crash
Local behavior has implications for distributed systems
Our study provides directions for more robust distributed storage design
31
![Page 249: Redundancy Does Not Imply Fault Tolerance · 2017. 3. 3. · Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://reader035.fdocuments.in/reader035/viewer/2022062612/613a831b0051793c8c0115a9/html5/thumbnails/249.jpg)
Conclusion
Most distributed systems not yet resilient
Always detect faults – important in layered stacks on commodity hardware
Detecting faults and not using redundancy to recover is undesirable
Cannot always assume corruption to be caused by a crash
Local behavior has implications for distributed systems
Our study provides directions for more robust distributed storage design
Our fault injection framework available online: http://research.cs.wisc.edu/adsl/Software/cords/
Thank you!31