Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Transcript of Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
![Page 1: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/1.jpg)
Testing Cassandra Guarantees under Diverse Failure Modes with JepsenJoel Knighton
@joelknighton
DataStax
#CassandraSummit
![Page 2: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/2.jpg)
Who I am
Mathematician
Software hobbyist
Logic enthusiast
Former DataStax Intern
DataStax Cassandra Developer
![Page 3: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/3.jpg)
What I Do
Deconstruct
Formalize
Communicate
Prove
Automate
![Page 4: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/4.jpg)
How We Test #1
Unit Testsant test
in-tree
![Page 5: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/5.jpg)
How We Test #2
Distributed Testsnosetests
On GitHub – available at riptano/cassandra-dtest
![Page 6: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/6.jpg)
Why You’re Here
JepsenKyle Kingsbury (aphyr)https://aphyr.com/tags/jepsen
![Page 7: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/7.jpg)
What Jepsen Is
A blog series about distributed systems behavior
A talk series about distributed systems behavior
A Clojure library to test the behavior of distributed systems
A collection of tests written using those libraries
![Page 8: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/8.jpg)
What We Hope
Jepsen
💘Cassandra
![Page 9: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/9.jpg)
What I Did
Jepsen Testslein test
On GitHub – available at riptano/jepsen
![Page 10: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/10.jpg)
A Test Incarnate
{:name …
:os …
:db …
:client …
:generator …
:conductors {:nemesis …}
:checker …}
names the results
prepares the os
configures/starts/stops the db
interacts with the db
instructions on how to interact
interacts with the environment
looks at and assesses test run
![Page 11: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/11.jpg)
What You Need
One machine to run the tests
+
n machines to run Cassandra
![Page 12: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/12.jpg)
How A Test Runs
lein testos
n1
n2
n3
n4
n5
![Page 13: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/13.jpg)
How A Test Runs
lein testdb
n1
n2
n3
n4
n5
![Page 14: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/14.jpg)
How A Test Runs
lein testclient 1client 2client 3client 4client 5nemesis
n1
n2
n3
n4
n5
readwrite 3
start nemesiswrite 4
readstop nemesis
write 1cas 2 -> 3
…
![Page 15: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/15.jpg)
How A Test Runs
lein testchecker
1 – read2 – write 3 1 – read 0n – start nemesis2 – write timed-out3 – write 4n – started nemesis3 – wrote 44 – read4 – read 4n – stop nemesis0 – write 11 – cas 2 -> 3n – stopped nemesis…
valid?
Latency
![Page 16: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/16.jpg)
Single Test Deep-Dive
lein test :only
cassandra.collections.set-test/
cql-set-isolate-node-decommission
![Page 17: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/17.jpg)
Single Test Name
Test name used to label folder where test results, logs, and history will be stored with timestamp
cassandra cql set isolate node decommission
![Page 18: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/18.jpg)
Single Test Nodes
[:n1 :n2 :n3 :n4 :n5]
![Page 19: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/19.jpg)
Single Test Net
net/iptables
(drop! ;use iptables to drop packets)
(heal! ;flush iptables)
![Page 20: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/20.jpg)
Single Test OS
debian/os(setup! ;adjust hostfile
;update package manager;install base packages like curl, iptables, etc.
;make sure network is healed)(teardown!)
![Page 21: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/21.jpg)
Single Test DB
cassandra.core/db(setup! ;shutdown and wipe Cassandra if running
;install, configure, and start Cassandra)(teardown! ;shutdown and wipe Cassandra)
(log-files ;return path to log files)
![Page 22: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/22.jpg)
Single Test Client
cql-set-client(setup! ;driver connect to all nodes
;create schema)(invoke! ;add? Run CQL to add to set, handle errors
;read? Read value of CQL set, handle errors)(teardown! ;disconnect driver)
![Page 23: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/23.jpg)
Single Test Generator
(gen/phases
(->> (adds)
(gen/stagger 1/10)
(gen/delay 1/2)
std-gen)
(read-once))
![Page 24: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/24.jpg)
Single Test Conductors
{:nemesis (nemesis/partition-random-node)
:decommissioner (c/decommissioner)}
![Page 25: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/25.jpg)
What a Conductor Is
It’s just a client
![Page 26: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/26.jpg)
Single Test Checker
checker/set(check ;look at history of run
;find ok or uncertain adds
;compare these to final read
;return map with validity and
;ok, lost, unexpected, recovered)
![Page 27: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/27.jpg)
Invariants We Test
Do CQL collections (maps, sets) merge cleanly when add-only?
Do counters merge to accurately reflect increments/decrements?
Does LWT in a single datacenter allow us linearizability?
Do materialized views converge to matching the base table?
Do batch writes eventually get applied atomically?
![Page 28: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/28.jpg)
Failures We Consider
How does this work under a variety of network partitions?
What about with node crashes?
Even if nodes are flushing and compacting?
And when nodes are being bootstrapped?
Or decommissioned?
While clocks drift?
![Page 29: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/29.jpg)
How We Run
Start the Docker container
Install Java driver, Cassaforte, clj-ssh, and Jepsen
Use environment variables to point to build under test
Run lein test with any desired selectors and profiles
![Page 30: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/30.jpg)
Tunable Options
Should we make a best-effort attempt to scale test length?
Should we enable commitlog compression, the coordinator batchlog on materialized views, or hinted handoff?
Is a different compaction strategy or phi value in the failure detector appropriate for this test?
Should we install from a tagged release, a URL pointing to a tarball, or a local tarball?
Should we leave Cassandra running after the test?
![Page 31: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/31.jpg)
What We’ve FoundIssues with counter undercounting/overcounting (#10143)
Decommission race conditions causing gossip problems (#10231)
Write durability violations when recovering commitlog (#9851)
Problems with merging of collections (#10001)
Batchlog replay failures after decommission/crash (#10068)
Incorrect asserts in counter write-path when timestamps collide
A variety of materialized view issues during development
![Page 32: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/32.jpg)
Work We Shared
Minor Jepsen fixes/features (Jepsen PRs #58, 59, 62)
Docker images to run Jepsen tests (Docker Hub: tjake/jepsen)
Multibox Vagrant configurations to run Jepsen tests (on GitHub)
Upstream library fixes (clj-ssh PR #36)
Cassandra Jepsen tests (on GitHub)
Available on CassCI (on cassci.datastax.com)
![Page 33: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/33.jpg)
Jepsen on CassCI
![Page 34: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/34.jpg)
Lessons I Learned
Tests verifying invariants under failures are valuable and practical
These tests can and should be a part of regular development
Testing complex systems is hard, but there are low-hanging fruit
Jepsen provides one readily available way to accomplish this goal
Considering invariants against a recorded test run is effective
Invariants should be explicit and carefully considered in design
![Page 35: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/35.jpg)
Thanks
Jake Luciani
DataStax
The Cassandra community
Kyle Kingsbury
![Page 36: Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen](https://reader033.fdocuments.in/reader033/viewer/2022050613/58eda7641a28abdc7d8b4709/html5/thumbnails/36.jpg)
QUESTIONS?TLA+ • TLC • TLAPS • Clojure
Formal Methods • Jepsen CRDTs • Cassandra • GossipConsistency Models • Alloy
Model Checking • Testing
@joelknighton#CassandraSummit