IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distributed System for the Real...
-
Upload
in-memory-computing-summit -
Category
Technology
-
view
41 -
download
0
Transcript of IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distributed System for the Real...
TESTING DISTRIBUTED SYSTEMS IN ANGERNOAH ARLISSSENIOR DEVELOPMENT MANAGER WORKDAY
See all the presentations from the In-Memory Computing Summit at http://imcsummit.org
WHO AM I?
Workday Senior Development Manager 16+ years experience of software development,
architecture, design and management Distributed systems domain expert
Decentralized Security (WebLogic Enterprise Security) Data Fabrics (Oracle Coherence) Fabric Team (Workday)
Passionate about distributed computing and building teams to deliver complex technologies with quality and reliability
Noah Arliss
WHY ARE WE HERE?
THE PROMISE
The distributed systems ”holy grail” Reliability Availability Scalability Performance
“Lets just throw hardware at the problem”
THE PARADIGM SHIFT
It’s all about partitions not physical location Data is highly available Idempotent operations Run your code where the data lives Event driven architectures Lock free parallel algorithms vs. procedural processing Know the rules of physics for your system
THE TRADEOFF: CAP THEORUM
In a distributed system, you can only have two out of the following guarantees* across consecutive read and write operations: Consistency - a read is guaranteed to return the most recent write for a given client Availability - a non-failing node will return a reasonable response within a reasonable amount of
time (no error or timeout) Partition Tolerance - the system will continue to function when network partitions occur
*Or what I call the “you can’t have your distributed cake and eat it too” theorem
NETWORKS ARE UNRELIABLE
A fallacy of distributed computing is that networks are reliable - they aren’t!
In the real world, there are only two choices: CP (Consistency/Partition Tolerance) - wait for a response from the partitioned node which could
result in a timeout error AP (Availability/Partition Tolerance) - return the most recent version of the data the node has, which
could be stale
WHAT COULD POSSIBLY GO WRONG
Node GC Node deadlock Loss of a node Loss of a machine Network failures Network saturation Over-provisioned systems Loss of a data center
DETERMINISTIC TESTING: A FRAMEWORK
Challenge: develop a multi-process test suite where a single test can start and stop multiple nodes deterministically (when a predicate is satisfied)
Bonus points for running the same local test on multiple remote hosts
Bonus points for running a single unit test as a load test!
Bonus points for throwing chaos at the system
EVENTUAL ASSERTIONS (AN ATOMIC BUILDING BLOCK)
In a distributed system answers aren’t always readily available
Test a condition for a period of time before failing Exponentially back off the test so as not to overtax the system
I want to… Assert that my cluster is balanced Assert that my process is running Assert that my service is deployed …
EVENTUAL ASSERTION PROJECTS
Roll your own Awaitility - https://github.com/jayway/awaitility Oracle Bedrock Eventually -
https://github.com/coherence-community/oracle-bedrock/blob/master/bedrock-testing-support/src/main/java/com/oracle/bedrock/deferred/Eventually.java
PROCESS MANAGEMENT (AN ATOMIC BUILDING BLOCK)
We prefer multi-JVM tests Deterministically and programmatically manage process lifecycle
Local host processes Remote host processes
Extensibility through configuration over code Container Support AWS
PROCESS MANAGEMENT PROJECTS
Java Process Management Projects Rolled our own Ignite Project currently uses GridAbstractTest.startGrid(…) and GridAbstractTest.stopGrid(…) Oracle Bedrock
https://github.com/coherence-community/oracle-bedrock/tree/master/bedrock-runtime/src/main/java/com/oracle/bedrock/runtime
TESTING WITH CHAOS
The Process Monkey: Thread 1: Perform some deterministic operation against the
system Thread 2: Validate that the operation is successful Thread 3: Throw chaos a the system by randomly killing
nodes in the system Example: Aggregation Test
Thread 1: Insert monotonically increasing values into a cache Thread 2: Calculate a checksum by getting all values and
ensuring their sum is equal to the highest value inserted Thread 3: Randomly kill a node in the system
We want a network monkey too
Inspired by:
PERFORMANCE TESTING
JMeterJUnitTestRunner test – structure a JUnit test to run as a JMeter test. Modify @BeforeClass and @AfterClass to run at the
beginning and end of the full test run Run the same @Test N iterations across M threads Write the results out over graphite to influx DB Track system telemetry while tracking test performance Graphs for everything
WHERE WE ARE NOW
Internal Test Framework run on every ignite/gridgain upgrade and every platform improvement
Working on giving back to the community – expect to see a pull request in the near future to improve the multi-process jvm testing
Network Monkey – can we leverage the simian army from Netflix to create the same process level problems on the network in AWS. Can we do it in our own data centers as well
The Fabric team – putting the easy button on distributed computing (we’re hiring!)
RECOMMENDED READING
Kyle Kingsbury’s blog - https://aphyr.com Brewer’s Conjecture -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf Why you can’t sacrifice the P in CAP -
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf The Fallacies of distributed computing - http://en.wikipedia.org/wiki/
Fallacies_of_distributed_computing
Q&A