IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distributed System for the Real...

TESTING DISTRIBUTED SYSTEMS IN ANGERNOAH ARLISSSENIOR DEVELOPMENT MANAGER WORKDAY

See all the presentations from the In-Memory Computing Summit at http://imcsummit.org

WHO AM I?

Workday Senior Development Manager 16+ years experience of software development,

architecture, design and management Distributed systems domain expert

Decentralized Security (WebLogic Enterprise Security) Data Fabrics (Oracle Coherence) Fabric Team (Workday)

Passionate about distributed computing and building teams to deliver complex technologies with quality and reliability

Noah Arliss

WHY ARE WE HERE?

THE PROMISE

The distributed systems ”holy grail” Reliability Availability Scalability Performance

“Lets just throw hardware at the problem”

THE PARADIGM SHIFT

It’s all about partitions not physical location Data is highly available Idempotent operations Run your code where the data lives Event driven architectures Lock free parallel algorithms vs. procedural processing Know the rules of physics for your system

THE TRADEOFF: CAP THEORUM

In a distributed system, you can only have two out of the following guarantees* across consecutive read and write operations: Consistency - a read is guaranteed to return the most recent write for a given client Availability - a non-failing node will return a reasonable response within a reasonable amount of

time (no error or timeout) Partition Tolerance - the system will continue to function when network partitions occur

*Or what I call the “you can’t have your distributed cake and eat it too” theorem

NETWORKS ARE UNRELIABLE

A fallacy of distributed computing is that networks are reliable - they aren’t!

In the real world, there are only two choices: CP (Consistency/Partition Tolerance) - wait for a response from the partitioned node which could

result in a timeout error AP (Availability/Partition Tolerance) - return the most recent version of the data the node has, which

could be stale

WHAT COULD POSSIBLY GO WRONG

Node GC Node deadlock Loss of a node Loss of a machine Network failures Network saturation Over-provisioned systems Loss of a data center

DETERMINISTIC TESTING: A FRAMEWORK

Challenge: develop a multi-process test suite where a single test can start and stop multiple nodes deterministically (when a predicate is satisfied)

Bonus points for running the same local test on multiple remote hosts

Bonus points for running a single unit test as a load test!

Bonus points for throwing chaos at the system

EVENTUAL ASSERTIONS (AN ATOMIC BUILDING BLOCK)

In a distributed system answers aren’t always readily available

Test a condition for a period of time before failing Exponentially back off the test so as not to overtax the system

I want to… Assert that my cluster is balanced Assert that my process is running Assert that my service is deployed …

EVENTUAL ASSERTION PROJECTS

Roll your own Awaitility - https://github.com/jayway/awaitility Oracle Bedrock Eventually -

https://github.com/coherence-community/oracle-bedrock/blob/master/bedrock-testing-support/src/main/java/com/oracle/bedrock/deferred/Eventually.java

https://github.com/jayway/awaitility

https://github.com/jayway/awaitility




PROCESS MANAGEMENT (AN ATOMIC BUILDING BLOCK)

We prefer multi-JVM tests Deterministically and programmatically manage process lifecycle

Local host processes Remote host processes

Extensibility through configuration over code Container Support AWS

PROCESS MANAGEMENT PROJECTS

Java Process Management Projects Rolled our own Ignite Project currently uses GridAbstractTest.startGrid(…) and GridAbstractTest.stopGrid(…) Oracle Bedrock

https://github.com/coherence-community/oracle-bedrock/tree/master/bedrock-runtime/src/main/java/com/oracle/bedrock/runtime



TESTING WITH CHAOS

The Process Monkey: Thread 1: Perform some deterministic operation against the

system Thread 2: Validate that the operation is successful Thread 3: Throw chaos a the system by randomly killing

nodes in the system Example: Aggregation Test

Thread 1: Insert monotonically increasing values into a cache Thread 2: Calculate a checksum by getting all values and

ensuring their sum is equal to the highest value inserted Thread 3: Randomly kill a node in the system

We want a network monkey too

Inspired by:

PERFORMANCE TESTING

JMeterJUnitTestRunner test – structure a JUnit test to run as a JMeter test. Modify @BeforeClass and @AfterClass to run at the

beginning and end of the full test run Run the same @Test N iterations across M threads Write the results out over graphite to influx DB Track system telemetry while tracking test performance Graphs for everything

WHERE WE ARE NOW

Internal Test Framework run on every ignite/gridgain upgrade and every platform improvement

Working on giving back to the community – expect to see a pull request in the near future to improve the multi-process jvm testing

Network Monkey – can we leverage the simian army from Netflix to create the same process level problems on the network in AWS. Can we do it in our own data centers as well

The Fabric team – putting the easy button on distributed computing (we’re hiring!)

RECOMMENDED READING

Kyle Kingsbury’s blog - https://aphyr.com Brewer’s Conjecture -

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf Why you can’t sacrifice the P in CAP -

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf The Fallacies of distributed computing - http://en.wikipedia.org/wiki/

Fallacies_of_distributed_computing

https://aphyr.com/

https://aphyr.com/

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6951&rep=rep1&type=pdf




http://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

http://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distributed System for the Real...

Technology

Transcript of IMC Summit 2016 Breakout - Noah Arliss - The Truth: How to Test Your Distributed System for the Real...