Deterministic behaviour and performance in trading systems

30
Peter Lawrey CEO of Higher Frequency Trading Google Developer Group 2015 Deterministic Behaviour and Performance in Trading

Transcript of Deterministic behaviour and performance in trading systems

Peter LawreyCEO of Higher Frequency Trading

Google Developer Group 2015

Deterministic Behaviour and Performance in Trading

Peter Lawrey

Java Developer/Consultant for hedge fund and trading firms for 6 years.

Most answers for Java and JVM on stackoverflow.com

Founder of the Performance Java User’s Group.

Architect of Chronicle Software

Agenda

• Lambda functions and state machines

• Record every input

• Determinism by Design

• Record every output

• The consequences of Little’s Law

• Java 8 and Garbage

• Chronicle Queue demo

Lambda Functions

• No mutable state

• Easy to reason about

• Easy to componentize

• But … no mutable state.

State Machine

• Local mutable state

• Easier to reason about, than shared state

• Easier to componentize

• Not as simple as Lambda Functions

Lambda functions and a state machine

Record every input

• By recording every input you can recreate the state of the system at any point and recreate bugs, test rare conditions, and test the latency distribution of your system.

But

• This approach doesn’t support software upgrades.

• A replay facility which is implemented after the fact might not recreate your system completely.

Determinism by design

• You want a system where producers write every event, and consumers and continuously in replay. This way you can be sure that you have this facility early in the development cycle and you know that you have recorded every event/input.

• This facility can help you in the testing of your system by allowing to you build small simple tests to huge complex data driven tests.

Record every output

• Supports live software upgrades. By recording and replaying outcome you can have a system which commits to any decision the previous one made. Ie you can change the software to make different decisions.

• This can be tested at the API level by having two state machines, where the input of one is the output of the other.

Little’s law

Little’s law states;

The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the (Palm-)average time a customer spends in the system, W; or expressed algebraically: L = λW

Little’s law as work.

The number of active workers must be at least the average arrival rate of tasks multiplied by the average time to complete those tasks.

workers >= tasks/second * seconds to perform task.

Or

throughput <= workers / latency.

Consequences of Little’s law

• If you have a problem with a high degree of independent tasks, you can throw more workers at the problem to handle the load. E.g. web services

• If you have a problem with a low degree of independent tasks, adding more workers will mean more will be idle. E.g. many trading systems. The solution is to reduce latency to increase throughput.

Consequences of Little’s law

• Average latency is a function, sometimes the inverse, of the throughput.

• Throughput focuses on the average experience. The worst case is often the ones which will hurt you, but averages are very good at hiding your worst cases. E.g. from long GC pauses.

• Testing with Co-ordinated omission also hides worst case latencies.

Co-ordinated omission

• A term coined by Gil Tene.

• Co-ordinated omission occurs when the system being tested is allowed to apply back pressure on the system doing the testing. When the tested system being tested is slow, it can effectively pause the test, esp. when averages or latency percentiles are considered.

Co-ordinated omission: Example

• A shop is open 10 hours a day between 8 AM and 6 PM.

• A customer comes every 5 minutes, waits to be served and leaves.

• When the shop keeper is there, he takes 1 minute to serve.

• But if he takes a 2 hour lunch break, how does this effect the average latency or the 98th percentile?

How not to measure latency.

• You have one person go to the shop and time how long she has to wait. Once per day she has to wait 2 hours and 1 minute, but the rest of the day it only takes 1 minute.

• The average of 97 tests is 2.2 minutes. Had the shop been open all day, there would be 120 tests, but one took 2 hours. Not great but doesn’t sound much worse than 1 minute.

• The 98th percentile is 1 minute.

Avoiding co-ordinated omission

• You have as many people as you need. Most of the time, only one is waiting, however over the lunch break, there is 31 people delayed 121, 117, 113, 109 … 5 mins.

• The average of 120 tests is 16.5 minutes wait time. This is much higher than the 2.2 minutes calculated previously.

• The 98th percentile is 111 minutes, instead of 1 minute in the previous test.

Doesn’t the GC stop the world?

• The GC only pauses the JVM when it has some work to do. Produce less garbage and it will pause less often

• Produce less than 1 GB/hour of garbage and you can get less than one pause per day. (With a 24 GB Eden)

Do I need to avoid all objects?

• In Java 8 you can have very short lived objects placed on the stack. This requires your code to be inlined and escape analysis to kick in. When this happens, no garbage is created and the code is faster.

• You can have very long lived objects, provided you don’t have too much.

• The rest of your data you can place in native memory (off heap)

• You can create 1 GB/hour of garbage and still not GC

Do I need to avoid all objects?

• In Java 8 you can have very short lived objects placed on the stack. This requires your code to be inlined and escape analysis to kick in. When this happens, no garbage is created and the code is faster.

• You can have very long lived objects, provided you don’t have too much.

• The rest of your data you can place in native memory (off heap)

• You can create 1 GB/hour of garbage and still not GC

How does Java 8 avoid creating objects?

One way to think of Java 8 lambdas is the ability to pass behaviour to a library. With inlining, an alternative view is the ability to template your code. Consider this locking example

lock.lock();

try {

doSomething();

} finally {

lock.unlock();

}

How does Java 8 avoid creating objects?

This boiler place can be templated

public static void withLock(Lock lock,

Runnable runnable) {

lock.lock();

try {

runnable.run();

} finally {

lock.unlock();

}

}

How does Java 8 avoid creating objects?

This simplifies the code to be

withLock(lock, () -> doSometing());

Doesn’t using a Runnable create an object?

With inlining and escape analysis the Runnable can be placed on the stack and eliminated (as it has no fields)

Low Latency with lots of Lambdas

Chronicle Wire is an API for generic serialization and deserialization. You determine what you want to read/write, but the exact wire format can be injected. This works for Yaml, Binary Yaml, and raw data. It will support XML, FIX, JSON and BSON.

This uses lambdas extensively but the objects associated can be eliminated.

Low Latency with lots of Lambdaswire.writeDocument(false, out ->

out.write(() -> "put")

.marshallable(m ->

m.write(() -> "key").int64(n)

.write(() -> "value").text(words[n])));

As Yaml

--- !!data

put: { key: 1, value: hello }

As Binary Yaml

⒗٠٠٠Ãput\u0082⒎٠٠٠⒈åhello

Isn’t writing to disk slow?

• Uncommitted synchronous writes can be extremely fast. Typically around a micro-second. The writes are synchronous to the application so data is not lost if the application dies, but not actually committed to disk.

• To prevent loss of data on power failure, you can use replication.

A low latency with fail over

• Data sent between servers is half round trip.

• Inputs are written on both servers.

• Outputs are written on both servers.

• The end to end latency can be 25 µs, 99% of the time.

Demo from http://chronicle.software/products/chronicle-queue/

Next Steps

• Chronicle is open source so you can start right away!

• Working with clients to produce Chronicle Enterprise

• Support contract for Chronicle and consultancy

Q & A

Peter Lawrey

@PeterLawrey

http://chronicle.software

http://vanillajava.blogspot.com