C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian

Time is Money

Financial Time Series Jake Luciani and Carl Yeksigian

BlueMountain Capital

About this talk Part 1: Our use case and architecture Part 2: Our deployment and tuning Part 3: Q&A

Know your problem. 1000s of consumers ..creating and reading data as fast as possible ..consistent to all readers ..and handle ad-hoc user queries ..quickly ..across data centers.

Know your data.

AAPL price

MSFT price

Know your queries.

Time Series Query

Start, End, Periodicity defines query

1 minute periods

Know your queries.

Cross Section Query

As Of time defines the query

As Of Time (11am)

Know your queries. Cross sections are random Storing for all possible Cross Sections is not possible. We also support bi-temporality

Let's optimize for Time Series.

CREATE TABLE tsdata ( id blob, property string, asof_ticks bigint, knowledge_ticks bigint, value blob, PRIMARY KEY(id,property,asof_ticks,knowledge_ticks)

) WITH COMPACT STORAGE AND CLUSTERING ORDER BY(asof_ticks DESC, knowledge_ticks DESC)

Data Model (CQL 3)

SELECT * FROM tsdata WHERE id = 0x12345 AND property = 'lastPrice' AND asof_ticks >= 1234567890 AND asof_ticks <= 2345678901

CQL3 Queries: Time Series

CQL3 Queries: Cross Section SELECT * FROM tsdata WHERE id = 0x12345 AND property = 'lastPrice' AND asof_ticks = 1234567890 AND knowledge_ticks < 2345678901 LIMIT 1

A Service, not an app

C*

Olympus

Olym

pus

Olympus

Oly

mpu

s

App

App

App

App

App

App

App

App

App

App

Fat Client

Olympus Thrift Service Olympus Thrift Service

Complex Value Types Not every value is a double Some values belong together (Bid and Ask should always come back together) Thrift structures as values Typed, extensible schema Union types give us a way to deserialize any type

Ad-hoc querying UI

But that's the easy part...

(queue transition)

Scaling... The first rule of scaling is you do not just turn everything to 11.

Scaling... Step 1 - Fast Machines for your workload Step 2 - Avoid Java GC for your workload Step 3 - Tune Cassandra for your workload Step 4 - Prefetch and cache for your workload

Can't fix what you can't measure Riemann (http://riemann.io) Easily push application and system metrics into a single system We push 6k metrics per second to a single Riemann instance

Metrics: Riemann Yammer Metrics with Riemann

https://gist.github.com/carlyeks/5199090

Metrics: Riemann Push stream based metrics library Riemann Dash for Why is it Slow? Graphite for Why was it Slow?

VisualVM: The greatest tool EVER Many useful plugins... Just start jstatd on each server and go!

Scaling Reads: Machines SSDs for hot data JBOD config As many cores as possible (> 16) 10GbE network Bonded network cards Jumbo frames

JBOD is a lifesaver SSDs are great until they aren't anymore JBOD allowed passive recovery in the face of simultaneous disk failures (SSDs had a bad firmware)

Scaling Reads: Cassandra Changes we've made: • Configuration • Compaction • Compression

Leveled Compaction Wide rows means data can be spread across a huge number of SSTables Leveled Compaction puts a bound on the worst case (*) Fewer SSTables to read means lower latency, as shown below; orange SSTables get read

L0

L1

L2

L3

L4

L5

* In Theory

Leveled Compaction: Breaking Bad Under high write load, forced to read all of the L0 files

L0

L1

L2

L3

L4

L5

Hybrid Compaction: Breaking Better Size Tiering Level 0 On by default in 2.0

L0

L1

L2

L3

L4

L5

{ Hybrid

Compaction

Size Tiered

Leveled

Overlapping Compaction Instead of forcing a combination of L0 files with L1, we can just push up files This allows a higher level of concurrency in compactions We still know the SSTables that might contain the keys We can force a proper compaction at any configurable level

L0

L1

L2

L3

L4

L5

C optimized library Read path needs to be fast for our workload CRC check, composite comparison eat a lot of cycles CRC is implemented on chip for some architectures (why not use it?) We want to move some of the operations into a JNI library to reduce latency and improve throughput

Current Stats 16 nodes 2 Data Centers Replication Factor 6 200k Writes/sec at EACH_QUORUM 150k Reads/sec at LOCAL_QUORUM > 30 Million time series > 15 Billion points 10 TB on disk (compressed) Read Latency 50%/95% is 1ms/5ms

Questions? Thank you! @tjake and @carlyeks

C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian

Technology

Transcript of C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian