ScyllaDB: NoSQL at Ludicrous Speed
-
Upload
j-on-the-beach -
Category
Technology
-
view
216 -
download
1
Transcript of ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB
● Clustered NoSQL database compatible with Apache Cassandra
● ~10X performance on same hardware● Low latency, esp. higher percentiles● Self tuning● Mechanically sympathetic C++14
YCSB Benchmark:3 node Scylla cluster vs 3, 9, 15, 30Cassandra machines
3 Scylla30 Cassandra
3 Cassandra
3 Scylla
30 Cassandra
3 Cassandra
Scylla vs Cassandra - CL:LOCAL_QUORUM, Outbrain Case Study
Scylla and Cassandra handling the full load (peak of ~12M RPM)
200ms
10ms
20x Lower Latency
5
Scylla benchmark by Samsung
op/s
Full report: http://tinyurl.com/msl-scylladb
Data model
Partition Key1Clustering Key1
Clustering Key1 Clustering Key2
Clustering Key2
...
...
...
...
...
CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id ));INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’');
Sort
ed b
y Pr
imar
y Ke
y
Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3Tim
e
SStable 4
SStable 5SStable 1+2+3
Foreground Job Background Job
Implementation Goals
● Efficiency:○ Make the most out of every cycle
● Utilization:○ Squeeze every cycle from the machine
● Control○ Spend the cycles on what we want, when we want
❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing
AGENDA
● Thread-per-core design (shard)○ No blocking. Ever.
Enter Seastar www.seastar-project.org
Enter Seastar www.seastar-project.org
● Thread-per-core design (shard)○ No blocking. Ever.
● Asynchronous networking, file I/O, multicore
Enter Seastar www.seastar-project.org
● Thread-per-core design (shard)○ No blocking. Ever.
● Asynchronous networking, file I/O, multicore
● Future/promise based APIs
Enter Seastar www.seastar-project.org
● Thread-per-core design (shard)○ No blocking. Ever.
● Asynchronous networking, file I/O, multicore
● Future/promise based APIs● Usermode TCP/IP stack included in the box
Seastar task schedulerTraditional stack Seastar stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a pointer to eventually computed value
Task is a pointer to a lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a function pointer
Stack is a byte array from 64k to megabytes
Context switch cost is
high. Large stacks pollutes
the cachesNo sharing, millio
ns of
parallel events
Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});
Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});
Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});
Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});
Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});
Futuresfuture<> f = _conn->read_exactly(4).then([] (temporary_buffer<char> buf) { int id = buf_to_id(buf); unsigned core = id % smp::count; return smp::submit_to(core, [id] { return lookup(id); }).then([this] (sstring result) { return _conn->write(result); });});
Seastar memory allocator
● Non-Thread safe!○ Each core gets a private memory pool
● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response
Seastar memory allocator
● Non-Thread safe!○ Each core gets a private memory pool
● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response
● Inter-core free() through message passing
❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing
AGENDA
Cassandra cache
Linux page cache
SSTables
● 4k granularity● Thread-safe● Synchronous APIs● General-purpose● Lack of control2● ...on the other hand
○ Exists○ Hundreds of man-years○ Handling lots of edge cases
Cassandra cache
Linux page cache
SSTables
● Page faults
Page faultSuspend thread
Initiate I/OContext switch
I/O completesContext switchInterrupt
Map pageResume thread
App thread
Kernel
SSD
Yet another allocator(Problems with malloc/free)
● Memory gets fragmented over time○ If the workload changes sizes of allocated objects○ Allocating a large contiguous block requires
evicting most of cache
Log-structured memory allocation
● Bump-pointer allocation to current segment● Frees leave holes in segments● Compaction will try to solve this
Compacting LSA● Teach allocator how to move objects around
○ Updating references● Garbage collect Compact!
○ Starting with the most sparse segments○ Lock to pin objects
● Used mostly for the cache○ Large majority of memory allocated○ Small subset of allocation sites
❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Resource Management❏ Workload Conditioning❏ Closing
AGENDA
Memtable
Seastar SchedulerCompaction
Query
Repair
Commitlog
SSD
Compaction Backlog Monitor
WAN
CPU
Workload Conditioning
Memtable
Seastar SchedulerCompaction
Query
Repair
Commitlog
SSD
Compaction Backlog Monitor
WAN
CPU
Workload Conditioning
Adjust priority
Memtable
Seastar SchedulerCompaction
Query
Repair
Commitlog
SSD
Memory Monitor
WAN
CPU
Workload Conditioning
Memtable
Seastar SchedulerCompaction
Query
Repair
Commitlog
SSD
Memory Monitor
Adjust priority
WAN
CPU
Workload Conditioning
❏ Introducing ScyllaDB❏ System Architecture❏ Node Architecture❏ Seastar❏ Workload Conditioning❏ Closing
AGENDA
● Careful system design and control of the software stack can maximize throughput
● Without sacrificing latency● Without requiring complex end-user tuning● While having a lot of fun
Conclusions
● Download: http://www.scylladb.com● Twitter: @ScyllaDB● Source: http://github.com/scylladb/scylla● Mailing lists: scylladb-user @ groups.google.com● Slack: ScyllaDB-Users● Blog: http://www.scylladb.com/blog● Join: http://www.scylladb.com/company/careers● Me: [email protected]
How to interact