Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla...
Transcript of Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla...
![Page 1: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/1.jpg)
Cloudius Systems presents:
Writing a Modern Highly Scalable Application
Where Linux Helps You, Where Linux Stands in Your Way
@glcst - Linuxcon 2016
![Page 2: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/2.jpg)
Part 1: The applicationPart 2: The framework
![Page 3: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/3.jpg)
Part 1: The applicationThe basics:
- Scylla is a datastore.- Scylla is a nosql datastore- Scylla is a highly available eventually consistent datastore- Scylla is a highly available eventually consistent datastore, compatible with
Apache Cassandra.
![Page 4: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/4.jpg)
Some examples of datastores
SQL: Structured, no scale
Document store: No structureSome scale
Column store: Some structureScale outAwesome HA/DR
Key-value: SimpleScaleNot a real DB
![Page 5: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/5.jpg)
Part 1: The applicationThe basics:
- Scylla is a datastore.- Scylla is a nosql datastore- Scylla is a highly available eventually consistent datastore- Scylla is a highly available eventually consistent datastore, compatible with
Apache Cassandra.- Scylla is a highly available eventually consistent datastore, compatible with
Apache Cassandra, but with 10x its throughput.
![Page 6: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/6.jpg)
Where you had consistency/durability:- user-defined replication factor (RF) and consistency level (CL)- Write behavior determined by RF:
- Durable for less than RF failures.- Read behavior determined by CL:
- Consistent for CL >= RF / 2 + 1- Availability increases as RF increases, CL decreases.
- Tunable consistency: meet the needs of the application.- Tables where eventual consistency can be tolerated use high RF, low CL.- Tables with data that must remain in sync, use high CL.
![Page 7: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/7.jpg)
Where you had a “primary key”:- 2 components: partition key, clustering key (optional)
https://jslvtr.gitbooks.io/big-data-analysis/
![Page 8: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/8.jpg)
YCSB Benchmark:3 Scylla cluster vs 3, 9, 15, 30 Cassandra
Throughput
![Page 9: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/9.jpg)
YCSB Benchmark:
![Page 10: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/10.jpg)
How do we get 10 x throughput?- “Just rewrite in C++ can’t make it 10x faster”
- True, but it allows us to (easily) do the things that can.- Control how we use memory
- Per-core memory allocation- No garbage collections -> no (unpredictable) pauses.
- Proximity to the hardware- Examples are userspace disk scheduler, and userspace network stack
![Page 11: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/11.jpg)
Part 2: The framework- Seastar is a highly scalable thread-per-core framework
- I/O intensive applications- Turns out a datastore is a good example of an I/O intensive application- Cost a context switch: 1 us (Paul Turner, LPC 2013) - “Majority of the context-switching
cost attributable to the complexity of the scheduling decision by a modern SMP cpu scheduler.”- For a 100ms CPU hog: 0.001 %- For a 1 ms HDD latency (not counting seek): 0.1 %- For a single request NVMe request: (Samsung SM951-NVMe M.2: avg. lat = 22µs): ~5%
![Page 12: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/12.jpg)
SCYLLA AND SEASTAR ARE DIFFERENT
❏ DMA❏ Log structured
merge tree❏ DB-aware cache❏ Userspace I/O
scheduler
❏ NUMA friendly❏ Log structured
allocator❏ Zero copy
❏ Thread per core❏ lock-free❏ Task scheduler❏ Reactor
programing
❏ Multi queue ❏ Poll mode❏ Userspace
TCP/IP
![Page 13: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/13.jpg)
SCYLLA DB: NETWORK COMPARISON
● KVM was invented by Avi in 2006, development was managed by Dor● It was a new hypervisor after VMW, Xen had dominated the market● By smart design choices and leveraging Linux and the hardware it became the most
performing hypervisor.○ KVM holds SPECvirt performance record○ KVM holds max IOPS record
● The Open Virtualization Alliance includes hundreds of companies, including HP, IBM, Intel, AMD, Red Hat, etc
● KVM is the engine behind many clouds such as OpenStack, IBM, NTT, Fujitsu, HP, Google, DigitalOcean, etc.
Kernel
Cassandra
TCP/IPScheduler
queuequeuequeuequeuequeuethreads
NICQueues
Kernel
Traditional stack Seastar’s sharded stack
Memory
Lock contentionCache contentionNUMA unfriendly
Application
TCP/IP
Task Schedulerqueuequeuequeuequeuequeuesmp queue
NICQueue
DPDK
Kernel (isn’t
involved)
Userspace
Application
TCP/IP
Task Schedulerqueuequeuequeuequeuequeuesmp queue
NICQueue
DPDK
Kernel (isn’t
involved)
Userspace
Application
TCP/IP
Task Schedulerqueuequeuequeuequeuequeuesmp queue
NICQueue
DPDK
Kernel (isn’t
involved)
Userspace
CoreDatabase
TCP/IP
Task Schedulerqueuequeuequeuequeuequeuesmp queue
NICQueue
DPDK
Kernel (isn’t
involved)
Userspace
No contentionLinear scalingNUMA friendly
![Page 14: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/14.jpg)
Seastar Programming model
● KVM was invented by Avi in 2006, development was managed by Dor● It was a new hypervisor after VMW, Xen had dominated the market● By smart design choices and leveraging Linux and the hardware it became the most
performing hypervisor.○ KVM holds SPECvirt performance record○ KVM holds max IOPS record
● The Open Virtualization Alliance includes hundreds of companies, including HP, IBM, Intel, AMD, Red Hat, etc
● KVM is the engine behind many clouds such as OpenStack, IBM, NTT, Fujitsu, HP, Google, DigitalOcean, etc.
return open_file_dma(name, flags).then([] (file f) {return f.dma_read(pos, buf, size);
}).then([] {/* do something else */
}).handle_exception([] {/* handle an exception */
});
![Page 15: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/15.jpg)
Seastar has its own task schedulerTraditional stack Scylla’s stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a pointer to eventually computed value
Task is a pointer to a lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a function pointer
Stack is a byte array from 64k to megabytes
Context switch cost is
high. Large stacks pollutes
the caches No sharing, millions of
parallel events
![Page 16: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/16.jpg)
Seastar minimizes cross CPU access❏ A task is always scheduled in the same CPU it was originated
❏ Local memo
![Page 17: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/17.jpg)
- A task is always scheduled in the CPU in which it originated- local memory allocation, local memory freeing
- Cross-cpu communication can happen, but is explicit- submit_to()- map_reduce()
Seastar minimizes cross CPU access
![Page 18: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/18.jpg)
- Modern NoSQL databases trust it too much.- Both MongoDB and Cassandra just trust the Linux page cache
- Wrong granularity, false sharing, unpredictable latencies.- Example: 1k rows per page. 3 hot rows, but also the coldest row. Which to
evict?
Linux page cache
![Page 19: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/19.jpg)
- Asynchronous I/O is not really asynchronous- “It’s ok, if it blocks something else runs instead”
- there is no something else- “Thread per core” really becomes “two threads per core”- XFS blocks under heavy load. Otherwise ok.
Linux filesystems: our greatest enemies.
![Page 20: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/20.jpg)
I/O Scheduling
Query
Commitlog
Compaction
Queue
Queue
Queue
UserspaceI/O
SchedulerDisk
![Page 21: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/21.jpg)
I/O Scheduling# ./fsqual context switch per appending io: 1 (BAD)
# ./fsqualcontext switch per appending io: 0 (GOOD)
ext4, 4.3.3
XFS, 3.15
![Page 22: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/22.jpg)
I/O Scheduling
![Page 23: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/23.jpg)
I/O Schedulingincreased latency for no gainXFS screams. Better avoid it.
![Page 24: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/24.jpg)
I/O SchedulingShares distribution Throughput (KB/s)
C1 C2 C3 C4
10, 10, 10, 10 137506 137501 137501 137501
100, 100, 100, 100 137504 137499 137499 137499
10, 20, 40, 80 37333 73732 146566 292375
100, 10, 10, 10 421211 42922 42922 42922
4 classes disputing the same I/O Queue, with various shares distributions, single core. 550 MB/s SSD fully saturated.
From ScyllaDB’s blog: http://www.scylladb.com/2016/04/29/io-scheduler-2/
![Page 25: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/25.jpg)
![Page 26: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/26.jpg)
+ Download: http://www.scylladb.com+ Twitter: @ScyllaDB+ Source: http://github.com/scylladb/scylla+ Mailing lists: scylladb-user @ groups.google.com+ Company site & blog: http://www.scylladb.com/
How to interact
![Page 27: Cloudius Systems presents · Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore](https://reader033.fdocuments.in/reader033/viewer/2022051912/6002fbb1c029416919233a10/html5/thumbnails/27.jpg)
SCYLLA, NoSQL GOES NATIVEThank you.