Workload Analysis of Key-Value Stores on Non-Volatile Media · 2017 Storage Developer Conference 1...
Transcript of Workload Analysis of Key-Value Stores on Non-Volatile Media · 2017 Storage Developer Conference 1...
2017 Storage Developer Conference1
Workload Analysis of Key-Value Storeson Non-Volatile Media
Vishal Verma (Performance Engineer, Intel)Tushar Gohad (Cloud Software Architect, Intel)
2017 Storage Developer Conference2
Outline
KV Stores – What and Why Data Structures for KV Stores Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways
2017 Storage Developer Conference3
Outline
KV Stores – What and Why Data Structures for KV Stores Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways
2017 Storage Developer Conference4
What are KV Stores
Type of NoSQL database that uses simple key/value pair mechanism to store data
Alternative to limitations of traditional relational databases (DB): Data structured and schema pre-defined Mismatch with today’s workloads. Data growth in large and unstructured Lots of random writes and reads.
NoSQL brings flexibility as application has complete control over what is stored inside the value
2017 Storage Developer Conference5
What are KV Stores Key in a key-value pair must (or at least, should) be unique.
Values identified via a key, and stored values can be numbers, strings, images, videos etc
API operations: get(key) for reading data, put(key, value) for writing data and delete(key) for deleting keys.
Phone Directory example:
Key Value
Bob (123) 456-7890
Kyle (245) 675-8888
Richard (787) 122-2212
2017 Storage Developer Conference6
KV Stores: Benefits
High performance: Enable fast location of object rather than searching through columns or tables to find an object in traditional relational DB
Highly scalable: Can scale over several machines or devices by several orders of magnitude, without the need for significant redesign
Flexible: No enforcement of any structure to data Low TCO: Simplify operations of adding or removing capacity as
needed. Any hardware or network failure do not create downtime
2017 Storage Developer Conference7
Outline
KV Stores – What and Why Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways
2017 Storage Developer Conference8
Design Choice: B-Tree19
7 13 37 53
2 3 5 7 17 19
41 43 47 53
11 13 23 29 31 37
59 61 67 71
Internal nodes (green) – pivot keys Leaf nodes (blue) – data records
(KV) Query time proportional to height
Logarithmic time Insertion / Deletion
Many random I/Os to disk May incur rebalance – read and
write amplification Full leaf nodes may be split –
space fragmentation
Internal Nodes
Leaf Nodes
2017 Storage Developer Conference9
B-Tree: Trade-offs Example DB engines: BerkeleyDB, MySQL InnoDB, MongoDB,
WiredTiger Good design choice for read-intensive workloads Trades read performance for increased Read / Write / Space
Amplification Nodes of B-Trees are on-disk blocks – aligned IO = space-amp Compression not effective – 16KB block size,10KB data
compressed to 5KB will still occupy 16KB Byte-size updates also end up in page size read / writes
2017 Storage Developer Conference10
Performance Test ConfigurationSystem Config
Non-VolatileStorage
OSConfig
Key-Value Stores
CPU:Intel(R) Xeon(R) CPU E5-2618L v4, 2.20GHz,HT disabled, 20 cores
Memory:64GB
Intel ® P4500 SSD (4TB)
Distro:Ubuntu 16.04.1
Kernel: 4.12.4
Arch: x86_64
WiredTiger 3.0.0
RocksDB 5.4
TokuDB 4.6.119
Dataset:500GB - 1TB (500 million - 1 billion records) Dataset size higher (> 3:1 DRAM size) Compression Off
Key_size: 16 BytesValue_size: 1000 BytesCache size: 32GB
db_bench:Source: https://github.com/wiredtiger/leveldb.gitTest Duration: 30 minutes
Workloads: ReadWrite (16 Readers, 1 Writer)
Linux Kernel 4.12.0 Tuning parameters
Drop page cache after every test run
XFS filesystem, agcount=32, mount with discard
2017 Storage Developer Conference11
Readwrite: WiredTiger B-Tree500 million rows
Read plot represents single Reader.Ops Completed: 1 thread Write 2.15 million, 16 thread Read 27.36 million
0.1
1
10
100
1000
10000
1 14 27 40 53 66 79 92 105
118
131
144
157
170
183
196
209
222
235
248
261
274
287
300
313
326
339
352
365
378
391
404
417
430
Wri
te (O
ps/s
ec)
Row Count (5000 ops per marker)
WiredTiger BTreeWrites (30 min run)
1
10
100
1000
10000
1 12 23 34 45 56 67 78 89 100
111
122
133
144
155
166
177
188
199
210
221
232
243
254
265
276
287
298
309
320
331
342
Read
(O
ps/s
ec)
Row Count (5000 ops per marker)
WiredTiger BTree Reads (30 min run)
2017 Storage Developer Conference12
Readwrite: WiredTiger B-Tree500 million rows
Test: Single Writer, 16 Readers. Read plot represents single Reader latency.
0.1
1
10
100
1 11 21 31 41 51 61 71 81 91 101
111
121
131
141
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
331
341
Late
ncy
(sec
/ op
)
Row Count (5000 ops per marker)
WiredTiger BTree Read Latency (30 min run)
0.1
1
10
100
1 14 27 40 53 66 79 92 105
118
131
144
157
170
183
196
209
222
235
248
261
274
287
300
313
326
339
352
365
378
391
404
417
430
Late
ncy
(sec
/ op
)
Row Count (5000 ops per marker)
WiredTiger BTreeWrite Latency (30 min run)
2017 Storage Developer Conference13
Design Choice: LSM Tree Log-Structured Merge-tree Two or more Tree-like components
In-memory Tree (RocksDB: memtable) One or more Trees on persistent store (RocksDB: SST files)
Transforms random writes into few sequential writes Write-Ahead Log (WAL) – append-only journal In-memory store – inexpensive writes to memory as the first level write.
Flushed sequentially to first level in persistent store. Compaction – Merge sort like background operation
Few sequential writes (better than random IOs in B-Tree case) Trims duplicate data – minimal space amplification
Example DB engines: RocksDB, WiredTiger, Cassandra, CouchBase, LevelDB
2017 Storage Developer Conference14
LSM Trees – OperationC0 Tree
Merge MergeMerge
Cn Tree
Merge Multi-Page Blocks
C1 Tree
N-level Merge Tree Transform random writes into sequential writes using WAL and In-memory tables Optimized for insertions by buffering Key value items in the store are sorted to support faster lookups
C1 – Cn Trees on Persistent StoreC0 – In-memory
Tree
C… Tree
2017 Storage Developer Conference15
Design Choice: LSM Tree Writes – Append-only constructs
No read-modify-write, no double write Reduce fragmentation but sort/merges to multiple levels cause Write Amplification
Reads – Expensive point, order by and range queries Might call for compaction Might need to scan all levels Read Amplification Can be optimized with Bloom Filters (RocksDB)
Deletes – Defers deletes via tombstones Tombstones scanned during queries Tombstones don’t disappear until compaction
2017 Storage Developer Conference16
LSM Trees: Trade-offs Read / Write / Space Amplification Summary
Read amp: 1 up to number of levels Write amp: 1 + 1 + fan-out
Most useful in applications Tiered storage with varying price / performance points
RAM, Persistent Memory, NVMe SSDs etc Large dataset where inserts / deletes are more common than reads / searches
Better suited for Write-intensive workloads Better compression: page/block alignment overhead small compared to size of
persistent trees (SST files in RocksDB) Leveled LSMs have lower write and space amplification compared to B-Tress
2017 Storage Developer Conference17
LSM Tree Example: RocksDB1 billion rows
1
10
100
1000
10000
100000
1000000
1 41 81 121
161
201
241
281
321
361
401
441
481
521
561
601
641
681
721
761
801
841
881
921
961
1001
1041
Wri
te (O
ps/s
ec)
Row Count (5000 ops per marker)
RocksDB LSM Write Throughput (30 min run)
1
10
100
1000
10000
100000
1000000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79
Read
(O
ps/s
ec)
Row Count (5000 ops per marker)
RocksDB LSM Read Throughput (30 min run)
Read plot represents single Reader.Ops Completed: 1 thread Write 5.2 million, 16 thread Read 6.3 million
2017 Storage Developer Conference18
LSM Tree Example: RocksDB1 billion rows
0.001
0.01
0.1
1
10
100
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79
Late
bcy
(sec
/op)
Row Count (5000 ops per marker)
RocksDB LSM Read Latency (30 min run)
0.001
0.01
0.1
1
10
100
1 41 81 121
161
201
241
281
321
361
401
441
481
521
561
601
641
681
721
761
801
841
881
921
961
1001
1041
Late
ncy
(sec
/op)
Row Count (5000 ops per marker)
RocksDB LSM Write Latency (30 min run)
Test: Single Writer, 16 Readers. Read plot represents single Reader Latency.
2017 Storage Developer Conference19
Design Choice: Fractal Trees
Merges features from B-trees with LSM trees A Fractal Tree index has buffers at each node, which allow
insertions, deletions and other changes to be stored in intermediate locations
Fast writes slower reads and updates Better than LSM for range reads on cold cache, but the same on
warm cache Commercialized in DBs by Percona (Tokutek)
2017 Storage Developer Conference20
Fractal Trees
19
7 13 37 53
2 3 5 7 17 19
43 53
11 13 29 37
59 61 67 71
Maintains a B tree in which each internal node contains a buffer, shaded RED
To insert a data record, simply insert it into the buffer at the root of the tree vs. traversing entire tree (B-Tree)
When root buffer full, move inserted record down one level until they reach leaves and get stored in leaf node like B-Tree
Several leaf nodes are grouped to create sequential IO (~1-4MB)
23, 31, 41, 43
41, 43
Insert(31, …) moves 41, 43 into the next level
2017 Storage Developer Conference21
Fractal Trees: Trade-offs Read / Write / Space Amplification Summary
Write Amplification similar to leveled LSM (RocksDB) Better than B-Tree and LSM (in theory)
Most useful in applications Tiered storage with varying price / performance points
RAM, Persistent Memory, NVMe SSDs etc Large dataset where inserts / deletes are more common than reads /
searches Better suited for Write-intensive workloads
Lower write and space amplification compared to B-Tress
2017 Storage Developer Conference22
Fractal Tree Example: TokuDB1 billion rows
Test: Single Writer, 16 Readers. Read plot represents single Reader.
1
10
100
1000
10000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
Wri
te (O
ps/s
ec)
Row Count (5000 ops per marker)
TokuDB Fractal WriteThroughput (30 min run)
1
10
100
1000
10000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
ARe
ad (
Ops
/sec
)
Row Count (5000 ops per marker)
TokuDB Fractal ReadThroughput (30 min run)
2017 Storage Developer Conference23
Fractal Tree Example: TokuDB1 billion rows
Test: Single Writer, 16 Readers. Read plot represents single Reader.
1
10
100
1000
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101
106
111
116
121
126
131
136
141
146
151
156
161
166
171
Late
ncy
(mic
ors/
op)
Row Count (5000 ops per marker)
TokuDB Fractal Write Latency (30 min run)
1
10
100
1000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Late
ncy
(mic
ors/
op)
Row Count (5000 ops per marker)
TokuDB Fractal Read Latency (30 min run)
2017 Storage Developer Conference24
Outline
KV Stores – What and Why Design Choices and Trade-offs Performance Comparisons Key Takeaways
2017 Storage Developer Conference25
Performance Test ConfigurationSystem Config
Non-VolatileStorage
OSConfig
Key-Value Stores
CPU:Intel(R) Xeon(R) CPU E5-2618L v4, 2.20GHz,HT disabled, 20 cores
Memory:64GB
Intel ® P4500 SSD (4TB)
Distro:Ubuntu 16.04.1
Kernel: 4.12.4
Arch: x86_64
WiredTiger 3.0.0
RocksDB 5.4
TokuDB 4.6.119
Dataset:200GB (200 million records) Dataset size higher (> 3:1 DRAM size) Compression Off
Key_size: 16 BytesValue_size: 1000 BytesCache size: 32GB
db_bench:Source: https://github.com/wiredtiger/leveldb.gitTest Duration: 30 minutes
Workloads: ReadWrite (16 Readers, 1 Writer) Overwrite (4 Writers)
Linux Kernel 4.12.0 Tuning parameters
Drop page cache after every test run
XFS filesystem, agcount=32, mount with discard
2017 Storage Developer Conference26
KV Workload#0: Fill (Insert)1 Writer, Space and Write Amplification
App Writes(GB)
Disk Usage(GB)
SpaceAmplification
Data Bytes Written
(per iostat, GB)
WriteAmplification
(per iostat)
SSD Writes*
includes NAND GC (GB)
Total WriteAmplification
(includingNAND GC)
WiredTiger
(B-Tree)~190 223 1.17 223 1.07 239 1.3
RocksDB
(Leveled LSM)~190 197 1.04 395 1.11 438 2.3
TokuDB
(Fractal-Tree)~190 196 1.03 605 1.14 688 3.6
* - SSD Device Writes obtained with nvme-cli from SMART stats for Intel® P4500
Space Amplification is the amount of disk space consumed relative to the total size of KV write operationWrite Amplification (per iostat) is the amount of data written relative to the total size of KV operationWrite Amplification (including NAND GC) is the work done by the storage media relative to the total size of KV operation
2017 Storage Developer Conference27
KV Workload#1: Readwhilewriting1 Writer, 16 Readers
0 5 10 15 20 25 30 35
TokuDB
RocksDB
WiredTiger (B-Tree)
99.99pct Write Latency (s)
6894
2444
1966
0 2000 4000 6000 8000
WiredTiger (B-Tree)
RocksDB
TokuDB
OPs/sec
ReadWrite: KV Store ReadThroughput
652
4580
900
0 1000 2000 3000 4000 5000
WiredTiger (B-Tree)
RocksDB
TokuDB
OPs/sec
ReadWrite: KV Store WriteThroughput
2017 Storage Developer Conference28
KV Workload#2: Overwrite4 Writers
4204
2079321602
0
5000
10000
15000
20000
25000
WiredTiger (B-Tree)
RocksDB TokuDB
Ops
/ se
c
OverWrite: KV Store Write Throughput
0 20 40 60 80 100 120 140
WiredTiger (B-Tree)
RocksDB
TokuDB
99.99pct Write Latency (s)
2017 Storage Developer Conference29
Outline
KV Stores – What and Why Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways
2017 Storage Developer Conference30
Key Takeaways Key-Value stores important for unstructured data Choice of an SSD-backed Key Value store is workload-dependent
Performance vs Space / Write Amplification trade-off key decision factor Traditional B-Tree implementations
Great for read-intensive workloads Better write amplification compared to alternatives Poor space amplification
LSM and Fractal-Trees Well-suited for write-intensive workloads Better space amplification compared to B-Tree
2017 Storage Developer Conference31
Thank you!
Comments / Questions?
Vishal Verma ([email protected])