Workload Analysis of Key-Value Stores on Non-Volatile Media · 2017 Storage Developer Conference 1...

2017 Storage Developer Conference1

Workload Analysis of Key-Value Storeson Non-Volatile Media

Vishal Verma (Performance Engineer, Intel)Tushar Gohad (Cloud Software Architect, Intel)


Outline

KV Stores – What and Why Data Structures for KV Stores Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways


What are KV Stores

Type of NoSQL database that uses simple key/value pair mechanism to store data

Alternative to limitations of traditional relational databases (DB): Data structured and schema pre-defined Mismatch with today’s workloads. Data growth in large and unstructured Lots of random writes and reads.

NoSQL brings flexibility as application has complete control over what is stored inside the value


What are KV Stores Key in a key-value pair must (or at least, should) be unique.

Values identified via a key, and stored values can be numbers, strings, images, videos etc

API operations: get(key) for reading data, put(key, value) for writing data and delete(key) for deleting keys.

Phone Directory example:

Key Value

Bob (123) 456-7890

Kyle (245) 675-8888

Richard (787) 122-2212


KV Stores: Benefits

High performance: Enable fast location of object rather than searching through columns or tables to find an object in traditional relational DB

Highly scalable: Can scale over several machines or devices by several orders of magnitude, without the need for significant redesign

Flexible: No enforcement of any structure to data Low TCO: Simplify operations of adding or removing capacity as

needed. Any hardware or network failure do not create downtime


Outline

KV Stores – What and Why Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways


Design Choice: B-Tree19

7 13 37 53

2 3 5 7 17 19

41 43 47 53

11 13 23 29 31 37

59 61 67 71

Internal nodes (green) – pivot keys Leaf nodes (blue) – data records

(KV) Query time proportional to height

Logarithmic time Insertion / Deletion

Many random I/Os to disk May incur rebalance – read and

write amplification Full leaf nodes may be split –

space fragmentation

Internal Nodes

Leaf Nodes


B-Tree: Trade-offs Example DB engines: BerkeleyDB, MySQL InnoDB, MongoDB,

WiredTiger Good design choice for read-intensive workloads Trades read performance for increased Read / Write / Space

Amplification Nodes of B-Trees are on-disk blocks – aligned IO = space-amp Compression not effective – 16KB block size,10KB data

compressed to 5KB will still occupy 16KB Byte-size updates also end up in page size read / writes


Performance Test ConfigurationSystem Config

Non-VolatileStorage

OSConfig

Key-Value Stores

CPU:Intel(R) Xeon(R) CPU E5-2618L v4, 2.20GHz,HT disabled, 20 cores

Memory:64GB

Intel ® P4500 SSD (4TB)

Distro:Ubuntu 16.04.1

Kernel: 4.12.4

Arch: x86_64

WiredTiger 3.0.0

RocksDB 5.4

TokuDB 4.6.119

Dataset:500GB - 1TB (500 million - 1 billion records) Dataset size higher (> 3:1 DRAM size) Compression Off

Key_size: 16 BytesValue_size: 1000 BytesCache size: 32GB

db_bench:Source: https://github.com/wiredtiger/leveldb.gitTest Duration: 30 minutes

Workloads: ReadWrite (16 Readers, 1 Writer)

Linux Kernel 4.12.0 Tuning parameters

Drop page cache after every test run

XFS filesystem, agcount=32, mount with discard


Readwrite: WiredTiger B-Tree500 million rows

Read plot represents single Reader.Ops Completed: 1 thread Write 2.15 million, 16 thread Read 27.36 million

0.1

1

10

100

1000

10000

1 14 27 40 53 66 79 92 105

118

131

144

157

170

183

196

209

222

235

248

261

274

287

300

313

326

339

352

365

378

391

404

417

430

Wri

te (O

ps/s

ec)

Row Count (5000 ops per marker)

WiredTiger BTreeWrites (30 min run)

1

10

100

1000

10000

1 12 23 34 45 56 67 78 89 100

111

122

133

144

155

166

177

188

199

210

221

232

243

254

265

276

287

298

309

320

331

342

Read

(O

ps/s

ec)


WiredTiger BTree Reads (30 min run)


Readwrite: WiredTiger B-Tree500 million rows

Test: Single Writer, 16 Readers. Read plot represents single Reader latency.

0.1

1

10

100

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

151

161

171

181

191

201

211

221

231

241

251

261

271

281

291

301

311

321

331

341

Late

ncy

(sec

/ op

)


WiredTiger BTree Read Latency (30 min run)

0.1

1

10

100

1 14 27 40 53 66 79 92 105

118

131

144

157

170

183

196

209

222

235

248

261

274

287

300

313

326

339

352

365

378

391

404

417

430

Late

ncy

(sec

/ op

)


WiredTiger BTreeWrite Latency (30 min run)


Design Choice: LSM Tree Log-Structured Merge-tree Two or more Tree-like components

In-memory Tree (RocksDB: memtable) One or more Trees on persistent store (RocksDB: SST files)

Transforms random writes into few sequential writes Write-Ahead Log (WAL) – append-only journal In-memory store – inexpensive writes to memory as the first level write.

Flushed sequentially to first level in persistent store. Compaction – Merge sort like background operation

Few sequential writes (better than random IOs in B-Tree case) Trims duplicate data – minimal space amplification

Example DB engines: RocksDB, WiredTiger, Cassandra, CouchBase, LevelDB


LSM Trees – OperationC0 Tree

Merge MergeMerge

Cn Tree

Merge Multi-Page Blocks

C1 Tree

N-level Merge Tree Transform random writes into sequential writes using WAL and In-memory tables Optimized for insertions by buffering Key value items in the store are sorted to support faster lookups

C1 – Cn Trees on Persistent StoreC0 – In-memory

Tree

C… Tree


Design Choice: LSM Tree Writes – Append-only constructs

No read-modify-write, no double write Reduce fragmentation but sort/merges to multiple levels cause Write Amplification

Reads – Expensive point, order by and range queries Might call for compaction Might need to scan all levels Read Amplification Can be optimized with Bloom Filters (RocksDB)

Deletes – Defers deletes via tombstones Tombstones scanned during queries Tombstones don’t disappear until compaction


LSM Trees: Trade-offs Read / Write / Space Amplification Summary

Read amp: 1 up to number of levels Write amp: 1 + 1 + fan-out

Most useful in applications Tiered storage with varying price / performance points

RAM, Persistent Memory, NVMe SSDs etc Large dataset where inserts / deletes are more common than reads / searches

Better suited for Write-intensive workloads Better compression: page/block alignment overhead small compared to size of

persistent trees (SST files in RocksDB) Leveled LSMs have lower write and space amplification compared to B-Tress


LSM Tree Example: RocksDB1 billion rows

1

10

100

1000

10000

100000

1000000

1 41 81 121

161

201

241

281

321

361

401

441

481

521

561

601

641

681

721

761

801

841

881

921

961

1001

1041

Wri

te (O

ps/s

ec)


RocksDB LSM Write Throughput (30 min run)

1

10

100

1000

10000

100000

1000000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79

Read

(O

ps/s

ec)


RocksDB LSM Read Throughput (30 min run)

Read plot represents single Reader.Ops Completed: 1 thread Write 5.2 million, 16 thread Read 6.3 million


LSM Tree Example: RocksDB1 billion rows

0.001

0.01

0.1

1

10

100

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79

Late

bcy

(sec

/op)


RocksDB LSM Read Latency (30 min run)

0.001

0.01

0.1

1

10

100

1 41 81 121

161

201

241

281

321

361

401

441

481

521

561

601

641

681

721

761

801

841

881

921

961

1001

1041

Late

ncy

(sec

/op)


RocksDB LSM Write Latency (30 min run)

Test: Single Writer, 16 Readers. Read plot represents single Reader Latency.


Design Choice: Fractal Trees

Merges features from B-trees with LSM trees A Fractal Tree index has buffers at each node, which allow

insertions, deletions and other changes to be stored in intermediate locations

Fast writes slower reads and updates Better than LSM for range reads on cold cache, but the same on

warm cache Commercialized in DBs by Percona (Tokutek)


Fractal Trees

19

7 13 37 53

2 3 5 7 17 19

43 53

11 13 29 37

59 61 67 71

Maintains a B tree in which each internal node contains a buffer, shaded RED

To insert a data record, simply insert it into the buffer at the root of the tree vs. traversing entire tree (B-Tree)

When root buffer full, move inserted record down one level until they reach leaves and get stored in leaf node like B-Tree

Several leaf nodes are grouped to create sequential IO (~1-4MB)

23, 31, 41, 43

41, 43

Insert(31, …) moves 41, 43 into the next level


Fractal Trees: Trade-offs Read / Write / Space Amplification Summary

Write Amplification similar to leveled LSM (RocksDB) Better than B-Tree and LSM (in theory)

Most useful in applications Tiered storage with varying price / performance points

RAM, Persistent Memory, NVMe SSDs etc Large dataset where inserts / deletes are more common than reads /

searches Better suited for Write-intensive workloads

Lower write and space amplification compared to B-Tress


Fractal Tree Example: TokuDB1 billion rows

Test: Single Writer, 16 Readers. Read plot represents single Reader.

1

10

100

1000

10000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

Wri

te (O

ps/s

ec)


TokuDB Fractal WriteThroughput (30 min run)

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

ARe

ad (

Ops

/sec

)


TokuDB Fractal ReadThroughput (30 min run)


Fractal Tree Example: TokuDB1 billion rows

Test: Single Writer, 16 Readers. Read plot represents single Reader.

1

10

100

1000

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101

106

111

116

121

126

131

136

141

146

151

156

161

166

171

Late

ncy

(mic

ors/

op)


TokuDB Fractal Write Latency (30 min run)

1

10

100

1000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Late

ncy

(mic

ors/

op)


TokuDB Fractal Read Latency (30 min run)


Outline

KV Stores – What and Why Design Choices and Trade-offs Performance Comparisons Key Takeaways


Performance Test ConfigurationSystem Config

Non-VolatileStorage

OSConfig

Key-Value Stores

CPU:Intel(R) Xeon(R) CPU E5-2618L v4, 2.20GHz,HT disabled, 20 cores

Memory:64GB

Intel ® P4500 SSD (4TB)

Distro:Ubuntu 16.04.1

Kernel: 4.12.4

Arch: x86_64

WiredTiger 3.0.0

RocksDB 5.4

TokuDB 4.6.119

Dataset:200GB (200 million records) Dataset size higher (> 3:1 DRAM size) Compression Off

Key_size: 16 BytesValue_size: 1000 BytesCache size: 32GB

db_bench:Source: https://github.com/wiredtiger/leveldb.gitTest Duration: 30 minutes

Workloads: ReadWrite (16 Readers, 1 Writer) Overwrite (4 Writers)

Linux Kernel 4.12.0 Tuning parameters

Drop page cache after every test run

XFS filesystem, agcount=32, mount with discard


KV Workload#0: Fill (Insert)1 Writer, Space and Write Amplification

App Writes(GB)

Disk Usage(GB)

SpaceAmplification

Data Bytes Written

(per iostat, GB)

WriteAmplification

(per iostat)

SSD Writes*

includes NAND GC (GB)

Total WriteAmplification

(includingNAND GC)

WiredTiger

(B-Tree)~190 223 1.17 223 1.07 239 1.3

RocksDB

(Leveled LSM)~190 197 1.04 395 1.11 438 2.3

TokuDB

(Fractal-Tree)~190 196 1.03 605 1.14 688 3.6

* - SSD Device Writes obtained with nvme-cli from SMART stats for Intel® P4500

Space Amplification is the amount of disk space consumed relative to the total size of KV write operationWrite Amplification (per iostat) is the amount of data written relative to the total size of KV operationWrite Amplification (including NAND GC) is the work done by the storage media relative to the total size of KV operation


KV Workload#1: Readwhilewriting1 Writer, 16 Readers

0 5 10 15 20 25 30 35

TokuDB

RocksDB

WiredTiger (B-Tree)

99.99pct Write Latency (s)

6894

2444

1966

0 2000 4000 6000 8000

WiredTiger (B-Tree)

RocksDB

TokuDB

OPs/sec

ReadWrite: KV Store ReadThroughput

652

4580

900

0 1000 2000 3000 4000 5000

WiredTiger (B-Tree)

RocksDB

TokuDB

OPs/sec

ReadWrite: KV Store WriteThroughput


KV Workload#2: Overwrite4 Writers

4204

2079321602

0

5000

10000

15000

20000

25000

WiredTiger (B-Tree)

RocksDB TokuDB

Ops

/ se

c

OverWrite: KV Store Write Throughput

0 20 40 60 80 100 120 140

WiredTiger (B-Tree)

RocksDB

TokuDB

99.99pct Write Latency (s)


Outline

KV Stores – What and Why Design Choices and Trade-offs Performance on Non-Volatile Media Key Takeaways


Key Takeaways Key-Value stores important for unstructured data Choice of an SSD-backed Key Value store is workload-dependent

Performance vs Space / Write Amplification trade-off key decision factor Traditional B-Tree implementations

Great for read-intensive workloads Better write amplification compared to alternatives Poor space amplification

LSM and Fractal-Trees Well-suited for write-intensive workloads Better space amplification compared to B-Tree


Thank you!

Comments / Questions?

Vishal Verma ([email protected])

Workload Analysis of Key-Value Stores on Non-Volatile Media · 2017 Storage Developer Conference 1...

Documents

Transcript of Workload Analysis of Key-Value Stores on Non-Volatile Media · 2017 Storage Developer Conference 1...