Optimizing forest db for flash based ssd: Couchbase Connect 2015

OPTIMIZING FORESTDB FOR FLASH-BASED SSD

Sang-Won LeeProfessor, Sungkungkwan University

Sundar SridharanSenior Software Engineer, Couchbase Inc.

©2015 Couchbase Inc.

2

Contents

▪ Introduction▪ SHARE Interface in Flash-Based SSD for

ForestDB▪ ForestDB Optimizations at File System Layer▪ Evaluation Results▪ Future Work▪ Summary


3

Introduction

▪It is all-flash storage era!

▪Legacy of harddisk era at system softwares▪ Suboptimal on top of flash storage

▪ForestDB: next-generation KV engine of Couchbase

▪Opportunities▪ Exploit flash storage characteristics (SHARE Interface)▪ Leverage modern CoW-based file systems

SHARE Interface in Flash-Based SSD

for ForestDB


5

Characteristics of Flash Storage (vs. Hard Disk)

▪No-overwrite and FTL layer▪ Overwrite is not allowed▪ Another layer of address mapping inside flash storage

▪Limited lifetime

▪Write time in flash storage ~ write amount▪ Write time in harddisk ~ mechanical disk head

movement


6

Copy-on-Write in ForestDB

▪Document update▪ Copy-on-Write, instead of in-place-update


7

Copy-On-Write in ForestDB (2)

▪Why CoW? ▪ 1) Write atomicity and 2) multi-version concurrency

control ▪ A reasonable solution in HDD

▪Problems with CoW in flash storage▪ Tree-wandering write amplification low performance ▪ Flash storage lifetime


8

Opportunities in Flash Storage

▪Address mapping inside flash storage (by FTL)


9

Opportunities in Flash Storage(2)

▪SHARE interface: explicit address remapping


10

Opportunities in Flash Storage (3)

▪ForestDB Compaction with SHARE▪ No write of valid documents to new file


11

SHARE Implementation

▪Firmware extension for SHARE▪ OpenSSD Board (http://www.openssd-project.org/)▪ Atomic and recoverable

http://www.openssd-project.org/


12

Performance Evaluation

▪Normal time performance: YCSB’s workload-F


13

Performance Evaluation (2)

▪Compaction performance

Elapsed Time(sec)

Written Bytes(MB)

Original ForestDB 227.5 1126.4

ForestDB with SHARE 88.4 150.6

ForestDB Optimizations atFile System Layer


15

Overview

▪Motivation – the catch-22

▪Why B-Tree file system (Btrfs)

▪How ForestDB solves the catch-22 using Btrfs

▪Optimizing with Linux Asynchronous library (libaio)

▪Performance Results


16

Append-Only Key-Value Stores are Great!

▪Consistency▪Stable access to multiple point-in-time snapshots of data

▪Performance with Isolation▪Multi-Version Concurrency Control (MVCC) means readers

and writers do not block each other

▪Recoverability▪Can easily rollback entire database to a stable past state

▪SSD Friendly▪Avoids in-place updates and Flash Layer Translations


17

Append-Only KV Stores are Great!


18

MVCC: Readers & Writer Run Unblocked!


19

But...

▪Disk can fill up with stale data

▪Need to do garbage collection - Compaction


20

Compactions Do Garbage Collection...


21

Compactions for Garbage Collection


22

What if size of active data exceeds free space available….

A Fundamental Problem with Disk Space

Writer appends too much data


23

A Fundamental Problem: Catch-22

“My disk is getting full... I want to free up space but don’t have enough free space to free up space!”

Size of Active Data must be strictly lesser than free space available on disk!!


24

B-Tree File System (Btrfs)

▪Btrfs is a copy-on-write filesystem for Linux

▪Development began in Oracle in 2007 and marked as stable since August 2014 (http://goo.gl/upukn4)

▪Industry support from Facebook, Fujitsu, Fusion-IO, Intel, Netgear, Novel/SUSE, Oracle, Red Hat etc

▪Available as an option in all major Linux distributions


25

Btrfs Features (Short list)▪Max file size upto 16 exbibytes (1 exbibyte in ext4)▪Self healing due to copy-on-write nature▪Online defragmentation▪Online volume growth and shrinking▪Online block device addition and removal▪Block discards for improved wear levelling on SSDs using TRIM▪Transparent compression configurable with file or volume ▪Online data scrubbing▪Send/receive of diffs▪Snapshots and subvolumes

▪File Cloning!


26

Btrfs Basics - Representation

File P with reference counted extents


27

Btrfs Feature - Copy File Range

Copy file range api lets new File “Q” share physical disk extents from File “P”


28

Btrfs Feature - Blocks shared across files

Copy-On-Write lets new updates to happen on File Q


29

Btrfs Basics - Deleting File

Deleting file Q


30

Btrfs Basics - Freeing up space

Freeing up space


31

ForestDB Compaction Using Btrfs Cloning

Compaction works by using BTRFS to copy-on-write (clone) valid block-ranges from old file into new file...


32

ForestDB Compaction Using Btrfs Cloning

Deleting old file.fdb.0 frees up space only belonging to the stale blocks. Valid blocks of file.fdb.1 stay intact!

Performance ResultsUbuntu 14.04, Btrfs v3.12, 4 CPU cores, 20GB

SSD drive 8GB DRAM


34

Performance (1) – ForestDB on Btrfs

~1.25 - 2 X Faster! ½ write amplification!


35


~1.5 - 4 X Faster! ½ write amplification!


36


~2 X Faster! ½ write amplification!


37

Speeding up Reads with libaio

▪Modern SSDs have multiple I/O channels

▪Asynchronous I/O maximizes throughput

▪Well suited for ForestDB compaction tasks


38

Performance (4) ForestDB on Btrfs with libaio

13X faster!

7X faster!

4X faster!


39

Advantages of Btrfs with libaio

▪Efficiently uses disk space avoiding the catch-22

▪Reduces Write Amplification by 2 times▪Longer SSD lifespan due to reduced wear

▪Over 13 X faster compaction speeds

▪Generic file system layer solution that applies to SSD as well as spinning disks

Future Work


41

Future Work

▪Optimize Btrfs clone feature for better performance▪Working with the Linux Btrfs community

▪Optimize ForestDB to skip reading if cloning on compaction

▪Adapt Ext4 file system to add the new system call that allows us to share physical blocks among multiple files

Summary


43

Summary

▪ForestDB with SHARE interface in SSD▪Speeds up compactions by 3X with 10X lower write

amplification

▪ForestDB with Btrfs clone feature in File system layer▪Speeds up compactions by 2X with 2X lower write

amplification

▪ForestDB with Btrfs clone feature with Linux libaio▪ Speeds up compactions by 13X with 2X lower write

amplification


44

Questions?

Sang-Won Lee, [email protected]

Sundar [email protected]


45

Initial Load Performance

3x ~ 6x less time


46

Initial Load Performance

4x less write overhead


47

Read-Only Performance

1 2 4 80

5000

10000

15000

20000

25000

30000

Throughput

ForestDB LevelDB RocksDB

# reader threads

Ope

ratio

ns p

er s

econ

d

2x ~ 5x


48

Write-Only Performance

1 4 16 64 2560

2000

4000

6000

8000

10000

12000

Throughput


Write batch size (# documents)

Ope

ratio

ns p

er s

econ

d

- Small batch size (e.g., < 10) is not usually common

3x ~ 5x


49

Write-Only Performance

1 4 16 64 2560

50

100

150

200

250

300

350

400

450

Write Amplification


Write batch size (# documents)

Writ

e am

plifi

catio

n(N

orm

aliz

ed t

o a

sing

le d

oc s

ize)

ForestDB shows 4x ~ 20x less write amplification


50

Mixed Workload Performance

1 2 4 80

2000

4000

6000

8000

10000

12000

Mixed (Unrestricted) Performance


# reader threads

Ope

ratio

ns p

er s

econ

d

2x ~ 5x

Optimizing forest db for flash based ssd: Couchbase Connect 2015

Technology

Transcript of Optimizing forest db for flash based ssd: Couchbase Connect 2015