Ceph Day Beijing - SPDK for Ceph

35
Ziye Yang, Senior software Engineer

Transcript of Ceph Day Beijing - SPDK for Ceph

Page 1: Ceph Day Beijing - SPDK for Ceph

Ziye Yang, Senior software Engineer

Page 2: Ceph Day Beijing - SPDK for Ceph

Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

No computer system can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

*Other names and brands may be claimed as the property of others.

© 2017 Intel Corporation.

2

Page 3: Ceph Day Beijing - SPDK for Ceph

• SPDK introduction and status update

• Current SDPK support in Bluestore

• Case study: Accelerate iSCSI service exported by Ceph

• SPDK support for Ceph in 2017

• Summary

Page 4: Ceph Day Beijing - SPDK for Ceph
Page 5: Ceph Day Beijing - SPDK for Ceph

The Problem: Software is becoming the bottleneck

The Opportunity: Use Intel software ingredients to unlock the potential of new media

HDD SATA NANDSSD

NVMe* NANDSSD

Intel® Optane™SSD

Latency

I/OPerformance <500 IO/s

>25,000 IO/s

>400,000 IO/s

>2ms

<100µs <100µs

Page 6: Ceph Day Beijing - SPDK for Ceph

Storage Performance

Development Kit

6

Scalable and Efficient Software Ingredients

• User space, lockless, polled-mode components

• Up to millions of IOPS per core

• Designed for Intel Optane™ technology latencies

Intel® Platform Storage Reference Architecture

• Optimized for Intel platform characteristics

• Open source building blocks (BSD licensed)

• Available via spdk.io

Page 7: Ceph Day Beijing - SPDK for Ceph

Architecture

Drivers

StorageServices

StorageProtocols

iSCSI Target

NVMe-oF*Target

SCSI

vhost-scsiTarget

NVMe

NVMe Devices

Blobstore

NVMe-oF*

Initiator

Intel® QuickDataTechnology Driver

Block Device Abstraction (BDEV)

Ceph RBD

Linux Async IO

Blob bdev

3rd Party

NVMe

NVMe*

PCIe Driver

Released

Q2’17

Pathfinding

vhost-blkTarget

Object

BlobFS

Integration

RocksDB

Ceph

Core

ApplicationFramework

Page 8: Ceph Day Beijing - SPDK for Ceph

Benefits of using SPDK

SPDKmore performance

from Intel CPUs, non-volatile media, and

networking

FASTER TTM/LESS RESOURCES

than developing componentsfrom scratch

10X MORE IOPS/coreUp to for NVMe-oF* vs. Linux kernel

as NVM technologies increase in performanceFuture ProofingProvides

for NVMe vs. Linux kernel8X MORE IOPS/coreUp to

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured usingspecific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests toassist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

350%Up to for RocksDB workloadsBETTER Tail Latency

Page 9: Ceph Day Beijing - SPDK for Ceph

SPDK Updates: 17.03 Release (Mar 2017)

Blobstore• Block allocator for applications• Variable granularity, defaults to 4KB

BlobFS• Lightweight, non-POSIX filesystem• Page caching & prefetch• Initially limited to DB file semantic

requirements (e.g. file name and size)

RocksDB SPDK Environment• Implement RocksDB using BlobFS

QEMU vhost-scsi Target• Simplified I/O path to local QEMU

guest VMs with unmodified apps

NVMe over Fabrics Improvements• Read latency improvement• NVMe-oF Host (Initiator) zero-copy • Discovery code simplification• Quality, performance & hardening fixes

New components:broader set of use cases for SPDK

libraries & ingredients

Existing components:feature and hardening

improvements

Page 10: Ceph Day Beijing - SPDK for Ceph

Current status

Fully realizing new media performance requires software optimizations

SPDK positioned to enable developers to realize this performance

SPDK available today via http://spdk.io

Help us build SPDK as an open source community!

Page 11: Ceph Day Beijing - SPDK for Ceph
Page 12: Ceph Day Beijing - SPDK for Ceph

Current SPDK support in BlueStore

New features

Support multiple threads for doing I/Os on NVMe SSDs via SPDK user space NVMe driver

Support running SPDK I/O threads on designated CPU cores in configuration file.

Upgrade in Ceph (now is 17.03)

Upgraded SPDK to 16.11 in Dec, 2016

Upgraded SPDK to 17.03 in April, 2017

Stability

Fixed several compilation issues, running time bugs while using SPDK.

Totally 16 SPDK related Patches are merged in Bluestore (mainly in NVMEDEVICE module)

Page 13: Ceph Day Beijing - SPDK for Ceph

(From iStaury’s talk in SPDK PRC meetup 2016)

Page 14: Ceph Day Beijing - SPDK for Ceph

Block service exported by Ceph via iSCSI protocol

Cloud service providers which provision VM service can use iSCSI.

If Ceph could export block service with good performance, it would be easy to glue those providers to Ceph cluster solution.

APP

Multipath

iSCSI initiator

dm-1

sdx sdy

iSCSI target

RBD

iSCSI target

RBD

OSD OSD OSD OSD

OSD OSD OSD OSD

Client

iSCSI gateway

Ceph cluster

Page 15: Ceph Day Beijing - SPDK for Ceph

iSCSI + RBD Gateway

Ceph server

CPU:Intel(R) Xeon(R) CPU E5-2660 v4 @2.00GHz

Four intel P3700 SSDs

One OSD on each SSD, total 4 osds

4 pools PG number 512, one 10G image in one pool

iSCSI target server (librbd+SPDK / librbd+tgt)

CPU:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

Only one core enable

iSCSI initiator

CPU:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

iSCSI Initiator

iSCSI Target Server

iSCSI Target

librbd

Ceph Server

OSD0 OSD1

OSD2 OSD3

Page 16: Ceph Day Beijing - SPDK for Ceph

iSCSI + RBD Gateway

One CPU Core:

FIO + img

iSCSI type + op

1 FIO + 1 img(IOPS)

2 FIO + 2 img(IOPS)

3 FIO + 3 img(IOPS)

SPDK iSCSI tgt/TGT

ratio

TGT + 4k_randread 10K 20K 20K140%

SPDK iSCSI tgt+ 4k_randread 20K 24K 28K

TGT + 4k_randwrite 6.5K 9.5K 18K133%

SPDK iSCSI tgt + 4k_randwrite 14K 19K 24K

Page 17: Ceph Day Beijing - SPDK for Ceph

iSCSI + RBD Gateway

Two CPU Cores:

FIO + img

iSCSI type + op1 FIO + 1 img(IOPS)

2 FIO + 2 img(IOPS)

3 FIO + 3 img(IOPS)

4 FIO + 4 img(IOPS)

SPDK iSCSI tgt/TGT

ratio

TGT + 4k_randread 12K 24K 26K 26K181%

SPDK iSCSI tgt + 4k_randread 37K 47K 47K 47K

TGT + 4k_randwrite 9.5K 13.5K 19K 22K123%

SPDK iSCSI tgt + 4k_randwrite 16K 24K 25K 27K

Page 18: Ceph Day Beijing - SPDK for Ceph

Reading Comparison

10

20

12

37

20

24 24

47

20

2826

47

0

5

10

15

20

25

30

35

40

45

50

One core:TGT One core:SPDK-iSCSI Two cores:TGT Two cores:SPDK-iSCSI

4K_randread(IOPS(K))

1stream 2 streams 3streams

Page 19: Ceph Day Beijing - SPDK for Ceph

Writing Comparison

6.5

14

9.5

16

9.5

19

13.5

24

18

24

19

25

22

27

0

5

10

15

20

25

30

One core:TGT One core:SPDK-iSCSI Two cores:TGT Two cores:SPDK-iSCSI

4K_randwrite(IOPS(K))

1stream 2 streams 3streams 4streams

Page 20: Ceph Day Beijing - SPDK for Ceph
Page 21: Ceph Day Beijing - SPDK for Ceph

SPDK support for Ceph in 2017

To make SPDK really useful in Ceph, we will still do the following works with partners:

Continue stability maintenance

– Version upgrade, bug fixing in compilation/running time.

Performance enhancement

– Continue optimizing NVMEDEVICE module according to customers or partners’ feedback.

New feature Development:

– Occasionally pickup some common requirements/feedback in community and may upstream those features in NVMEDEVICE module

Page 22: Ceph Day Beijing - SPDK for Ceph

Proposals/opportunties for better leveraging SPDK

Multiple OSD support on same NVMe Device by using SPDK.

Leverage SPDK’s multiple process features in user space NVMe driver.

Risks: Same with kernel, i.e., fail all OSDs on the device if it is fail.

Enhance cache support in NVMEDEVICE via using SPDK

Need better cache/buffer strategy for Read/Write performance improvement.

Optimize Rocksdb usage in Bluestore by SPDK’s blobfs/blobstore

Make Rocksdb use SPDK’s Blobfs/Blostore instead of kernel file system for metadata management.

Page 23: Ceph Day Beijing - SPDK for Ceph

Leverage SPDK to accelerate the block service exported by CephOptimization in front of Ceph

Use optimized Block service daemon, e.g., SPDK iSCSI target or NVMe-oF target

Introduce Cache policy in Block service daemon.

Store Optimization inside Ceph

Use SPDK’s user space NVMe driver instead of Kernel NVMe driver (Already have)

May replace “BlueRocksEnv + Bluefs” with “BlobfsENV + Blobfs/Blobstore”.

Page 24: Ceph Day Beijing - SPDK for Ceph

Ceph RBD service

SPDK optimized iSCSI target SPDK optimized NVMe-oF target

SPDK Ceph RBD bdev module (Leverage librbd/librados)

SPDK Cache module

Existing SPDK app/module

Existing Ceph Service/component

FileStore

Export Block Service

KVStoreBluestore

metadata

RocksDB

BlueRocksENV

Bluefs

Kernel/SPDK driver

NVMe device

metadata

RocksDB

SPDK BlobfsENV

SPDK Blobfs/Blobstore

SPDK NVMedriver

NVMe device

Optimized module to be developed (TBD in SPDK roadmap)

Accelerate block service exported by Ceph via SPDK

Even replace RocksDB?

Page 25: Ceph Day Beijing - SPDK for Ceph
Page 26: Ceph Day Beijing - SPDK for Ceph

Summary

SPDK proves to useful to explore the capability of fast storage devices (e.g., NVMe SSDs)

But it still needs lots of development work to make SPDK useful for Bluestore in product quality level.

Call for actions:

Call for code contribution in SPDK community

Call for leveraging SPDK for Ceph optimization, welcome to contact SPDK dev team for help and collaboration.

Page 27: Ceph Day Beijing - SPDK for Ceph
Page 28: Ceph Day Beijing - SPDK for Ceph

Summary

SPDK proves to useful to explore the capability of fast storage devices (e.g., NVMe SSDs)

But it still needs lots of development work to make SPDK useful for Bluestore in product quality level.

Call for actions:

Call for code contribution in SPDK community

Call for leveraging SPDK for Ceph optimization, welcome to contact SPDK dev team for help and collaboration.

Page 29: Ceph Day Beijing - SPDK for Ceph
Page 30: Ceph Day Beijing - SPDK for Ceph

Vhost-scsi Performance

SPDK provides

1 Million IOPS with 1 core

and

8x VM performance vs. kernel!

Features Realized Benefit

High performancestorage virtualization

Increased VMdensity

Reduced VM exit Reduced tail latencies

1

11

System Configuration: Target system: 2x Intel® Xeon® E5-2695v4 (HT off), Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 8x 8GB DDR4 2133 MT/s, 1 DIMM per channel, 8x Intel® P3700 NVMe SSD (800GB), 4x per CPU socket, FW 8DV10102, Network: Mellanox* ConnectX-4 100Gb RDMA, direct connection between initiator and target; Initiator OS: CentOS* Linux* 7.2, Linux kernel 4.7.0-rc2, Target OS (SPDK): CentOS Linux 7.2, Linux kernel 3.10.0-327.el7.x86_64, Target OS (Linux kernel): CentOS Linux 7.2, Linux kernel 4.7.0-rc2 Performance as measured by: fio, 4KB Random Read I/O, 2 RDMA QP per remote SSD, Numjobs=4 per SSD, Queue Depth: 32/job

10

10

10

17

8

1

0 5 10 15 20 25 30

QEMU virtio-scsi

kernel vhost-scsi

SPDK vhost-scsi

VM cores I/O processing cores

0

200000

400000

600000

800000

1000000

QEMU virtio-scsi kernel vhost-scsi SPDK vhost-scsi

I/Os handled per I/O processing core

Page 31: Ceph Day Beijing - SPDK for Ceph

Alibaba* Cloud ECS Case Study: Write Performance

Source: http://mt.sohu.com/20170228/n481925423.shtml

* Other names and brands may be claimed as the property of others

Ali Cloud sees 300% improvement in IOPS and latency using SPDK

0

200

400

600

800

1000

1200

1400

1 2 4 8 16 32

La

ten

cy (

use

c)

Queue Depth

Random Write Latency (usec)

General Virtualization Infrastructure

Ali Cloud High-Performance Storage Infrastructure with SPDK

0

50000

100000

150000

200000

250000

300000

350000

400000

1 2 4 8 16 32

IOP

S

Queue Depth

Random Write 4K IOPS

General Virtualization Infrastructure

Ali Cloud High-Performance Storage Infrastructure with SPDK

Page 32: Ceph Day Beijing - SPDK for Ceph

Alibaba* Cloud ECS Case Study: MySQL Sysbench

Source: http://mt.sohu.com/20170228/n481925423.shtml

* Other names and brands may be claimed as the property of others

Sysbench Update sees 4.6X QPS at 10% of the latency!

0

2

4

6

8

10

12

14

16

18

Select Update

La

ten

cy (m

s)

MySQL Sysbench - Latency

General Virtualization Infrastructure High Performance Virtualization with SPDK

0

20000

40000

60000

80000

100000

120000

Select Update

MySQL Sysbench - TPS/QPS

General Virtualization Infrastructure High Performance Virtualization with SPDK

Page 33: Ceph Day Beijing - SPDK for Ceph
Page 34: Ceph Day Beijing - SPDK for Ceph

SPDK Blobstore Vs. Kernel: Key Tail Latency

0

20000

40000

60000

80000

100000

120000

140000

Readwrite

Late

ncy

uS

db_bench 99.99th Percentile LatencyLower is Better

Kernel (256KB sync) Blobstore (20GB Cache + Readahead)

372%

SPDK Blobstore reduces tail latency by 3.7X

Insert Randread Overwrite Readwrite

Kernel (256KB Sync) 366 6444 1675 122500

SPDK Blobstore(20GB Cache + Readahead)

444 3607 1200 33052

0

1000

2000

3000

4000

5000

6000

7000

Insert Randread Overwrite

Late

ncy

uS

db_bench 99.99th Percentile LatencyLower is Better

Kernel (256KB sync) Blobstore (20GB Cache + Readahead)

21%

44%

28%

Page 35: Ceph Day Beijing - SPDK for Ceph

SPDK Blobstore Vs. Kernel: Key Transactions per sec

0

200000

400000

600000

800000

1000000

1200000

Insert Randread Overwrite Readwrite

Ke

ys p

er s

eco

nd

db_bench Key TransactionsHigher is Better

85%

8% 4% ~0%

Insert Randread Overwrite Readwrite

Kernel (256KB Sync) 547046 92582 51421 30273

SPDK Blobstore(20GB Cache + Readahead)

1011245 99918 53495 29804

SPDK Blobstore improves insert throughput by 85%