Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

25
Ceph Day Amsterdam March 31st, 2015 Unleash Ceph over Flash Storage Potential with Mellanox High-Performance Interconnect Chloe Jian Ma, Senior Director, Cloud Marketing

Transcript of Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

Page 1: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

Ceph Day Amsterdam – March 31st, 2015

Unleash Ceph over Flash Storage Potential with

Mellanox High-Performance Interconnect

Chloe Jian Ma, Senior Director, Cloud Marketing

Page 2: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 2- Mellanox Confidential -

Leading Supplier of End-to-End Interconnect Solutions

Virtual Protocol Interconnect

StorageFront / Back-End

Server / Compute Switch / Gateway

56G IB & FCoIB 56G InfiniBand

10/40/56GbE & FCoE 10/40/56GbE

Virtual Protocol Interconnect

Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio

Metro / WAN

Page 3: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 3- Mellanox Confidential -

Solid State Storage Technology Evolution – Lower Latency

Advanced Networking and Protocol Offloads Required to Match Storage Media Performance

0.1

10

1000

HD SSD NVM

Ac

ce

ss

Tim

e (

mic

ro-S

ec

)

Storage Media Technology

50%

100%

Networked Storage

Storage Protocol (SW) Network

Storage Media

Network

HW & SW

Hard

Drives

NAND

Flash

Next Gen

NVM

Page 4: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 4- Mellanox Confidential -

Scale-Out Architecture Requires A Fast Network

Scale-out grows capacity and performance in parallel

Requires fast network for replication, sharing, and metadata (file)

• Throughput requires bandwidth

• IOPS requires low latency

• Efficiency requires CPU offload

Proven in HPC, storage appliances, cloud, and now… Ceph

Interconnect Capabilities Determine Scale Out Performance

Page 5: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 5- Mellanox Confidential -

Ceph and Networks

High performance networks enable maximum cluster availability

• Clients, OSD, Monitors and Metadata servers communicate over multiple network layers

• Real-time requirements for heartbeat, replication, recovery and re-balancing

Cluster (“backend”) network performance dictates cluster’s performance and scalability

• “Network load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients

and the Ceph Storage Cluster” (Ceph Documentation)

Page 6: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 6- Mellanox Confidential -

How Customers Deploy Ceph with Mellanox Interconnect

Building Scalable, Performing Storage Solutions

• Cluster network @ 40Gb Ethernet

• Clients @ 10G/40Gb Ethernet

High performance at Low Cost

• Allows more capacity per OSD

• Lower cost/TB

Flash Deployment Options

• All HDD (no flash)

• Flash for OSD Journals

• 100% Flash in OSDs

Faster Cluster Network Improves Price/Capacity and Price/Performance

Page 7: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 7- Mellanox Confidential -

Ceph Deployment Using 10GbE and 40GbE

Cluster (Private) Network @ 40/56GbE

• Smooth HA, unblocked heartbeats, efficient data balancing

Throughput Clients @ 40/56GbE

• Guaranties line rate for high ingress/egress clients

IOPs Clients @ 10GbE or 40/56GbE

• 100K+ IOPs/Client @4K blocks

20x Higher Throughput , 4x Higher IOPs with 40Gb Ethernet Clients!(http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)

Throughput Testing results based on fio benchmark, 8m block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2

IOPs Testing results based on fio benchmark, 4k block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2

Cluster Network

Admin Node

40GbE

Public Network10GbE/40GBE

Ceph Nodes(Monitors, OSDs, MDS)

Client Nodes10GbE/40GbE

Page 8: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 8- Mellanox Confidential -

OK, But How Do We Further Improve IOPS? We Use RDMA!

RDMA over InfiniBand or Ethernet

KE

RN

EL

HA

RD

WA

RE

US

ER

RACK 1

OS

NIC Buffer 1

Application

1Application

2

OS

Buffer 1

NICBuffer 1

TCP/IP

RACK 2

HCA HCA

Buffer 1Buffer 1

Buffer 1

Buffer 1

Buffer 1

Page 9: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 9- Mellanox Confidential -

Ceph Throughput using 40Gb and 56Gb Ethernet

0

1000

2000

3000

4000

5000

6000

64KB Random Read 256KB Random Read

MB

/s 40Gb TCP MTU=1500

56Gb TCP MTU=4500

56 Mb RDMA MTU=4500

One OSD, One Client, 8 Threads

56

Gb

RD

MA

56

Gb

TC

P

40

Gb

TC

P

56

Gb

RD

MA

56

Gb

TC

P

40

Gb

TC

P

Page 10: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 10- Mellanox Confidential -

By SanDisk & Mellanox

Optimizing Ceph for Flash

Page 11: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 11- Mellanox Confidential -

Ceph Flash Optimization

Highlights Compared to Stock Ceph• Read performance up to 8x better

• Write performance up to 2x better with tuning

Optimizations• All-flash storage for OSDs

• Enhanced parallelism and lock optimization

• Optimization for reads from flash

• Improvements to Ceph messenger

Test Configuration• InfiniFlash Storage with IFOS 1.0 EAP3

• Up to 4 RBDs

• 2 Ceph OSD nodes, connected to InfiniFlash

• 40GbE NICs from Mellanox

SanDisk InfiniFlash

Page 12: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 12- Mellanox Confidential -

8K Random - 2 RBD/Client with File System

IOPS: 2 LUNs /Client (Total 4 Clients)

0

50000

100000

150000

200000

250000

300000

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

0 25 50 75 100

Lat(ms): 2 LUNs/Client (Total 4 Clients)

0

20

40

60

80

100

120

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

0 25 50 75 100

IOPS Latency

(ms)

[Queue Depth]

Read PercentIFOS 1.0 Stock Ceph

Page 13: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 13- Mellanox Confidential -

Performance: 64K Random -2 RBD/Client with File System

IOPS: 2 LUNs/Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)

0

20000

40000

60000

80000

100000

120000

140000

160000

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

0 25 50 75 100

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32

0 25 50 75 100

IOPS Latency

(ms)

[Queue Depth]

Read PercentIFOS 1.0 Stock Ceph

Page 14: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 14- Mellanox Confidential -

XioMessenger

Adding RDMA To Ceph

Page 15: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 15- Mellanox Confidential -

I/O Offload Frees Up CPU for Application Processing

~88% CPU

Efficiency

Us

er

Sp

ac

eS

ys

tem

Sp

ac

e

~53% CPU

Efficiency

~47% CPU

Overhead/Idle

~12% CPU

Overhead/Idle

Without RDMA With RDMA and Offload

Us

er

Sp

ac

eS

ys

tem

Sp

ac

e

Page 16: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 16- Mellanox Confidential -

Adding RDMA to Ceph

RDMA Beta to be Included in Hammer

• Mellanox, Red Hat, CohortFS, and Community collaboration

• Full RDMA expected in Infernalis

Refactoring of Ceph messaging layer

• New RDMA messenger layer called XioMessenger

• New class hierarchy allowing multiple transports (simple one is TCP)

• Async design that leverages Accelio

• Reduced locks; Reduced number of threads

XioMessenger built on top of Accelio (RDMA abstraction layer)

• Integrated into all CEPH user space components: daemons and clients

• “public network” and “cluster network”

Page 17: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 17- Mellanox Confidential -

Open source!

• https://github.com/accelio/accelio/ && www.accelio.org

Faster RDMA integration to application

Asynchronous

Maximize msg and CPU parallelism

• Enable >10GB/s from single node

• Enable <10usec latency under load

In Giant and Hammer

• http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger

Accelio, High-Performance Reliable Messaging and RPC Library

Page 18: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 18- Mellanox Confidential -

Ceph 4KB Read IOPS: 40Gb TCP vs. 40Gb RDMA

0

50

100

150

200

250

300

350

400

450

2 OSDs, 4 clients 4 OSDs, 4 clients 8 OSDs, 4 clients

Th

ou

san

ds o

f IO

PS

40Gb TCP

40Gb RDMA

34

co

res in

OS

D

24

co

res in

clie

nt

38

co

res in

OS

D

24

co

res in

clie

nt

38

co

res in

OS

D

30

co

res in

clie

nt

36

co

res in

OS

D

32

co

res in

clie

nt

34

co

res in

OS

D

27

co

res in

clie

nt

38

co

res in

OS

D

24

co

res in

clie

nt

RDMATCP RDMATCP RDMATCP

Page 19: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 19- Mellanox Confidential -

Ceph-Powered Solutions

Deployment Examples

Page 20: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 20- Mellanox Confidential -

Ceph For Large Scale Storage– Fujitsu Eternus CD10000

Hyperscale Storage

• 4 to 224 nodes

• Up to 56 PB raw capacity

Runs Ceph with Enhancements

• 3 different storage nodes

• Object, block, and file storage

Mellanox InfiniBand Cluster Network

• 40Gb InfiniBand cluster network

• 10Gb Ethernet front end network

Page 21: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 21- Mellanox Confidential -

Media & Entertainment Storage – StorageFoundry Nautilus

Turnkey Object Storage

• Built on Ceph

• Pre-configured for rapid deployment

• Mellanox 10/40GbE networking

High-Capacity Configuration

• 6-8TB Helium-filled drives

• Up to 2PB in 18U

High-Performance Configuration

• Single client read 2.2 GB/s

• SSD caching + Hard Drives

• Supports Ethernet, IB, FC, FCoE front-end ports

More information: www.storagefoundry.net

Page 22: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 22- Mellanox Confidential -

SanDisk InfiniFlash

Flash Storage System

• Announced March 3, 2015

• InfiniFlash OS uses Ceph

• 512 TB (raw) in one 3U enclosure

• Tested with 40GbE networking

High Throughput

• Up to 7GB/s

• Up to 1M IOPS with two nodes

More information:

• http://bigdataflash.sandisk.com/infiniflash

Page 23: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 23- Mellanox Confidential -

More Ceph Solutions

Cloud – OnyxCCS ElectraStack

• Turnkey IaaS

• Multi-tenant computing system

• 5x faster Node/Data restoration

• https://www.onyxccs.com/products/8-series

Flextronics CloudLabs

• OpenStack on CloudX design

• 2SSD + 20HDD per node

• Mix of 1Gb/40GbE network

• http://www.flextronics.com/

ISS Storage Supercore

• Healthcare solution

• 82,000 IOPS on 512B reads

• 74,000 IOPS on 4KB reads

• 1.1GB/s on 256KB reads

• http://www.iss-integration.com/supercore.html

Scalable Informatics Unison

• High availability cluster

• 60 HDD in 4U

• Tier 1 performance at archive cost

• https://scalableinformatics.com/unison.html

Page 24: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

© 2015 Mellanox Technologies 24- Mellanox Confidential -

Summary

Ceph scalability and performance benefit from high performance networks

Ceph being optimized for flash storage

End-to-end 40/56 Gb/s transport accelerates Ceph today

• 100Gb/s testing has begun!

• Available in various Ceph solutions and appliances

RDMA is next to optimize flash performance—beta in Hammer

Page 25: Ceph Day Amsterdam 2015 - Deploying flash storage for Ceph without compromising performance

Thank You