Ceph at Work in Bloomberg: Object Store, RBD and OpenStack

CEPH AT WORKIN BLOOMBERGObject Store, RBD and OpenStack

January 19, 2016

By: Chris Jones & Chris Morgan

2

BLOOMBERG

30 Years in under 30 Seconds● Subscriber based financial provider (Bloomberg Terminal)

● Online, TV, print, real-time streaming information

● Offices and customers in every major financial market and institution worldwide

3

BLOOMBERG

Primary product - Information● Bloomberg Terminal

− Approximately 60,000 features/functions. For example, ability to track oil tankers in real-time via satellite feeds

− Note: Exact numbers are not specified. Contact media relations for specifics and other important information.

CLOUD INFRASTRUCTURE

4

CLOUD INFRASTRUCTURE GROUP

5

Primary customers – Developers – Product Groups

● Many different development groups throughout our organization

● Currently about 3,000 R&D developers

● Everyone of them wants and needs resources

6

CLOUD INFRASTRUCTURE GROUP

Resource Challenges ● Developers

− Development

− Testing

− Automation (Cattle vs. Pets)

● Organizations

− POC

− Products in production

− Automation

● Security/Networking

− Compliance

7

USER BASE (EXAMPLES)

Resources and Use cases ● Multiple Data Centers

− Each DC contains *many* Network Tiers which includes a DMZ for Public-facing Bloomberg assets

− There is at least one Ceph/OpenStack Cluster per Network Tier

● Developer Community Supported

− Public facing Bloomberg products

− Machine learning backend for smart apps

− Compliance-based resources

− Use cases continue to climb as Devs need more storage and compute capacity

INFRASTRUCTURE

8

9

USED IN BLOOMBERG

● Ceph – RGW (Object Store)

● Ceph – Block/Volume

● OpenStack─ Different flavors of compute

─ Ephemeral storage

● Object Store is becoming one of the most popular items

● OpenStack compute with Ceph backed block store volumes are very popular

● We introduced ephemeral compute storage

10

SUPER HYPER-CONVERGED STACK

On EVERY Network Tier

11

SUPER HYPER-CONVERGED STACK(Original) Converged Architecture Rack Layout● 3 Head Nodes (Controller Nodes)

− Ceph Monitor

− Ceph OSD

− OpenStack Controllers (All of them!)

− HAProxy

● 1 Bootstrap Node

− Cobbler (PXE Boot)

− Repos

− Chef Server

− Rally/Tempest

● Remaining Nodes

− Nova Compute

− Ceph OSDs

− RGW – Apache

● Ubuntu

● Shared spine with Hadoop resources

3 Head Nodes

Ceph Mon

OS-Controllers

OS-Compute

OSD

OS-Other

Bootstrap Node

Compute/Ceph OSDs/RGW/ApacheRemaining Stack

Sliced View of Stack

12

NEW POD ARCHITECTURE

POD(TOR)

HAProxy

OS-Nova

OS-NovaOS-Rabbit

OS-DB

Number of large providers have taken similar approachesNote: Illustrative only – Not Representative

POD(TOR)

Ceph OSD

CephMon

CephMon

CephMon

CephOSD

CephOSD

RBD Only

Bootstrap

Monitoring

Ephemeral

Ephemeral – Fast/DangerousHost aggregates & flavorsNot Ceph backed

13

POD ARCHITECTURE (OPENSTACK/CEPH)

POD(TOR)

Ceph Block

OS-Nova

OS-NovaOS-Rabbit

OS-NovaOS-DB

Number of large providers have taken similar approachesNote: Illustrative only – Not Representative

POD(TOR)

Ceph OSD

CephMon

CephMon

CephMon

CephOSD

CephOSD

POD(TOR)

Ceph OSD

CephMon

CephMon

CephMon

CephOSD

CephOSD

• Scale and re-provision as needed

• 3 PODs per rack

14

EPHEMERAL VS. CEPH BLOCK STORAGE

Numbers will vary in different environments. Illustrations are simplified.

Ceph Ephemeral

New feature option added to address high IOP applications

15



Ceph – Advantages● All data is replicated at least 3 ways across the cluster

● Ceph RBD volumes can be created, attached and detached from any hypervisor

● Very fast provisioning using COW (copy-on-write) images

● Allows easy instance re-launch in the event of hypervisor failure

● High read performance

Ephemeral – Advantages

● Offers read/write speeds that can be 3-4 times faster than Ceph with lower latency

● Can provide fairly large volumes for cheap

Ceph – Disadvantages

● All writes must be acknowledged by multiple nodes before being considered as committed (tradeoff for reliability)

● Higher latency due to Ceph being network based instead of local

Ephemeral – Disadvantages

● Trades data integrity for speed: if one drive fails at RAID 0 then all data on that node is lost

● May be difficult to add more capacity (depends on type of RAID)

● Running in JBOD LVM mode w/o RAID performance was not as good as Ceph

● Less important, with RAID your drives need to be same size or you lose capacity

16



EPHEMERAL CEPH

Block write bandwidth (MB/s) 1,094.02 642.15

Block read bandwidth (MB/s) 1,826.43 639.47

Character read bandwidth (MB/s) 4.93 4.31

Character write bandwidth (MB/s) 0.83 0.75

Block write latency (ms) 9.502 37.096

Block read latency (ms) 8.121 4.941

Character read latency (ms) 2.395 3.322

Character write latency (ms) 11.052 13.587

Note: Ephemeral in JBOD/LVM mode is not as fast as CephNumbers can also increase with additional tuning and different devices

CHALLENGES – LESSONS LEARNED

17

Network● It’s all about the network.

− Changed MTU from 1500 to 9000 on certain interfaces (Float interface – Storage interface)

− Hardware Load Balancers – keep an eye on performance

● Hardware

− Moving to a more commodity driven hardware

− All flash storage in compute cluster (high cost, good for block and ephemeral)

Costs

● Storage costs are very high in a converged compute cluster for Object Store

Analytics

● Need to know how the cluster is being used

● Need to know if the tps meets the SLA

● Test going directly against nodes and then layer in network components until you can verify all choke points in the data flow path

● Monitor and test always

NEW CEPH OBJECT STORE

18

19

OBJECT STORE STACK (RACK CONFIG)RedHat 7.1● 1 TOR and 1 Rack Mgt Node

● 3 1U nodes (Mon, RGW, Util)

● 17 2U Ceph OSD nodes

● 2x or 3x Replication depending on need (3x default)

● Secondary RGW (may coexist with OSD Node)

● 10g Cluster interface

● 10g Public interface

● 1 IPMI interface

● OSD Nodes (high density server nodes)

− 6TB HDD x 12 – Journal partitions on SSD

− No RAID1 OS drives – instead we partitioned off a small amount of SSD1 for OS and swap with remainder of SSD1 used for some journals and SSD2 used for remaining journals

− Failure domain is a node

3 1U Nodes

TOR/IPMI

ConvergedStorage Nodes

2U each

20

OBJECT STORE STACK (ARCHITECTURE)

1 Mon/RGW NodePer rack

TOR - Leaf

Storage Nodes

Spine Spine LBLB

21

OBJECT STORE STACKStandard configuration (Archive Cluster)● Min of 3 Racks = Cluster

● OS – Redhat 7.1

● Cluster Network: Bonded 10g or higher depending on size of cluster

● Public Network: Bonded 10g for RGW interfaces

● 1 Ceph Mon node per rack except on more than 3 racks. Need to keep odd number of Mons so some racks may not have Mons. We try to keep larger cluster racks & Mons in different power zones

● We have developed a healthy “Pain” tolerance. We mainly see drive failures and some node failures.

● Min 1 RGW (dedicated Node) per rack (may want more)

● Hardware load balancers to RGWs with redundancy

● Erasure coded pools (no cache tiers at present – testing). We also use a host profile with 8/3 (k/m)

● Near full and full ratios are .75/.85 respectfully

● Index sharding

● Federated (regions/zones)

● All server nodes, no JBOD expansions

● S3 only at present but we do have a few requests for Swift

● Fully AUTOMATED – Chef cookbooks to configure and manage cluster (some Ansible)

22

AUTOMATION

All of what we do only happens because of automation● Company policy – Chef

● Cloud Infrastructure Group uses Chef and Ansible. We use Ansible for orchestration and maintenance

● Bloomberg Github: https://github.com/bloomberg/bcpc

● Ceph specific options

− Ceph Chef: https://github.com/ceph/ceph-chef

− Bloomberg Object Store: https://github.com/bloomberg/chef-bcs

− Ceph Deploy: https://github.com/ceph/ceph-deploy

− Ceph Ansible: https://github.com/ceph/ceph-ansible

● Our bootstrap server is our Chef server per cluster

https://github.com/bloomberg/bcpc






https://github.com/ceph/ceph-chef

https://github.com/bloomberg/chef-bcs


https://github.com/ceph/ceph-deploy

https://github.com/ceph/ceph-ansible

23

TESTING

Testing is critical. We use different strategies for the different parts of OpenStack and Ceph we test● OpenStack

− Tempest – We currently only use this for patches we make. We plan to use this more in our DevOps pipeline

− Rally – Can’t do distributed testing but we use it to test bottlenecks in OpenStack itself

● Ceph

− RADOS Bench

− COS Bench – Going to try this with CBT

− CBT – Ceph Benchmark Testing

− Bonnie++

− FIO

● Ceph – RGW

− Jmeter – Need to test load at scale. Takes a cloud to test a cloud ● A lot of the times you find it’s your network, load balancers etc

24

CEPH USE CASE DEMAND – GROWING!

Ceph

*Real-time

Object

ImmutableOpenStack

Big Data*?

*Possible use cases if performance is enhanced

25

WHAT’S NEXT?Continue to evolve our POD architecture● OpenStack

− Work on performance improvements and track stats on usage for departments

− Better monitoring

− LBaaS, Neutron

● Containers and PaaS

− We’re currently evaluating PaaS software and container strategies now

● Better DevOps Pipelining

− GO CD and/or Jenkins improved strategies

− Continue to enhance automation and re-provisioning

− Add testing to automation

● Ceph

− New Block Storage Cluster

− Super Cluster design

− Performance improvements – testing Jewel

− RGW Multi-Master (multi-sync) between datacenters

− Enhanced security – encryption at rest (can already do) but with better key management

− NVMe for Journals and maybe for high IOP block devices

− Cache Tier (need validation tests)

THANK YOU

27

ADDITIONAL RESOURCES● Chris Jones: [email protected]

− Github: cloudm2

● Chris Morgan: [email protected]

− Github: mihalis68

Cookbooks:

● BCC: https://github.com/bloomberg/bcpc

− Current repo for Bloomberg’s Converged OpenStack and Ceph cluster

● BCS: https://github.com/bloomberg/chef-bcs

● Ceph-Chef: https://github.com/ceph/ceph-chef

The last two repos make up the Ceph Object Store and full Ceph Chef Cookbooks.

mailto:[email protected]

mailto:[email protected]



https://github.com/ceph/ceph-chef

Ceph at Work in Bloomberg: Object Store, RBD and OpenStack

Technology

Transcript of Ceph at Work in Bloomberg: Object Store, RBD and OpenStack