Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
-
Upload
redhatstorage -
Category
Technology
-
view
1.326 -
download
1
Transcript of Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
CEPH AT WORKIN BLOOMBERGObject Store, RBD and OpenStack
January 19, 2016
By: Chris Jones & Chris Morgan
2
BLOOMBERG
30 Years in under 30 Seconds● Subscriber based financial provider (Bloomberg Terminal)
● Online, TV, print, real-time streaming information
● Offices and customers in every major financial market and institution worldwide
3
BLOOMBERG
Primary product - Information● Bloomberg Terminal
− Approximately 60,000 features/functions. For example, ability to track oil tankers in real-time via satellite feeds
− Note: Exact numbers are not specified. Contact media relations for specifics and other important information.
CLOUD INFRASTRUCTURE
4
CLOUD INFRASTRUCTURE GROUP
5
Primary customers – Developers – Product Groups
● Many different development groups throughout our organization
● Currently about 3,000 R&D developers
● Everyone of them wants and needs resources
6
CLOUD INFRASTRUCTURE GROUP
Resource Challenges ● Developers
− Development
− Testing
− Automation (Cattle vs. Pets)
● Organizations
− POC
− Products in production
− Automation
● Security/Networking
− Compliance
7
USER BASE (EXAMPLES)
Resources and Use cases ● Multiple Data Centers
− Each DC contains *many* Network Tiers which includes a DMZ for Public-facing Bloomberg assets
− There is at least one Ceph/OpenStack Cluster per Network Tier
● Developer Community Supported
− Public facing Bloomberg products
− Machine learning backend for smart apps
− Compliance-based resources
− Use cases continue to climb as Devs need more storage and compute capacity
INFRASTRUCTURE
8
9
USED IN BLOOMBERG
● Ceph – RGW (Object Store)
● Ceph – Block/Volume
● OpenStack─ Different flavors of compute
─ Ephemeral storage
● Object Store is becoming one of the most popular items
● OpenStack compute with Ceph backed block store volumes are very popular
● We introduced ephemeral compute storage
10
SUPER HYPER-CONVERGED STACK
On EVERY Network Tier
11
SUPER HYPER-CONVERGED STACK(Original) Converged Architecture Rack Layout● 3 Head Nodes (Controller Nodes)
− Ceph Monitor
− Ceph OSD
− OpenStack Controllers (All of them!)
− HAProxy
● 1 Bootstrap Node
− Cobbler (PXE Boot)
− Repos
− Chef Server
− Rally/Tempest
● Remaining Nodes
− Nova Compute
− Ceph OSDs
− RGW – Apache
● Ubuntu
● Shared spine with Hadoop resources
3 Head Nodes
Ceph Mon
OS-Controllers
OS-Compute
OSD
OS-Other
Bootstrap Node
Compute/Ceph OSDs/RGW/ApacheRemaining Stack
Sliced View of Stack
12
NEW POD ARCHITECTURE
POD(TOR)
HAProxy
OS-Nova
OS-NovaOS-Rabbit
OS-DB
Number of large providers have taken similar approachesNote: Illustrative only – Not Representative
POD(TOR)
Ceph OSD
CephMon
CephMon
CephMon
CephOSD
CephOSD
RBD Only
Bootstrap
Monitoring
Ephemeral
Ephemeral – Fast/DangerousHost aggregates & flavorsNot Ceph backed
13
POD ARCHITECTURE (OPENSTACK/CEPH)
POD(TOR)
Ceph Block
OS-Nova
OS-NovaOS-Rabbit
OS-NovaOS-DB
Number of large providers have taken similar approachesNote: Illustrative only – Not Representative
POD(TOR)
Ceph OSD
CephMon
CephMon
CephMon
CephOSD
CephOSD
POD(TOR)
Ceph OSD
CephMon
CephMon
CephMon
CephOSD
CephOSD
• Scale and re-provision as needed
• 3 PODs per rack
14
EPHEMERAL VS. CEPH BLOCK STORAGE
Numbers will vary in different environments. Illustrations are simplified.
Ceph Ephemeral
New feature option added to address high IOP applications
15
EPHEMERAL VS. CEPH BLOCK STORAGE
Numbers will vary in different environments. Illustrations are simplified.
Ceph – Advantages● All data is replicated at least 3 ways across the cluster
● Ceph RBD volumes can be created, attached and detached from any hypervisor
● Very fast provisioning using COW (copy-on-write) images
● Allows easy instance re-launch in the event of hypervisor failure
● High read performance
Ephemeral – Advantages
● Offers read/write speeds that can be 3-4 times faster than Ceph with lower latency
● Can provide fairly large volumes for cheap
Ceph – Disadvantages
● All writes must be acknowledged by multiple nodes before being considered as committed (tradeoff for reliability)
● Higher latency due to Ceph being network based instead of local
Ephemeral – Disadvantages
● Trades data integrity for speed: if one drive fails at RAID 0 then all data on that node is lost
● May be difficult to add more capacity (depends on type of RAID)
● Running in JBOD LVM mode w/o RAID performance was not as good as Ceph
● Less important, with RAID your drives need to be same size or you lose capacity
16
EPHEMERAL VS. CEPH BLOCK STORAGE
Numbers will vary in different environments. Illustrations are simplified.
EPHEMERAL CEPH
Block write bandwidth (MB/s) 1,094.02 642.15
Block read bandwidth (MB/s) 1,826.43 639.47
Character read bandwidth (MB/s) 4.93 4.31
Character write bandwidth (MB/s) 0.83 0.75
Block write latency (ms) 9.502 37.096
Block read latency (ms) 8.121 4.941
Character read latency (ms) 2.395 3.322
Character write latency (ms) 11.052 13.587
Note: Ephemeral in JBOD/LVM mode is not as fast as CephNumbers can also increase with additional tuning and different devices
CHALLENGES – LESSONS LEARNED
17
Network● It’s all about the network.
− Changed MTU from 1500 to 9000 on certain interfaces (Float interface – Storage interface)
− Hardware Load Balancers – keep an eye on performance
● Hardware
− Moving to a more commodity driven hardware
− All flash storage in compute cluster (high cost, good for block and ephemeral)
Costs
● Storage costs are very high in a converged compute cluster for Object Store
Analytics
● Need to know how the cluster is being used
● Need to know if the tps meets the SLA
● Test going directly against nodes and then layer in network components until you can verify all choke points in the data flow path
● Monitor and test always
NEW CEPH OBJECT STORE
18
19
OBJECT STORE STACK (RACK CONFIG)RedHat 7.1● 1 TOR and 1 Rack Mgt Node
● 3 1U nodes (Mon, RGW, Util)
● 17 2U Ceph OSD nodes
● 2x or 3x Replication depending on need (3x default)
● Secondary RGW (may coexist with OSD Node)
● 10g Cluster interface
● 10g Public interface
● 1 IPMI interface
● OSD Nodes (high density server nodes)
− 6TB HDD x 12 – Journal partitions on SSD
− No RAID1 OS drives – instead we partitioned off a small amount of SSD1 for OS and swap with remainder of SSD1 used for some journals and SSD2 used for remaining journals
− Failure domain is a node
3 1U Nodes
TOR/IPMI
ConvergedStorage Nodes
2U each
20
OBJECT STORE STACK (ARCHITECTURE)
1 Mon/RGW NodePer rack
TOR - Leaf
Storage Nodes
Spine Spine LBLB
21
OBJECT STORE STACKStandard configuration (Archive Cluster)● Min of 3 Racks = Cluster
● OS – Redhat 7.1
● Cluster Network: Bonded 10g or higher depending on size of cluster
● Public Network: Bonded 10g for RGW interfaces
● 1 Ceph Mon node per rack except on more than 3 racks. Need to keep odd number of Mons so some racks may not have Mons. We try to keep larger cluster racks & Mons in different power zones
● We have developed a healthy “Pain” tolerance. We mainly see drive failures and some node failures.
● Min 1 RGW (dedicated Node) per rack (may want more)
● Hardware load balancers to RGWs with redundancy
● Erasure coded pools (no cache tiers at present – testing). We also use a host profile with 8/3 (k/m)
● Near full and full ratios are .75/.85 respectfully
● Index sharding
● Federated (regions/zones)
● All server nodes, no JBOD expansions
● S3 only at present but we do have a few requests for Swift
● Fully AUTOMATED – Chef cookbooks to configure and manage cluster (some Ansible)
22
AUTOMATION
All of what we do only happens because of automation● Company policy – Chef
● Cloud Infrastructure Group uses Chef and Ansible. We use Ansible for orchestration and maintenance
● Bloomberg Github: https://github.com/bloomberg/bcpc
● Ceph specific options
− Ceph Chef: https://github.com/ceph/ceph-chef
− Bloomberg Object Store: https://github.com/bloomberg/chef-bcs
− Ceph Deploy: https://github.com/ceph/ceph-deploy
− Ceph Ansible: https://github.com/ceph/ceph-ansible
● Our bootstrap server is our Chef server per cluster
23
TESTING
Testing is critical. We use different strategies for the different parts of OpenStack and Ceph we test● OpenStack
− Tempest – We currently only use this for patches we make. We plan to use this more in our DevOps pipeline
− Rally – Can’t do distributed testing but we use it to test bottlenecks in OpenStack itself
● Ceph
− RADOS Bench
− COS Bench – Going to try this with CBT
− CBT – Ceph Benchmark Testing
− Bonnie++
− FIO
● Ceph – RGW
− Jmeter – Need to test load at scale. Takes a cloud to test a cloud ● A lot of the times you find it’s your network, load balancers etc
24
CEPH USE CASE DEMAND – GROWING!
Ceph
*Real-time
Object
ImmutableOpenStack
Big Data*?
*Possible use cases if performance is enhanced
25
WHAT’S NEXT?Continue to evolve our POD architecture● OpenStack
− Work on performance improvements and track stats on usage for departments
− Better monitoring
− LBaaS, Neutron
● Containers and PaaS
− We’re currently evaluating PaaS software and container strategies now
● Better DevOps Pipelining
− GO CD and/or Jenkins improved strategies
− Continue to enhance automation and re-provisioning
− Add testing to automation
● Ceph
− New Block Storage Cluster
− Super Cluster design
− Performance improvements – testing Jewel
− RGW Multi-Master (multi-sync) between datacenters
− Enhanced security – encryption at rest (can already do) but with better key management
− NVMe for Journals and maybe for high IOP block devices
− Cache Tier (need validation tests)
THANK YOU
27
ADDITIONAL RESOURCES● Chris Jones: [email protected]
− Github: cloudm2
● Chris Morgan: [email protected]
− Github: mihalis68
Cookbooks:
● BCC: https://github.com/bloomberg/bcpc
− Current repo for Bloomberg’s Converged OpenStack and Ceph cluster
● BCS: https://github.com/bloomberg/chef-bcs
● Ceph-Chef: https://github.com/ceph/ceph-chef
The last two repos make up the Ceph Object Store and full Ceph Chef Cookbooks.