Ceph and Mirantis OpenStack

15
Ceph in Mirantis OpenStack Dmitry Borodaenko Mountain View, 2014

description

On January 14, 2014, Dmitry Borodaenko presented information on Ceph and OpenStack. This is the slide deck for that presentation.

Transcript of Ceph and Mirantis OpenStack

Page 1: Ceph and Mirantis OpenStack

Ceph in Mirantis OpenStack

Dmitry Borodaenko

Mountain View, 2014

Page 2: Ceph and Mirantis OpenStack

The Plan

1. What is Ceph?2. What is Mirantis OpenStack?3. How does Ceph fit into OpenStack?4. What has Fuel ever done for Ceph?5. What does it look like?6. Things we’ve done7. Disk partition for Ceph OSD8. Cephx authentication settings9. Types of VM migrations

10. Live VM migrations with Ceph11. Thinks we left undone12. Diagnostics and troubleshooting13. Resources

Page 3: Ceph and Mirantis OpenStack

What is Ceph?

Ceph is a free clustered storage platform that provides unifiedobject, block, and file storage.

Object Storage RADOS objects support snapshotting, replication,and consistency.

Block Storage RBD block devices are thinly provisioned overRADOS objects and can be accessed by QEMU vialibrbd library.Kernel Module librbd

RADOS Protocol

OSDs Monitors

File Storage CephFS metadata servers (MDS) provide aPOSIX-compliant overlay over RADOS.

Page 4: Ceph and Mirantis OpenStack

What is Mirantis OpenStack?OpenStack is an open source cloud computing platform.

Nova VM Swift

Cinder Glance

storesprovisions objects in

storesprovides provides images involumes for images for

Mirantis ships hardened OpenStack packages and provides Fuelutility to simplify deployment of OpenStack and Ceph.

Fuel uses Cobbler, MCollective, and Puppet to discovernodes, provision OS, and setup OpenStack services.Fuel master node

serialize

orchestrate

Target node

configure

start

Astute Nailgun facts Puppet

Cobbler MCollective MCollective Agent

provision

Page 5: Ceph and Mirantis OpenStack

How does Ceph fit into OpenStack?RBD drivers for OpenStack make libvirtconfigure the QEMU interface to librbd.

Ceph benefits:I Multi-node striping and redundancy

for block storage (Cinder volumesand Nova ephemeral drives)

I Copy-on-write cloning of images tovolumes and instances

I Unified storage pool for all types ofstorage (object, block, POSIX)

I Live migration of Ceph-backedinstances

OpenStack

libvirt

QEMU

librbd

librados

OSDs Monitors

configures

Problems: sensitivity to clock drift, multi-site (async replication inEmperor), block storage density (erasure coding in Firefly), SwiftAPI gap (rbd backend for Swift)

Page 6: Ceph and Mirantis OpenStack

What has Fuel ever done for Ceph?1. Fuel deploys Ceph Monitors and OSDs on dedicated nodes or

in combination with OpenStack components.

controller 3

controller

ceph-mon

controller 2

controller

ceph-mon

controller 1

controller

ceph-mon

nova

ceph client

compute 1

compute n. . .

ceph-osd

ceph-osd

storage 1

storage n. . . sto

rag

e n

etw

ork

management network

2. Creates partitions for OSDs when nodes are provisioned.3. Creates separate RADOS pools and sets up Cephx

authentication for Cinder, Glance, and Nova.4. Configures Cinder, Glance, and Nova to use RBD backend

with the right pools and credentials.5. Deploys RADOS Gateway (S3 and Swift API frontend to

Ceph) behind HAProxy on controller nodes.

Page 7: Ceph and Mirantis OpenStack

What does it look like?

Select storage options ⇒ assign roles to nodes ⇒ allocate disks:

Page 8: Ceph and Mirantis OpenStack

Things we’ve done

1. Set the right GPT type GUIDs on OSD and journal partitionsfor udev automount rules

2. ceph-deploy: set up root SSH between Ceph nodes3. Basic Ceph settings: cephx, pool size, networks4. Cephx: ceph auth command line can’t be split5. Rados Gateway: has to be the Inktank’s fork of FastCGI, set

an infinite revocation interval for UUID auth tokens to work6. Patch Cinder to convert non-raw images when creating an

RBD backed volume from Glance7. Patch Nova: clone RBD backed Glance images into RBD

backed ephemeral volumes, pass RBD user to qemu-img8. Ephemeral RBD: disable SSH key injection, set up Nova,

libvirt, and QEMU for live migrations

Page 9: Ceph and Mirantis OpenStack

Disk partitioning for Ceph OSD

Flow of disk partitioning information during discovery,configuration, provisioning, and deployment:

Fuel master node

allocation

ceph-osdrole volumes ks_spaces

Target nodescan

disks scan

osd:journalcreate

settype

Facterosd_devices_listFuel UI Nailgun MCAgent

parted Base OS

OSD

OSD

Journal

Puppetceph::osd

openstack.json Cobbler pmanager

sgdiskceph-deploy

GPT partition type GUIDs according to ceph-disk:

JOURNAL_UUID = ’45b0969e -9b03 -4f30 -b4c6 -b4b80ceff106 ’OSD_UUID = ’4fbd7e29 -9d25 -41b8-afd0 -062 c0ceff05d ’

If more than one device is allocated for OSD Journal, journaldevices are evenly distributed between OSDs.

Page 10: Ceph and Mirantis OpenStack

Cephx authentication settings

Monitor ACL is the same for all Cephx users:allow r

OSD ACLs vary per OpenStack component:Glance: allow class -read object_prefix rbd_children ,

allow rwx pool=images

Cinder: allow class -read object_prefix rbd_children ,allow rwx pool=volumesallow rx pool=images

Nova: allow class -read object_prefix rbd_children ,allow rwx pool=volumesallow rx pool=imagesallow rwx pool=compute

Watch out: Cephx is easily tripped up by unexpected whitespace inceph auth command line parameters, so we have to keep them allon a single line.

Page 11: Ceph and Mirantis OpenStack

Types of VM migrations

OpenStack:Live vs offline: Is VM stopped during migration?Block vs shared storage vs volume-backed: Is VM data shared

between nodes? Is VM metadata (e.g. libvirt domainXML) shared?

Libvirt:Native vs tunneled: Is VM state transferred directly between

hypervisors or tunneled by libvirtd?Direct vs peer-to-peer: Is migration controlled by libvirt client or by

source libvirtd?Managed vs unmanaged: Is migration controlled by libvirt or by

hypervisor itself?Our type:Live, volume-backed*, native, peer-to-peer, managed.

Page 12: Ceph and Mirantis OpenStack

Live VM migrations with Ceph

I Enable native peer to peer live migration:

Source compute node Destination compute node

VM-A VM-B VM-C VM-C VM-D VM-E

Nova libvirtd libvirtd Nova

libvirt VIR_MIGRATE_* flags: LIVE, PEER2PEER,UNDEFINE_SOURCE, PERSIST_DEST

I Patch Nova to decouple shared volumes from shared libvirtmetadata logic during live migration

I Set VNC listen address to 0.0.0.0 and block VNC from outsidethe management network in iptables

I Open ports 49152+ between computes for QEMU migrations

Page 13: Ceph and Mirantis OpenStack

Things we left undone

1. Non-root user with sudo for ceph-deploy2. Calculate PG numbers based on the number of OSDs3. Ceph public network should go to a second storage network

instead of management4. Dedicated Monitor nodes, list all Monitors in ceph.conf on

each Ceph node5. Multi-backend configuration for Cinder6. A better way to configure pools for OpenStack services (than

CEPH_ARGS in the init script)7. Make Nova update VM’s VNC listen address to

vncserver_listen of the destination compute after migration8. Replace ’qemu-img convert’ with clone_image() in

LibvirtDriver.snapshot() in Nova

Page 14: Ceph and Mirantis OpenStack

Diagnostics and troubleshootingceph -sceph osd treecinder create 1rados dfqemu -img convert -O raw cirros.qcow2 cirros.rawglance image -create --name cirros -raw --is-public yes \

--container -format bare --disk -format raw < cirros.rawnova boot --flavor 1 --image cirros -raw vm0nova live -migration vm0 node -3

disk partitioning failed during provisioning – check if traces ofprevious partition tables are left on any drives

’ceph-deploy config pull’ failed – check if the node can ssh to theprimary controller over management network

HEALTH_WARN: clock skew detected – check your ntpd settings,make sure your NTP server is reachable from all nodes

ENOSPC when storing small objects in RGW – try setting asmaller rgw object stripe size

Page 15: Ceph and Mirantis OpenStack

Resources

Read the docs:http://ceph.com/docs/next/rbd/rbd-openstack/http://docs.mirantis.com/fuel/fuel-4.0/http://libvirt.org/migration.htmlhttp://docs.openstack.org/admin-guide-cloud/content/ch_introduction-to-openstack-compute.html

Get the code:

I Mirantis OpenStack ISO image and VirtualBox scripts,I ceph Puppet module for Fuel,I Josh Durgin’s havana-ephemeral-rbd branch for Nova.

Vote on Nova bugs:#1226351, #1261675, #1262450, #1262914.

Sign up for Mirantis and Inktank webcast on Ceph and OpenStack.