Ceph Block Devices: A Deep Dive

42
Ceph Block Devices: A Deep Dive Josh Durgin RBD Lead June 24, 2015

Transcript of Ceph Block Devices: A Deep Dive

Page 1: Ceph Block Devices: A Deep Dive

Ceph Block Devices: A Deep DiveJosh DurginRBD LeadJune 24, 2015

Page 2: Ceph Block Devices: A Deep Dive

Ceph Motivating Principles● All components must scale horizontally● There can be no single point of failure● The solution must be hardware agnostic● Should use commodity hardware● Self-manage wherever possible● Open Source (LGPL)● Move beyond legacy approaches

– client/cluster instead of client/server– Ad hoc HA

Page 3: Ceph Block Devices: A Deep Dive

Ceph Components

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-

distributed block device with cloud platform

integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 4: Ceph Block Devices: A Deep Dive

Ceph Components

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-

distributed block device with cloud platform

integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 5: Ceph Block Devices: A Deep Dive

Storing Virtual Disks

M M

RADOS CLUSTER

HYPERVISORLIBRBD

VM

Page 6: Ceph Block Devices: A Deep Dive

Kernel Module

M M

RADOS CLUSTER

LINUX HOSTKRBD

Page 7: Ceph Block Devices: A Deep Dive

RBD

● Stripe images across entire cluster (pool)● Read-only snapshots● Copy-on-write clones● Broad integration

– QEMU, libvirt– Linux kernel– iSCSI (STGT, LIO)– OpenStack, CloudStack, OpenNebula, Ganeti, Proxmox, oVirt

● Incremental backup (relative to snapshots)

Page 8: Ceph Block Devices: A Deep Dive
Page 9: Ceph Block Devices: A Deep Dive

Ceph Components

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-

distributed block device with cloud platform

integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 10: Ceph Block Devices: A Deep Dive

RADOS

● Flat namespace within a pool● Rich object API

– Bytes, attributes, key/value data– Partial overwrite of existing data– Single-object compound atomic operations– RADOS classes (stored procedures)

● Strong consistency (CP system)● Infrastructure aware, dynamic topology● Hash-based placement (CRUSH)● Direct client to server data path

Page 11: Ceph Block Devices: A Deep Dive

RADOS Components

OSDs:

10s to 1000s in a cluster

One per disk (or one per SSD, RAID group…)

Serve stored objects to clients

Intelligently peer for replication & recovery

Monitors:

Maintain cluster membership and state

Provide consensus for distributed decision-making

Small, odd number (e.g., 5)

Not part of data path

M

Page 12: Ceph Block Devices: A Deep Dive

Ceph Components

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-

distributed block device with cloud platform

integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 13: Ceph Block Devices: A Deep Dive

Metadata

● rbd_directory– Maps image name to id, and vice versa

● rbd_children– Lists clones in a pool, indexed by parent

● rbd_id.$image_name– The internal id, locatable using only the user-specified image name

● rbd_header.$image_id– Per-image metadata

Page 14: Ceph Block Devices: A Deep Dive

Data

● rbd_data.* objects– Named based on offset in image– Non-existent to start with– Plain data in each object– Snapshots handled by rados– Often sparse

Page 15: Ceph Block Devices: A Deep Dive

Striping

● Objects are uniformly sized– Default is simple 4MB divisions of device

● Randomly distributed among OSDs by CRUSH● Parallel work is spread across many spindles● No single set of servers responsible for image● Small objects lower OSD usage variance

Page 16: Ceph Block Devices: A Deep Dive

I/O

M M

RADOS CLUSTER

HYPERVISORLIBRBD

CACHE

VM

Page 17: Ceph Block Devices: A Deep Dive

Snapshots

● Object granularity● Snapshot context [list of snap ids, latest snap id]

– Stored in rbd image header (self-managed)– Sent with every write

● Snapshot ids managed by monitors● Deleted asynchronously● RADOS keeps per-object overwrite stats, so diffs are easy

Page 18: Ceph Block Devices: A Deep Dive

LIBRBDSnap context: ([], 7)

write

Snapshots

Page 19: Ceph Block Devices: A Deep Dive

LIBRBD

Create snap 8

Snapshots

Page 20: Ceph Block Devices: A Deep Dive

LIBRBD

Set snap context to ([8], 8)

Snapshots

Page 21: Ceph Block Devices: A Deep Dive

LIBRBDWrite with ([8], 8)

Snapshots

Page 22: Ceph Block Devices: A Deep Dive

LIBRBDWrite with ([8], 8)

Snap 8

Snapshots

Page 23: Ceph Block Devices: A Deep Dive

Watch/Notify● establish stateful 'watch' on an object

– client interest persistently registered with object– client keeps connection to OSD open

● send 'notify' messages to all watchers– notify message (and payload) sent to all watchers– notification (and reply payloads) on completion

● strictly time-bounded liveness check on watch– no notifier falsely believes we got a message

Page 24: Ceph Block Devices: A Deep Dive

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

watch

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

watch

Page 25: Ceph Block Devices: A Deep Dive

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

Add snapshot

Page 26: Ceph Block Devices: A Deep Dive

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify

Page 27: Ceph Block Devices: A Deep Dive

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify ack

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify complete

Page 28: Ceph Block Devices: A Deep Dive

Watch/Notify

HYPERVISOR

LIBRBD

read metadata

HYPERVISOR

LIBRBD

Page 29: Ceph Block Devices: A Deep Dive

Clones

● Copy-on-write (and optionally read, in Hammer)● Object granularity● Independent settings

– striping, feature bits, object size can differ– can be in different pools

● Clones are based on protected snapshots– 'protected' means they can't be deleted

● Can be flattened– Copy all data from parent– Remove parent relationship

Page 30: Ceph Block Devices: A Deep Dive

Clones - read

LIBRBD

clone parent

Page 31: Ceph Block Devices: A Deep Dive

Clones - write

LIBRBD

clone parent

Page 32: Ceph Block Devices: A Deep Dive

LIBRBD

clone parent

read

Clones - write

Page 33: Ceph Block Devices: A Deep Dive

LIBRBD

clone parent

copy up, write

Clones - write

Page 34: Ceph Block Devices: A Deep Dive

Ceph and OpenStack

Page 35: Ceph Block Devices: A Deep Dive

Virtualized Setup

● Secret key stored by libvirt● XML defining VM fed in, includes

– Monitor addresses– Client name– QEMU block device cache setting

● Writeback recommended– Bus on which to attach block device

● virtio-blk/virtio-scsi recommended● Ide ok for legacy systems

– Discard options (ide/scsi only), I/O throttling if desired

Page 36: Ceph Block Devices: A Deep Dive

libvirtvm.xml

QEMULIBRBD

CACHE

qemu ­enable­kvm ...

VM

Virtualized Setup

Page 37: Ceph Block Devices: A Deep Dive

Kernel RBD

● rbd map sets everything up● /etc/ceph/rbdmap is like /etc/fstab● udev adds handy symlinks:

– /dev/rbd/$pool/$image[@snap]● striping v2 and later feature bits not supported yet● Can be used to back LIO, NFS, SMB, etc.● No specialized cache, page cache used by filesystem on top

Page 38: Ceph Block Devices: A Deep Dive
Page 39: Ceph Block Devices: A Deep Dive

What's new in Hammer

● Copy-on-read● Librbd cache enabled in safe mode by default● Readahead during boot● Lttng tracepoints● Allocation hints● Cache hints● Exclusive locking (off by default for now)● Object map (off by default for now)

Page 40: Ceph Block Devices: A Deep Dive

Infernalis

● Easier space tracking (rbd du)● Faster differential backups● Per-image metadata

– Can persist rbd options● RBD journaling● Enabling new features on the fly

Page 41: Ceph Block Devices: A Deep Dive

Future Work

● Kernel client catch up● RBD mirroring● Consistency groups● QoS for rados (policy at rbd level)● Active/Active iSCSI● Performance improvements

– Newstore osd backend– Improved cache tiering– Finer-grained caching– Many more

Page 42: Ceph Block Devices: A Deep Dive

Questions?

Josh Durgin

[email protected]