Overview of sheepdog

32
Sheepdog Overview Liu Yuan 2013.4.27

Transcript of Overview of sheepdog

Page 1: Overview of sheepdog

Sheepdog Overview

Liu Yuan

2013.4.27

Page 2: Overview of sheepdog

Sheepdog – Distributed Object Storage

● Replicated shared storage for VM

● Most intelligent storage in OSS– Self-healing

– Self-managing

– No configuration file

– One-liner setup● Scale-out (more than 1000+ nodes)

● Integrate well in QEMU/Libvirt/Openstack

Page 3: Overview of sheepdog

Agenda

● Background Knowledge

● Node management

● Data management

● Thin-provisioning

● Sheepfs

● Features from the future

Page 4: Overview of sheepdog

Background Knowledge

● VM Storage stack

● QEMU/KVM stack

● Virtual Disk

● IO Requests Type

● Write Cache

● QEMU Snapshot

Page 5: Overview of sheepdog

VM Storage Stack

Guset File System

Guset Block Driver

QEMU Image Format

QEMU Disk Emulation

QEMU Format Protocol

POSIX file, Raw device, Sheepdog, Ceph

Sheepdog block driver in QEMU is implemented at protocol layer● Support all the formats of

QEMU● Raw format as default

● Best performance● Snapshot is supported by

the Sheepdog protocol

Page 6: Overview of sheepdog

QEMU/KVM Stack

VCPU VCPU

Kernel

VM

PCPU PCPU

VM_ENTRY

IO Requests

KVMeventfd

Virtual Disk

VM_EXIT

Sheepdog

QEMU

Network

Page 7: Overview of sheepdog

Virtual Disk

● Transports– ATA, SCSI, Virtio

– Virtio – Designed for VM● Simpler interface, better performance● Virtio-scsi

– Enhancement of virtio-blk– Advanced DISCARD operation supports

● Write-cache– Essential for distributed backend storage to boost

performance

Page 8: Overview of sheepdog

IO Requests Type of VD

● Read/Write

● Discard– VM's FS (EXT4, XFS) transparently inform

underlying storage backend to release blocks

● FLUSH– Assure dirty bits reach the underlying backend storage

● Write Cache Enable (WCE)– VM uses it to change the VD cache mode on the fly

Page 9: Overview of sheepdog

Write Cache

● Not a memory cache like page cache– DirectIO(O_DIRECT) bypass page cache but not

bypass write cache

– O_SYNC or fsync(2) flush write cache

● All modern disks have it and have well-support from OS

● Most virtual devices emulate write cache– As safe as well-behaved hard-disk cache

Page 10: Overview of sheepdog

QEMU Snapshot

● Two type of states– Memory state (VM state) and disk state

● Users can optional save– VM state only

– VM state + disk state

– Disk state only

● Internal snapshot & external snapshot– Sheepdog choose external snapshot

Page 11: Overview of sheepdog

Node management

● Node Add/Delete

● Dual NIC

Page 12: Overview of sheepdog

Node Add/Delete

● One-liner to add or delete node– Add node

● $ sheep /store # use corosync or● $ sheep /store -c zookeeper:IP

– Delete node● $ kill sheep

– Support group add/kill

● Rely on Corosync or Zookeeper – Membership change events

– Cluster-wide ordered message

Page 13: Overview of sheepdog

Pic. from http://www.osrg.net/sheepdog/

Page 14: Overview of sheepdog

Dual NIC

● One for control messages(heart-beat), the other for data transfer– If data NIC is down, data transfer will fallback on

control NIC

– But if control NIC is down, the node is considered as dead

● Single NIC– Control and data will share it

Page 15: Overview of sheepdog

Data Management

● Object Management

● VM Requst Management

● Auto-weighting

● Multi-disk

● Object Cache

● Journaling

Page 16: Overview of sheepdog

Object Management

● Data are stored as replicated objects– Object is plain fix-sized POSIX file

● objects are auto-rebalanced at node add/delete/crash events

● Replica are auto-recovered

● Different copy number for each VDI

● Support SAN-like or SAN-less or even mixed architeture

Page 17: Overview of sheepdog

Pic. from http://www.osrg.net/sheepdog/

Page 18: Overview of sheepdog

VM Requst Management

● Parallel requests handling– Every node can handle the requests concurrently

● Serve the requests even in the node change events– VM requests are prioritized againt replica recovery

requests

– VM requests will retry until it succeeds at node change events

Page 19: Overview of sheepdog

Auto-weighting

● Node storage is auto-weighted– Different sized nodes will only store its proportional

share

● Use consistent hashing + virtual node

● Users can specify exported space– Use all the free space as default

Page 20: Overview of sheepdog

Multi-disk

● Single deamon manage multi-disks– $ sheep /disk1,/disk2{,disk3...}

– Auto-weighting

– Auto-rebalance

– Recover objects from other Sheep● Simply put, MD = raid0 + auto-recovery● Eliminate need of hardware RAID

– Support hot-plug/unplug

Page 21: Overview of sheepdog

Object cache

● Sheepdog's write cache of Virtual Disk– $ sheep -w size=100G /store

● $ qemu -drive cache={writeback|writethrough|off}

– Support writeback, writethrough, directio

– LRU algorithm for reclaiming

– Share objects between the VM from the same base

● Use SSD for object cache to get a boost

Page 22: Overview of sheepdog

Object cache

Virtual Disk

Object Cache

R&WFLUSH

VM

PUSH & PULL

Sheepdog Cluster

Page 23: Overview of sheepdog

Journaling

● $ sheep -j dir=/path/to/journal /store

● Sheepdog use O_SYNC write as default

● Object writes are fairly random

● Log all the write opertions as append write on the rotated log file– Transform random write into sequential write

– Objects write can then drop O_SYNC

● Boost performance + avoid partial write

Page 24: Overview of sheepdog

Thin-provisioning

● Sparse Volume

● Discard Operation

● COW Snapshot

Page 25: Overview of sheepdog

Sparse Volume

● Only allocate one inode object for new VDI as default – Instant creation of new VDI

● Create data objects on demand

● Users can preallocate data objects– Not recommended, performance gain is very

limited

Page 26: Overview of sheepdog

Discard operation

● Release objects when users delete files inside VM

● Only support IDE and virtio-scsi device– CentOS 6.3+

– OS running vanilla kernel 3.4+

– We need QEMU 1.5+

Page 27: Overview of sheepdog

Snapshot

● Live snapshot (VM state + vdisk)– Save the snapshot in the sheepdog

● QEMU monitor > savevm tag

– Restore the snapshot on the fly● QEMU monitor > loadvm tag

– Restore the snapshot at boot● $ qemu -hda sheepdog -loadvm tag

● Live or off-line snapshot (vdisk only)– $ qemu-img snapshot sheepdog:disk

Page 28: Overview of sheepdog

Snapshot cont.

● Tree structure snapshots

base

● Rollback to any snapshot and make your branch

Page 29: Overview of sheepdog

Snapshot cont.

● All snapshots are COW based– Only create inode object for the snapshot

– Instantly taken

● Support incremental snapshot backup

● Read the snapshot out of cluster– $ collie vdi read -s tag disk

● Snapshots are stored in the Sheepdog storage so shared by all the nodes

Page 30: Overview of sheepdog

Sheepfs

● FUSE-based pseudo file system to export Sheepdog's virtual disks– $ sheepfs /mountpoint

● Mount vdisk into local file system hierarchy as a block file– $ echo vdisk > /mountpoint/vdi/mount

– Then /mountpoint/volume/vdisk will show up

Page 31: Overview of sheepdog

Features from the future

● Cluster-wide snapshot– Useful for backup and inter-cluster VDI-

migration/sharing

– Dedup, compression, incremental snapshot

● QEMU-SD connection auto-restart– Useful for upgrading sheep without stopping the VM

● QEMU-SD multi-connection– Higher Availibility VM

Page 32: Overview of sheepdog

Thank You