Ceph Day New York 2014: Future of CephFS
-
Upload
ceph-community -
Category
Software
-
view
561 -
download
4
description
Transcript of Ceph Day New York 2014: Future of CephFS
Future of CephFSSage Weil
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
MM
MM
MM
CLIENTCLIENT
0110
0110
datametadata
MM
MM
MM
Metadata Server• Manages metadata for a
POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,
timestamps, mode, etc.)• Stores metadata in
RADOS• Does not serve file data
to clients• Only required for shared
filesystem
legacy metadata storage
● a scaling disaster● name → inode → block
list → data
● no inode table locality
● fragmentation
– inode table
– directory
● many seeks
● difficult to partition
usr
etc
var
home
vmlinuz
passwdmtabhosts
lib…
…
…
includebin
ceph fs metadata storage
● block lists unnecessary
● inode table mostly useless
● APIs are path-based, not inode-based
● no random table access, sloppy caching
● embed inodes inside directories
● good locality, prefetching
● leverage key/value object
102
100
1
usr
etc
var
home
vmlinuz
passwdmtabhosts
libincludebin
…
…
…
controlling metadata io
● view ceph-mds as cache
● reduce reads
– dir+inode prefetching
● reduce writes
– consolidate multiple writes
● large journal or log● stripe over objects
● two tiers
– journal for short term
– per-directory for long term
● fast failure recovery
journal
directories
one tree
three metadata servers
??
load distribution
● coarse (static subtree)● preserve locality
● high management overhead
● fine (hash)● always balanced
● less vulnerable to hot spots
● destroy hierarchy, locality
● can a dynamic approach capture benefits of both extremes?
static subtree
hash directories
hash files
good locality
good balance
DYNAMIC SUBTREE PARTITIONING
● scalable● arbitrarily partition
metadata
● adaptive● move work from busy to
idle servers
● replicate hot metadata
● efficient● hierarchical partition
preserve locality
● dynamic● daemons can join/leave
● take over for failed nodes
dynamic subtree partitioning
Dynamic partitioning
many directories same directory
Failure recovery
Metadata replication and availability
Metadata cluster scaling
client protocol
● highly stateful● consistent, fine-grained caching
● seamless hand-off between ceph-mds daemons● when client traverses hierarchy
● when metadata is migrated between servers
● direct access to OSDs for file I/O
an example
● mount -t ceph 1.2.3.4:/ /mnt
● 3 ceph-mon RT
● 2 ceph-mds RT (1 ceph-mds to -osd RT)
● cd /mnt/foo/bar
● 2 ceph-mds RT (2 ceph-mds to -osd RT)
● ls -al
● open
● readdir
– 1 ceph-mds RT (1 ceph-mds to -osd RT)
● stat each file
● close
● cp * /tmp
● N ceph-osd RT
ceph-mon
ceph-mds
ceph-osd
recursive accounting
● ceph-mds tracks recursive directory stats● file sizes
● file and directory counts
● modification time
● virtual xattrs present full stats
● efficient
$ ls alSh | headtotal 0drwxrxrx 1 root root 9.7T 20110204 15:51 .drwxrxrx 1 root root 9.7T 20101216 15:06 ..drwxrxrx 1 pomceph pg4194980 9.6T 20110224 08:25 pomcephdrwxrxrx 1 mcg_test1 pg2419992 23G 20110202 08:57 mcg_test1drwxx 1 luko adm 19G 20110121 12:17 lukodrwxx 1 eest adm 14G 20110204 16:29 eestdrwxrxrx 1 mcg_test2 pg2419992 3.0G 20110202 09:34 mcg_test2drwxx 1 fuzyceph adm 1.5G 20110118 10:46 fuzycephdrwxrxrx 1 dallasceph pg275 596M 20110114 10:06 dallasceph
snapshots
● volume or subvolume snapshots unusable at petabyte scale
● snapshot arbitrary subdirectories
● simple interface● hidden '.snap' directory
● no special tools
$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot
multiple client implementations
● Linux kernel client● mount -t ceph
1.2.3.4:/ /mnt
● export (NFS), Samba (CIFS)
● ceph-fuse
● libcephfs.so● your app
● Samba (CIFS)
● Ganesha (NFS)
● Hadoop (map/reduce)
kernel
libcephfs
ceph fuseceph-fuse
your app
libcephfsSamba
libcephfsGanesha
NFS SMB/CIFS
libcephfsHadoop
Recent work
● Groundwork for hosting multi file systems
● Admin-ability● Improved health reporting, logging
● cephfs-journal-tool, improved journal flexibility
● Ability to query, manipulate cluster state
● Consistency checking● forward scrub – verify correctness / integrity
● backward scrub – recovery from corruption, data loss
● Performance
Dog food
● Using CephFS for internal build/test lab● 80 TB (80 x 1 TB HDDs, 10 hosts)
● Old, crummy hardware with lots of failures
● Linux kernel clients (ceph.ko, bleeding edge kernels)
● Lots of good lessons● Several kernel bugs found
● Recovery performance issues
● Lots of painful admin processes identified
● Several fat fingers, facepalms
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOMEAWESOME
AWESOME
AWESOME
Path forward
● Testing● Various workloads
● Multiple active MDSs
● Test automation● Simple workload generator scripts
● Bug reproducers
● Hacking● Bug squashing
● Long-tail features
● Integrations● Ganesha, Samba, *stacks
librados
object model
● pools● 1s to 100s
● independent namespaces or object collections
● replication level, placement policy
● objects● bazillions
● blob of data (bytes to gigabytes)
● attributes (e.g., “version=12”; bytes to kilobytes)
● key/value bundle (bytes to gigabytes)
atomic transactions
● client operations send to the OSD cluster● operate on a single object
● can contain a sequence of operations, e.g.
– truncate object
– write new object data
– set attribute
● atomicity● all operations commit or do not commit atomically
● conditional● 'guard' operations can control whether operation is
performed
– verify xattr has specific value
– assert object is a specific version
● allows atomic compare-and-swap etc.
key/value storage
● store key/value pairs in an object● independent from object attrs or byte data payload
● based on google's leveldb● efficient random and range insert/query/removal
● based on BigTable SSTable design
● exposed via key/value API● insert, update, remove
● individual keys or ranges of keys
● avoid read/modify/write cycle for updating complex objects
● e.g., file system directory objects
watch/notify
● establish stateful 'watch' on an object● client interest persistently registered with object
● client keeps session to OSD open
● send 'notify' messages to all watchers● notify message (and payload) is distributed to all watchers
● variable timeout
● notification on completion
– all watchers got and acknowledged the notify
● use any object as a communication/synchronization channel
● locking, distributed coordination (ala ZooKeeper), etc.
CLIENT #1
CLIENT #2
CLIENT #3
OSD
watch
ack/commit
ack/commit
watch
ack/commitwatch
notify
notify
notify
notify
ackack
ack
complete
watch/notify example
● radosgw cache consistency● radosgw instances watch a single object (.rgw/notify)
● locally cache bucket metadata
● on bucket metadata changes (removal, ACL changes)
– write change to relevant bucket object
– send notify with bucket name to other radosgw instances
● on receipt of notify
– invalidate relevant portion of cache
rados classes
● dynamically loaded .so● /var/lib/rados-classes/*
● implement new object “methods” using existing methods
● part of I/O pipeline
● simple internal API
● reads● can call existing native or class methods
● do whatever processing is appropriate
● return data
● writes● can call existing native or class methods
● do whatever processing is appropriate
● generates a resulting transaction to be applied atomically
class examples
● grep● read an object, filter out individual records, and return
those
● sha1● read object, generate fingerprint, return that
● images● rotate, resize, crop image stored in object
● remove red-eye
● crypto● encrypt/decrypt object data with provided key
ideas
● distributed key/value table● aggregate many k/v objects into one big 'table'
● working prototype exists (thanks, Eleanor!)
ideas
● lua rados class● embed lua interpreter in a rados class
● ship semi-arbitrary code for operations
● json class● parse, manipulate json structures
ideas
● rados mailbox (RMB?)● plug librados backend into dovecot, postfix, etc.
● key/value object for each mailbox
– key = message id
– value = headers
● object for each message or attachment
● watch/notify to delivery notification
hard links?
● rare
● useful locality properties
● intra-directory
● parallel inter-directory
● on miss, file objects provide per-file backpointers
● degenerates to log(n) lookups
● optimistic read complexity