What you need to know about ceph
-
Upload
haruka-iwao -
Category
Technology
-
view
118 -
download
7
description
Transcript of What you need to know about ceph
![Page 1: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/1.jpg)
What you need to know about CephGluster Community Day, 20 May 2014
Haruka Iwao
![Page 2: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/2.jpg)
Index
What is Ceph? Ceph architecture Ceph and OpenStack Wrap-up
![Page 3: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/3.jpg)
What is Ceph?
![Page 4: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/4.jpg)
Ceph
The name "Ceph" is a common nickname given to pet octopuses, short for cephalopod.
![Page 5: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/5.jpg)
Cephalopod?
![Page 6: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/6.jpg)
![Page 7: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/7.jpg)
Ceph is...
{ }object storage and file system
Open-sourceMassively scalableSoftware-defined
![Page 8: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/8.jpg)
History of Ceph
2003 Project born at UCSC
2006 Open sourced Papers published
2012 Inktank founded “Argonaut” released
![Page 9: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/9.jpg)
In April 2014
![Page 10: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/10.jpg)
Yesterday
Red Hat acquires me
I joined Red Hat as an architect of storage systems
This is just a coincidence.
Red Hat acquires me
![Page 11: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/11.jpg)
Ceph releases
Major release every 3 months Argonaut Bobtail Cuttlefish Dumpling Emperor Firefly Giant (coming in July)
![Page 12: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/12.jpg)
Ceph architecture
![Page 13: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/13.jpg)
Ceph at a glance
![Page 14: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/14.jpg)
Layers in Ceph
RADOS = /dev/sda Ceph FS = ext4
/dev/sda
ext4
![Page 15: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/15.jpg)
RADOS
Reliable Replicated to avoid data loss
Autonomic Communicate each other to
detect failures Replication done transparently
Distributed Object Store
![Page 16: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/16.jpg)
RADOS (2)
Fundamentals of Ceph Everything is stored in
RADOS Including Ceph FS metadata
Two components: mon, osd CRUSH algorithm
![Page 17: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/17.jpg)
OSD
Object storage daemon One OSD per disk Uses xfs/btrfs as backend
Btrfs is experimental! Write-ahead journal for
integrity and performance 3 to 10000s OSDs in a cluster
![Page 18: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/18.jpg)
OSD (2)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
![Page 19: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/19.jpg)
MON
Monitoring daemon Maintain cluster map and
state Small, odd number
![Page 20: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/20.jpg)
Locating objects
RADOS uses an algorithm “CRUSH” to locate objects Location is decided through
pure “calculation”
No central “metadata” server No SPoF Massive scalability
![Page 21: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/21.jpg)
CRUSH1. Assign a placement group pg = Hash(object name) % num pg2. CRUSH(pg, cluster map, rule)
1
2
![Page 22: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/22.jpg)
Cluster map
Hierarchical OSD map Replicating across failure domains Avoiding network congestion
![Page 23: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/23.jpg)
Object locations computed
0100111010100111011 Name: abc, Pool: test
Hash(“abc”) % 256 = 0x23“test” = 3
Placement Group: 3.23
![Page 24: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/24.jpg)
PG to OSD
Placement Group: 3.23
CRUSH(PG 3.23, Cluster Map, Rule) → osd.1, osd.5, osd.9
15
9
![Page 25: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/25.jpg)
Synchronous Replication
Replication is synchronousto maintain strong consistency
![Page 26: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/26.jpg)
When OSD fails
OSD marked “down” 5mins later, marked “out”
Cluster map updated
CRUSH(PG 3.23, Cluster Map #1, Rule) → osd.1, osd.5, osd.9
CRUSH(PG 3.23, Cluster Map #2, Rule) → osd.1, osd.3, osd.9
![Page 27: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/27.jpg)
Wrap-up: CRUSH
Object name + cluster map → object locations Deterministic
No metadata at all Calculation done on clients Cluster map reflects
network hierarchy
![Page 28: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/28.jpg)
RADOSGW
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
![Page 29: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/29.jpg)
RADOSGW
S3 / Swift compatible gateway to RADOS
![Page 30: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/30.jpg)
RBD
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
![Page 31: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/31.jpg)
RBD
RADOS Block Devices
![Page 32: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/32.jpg)
RBD
Directly mountable rbd map foo --pool rbd mkfs -t ext4 /dev/rbd/rbd/foo
OpenStack integration Cinder & Glance Will explain later
![Page 33: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/33.jpg)
Ceph FS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
![Page 34: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/34.jpg)
Ceph FS
POSIX compliant file system build on top of RADOS
Can mount with Linux native kernel driver (cephfs) or FUSE
Metadata servers (mds) manages metadata of the file system tree
![Page 35: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/35.jpg)
Ceph FS is reliable
MDS writes journal to RADOS so that metadata doesn’t lose by MDS failures
Multiple MDS can run for HA and load balancing
![Page 36: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/36.jpg)
Ceph FS and OSD
MDS
OSDOSDOSD
POSIX Metadata(directory, time, owner, etc)
MDS
Write metadata journal
Data I/O
Metadata held in-memory
![Page 37: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/37.jpg)
![Page 38: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/38.jpg)
![Page 39: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/39.jpg)
![Page 40: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/40.jpg)
![Page 41: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/41.jpg)
DYNAMIC SUBTREE PARTITIONING
![Page 42: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/42.jpg)
Ceph FS is experimental
![Page 43: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/43.jpg)
Other features
Rolling upgrades Erasure Coding Cache tiering Key-value OSD backend Separate backend network
![Page 44: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/44.jpg)
Rolling upgrades
No interruption to the service when upgrading
Stop/Start daemons one by one mon → osd → mds →
radowgw
![Page 45: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/45.jpg)
Erasure coding
Use erasure coding instead of parity for data durability
Suitable for rarely modified or accessed objects
Erasure Coding
Replication
Space overhead(survive 2 fails)
Approx 40% 200%
CPU High Low
Latency High Low
![Page 46: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/46.jpg)
Cache tiering
Cache tierex. SSD
Base tierex. HDD,erasure coded
librados
transparent to clients
read/write
read when miss
fetch when missflush to base tier
![Page 47: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/47.jpg)
Key-value OSD backend
Use LevelDB for OSD backend (instead of xfs)
Better performance esp for small objects
Plans to support RocksDB, NVMKV, etc
![Page 48: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/48.jpg)
Separate backend network
Backend network for replication
Clients
Frontend network for service
OSDs
1. Write
2. Replicate
![Page 49: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/49.jpg)
OpenStack Integration
![Page 50: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/50.jpg)
OpenStack with Ceph
![Page 51: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/51.jpg)
RADOSGW and Keystone
Keystone Server
RADOSGW
RESTful Object Store
Query token
Access with token
Grant/revoke
![Page 52: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/52.jpg)
Glance Integration
RBD
Glance Server
/etc/glance/glance-api.conf
default_store=rbdrbd_store_user=glancerbd_store_pool=images
Store, Download
Need just 3 lines!
Image
![Page 53: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/53.jpg)
Cinder/Nova Integration
RBD
Cinder Server
qemu
VM
librbd
nova-compute
Boot from volume
Management
Volume Image
Copy-on-write clone
![Page 54: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/54.jpg)
Benefits of using with
Unified storage for both images and volumes
Copy-on-write cloning and snapshot support
Native qemu / KVM support for better performance
![Page 55: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/55.jpg)
![Page 56: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/56.jpg)
Wrap-up
![Page 57: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/57.jpg)
Ceph is
Massively scalable storage Unified architecture for
object / block / POSIX FS OpenStack integration is
ready to use & awesome
![Page 58: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/58.jpg)
Ceph and GlusterFSCeph GlusterFS
Distribution Object based File based
File location Deterministic algorithm (CRUSH)
Distributed hash table, stored in xattr
Replication Server side Client side
Primary usage Object / block storage
POSIX-like file system
Challenge POSIX file system needs improvement
Object / block storage needs improvement
![Page 59: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/59.jpg)
Further readings
Ceph Documents
https://ceph.com/docs/master/
Well documented.
Sébastien Han
http://www.sebastien-han.fr/blog/
An awesome blog.
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
http://ceph.com/papers/weil-crush-sc06.pdf
CRUSH algorithm paper
Ceph: A Scalable, High-Performance Distributed File System
http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf
Ceph paper
Ceph の覚え書きのインデックスhttp://www.nminoru.jp/~nminoru/unix/ceph/
Well written introduction in Japanese
![Page 60: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/60.jpg)
One more thing
![Page 61: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/61.jpg)
Calamari will be open sourced
“Calamari, the monitoring and diagnostics tool that Inktank has developed as part of the Inktank Ceph Enterprise product, will soon be open sourced.”
http://ceph.com/community/red-hat-to-acquire-inktank/#sthash.1rB0kfRS.dpuf
![Page 62: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/62.jpg)
Calamari screens
![Page 63: What you need to know about ceph](https://reader035.fdocuments.in/reader035/viewer/2022081414/54c6f6824a7959c5088b457f/html5/thumbnails/63.jpg)
Thank you!