GlusterFS Update and OpenStack Integration
-
Upload
etsuji-nakai -
Category
Technology
-
view
2.922 -
download
5
description
Transcript of GlusterFS Update and OpenStack Integration
GlusterFS Update andOpenStack Integration
v1.0 2014/05/14
Etsuji NakaiSenior Solution Architectand Cloud EvangelistRed Hat K.K.
2
Contents
Recap: What is GlusterFS? Recap: DHT architecture overview The current status of OpenStack integration libgfapi: Mini Tutorial
Recap: what is GlusterFS?
4
What is GlusterFS?
GlusterFS is opensource software to create a scale-out distributed filesystem on top of commodity x86_86 servers.– It aggregates local storage of many servers into a single logical volume.– You can extend the volume just by adding more servers.
GlusterFS runs on top of Linux. You can use it wherever you can use Linux.– Physical/virtual machines in your data center.– Linux VM on public clouds.
GlusterFS provides a wide variety of APIs.– FUSE mount (using the native client.)– NFSv3 (supporting the distributed NFS lock.)– CIFS (using libgfapi native access from Samba.)– REST API (compatible with OpenStack Swift.)– Native application library (providing POSIX-like system calls.)
5
Brief history of GlusterFS
2005 2011 2012 2013 2014
GlusterFS 3.3
GlusterFS 3.4
GlusterFS 3.5
http://www.slideshare.net/johnmarkorg/gluster-where-weve-been-a-history
Red Hat acquisition of Gluster Inc.
The early daysof Gluster Inc.
6
Architecture overview
The standard filesystem (typically xfs) of each storage node is used as a backend device of the logical volume.– Each file in the volume is physically stored in one of the storage nodes' filesystem, just as the
same plain file seen from the client.
Hash value of the file name is used to decide the node to store it.– The metadata server storing the file location is not used in GlusterFS.
file01 file02 file03
・・・ Storage nodes
file01, file02, file03
GlusterFS client
The volume is seen as a single filesystemmounted on a local directory tree.
Files are distributed acrosslocal filesystem of storage nodes.
GlusterFS volume
7
Hierarchy structure consisted of Node / Brick / Volume
・・
・
Volume vol01
Filesystem mounted on /data
/data/brick02
/data/brick01 Brick(Just a directory)
・・
・/data/brick02
/data/brick01
・・
・
/data/brick02
/data/brick01
A volume is created as a "bundle" of bricks which are provided by storage nodes.
Node01
A single node can provide multiple bricks to create multiple volumes. You don't need to use the same number of bricks nor the same directory name on each node. You can add/remove bricks to extend/reduce the size of volumes.
Node02 Node03
8
/brick01
/brick02
/brick03
/brick04
/brick01
/brick02
/brick03
/brick04
/brick01
/brick02
/brick03
/brick04
Volume configuration examples
/brick01
/brick02
/brick03
/brick04
Storage nodes
Distributing files across multiple bricks.(Each file is stored in one of the bricks.)
A file is replicated between the specifedbrick pairs
A file is split into fixed size chunks, and chunks are distributed to brikcs.
Replication
Striping
Striping
Combining the striping and replication
node01 node02 node03 node04
Replication
Replication Replication
Recap: DHT architecture
10
DHT: Distributed Hash Table
Distributed has table is:– A rule for deciding a brick to store the file based on filename's hash value. – More precisely, it's just a table of bricks and corresponding hash ranges.
file01
Brick1
Hash range 0〜99
Calculate the hashvalue of filename.
Brick2
Hash range 100〜199
・・・
127
Stored in the brick which is responsible for this hash value.
Brick3
Hash range 200〜299
The actual hash length is 32bit.0x00000000 〜 0xFFFFFFFF
Brick1 Brick2 Brick3 ・・・
Hashrange
0〜99 100〜199 200〜299
DHT (Distributed Hash Table)
11
DHT structure in GlusterFS
Hash tables are created for each directory in a single volume.– Two files with the same name (in different directories) are placed in different bricks.– By assigning different hash ranges for different directories, files are more evenly distributed.
The hash range of each brick (directory) is recorded in the extended attribute of the directory.
Brick1
[root@gluster01 ~]# getfattr -d -m . /data/brick01/dir01getfattr: Removing leading '/' from absolute path names# file: data/brick01/dir01trusted.gfid=0shk2IwdFdT0yI1K7xXGNSdA==trusted.glusterfs.10d3504b-7111-467d-8d4f-d25f0b504df6.xtime=0sT+vTRwADqyI=trusted.glusterfs.dht=0sAAAAAQAAAAB//////////w==
Brick1 Brick2 Brick3 ・・・
/dir01 0〜99 100〜199 200〜299 ・・・
/dir02 100〜199 400〜499 300〜399 ・・・
/dir03 500〜599 200〜299 100〜199 ・・・
・・・
Brick2 Brick3 ・・・
12
How GlusterFS client recognizes the hash table
# mount -t glusterfs gluster01:/vol01
Volume "vol01" is provided by gluster01〜gluster04
gluster01 gluster02 gluster03 gluster04
13
How GlusterFS client recognizes the hash table
# cat /vol01/dir01/file01
gluster01 gluster02 gluster03 gluster04
The hash range of dir01 is xxx.
The hash range of dir01 is yyy.
14
How GlusterFS client recognizes the hash table
# cat /vol01/dir01/file01
gluster01 gluster02 gluster03 gluster04
The hash range of dir01 is xxx.
The hash range of dir01 is yyy.
Construct the whole hash table for dir01 on memory!
Brick1 Brick2 Brick3 ・・・
dir01 0〜99 100〜199 200〜299
15
Translator modules
GlusterFS works with multiple translator modules.– There are modules running on clients and modules running on servers.
Each module has its own role.– Translator modules are built as shared library.– Original modules can be added as a plug-in.
[root@gluster01 ~]# ls -l /usr/lib64/glusterfs/3.3.0/xlator/total 48drwxr-xr-x 2 root root 4096 Jun 16 15:25 clusterdrwxr-xr-x 2 root root 4096 Jun 16 15:25 debugdrwxr-xr-x 2 root root 4096 Jun 16 15:25 encryptiondrwxr-xr-x 2 root root 4096 Jun 16 15:25 featuresdrwxr-xr-x 2 root root 4096 Jun 16 15:25 mgmtdrwxr-xr-x 2 root root 4096 Jun 16 15:25 mountdrwxr-xr-x 2 root root 4096 Jun 16 15:25 nfsdrwxr-xr-x 2 root root 4096 Jun 16 15:25 performancedrwxr-xr-x 2 root root 4096 Jun 16 15:25 protocoldrwxr-xr-x 2 root root 4096 Jun 16 15:25 storagedrwxr-xr-x 2 root root 4096 Jun 16 15:25 systemdrwxr-xr-x 3 root root 4096 Jun 16 15:25 testing
DHT, replication, etc.
quota, file lock, etc.
caching, read ahead, etc.
physical I/O
16
Typical combination of translator modulesio-stats
md-cache
quick-read
io-cache
read-ahead
write-behind
dht
replicate-1 replicate-2
server
brick
marker
index
io-threads
locks
access-control
posix
server
brick
marker
index
io-threads
locks
access-control
posix
server
brick
marker
index
io-threads
locks
access-control
posix
server
brick
marker
index
io-threads
locks
access-control
posix
client-1 client-2 client-3 client-4
Client modules(*1)
Server modules(*2)
Brick
Recording statistics information
Metadata caching
Data caching
Handling DHT
Replication
Communication with servers
Communication with clients
Activating I/O thereads
File locking
ACL management
Physical access to bricks
Brick Brick Brick
(*1) Defined in /var/lib/glusterd/vols/<Vol>/<Vol>-fuse.vol (*2) Defined in /var/lib/glusterd/vols/<Vol>/<Vol>.<Node>.<Brick>.vol
17
The past wish list for GlusterFS
Volume Snapshot (master branch) File Snapshot (GlusterFS3.5) On-wire compression / decompression (GlusterFS3.4) Disk Encryption (GlusterFS3.4) Journal based distributed GeoReplication (GlusterFS3.5) Erasure coding (Not yet...) Integration with OpenStack etc...
http://www.gluster.org/
The current status ofOpenStack Integration
19
Four locations you need storage system in OpenStack
Swift
Nova ComputeGlance
ApplicationData
OS
Cinder
Object Store
TemplateImage
Typcally, original distributed object store using commodity
x86_86 servers is used.
Typcally, external hardware storege (iSCSI) is used
Typcally, local storage of compute nodes is used.
Typcally, Swift or NFS storage is used.
Using GlusterFS for Glance backend
GlusterFS Cluster
GlusterFSVolume
GlusterFS manages scalability, redundancyand consistency.
Glance Server
Just use GlusterFS volume instead of local storage. So simple. This is actually being used in many production clusters.
21
Nova Compute
CinderVM instance
/dev/vdb Virtual disk
Linux KVM
/dev/sdX iSCSI LUN
Storage box
Create LUNs
iSCSI SWInitiator
iSCSI Target
In typical configuration, block volumes are created as LUNs in iSCSI storage boxes. Cinder operates on the management interface of the storage through the corresponding driver.
Nova Compute attaches it to the host Linux using the software initiator, then it's attached to the VM instance through KVM hypervisor.
How Nova and Cinder works together
22
Cinder also provides the NFS driver which uses NFS server as a storage backend.– The driver simply mounts the NFS exported directly and create disk image files
in it. Compute nodes use NFS mount to access the image files.
Virtual disk
NFS server
NFS mount
・・・
NFS mount
・・・
Nova ComputeVM instance
/dev/vdb
Linux KVM
Cinder
Using NFS driver
23
There is a driver for GlusterFS distributed filesystem, too.– Currently it uses FUSE mount mechanism. This will be replaced with more optimized
mechanism (libgfapi) which bypasses the FUSE layer.
Cinder
GlusterFS cluster
FUSE mount
FUSE mount
・・・
Virtual disk
・・・
Nova ComputeVM instance
/dev/vdb
Linux KVM
Using GlusterFS driver for Cinder
24
The same can work for Nova Compute. You can store running VM's OS image on locally mounted GlusterFS volume.
GlusterFS cluster
FUSE mount
・・・
Virtual disk
・・・
Nova ComputeVM instance
/dev/vda
Linux KVM
GlusterFS shared volume for Nova Compute
TemplateImage
25
The FUSE mount/file based architecture is not well suited to workload for VM disk images (small random I/O).
How can we imporve it?
The challenge in Cinder/Nova Compute integration
26
The FUSE mount/file based architecture is not well suited to workload for VM disk images (small random I/O).
How can we imporve it?
The challenge in Cinder/Nova Compute integration
http://www.inktank.com/
Using Ceph?
CENSORED
27
"libgfapi" is an application library with which user applications can directly access GlusterFS volume via native protocol. – It reduces the overhead of FUSE architecture.
GlusterFS way for qemu integration
Now qemu is integrated with libgfapi so that it can directly access disk image files placed in GlusterFS volume.– This feature is available since Havana
release.
FUSE mount
libgfapi
Architecture of Swift Account ServersMaintain mappingsbetweenaccounts and containers
Container Servers
Object Servers
Maintain lists and ACLsof objectsin each container.
Store object contentsin file system.
Proxy Servers
Handling RESTrequest from clients
Authentication ServerDB
DB
File System
Architecture of GlusterFS with Swift API
Proxy / Account / Container / Object“all in one” server & GlusterFS client
Authentication Server
GlusterFS Cluster
One volume is used for one account
Account/Container/Object Servermodules retrieve required informationdirectly from locally mounted volumes.
GlusterFSVolume
Volume for each account is locally mounted at:/mnt/gluster-object/AUTH_<account name>
GlusterFS manages scalability, redundancyand consistency.
libgfapi: Mini Tutorial
Using libgfapi with RHEL6/CentOS6
Install development tools, and libgfapi library from EPEL repository.
Build your application with libgfapi.
That's all!
Pseudo-Posix I/O system calls are listed in the header file.– https://github.com/gluster/glusterfs/blob/release-3.5/api/src/glfs.h– file stream and mmap are not there :-(
# yum install http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm# yum groupinstall "Development Tools"# yum install glusterfs-api-devel
# gcc hellogluster.c -lgfapi # ./a.out
"Hello, World!" with libgfapi
#include <stdlib.h>#include <stdio.h>#include <string.h>#include <glusterfs/api/glfs.h> int main (int argc, char** argv) { const char *gfserver = "gluster01"; const char *gfvol = "testvol01";
int ret; glfs_t *fs; glfs_fd_t *fd; fs = glfs_new(gfvol); glfs_set_volfile_server(fs, "tcp", gfserver, 24007); ret = glfs_init (fs); if (ret) { printf( "Failed to connect server/volume: %s/%s\n", gfserver, gfvol ); exit(ret); }
char *greet = "Hello, Gluster!\n"; fd = glfs_creat(fs, "greeting.txt", O_RDWR, 0644); glfs_write(fd, greet, strlen(greet), 0); glfs_close(fd); return 0;}
type struct representing the volume (filesystem) "testvol01"
Connecting to the volume.
Opening a new file on the volume.
Write and close the file.
Thank you