GlusterFS Update and OpenStack Integration

GlusterFS Update andOpenStack Integration

v1.0 2014/05/14

Etsuji NakaiSenior Solution Architectand Cloud EvangelistRed Hat K.K.

2

Contents

Recap: What is GlusterFS? Recap: DHT architecture overview The current status of OpenStack integration libgfapi: Mini Tutorial

Recap: what is GlusterFS?

4

What is GlusterFS?

GlusterFS is opensource software to create a scale-out distributed filesystem on top of commodity x86_86 servers.– It aggregates local storage of many servers into a single logical volume.– You can extend the volume just by adding more servers.

GlusterFS runs on top of Linux. You can use it wherever you can use Linux.– Physical/virtual machines in your data center.– Linux VM on public clouds.

GlusterFS provides a wide variety of APIs.– FUSE mount (using the native client.)– NFSv3 (supporting the distributed NFS lock.)– CIFS (using libgfapi native access from Samba.)– REST API (compatible with OpenStack Swift.)– Native application library (providing POSIX-like system calls.)

5

Brief history of GlusterFS

2005 2011 2012 2013 2014

GlusterFS 3.3

GlusterFS 3.4

GlusterFS 3.5

http://www.slideshare.net/johnmarkorg/gluster-where-weve-been-a-history

Red Hat acquisition of Gluster Inc.

The early daysof Gluster Inc.

6

Architecture overview

The standard filesystem (typically xfs) of each storage node is used as a backend device of the logical volume.– Each file in the volume is physically stored in one of the storage nodes' filesystem, just as the

same plain file seen from the client.

Hash value of the file name is used to decide the node to store it.– The metadata server storing the file location is not used in GlusterFS.

file01 file02 file03

・・・ Storage nodes

file01, file02, file03

GlusterFS client

The volume is seen as a single filesystemmounted on a local directory tree.

Files are distributed acrosslocal filesystem of storage nodes.

GlusterFS volume

7

Hierarchy structure consisted of Node / Brick / Volume

・・

・

Volume vol01

Filesystem mounted on /data

/data/brick02

/data/brick01　 Brick（Just a directory)

・・

・/data/brick02

/data/brick01

・・

・

/data/brick02

/data/brick01

A volume is created as a "bundle" of bricks which are provided by storage nodes.

Node01

A single node can provide multiple bricks to create multiple volumes. You don't need to use the same number of bricks nor the same directory name on each node. You can add/remove bricks to extend/reduce the size of volumes.

Node02 Node03

8

/brick01

/brick02

/brick03

/brick04

/brick01

/brick02

/brick03

/brick04

/brick01

/brick02

/brick03

/brick04

Volume configuration examples

/brick01

/brick02

/brick03

/brick04

Storage nodes

Distributing files across multiple bricks.(Each file is stored in one of the bricks.)

A file is replicated between the specifedbrick pairs

A file is split into fixed size chunks, and chunks are distributed to brikcs.

Replication

Striping

Striping

Combining the striping and replication

node01 node02 node03 node04

Replication

Replication Replication

Recap: DHT architecture

10

DHT: Distributed Hash Table

Distributed has table is:– A rule for deciding a brick to store the file based on filename's hash value. – More precisely, it's just a table of bricks and corresponding hash ranges.

file01

Brick1

Hash range 0〜99

Calculate the hashvalue of filename.

Brick2

Hash range 100〜199

・・・

127

Stored in the brick which is responsible for this hash value.

Brick3

Hash range 200〜299

The actual hash length is 32bit.0x00000000 〜 0xFFFFFFFF

Brick1 Brick2 Brick3 ・・・

Hashrange

0〜99 100〜199 200〜299

DHT (Distributed Hash Table)

11

DHT structure in GlusterFS

Hash tables are created for each directory in a single volume.– Two files with the same name (in different directories) are placed in different bricks.– By assigning different hash ranges for different directories, files are more evenly distributed.

The hash range of each brick (directory) is recorded in the extended attribute of the directory.

Brick1

[root@gluster01 ~]# getfattr -d -m . /data/brick01/dir01getfattr: Removing leading '/' from absolute path names# file: data/brick01/dir01trusted.gfid=0shk2IwdFdT0yI1K7xXGNSdA==trusted.glusterfs.10d3504b-7111-467d-8d4f-d25f0b504df6.xtime=0sT+vTRwADqyI=trusted.glusterfs.dht=0sAAAAAQAAAAB//////////w==


/dir01 0〜99 100〜199 200〜299 ・・・

/dir02 100〜199 400〜499 300〜399 ・・・

/dir03 500〜599 200〜299 100〜199 ・・・

・・・

Brick2 Brick3 ・・・

12

How GlusterFS client recognizes the hash table

# mount -t glusterfs gluster01:/vol01

Volume "vol01" is provided by gluster01〜gluster04

gluster01 gluster02 gluster03 gluster04

13


# cat /vol01/dir01/file01


The hash range of dir01 is xxx.

The hash range of dir01 is yyy.

14


# cat /vol01/dir01/file01


The hash range of dir01 is xxx.

The hash range of dir01 is yyy.

Construct the whole hash table for dir01 on memory!


dir01 0〜99 100〜199 200〜299

15

Translator modules

GlusterFS works with multiple translator modules.– There are modules running on clients and modules running on servers.

Each module has its own role.– Translator modules are built as shared library.– Original modules can be added as a plug-in.

[root@gluster01 ~]# ls -l /usr/lib64/glusterfs/3.3.0/xlator/total 48drwxr-xr-x 2 root root 4096 Jun 16 15:25 clusterdrwxr-xr-x 2 root root 4096 Jun 16 15:25 debugdrwxr-xr-x 2 root root 4096 Jun 16 15:25 encryptiondrwxr-xr-x 2 root root 4096 Jun 16 15:25 featuresdrwxr-xr-x 2 root root 4096 Jun 16 15:25 mgmtdrwxr-xr-x 2 root root 4096 Jun 16 15:25 mountdrwxr-xr-x 2 root root 4096 Jun 16 15:25 nfsdrwxr-xr-x 2 root root 4096 Jun 16 15:25 performancedrwxr-xr-x 2 root root 4096 Jun 16 15:25 protocoldrwxr-xr-x 2 root root 4096 Jun 16 15:25 storagedrwxr-xr-x 2 root root 4096 Jun 16 15:25 systemdrwxr-xr-x 3 root root 4096 Jun 16 15:25 testing

DHT, replication, etc.

quota, file lock, etc.

caching, read ahead, etc.

physical I/O

16

Typical combination of translator modulesio-stats

md-cache

quick-read

io-cache

read-ahead

write-behind

dht

replicate-1 replicate-2

server

brick

marker

index

io-threads

locks

access-control

posix

server

brick

marker

index

io-threads

locks

access-control

posix

server

brick

marker

index

io-threads

locks

access-control

posix

server

brick

marker

index

io-threads

locks

access-control

posix

client-1 client-2 client-3 client-4

Client modules(*1)

Server modules(*2)

Brick

Recording statistics information

Metadata caching

Data caching

Handling DHT

Replication

Communication with servers

Communication with clients

Activating I/O thereads

File locking

ACL management

Physical access to bricks

Brick Brick Brick

(*1) Defined in /var/lib/glusterd/vols/<Vol>/<Vol>-fuse.vol (*2) Defined in /var/lib/glusterd/vols/<Vol>/<Vol>.<Node>.<Brick>.vol

17

The past wish list for GlusterFS

Volume Snapshot (master branch) File Snapshot (GlusterFS3.5) On-wire compression / decompression (GlusterFS3.4) Disk Encryption (GlusterFS3.4) Journal based distributed GeoReplication (GlusterFS3.5) Erasure coding (Not yet...) Integration with OpenStack etc...

http://www.gluster.org/

The current status ofOpenStack Integration

19

Four locations you need storage system in OpenStack

Swift

Nova ComputeGlance

ApplicationData

OS

Cinder

Object Store

TemplateImage

Typcally, original distributed object store using commodity

x86_86 servers is used.

Typcally, external hardware storege (iSCSI) is used

Typcally, local storage of compute nodes is used.

Typcally, Swift or NFS storage is used.

Using GlusterFS for Glance backend

GlusterFS Cluster

GlusterFSVolume

GlusterFS manages scalability, redundancyand consistency.

Glance Server

Just use GlusterFS volume instead of local storage. So simple. This is actually being used in many production clusters.

21

Nova Compute

CinderVM instance

/dev/vdb Virtual disk

Linux KVM

/dev/sdX iSCSI LUN

Storage box

Create LUNs

iSCSI SWInitiator

iSCSI Target

In typical configuration, block volumes are created as LUNs in iSCSI storage boxes. Cinder operates on the management interface of the storage through the corresponding driver.

Nova Compute attaches it to the host Linux using the software initiator, then it's attached to the VM instance through KVM hypervisor.

How Nova and Cinder works together

22

Cinder also provides the NFS driver which uses NFS server as a storage backend.– The driver simply mounts the NFS exported directly and create disk image files

in it. Compute nodes use NFS mount to access the image files.

Virtual disk

NFS server

NFS mount

・・・

NFS mount

・・・

Nova ComputeVM instance

/dev/vdb

Linux KVM

Cinder

Using NFS driver

23

There is a driver for GlusterFS distributed filesystem, too.– Currently it uses FUSE mount mechanism. This will be replaced with more optimized

mechanism (libgfapi) which bypasses the FUSE layer.

Cinder

GlusterFS cluster

FUSE mount

FUSE mount

・・・

Virtual disk

・・・


/dev/vdb

Linux KVM

Using GlusterFS driver for Cinder

24

The same can work for Nova Compute. You can store running VM's OS image on locally mounted GlusterFS volume.

GlusterFS cluster

FUSE mount

・・・

Virtual disk

・・・


/dev/vda

Linux KVM

GlusterFS shared volume for Nova Compute

TemplateImage

25

The FUSE mount/file based architecture is not well suited to workload for VM disk images (small random I/O).

How can we imporve it?

The challenge in Cinder/Nova Compute integration

26

The FUSE mount/file based architecture is not well suited to workload for VM disk images (small random I/O).

How can we imporve it?

The challenge in Cinder/Nova Compute integration

http://www.inktank.com/

Using Ceph?

CENSORED

27

"libgfapi" is an application library with which user applications can directly access GlusterFS volume via native protocol. – It reduces the overhead of FUSE architecture.

GlusterFS way for qemu integration

Now qemu is integrated with libgfapi so that it can directly access disk image files placed in GlusterFS volume.– This feature is available since Havana

release.

FUSE mount

libgfapi

Architecture of Swift Account ServersMaintain mappingsbetweenaccounts and containers

Container Servers

Object Servers

Maintain lists and ACLsof objectsin each container.

Store object contentsin file system.

Proxy Servers

Handling RESTrequest from clients

Authentication ServerDB

DB

File System

Architecture of GlusterFS with Swift API

Proxy / Account / Container / Object“all in one” server & GlusterFS client

Authentication Server

GlusterFS Cluster

One volume is used for one account

Account/Container/Object Servermodules retrieve required informationdirectly from locally mounted volumes.

GlusterFSVolume

Volume for each account is locally mounted at:/mnt/gluster-object/AUTH_<account name>

GlusterFS manages scalability, redundancyand consistency.

libgfapi: Mini Tutorial

Using libgfapi with RHEL6/CentOS6

Install development tools, and libgfapi library from EPEL repository.

Build your application with libgfapi.

That's all!

Pseudo-Posix I/O system calls are listed in the header file.– https://github.com/gluster/glusterfs/blob/release-3.5/api/src/glfs.h– file stream and mmap are not there :-(

# yum install http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm# yum groupinstall "Development Tools"# yum install glusterfs-api-devel

# gcc hellogluster.c -lgfapi # ./a.out

"Hello, World!" with libgfapi

#include <stdlib.h>#include <stdio.h>#include <string.h>#include <glusterfs/api/glfs.h> int main (int argc, char** argv) { const char *gfserver = "gluster01"; const char *gfvol = "testvol01";

int ret; glfs_t *fs; glfs_fd_t *fd; fs = glfs_new(gfvol); glfs_set_volfile_server(fs, "tcp", gfserver, 24007); ret = glfs_init (fs); if (ret) { printf( "Failed to connect server/volume: %s/%s\n", gfserver, gfvol ); exit(ret); }

char *greet = "Hello, Gluster!\n"; fd = glfs_creat(fs, "greeting.txt", O_RDWR, 0644); glfs_write(fd, greet, strlen(greet), 0); glfs_close(fd); return 0;}

type struct representing the volume (filesystem) "testvol01"

Connecting to the volume.

Opening a new file on the volume.

Write and close the file.

Thank you

GlusterFS Update and OpenStack Integration

Technology

Transcript of GlusterFS Update and OpenStack Integration