GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Dustin L. Black, RHCASr. Technical Account Manager & Team LeadRed Hat Global Support Services
--
Dustin L. Black, RHCASr. Technical Account ManagerRed Hat, Inc.
[email protected]@dustinlblack
My name is Dustin Black. I'm a Red Hat Certified Architect, and a Senior Technical Account Manager with Red Hat. I specialize in Red Hat Storage and GlusterFS, and I've been working closely with our partner, Intuit, on a very interesting and exciting implementation of Red Hat Storage.
#whatis TAM
Premium named-resource support
Proactive and early access
Regular calls and on-site engagements
Customer advocate within Red Hat and upstream
Multi-vendor support coordinator
High-touch access to engineering
Influence for software enhancements
NOT Hands-on or consulting
-What is a Technical Account Manager?-We provide semi-dedicated support to many of the world's largest enterprise Linux consumers.-This is not a hands-on role, but rather a collaborative support relationship with the customer.-Our goal is to provide a proactive and high-touch customer relationship with close ties to Red Hat Engineering.
Agenda
Technology Overview
Scaling Up and Out
A Peek at GlusterFS Logic
Redundancy and Fault Tolerance
Data Access
General Administration
Use Cases
Common Pitfalls
I hope to make the GlusterFS concepts more tangible. I want you to walk away with the confidence to start working with GlusterFS today.
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Technology
Overview
What is GlusterFS?
Scalable, general-purpose storage platform
POSIX-y Distributed File System
Object storage (swift)
Distributed block storage (qemu)
Flexible storage (libgfapi)
No Metadata Server
Heterogeneous Commodity Hardware
Standards-Based Clients, Applications, Networks
Flexible and Agile Scaling
Capacity Petabytes and beyond
Performance Thousands of Clients
-Commodity hardware: aggregated as building blocks for a clustered storage resource.-Standards-based: No need to re-architect systems or applications, and no long-term lock-in to proprietary systems or protocols.-Simple and inexpensive scalability.-Scaling is non-interruptive to client access. -Aggregated resources into unified storage volume abstracted from the hardware.
Fuse Native
NFS
libgfapi
Samba
QEMU
application
Network Interconnect
What is Red Hat Storage?
Enterprise Implementation of GlusterFS
Software Appliance
Bare Metal Installation
Built on RHEL + XFS
Subscription Model
Storage Software Appliance
Datacenter and Private Cloud Deployments
Virtual Storage Appliance
Amazon Web Services Public Cloud Deployments
-Provided as an ISO for direct bare-metal installation.-XFS filesystem is usually an add-on subscription for RHEL, but is included with Red Hat Storage-XFS supports metadata journaling for quicker crash recovery and can be defragged and expanded online.-RHS:GlusterFS::RHEL:Fedora-GlusterFS and gluster.org are the upstream development and test bed for Red Hat Storage
release
iterate
debug
enhance
upstream
glusterfs-3.4
release
With the direct partnership between Red Hat and Intuit, we were able to save precious time and get the updated code into Intuit's hands very early.
We immediately began working this into the upstream GlusterFS project, where the effect was a significant change in the priorities of the next release.
The code was then further optimized and additional features were added. And now all of these changes are ready to benefit the community with the GlusterFS 3.4 release, and will be ready for wide enterprise support in the upcoming Red Hat Storage 2.1 product.
GlusterFS vs. Traditional Solutions
A basic NAS has limited scalability and redundancy
Other distributed filesystems limited by metadata
SAN is costly & complicated but high performance & scalable
GlusterFS =
Linear Scaling
Minimal Overhead
High Redundancy
Simple and Inexpensive Deployment
Use Cases
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Common Solutions
Media / Content Distribution Network (CDN)
Backup / Archive / Disaster Recovery (DR)
Large Scale File Server
Home directories
High Performance Computing (HPC)
Infrastructure as a Service (IaaS) storage layer
Hadoop Map Reduce
Access data within and outside of Hadoop
No HDFS name node single point of failure / bottleneck
Seamless replacement for HDFS
Scales with the massive growth of big data
Technology
Stack
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Terminology
Brick
A filesystem mountpoint
A unit of storage used as a GlusterFS building block
Translator
Logic between the bits and the Global Namespace
Layered to provide GlusterFS functionality
Volume
Bricks combined and passed through translators
Peer
Server running the gluster daemon and sharing volumes
-Bricks are stacked to increase capacity-Translators are stacked to increase functionality
Disk, LVM, and Filesystems
Direct-Attached Storage (DAS)-or-
Just a Bunch Of Disks (JBOD)
Hardware RAID
RHS: RAID 6 required
Logical Volume Management (LVM)
XFS, EXT3/4, BTRFS
Extended attributes support required
RHS: XFS required
-XFS is the only filesystem supported with RHS.-Extended attribute support is necessary because the file hash is stored there
Gluster Components
glusterd
Elastic volume management daemon
Runs on all export servers
Interfaced through gluster CLI
glusterfsd
GlusterFS brick daemon
One process for each brick
Managed by glusterd
Gluster Components
glusterfs
NFS server daemon
Self-heal daemon
FUSE client daemon
mount.glusterfs
FUSE native mount tool
gluster
Gluster Console Manager (CLI)
-gluster console commands can be run directly, or in interactive mode. Similar to virsh, ntpq
Data Access Overview
GlusterFS Native Client
Filesystem in Userspace (FUSE)
NFS
Built-in Service
SMB/CIFS
Samba server required; NOW libgfapi-integrated!
Gluster For OpenStack (G4O; aka UFO)
Simultaneous object-based access via Swift
NEW! libgfapi flexible abstracted storageIntegrated with upstream Samba and Ganesha-NFS
-The native client uses fuse to build complete filesystem functionality without gluster itself having to operate in kernel space. This offers benefits in system stability and time-to-end-user for code updates.
Putting it All Together
Scaling Up
Add disks and filesystems to a server
Expand a GlusterFS volume by adding bricks
XFS
Scaling Out
Add GlusterFS nodes to trusted pool
Add filesystems as new bricks
Under
the Hood
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Elastic Hash Algorithm
No central metadata
No Performance Bottleneck
Eliminates risk scenarios
Location hashed intelligently on filename
Unique identifiers, similar to md5sum
The Elastic Part
Files assigned to virtual volumes
Virtual volumes assigned to multiple bricks
Volumes easily reassigned on the fly
No metadata == No Performance Bottleneck or single point of failure (compared to single metadata node) or corruption issues (compared to distributed metadata).Hash calculation is faster than metadata retrievalElastic hash is the core of how gluster scales linerally
Elastic Hash Algorithm (simplified):* hash is based on the filename* the filename gets hashed to a value* say a, b, c or d* each brich in a distributed volume hold the files for a hash-range* say brick1 has a-b, and brick2 has c-d* the hash is calculated client-side* say the filename README hashes to c* the client knows the volume layout* c is in the range c-d* the range c-d is located on brick2* in case of a replicated volume, a sub-volume is created* instead of locating the files on the bricks directly, theDHT-xlator assigns the hash-ranges to sub-volumes* a sub-volume acts like a brick from the point of view of DHT* the sub-volume contains the AFR-xlator, which manages the duplication* hash collisions do not matter* two filenames that hash to the same value, just land on the same brick
Translators
Modular building blocks for functionality, like bricks are for storage
Your Storage Servers are Sacred!
Don't touch the brick filesystems directly!
They're Linux servers, but treat them like appliancesSeparate security protocols
Separate access standards
Don't let your Jr. Linux admins in!A well-meaning sysadmin can quickly break your system or destroy your data
Basic Volumes
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Distributed Volume
Files evenly spread across bricks
Similar to file-level RAID 0
Server/Disk failure could be catastrophic
Replicated Volume
Copies files to multiple bricks
Similar to file-level RAID 1
Striped Volumes
Individual files split among bricks
Similar to block-level RAID 0
Limited Use Cases HPC Pre/Post Processing
Limited Read/Write performance increase, but in some cases the overhead can actually cause a performance degredation
Layered
Functionality
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Distributed Replicated Volume
Distributes files across replicated bricks
Distributed Striped Volume
Files striped across two or more nodes
Striping plus scalability
Striped Replicated Volume
RHS 2.0 / GlusterFS 3.3+
Similar to RAID 10 (1+0)
-Graphic is wrong ---Replica 0 is exp1 and exp3-Replica 1 is exp2 and exp4
Distributed Striped Replicated Volume
RHS 2.0 / GlusterFS 3.3+
Limited Use Cases Map Reduce
This graphic from the RHS 2.0 beta documentation actually represents a non-optimal configuration. We'll discuss this in more detail later.
Asynchronous Replication
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Geo Replication
Asynchronous across LAN, WAN, or Internet
Master-Slave model -- Cascading possible
Continuous and incremental
Data is passed between defined master and slave only
When you configure geo-replication, you do so on a specific master (or source) host, replicating to a specific slave (or destination) host. Geo-replication does not scale linearly since all data is passed through the master and slave nodes specifically.
Because geo-replication is configured per-volume, you can gain scalability by choosing different master and slave geo-rep nodes for each volume.
?
Geo-Replication was in need of a whole new approach. At the time, multi-master replication was on the product roadmap, pending a future release. But as it turned out, Intuit had already resolved the mult-master limitation with a novel approach utilizing their Cassandra NoSQL system. I'll let Mohit describe that in more detail in a moment.
NEW! Distributed Geo-Replication
In a very short time, Red Hat engineers were able to characterize the limitations that Intuit was encountering, and propose new code to address them a completely re-designed Geo-Replication stack.
Where the traditional GlusterFS Geo-Replication code relied on a serial stream between only two nodes, the new code employed an algorithm that would divide the replication workload between all members of the volumes on both the sending and receiving sides.
Additionally, the new code introduced an improved filesystem scanning mechanism that was significantly more efficient for Intuit's small-file use case.
Distributed Geo-Replication
Drastic performance improvements
Parallel transfers
Efficient source scanning
File type/layout agnostic
Available now in RHS 2.1
Planned for GlusterFS 3.5
In a very short time, Red Hat engineers were able to characterize the limitations that Intuit was encountering, and propose new code to address them a completely re-designed Geo-Replication stack.
Where the traditional GlusterFS Geo-Replication code relied on a serial stream between only two nodes, the new code employed an algorithm that would divide the replication workload between all members of the volumes on both the sending and receiving sides.
Additionally, the new code introduced an improved filesystem scanning mechanism that was significantly more efficient for Intuit's small-file use case.
Distributed Geo-Replication
Drastic performance improvements
Parallel transfers
Efficient source scanning
File type/layout agnostic
Perhaps it's not just for DR anymore...http://www.redhat.com/resourcelibrary/case-studies/intuit-leverages-red-hat-storage-for-always-available-massively-scalable-storage
In a very short time, Red Hat engineers were able to characterize the limitations that Intuit was encountering, and propose new code to address them a completely re-designed Geo-Replication stack.
Where the traditional GlusterFS Geo-Replication code relied on a serial stream between only two nodes, the new code employed an algorithm that would divide the replication workload between all members of the volumes on both the sending and receiving sides.
Additionally, the new code introduced an improved filesystem scanning mechanism that was significantly more efficient for Intuit's small-file use case.
Replicated Volumes vs Geo-replication
Replicated VolumesGeo-replication
Mirrors data across clustersMirrors data across geographically distributed clusters
Provides high-availabilityEnsures backing up of data for disaster recovery
Synchronous replication (each and every file operation is sent across all the bricks)Asynchronous replication (checks for the changes in files periodically and syncs them on detecting differences)
Data Access
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
GlusterFS Native Client (FUSE)
FUSE kernel module allows the filesystem to be built and operated entirely in userspace
Specify mount to any GlusterFS server
Native Client fetches volfile from mount server, then communicates directly with all nodes to access data
Recommended for high concurrency and high write performance
Load is inherently balanced across distributed volumes
-Native client will be the best choice when you have many nodes concurrently accessing the same data-Client access to data is naturally load-balanced because the client is aware of the volume structure and the hash algorithm.
NFS
Standard NFS v3 clients
Standard automounter is supported
Mount to any server, or use a load balancer
GlusterFS NFS server includes Network Lock Manager (NLM) to synchronize locks across clients
Better performance for reading many small files from a single client
Load balancing must be managed externally
...mount with nfsvers=3 in modern distros that default to nfs 4The need for this seems to be a bug, and I understand it is in the process of being fixed.
NFS will be the best choice when most of the data access by one client and for small files. This is mostly due to the benefits of native NFS caching.
Load balancing will need to be accomplished by external mechanisms
NEW! libgfapi
Introduced with GlusterFS 3.4
User-space library for accessing data in GlusterFS
Filesystem-like API
Runs in application process
no FUSE, no copies, no context switches
...but same volfiles, translators, etc.
...mount with nfsvers=3 in modern distros that default to nfs 4The need for this seems to be a bug, and I understand it is in the process of being fixed.
NFS will be the best choice when most of the data access by one client and for small files. This is mostly due to the benefits of native NFS caching.
Load balancing will need to be accomplished by external mechanisms
SMB/CIFS
NEW! In GlusterFS 3.4 Samba + libgfapi
No need for local native client mount & re-export
Significant performance improvements with FUSE removed from the equation
Must be setup on each server you wish to connect to via CIFS
CTDB is required for Samba clustering
-Use the GlusterFS native client first to mount the volume on the Samba server, and then share that mount point with Samba via normal methods.-GlusterFS nodes can act as Samba servers (packages are included), or it can be an external service.-Load balancing will need to be accomplished by external mechanisms
Demo Time!
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
General
Administration
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Preparing a Brick
# lvcreate -L 100G -n lv_brick1 vg_server1# mkfs -t xfs -i size=512 /dev/vg_server1/lv_brick1# mkdir /brick1# mount /dev/vg_server1/lv_brick1 /brick1# echo '/dev/vg_server1/lv_brick1 /brick1 xfs defaults 1 2' >> /etc/fstab
An inode size smaller than 512 leaves no room for extended attributes (xattr). This means that every active inode will require a separate block for these. This has both a performance hit as well as a disk space usage penalty.
Adding Nodes (peers) and Volumes
gluster> peer probe server3gluster> peer statusNumber of Peers: 2
Hostname: server2Uuid: 5e987bda-16dd-43c2-835b-08b7d55e94e5State: Peer in Cluster (Connected)
Hostname: server3Uuid: 1e0ca3aa-9ef7-4f66-8f15-cbc348f29ff7State: Peer in Cluster (Connected)
gluster> volume create my-dist-vol server2:/brick2 server3:/brick3gluster> volume info my-dist-volVolume Name: my-dist-volType: DistributeStatus: CreatedNumber of Bricks: 2Transport-type: tcpBricks:Brick1: server2:/brick2Brick2: server3:/brick3gluster> volume start my-dist-vol
Distributed Volume
Peer Probe
-peer status command shows all other peer nodes excludes the local node-I understand this to be a bug that's in the process of being fixed
Distributed Striped Replicated Volume
gluster> volume create test-volume replica 2 stripe 2 transport tcp \server1:/exp1 server1:/exp2 server2:/exp3 server2:/exp4 \server3:/exp5 server3:/exp6 server4:/exp7 server4:/exp8Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.Do you still want to continue creating the volume? (y/n) yCreation of volume test-volume has been successful. Please start the volume to access data.
volume info test-volume
Volume Name: test-volumeType: Distributed-Striped-ReplicateVolume ID: 8f8b8b59-d1a1-42fe-ae05-abe2537d0e2dStatus: CreatedNumber of Bricks: 2 x 2 x 2 = 8Transport-type: tcpBricks:Brick1: server1:/exp1Brick2: server2:/exp3Brick3: server1:/exp2Brick4: server2:/exp4Brick5: server3:/exp5Brick6: server4:/exp7Brick7: server3:/exp6Brick8: server4:/exp8
gluster> volume create test-volume stripe 2 replica 2 transport tcp \server1:/exp1 server2:/exp3 server1:/exp2 server2:/exp4 \server3:/exp5 server4:/exp7 server3:/exp6 server4:/exp8Creation of volume test-volume has been successful. Please start the volume to access data.
-Brick order is corrected to ensure replication is between bricks on different nodes.-Replica is always processed first in building the volume, regardless of whether it's before or after stripe on the command line.-So, a 'replica 2' will create the replica between matching pairs of bricks in the order that the bricks are passed to the command. 'replica 3' will be matching trios of bricks, and so on.
Manipulating Bricks in a Volume
gluster> volume add-brick my-dist-vol server4:/brick4
gluster> volume remove-brick my-dist-vol server2:/brick2 startgluster> volume remove-brick my-dist-vol server2:/brick2 statusNode Rebalanced-files size scanned failures status--------- ----------- ----------- ----------- ----------- ------------localhost 16 16777216 52 0 in progress192.168.1.1 13 16723211 47 0 in progressgluster> volume remove-brick my-dist-vol server2:/brick2 commit
gluster> volume rebalance my-dist-vol fix-layout start
gluster> volume rebalance my-dist-vol startgluster> volume rebalance my-dist-vol statusNode Rebalanced-files size scanned failures status--------- ----------- ----------- ----------- ----------- ------------localhost 112 15674 170 0 completed10.16.156.72 140 3423 321 2 completed
Must add and remove in multiples of the replica and stripe counts
Migrating Data / Replacing Bricks
gluster> volume replace-brick my-dist-vol server3:/brick3 server5:/brick5 startgluster> volume replace-brick my-dist-vol server3:/brick3 server5:/brick5 statusCurrent File = /usr/src/linux-headers-2.6.31-14/block/Makefile Number of files migrated = 10567Migration completegluster> volume replace-brick my-dist-vol server3:/brick3 server5:/brick5 commit
Volume Options
gluster> volume set my-dist-vol auth.allow 192.168.1.*gluster> volume set my-dist-vol auth.reject 10.*
gluster> volume set my-dist-vol nfs.volume-access read-onlygluster> volume set my-dist-vol nfs.disable on
gluster> volume set my-dist-vol features.read-only ongluster> volume set my-dist-vol performance.cache-size 67108864
gluster> volume set my-dist-vol auth.allow 192.168.1.*gluster> volume set my-dist-vol auth.reject 10.*
NFS
Auth
Other advanced options
Volume Top Command
gluster> volume set my-dist-vol auth.allow 192.168.1.*gluster> volume set my-dist-vol auth.reject 10.*
gluster> volume top my-dist-vol read brick server3:/brick3 list-cnt 3Brick: server:/export/dir1==========Read file stats========
read filenamecall count
116 /clients/client0/~dmtmp/SEED/LARGE.FIL
64 /clients/client0/~dmtmp/SEED/MEDIUM.FIL
54 /clients/client2/~dmtmp/SEED/LARGE.FIL
Many top commands are available for analysis of files, directories, and bricks
Read and write performance test commands availablePerform active dd tests and measure throughput
Volume Profiling
gluster> volume set my-dist-vol auth.allow 192.168.1.*gluster> volume set my-dist-vol auth.reject 10.*
gluster> volume profile my-dist-vol startgluster> volume profile my-dist-vol infoBrick: Test:/export/2Cumulative Stats:
Block 1b+ 32b+ 64b+Size:Read: 0 0 0Write: 908 28 8
...
%-latency Avg- Min- Max- calls Foplatency Latency Latency ___________________________________________________________4.82 1132.28 21.00 800970.00 4575 WRITE5.70 156.47 9.00 665085.00 39163 READDIRP11.35 315.02 9.00 1433947.00 38698 LOOKUP11.88 1729.34 21.00 2569638.00 7382 FXATTROP47.35 104235.02 2485.00 7789367.00 488 FSYNC
------------------
Duration : 335
BytesRead : 94505058
BytesWritten : 195571980
??where is this stored, and how does it impact performance when on??
Common
Pitfalls
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Split-Brain Syndrome
Communication lost between replicated peers
Clients write separately to multiple copies of a file
No automatic fix
May be subjective which copy is right ALL may be!
Admin determines the bad copy and removes it
Self-heal will correct the volumeTrigger a recursive stat to initiate
Proactive self-healing in RHS 2.0 / GlusterFS 3.3
Quorum Enforcement
Disallows writes (EROFS) on non-quorum peers
Significantly reduces files affected by split-brain
Preferred when data integrity is the priority
Not preferred when application integrity is the priority
NEW! Server-Side Quorum
In GlusterFS 3.3
Client-side
Replica set level
NOW in GlusterFS 3.4
Server-side
Cluster-level (glusterd)
This quorum is on the server-side i.e. in glusterd. Whenever the glusterdon a machine observes that the quorum is not met, it brings down the bricksto prevent data split-brains. When the network connections are brought backup and the quorum is restored the bricks in the volume are brought back up.In glusterd commands to configure the quorum values are provided which arelisted in the sections below. When the quorum is not met for a volume, anycommands that update the volume configuration or peer addition or detachare not allowed.
Do it!
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Do it!
Build a test environment in VMs in just minutes!
Get the bits:
Fedora 19 has GlusterFS packages natively
RHS 2.1 ISO available on Red Hat Portal
Go upstream: www.gluster.org
Amazon Web Services (AWS)Amazon Linux AMI includes GlusterFS packages
RHS AMI is available
Thank You!
RHS:www.redhat.com/storage/
GlusterFS:www.gluster.org
TAM: access.redhat.com/support/offerings/tam/
@Glusterorg@RedHatStorage
GlusterRed Hat Storage
Slides Available at: http://people.redhat.com/dblack
GlusterFS and RHS for the SysAdminAn In-Depth Look and Demo
Question notes:
-Vs. CEPH-CEPH is object-based at its core, with distributed filesystem as a layered function. GlusterFS is file-based at its core, with object methods (UFO) as a layered function.-CEPH stores underlying data in files, but outside the CEPH constructs they are meaningless. Except for striping, GlusterFS files maintain complete integrity at the brick level.-With CEPH, you define storage resources and data architecture (replication) separate, and CEPH actively and dynamically manages the mapping of the architecture to the storage. With GlusterFS, you manually manage both the storage resources and the data architecture.
Click to edit the title text format
Click to edit the outline text format
DUSTIN L. BLACK, RHCA
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level
DUSTIN L. BLACK, RHCA
Top Related