The Extent of GFS2Fsck.gfs2 performance improvements As filesystems get larger, fsck time becomes a...
Transcript of The Extent of GFS2Fsck.gfs2 performance improvements As filesystems get larger, fsck time becomes a...
1
The Extent of GFS2
Dr Steven Whitehouse22/23 March 2017Linux Foundation Vault 2017
2
Topics
● Quick tour of GFS2
● Where have we got to?
● Where are we going?
3
What is GFS2?
● 64 bit, symmetric cluster filesystem
● Uses DLM for locking● Abstracted through glocks – cache control mechanism
● Inodes are single blocks (unit of caching)● Equal height metadata tree using pointer blocks
● Directories use “extensible hashing”
● Hidden metadata filesystem contains system data● One journal per node● Also quota & statfs data
4
Where did GFS2 come from?
● GFS started out as a research project at the University of Minnesota
● Initial purpose was storage of ocean current simulation data
● Spun out into Sistina Software circa 2000● Red Hat bought Sistina Software in Dec 2003
● GFS2 was a development from GFS● Very similar on disk structures – allows in place upgrade● Code clean up & some improvements● Went upstream in 2.6.19 (Nov 2006)
5
Where is GFS2 used today?
● Lots of different applications…● Web/FTP servers● Backup solutions● Message Queue (IBM Websphere, Tibco MQ, Active
MQ)● Various SAS workloads● and many more...
● Many different sectors● Financial, IT, Retail, Manufacturing, ...
6
What workloads is GFS2 best at?
● Small numbers of nodes (<=16)
● When (almost) POSIX compliance is required
● When the workload can be mostly localized● This point is very important for performance
● When HA is an important consideration
● Avoid:● Highly non-local workloads● Polling the filesystem for inter-node communication
7
Recent Developments
8
Resource Group Scalability (1)
● Like ext3 block group / XFS allocation group● Subsection of the filesystem with allocation bitmap
● Internally held in an rbtree for quick access
● At allocation time we have a choice of which rgrp to use
● We want locality with previous allocations● We want to avoid inter-node contention
9
Resource Group Scalability (2 - locality)
● Each (in core) resource group has a list of block reservations associated with it
● The reservations are created at write or page_mkwrite time, where a size hint is calculated
● A node-local reservation is then created for a number of blocks, even though fewer may be allocated
● Future allocations will try to use the reservation, before looking elsewhere for space
● Avoids the multiple streaming writes issue● A big performance improvement for that specific case
10
Resource Group Scalability (3 – inter-node)
● We want to avoid inter-node contention on rgrps● How hard should we try to allocate from a particuar
rgrp?● Orlov allocator (as per ext3) gives first level of
contention avoidance
● The second level is given by lock stats – did we have to wait longer than average for this rgrp? If so it might be contended
● If we have a reservation we ignore the lock stats, to avoid excessive fragmentation
11
Glock scalability
● Glocks are kept in a single big hash table● Indexed by type and glock number (inode/rgrp number)
● Lookups mostly occur on inode creation/lookup● Glock references are kept by inodes for their lifetime
● Recent change to use rhashtable improves scalability● Keeps RCU locking & lockref advantages● Scales according to number of glocks/inodes
● Big performance improvement with lots of inodes (>1m)
12
Xattrs & SELinux (1)
● In GFS2 xattrs are stored in a separate block to the inode
● Two disk reads may be required for each inode● Solution:
● If we create xattrs at inode creation time (e.g. for SELinux labels) then we can allocate 2 blocks (inode & xattr) contiguously
● We then mark the directory entry, so we know that there are two blocks to read, not just one.
● When we read the inode, we can then issue a single read for both blocks
13
Xattrs & SELinux (2)
● SELinux has historically not been cluster coherent● No way for GFS2 to invalidate SELinux labels
● This is now fixed upstream, so SELinux can be used in a fully cluster coherent manner
● Combined with the xattr performance improvement, SELinux is now a viable option for GFS2
14
Multi-threaded streaming scalability
● Journal can be a source of contention with multi-threaded workloads
● A recent patch avoids taking the journal lock in case that the block in question is already in the journal
● For streaming workloads this is very likely to be the case for the inode and some of the indirect blocks, for example
● Improvements seen of around 50% with fast storage
15
Fsck.gfs2 performance improvements
● As filesystems get larger, fsck time becomes a major issue
● The design of GFS2’s fsck is based on multiple passes● The amount of memory used for storage of state has
been reduced● Readahead has been added● pass1c has been removed (combined with pass1)
● Work in continuing on improvements in this area
16
What’s next?
17
DLM Lock Timing Analysis
● Using ● Using the gfs2_glock_lock_time tracepoint
● The tdiff field reports the time of each DLM lock request
● srtt, srttb, srttvar, srttvarb● Smoothed round trip times (b = blocking) and variance
● sirt, sirtvar● Smoothed Inter-request times
● dcount – Number of DLM requests
● qcount – Number of (local) glock requests
18
19
20
Journal Flushing (1)
● This can take a long time● Increases glock release latency● Stops new transactions while journal is being flushed
● Causes:● Ordered write mode, means data is flushed before the
journal● Inability to start transactions while journal flush is in
progress
21
Journal Flushing (2)
● Things are not all bad● We have streamlined the journal I/O already● Builds large bio I/Os – very efficient● Works well under memory pressure
● Design allows adding new data and being backwards compatible
● Some space left in data structures, so lots of options● A big win would be to eliminate the ordered write list
flushing
22
Ordered write list
● A list of inodes to which data has been written
● At journal flush time:● Sorts the ordered write list by inode number● Writes back the data for each inode● Waits for the data for each inode● Then flushes the journal
● Can we avoid this?
23
Introducing extents
● One potential solution to the ordered write issue● Add additional information to the journal indicating
newly allocated extents● Then we can avoid the pre-journal flush writeback
● Backwards compatibility● Yes, from journal PoV● No, in case of mixed clusters (old & new)
● Could provide a way in which to introduce more general support for extents into GFS2
24
iomap
● Recently introduced upstream● Would enable multi-page write
● Spread locking overhead across multiple pages● Performance win for streaming writes
● Also to fix FIEMAP issue● Improve efficiency of mapping holes in sparse files
● One nice side effect● Should be possible to write a generic
SEEK_DATA/SEEK_HOLE for iomap based filesystems
25
Thank-you!