The Extent of GFS2Fsck.gfs2 performance improvements As filesystems get larger, fsck time becomes a...

1

The Extent of GFS2

Dr Steven Whitehouse22/23 March 2017Linux Foundation Vault 2017

2

Topics

● Quick tour of GFS2

● Where have we got to?

● Where are we going?

3

What is GFS2?

● 64 bit, symmetric cluster filesystem

● Uses DLM for locking● Abstracted through glocks – cache control mechanism

● Inodes are single blocks (unit of caching)● Equal height metadata tree using pointer blocks

● Directories use “extensible hashing”

● Hidden metadata filesystem contains system data● One journal per node● Also quota & statfs data

4

Where did GFS2 come from?

● GFS started out as a research project at the University of Minnesota

● Initial purpose was storage of ocean current simulation data

● Spun out into Sistina Software circa 2000● Red Hat bought Sistina Software in Dec 2003

● GFS2 was a development from GFS● Very similar on disk structures – allows in place upgrade● Code clean up & some improvements● Went upstream in 2.6.19 (Nov 2006)

5

Where is GFS2 used today?

● Lots of different applications…● Web/FTP servers● Backup solutions● Message Queue (IBM Websphere, Tibco MQ, Active

MQ)● Various SAS workloads● and many more...

● Many different sectors● Financial, IT, Retail, Manufacturing, ...

6

What workloads is GFS2 best at?

● Small numbers of nodes (<=16)

● When (almost) POSIX compliance is required

● When the workload can be mostly localized● This point is very important for performance

● When HA is an important consideration

● Avoid:● Highly non-local workloads● Polling the filesystem for inter-node communication

7

Recent Developments

8

Resource Group Scalability (1)

● Like ext3 block group / XFS allocation group● Subsection of the filesystem with allocation bitmap

● Internally held in an rbtree for quick access

● At allocation time we have a choice of which rgrp to use

● We want locality with previous allocations● We want to avoid inter-node contention

9

Resource Group Scalability (2 - locality)

● Each (in core) resource group has a list of block reservations associated with it

● The reservations are created at write or page_mkwrite time, where a size hint is calculated

● A node-local reservation is then created for a number of blocks, even though fewer may be allocated

● Future allocations will try to use the reservation, before looking elsewhere for space

● Avoids the multiple streaming writes issue● A big performance improvement for that specific case

10

Resource Group Scalability (3 – inter-node)

● We want to avoid inter-node contention on rgrps● How hard should we try to allocate from a particuar

rgrp?● Orlov allocator (as per ext3) gives first level of

contention avoidance

● The second level is given by lock stats – did we have to wait longer than average for this rgrp? If so it might be contended

● If we have a reservation we ignore the lock stats, to avoid excessive fragmentation

11

Glock scalability

● Glocks are kept in a single big hash table● Indexed by type and glock number (inode/rgrp number)

● Lookups mostly occur on inode creation/lookup● Glock references are kept by inodes for their lifetime

● Recent change to use rhashtable improves scalability● Keeps RCU locking & lockref advantages● Scales according to number of glocks/inodes

● Big performance improvement with lots of inodes (>1m)

12

Xattrs & SELinux (1)

● In GFS2 xattrs are stored in a separate block to the inode

● Two disk reads may be required for each inode● Solution:

● If we create xattrs at inode creation time (e.g. for SELinux labels) then we can allocate 2 blocks (inode & xattr) contiguously

● We then mark the directory entry, so we know that there are two blocks to read, not just one.

● When we read the inode, we can then issue a single read for both blocks

13

Xattrs & SELinux (2)

● SELinux has historically not been cluster coherent● No way for GFS2 to invalidate SELinux labels

● This is now fixed upstream, so SELinux can be used in a fully cluster coherent manner

● Combined with the xattr performance improvement, SELinux is now a viable option for GFS2

14

Multi-threaded streaming scalability

● Journal can be a source of contention with multi-threaded workloads

● A recent patch avoids taking the journal lock in case that the block in question is already in the journal

● For streaming workloads this is very likely to be the case for the inode and some of the indirect blocks, for example

● Improvements seen of around 50% with fast storage

15

Fsck.gfs2 performance improvements

● As filesystems get larger, fsck time becomes a major issue

● The design of GFS2’s fsck is based on multiple passes● The amount of memory used for storage of state has

been reduced● Readahead has been added● pass1c has been removed (combined with pass1)

● Work in continuing on improvements in this area

16

What’s next?

17

DLM Lock Timing Analysis

● Using ● Using the gfs2_glock_lock_time tracepoint

● The tdiff field reports the time of each DLM lock request

● srtt, srttb, srttvar, srttvarb● Smoothed round trip times (b = blocking) and variance

● sirt, sirtvar● Smoothed Inter-request times

● dcount – Number of DLM requests

● qcount – Number of (local) glock requests

20

Journal Flushing (1)

● This can take a long time● Increases glock release latency● Stops new transactions while journal is being flushed

● Causes:● Ordered write mode, means data is flushed before the

journal● Inability to start transactions while journal flush is in

progress

21

Journal Flushing (2)

● Things are not all bad● We have streamlined the journal I/O already● Builds large bio I/Os – very efficient● Works well under memory pressure

● Design allows adding new data and being backwards compatible

● Some space left in data structures, so lots of options● A big win would be to eliminate the ordered write list

flushing

22

Ordered write list

● A list of inodes to which data has been written

● At journal flush time:● Sorts the ordered write list by inode number● Writes back the data for each inode● Waits for the data for each inode● Then flushes the journal

● Can we avoid this?

23

Introducing extents

● One potential solution to the ordered write issue● Add additional information to the journal indicating

newly allocated extents● Then we can avoid the pre-journal flush writeback

● Backwards compatibility● Yes, from journal PoV● No, in case of mixed clusters (old & new)

● Could provide a way in which to introduce more general support for extents into GFS2

24

iomap

● Recently introduced upstream● Would enable multi-page write

● Spread locking overhead across multiple pages● Performance win for streaming writes

● Also to fix FIEMAP issue● Improve efficiency of mapping holes in sparse files

● One nice side effect● Should be possible to write a generic

SEEK_DATA/SEEK_HOLE for iomap based filesystems

25

Thank-you!

The Extent of GFS2Fsck.gfs2 performance improvements As filesystems get larger, fsck time becomes a...

Documents

Transcript of The Extent of GFS2Fsck.gfs2 performance improvements As filesystems get larger, fsck time becomes a...