A 2313540

General Parallel File System

GPFS Native RAID Administration andProgramming ReferenceVersion 3 Release 4

SA23-1354-00

��

NoteBefore using this information and the product it supports, read the information in “Notices” on page 115.

This edition applies to version 3 release 4 of IBM General Parallel File System for AIX (program number 5765-G66)with APAR IV00760, and to all subsequent fix pack levels until otherwise indicated in new editions.

Previously published descriptions of GPFS commands that are not specific to, but are related to, GPFS Native RAIDare included in this information. Significant changes or additions to the text of those previously publishedcommand descriptions are indicated by a vertical line (|) to the left of the change.

GPFS Native RAID is supported only on hardware on which it has been tested and on certain levels of GPFS. Forthe list of supported hardware and levels of GPFS, see the GPFS FAQ topic in the GPFS library section of the IBMCluster Information Center (http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp).

IBM welcomes your comments; see the topic “How to send your comments” on page x. When you sendinformation to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believesappropriate without incurring any obligation to you.

© Copyright IBM Corporation 2011.US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contractwith IBM Corp.

Contents

Figures . . . . . . . . . . . . . . . v

Tables . . . . . . . . . . . . . . . vii

About this information . . . . . . . . ixWho should read this information . . . . . . . ixConventions used in this information . . . . . . ixPrerequisite and related information . . . . . . xHow to send your comments . . . . . . . . . x

Chapter 1. Introduction . . . . . . . . 1Overview . . . . . . . . . . . . . . . 1GPFS Native RAID features . . . . . . . . . 2

RAID codes. . . . . . . . . . . . . . 2End-to-end checksum . . . . . . . . . . 3Declustered RAID . . . . . . . . . . . 3

Disk configurations . . . . . . . . . . . . 5Recovery groups . . . . . . . . . . . . 5Declustered arrays . . . . . . . . . . . 6

Virtual and physical disks . . . . . . . . . . 7Virtual disks . . . . . . . . . . . . . 7Physical disks . . . . . . . . . . . . . 7Solid-state disks . . . . . . . . . . . . 8

Disk hospital . . . . . . . . . . . . . . 8Health metrics . . . . . . . . . . . . . 8Pdisk discovery . . . . . . . . . . . . 8Disk replacement . . . . . . . . . . . . 8

Chapter 2. Managing GPFS Native RAID 11Recovery groups. . . . . . . . . . . . . 11

Recovery group server parameters . . . . . . 11Recovery group creation . . . . . . . . . 11Recovery group server failover . . . . . . . 12

Pdisks . . . . . . . . . . . . . . . . 12Pdisk paths . . . . . . . . . . . . . 13Pdisk states . . . . . . . . . . . . . 13

Declustered arrays . . . . . . . . . . . . 15Declustered array parameters . . . . . . . 15Declustered array size . . . . . . . . . . 15Spare space . . . . . . . . . . . . . 16

Vdisks . . . . . . . . . . . . . . . . 16RAID code . . . . . . . . . . . . . 16Block size . . . . . . . . . . . . . . 16Vdisk size . . . . . . . . . . . . . . 17The log vdisk. . . . . . . . . . . . . 17The relationship between vdisks and NSDs. . . 17

Maintenance . . . . . . . . . . . . . . 17Disk diagnosis . . . . . . . . . . . . 17Background tasks . . . . . . . . . . . 19Server failover . . . . . . . . . . . . 19Data checksums . . . . . . . . . . . . 19Disk replacement . . . . . . . . . . . 19Other hardware service . . . . . . . . . 20

Overall management of GPFS Native RAID . . . 20

Planning considerations for GPFS Native RAID 20Monitoring GPFS Native RAID . . . . . . . 22Displaying vdisk I/O statistics . . . . . . . 22GPFS Native RAID callbacks . . . . . . . 23

Chapter 3. GPFS Native RAID setupand disk replacement on the IBMPower 775 Disk Enclosure . . . . . . 25Example scenario: Configuring GPFS Native RAIDrecovery groups . . . . . . . . . . . . . 25

Preparing recovery group servers . . . . . . 25Creating recovery groups on a Power 775 DiskEnclosure . . . . . . . . . . . . . . 29

Example scenario: Replacing failed disks in a Power775 Disk Enclosure recovery group . . . . . . 36

Chapter 4. GPFS Native RAIDcommands . . . . . . . . . . . . . 43mmaddpdisk command . . . . . . . . . . 44mmchcarrier command . . . . . . . . . . 46mmchpdisk command . . . . . . . . . . . 49mmchrecoverygroup command . . . . . . . . 51mmcrrecoverygroup command . . . . . . . . 53mmcrvdisk command . . . . . . . . . . . 56mmdelpdisk command . . . . . . . . . . 60mmdelrecoverygroup command . . . . . . . 62mmdelvdisk command . . . . . . . . . . 64mmlspdisk command . . . . . . . . . . . 66mmlsrecoverygroup command . . . . . . . . 69mmlsrecoverygroupevents command . . . . . . 72mmlsvdisk command . . . . . . . . . . . 74

Chapter 5. Other GPFS commandsrelated to GPFS Native RAID . . . . . 77mmaddcallback command . . . . . . . . . 78mmchconfig command . . . . . . . . . . 86mmcrfs command . . . . . . . . . . . . 95mmexportfs command . . . . . . . . . . 102mmimportfs command . . . . . . . . . . 104mmpmon command . . . . . . . . . . . 107

Accessibility features for GPFS . . . 113Accessibility features . . . . . . . . . . . 113Keyboard navigation . . . . . . . . . . . 113IBM and accessibility . . . . . . . . . . . 113

Notices . . . . . . . . . . . . . . 115Trademarks . . . . . . . . . . . . . . 116

Glossary . . . . . . . . . . . . . 117

Index . . . . . . . . . . . . . . . 123

© Copyright IBM Corp. 2011 iii

iv GPFS Native RAID Administration and Programming Reference

Figures

1. Redundancy codes supported by GPFS NativeRAID . . . . . . . . . . . . . . . 2

2. Conventional RAID versus declustered RAIDlayouts . . . . . . . . . . . . . . 4

3. Lower rebuild overhead in conventional RAIDversus declustered RAID . . . . . . . . 5

4. GPFS Native RAID server and recovery groupsin a ring configuration . . . . . . . . . 6

5. Minimal configuration of two GPFS NativeRAID servers and one storage JBOD . . . . 6

6. Example of declustered arrays and recoverygroups in storage JBOD . . . . . . . . . 7

© Copyright IBM Corp. 2011 v

vi GPFS Native RAID Administration and Programming Reference

Tables

1. Conventions . . . . . . . . . . . . ix2. Pdisk states . . . . . . . . . . . . 143. Background tasks . . . . . . . . . . 194. Keywords and descriptions of values provided

in the mmpmon vio_s response . . . . . . 235. GPFS Native RAID callbacks and parameters 246. NSD block size, vdisk track size, vdisk RAID

code, vdisk strip size, and non-defaultoperating system I/O size for permitted GPFSNative RAID vdisks. . . . . . . . . . 26

7. GPFS Native RAID commands . . . . . . 438. Other GPFS commands related to GPFS Native

RAID . . . . . . . . . . . . . . 77

© Copyright IBM Corp. 2011 vii

viii GPFS Native RAID Administration and Programming Reference

About this information

This information explains how to use the commands unique to the General Parallel File System functionGPFS™ Native RAID.

To find out which version of GPFS is running on a particular AIX® node, enter:lslpp -l gpfs\*

Throughout this information you will see various command and component names beginning with theprefix mm. This is not an error. GPFS shares many components with the related products IBM®

Multi-Media Server and IBM Video Charger.

Who should read this information

This information is designed for system administrators and programmers of GPFS Native RAID. To usethis information, you should be familiar with the GPFS licensed product and the AIX operating system.Where necessary, some background information relating to AIX is provided. More commonly, you arereferred to the appropriate documentation.

Conventions used in this information

Table 1 describes the typographic conventions used in this information. UNIX file name conventions areused throughout this information.

Note: Users of GPFS for Windows must be aware that on Windows, UNIX-style file names need to beconverted appropriately. For example, the GPFS cluster configuration data is stored in the/var/mmfs/gen/mmsdrfs file. On Windows, the UNIX name space starts under the %SystemRoot%\SUAdirectory, so this cluster configuration file is C:\Windows\SUA\var\mmfs\gen\mmsdrfs.

Table 1. Conventions.

This table describes the typographic conventions used throughout this information unit.

Convention Usage

bold Bold words or characters represent system elements that you must use literally, such ascommands, flags, values, and selected menu options.

Depending on the context, bold typeface sometimes represents path names, directories, or filenames.

bold underlined bold underlined keywords are defaults. These take effect if you do not specify a differentkeyword.

constant width Examples and information that the system displays appear in constant-width typeface.

Depending on the context, constant-width typeface sometimes represents path names,directories, or file names.

italic v Italic words or characters represent variable values that you must supply.

v Italics are also used for information unit titles, for the first use of a glossary term, and forgeneral emphasis in text.

<key> Angle brackets (less-than and greater-than) enclose the name of a key on the keyboard. Forexample, <Enter> refers to the key on your terminal or workstation that is labeled with theword Enter.

© Copyright IBM Corp. 2011 ix

Table 1. Conventions (continued).

This table describes the typographic conventions used throughout this information unit.

Convention Usage

\ In command examples, a backslash indicates that the command or coding example continueson the next line. For example:

mkcondition -r IBM.FileSystem -e "PercentTotUsed > 90" \-E "PercentTotUsed < 85" -m p "FileSystem space used"

{item} Braces enclose a list from which you must choose an item in format and syntax descriptions.

[item] Brackets enclose optional items in format and syntax descriptions.

<Ctrl-x> The notation <Ctrl-x> indicates a control character sequence. For example, <Ctrl-c> meansthat you hold down the control key while pressing <c>.

item... Ellipses indicate that you can repeat the preceding item one or more times.

| v In synopsis statements, vertical lines separate a list of choices. In other words, a vertical linemeans Or.

v In the left margin of the document, vertical lines indicate technical changes to theinformation.

Prerequisite and related information

For updates to this information, see the GPFS library at (http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfsbooks.html).

For the latest support information, see the GPFS Frequently Asked Questions at (http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html).

How to send your comments

Your feedback is important in helping us to produce accurate, high-quality information. If you have anycomments about this information or any other GPFS documentation, send your comments to thefollowing e-mail address:

[email protected]

Include the publication title and order number, and, if applicable, the specific location of the informationabout which you have comments (for example, a page number or a table number).

To contact the GPFS development organization, send your comments to the following e-mail address:

[email protected]

x GPFS Native RAID Administration and Programming Reference

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfsbooks.html

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html

Chapter 1. Introduction

GPFS Native RAID is a software implementation of storage RAID technologies within GPFS. Usingconventional dual-ported disks in a JBOD configuration, GPFS Native RAID implements sophisticateddata placement and error correction algorithms to deliver high levels of storage reliability, availability,and performance. Standard GPFS file systems are created from the NSDs defined through GPFS NativeRAID.

This chapter describes the basic concepts, advantages, and motivations behind GPFS Native RAID:redundancy codes; end-to-end checksums; data declustering; and administrator configuration, includingrecovery groups, declustered arrays, virtual disks, and virtual disk NSDs.

Overview

GPFS Native RAID integrates the functionality of an advanced storage controller into the GPFS NSDserver. Unlike an external storage controller, where configuration, LUN definition, and maintenance arebeyond the control of GPFS, GPFS Native RAID takes ownership of a JBOD array to directly match LUNdefinition, caching, and disk behavior to GPFS file system requirements.

Sophisticated data placement and error correction algorithms deliver high levels of storage reliability,availability, serviceability, and performance. GPFS Native RAID provides a variation of the GPFS networkshared disk (NSD) called a virtual disk, or vdisk. Standard NSD clients transparently access the vdiskNSDs of a file system using the conventional NSD protocol.

The features of GPFS Native RAID include:v Software RAID: GPFS Native RAID runs on standard AIX disks in a dual-ported JBOD array, which

does not require external RAID storage controllers or other custom hardware RAID acceleration.v Declustering: GPFS Native RAID distributes client data, redundancy information, and spare space

uniformly across all disks of a JBOD. This distribution reduces the rebuild (disk failure recoveryprocess) overhead compared to conventional RAID.

v Checksum: An end-to-end data integrity check, using checksums and version numbers, is maintainedbetween the disk surface and NSD clients. The checksum algorithm uses version numbers to detectsilent data corruption and lost disk writes.

v Data redundancy: GPFS Native RAID supports highly reliable 2-fault-tolerant and 3-fault-tolerantReed-Solomon based parity codes and 3-way and 4-way replication.

v Large cache: A large cache improves read and write performance, particularly for small I/O operations.v Arbitrarily sized disk arrays: The number of disks is not restricted to a multiple of the RAID

redundancy code width, which allows flexibility in the number of disks in the RAID array.v Multiple redundancy schemes: One disk array can support vdisks with different redundancy schemes,

for example Reed-Solomon and replication codes.v Disk hospital: A disk hospital asynchronously diagnoses faulty disks and paths, and requests

replacement of disks by using past health records.v Automatic recovery: Seamlessly and automatically recovers from primary server failure.v Disk scrubbing: A disk scrubber automatically detects and repairs latent sector errors in the

background.v Familiar interface: Standard GPFS command syntax is used for all configuration commands; including,

maintaining and replacing failed disks.v Flexible hardware configuration: Support of JBOD enclosures with multiple disks physically mounted

together on removable carriers.

© Copyright IBM Corporation © IBM 2011 1

v Configuration and data logging: Internal configuration and small-write data are automatically loggedto solid-state disks for improved performance.

GPFS Native RAID features

This section introduces three key features of GPFS Native RAID and how they work: data redundancyusing RAID codes, end-to-end checksums, and declustering.

RAID codes

GPFS Native RAID automatically corrects for disk failures and other storage faults by reconstructing theunreadable data using the available data redundancy of either a Reed-Solomon code or N-wayreplication. GPFS Native RAID uses the reconstructed data to fulfill client operations, and in the case ofdisk failure, to rebuild the data onto spare space. GPFS Native RAID supports 2- and 3-fault-tolerantReed-Solomon codes and 3-way and 4-way replication, which respectively detect and correct up to two orthree concurrent faults1. The redundancy code layouts supported by GPFS Native RAID, called tracks, areillustrated in Figure 1.

GPFS Native RAID automatically creates redundancy information depending on the configured RAIDcode. Using a Reed-Solomon code, GPFS Native RAID equally divides a GPFS block of user data intoeight data strips and generates two or three redundant parity strips. This results in a stripe or track widthof 10 or 11 strips and storage efficiency of 80% or 73% (excluding user configurable spare space forrebuild).

Using N-way replication, a GPFS data block is simply replicated N − 1 times, in effect implementing 1 + 2and 1 + 3 redundancy codes, with the strip size equal to the GPFS block size. Thus, for every block/stripwritten to the disks, N replicas of that block/strip are also written. This results in track width of three orfour strips and storage efficiency of 33% or 25%.

1. An ,-fault-tolerant Reed-Solomon code or a (1 + ,)-way replication can survive the concurrent failure of , disks, read faults, oreither. Also, if there are s equivalent spare disks in the array, an ,-fault-tolerant array can survive the sequential failure of , + sdisks where disk failures occur between successful rebuild operations.

2-fault-tolerantcodes

8+2p Reed-Solomon code 3-way replication (1 + 2)

8+3p Reed-Solomon code 4-way replication (1 + 3)3-fault-tolerantcodes

8 strips=

GPFS block

2 or 3generated

parity strips

2 or 3generatedreplicas

1 strip=

GPFSblock

Figure 1. Redundancy codes supported by GPFS Native RAID. GPFS Native RAID supports 2- and 3-fault-tolerantReed-Solomon codes, which partition a GPFS block into eight data strips and two or three parity strips. The N-wayreplication codes duplicate the GPFS block on N - 1 replica strips.

2 GPFS Native RAID Administration and Programming Reference

End-to-end checksum

Most implementations of RAID codes implicitly assume that disks reliably detect and report faults,hard-read errors, and other integrity problems. However, studies have shown that disks do not reportsome read faults and occasionally fail to write data, while actually claiming to have written the data.These errors are often referred to as silent errors, phantom-writes, dropped-writes, and off-track writes.To cover for these shortcomings, GPFS Native RAID implements an end-to-end checksum that can detectsilent data corruption caused by either disks or other system components that transport or manipulatethe data.

When an NSD client is writing data, a checksum of 8 bytes is calculated and appended to the data beforeit is transported over the network to the GPFS Native RAID server. On reception, GPFS Native RAIDcalculates and verifies the checksum. Then, GPFS Native RAID stores the data, a checksum, and versionnumber to disk and logs the version number in its metadata for future verification during read.

When GPFS Native RAID reads disks to satisfy a client read operation, it compares the disk checksumagainst the disk data and the disk checksum version number against what is stored in its metadata. If thechecksums and version numbers match, GPFS Native RAID sends the data along with a checksum to theNSD client. If the checksum or version numbers are invalid, GPFS Native RAID reconstructs the datausing parity or replication and returns the reconstructed data and a newly generated checksum to theclient. Thus, both silent disk read errors and lost or missing disk writes are detected and corrected.

Declustered RAID

Compared to conventional RAID, GPFS Native RAID implements a sophisticated data and spare spacedisk layout scheme that allows for arbitrarily sized disk arrays while also reducing the overhead toclients when recovering from disk failures. To accomplish this, GPFS Native RAID uniformly spreads ordeclusters user data, redundancy information, and spare space across all the disks of a declustered array.Figure 2 on page 4 compares a conventional RAID layout versus an equivalent declustered array.

Chapter 1. Introduction 3

As illustrated in Figure 3 on page 5, a declustered array can significantly shorten the time required torecover from a disk failure, which lowers the rebuild overhead for client applications. When a disk fails,erased data is rebuilt using all the operational disks in the declustered array, the bandwidth of which isgreater than that of the fewer disks of a conventional RAID group. Furthermore, if an additional diskfault occurs during a rebuild, the number of impacted tracks requiring repair is markedly less than theprevious failure and less than the constant rebuild overhead of a conventional array.

The decrease in declustered rebuild impact and client overhead can be a factor of three to four times lessthan a conventional RAID. Because GPFS stripes client data across all the storage nodes of a cluster, filesystem performance becomes less dependent upon the speed of any single rebuilding storage array.

1 declustered arrayon 7 disks

3 arrayson 6 disks

sparedisk

21 virtualtracks(42 strips)

49 strips7 tracks per array(2 strips per track)

7 sparestrips

Figure 2. Conventional RAID versus declustered RAID layouts. This figure is an example of how GPFS Native RAIDimproves client performance during rebuild operations by utilizing the throughput of all disks in the declustered array.This is illustrated here by comparing a conventional RAID of three arrays versus a declustered array, both using 7disks. A conventional 1-fault-tolerant 1 + 1 replicated RAID array in the lower left is shown with three arrays of twodisks each (data and replica strips) and a spare disk for rebuilding. To decluster this array, the disks are divided intoseven tracks, two strips per array, as shown in the upper left. The strips from each group are then combinatoriallyspread across all seven disk positions, for a total of 21 virtual tracks, per the upper right. The strips of each diskposition for every track are then arbitrarily allocated onto the disks of the declustered array of the lower right (in thiscase, by vertically sliding down and compacting the strips from above). The spare strips are uniformly inserted, oneper disk.


Disk configurations

This section describes recovery group and declustered array configurations.

Recovery groups

GPFS Native RAID divides disks into recovery groups where each is physically connected to two servers:primary and backup. All accesses to any of the disks of a recovery group are made through the activeserver of the recovery group, either the primary or backup.

Building on the inherent NSD failover capabilities of GPFS, when a GPFS Native RAID server stopsoperating because of a hardware fault, software fault, or normal shutdown, the backup GPFS NativeRAID server seamlessly takes over control of the associated disks of its recovery groups.

Typically, a JBOD array is divided into two recovery groups controlled by different primary GPFS NativeRAID servers. If the primary server of a recovery group fails, control automatically switches over to itsbackup server. Within a typical JBOD, the primary server for a recovery group is the backup server forthe other recovery group.

Figure 4 on page 6 illustrates the ring configuration where GPFS Native RAID servers and storage JBODsalternate around a loop. A particular GPFS Native RAID server is connected to two adjacent storageJBODs and vice versa. The ratio of GPFS Native RAID server to storage JBODs is thus one-to-one. Loadon servers increases by 50% when a server fails.

Wr

failed disk failed disk

Rd Rd-Wr

time time

Figure 3. Lower rebuild overhead in conventional RAID versus declustered RAID. When a single disk fails in the1-fault-tolerant 1 + 1 conventional array on the left, the redundant disk is read and copied onto the spare disk, whichrequires a throughput of 7 strip I/O operations. When a disk fails in the declustered array, all replica strips of the siximpacted tracks are read from the surviving six disks and then written to six spare strips, for a throughput of 2 strip I/Ooperations. The bar chart illustrates disk read and write I/O throughput during the rebuild operations.


For small configurations, Figure 5 illustrates a setup with two GPFS Native RAID servers connected toone storage JBOD. For handling server failures, this configuration can be less efficient for large clustersbecause it requires 2 × N servers each capable of serving two recovery groups, where N is the number ofJBOD arrays. Conversely, the ring configuration requires 1 × N servers each capable of serving threerecovery groups.

Declustered arrays

A declustered array is a subset of the physical disks (pdisks) in a recovery group across which data,redundancy information, and spare space are declustered. The number of disks in a declustered array isdetermined by the RAID code-width of the vdisks that will be housed in the declustered array. For moreinformation, see “Virtual disks” on page 7. There can be one or more declustered arrays per recoverygroup. Figure 6 on page 7 illustrates a storage JBOD with two recovery groups, each with fourdeclustered arrays.

A declustered array can hold one or more vdisks. Since redundancy codes are associated with vdisks, adeclustered array can simultaneously contain both Reed-Solomon and replicated vdisks.

If the storage JBOD supports multiple disks physically mounted together on removable carriers, removalof a carrier temporarily disables access to all the disks in the carrier. Thus, pdisks on the same carriershould not be in the same declustered array, as vdisk redundancy protection would be weakened uponcarrier removal.

Server 1

Server 2

Server 3

JBOD 1 JBOD 1

JBOD 2 JBOD 3 JBOD 3JBOD 2

Server 2

Server 1 Server 3

Figure 4. GPFS Native RAID server and recovery groups in a ring configuration. A recovery group is illustrated as thedashed-line enclosed group of disks within a storage JBOD. Server N is the primary controller of the left recoverygroup in JBOD N (and backup for its right recovery group), and the primary controller of the right recovery group inJBOD N + 1 (and backup for its left recovery group). As shown, when server 2 fails, control of the left recovery groupin JBOD 2 is taken over by its backup server 1, and control of the right recovery group in JBOD 3 is taken over by itsbackup server 3. During the failure of server 2, the load on backup server 1 and 3 increases by 50% from two to threerecovery groups.

Server 2Server 1Server 1 Server 2

JBOD JBOD

Figure 5. Minimal configuration of two GPFS Native RAID servers and one storage JBOD. GPFS Native RAID server 1is the primary controller for the left recovery group and backup for the right recovery group. GPFS Native RAID server2 is the primary controller for the right recovery group and backup for the left recovery group. As shown, when server1 fails, control of the left recovery group is taken over by its backup server 2. During the failure of server 1, the loadon backup server 2 increases by 100% from one to two recovery groups.


Declustered arrays are normally created at recovery group creation time but new ones can be created orexisting ones grown by adding pdisks at a later time.

Virtual and physical disks

A virtual disk (vdisk) is a type of NSD, implemented by GPFS Native RAID across all the physical disks(pdisks) of a declustered array. Multiple vdisks can be defined within a declustered array, typicallyReed-Solomon vdisks for GPFS user data and replicated vdisks for GPFS metadata.

Virtual disks

Whether a vdisk of a particular capacity can be created in a declustered array depends on its redundancycode, the number of pdisks and equivalent spare capacity in the array, and other small GPFS NativeRAID overhead factors. The mmcrvdisk command can automatically configure a vdisk of the largestpossible size given a redundancy code and configured spare space of the declustered array.

In general, the number of pdisks in a declustered array cannot be less than the widest redundancy codeof a vdisk plus the equivalent spare disk capacity of a declustered array. For example, a vdisk using the11-strip-wide 8 + 3p Reed-Solomon code requires at least 13 pdisks in a declustered array with theequivalent spare space capacity of two disks. A vdisk using the 3-way replication code requires at leastfive pdisks in a declustered array with the equivalent spare capacity of two disks.

Vdisks are partitioned into virtual tracks, which are the functional equivalent of a GPFS block. All vdiskattributes are fixed at creation and cannot be subsequently altered.

Physical disks

A pdisk is used by GPFS Native RAID to store both user data and GPFS Native RAID internalconfiguration data.

A pdisk is either a conventional rotating magnetic-media disk (HDD) or a solid-state disk (SSD). Allpdisks in a declustered array must have the same capacity.

Pdisks are also assumed to be dual ported; with one or more paths connected to the primary GPFSNative RAID server and one or more paths connected to the backup server. There are typically tworedundant paths between a GPFS Native RAID server and connected JBOD pdisks.

storage JBOD

RG1

DA1

DA2

DA3

DA4

DA5

DA6

DA7

DA8

RG2

Figure 6. Example of declustered arrays and recovery groups in storage JBOD. This figure shows a storage JBODwith two recovery groups, each recovery group with four declustered arrays, and each declustered array with fivedisks.


Solid-state disks

GPFS Native RAID assumes several solid-state disks (SSDs) in each recovery group in order toredundantly log changes to its internal configuration and fast-write data in non-volatile memory, which isaccessible from either the primary or backup GPFS Native RAID servers after server failure. A typicalGPFS Native RAID log vdisk might be configured as 3-way replication over a dedicated declustered arrayof 4 SSDs per recovery group.

Disk hospital

The disk hospital is a key feature of GPFS Native RAID that asynchronously diagnoses errors and faultsin the storage subsystem. GPFS Native RAID times out an individual pdisk I/O operation after about tenseconds, thereby limiting the impact from a faulty pdisk on a client I/O operation. When a pdisk I/Ooperation results in a timeout, an I/O error, or a checksum mismatch, the suspect pdisk is immediatelyadmitted into the disk hospital. When a pdisk is first admitted, the hospital determines whether the errorwas caused by the pdisk itself or by the paths to it. While the hospital diagnoses the error, GPFS NativeRAID, if possible, uses vdisk redundancy codes to reconstruct lost or erased strips for I/O operations thatwould otherwise have used the suspect pdisk.

Health metrics

The disk hospital maintains internal health assessment metrics for each pdisk: time badness, whichcharacterizes response times; and data badness, which characterizes media errors (hard errors) andchecksum errors. When a pdisk health metric exceeds the threshold, it is marked for replacementaccording to the disk maintenance replacement policy for the declustered array.

The disk hospital logs selected Self-Monitoring, Analysis and Reporting Technology (SMART) data,including the number of internal sector remapping events for each pdisk.

Pdisk discovery

GPFS Native RAID discovers all connected pdisks when it starts up, and then regularly schedules aprocess that will rediscover a pdisk that newly becomes accessible to the GPFS Native RAID server. Thisallows pdisks to be physically connected or connection problems to be repaired without restarting theGPFS Native RAID server.

Disk replacement

The disk hospital keeps track of disks that require replacement according to the disk replacement policyof the declustered array, and it can be configured to report the need for replacement in a variety of ways.It records and reports the FRU number and physical hardware location of failed disks to help guideservice personnel to the correct location with replacement disks.

When multiple disks are mounted on a removable carrier, each a member of a different declustered array,disk replacement requires the hospital to temporarily suspend other disks in the same carrier. In order toguard against human error, carriers are also not removable until GPFS Native RAID actuates a solenoidcontrolled latch. In response to administrative commands, the hospital quiesces the appropriate disks,releases the carrier latch, and turns on identify lights on the carrier adjacent to the disks that requirereplacement.

After one or more disks are replaced and the carrier is re-inserted, the hospital, in response toadministrative commands, verifies that the repair has taken place and automatically adds any new disksto the declustered array, which causes GPFS Native RAID to rebalance the tracks and spare space across


all the disks of the declustered array. If service personnel fail to re-insert the carrier within a reasonableperiod, the hospital declares the disks on the carrier as missing and starts rebuilding the affected data.


Chapter 2. Managing GPFS Native RAID

This section describes, in more detail, the characteristics and behavior of GPFS Native RAID entities:recovery groups, pdisks, declustered arrays, and vdisks. Disk maintenance and overall GPFS NativeRAID management are also described.

Recovery groups

A recovery group is the fundamental organizing structure employed by GPFS Native RAID. A recoverygroup is conceptually the internal GPFS equivalent of a hardware disk controller. Within a recoverygroup, individual JBOD disks are defined as pdisks and assigned to declustered arrays. Each pdisk belongsto exactly one declustered array within one recovery group. Within a declustered array of pdisks, vdisksare defined. The vdisks are the equivalent of the RAID LUNs for a hardware disk controller. One or twoGPFS cluster nodes must be defined as the servers for a recovery group, and these servers must havedirect hardware connections to the JBOD disks in the recovery group. Two servers are recommended forhigh availability server failover, but only one server will actively manage the recovery group at any giventime. One server is the preferred and primary server, and the other server, if defined, is the backup server.

Multiple recovery groups can be defined, and a GPFS cluster node can be the primary or backup serverfor more than one recovery group. The name of a recovery group must be unique within a GPFS cluster.

Recovery group server parameters

To enable a GPFS cluster node as a recovery group server, it must have the mmchconfig configurationparameter nsdRAIDTracks set to a nonzero value, and the GPFS daemon must be restarted on the node.The nsdRAIDTracks parameter defines the maximum number of vdisk track descriptors that the servercan have in memory at a given time. The volume of actual vdisk data that the server can cache inmemory is governed by the size of the GPFS pagepool on the server and the value of thensdRAIDBufferPoolSizePct configuration parameter. The nsdRAIDBufferPoolSizePct parameter defaultsto 50% of the pagepool on the server. A recovery group server should be configured with a substantialamount of pagepool, on the order of tens of gigabytes. A recovery group server becomes an NSD serverafter NSDs are defined on the vdisks in the recovery group, so the nsdBufSpace parameter also applies.The default for nsdBufSpace is 30% of the pagepool, and it can be decreased to its minimum value of10% because the vdisk data buffer pool is used directly to serve the vdisk NSDs.

The vdisk track descriptors, as governed by nsdRAIDTracks, include such information as the RAID code,track number, and status. The descriptors also contain pointers to vdisk data buffers in the GPFSpagepool, as governed by nsdRAIDBufferPoolSizePct. It is these buffers that hold the actual vdisk dataand redundancy information.

For more information on how to set the nsdRAIDTracks and nsdRAIDBufferPoolSizePct parameters, see“Planning considerations for GPFS Native RAID” on page 20.

For more information on the nsdRAIDTracks, nsdRAIDBufferPoolSizePct, and nsdBufSpaceparameters, see the “mmchconfig command” on page 86.

Recovery group creation

Recovery groups are created using the mmcrrecoverygroup command, which takes the followingarguments:v The name of the recovery group to create.


v The name of a stanza file describing the declustered arrays and pdisks within the recovery group.v The names of the GPFS cluster nodes that will be the primary and, if specified, backup servers for the

recovery group.

When a recovery group is created, the GPFS daemon must be running with the nsdRAIDTracksconfiguration parameter in effect on the specified servers.

For more information see the “mmcrrecoverygroup command” on page 53.

Recovery group server failover

When, as is recommended, a recovery group is assigned two servers, one server is the preferred andprimary server for the recovery group and the other server is the backup server. Only one server can servethe recovery group at any given time; this server is known as the active recovery group server. The serverthat is not currently serving the recovery group is the standby server. If the active recovery group server isunable to serve a recovery group, it will relinquish control of the recovery group and pass it to thestandby server, if available. The failover from the active to the standby server should be transparent toany GPFS file system using the vdisk NSDs in the recovery group. There will be a pause in access to thefile system data in the vdisk NSDs of the recovery group while the recovery operation takes place on thenew server. This server failover recovery operation involves the new server opening the component disksof the recovery group and playing back any logged RAID transactions.

The active server for a recovery group can be changed by the GPFS administrator using themmchrecoverygroup command. This command can also be used to change the primary and backupservers for a recovery group. For more information, see “mmchrecoverygroup command” on page 51

Pdisks

The GPFS Native RAID pdisk is an abstraction of a physical disk. A pdisk corresponds to exactly onephysical disk, and belongs to exactly one declustered array within exactly one recovery group. Beforediscussing how declustered arrays collect pdisks into groups, it will be useful to describe thecharacteristics of pdisks.

A recovery group may contain a maximum of 512 pdisks. A declustered array within a recovery groupmay contain a maximum of 128 pdisks. The name of a pdisk must be unique within a recovery group;that is, two recovery groups may each contain a pdisk named disk10, but a recovery group may notcontain two pdisks named disk10, even if they are in different declustered arrays.

A pdisk is usually created using the mmcrrecoverygroup command, whereby it is assigned to adeclustered array within a newly created recovery group. In unusual situations, pdisks may also becreated and assigned to a declustered array of an existing recovery group by using the mmaddpdiskcommand.

To create a pdisk, a stanza must be supplied to the mmcrrecoverygroup or mmaddpdisk commandsspecifying the pdisk name, the declustered array name to which it is assigned, and a block device specialfile name for the entire physical disk as it is configured by the operating system on the active recoverygroup server. The following is an example pdisk creation stanza:%pdisk: pdiskName=c073d1

da=DA1device=/dev/hdisk192

The device name for a pdisk must refer to the entirety of a single physical disk; pdisks should not becreated using virtualized or software-based disks (for example, logical volumes, disk partitions, logicalunits from other RAID controllers, or network-attached disks). For a pdisk to be successfully created, thephysical disk must be present and functional at the specified device name on the active server. The


physical disk must also be present on the standby recovery group server, if one is configured (note thatthe physical disk block device special name on the standby server will almost certainly be different, andwill automatically be discovered by GPFS).

Pdisks that have failed and been marked for replacement by the disk hospital are replaced using themmchcarrier command. In unusual situations, pdisks may be added or deleted using the mmaddpdiskor mmdelpdisk commands. When deleted, either through replacement or the mmdelpdisk command, thepdisk abstraction will only cease to exist when all the data it contained has been rebuilt onto spare space(even though the physical disk may have been removed from the system).

Pdisks are normally under the control of GPFS Native RAID and the disk hospital. In unusual situations,the mmchpdisk command may be used to directly manipulate pdisks.

The attributes of a pdisk include the physical disk's unique world wide name (WWN), its fieldreplaceable unit (FRU) code, and its physical location code. Pdisk attributes may be displayed by themmlspdisk command; of particular interest here are the pdisk device paths and the pdisk states.

Pdisk paths

To the operating system, physical disks are made visible as block devices with device special file names,such as /dev/hdisk32. To achieve high availability and throughput, the physical disks of a JBOD array areconnected to each server by multiple (usually two) interfaces in a configuration known as multipath (ordualpath). When two operating system block devices are visible for each physical disk, GPFS Native RAIDrefers to them as the paths to the pdisk.

In normal operation, the paths to individual pdisks are automatically discovered by GPFS Native RAID.There are only two instances when a pdisk must be referred to by its explicit block device path name:during recovery group creation using the mmcrrecoverygroup command, and when adding new pdisksto an existing recovery group with the mmaddpdisk command. In both of these cases, only one of theblock device path names as seen on the active server needs to be specified; any other paths on the activeand standby servers will be automatically discovered.

The operating system may have the ability to internally merge multiple paths to a physical disk into asingle block device. When GPFS Native RAID is in use, the operating system multipath merge functionmust be disabled because GPFS Native RAID itself manages the individual paths to the disk. For moreinformation, see “Example scenario: Configuring GPFS Native RAID recovery groups” on page 25.

Pdisk states

GPFS Native RAID maintains its view of a pdisk and its corresponding physical disk by means of a pdiskstate. The pdisk state consists of multiple keyword flags, which may be displayed using themmlsrecoverygroup or mmlspdisk commands. The pdisk state flags indicate how GPFS Native RAID iscurrently using or managing a disk.

In normal circumstances, the state of the vast majority of pdisks will be represented by the sole keywordok. This means that GPFS Native RAID considers the pdisk to be healthy: The recovery group server isable to communicate with the disk, the disk is functioning normally, and the disk can be used to storedata. The diagnosing flag will be present in the pdisk state when the GPFS Native RAID disk hospitalsuspects, or attempts to correct, a problem. If GPFS Native RAID is unable to communicate with a disk,the pdisk state will include the keyword missing. If a missing disk becomes reconnected and functionsproperly, its state will change back to ok. The readonly flag means that a disk has indicated that it can nolonger safely write data. A disk can also be marked by the disk hospital as failing, perhaps due to anexcessive number of media or checksum errors. When the disk hospital concludes that a disk is no longeroperating effectively, it will declare the disk to be dead. If the number of dead pdisks reaches or exceeds

Chapter 2. Managing GPFS Native RAID 13

the replacement threshold of their declustered array, the disk hospital will add the flag replace to thepdisk state, which indicates that physical disk replacement should be performed as soon as possible.

When the state of a pdisk indicates that it can no longer behave reliably, GPFS Native RAID will rebuildthe pdisk's data onto spare space on the other pdisks in the same declustered array. This is called drainingthe pdisk. That a pdisk is draining or has been drained will be indicated by a keyword in the pdisk stateflags. The flag systemDrain means that GPFS Native RAID has decided to rebuild the data from thepdisk; the flag adminDrain means that the GPFS administrator issued the mmdelpdisk command to deletethe pdisk.

GPFS Native RAID stores both user (GPFS file system) data and its own internal recovery group data andvdisk configuration data on pdisks. Additional pdisk state flags indicate when these data elements arenot present on a pdisk. When a pdisk starts draining, GPFS Native RAID first replicates the recoverygroup data and vdisk configuration data onto other pdisks. When this completes, the flags noRGD (norecovery group data) and noVCD (no vdisk configuration data) are added to the pdisk state flags. Whenthe slower process of removing all user data completes, the noData flag will be added to the pdisk state.

To summarize, the vast majority of pdisks will be in the ok state during normal operation. The ok stateindicates that the disk is reachable, functioning, not draining, and that the disk contains user data andGPFS Native RAID recovery group and vdisk configuration information. A more complex example of apdisk state is dead/systemDrain/noRGD/noVCD/noData for a single pdisk that has failed. This set of pdiskstate flags indicates that the pdisk was declared dead by the system, was marked to be drained, and thatall of its data (recovery group, vdisk configuration, and user) has been successfully rebuilt onto the sparespace on other pdisks.

In addition to those discussed here, there are some transient pdisk states that have little impact onnormal operations; the complete set of states is documented in Table 2.

Table 2. Pdisk states

State Description

ok The disk is functioning normally.

dead The disk failed.

missing GPFS Native RAID is unable to communicate with the disk.

diagnosing The disk is temporarily unusable while its status is determined by the disk hospital.

suspended The disk is temporarily unusable as part of a service procedure.

readonly The disk is no longer writeable.

failing The disk is not healthy but not dead.

systemDrain The disk is faulty, so data and configuration data must be drained.

adminDrain An administrator requested that this pdisk be deleted.

noRGD The recovery group data was drained from the disk.

noVCD All vdisk configuration data was drained from the disk.

noData All vdisk user data was drained from the disk.

replace Replacement of the disk was requested.

noPath There was no functioning path found to this disk.

PTOW The disk is temporarily unusable because of a pending timed-out write.

init The pdisk object is being initialized or removed.

formatting Initial configuration data is being written to the disk.


Declustered arrays

Declustered arrays are disjoint subsets of the pdisks in a recovery group. Vdisks are created withindeclustered arrays, and vdisk tracks are declustered across all of an array's pdisks. A recovery group maycontain up to 16 declustered arrays. A declustered array may contain up to 128 pdisks (but the totalnumber of pdisks in all declustered arrays within a recovery group may not exceed 512). A pdisk maybelong to only one declustered array. The name of a declustered array must be unique within a recoverygroup; that is, two recovery groups may each contain a declustered array named DA3, but a recoverygroup may not contain two declustered arrays named DA3. The pdisks within a declustered array must allbe of the same size and should all have similar performance characteristics.

A declustered array is usually created together with its member pdisks and its containing recovery groupthrough the use of the mmchrecoverygroup command. A declustered array may also be created using themmaddpdisk command to add pdisks to a declustered array that does not yet exist in a recovery group.A declustered array may be deleted by deleting its last member pdisk, or by deleting the recovery groupin which it resides. Any vdisk NSDs and vdisks within the declustered array must already have beendeleted. There are no explicit commands to create or delete declustered arrays.

Declustered arrays serve two purposes:v Segregating a small number of fast SSDs into their own group for storing the vdisk log (the RAID

update and recovery group event log).v Partitioning the disks of a JBOD enclosure into smaller subsets exclusive of a common point of failure,

such as removable carriers that hold multiple disks.

The latter consideration comes into play when one considers that removing a disk carrier to perform diskreplacement also temporarily removes some good disks, perhaps a number in excess of the fault toleranceof the vdisk NSDs. This would cause temporary suspension of file system activity until the disks arerestored. To avoid this, each disk position in a removable carrier should be used to define a separatedeclustered array, such that disk position one defines DA1, disk position two defines DA2, and so on. Thenwhen a disk carrier is removed, each declustered array will suffer the loss of just one disk, which iswithin the fault tolerance of any GPFS Native RAID vdisk NSD.

Declustered array parameters

Declustered arrays have three parameters that may be changed using the mmchrecoverygroup commandwith the --declustered-array option. These are:v The number of disks' worth of equivalent spare space. This defaults to one for arrays with nine or

fewer pdisks, and two for arrays with 10 or more pdisks.v The number of disks that must fail before the declustered array is marked as needing to have disks

replaced. The default is the number of spares.v The number of days over which all the vdisks in the declustered array are scrubbed for errors. The

default is 14 days.

Declustered array size

GPFS Native RAID distinguishes between large and small declustered arrays. A declustered array isconsidered large if at the time of its creation it contains at least 11 pdisks, including an equivalent sparespace of two disks (or at least 10 pdisks, including an equivalent spare space of one disk). All otherdeclustered arrays are considered small. At least one declustered array in each recovery group must belarge, because only large declustered arrays have enough pdisks to safely store an adequate number ofreplicas of the GPFS Native RAID configuration data for the recovery group.


Because the narrowest RAID code that GPFS Native RAID supports is 3-way replication, the smallestpossible declustered array contains four pdisks, including the minimum required equivalent spare spaceof one disk. The RAID code width of the intended vdisk NSDs and the amount of equivalent spare spacealso affect declustered array size; if Reed-Solomon 8 + 3p vdisks, which have a code width of 11, arerequired, and two disks of equivalent spare space is also required, the declustered array must have atleast 13 member pdisks.

Spare space

While operating with a failed pdisk in a declustered array, GPFS Native RAID continues to serve filesystem I/O requests by using redundancy information on other pdisks to reconstruct data that cannot beread, and by marking data that cannot be written to the failed pdisk as stale. Meanwhile, to restore fullredundancy and fault tolerance, the data on the failed pdisk is rebuilt onto spare space, reserved unusedportions of the declustered array that are declustered over all the member pdisks. The failed disk isthereby drained of its data by copying it to the spare space.

The amount of spare space in a declustered array is set at creation time and may be changed later. Thespare space is expressed in whole units equivalent to the capacity of a member pdisk of the declusteredarray, but is spread among all the member pdisks. There are no dedicated spare pdisks. This implies thata number of pdisks equal to the specified spare space may fail, and the full redundancy of all the data inthe declustered array can be restored through rebuild.

At minimum, each declustered array requires spare space equivalent to the size of one member pdisk.Because large declustered arrays have a greater probability of disk failure, the default amount of sparespace depends on the size of the declustered array. A declustered array with nine or fewer pdisksdefaults to having one disk of equivalent spare space. A declustered array with 10 or more disks defaultsto having two disks of equivalent spare space. These defaults can be overridden, especially at declusteredarray creation. However, if at a later point too much of the declustered array is already allocated for useby vdisks, it may not be possible to increase the amount of spare space.

Vdisks

Vdisks are created across the pdisks within a declustered array. Each recovery group requires a special logvdisk to function, which will be discussed in “The log vdisk” on page 17. All other vdisks are created foruse as GPFS file system NSDs.

A recovery group can contain at most 64 vdisks, and the first must be the log vdisk. Vdisks can beallocated arbitrarily among declustered arrays. Vdisks are created with the mmcrvdisk command. Themmdelvdisk command destroys vdisks and all their contained data.

When creating a vdisk, specify the RAID code, block size, vdisk size, and a name that is unique withinthe recovery group and the GPFS cluster. There are no adjustable parameters available for vdisks.

RAID code

The type, performance, and space efficiency of the RAID codes used for vdisks, discussed in “RAIDcodes” on page 2, should be considered when choosing the RAID code for a particular set of user data.GPFS storage pools and policy-based data placement can be used to ensure data is stored withappropriate RAID codes.

Block size

The vdisk block size must equal the GPFS file system block size of the storage pool where the vdisk isassigned. For replication codes, the supported block sizes are 256 KiB, 512 KiB, 1 MiB and 2 MiB. For


Reed-Solomon codes, they are 1 MiB, 2 MiB, 4 MiB, 8 MiB and 16 MiB. See “Planning considerations forGPFS Native RAID” on page 20 for an overview of vdisk configuration considerations.

Vdisk size

The minimum vdisk size is 1 GiB. The maximum vdisk size is the total space available on the pdisks inthe declustered array, taking into account the overhead of the RAID code, minus spare space, minusvdisk configuration data, and minus a small amount of space reserved as a buffer for write operations.GPFS Native RAID will round up the requested vdisk size as required. When creating a vdisk, the usercan specify to use all remaining space in the declustered array for that vdisk.

The log vdisk

Every recovery group requires one log vdisk to function. The log vdisk must be created before any othervdisks in the recovery group, and it can only be deleted after all other vdisks in the recovery group havebeen deleted. The log vdisk is used to temporarily record changes to the GPFS Native RAIDconfiguration data, and to log small writes. The log vdisk must be allocated on the declustered arraymade up of SSDs. All other vdisks must be placed on declustered arrays that use HDDs, not SSDs.

Only the 3-way and 4-way replication codes are supported for the log vdisk. In the typical system withfour SSDs, with spare space equal to the size of one disk, the 3-way replication code would be used forthe log vdisk.

The relationship between vdisks and NSDs

After creating a vdisk with the mmcrvdisk command, NSDs are created from the vdisks by using themmcrnsd command. The relationship between vdisks and NSDs is described as follows:v GPFS file systems are built from vdisk NSDs in the same way as they are built from any other NSDs.v While an NSD exists for a vdisk, that vdisk cannot be deleted.v A node cannot serve both vdisk-based NSDs and non-vdisk-based NSDs.v A file system cannot support both vdisk-based NSDs and non-vdisk-based NSDs.v Vdisk NSDs should not be used as tiebreaker disks.

Maintenance

Very large disk systems, with thousands or tens of thousands of disks and servers, will likely experiencea variety of failures during normal operation. To maintain system productivity, the vast majority of thesefailures must be handled automatically: without loss of data, without temporary loss of access to thedata, and with minimal impact on the performance of the system. Some failures require humanintervention, such as replacing failed components with spare parts or correcting faults that cannot becorrected by automated processes.

Disk diagnosis

The disk hospital was introduced in “Disk hospital” on page 8. When an individual disk I/O operation(read or write) encounters an error, GPFS Native RAID completes the NSD client request byreconstructing the data (for a read) or by marking the unwritten data as stale and relying on successfullywritten parity or replica strips (for a write), and starts the disk hospital to diagnose the disk. While thedisk hospital is diagnosing, the affected disk will not be used for serving NSD client requests.


Similarly, if an I/O operation does not complete in a reasonable time period, it is timed out, and theclient request is treated just like an I/O error. Again, the disk hospital will diagnose what went wrong. Ifthe timed-out operation is a disk write, the disk remains temporarily unusable until a pending timed-outwrite (PTOW) completes.

The disk hospital then determines the exact nature of the problem. If the cause of the error was an actualmedia error on the disk, the disk hospital marks the offending area on disk as temporarily unusable, andoverwrites it with the reconstructed data. This cures the media error on a typical HDD by relocating thedata to spare sectors reserved within that HDD.

If the disk reports that it can no longer write data, the disk is marked as readonly. This can happen whenno spare sectors are available for relocating in HDDs, or the flash memory write endurance in SSDs wasreached. Similarly, if a disk reports that it cannot function at all, for example not spin up, the diskhospital marks the disk as dead.

The disk hospital also maintains various forms of disk badness, which measure accumulated errors fromthe disk. If the badness level is high, the disk can be marked dead. For less severe cases, the disk can bemarked failing.

Finally, the GPFS Native RAID server might lose communication with a disk. This can either be causedby an actual failure of an individual disk, or by a fault in the disk interconnect network. In this case, thedisk is marked as missing.

If a disk would have to be marked dead or missing, and the problem affects only individual disks, not alarge set of disks, the disk hospital attempts to recover the disk. If the disk reports that it is not started,the disk hospital attempts to start the disk. If nothing else helps, the disk hospital power-cycles the disk,and then waits for the disk to return online.

Before actually reporting an individual disk as missing, the disk hospital starts a search for that disk bypolling all disk interfaces to locate the disk. Only after that fast poll fails, is the disk actually declaredmissing.

If a large set of disks has faults, the GPFS Native RAID server can continue to serve read and writerequests, provided that the number of failed disks does not exceed the fault tolerance of either the RAIDcode for the vdisk or the GPFS Native RAID internal configuration data. When any disk fails, the serverbegins rebuilding its data onto spare space. If the failure is not considered critical, rebuilding is throttledwhen user workload is present. This ensures that the performance impact to user workload is minimal. Afailure might be considered critical if a vdisk has no remaining redundancy information, for examplethree disk faults for 4-way replication and 8 + 3p or two disk faults for 3-way replication and 8 + 2p.During a critical failure, critical rebuilding will run as fast as possible because the vdisk is in imminentdanger of data loss, even if that impacts the user workload. Since the data is declustered, or spread outover many disks, and all disks in the declustered array participate in rebuilding, a vdisk will remain incritical rebuild only for short periods (2-3 minutes for a typical system). A double or triple fault isextremely rare, so the performance impact of critical rebuild is minimized.

In a multiple fault scenario, the server might not have enough disks to fulfill a request. More specifically,the number of unavailable disks exceeds the fault tolerance of the RAID code. If some of the disks areonly temporarily unavailable, and are expected back online soon, the server will stall the client I/O andwait for the disk to return to service. Disks can be temporarily unavailable due to three reasons:v The disk hospital is diagnosing an I/O error.v A timed-out write operation is pending.v A user intentionally suspended the disk, perhaps it is on a carrier with another failed disk that has been

removed for service.


If too many disks become unavailable for the primary server to proceed, it will fail over. In other words,the whole recovery group is moved to the backup server. If the disks are not reachable from the backupserver either, then all vdisks in that recovery group become unavailable until connectivity is restored.

A vdisk will suffer data loss when the number of permanently failed disks exceeds the vdisk faulttolerance. This data loss is reported to NSD clients when the data is accessed.

Background tasks

While GPFS Native RAID primarily performs NSD client read and write operations in the foreground, italso performs several long-running maintenance tasks in the background, which are referred to asbackground tasks. The background task that is currently in progress for each declustered array is reportedin the long-form output of the mmlsrecoverygroup command. Table 3 describes the long-runningbackground tasks.

Table 3. Background tasks

Task Description

repair-RGD/VCD Repairing the internal recovery group data and vdisk configuration data from the failed diskonto the other disks in the declustered array.

rebuild-critical Rebuilding virtual tracks that cannot tolerate any more disk failures.

rebuild-1r Rebuilding virtual tracks that can tolerate only one more disk failure.

rebuild-2r Rebuilding virtual tracks that can tolerate two more disk failures.

rebuild-offline Rebuilding virtual tracks where failures exceeded the fault tolerance.

rebalance Rebalancing the spare space in the declustered array for either a missing pdisk that wasdiscovered again, or a new pdisk that was added to an existing array.

scrub Scrubbing vdisks to detect any silent disk corruption or latent sector errors by reading the entirevirtual track, performing checksum verification, performing consistency checks of the data andits redundancy information. Any correctable errors found are fixed.

Server failover

If the primary GPFS Native RAID server loses connectivity to a sufficient number of disks, the recoverygroup attempts to fail over to the backup server. If the backup server is also unable to connect, therecovery group becomes unavailable until connectivity is restored. If the backup server had taken over, itwill relinquish the recovery group to the primary server when it becomes available again.

Data checksums

GPFS Native RAID stores checksums of the data and redundancy information on all disks for each vdisk.Whenever data is read from disk or received from a NSD client, checksums are verified. If the checksumverification on a data transfer to or from an NSD client fails, the data is retransmitted. If the checksumverification fails for data read from disk, the error is treated similarly to a media error:v The data is reconstructed from redundant data on other disks.v The data on disk is rewritten with reconstructed good data.v The disk badness is adjusted to reflect the silent read error.

Disk replacement

When one disk fails, the system will rebuild the data that was on the failed disk onto spare space andcontinue to operate normally, but at slightly reduced performance because the same workload is shared


among fewer disks. With the default setting of two spare disks for each large declustered array, failure ofa single disk would typically not be a sufficient reason for maintenance.

When several disks fail, the system continues to operate even if there is no more spare space. The nextdisk failure would make the system unable to maintain the redundancy the user requested during vdiskcreation. At this point, a service request is sent to a maintenance management application that requestsreplacement of the failed disks and specifies the disk FRU numbers and locations.

In general, disk maintenance is requested when the number of failed disks in a declustered array reachesthe disk replacement threshold. By default, that threshold is identical to the number of spare disks. For amore conservative disk replacement policy, the threshold can be set to smaller values using themmchrecoverygroup command.

Disk maintenance is performed using the mmchcarrier command with the --release option, which:v Suspends all functioning disks on the carrier that is shared with the disk being replaced.v Powers down all the disks on that carrier.v Turns on indicators on the disk enclosure and carrier to help locate and identify the disk that requires

replacement.v Unlocks the carrier for disk replacement.

After the disk is replaced and the carrier reinserted, another mmchcarrier command with the --replaceoption powers on the disks.

Other hardware service

While GPFS Native RAID can easily tolerate a single disk fault with no significant impact, and failures ofup to three disks with various levels of impact on performance and data availability, it still relies on thevast majority of all disks being functional and reachable from the server. If a major equipmentmalfunction prevents both the primary and backup server from accessing more than that number ofdisks, or if those disks are actually destroyed, all vdisks in the recovery group will become eitherunavailable or suffer permanent data loss. As GPFS Native RAID cannot recover from such catastrophicproblems, it also does not attempt to diagnose them or orchestrate their maintenance.

In the case that a GPFS Native RAID server becomes permanently disabled, a manual failover procedureexists that requires recabling to an alternate server (see the “mmchrecoverygroup command” on page 51).If both the primary and backup GPFS Native RAID servers for a recovery group fail, the recovery groupis unavailable until one of the servers is repaired.

Overall management of GPFS Native RAID

This section summarizes how to plan and monitor a GPFS Native RAID system. For an example ofsetting up a GPFS Native RAID system, see Chapter 3, “GPFS Native RAID setup and disk replacementon the IBM Power 775 Disk Enclosure,” on page 25.

Planning considerations for GPFS Native RAID

Planning a GPFS Native RAID implementation requires consideration of the nature of the JBOD arraysbeing used, the required redundancy protection and usable disk capacity, the required spare capacity andmaintenance strategy, and the ultimate GPFS file system configuration. This section is a set ofbest-practice recommendations for using GPFS Native RAID.v Assign a primary and backup server to each recovery group.

Each JBOD array should be connected to two servers to protect against server failure. Each servershould also have two independent paths to each physical disk to protect against path failure andprovide higher throughput to the individual disks.


Define multiple recovery groups on a JBOD array, if the architecture suggests it, and use mutuallyreinforcing primary and backup servers to spread the processing evenly across the servers and theJBOD array.Recovery group server nodes can be designated GPFS quorum or manager nodes, but they shouldotherwise be dedicated to GPFS Native RAID and not run application workload.

v Configure recovery group servers with a large vdisk track cache and a large pagepool.The nsdRAIDTracks configuration parameter tells GPFS Native RAID how many vdisk trackdescriptors, not including the actual track data, to cache in memory.In general, a large number of vdisk track descriptors should be cached. The nsdRAIDTracks value forthe recovery group servers should be 10000 - 60000. If the expected vdisk NSD access pattern israndom across all defined vdisks and within individual vdisks, a larger value for nsdRAIDTracksmight be warranted. If the expected access pattern is sequential, a smaller value can be sufficient.The amount of actual vdisk data (including user data, parity, and checksums) that can be cacheddepends on the size of the GPFS pagepool on the recovery group servers and the percentage ofpagepool reserved for GPFS Native RAID. The nsdRAIDBufferPoolSizePct parameter specifies whatpercentage of the pagepool should be used for vdisk data. The default is 50%, but it can be set as highas 90% or as low as 10%. Because a recovery group server is also an NSD server and the vdisk bufferpool also acts as the NSD buffer pool, the configuration parameter nsdBufSpace should be reduced toits minimum value of 10%.As an example, to have a recovery group server cache 20000 vdisk track descriptors (nsdRAIDTracks),where the data size of each track is 4 MiB, using 80% (nsdRAIDBufferPoolSizePct) of the pagepool, anapproximate pagepool size of 20000 * 4 MiB * (100/80) ≈ 100000 MiB ≈ 98 GiB would be required. It isnot necessary to configure the pagepool to cache all the data for every cached vdisk track descriptor,but this example calculation can provide some guidance in determining appropriate values fornsdRAIDTracks and nsdRAIDBufferPoolSizePct.

v Define each recovery group with at least one large declustered array.A large declustered array contains enough pdisks to store the required redundancy of GPFS NativeRAID vdisk configuration data. This is defined as at least nine pdisks plus the effective spare capacity.A minimum spare capacity equivalent to two pdisks is strongly recommended in each largedeclustered array. The code width of the vdisks must also be considered. The effective number ofnon-spare pdisks must be at least as great as the largest vdisk code width. A declustered array withtwo effective spares where 11 is the largest code width (8 + 3p Reed-Solomon vdisks) must contain atleast 13 pdisks. A declustered array with two effective spares where 10 is the largest code width(8 + 2p Reed-Solomon vdisks) must contain at least 12 pdisks.

v Place the log vdisk in a separate declustered array of solid-state disks (SSDs).The SSDs in the JBOD array should be used for the log vdisk of each recovery group. These SSDsshould be isolated in a small log declustered array, and the log vdisk should be the only vdisk definedthere. One pdisk of spare capacity should be defined, which is the default for a small declustered array.For example, if the log declustered array contains four physical SSDs, it should have one spare definedand the log vdisk should use 3-way replication. The recommended track size for the log vdisk is 1MiB,and the recommended total size is 2 - 4 GiB.

v Determine the declustered array maintenance strategy.Disks will fail and need replacement, so a general strategy of deferred maintenance can be used. Forexample, failed pdisks in a declustered array are only replaced when the spare capacity of thedeclustered array is exhausted. This is implemented with the replacement threshold for the declusteredarray set equal to the effective spare capacity. This strategy is useful in installations with a largenumber of recovery groups where disk replacement might be scheduled on a weekly basis. Smallerinstallations can have GPFS Native RAID require disk replacement as disks fail, which means thedeclustered array replacement threshold can be set to one.

v Choose the vdisk RAID codes based on GPFS file system usage.


The choice of vdisk RAID codes depends on the level of redundancy protection required versus theamount of actual space required for user data, and the ultimate intended use of the vdisk NSDs in aGPFS file system.Reed-Solomon vdisks are more space efficient. An 8 + 3p vdisk uses approximately 27% of actual diskspace for redundancy protection and 73% for user data. An 8 + 2p vdisk uses 20% for redundancy and80% for user data. Reed-Solomon vdisks perform best when writing whole tracks (the GPFS block size)at once. When partial tracks of a Reed-Solomon vdisk are written, parity recalculation must occur.Replicated vdisks are less space efficient. A vdisk with 3-way replication uses approximately 67% ofactual disk space for redundancy protection and 33% for user data. A vdisk with 4-way replication uses75% of actual disk space for redundancy and 25% for user data. The advantage of vdisks with N-wayreplication is that small or partial write operations can complete faster.For file system applications where write performance must be optimized, the preceding considerationsmake replicated vdisks most suitable for use as GPFS file system metadataOnly NSDs, andReed-Solomon vdisks most suitable for use as GPFS file system dataOnly NSDs. The volume of GPFSfile system metadata is usually small (1% - 3%) relative to file system data, so the impact of the spaceinefficiency of a replicated RAID code is minimized. The file system metadata is typically written insmall chunks, which takes advantage of the faster small and partial write operations of the replicatedRAID code. Applications are often tuned to write file system user data in whole multiples of the filesystem block size, which works to the strengths of the Reed-Solomon RAID codes both in terms ofspace efficiency and speed.When segregating vdisk NSDs for file system metadataOnly and dataOnly disk usage, themetadataOnly replicated vdisks can be created with a smaller block size and assigned to the GPFS filesystem storage pool. The dataOnly Reed-Solomon vdisks can be created with a larger block size andassigned to GPFS file system data storage pools. When using multiple storage pools, a GPFS placementpolicy must be installed to direct file system data to non-system storage pools.When write performance optimization is not important, it is acceptable to use Reed-Solomon vdisks asdataAndMetadata NSDs for better space efficiency.When assigning the failure groups to vdisk NSDs in a GPFS file system, the JBOD array should beconsidered the common point of failure. All vdisks within all recovery groups in a given JBOD arrayshould be assigned the same failure group number.

Monitoring GPFS Native RAID

To monitor GPFS Native RAID during normal operation, use the mmlsrecoverygroup, mmlspdisk, andmmpmon commands. Pay particular attention to the GPFS Native RAID event log, which is visible usingthe mmlsrecoverygroupevents command.

Consider using GPFS Native RAID user exits to notify an automated system management tool if criticalevents, such as disk failures, occur during normal operation. For more information, see “mmaddcallbackcommand” on page 78.

If disk maintenance is indicated, use the mmchcarrier command to release the failed disk, replace thefailed drive, and use the mmchcarrier command again to inform GPFS Native RAID that the failed diskhas been replaced.

Displaying vdisk I/O statistics

To display vdisk I/O statistics, run mmpmon with the following command included in the input file:vio_s [f [rg RecoveryGroupName [ da DeclusteredArrayName [ v VdiskName]]]] [reset]

This request returns strings containing vdisk I/O statistics as seen by that node. The values are presentedas total values for the node, or they can be filtered with the f option. The reset option indicates that thestatistics should be reset after the data is sampled.


If the -p option is specified when running mmpmon, the vdisk I/O statistics are provided in the form ofkeywords and values in the vio_s response. Table 4 lists and describes these keywords in the order inwhich they appear in the output.

Table 4. Keywords and descriptions of values provided in the mmpmon vio_s response

Keyword Description

_n_ The IP address of the node responding. This is the address by which GPFS knows the node.

_nn_ The name by which GPFS knows the node.

_rc_ The reason or error code. In this case, the reply value is 0 (OK).

_t_ The current time of day in seconds (absolute seconds since Epoch (1970)).

_tu_ The microseconds part of the current time of day.

_rg_ The name of the recovery group.

_da_ The name of the declustered array.

_r_ The total number of read operations.

_sw_ The total number of short write operations.

_mw_ The total number of medium write operations.

_pfw_ The total number of promoted full track write operations.

_ftw_ The total number of full track write operations.

_fuw_ The total number of flushed update write operations.

_fpw_ The total number of flushed promoted full track write operations.

_m_ The total number of migrate operations.

_s_ The total number of scrub operations.

_l_ The total number log write operations.

To display these statistics, use the sample script /usr/lpp/mmfs/samples/vdisk/viostat. The followingshows the usage of the viostat script:viostat [-F NodeFile | [--recovery-group RecoveryGroupName

[--declustered-array DeclusteredArrayName[--vdisk VdiskName]]]]

[Interval [Count]]

For more information, see the topic about monitoring GPFS I/O performance with the mmpmoncommand in GPFS: General Parallel File System: Advanced Administration Guide.

GPFS Native RAID callbacks

GPFS Native RAID introduces 12 new GPFS callbacks for events that can occur during recovery groupoperations. These callbacks can be installed by the system administrator using the mmaddcallbackcommand.

The callbacks are provided primarily as a method for system administrators to take notice whenimportant GPFS Native RAID events occur. For example, a GPFS administrator can use thepdReplacePdisk callback to send an e-mail to notify system operators that the replacement threshold fora declustered array was reached and that pdisks must be replaced. Similarly, the preRGTakeover callbackcan be used to inform system administrators of a possible server failover.

As notification methods, no real processing should occur in the callback scripts. GPFS Native RAIDcallbacks should not be installed for synchronous execution; the default of asynchronous callback


execution should be used in all cases. Synchronous or complicated processing within a callback mightdelay GPFS daemon execution pathways and cause unexpected and undesired results, including loss offile system availability.

Table 5 lists the callbacks and their corresponding parameters available through the mmaddcallbackcommand:

Table 5. GPFS Native RAID callbacks and parameters

Callbacks Parameters

preRGTakeover myNode, rgName, rgErr, rgCount, rgReason

postRGTakeover myNode, rgName, rgErr, rgCount, rgReason

preRGRelinquish myNode, rgName, rgErr, rgCount, rgReason

postRGRelinquish myNode, rgName, rgErr, rgCount, rgReason

rgOpenFailed myNode, rgName, rgErr, rgReason

rgPanic myNode, rgName, rgErr, rgReason

pdFailed myNode, rgName, daName, pdName, pdLocation, pdFru, pdWwn, pdState

pdRecovered myNode, rgName, daName, pdName, pdLocation, pdFru, pdWwn

pdReplacePdisk myNode, rgName, daName, pdName, pdLocation, pdFru, pdWwn, pdState, pdPriority

pdPathDown myNode, rgName, daName, pdName, pdPath, pdLocation, pdFru, pdWwn

daRebuildFailed myNode, rgName, daName, daRemainingRedundancy

nsdCksumMismatch myNode, ckRole, ckOtherNode, ckNSD, ckReason, ckStartSector, ckDataLen,ckErrorCountClient, ckErrorCountServer, ckErrorCountNSD, ckReportingInterval

All GPFS Native RAID callbacks are local, which means that the event triggering the callback occurs onlyon the involved node or nodes, in the case of nsdCksumMismatch, rather than on every node in theGPFS cluster. The nodes where GPFS Native RAID callbacks should be installed are, by definition, therecovery group server nodes. An exception is the case of nsdCksumMismatch, where it makes sense toinstall the callback on GPFS client nodes as well as recovery group servers.

For more information about GPFS Native RAID callbacks, see “mmaddcallback command” on page 78.


Chapter 3. GPFS Native RAID setup and disk replacement onthe IBM Power 775 Disk Enclosure

Example scenario: Configuring GPFS Native RAID recovery groupsThis topic provides a detailed example of configuring GPFS Native RAID using the JBOD SAS disks onthe Power 775 Disk Enclosure. The example considers one fully populated Power 775 Disk Enclosurecabled to two recovery group servers, and shows how the architecture of the Power 775 Disk Enclosuredetermines the structure of the recovery groups. Throughout this topic, it may be helpful to have Power775 Disk Enclosure documentation at hand.

Preparing recovery group servers

Disk enclosure and HBA cabling

The Power 775 Disk Enclosure should be cabled to the intended recovery group servers according to thePower 775 Disk Enclosure hardware installation instructions. The fully populated Power 775 DiskEnclosure consists of 8 STORs of 48 disks, for a total of 384 JBOD disks. Each STOR provides redundantleft and right port cards for host server HBA connections (STOR is short for physical storage group,meaning the part of the disk enclosure controlled by a pair of port cards). To ensure proper multi-pathingand redundancy, each recovery group server must be connected to each port card using different HBAs.For example, STOR 1 has port cards P1-C4 and P1-C5. Server 1 may be connected to P1-C4 using HBAhba1 and to P1-C5 using HBA hba2; similarly for server 2 and its respective HBAs hba1 and hba2.

GPFS Native RAID provides system administration tools for verifying the correct connectivity of thePower 775 Disk Enclosure, which will be seen later during the operating system preparation.

When the port cards of the Power 775 Disk Enclosure have been cabled to the appropriate HBAs of thetwo recovery group servers, the Power 775 Disk Enclosure should be powered on and the recovery groupservers should be rebooted.

Initial operating system verification

Preparation then continues with the operating system, which must be AIX 7.1, and which must be thesame on both recovery group servers. It is not necessary to do a complete verification of the Power 775Disk Enclosure connectivity at this point. Logging in to the servers to perform a quick check that at leastsome disks have been detected and configured by the operating system will suffice. The operating systemdevice configuration should be examined for the Power 775 Disk Enclosure VPD enclosure type, which is78AD.001.

One way to quickly verify that AIX has configured devices with enclosure type 78AD.001 for the Power775 Disk Enclosure is:# lsdev -t ses -F ’name physloc parent’ | grep 78AD.001

The output should include lines resembling the following:ses12 U78AD.001.000DE37-P1-C4 sas3

This is the SAS expander device on port card P1-C4 of the Power 775 Disk Enclosure with serial number000DE37, together with the SAS protocol device driver sas3 under which it has been configured. To seewhat disks have been detected by the SAS protocol driver, use:# lsdev -p sas3


The output should include all the disks and port card expanders that successfully configured under thesas3 SAS protocol driver (which corresponds to the HBA device mpt2sas3).

If AIX has not configured any port card expanders of enclosure type 78AD.001, the hardware installationof the server HBAs and the Power 775 Disk Enclosure must be reexamined and corrected.

Disabling operating system multi-pathing

Once it has been verified that at least some of the Power 775 Disk Enclosure has been configured by theoperating system, the next step is to disable any operating system multi-pathing. Since GPFS NativeRAID performs its own disk multi-pathing, AIX MPIO (Multiple Path I/O) must be disabled asappropriate.

To disable AIX MPIO for SAS disks, use:# manage_disk_drivers -d SAS_SCSD -o AIX_non_MPIO

Note: This blanket disabling of operating system multi-pathing is appropriate because a Power 775 DiskEnclosure installation provides the only available disk devices to the recovery group servers. Onceoperating system multi-pathing has been disabled, the recovery group servers should be rebooted.

Operating system device attributes

For best performance, the operating system disk device driver should be configured to allow GPFSNative RAID I/O operations to be made with one disk access, rather than being fragmented. Under AIXthis is controlled by the max_transfer attribute of the HBAs and disk devices.

The disk I/O size performed by GPFS Native RAID depends on the strip size of the RAID code of thevdisk NSD. This in turn is related to the vdisk track size and its corresponding GPFS file system blocksize. The operating system I/O size should be equal to or greater than the largest strip size of theplanned vdisk NSDs.

Because GPFS Native RAID stores checksums with each strip, strips have an additional 4 KiB or 8 KiBthan might be expected just from the user data (strips containing 2 MiB of user data have an additional 8KiB; all smaller strips have an additional 4 KiB). The strip size for a replicated vdisk RAID code is equalto the vdisk track size plus the size of the checksum. The strip size for a Reed-Solomon vdisk RAID codeis equal to one-eighth of the vdisk track size plus the size of the checksum.

The default max_transfer value of 1 MiB under AIX is suitable for GPFS Native RAID vdisk strip sizesunder 1 MiB.

For vdisk strip sizes greater than 1 MiB under AIX, the operating system disk device driver I/O sizeshould be increased for best performance.

The following table indicates the relationship between file system NSD block size, vdisk track size, vdiskRAID code, vdisk strip size, and the non-default operating system I/O size for all permitted GPFS NativeRAID vdisks. The AIX max_transfer attribute is specified in hexadecimal, and the only allowable valuesgreater than the 1 MiB default are 0x200000 (2 MiB) and 0x400000 (4 MiB).

Table 6. NSD block size, vdisk track size, vdisk RAID code, vdisk strip size, and non-default operating system I/Osize for permitted GPFS Native RAID vdisks

NSD block size vdisk track size vdisk RAID code RAID code strip size AIX max_transfer

256 KiB 256 KiB 3- or 4-wayreplication

260 KiB default

512 KiB 512 KiB 3- or 4-wayreplication

516 KiB default


Table 6. NSD block size, vdisk track size, vdisk RAID code, vdisk strip size, and non-default operating system I/Osize for permitted GPFS Native RAID vdisks (continued)

NSD block size vdisk track size vdisk RAID code RAID code strip size AIX max_transfer

1 MiB 1 MiB 3- or 4-wayreplication

1028 KiB 0x200000

2 MiB 2 MiB 3- or 4-wayreplication

2056 KiB 0x400000

512 KiB 512 KiB 8 + 2p or 8 + 3p 68 KiB default

1 MiB 1 MiB 8 + 2p or 8 + 3p 132 KiB default



8 MiB 8 MiB 8 + 2p or 8 + 3p 1028 KiB 0x200000

16 MiB 16 MiB 8 + 2p or 8 + 3p 2056 KiB 0x400000

If the largest strip size of all the vdisk NSDs planned for a GPFS Native RAID installation exceeds theoperating system default I/O size, the operating system I/O size should be changed.

Under AIX, this involves changing the HBA max_transfer size. The disk devices seen over the HBA willthen inherit the max_transfer size of the HBA (unless the disk max_transfer size has itself beencustomized to a different value).

To change the max_transfer attribute to 2 MiB for the HBA mpt2sas0 under AIX, use the followingcommand:# chdev -P -l mpt2sas0 -a max_transfer=0x200000

Repeat the previous command for each HBA. The new max_transfer size will not take effect until AIXreconfigures the HBA. This can be done either by rebooting the recovery group server, or bydeconfiguring (but not removing) and then reconfiguring the affected HBAs. The HBA max_transferattribute is recorded in the CuAt ODM class and will persist across reboots.

For optimal performance, additional device attributes may need to be changed (for example, the HBAand block device command queue depths); consult the operating system documentation for the deviceattributes.

Verifying that a Power 775 Disk Enclosure is configured correctly

Once a superficial inspection indicates that the Power 775 Disk Enclosure has been configured on therecovery group servers, and especially once operating system multi-pathing has been disabled, it isnecessary to perform a thorough discovery of the disk topology on each server.

To proceed, GPFS must be installed on the recovery group servers, and they should be members of thesame GPFS cluster. Consult GPFS: Administration and Programming Reference for instructions for creating aGPFS cluster.

GPFS Native RAID provides tools in /usr/lpp/mmfs/samples/vdisk for collecting and collatinginformation on any attached Power 775 Disk Enclosure and for verifying that the detected topology iscorrect. The mmgetpdisktopology command examines the operating system's list of connected devicesand produces a colon-delimited database with a line for each discovered Power 775 Disk Enclosurephysical disk, port card expander device, and HBA. mmgetpdisktopology should be run on each of thetwo intended recovery group server nodes, and the results examined to verify that the disk enclosurehardware and software configuration is as expected. An additional tool called topsummary conciselysummarizes the output of the mmgetpdisktopology command.

Chapter 3. GPFS Native RAID setup and disk replacement on the IBM Power 775 Disk Enclosure 27

Create a directory in which to work, and then capture the output of the mmgetpdisktopology commandfrom each of the two intended recovery group server nodes:# mkdir p7ihde# cd p7ihde# ssh server1 /usr/lpp/mmfs/samples/vdisk/mmgetpdisktopology > server1.top# ssh server2 /usr/lpp/mmfs/samples/vdisk/mmgetpdisktopology > server2.top

Then view the summary for each of the nodes (server1 example shown):# /usr/lpp/mmfs/samples/vdisk/topsummary server1.topP7IH-DE enclosures found: DE00022Enclosure DE00022:Enclosure DE00022 STOR P1-C4/P1-C5 sees both portcards: P1-C4 P1-C5Portcard P1-C4: ses0[0150]/mpt2sas0/24 diskset "37993" ses1[0150]/mpt2sas0/24 diskset "18793"Portcard P1-C5: ses4[0150]/mpt2sas1/24 diskset "37993" ses5[0150]/mpt2sas1/24 diskset "18793"Enclosure DE00022 STOR P1-C4/P1-C5 sees 48 disksEnclosure DE00022 STOR P1-C12/P1-C13 sees both portcards: P1-C12 P1-C13Portcard P1-C12: ses8[0150]/mpt2sas2/24 diskset "40657" ses9[0150]/mpt2sas2/24 diskset "44382"Portcard P1-C13: ses12[0150]/mpt2sas3/24 diskset "40657" ses13[0150]/mpt2sas3/24 diskset "44382"Enclosure DE00022 STOR P1-C12/P1-C13 sees 48 disksEnclosure DE00022 STOR P1-C20/P1-C21 sees both portcards: P1-C20 P1-C21Portcard P1-C20: ses16[0150]/mpt2sas4/24 diskset "04091" ses17[0150]/mpt2sas4/24 diskset "31579"Portcard P1-C21: ses20[0150]/mpt2sas5/24 diskset "04091" ses21[0150]/mpt2sas5/24 diskset "31579"Enclosure DE00022 STOR P1-C20/P1-C21 sees 48 disksEnclosure DE00022 STOR P1-C28/P1-C29 sees both portcards: P1-C28 P1-C29Portcard P1-C28: ses24[0150]/mpt2sas6/24 diskset "64504" ses25[0150]/mpt2sas6/24 diskset "62361"Portcard P1-C29: ses28[0150]/mpt2sas7/24 diskset "64504" ses29[0150]/mpt2sas7/24 diskset "62361"Enclosure DE00022 STOR P1-C28/P1-C29 sees 48 disksEnclosure DE00022 STOR P1-C60/P1-C61 sees both portcards: P1-C60 P1-C61Portcard P1-C60: ses30[0150]/mpt2sas7/24 diskset "10913" ses31[0150]/mpt2sas7/24 diskset "52799"Portcard P1-C61: ses26[0150]/mpt2sas6/24 diskset "10913" ses27[0150]/mpt2sas6/24 diskset "52799"Enclosure DE00022 STOR P1-C60/P1-C61 sees 48 disksEnclosure DE00022 STOR P1-C68/P1-C69 sees both portcards: P1-C68 P1-C69Portcard P1-C68: ses22[0150]/mpt2sas5/24 diskset "50112" ses23[0150]/mpt2sas5/24 diskset "63400"Portcard P1-C69: ses18[0150]/mpt2sas4/24 diskset "50112" ses19[0150]/mpt2sas4/24 diskset "63400"Enclosure DE00022 STOR P1-C68/P1-C69 sees 48 disksEnclosure DE00022 STOR P1-C76/P1-C77 sees both portcards: P1-C76 P1-C77Portcard P1-C76: ses14[0150]/mpt2sas3/23 diskset "45948" ses15[0150]/mpt2sas3/24 diskset "50856"Portcard P1-C77: ses10[0150]/mpt2sas2/24 diskset "37258" ses11[0150]/mpt2sas2/24 diskset "50856"Enclosure DE00022 STOR P1-C76/P1-C77 sees 48 disksEnclosure DE00022 STOR P1-C84/P1-C85 sees both portcards: P1-C84 P1-C85Portcard P1-C84: ses6[0150]/mpt2sas1/24 diskset "13325" ses7[0150]/mpt2sas1/24 diskset "10443"Portcard P1-C85: ses2[0150]/mpt2sas0/24 diskset "13325" ses3[0150]/mpt2sas0/24 diskset "10443"Enclosure DE00022 STOR P1-C84/P1-C85 sees 48 disksCarrier location P1-C79-D4 appears only on the portcard P1-C77 pathEnclosure DE00022 sees 384 disks

mpt2sas7[1005470001] U78A9.001.9998884-P1-C1 DE00022 STOR 4 P1-C29 (ses28 ses29) STOR 5 P1-C60 (ses30 ses31)mpt2sas6[1005470001] U78A9.001.9998884-P1-C3 DE00022 STOR 4 P1-C28 (ses24 ses25) STOR 5 P1-C61 (ses26 ses27)mpt2sas5[1005470001] U78A9.001.9998884-P1-C5 DE00022 STOR 3 P1-C21 (ses20 ses22) STOR 6 P1-C68 (ses21 ses23)mpt2sas4[1005470001] U78A9.001.9998884-P1-C7 DE00022 STOR 3 P1-C20 (ses16 ses17) STOR 6 P1-C69 (ses18 ses19)mpt2sas3[1005470001] U78A9.001.9998884-P1-C9 DE00022 STOR 2 P1-C13 (ses12 ses13) STOR 7 P1-C76 (ses14 ses15)mpt2sas2[1005470001] U78A9.001.9998884-P1-C11 DE00022 STOR 2 P1-C12 (ses8 ses9) STOR 7 P1-C77 (ses10 ses11)mpt2sas1[1005470001] U78A9.001.9998884-P1-C13 DE00022 STOR 1 P1-C5 (ses4 ses5) STOR 8 P1-C84 (ses6 ses7)mpt2sas0[1005470001] U78A9.001.9998884-P1-C15 DE00022 STOR 1 P1-C4 (ses0 ses1) STOR 8 P1-C85 (ses2 ses3)

In the preceding output, the Power 775 Disk Enclosure with serial number DE00022 is discovered,together with its eight individual STORs and the component port cards, port card expanders (with theirfirmware levels in brackets), and physical disks. One minor discrepancy is noted: The physical disk inlocation P1-C79-D4 is only seen over one of the two expected HBA paths. This can also be seen in theoutput for the STOR with port cards P1-C76 and P1-C77:Enclosure DE00022 STOR P1-C76/P1-C77 sees both portcards: P1-C76 P1-C77Portcard P1-C76: ses14[0150]/mpt2sas3/23 diskset "45948" ses15[0150]/mpt2sas3/24 diskset "50856"Portcard P1-C77: ses10[0150]/mpt2sas2/24 diskset "37258" ses11[0150]/mpt2sas2/24 diskset "50856"Enclosure DE00022 STOR P1-C76/P1-C77 sees 48 disks


Here the connection through port card P1-C76 sees just 23 disks on the expander ses14 and all 24 diskson the expander ses15, while the connection through port card P1-C77 sees all 24 disks on each of theexpanders ses10 and ses11. The “disksets” that are reached over the expanders are identified by achecksum of the unique SCSI WWNs of the physical disks that are present; equal disksets represent thesame collection of physical disks.

The preceding discrepancy can either be corrected or ignored, as it is probably due to a poorly seated ordefective port on the physical disk. The disk is still present on the other port.

If other discrepancies are noted (for example, physical disks that are expected but do not show up at all,or SSDs or HDDs in the wrong locations), they should be corrected before proceeding.

The HBAs (firmware levels in brackets) are also listed with their slot location codes to show the cablingpattern. Each HBA sees two STORs, and each STOR is seen by two different HBAs, which provides themultiple paths and redundancy required by a correct Power 775 Disk Enclosure installation.

This output can be compared to the hardware cabling specification to verify that the disk enclosure isconnected correctly.

The server2.top topology database should also be examined with the topsummary sample script andverified to be correct.

Once the Power 775 Disk Enclosure topologies are verified to be correct on both intended recovery groupserver nodes, the recommended recovery group configuration can be created using GPFS Native RAIDcommands.

Creating recovery groups on a Power 775 Disk Enclosure

Configuring GPFS nodes to be recovery group servers

Before a GPFS node can create and serve recovery groups, it must be configured with a vdisk track cache.This is accomplished by setting the nsdRAIDTracks configuration parameter.

nsdRAIDTracks is the GPFS configuration parameter essential to define a GPFS cluster node as arecovery group server. It specifies the number of vdisk tracks of which the attributes will be held inmemory by the GPFS daemon on the recovery group server.

The actual contents of the vdisk tracks, the user data and the checksums, are stored in the standard GPFSpagepool. Therefore, the size of the GPFS pagepool configured on a recovery group server should beconsiderable, on the order of tens of gigabytes. The amount of pagepool dedicated to hold vdisk trackdata is governed by the nsdRAIDBufferPoolSizePct parameter, which defaults to 50%. In practice, arecovery group server will not need to use the GPFS pagepool for any significant amount of standard filecaching, and the nsdRAIDBufferPoolSizePct value can be increased to 80%. Also applicable, since arecovery group server is by definition an NSD server, is the nsdBufSpace parameter, which defaults to30% of pagepool. Since the vdisk buffer pool doubles as the NSD buffer spool, the nsdBufSpaceparameter should be decreased to its minimum of 10%. Together these values leave only 10% of thepagepool for application program file cache, but this should not be a problem as a recovery group servershould not be running application programs.

In this example, the recovery group servers will be configured to cache the information on 16384 vdisktracks and to have 64 GiB of pagepool, of which 80% will be used for vdisk data. Once the configurationchanges are made, the servers will need to be restarted.# mmchconfig nsdRAIDTracks=16384,nsdRAIDBufferPoolSizePct=80,nsdBufSpace=10,pagepool=64G -N server1,server2# mmshutdown -N server1,server2# mmstartup -N server1,server2


Defining the recovery group layout

The definition of recovery groups on a Power 775 Disk Enclosure is dictated by the architecture andcabling of the disk enclosure. Two servers sharing a Power 775 Disk Enclosure implies two recoverygroups; one is served by one node and one by the other, and each server acts as the other's backup. Halfthe disks in each STOR should belong to one recovery group, and half to the other. One recovery groupwill therefore be defined on the disks and carriers in the top halves of the eight STORs, and one on thebottom halves. Since the disks in a STOR are placed four to a removable carrier, thereby having acommon point of failure, each disk in a carrier should belong to one of four different declustered arrays.Should a carrier fail or be removed, then each declustered array will only suffer the loss of one disk.There are four SSDs distributed among the top set of carriers, and four in the bottom set of carriers.These groups of four SSDs will make up the vdisk log declustered arrays in their respective halves.

GPFS Native RAID provides a tool that understands the layout of the Power 775 Disk Enclosure and willautomatically generate the mmcrrecoverygroup stanza files for creating the top and bottom recoverygroups. /usr/lpp/mmfs/samples/vdisk/mkp7rginput, when supplied with output of themmgetpdisktopology command, will create recovery group stanza files for the top and bottom halves ofeach Power 775 Disk Enclosure found in the topology.

Each recovery group server, though it may see the same functional disk enclosure topology, will almostcertainly differ in the particulars of which disk device names (e.g., /dev/rhdisk77 on AIX) refer to whichphysical disks in what disk enclosure location.

There are two possibilities then for creating the recovery group stanza files and the recovery groupsthemselves:

Alternative 1:Generate the recovery group stanza files and create the recovery groups from the perspective ofjust one of the servers as if that server were to be primary for both recovery groups, and then usethe mmchrecoverygroup command to swap the primary and backup servers for one of therecovery groups

Alternative 2:Generate the recovery group stanza files for each server's primary recovery group using theprimary server's topology file.

This example will show both alternatives.

Creating the recovery groups, alternative 1To create the recovery group input stanza files from the perspective of server1, run:# /usr/lpp/mmfs/samples/vdisk/mkp7rginput server1.top

This will create two files for each disk enclosure present in the server1 topology; in this case,DE00022TOP.server1 for the top half of disk enclosure DE00022 and DE00022BOT.server2 for thebottom half. (An extra file, DEXXXXXbad, may be created if any discrepancies are present in thetopology; if such a file is created by mkp7rginput, it should be examined and the discrepanciescorrected.)

The recovery group stanza files will follow the recommended best practice for the Power 775Disk Enclosure of defining in each half of the disk enclosure a separate declustered array of 4SSDs for recovery group transaction logging, and four file system data declustered arrays usingthe regular HDDs according to which of the four disk enclosure carrier slots each HDD residesin.

The defaults are accepted for other recovery group declustered array parameters such as scrubduration, spare space, and disk replacement policy.

The stanza file will look something like this:


# head DE00022TOP.server1%pdisk: pdiskName=c081d1

device=/dev/hdisk10da=DA1

%pdisk: pdiskName=c065d1device=/dev/hdisk211da=DA1


%pdisk: pdiskName=c067d1

All the pdisk stanzas for declustered array DA1 will be listed first, followed by those for DA2, DA3,DA4, and the LOG declustered array. The pdisk names will indicate the carrier and disk location inwhich the physical disk resides. Notice that only one block device path to the disk is given; thesecond path will be discovered automatically soon after the recovery group is created.

Now that the DE00022TOP.server1 and DE00022BOT.server1 stanza files have been created fromthe perspective of recovery group server node server1, these two recovery groups can be createdusing two separate invocations of the mmcrrecoverygroup command:# mmcrrecoverygroup DE00022TOP -F DE00022TOP.server1 --servers server1,server2mmcrrecoverygroup: Propagating the cluster configuration data to all

affected nodes. This is an asynchronous process.

# mmcrrecoverygroup DE00022BOT -F DE00022BOT.server1 --servers server1,server2mmcrrecoverygroup: Propagating the cluster configuration data to all


Note that both recovery groups were created with server1 as primary and server2 as backup. Itis now necessary to swap the primary and backup servers for DE00022BOT using themmchrecoverygroup command:# mmchrecoverygroup DE00022BOT --servers server2,server1mmchrecoverygroup: Propagating the cluster configuration data to all


GPFS Native RAID will automatically discover the appropriate disk devices on server2.

Creating the recovery groups, alternative 2To create the recovery groups from the start with the intended primary and backup servers, thestanza files from both server topologies will need to be created.

To create the server1 recovery group input stanza files, run:# /usr/lpp/mmfs/samples/vdisk/mkp7rginput server1.top

To create the server2 recovery group input stanza files, run:# /usr/lpp/mmfs/samples/vdisk/mkp7rginput server2.top

These two commands will result in four stanza files: DE00022TOP.server1, DE00022BOT.server1,DE00022TOP.server2, and DE00022BOT.server2. (As in alternative 1, if any files named DEXXXXXbadare created, they should be examined and the errors within should be corrected.)

The DE00022TOP recovery group must then be created using server1 as the primary and theDE00022TOP.server1 stanza file. The DE00022BOT recovery group must be created using server2 asthe primary and the DE00022BOT.server2 stanza file.# mmcrrecoverygroup DE00022TOP -F DE00022TOP.server1 --servers server1,server2mmcrrecoverygroup: Propagating the cluster configuration data to all


# mmcrrecoverygroup DE00022BOT -F DE00022BOT.server2 --servers server2,server1mmcrrecoverygroup: Propagating the cluster configuration data to all


Since each recovery group was created using the intended primary server and the stanza file forthat server, it is not necessary to swap the primary and backup servers.


Verifying recovery group creation

Use the mmlsrecoverygroup command to verify that each recovery group was created:# mmlsrecoverygroup DE00022TOP -L

declusteredrecovery group arrays vdisks pdisks----------------- ----------- ------ ------DE00022TOP 5 0 192

declustered needs replace scrub background activityarray service vdisks pdisks spares threshold free space duration task progress priority

----------- ------- ------ ------ ------ --------- ---------- -------- -------------------------DA1 no 0 47 2 2 24 TiB 14 days inactive 0% lowDA2 no 0 47 2 2 24 TiB 14 days inactive 0% lowDA3 no 0 47 2 2 24 TiB 14 days inactive 0% lowDA4 no 0 47 2 2 24 TiB 14 days inactive 0% lowLOG no 0 4 1 1 558 GiB 14 days inactive 0% low

declusteredvdisk RAID code array vdisk size remarks------------------ ------------------ ----------- ---------- -------

active recovery group server servers----------------------------------------------- -------server1 server1,server2

# mmlsrecoverygroup DE00022BOT -L

declusteredrecovery group arrays vdisks pdisks----------------- ----------- ------ ------DE00022BOT 5 0 192


----------- ------- ------ ------ ------ --------- ---------- -------- -------------------------DA1 no 0 47 2 2 24 TiB 14 days inactive 0% lowDA2 no 0 47 2 2 24 TiB 14 days inactive 0% lowDA3 no 0 47 2 2 24 TiB 14 days inactive 0% lowDA4 no 0 47 2 2 24 TiB 14 days inactive 0% lowLOG no 0 4 1 1 558 GiB 14 days inactive 0% low

declusteredvdisk RAID code array vdisk size remarks------------------ ------------------ ----------- ---------- -------


Notice that the vdisk sections of the newly created recovery groups are empty; the next step is to createthe vdisks.

Defining and creating the vdisks

Once the recovery groups are created and being served by their respective servers, it is time to create thevdisks using the mmcrvdisk command.

Each recovery group requires a single log vdisk for recording RAID updates and diagnostic information.This is internal to the recovery group, cannot be used for user data, and should be the only vdisk in theLOG declustered array. The log vdisks in this example use 3-way replication in order to fit in the LOGdeclustered array, which contains 4 SSDs and spare space equivalent to one disk.


Data vdisks are required to be defined in the four data declustered arrays for use as file system NSDs. Inthis example, each of the declustered arrays for file system data is divided into two vdisks with differentcharacteristics: one using 4-way replication and a 1 MiB block size and a total vdisk size of 250 GiBsuitable for file system metadata, and one using Reed-Solomon 8 + 3p encoding and an 16 MiB blocksize suitable for file system data. The vdisk size is omitted for the Reed-Solomon vdisks, meaning thatthey will default to use the remaining non-spare space in the declustered array (for this to work, anyvdisks with specified total sizes must of course be defined first).

The possibilities for the vdisk creation stanza file are quite great, depending on the number and type ofvdisk NSDs required for the number and type of file systems desired, so the vdisk stanza file will need tobe created by hand, possibly following a template.

In this example, a single stanza file, mmcrvdisk.DE00022ALL, is used. The single file contains thespecifications for all the vdisks in both the DE00022TOP and DE00022BOT recovery groups. Here is what theexample stanza file for use with mmcrvdisk should look like:# cat mmcrvdisk.DE00022ALL%vdisk: vdiskName=DE00022TOPLOG

rg=DE00022TOPda=LOGblocksize=1msize=4graidCode=3WayReplicationdiskUsage=vdiskLog

%vdisk: vdiskName=DE00022BOTLOGrg=DE00022BOTda=LOGblocksize=1msize=4graidCode=3WayReplicationdiskUsage=vdiskLog

%vdisk: vdiskName=DE00022TOPDA1METArg=DE00022TOPda=DA1blocksize=1msize=250graidCode=4WayReplicationdiskUsage=metadataOnlyfailureGroup=22pool=system

%vdisk: vdiskName=DE00022TOPDA1DATArg=DE00022TOPda=DA1blocksize=16mraidCode=8+3pdiskUsage=dataOnlyfailureGroup=22pool=data

%vdisk: vdiskName=DE00022BOTDA1METArg=DE00022BOTda=DA1blocksize=1msize=250graidCode=4WayReplicationdiskUsage=metadataOnlyfailureGroup=22pool=system

%vdisk: vdiskName=DE00022BOTDA1DATArg=DE00022BOTda=DA1blocksize=16mraidCode=8+3p


diskUsage=dataOnlyfailureGroup=22pool=data

[DA2, DA3, DA4 vdisks omitted.]

Notice how the file system metadata vdisks are flagged for eventual file system usage as metadataOnlyand for placement in the system storage pool, and the file system data vdisks are flagged for eventualdataOnly usage in the data storage pool. (After the file system is created, a policy will be required toallocate file system data to the correct storage pools; see “Creating the GPFS file system” on page 35.)

Importantly, also notice that block sizes for the file system metadata and file system data vdisks must bespecified at this time, may not later be changed, and must match the block sizes supplied to the eventualmmcrfs command.

Notice also that the eventual failureGroup=22 value for the NSDs on the file system vdisks is the samefor vdisks in both the DE00022TOP and DE00022BOT recovery groups. This is because the recovery groups,although they have different servers, still share a common point of failure in the disk enclosure DE00022,and GPFS should be informed of this through a distinct failure group designation for each disk enclosure.It is up to the GPFS system administrator to decide upon the failure group numbers for each Power 775Disk Enclosure in the GPFS cluster.

To create the vdisks specified in the mmcrvdisk.DE00022ALL file, use the following mmcrvdisk command:# mmcrvdisk -F mmcrvdisk.DE00022ALLmmcrvdisk: [I] Processing vdisk DE00022TOPLOGmmcrvdisk: [I] Processing vdisk DE00022BOTLOGmmcrvdisk: [I] Processing vdisk DE00022TOPDA1METAmmcrvdisk: [I] Processing vdisk DE00022TOPDA1DATAmmcrvdisk: [I] Processing vdisk DE00022TOPDA2METAmmcrvdisk: [I] Processing vdisk DE00022TOPDA2DATAmmcrvdisk: [I] Processing vdisk DE00022TOPDA3METAmmcrvdisk: [I] Processing vdisk DE00022TOPDA3DATAmmcrvdisk: [I] Processing vdisk DE00022TOPDA4METAmmcrvdisk: [I] Processing vdisk DE00022TOPDA4DATAmmcrvdisk: [I] Processing vdisk DE00022BOTDA1METAmmcrvdisk: [I] Processing vdisk DE00022BOTDA1DATAmmcrvdisk: [I] Processing vdisk DE00022BOTDA2METAmmcrvdisk: [I] Processing vdisk DE00022BOTDA2DATAmmcrvdisk: [I] Processing vdisk DE00022BOTDA3METAmmcrvdisk: [I] Processing vdisk DE00022BOTDA3DATAmmcrvdisk: [I] Processing vdisk DE00022BOTDA4METAmmcrvdisk: [I] Processing vdisk DE00022BOTDA4DATAmmcrvdisk: Propagating the cluster configuration data to all


Creation of the vdisks may be verified through the mmlsvdisk command (the mmlsrecoverygroupcommand may also be used):# mmlsvdisk

declustered block sizevdisk name RAID code recovery group array in KiB remarks------------------ --------------- ------------------ ----------- ---------- -------DE00022BOTDA1DATA 8+3p DE00022BOT DA1 16384DE00022BOTDA1META 4WayReplication DE00022BOT DA1 1024DE00022BOTDA2DATA 8+3p DE00022BOT DA2 16384DE00022BOTDA2META 4WayReplication DE00022BOT DA2 1024DE00022BOTDA3DATA 8+3p DE00022BOT DA3 16384DE00022BOTDA3META 4WayReplication DE00022BOT DA3 1024DE00022BOTDA4DATA 8+3p DE00022BOT DA4 16384DE00022BOTDA4META 4WayReplication DE00022BOT DA4 1024DE00022BOTLOG 3WayReplication DE00022BOT LOG 1024 logDE00022TOPDA1DATA 8+3p DE00022TOP DA1 16384DE00022TOPDA1META 4WayReplication DE00022TOP DA1 1024DE00022TOPDA2DATA 8+3p DE00022TOP DA2 16384


DE00022TOPDA2META 4WayReplication DE00022TOP DA2 1024DE00022TOPDA3DATA 8+3p DE00022TOP DA3 16384DE00022TOPDA3META 4WayReplication DE00022TOP DA3 1024DE00022TOPDA4DATA 8+3p DE00022TOP DA4 16384DE00022TOPDA4META 4WayReplication DE00022TOP DA4 1024DE00022TOPLOG 3WayReplication DE00022TOP LOG 1024 log

Creating NSDs from vdisks

The mmcrvdisk command rewrites the input file so that it is ready to be passed to the mmcrnsdcommand that creates the NSDs from which GPFS builds file systems. To create the vdisk NSDs, run themmcrnsd command on the rewritten mmcrvdisk stanza file:# mmcrnsd -F mmcrvdisk.DE00022ALLmmcrnsd: Processing disk DE00022TOPDA1METAmmcrnsd: Processing disk DE00022TOPDA1DATAmmcrnsd: Processing disk DE00022TOPDA2METAmmcrnsd: Processing disk DE00022TOPDA2DATAmmcrnsd: Processing disk DE00022TOPDA3METAmmcrnsd: Processing disk DE00022TOPDA3DATAmmcrnsd: Processing disk DE00022TOPDA4METAmmcrnsd: Processing disk DE00022TOPDA4DATAmmcrnsd: Processing disk DE00022BOTDA1METAmmcrnsd: Processing disk DE00022BOTDA1DATAmmcrnsd: Processing disk DE00022BOTDA2METAmmcrnsd: Processing disk DE00022BOTDA2DATAmmcrnsd: Processing disk DE00022BOTDA3METAmmcrnsd: Processing disk DE00022BOTDA3DATAmmcrnsd: Processing disk DE00022BOTDA4METAmmcrnsd: Processing disk DE00022BOTDA4DATAmmcrnsd: Propagating the cluster configuration data to all


Notice how the recovery group log vdisks are omitted from NSD processing.

The mmcrnsd command then once again rewrites the stanza file in preparation for use as input to themmcrfs command.

Creating the GPFS file system

Run the mmcrfs command to create the file system:# mmcrfs gpfs -F mmcrvdisk.DE00022ALL -B 16m --metadata-block-size 1m -T /gpfs -A no

The following disks of gpfs will be formatted on node c250f09c01ap05.ppd.pok.ibm.com:DE00022TOPDA1META: size 262163456 KBDE00022TOPDA1DATA: size 8395522048 KBDE00022TOPDA2META: size 262163456 KBDE00022TOPDA2DATA: size 8395522048 KBDE00022TOPDA3META: size 262163456 KBDE00022TOPDA3DATA: size 8395522048 KBDE00022TOPDA4META: size 262163456 KBDE00022TOPDA4DATA: size 8395522048 KBDE00022BOTDA1META: size 262163456 KBDE00022BOTDA1DATA: size 8395522048 KBDE00022BOTDA2META: size 262163456 KBDE00022BOTDA2DATA: size 8395522048 KBDE00022BOTDA3META: size 262163456 KBDE00022BOTDA3DATA: size 8395522048 KBDE00022BOTDA4META: size 262163456 KBDE00022BOTDA4DATA: size 8395522048 KB

Formatting file system ...Disks up to size 2.5 TB can be added to storage pool ’system’.Disks up to size 79 TB can be added to storage pool ’data’.Creating Inode File


Creating Allocation MapsClearing Inode Allocation MapClearing Block Allocation MapFormatting Allocation Map for storage pool ’system’Formatting Allocation Map for storage pool ’data’Completed creation of file system /dev/gpfs.mmcrfs: Propagating the cluster configuration data to all


Notice how the 16 MiB data block size is specified with the traditional -B parameter and the 1 MiBmetadata block size is specified with the --metadata-block-size parameter. Since a file system withdifferent metadata and data block sizes requires the use of multiple GPFS storage pools, a file systemplacement policy is needed to direct user file data to the data storage pool. In this example, the fileplacement policy is very simple:# cat policyrule ’default’ set pool ’data’

The policy must then be installed in the file system using the mmchpolicy command:# mmchpolicy gpfs policy -I yesValidated policy `policy’: parsed 1 Placement Rules, 0 Restore Rules, 0 Migrate/Delete/Exclude Rules,

0 List Rules, 0 External Pool/List RulesPolicy `policy’ installed and broadcast to all nodes.

If a policy is not placed in a file system with multiple storage pools, attempts to place data into files willreturn ENOSPC as if the file system were full.

This file system, built on a Power 775 Disk Enclosure using two recovery groups, two recovery groupservers, eight file system metadata vdisk NSDs and eight file system data vdisk NSDs, may now bemounted and placed into service:# mmmount gpfs -a

Example scenario: Replacing failed disks in a Power 775 DiskEnclosure recovery groupThe scenario presented here shows how to detect and replace failed disks in a recovery group built on aPower 775 Disk Enclosure.

Detecting failed disks in your enclosure

Assume a fully populated Power 775 Disk Enclosure (serial number 000DE37) on which the followingtwo recovery groups are defined:v 000DE37TOP containing the disks in the top set of carriersv 000DE37BOT containing the disks in the bottom set of carriers

Each recovery group contains the following:v one log declustered array (LOG)v four data declustered arrays (DA1, DA2, DA3, DA4)

The data declustered arrays are defined according to Power 775 Disk Enclosure best practice as follows:v 47 pdisks per data declustered arrayv each member pdisk from the same carrier slotv default disk replacement threshold value set to 2

The replacement threshold of 2 means that GPFS Native RAID will only require disk replacement whentwo or more disks have failed in the declustered array; otherwise, rebuilding onto spare space orreconstruction from redundancy will be used to supply affected data.


This configuration can be seen in the output of mmlsrecoverygroup for the recovery groups, shown herefor 000DE37TOP:# mmlsrecoverygroup 000DE37TOP -L

declusteredrecovery group arrays vdisks pdisks----------------- ----------- ------ ------000DE37TOP 5 9 192


----------- ------- ------ ------ ------ --------- ---------- -------- -------------------------DA1 no 2 47 2 2 3072 MiB 14 days scrub 63% lowDA2 no 2 47 2 2 3072 MiB 14 days scrub 19% lowDA3 yes 2 47 2 2 0 B 14 days rebuild-2r 48% lowDA4 no 2 47 2 2 3072 MiB 14 days scrub 33% lowLOG no 1 4 1 1 546 GiB 14 days scrub 87% low

declusteredvdisk RAID code array vdisk size remarks------------------ ------------------ ----------- ---------- -------000DE37TOPLOG 3WayReplication LOG 4144 MiB log000DE37TOPDA1META 4WayReplication DA1 250 GiB000DE37TOPDA1DATA 8+3p DA1 17 TiB000DE37TOPDA2META 4WayReplication DA2 250 GiB000DE37TOPDA2DATA 8+3p DA2 17 TiB000DE37TOPDA3META 4WayReplication DA3 250 GiB000DE37TOPDA3DATA 8+3p DA3 17 TiB000DE37TOPDA4META 4WayReplication DA4 250 GiB000DE37TOPDA4DATA 8+3p DA4 17 TiB


The indication that disk replacement is called for in this recovery group is the value of yes in the needsservice column for declustered array DA3.

The fact that DA3 (the declustered array on the disks in carrier slot 3) is undergoing rebuild of its RAIDtracks that can tolerate two strip failures is by itself not an indication that disk replacement is required; itmerely indicates that data from a failed disk is being rebuilt onto spare space. Only if the replacementthreshold has been met will disks be marked for replacement and the declustered array marked asneeding service.

GPFS Native RAID provides several indications that disk replacement is required:v entries in the AIX error reportv the GPFS pdReplacePdisk callback, which can be configured to run an administrator-supplied script at

the moment a pdisk is marked for replacementv the POWER7® cluster event notification TEAL agent, which can be configured to send disk replacement

notices when they occur to the POWER7 cluster EMSv the output from the following commands, which may be performed from the command line on any

GPFS cluster node (see the examples that follow):1. mmlsrecoverygroup with the -L flag shows yes in the needs service column2. mmlsrecoverygroup with the -L and --pdisk flags; this shows the states of all pdisks, which may be

examined for the replace pdisk state3. mmlspdisk with the --replace flag, which lists only those pdisks that are marked for replacement

Note: Because the output of mmlsrecoverygroup -L --pdisk for a fully-populated disk enclosure is verylong, this example shows only some of the pdisks (but includes those marked for replacement).


# mmlsrecoverygroup 000DE37TOP -L --pdisk




number of declusteredpdisk active paths array free space state----------------- ------------ ----------- ---------- -----[...]c014d1 2 DA1 23 GiB okc014d2 2 DA2 23 GiB okc014d3 2 DA3 510 GiB dead/systemDrain/noRGD/noVCD/replacec014d4 2 DA4 12 GiB ok[...]c018d1 2 DA1 24 GiB okc018d2 2 DA2 24 GiB okc018d3 2 DA3 558 GiB dead/systemDrain/noRGD/noVCD/noData/replacec018d4 2 DA4 12 GiB ok[...]

The preceding output shows that the following pdisks are marked for replacement:v c014d3 in DA3

v c018d3 and DA3

The naming convention used during recovery group creation indicates that these are the disks in slot 3 ofcarriers 14 and 18. To confirm the physical locations of the failed disks, use the mmlspdisk command tolist information about those pdisks in declustered array DA3 of recovery group 000DE37TOP that aremarked for replacement:# mmlspdisk 000DE37TOP --declustered-array DA3 --replacepdisk:

replacementPriority = 1.00name = "c014d3"device = "/dev/rhdisk158,/dev/rhdisk62"recoveryGroup = "000DE37TOP"declusteredArray = "DA3"state = "dead/systemDrain/noRGD/noVCD/replace"freeSpace = 513745775616fru = "74Y4936"location = "78AD.001.000DE37-C14-D3"WWN = "naa.5000C5001D90E17F"server = "server1"reads = 986writes = 293368bytesReadInGiB = 0.826bytesWrittenInGiB = 286.696IOErrors = 0IOTimeouts = 12mediaErrors = 0checksumErrors = 0pathErrors = 0timeBadness = 33.278dataBadness = 0.000

pdisk:


replacementPriority = 1.00name = "c018d3"device = "/dev/rhdisk630,/dev/rhdisk726"recoveryGroup = "000DE37TOP"declusteredArray = "DA3"state = "dead/systemDrain/noRGD/noVCD/noData/replace"freeSpace = 599147937792fru = "74Y4936"location = "78AD.001.000DE37-C18-D3"WWN = "naa.5000C5001DC5DF3F"server = "server1"reads = 1104writes = 379053bytesReadInGiB = 0.844bytesWrittenInGiB = 370.717IOErrors = 0IOTimeouts = 0mediaErrors = 0checksumErrors = 10pathErrors = 0timeBadness = 0.000dataBadness = 10.000

The preceding location code attributes confirm the pdisk naming convention:

Disk Location code Interpretation

pdisk c014d3 78AD.001.000DE37-C14-D3 Disk 3 in carrier 14 in the disk enclosureidentified by enclosure type 78AD.001 andserial number 000DE37

pdisk c018d3 78AD.001.000DE37-C18-D3 Disk 3 in carrier 18 in the disk enclosureidentified by enclosure type 78AD.001 andserial number 000DE37

Replacing the failed disks in a Power 775 Disk Enclosure recovery group

Note: In this example, it is assumed that two new disks with the appropriate Field Replaceable Unit(FRU) code, as indicated by the fru attribute (74Y4936 in this case), have been obtained as replacementsfor the failed pdisks c014d3 and c018d3.

Replacing each disk is a three-step process:1. Using the mmchcarrier command with the --release flag to suspend use of the other disks in the

carrier and to release the carrier.2. Removing the carrier and replacing the failed disk within with a new one.3. Using the mmchcarrier command with the --replace flag to resume use of the suspended disks and to

begin use of the new disk.

GPFS Native RAID assigns a priority to pdisk replacement. Disks with smaller values for thereplacementPriority attribute should be replaced first. In this example, the only failed disks are in DA3and both have the same replacementPriority.

Disk c014d3 is chosen to be replaced first.1. To release carrier 14 in disk enclosure 000DE37:

# mmchcarrier 000DE37TOP --release --pdisk c014d3[I] Suspending pdisk c014d1 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D1.[I] Suspending pdisk c014d2 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D2.[I] Suspending pdisk c014d3 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D3.[I] Suspending pdisk c014d4 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D4.[I] Carrier released.


- Remove carrier.- Replace disk in location 78AD.001.000DE37-C14-D3 with FRU 74Y4936.- Reinsert carrier.- Issue the following command:

mmchcarrier 000DE37TOP --replace --pdisk ’c014d3’

Repair timer is running. Perform the above within 5 minutesto avoid pdisks being reported as missing.

GPFS Native RAID issues instructions as to the physical actions that must be taken. Note that disksmay be suspended only so long before they are declared missing; therefore the mechanical process ofphysically performing disk replacement must be accomplished promptly.Use of the other three disks in carrier 14 has been suspended, and carrier 14 is unlocked. The identifylights for carrier 14 and for disk 3 are on.

2. Carrier 14 should be unlatched and removed. The failed disk 3, as indicated by the internal identifylight, should be removed, and the new disk with FRU 74Y4936 should be inserted in its place. Carrier14 should then be reinserted and the latch closed.

3. To finish the replacement of pdisk c014d3:# mmchcarrier 000DE37TOP --replace --pdisk c014d3[I] The following pdisks will be formatted on node server1:

/dev/rhdisk354[I] Pdisk c014d3 of RG 000DE37TOP successfully replaced.[I] Resuming pdisk c014d1 of RG 000DE37TOP.[I] Resuming pdisk c014d2 of RG 000DE37TOP.[I] Resuming pdisk c014d3#162 of RG 000DE37TOP.[I] Resuming pdisk c014d4 of RG 000DE37TOP.[I] Carrier resumed.

When the mmchcarrier --replace command returns successfully, GPFS Native RAID has resumed use ofthe other 3 disks. The failed pdisk may remain in a temporary form (indicated here by the namec014d3#162) until all data from it has been rebuilt, at which point it is finally deleted. The newreplacement disk, which has assumed the name c014d3, will have RAID tracks rebuilt and rebalancedonto it. Notice that only one block device name is mentioned as being formatted as a pdisk; the secondpath will be discovered in the background.

This can be confirmed with mmlsrecoverygroup -L --pdisk:# mmlsrecoverygroup 000DE37TOP -L --pdisk




number of declusteredpdisk active paths array free space state----------------- ------------ ----------- ---------- -----[...]c014d1 2 DA1 23 GiB okc014d2 2 DA2 23 GiB okc014d3 2 DA3 550 GiB okc014d3#162 0 DA3 543 GiB dead/adminDrain/noRGD/noVCD/noPathc014d4 2 DA4 23 GiB ok[...]


c018d1 2 DA1 24 GiB okc018d2 2 DA2 24 GiB okc018d3 2 DA3 558 GiB dead/systemDrain/noRGD/noVCD/noData/replacec018d4 2 DA4 23 GiB ok[...]

Notice that the temporary pdisk c014d3#162 is counted in the total number of pdisks in declustered arrayDA3 and in the recovery group, until it is finally drained and deleted.

Notice also that pdisk c018d3 is still marked for replacement, and that DA3 still needs service. This isbecause GPFS Native RAID replacement policy expects all failed disks in the declustered array to bereplaced once the replacement threshold is reached. The replace state on a pdisk is not removed whenthe total number of failed disks goes under the threshold.

Pdisk c018d3 is replaced following the same process.1. Release carrier 18 in disk enclosure 000DE37:

# mmchcarrier 000DE37TOP --release --pdisk c018d3[I] Suspending pdisk c018d1 of RG 000DE37TOP in location 78AD.001.000DE37-C18-D1.[I] Suspending pdisk c018d2 of RG 000DE37TOP in location 78AD.001.000DE37-C18-D2.[I] Suspending pdisk c018d3 of RG 000DE37TOP in location 78AD.001.000DE37-C18-D3.[I] Suspending pdisk c018d4 of RG 000DE37TOP in location 78AD.001.000DE37-C18-D4.[I] Carrier released.




2. Unlatch and remove carrier 18, remove and replace failed disk 3, reinsert carrier 18, and close thelatch.

3. To finish the replacement of pdisk c018d3:

# mmchcarrier 000DE37TOP --replace --pdisk c018d3

[I] The following pdisks will be formatted on node server1:/dev/rhdisk674

[I] Pdisk c018d3 of RG 000DE37TOP successfully replaced.[I] Resuming pdisk c018d1 of RG 000DE37TOP.[I] Resuming pdisk c018d2 of RG 000DE37TOP.[I] Resuming pdisk c018d3#166 of RG 000DE37TOP.[I] Resuming pdisk c018d4 of RG 000DE37TOP.[I] Carrier resumed.

Running mmlsrecoverygroup again will confirm the second replacement:# mmlsrecoverygroup 000DE37TOP -L --pdisk



----------- ------- ------ ------ ------ --------- ---------- -------- -------------------------DA1 no 2 47 2 2 3072 MiB 14 days scrub 64% lowDA2 no 2 47 2 2 3072 MiB 14 days scrub 22% lowDA3 no 2 47 2 2 2048 MiB 14 days rebalance 12% lowDA4 no 2 47 2 2 3072 MiB 14 days scrub 36% lowLOG no 1 4 1 1 546 GiB 14 days scrub 89% low


number of declusteredpdisk active paths array free space state----------------- ------------ ----------- ---------- -----[...]c014d1 2 DA1 23 GiB okc014d2 2 DA2 23 GiB okc014d3 2 DA3 271 GiB okc014d4 2 DA4 23 GiB ok[...]c018d1 2 DA1 24 GiB okc018d2 2 DA2 24 GiB okc018d3 2 DA3 542 GiB okc018d4 2 DA4 23 GiB ok[...]

Notice that both temporary pdisks have been deleted. This is because c014d3#162 has finished draining,and because pdisk c018d3#166 had, before it was replaced, already been completely drained (asevidenced by the noData flag). Declustered array DA3 no longer needs service and once again contains 47pdisks, and the recovery group once again contains 192 pdisks.


Chapter 4. GPFS Native RAID commands

The following table summarizes the GPFS Native RAID commands.

(See also Chapter 5, “Other GPFS commands related to GPFS Native RAID,” on page 77.)

Table 7. GPFS Native RAID commands

Command Purpose

“mmaddpdisk command” on page 44 Adds a pdisk to a GPFS Native RAID recovery group.

“mmchcarrier command” on page 46 Allows GPFS Native RAID pdisks to be physically removedand replaced.

“mmchpdisk command” on page 49 Changes GPFS Native RAID pdisk states. This command is tobe used only in extreme situations under the guidance ofIBM service personnel.

“mmchrecoverygroup command” on page 51 Changes GPFS Native RAID recovery group and declusteredarray attributes.

“mmcrrecoverygroup command” on page 53 Creates a GPFS Native RAID recovery group and itscomponent declustered arrays and pdisks and specifies theservers.

“mmcrvdisk command” on page 56 Creates a vdisk within a declustered array of a GPFS NativeRAID recovery group.

“mmdelpdisk command” on page 60 Deletes GPFS Native RAID pdisks.

“mmdelrecoverygroup command” on page 62 Deletes a GPFS Native RAID recovery group.

“mmdelvdisk command” on page 64 Deletes vdisks from a declustered array in a GPFS NativeRAID recovery group.

“mmlspdisk command” on page 66 Lists information for one or more GPFS Native RAID pdisks.

“mmlsrecoverygroup command” on page 69 Lists information about GPFS Native RAID recovery groups.

“mmlsrecoverygroupevents command” on page 72 Displays the GPFS Native RAID recovery group event log.

“mmlsvdisk command” on page 74 Lists information for one or more GPFS Native RAID vdisks.


mmaddpdisk commandAdds a pdisk to a GPFS Native RAID recovery group.

Synopsismmaddpdisk RecoveryGroupName -F StanzaFile [--replace][-v {yes | no}]

Description

The mmaddpdisk command adds a pdisk to a recovery group.

Note: The GPFS daemon must be running on both the primary and backup servers to run this command.

Parameters

RecoveryGroupNameSpecifies the recovery group to which the pdisks are being added.

-F StanzaFileSpecifies a file containing pdisk stanzas that identify the pdisks to be added.

Pdisk stanzas look like the following:%pdisk: pdiskName=PdiskName

device=BlockDeviceNameda=DeclusteredArrayName

Examples of values for BlockDeviceName are hdisk3 or /dev/hdisk3.

--replaceIndicates that any existing pdisk that has the same name as a pdisk listed in the stanza file is to bereplaced. (This is an atomic deletion and addition.)

-v {yes | no}Verifies that specified pdisks do not belong to an existing recovery group. The default is -v yes.

Specify -v no only when you are certain that the specified disk does not belong to an existingrecovery group. Using -v no on a disk that already belongs to a recovery group will corrupt thatrecovery group.

Exit status

0 Successful completion.

nonzeroA failure has occurred.

Security

You must have root authority to run the mmaddpdisk command.

The node on which the command is issued must be able to execute remote shell commands on any othernode in the cluster without the use of a password and without producing any extraneous messages. Foradditional details, see the following topic in GPFS: Administration and Programming Reference:Requirements for administering a GPFS file system.

Examples

In this example, assume that the input stanza file, pdisk.c033d2, contains the following lines:



This command example shows how to add the pdisk described in stanza file pdisk.c033d2 to recoverygroup 000DE37BOT:mmaddpdisk 000DE37BOT -F pdisk.c033d2

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmchpdisk command” on page 49v “mmdelpdisk command” on page 60v “mmlspdisk command” on page 66v “mmlsrecoverygroup command” on page 69

Location

/usr/lpp/mmfs/bin

Chapter 4. GPFS Native RAID commands 45

mmchcarrier commandAllows GPFS Native RAID pdisks to be physically removed and replaced.

Synopsismmchcarrier RecoveryGroupName --release

{[--pdisk "Pdisk[;Pdisk]" [--location "Location[;Location]"]}[--force-release] [--force-rg]

ormmchcarrier RecoveryGroupName --resume

{[--pdisk "Pdisk[;Pdisk]" [--location "Location[;Location]"]}[--force-rg]

ormmchcarrier RecoveryGroupName --replace

{[--pdisk "Pdisk[;Pdisk]" [--location "Location[;Location]"]}[-v {yes|no}] [--force-fru] [--force-rg]

Description

The mmchcarrier command is used to control disk carriers and replace failed pdisks.

Replacing a pdisk requires the following three steps:1. Run the mmchcarrier --release command to prepare the carrier for removal.

The mmchcarrier --release command suspends I/O to all disks in the carrier, turns off power to thedisks, illuminates identify lights on the carrier, and unlocks the carrier latch (if applicable).

2. Remove the carrier from the disk drawer, replace the failed disk or disks with new disks, and reinsertthe carrier into the disk drawer.

3. Run the mmchcarrier --replace command to complete the replacement.The mmchcarrier --replace command powers on the disks, verifies that the new disks have beeninstalled, resumes I/O, and begins the rebuilding and rebalancing process onto the new disks.

Note: New disks will take the name of the replaced pdisks. In the event that replaced pdisks havenot completely drained, they will be given a temporary name consisting of the old pdisk name with asuffix of the form #nnnn. The temporary pdisk will have the adminDrain pdisk state flag set and willbe deleted once drained. For example, a pdisk named p25 will receive a temporary name similar top25#0010 when the adminDrain state flag is set. This allows the new disk that is replacing it to benamed p25 immediately rather than waiting for the old disk to be completely drained and deleted.Until the draining and deleting process completes, both the new pdisk p25 and the old pdisk p25#0010will show up in the output of the mmlsrecoverygroup and mmlspdisk commands.

Both the release and replace commands require either a recovery group name and a location code, or arecovery group name and a pdisk name to identify the carrier and particular disk slot within the carrier.It is acceptable to provide more than one location code or pdisk name to replace multiple disks withinthe same carrier.

The mmchcarrier --resume command reverses the effect of the release command without doing diskreplacements. It can be used to cancel the disk replacement procedure after running the mmchcarrier--release command.

Parameters

RecoveryGroupNameSpecifies the name of the recovery group to which the carrier belongs. This is used to identify theactive server where the low level commands will be issued.


--releaseSuspends all disks in the carrier, activates identify lights, and unlocks the carrier.

--resumeResumes all disks in the carrier without doing disk replacements.

--replaceFormats the replacement disks for use and resumes all disks in the carrier.

--pdiskSpecifies the target pdisk or pdisks and identifies the carrier. All specified pdisks must belong to thesame carrier.

--locationSpecifies the target pdisk or pdisks and identifies the carrier by location code (consisting of machinetype, model, serial number, carrier number, and disk-on-carrier number; for example,58DE-001-000000D-C111-D3). All specified pdisks must belong to the same carrier.

--force-releaseThis is a force flag for the --release option, to release the carrier even if the target is not marked forreplacement. Disks marked for replacement are identified via the mmlspdisk --replace command.

--force-fruThis is a force flag for the --replace option, to allow the replacement even if the field replaceable unit(FRU) number of the new disk does not match that of the old disk.

--force-rgThis is a force flag for the --release, --resume, and --replace options to allow actions on the carriereven if all the pdisks do not belong to the same recovery group.

-v {yes | no}Verification flag for the --replace option; indicates whether or not to verify that the new disk does notalready have a valid pdisk descriptor. The default is -v yes.

Specify -v no to allow a disk that was formerly part of some other recovery group to be reused.

Exit status



Security

You must have root authority to run the mmchcarrier command.


Examples1. The following command example shows how to release the carrier containing failed pdisk c014d3 in

recovery group 000DE37BOT:mmchcarrier 000DE37BOT --release --pdisk c014d3

The system displays output similar to the following:[I] Suspending pdisk c014d1 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D1.[I] Suspending pdisk c014d2 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D2.[I] Suspending pdisk c014d3 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D3.[I] Suspending pdisk c014d4 of RG 000DE37TOP in location 78AD.001.000DE37-C14-D4.


[I] Carrier released.




2. The following command example shows how to tell GPFS that the carrier containing pdisk c014d3 inrecovery group 000DE37BOT has been reinserted and is ready to be brought back online:mmchcarrier 000DE37BOT --replace --pdisk c014d3

The system displays output similar to the following:[I] The following pdisks will be formatted on node server1:

/dev/rhdisk354[I] Pdisk c014d3 of RG 000DE37TOP successfully replaced.[I] Resuming pdisk c014d1 of RG 000DE37TOP.[I] Resuming pdisk c014d2 of RG 000DE37TOP.[I] Resuming pdisk c014d3#162 of RG 000DE37TOP.[I] Resuming pdisk c014d4 of RG 000DE37TOP.[I] Carrier resumed.

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmchpdisk command” on page 49v “mmchrecoverygroup command” on page 51

Location

/usr/lpp/mmfs/bin


mmchpdisk commandChanges GPFS Native RAID pdisk states. This command is to be used only in extreme situations underthe guidance of IBM service personnel.

Synopsismmchpdisk RecoveryGroupName --pdisk PdiskName

{--kill | --revive | --suspend | --resume | --diagnose | --identify {on|off}}

Description

The mmchpdisk command changes the states of pdisks.

Attention: This command is to be used only in extreme situations under the guidance of IBM servicepersonnel.

Parameters

RecoveryGroupNameSpecifies the recovery group containing the pdisk for which one or more states are to be changed.

--pdisk PdiskNameSpecifies the target pdisk.

--killForces the failure of a pdisk by setting the dead pdisk state flag.

Attention: This option must be used with caution; if the total number of failures in a declusteredarray exceeds the fault tolerance of any vdisk in that array, permanent data loss might result.

--reviveAttempts to make a failed disk usable again by removing dead, failing, and readonly pdisk stateflags. Data allocated on the disk that has not been rebuilt onto spare space can become readableagain; however, any data that has already been reported as lost cannot be recovered.

--suspendSuspends I/O to the pdisk until a subsequent resume command is given. If a pdisk remains in thesuspended state for longer than a predefined timeout period, GPFS Native RAID begins rebuilding thedata from that pdisk into spare space.

Attention: This option is to be used with caution and only when performing maintenance on disksmanually, bypassing the automatic system provided by mmchcarrier.

--resumeCancels a previously given mmchpdisk --suspend command and resumes use of the pdisk.

Use this option only when performing maintenance on disks manually and bypassing the automaticsystem provided by mmchcarrier.

--diagnoseRuns basic tests on the pdisk. If no problems are found, the pdisk state automatically returns to ok.

--identify {on | off}Turns on or off the disk identify light, if available.

Exit status




Security

You must have root authority to run the mmchpdisk command.


Examples

The following command example shows how to tell GPFS to attempt to revive the failed pdisk c036d3 inrecovery group 000DE37BOT:mmchpdisk 000DE37BOT --pdisk c036d3 --revive

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmaddpdisk command” on page 44v “mmchcarrier command” on page 46v “mmdelpdisk command” on page 60v “mmlspdisk command” on page 66v “mmlsrecoverygroup command” on page 69

Location

/usr/lpp/mmfs/bin


mmchrecoverygroup commandChanges GPFS Native RAID recovery group and declustered array attributes.

Synopsismmchrecoverygroup RecoveryGroupName {--declustered-array DeclusteredArrayName {[--spares NumberOfSpares]

[--scrub-duration NumberOfDays] [--replace-threshold NumberOfDisks]}}

ormmchrecoverygroup RecoveryGroupName --active ServerName

ormmchrecoverygroup RecoveryGroupName --servers Primary[,Backup] [-v {yes | no}]

Description

The mmchrecoverygroup command changes recovery group and declustered array attributes.

Parameters

RecoveryGroupNameSpecifies the name of the recovery group being changed.

--declustered-array DeclusteredArrayNameSpecifies the name of the declustered array being changed.

--spares NumberOfSparesSpecifies the number of disks' worth of spare space to set aside in the declustered array. This space isused to rebuild the declustered arrays when physical disks fail.

--scrub-duration NumberOfDaysSpecifies the number of days, from 1 to 60, for the duration of the scrub. The default value is 14.

--replace-threshold NumberOfDisksSpecifies the number of pdisks that must fail in the declustered array before mmlsrecoverygroup willreport that service (disk replacement) is needed.

--active ServerNameChanges the active server for the recovery group.

--servers Primary[,Backup]Changes the defined list of recovery group servers.

Note: To use this option, all file systems that use this recovery group must be unmounted.

-v {yes | no}Specifies whether the new server or servers should verify access to the pdisks of the recovery group.The default is -v yes. Use -v no to specify that configuration changes should be made withoutverifying pdisk access; for example, you could use this if all the servers were down.

Exit status



Security

You must have root authority to run the mmchrecoverygroup command.



Examples1. The following command example shows how to change the number of spares to one and the

replacement threshold to one for declustered array DA4 in recovery group 000DE37TOP:mmchrecoverygroup 000DE37TOP --declustered-array DA4 --spares 1 --replace-threshold 1

2. The following command example shows how to change the scrub duration to one day for declusteredarray DA2 of recovery group 000DE37BOT:mmchrecoverygroup 000DE37BOT --declustered-array DA2 --scrub-duration 1

3. The following command example shows how to replace the servers for a recovery group. In thisexample, assume that the two current servers for recovery group RG1 have been taken down forextended maintenance. GPFS is shut down on these current RG1 servers. The new servers have alreadybeen configured to be recovery group servers, with mmchconfig parameter nsdRAIDTracks set to anonzero value. The disks for RG1 have not yet been connected to the new servers, so verification ofdisk availability must be disabled by specifying -v no as shown here:mmchrecoverygroup RG1 --servers newprimary,newbackup -v no

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmchpdisk command” on page 49v “mmdelrecoverygroup command” on page 62v “mmcrrecoverygroup command” on page 53v “mmlsrecoverygroup command” on page 69

Location

/usr/lpp/mmfs/bin


mmcrrecoverygroup commandCreates a GPFS Native RAID recovery group and its component declustered arrays and pdisks andspecifies the servers.

Synopsismmcrrecoverygroup RecoveryGroupName -F StanzaFile --servers {Primary[,Backup]} [-v {yes | no}]

Description

The mmcrrecoverygroup command is used to define a cluster-wide recovery group for use by GPFSNative RAID. A recovery group is a set of physical disks shared by up to two server nodes. The set ofdisks must be partitioned into one or more declustered arrays.

See the following topic in GPFS Native RAID Administration and Programming Reference: Chapter 2,“Managing GPFS Native RAID,” on page 11.

The pdisk stanzas assigned to a recovery group must contain at least one declustered array that meets thedefinition of large.

While the mmcrrecoverygroup command creates the declustered arrays and pdisks and defines theservers, further processing with the mmcrvdisk and mmcrnsd commands is necessary to create GPFSNative RAID vdisk NSDs within the recovery group.

Note: The recovery group servers must be active to run this command.

Parameters

RecoveryGroupNameName of the recovery group being created.

-F StanzaFileSpecifies a file that includes pdisk stanzas and declustered array stanzas that are used to create therecovery group. The declustered array stanzas are optional.



where:

pdiskName=PdiskNameSpecifies the name of a pdisk.

device=BlockDeviceNameSpecifies the name of a block device. The value provided for BlockDeviceName must refer to theblock device as configured by the operating system on the primary recovery group server.

Example values for BlockDeviceName are hdisk3 and /dev/hdisk3.

Only one BlockDeviceName needs to be used, even if the device uses multipath and has multipledevice names.

da=DeclusteredArrayNameSpecifies the DeclusteredArrayName in the pdisk stanza, which implicitly creates the declusteredarray with default parameters.

Declustered array stanzas look like the following:


%da: daName=DeclusteredArrayNamespares=NumberreplaceThreshold=NumberscrubDuration=Number

where:

daName=DeclusteredArrayNameSpecifies the name of the declustered array for which you are overriding the default values.

spares=NumberSpecifies the number of disks' worth of spare space to set aside in the declustered array. Thenumber of spares can be 1 or higher. The default values are the following:

1 for arrays with 9 or fewer disks

2 for arrays with 10 or more disks

replaceThreshold=NumberSpecifies the number of pdisks that must fail in the declustered array before mmlspdisk willreport that pdisks need to be replaced. The default is equal to the number of spares.

scrubDuration=NumberSpecifies the length of time (in days) by which the scrubbing of entire array must becompleted. Valid values are 1 to 60. The default value is 14 days.

--servers {Primary[,Backup]}Specifies the primary server and, optionally, a backup server.

-v {yes | no}Verification flag that specifies whether each pdisk in the stanza file should only be created if it hasnot been previously formatted as a pdisk (or NSD). The default is -v yes. Use -v no to specify thatthe disks should be created regardless of whether they have been previously formatted or not.

Exit status



Security

You must have root authority to run the mmcrrecoverygroup command.

The node on which the command is issued must be able to execute remote shell commands on any othernode in the cluster without the use of a password and without producing any extraneous messages.

Examples

Assume that input stanza file 000DE37BOT contains the following lines:%pdisk: pdiskName=c034d1

device=/dev/hdisk316da=DA1





%pdisk: pdiskName=c033d1device=/dev/hdisk312da=LOG

[...]

The following command example shows how to create recovery group 000DE37BOT using stanza file000DE37BOT, with c250f10c08ap01-hf0 as the primary server and c250f10c07ap01-hf0 as the backupserver:mmcrrecoverygroup 000DE37BOT -F 000DE37BOT --servers c250f10c08ap01-hf0,c250f10c07ap01-hf0

The system displays output similar to the following:mmcrrecoverygroup: Propagating the cluster configuration data to all


See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmchrecoverygroup command” on page 51v “mmcrvdisk command” on page 56v “mmdelrecoverygroup command” on page 62v “mmlsrecoverygroup command” on page 69

Location

/usr/lpp/mmfs/bin


mmcrvdisk commandCreates a vdisk within a declustered array of a GPFS Native RAID recovery group.

Synopsismmcrvdisk -F StanzaFile

Description

The mmcrvdisk command creates one or more vdisks. Upon successful completion of the mmcrvdiskcommand, the vdisk stanza file is rewritten in a form that can be used as input to the mmcrnsdcommand.

The first vdisk that is created in a recovery group must be a log vdisk, which is indicated by a disk usageof vdiskLog.

Note: The recovery group must be active to run this command.

Parameters

-F StanzaFileSpecifies a file that includes vdisk stanzas identifying the vdisks to be created.

Vdisk stanzas look like the following:%vdisk: vdiskName=VdiskName

rg=RecoveryGroupNameda=DeclusteredArrayNameblocksize=BlockSizesize=SizeraidCode=RaidCodediskUsage=DiskUsagefailureGroup=FailureGrouppool=StoragePool

where:

vdiskName=VdiskNameSpecifies the name you wish to assign to the vdisk NSD to be created. This name must notalready be used as another GPFS disk name, and it must not begin with the reserved string 'gpfs'.

Note: This name can contain only the following characters:Uppercase letters 'A' through 'Z'Lowercase letters 'a' through 'z'Numerals '0' through '9'Underscore character '_'

All other characters are not valid.

rg=RecoveryGroupNameSpecifies the recovery group where the vdisk is to be created.

da=DeclusteredArrayNameSpecifies the declustered array within the recovery group where the vdisk is to be created.

blocksize=BlockSizeSpecifies the size of the data blocks for the vdisk. This should match the block size planned forthe file system and must be one of the following: 256 KiB (the default), 512 KiB, 1 MiB, 2 MiB, 4MiB, 8 MiB, or 16 MiB. Specify this value with the character K or M (for example, 512K).


If you are using the system pool as metadata only and placing your data in a separate pool, youcan specify your metadata-only vdisks with one block size and your data vdisks with a differentblock size.

size=SizeSpecifies the size of the vdisk. If size=Size is omitted, it defaults to using all the available space inthe declustered array. The requested vdisk size is equally allocated across all of the pdisks withinthe declustered array.

raidCode=RaidCodeSpecifies the raid code to be used for the vdisk. Valid codes are the following:

3WayReplicationIndicates three-way replication.

4WayReplicationIndicates four-way replication.

8 + 2pIndicates Reed-Solomon 8 + 2p.

8 + 3pIndicates Reed-Solomon 8 + 3p.

diskUsage=DiskUsageSpecifies a disk usage or accepts the default. With the exception of the vdiskLog value, this fieldis ignored by the mmcrvdisk command and is passed unchanged to the output stanza file.

Possible values are the following

dataAndMetadataIndicates that the disk contains both data and metadata. This is the default for disks in thesystem pool.

dataOnlyIndicates that the disk contains data and does not contain metadata. This is the default fordisks in storage pools other than the system pool.

metadataOnlyIndicates that the disk contains metadata and does not contain data.

descOnlyIndicates that the disk contains no data and no file metadata. Such a disk is used solely tokeep a copy of the file system descriptor, and can be used as a third failure group in certaindisaster recovery configurations. For more information, see General Parallel File System:Advanced Administration and search on Synchronous mirroring utilizing GPFS replication.

vdiskLogIndicates that this is the log vdisk for the recovery group.

The first vdisk created in the recovery group must be the log vdisk. The output stanza filewill have this vdisk line commented out because you cannot create an NSD on this vdisk.

failureGroup=FailureGroupThis field is ignored by the mmcrvdisk command and is passed unchanged to the output stanzafile. It is recommended that a value between 1 and 4000 that is unique to the recovery group bespecified; however, in the case of a vdisk, this value defaults to -1, which indicates that the diskhas no point of failure in common with any other disk.

pool=StoragePoolThis field is ignored by the mmcrvdisk command and is passed unchanged to the output stanzafile. It specifies the name of the storage pool to which the vdisk NSD is assigned.


Exit status



Security

You must have root authority to run the mmcrvdisk command.


Examples

Assume that input stanza file 000DE37TOP.vdisk contains the following lines:%vdisk: vdiskName=000DE37TOPLOG

rg=000DE37TOPda=LOGblocksize=1msize=4graidCode=3WayReplicationdiskUsage=vdiskLog

%vdisk: vdiskName=000DE37TOPDA1METArg=000DE37TOPda=DA1blocksize=1msize=250graidCode=4WayReplicationdiskUsage=metadataOnlyfailureGroup=37pool=system

%vdisk: vdiskName=000DE37TOPDA1DATArg=000DE37TOPda=DA1blocksize=8mraidCode=8+3pdiskUsage=dataOnlyfailureGroup=37pool=data

[...]

The following command example shows how to create the vdisks described in the stanza file000DE37TOP.vdisk:mmcrvdisk -F 000DE37TOP.vdisk

The system displays output similar to the following:mmcrvdisk: [I] Processing vdisk 000DE37TOPLOGmmcrvdisk: [I] Processing vdisk 000DE37TOPDA1METAmmcrvdisk: [I] Processing vdisk 000DE37TOPDA1DATAmmcrvdisk: [I] Processing vdisk 000DE37TOPDA2METAmmcrvdisk: [I] Processing vdisk 000DE37TOPDA2DATAmmcrvdisk: [I] Processing vdisk 000DE37TOPDA3METAmmcrvdisk: [I] Processing vdisk 000DE37TOPDA3DATAmmcrvdisk: [I] Processing vdisk 000DE37TOPDA4METAmmcrvdisk: [I] Processing vdisk 000DE37TOPDA4DATAmmcrvdisk: Propagating the cluster configuration data to allaffected nodes. This is an asynchronous process.


See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmcrrecoverygroup command” on page 53v “mmdelvdisk command” on page 64v “mmlsvdisk command” on page 74

Location

/usr/lpp/mmfs/bin


mmdelpdisk commandDeletes GPFS Native RAID pdisks.

Synopsismmdelpdisk RecoveryGroupName {--pdisk "PdiskName[;PdiskName...]" | -F StanzaFile} [-a]

ormmdelpdisk RecoveryGroupName --declustered-array DeclusteredArrayName

Description

The mmdelpdisk command deletes one or more pdisks. Deleting a pdisk causes any data allocated tothat disk to be moved or rebuilt (drained) to spare space in the declustered array.

The mmdelpdisk command first renames each pdisk that is to be deleted, giving it a temporary name.The command then drains each renamed pdisk to remove all data from it. Finally, the command destroyseach renamed pdisk once all data has been drained from it.

Note: The temporary name is obtained by appending a suffix in the form #nnnn to the pdisk name. Forexample, a pdisk named p25 will receive a temporary name similar to p25#0010; this allows you to usethe mmaddpdisk command to add a new pdisk with the name p25 immediately rather than waiting forthe old disk to be completely drained and removed. Until the draining and removing process is complete,both the new pdisk p25 and the old pdisk p25#0010 will show up in the output of themmlsrecoverygroup and mmlspdisk commands.

If mmdelpdisk is interrupted (by an interrupt signal or by GPFS server failover), the deletion willproceed and will be completed as soon as another GPFS Native RAID server becomes the active vdiskserver of the recovery group.

If you wish to delete a declustered array and all pdisks in that declustered array, use the--declustered-array DeclusteredArrayName form of the command.

The mmdelpdisk command cannot be used if the declustered array does not have enough spare space tohold the data that needs to be drained, or if it attempts to reduce the size of a large declustered arraybelow the limit for large declustered arrays. Normally, all of the space in a declustered array is allocatedto vdisks and spares, and therefore the only times the mmdelpdisk command typically can be used isafter adding pdisks, after deleting vdisks, or after reducing the designated number of spares. See thefollowing topics in GPFS Native RAID Administration and Programming Reference: “mmaddpdiskcommand” on page 44 and “mmchcarrier command” on page 46.


Parameters

RecoveryGroupNameSpecifies the recovery group from which the pdisks are being deleted.

--pdisk "PdiskName[;PdiskName...]"Specifies a semicolon-separated list of pdisk names identifying the pdisks to be deleted.

-F StanzaFileSpecifies a file that contains pdisk stanzas identifying the pdisks to be deleted.




where:

pdiskName=PdiskNameSpecifies the name of the pdisk to be deleted.

device=BlockDeviceNameSpecifies the name of a block device.

Example values for BlockDeviceName are hdisk3 and /dev/hdisk3.

da=DeclusteredArrayNameSpecifies the name of the declustered array containing the pdisk to be deleted.

-a Indicates that the data on the deleted pdisks is to be drained asynchronously. The pdisk will continueto exist, with its name changed to a temporary name, while the deletion progresses.

--declustered-array DeclusteredArrayNameSpecifies the name of the declustered array whose pdisks are to be deleted.

Exit status



Security

You must have root authority to run the mmdelpdisk command.


Examples

The following command example shows how to remove pdisk c016d1 from recovery group 000DE37TOPand have it be drained in the background, thereby returning the administrator immediately to thecommand prompt:mmdelpdisk 000DE37TOP --pdisk c016d1 -a

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmaddpdisk command” on page 44v “mmchpdisk command” on page 49v “mmdelrecoverygroup command” on page 62v “mmdelvdisk command” on page 64v “mmlspdisk command” on page 66v “mmlsrecoverygroup command” on page 69

Location

/usr/lpp/mmfs/bin


mmdelrecoverygroup commandDeletes a GPFS Native RAID recovery group.

Synopsismmdelrecoverygroup RecoveryGroupName [-p]

Description

The mmdelrecoverygroup command deletes the specified recovery group and the declustered arrays andpdisks that it contains. The recovery group must not contain any vdisks; use the mmdelvdisk commandto delete vdisks prior to running this command.

Note: The recovery group must be active to run this command, unless the -p option is specified.

Parameters

RecoveryGroupNameSpecifies the name of the recovery group to delete.

-p Indicates that the recovery group is permanently damaged and that the recovery group informationshould be removed from the GPFS cluster configuration data. The -p option may be used when theGPFS daemon is down on the recovery group servers.

Exit status



Security

You must have root authority to run the mmdelrecoverygroup command.


Examples

The following command example shows how to delete recovery group 000DE37BOT:mmdelrecoverygroup 000DE37BOT

The system displays output similar to the following:mmdelrecoverygroup: [I] Recovery group 000DE37BOT deleted on node c250f10c08ap01-hf0.ppd.pok.ibm.com.mmdelrecoverygroup: Propagating the cluster configuration data to all


See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmchrecoverygroup command” on page 51v “mmcrrecoverygroup command” on page 53v “mmlsrecoverygroup command” on page 69


Location

/usr/lpp/mmfs/bin


mmdelvdisk commandDeletes vdisks from a declustered array in a GPFS Native RAID recovery group.

Synopsismmdelvdisk {"VdiskName[;VdiskName...]" | -F StanzaFile} [-p | --recovery-group RecoveryGroupName]

Description

The mmdelvdisk command is used to delete vdisks from the declustered arrays of recovery groups. Thelog vdisk in a recovery group must not be deleted until all other vdisks in the recovery group have beendeleted. There must be no NSD defined on the specified vdisk; if necessary, use the mmdelnsd commandprior to running this command.

Parameters

VdiskName[;VdiskName...]Specifies the vdisks to be deleted.

-F StanzaFileSpecifies the name of a stanza file in which stanzas of the type %vdisk identify the vdisks to bedeleted. Only the vdisk name is required to be included in the vdisk stanza; however, for a completedescription of vdisk stanzas, see the following topic in GPFS Native RAID Administration andProgramming Reference: “mmcrvdisk command” on page 56.

-p Indicates that the recovery group is permanently damaged, and therefore the vdisk informationshould be removed from the GPFS cluster configuration data. This option can be used when theGPFS daemon is down on the recovery group servers.

--recovery-group RecoveryGroupNameSpecifies the name of the recovery group that contains the vdisks. This option is for the rare casewhere the vdisks have been removed from the GPFS cluster configuration data but are still present inthe recovery group.

You can see the vdisks in the recovery group by issuing either of the following commands:mmlsvdisk --recovery-group RecoveryGroupName

ormmlsrecoverygroup RecoveryGroupName -L

Exit status



Security

You must have root authority to run the mmdelvdisk command.



Examples

The following command example shows how to delete vdisk 000DE37BOTDA4DATA:mmdelvdisk 000DE37BOTDA4DATA

The system displays output similar to the following:mmdelvdisk: Propagating the cluster configuration data to all


See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmcrvdisk command” on page 56v “mmdelpdisk command” on page 60v “mmdelrecoverygroup command” on page 62v “mmlsvdisk command” on page 74

Location

/usr/lpp/mmfs/bin


mmlspdisk commandLists information for one or more GPFS Native RAID pdisks.

Synopsismmlspdisk {all | RecoveryGroupName [--declustered-array DeclusteredArrayName | --pdisk pdiskName]}

[--not-in-use | --not-ok | --replace]

Description

The mmlspdisk command lists information for one or more pdisks, which can be specified in variousways.

Parameters

all | RecoveryGroupNameSpecifies the recovery group for which the pdisk information is to be listed.

all specifies that pdisk information for all recovery groups is to be listed.

RecoveryGroupName specifies the name of a particular recovery group for which the pdisk informationis to be listed.

--declustered-array DeclusteredArrayNameSpecifies the name of a declustered array for which the pdisk information is to be listed.

--pdisk pdiskNameSpecifies the name of a single pdisk for which the information is to be listed.

--not-in-useIndicates that information is to be listed only for pdisks that are draining.

--not-okIndicates that information is to be listed only for pdisks that are not functioning correctly.

--replaceIndicates that information is to be listed only for pdisks in declustered arrays that are marked forreplacement.

Exit status



Security

You must have root authority to run the mmlspdisk command.


Examples1. The following command example shows how to display the details regarding pdisk c112d3 in

recovery group 000DE37BOT:mmlspdisk 000DE37BOT --pdisk c112d3

The system displays output similar to the following:


pdisk:replacementPriority = 1000name = "c112d3"device = "/dev/rhdisk142,/dev/rhdisk46"recoveryGroup = "000DE37BOT"declusteredArray = "DA3"state = "ok"freeSpace = 0fru = "74Y4936"location = "78AD.001.000DE37-C112-D3"WWN = "naa.5000C5001DC70D77"server = "c078p01.pok.ibm.com"reads = 750writes = 6450bytesReadInGiB = 0.735bytesWrittenInGiB = 6.078IOErrors = 0IOTimeouts = 0mediaErrors = 0checksumErrors = 0pathErrors = 0timeBadness = 0.000dataBadness = 0.000

2. To show which pdisks in recovery group 000DE37BOT need replacing:mmlspdisk 000DE37BOT --replace

The system displays output similar to the following:pdisk:

replacementPriority = 0.98name = "c052d1"device = "/dev/rhdisk556,/dev/rhdisk460"recoveryGroup = "000DE37BOT"declusteredArray = "DA1"state = "dead/systemDrain/noRGD/noVCD/replace"freeSpace = 373125283840fru = "74Y4936"location = "78AD.001.000DE37-C52-D1"WWN = "naa.5000C5001DB334CF"server = "c08ap01.pok.ibm.com"reads = 28445writes = 156116bytesReadInGiB = 13.883bytesWrittenInGiB = 225.826IOErrors = 1IOTimeouts = 27mediaErrors = 0checksumErrors = 0pathErrors = 0timeBadness = 28.206dataBadness = 0.000

pdisk:replacementPriority = 0.98name = "c096d1"device = "/dev/rhdisk508,/dev/rhdisk412"recoveryGroup = "000DE37BOT"declusteredArray = "DA1"state = "dead/systemDrain/noRGD/noVCD/replace"freeSpace = 390305153024fru = "74Y4936"location = "78AD.001.000DE37-C96-D1"WWN = "naa.5000C5001DB45393"server = "c08ap01.pok.ibm.com"reads = 45204writes = 403217bytesReadInGiB = 29.200bytesWrittenInGiB = 635.833


IOErrors = 6IOTimeouts = 18mediaErrors = 0checksumErrors = 0pathErrors = 0timeBadness = 33.251dataBadness = 0.000

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmchpdisk command” on page 49v “mmdelpdisk command” on page 60v “mmlsrecoverygroup command” on page 69v “mmlsrecoverygroupevents command” on page 72v “mmlsvdisk command” on page 74

Location

/usr/lpp/mmfs/bin


mmlsrecoverygroup commandLists information about GPFS Native RAID recovery groups.

Synopsismmlsrecoverygroup [ RecoveryGroupName [-L [--pdisk] ] ]

Description

The mmlsrecoverygroup command lists information about recovery groups. The command displaysvarious levels of information, depending on the parameters specified.

Parameters

RecoveryGroupNameSpecifies the recovery group for which the information is being requested. If no other parameters arespecified, the command displays only the information that can be found in the GPFS clusterconfiguration data.

-L Displays more detailed runtime information for the specified recovery group.

--pdiskIndicates that pdisk information is to be listed for the specified recovery group.

Exit status



Security

You must have root authority to run the mmlsrecoverygroup command.


Examples1. The following command example shows how to list all the recovery groups in the GPFS cluster:

mmlsrecoverygroup

The system displays output similar to the following:declusteredarrays with

recovery group vdisks vdisks servers------------------ ----------- ------ -------000DE37BOT 5 9 c07ap01.pok.ibm.com,c08ap01.pok.ibm.com000DE37TOP 5 9 c08ap01.pok.ibm.com,c07ap01.pok.ibm.com

2. The following command example shows how to list the basic non-runtime information for recoverygroup 000DE37BOT:mmlsrecoverygroup 000DE37BOT

The system displays output similar to the following:declusteredarrays with

recovery group vdisks vdisks servers------------------ ----------- ------ -------000DE37BOT 5 9 c08ap01.pok.ibm.com,c07ap01.pok.ibm.com


declustered arraywith vdisks vdisks

------------------ ------DA1 2DA2 2DA3 2DA4 2LOG 1

declusteredvdisk RAID code array remarks------------------ ------------------ ----------- -------000DE37BOTDA1DATA 8+3p DA1000DE37BOTDA1META 4WayReplication DA1000DE37BOTDA2DATA 8+3p DA2000DE37BOTDA2META 4WayReplication DA2000DE37BOTDA3DATA 8+3p DA3000DE37BOTDA3META 4WayReplication DA3000DE37BOTDA4DATA 8+3p DA4000DE37BOTDA4META 4WayReplication DA4000DE37BOTLOG 3WayReplication LOG log

3. The following command example shows how to display the runtime status of recovery group000DE37TOP:mmlsrecoverygroup 000DE37TOP -L

The system displays output similar to the following:declustered

recovery group arrays vdisks pdisks----------------- ----------- ------ ------000DE37TOP 5 9 192


----------- ------- ------ ------ ------ --------- ---------- -------- -------------------------DA1 no 2 47 2 2 3072 MiB 14 days scrub 6% lowDA2 no 2 47 2 2 3072 MiB 14 days scrub 42% lowDA3 no 2 47 2 2 2048 MiB 14 days rebalance 39% lowDA4 yes 2 47 2 2 0 B 14 days rebuild-2r 9% lowLOG no 1 4 1 1 192 GiB 14 days scrub 90% low


active recovery group server servers----------------------------------------------- -------c07ap01.pok.ibm.com c07ap01.pok.ibm.com,c08ap01.pok.ibm.com

4. The following example shows how to include pdisk information for 000DE37TOP:mmlsrecoverygroup 000DE37TOP -L --pdisk

The system displays output similar to the following:declustered

recovery group arrays vdisks pdisks----------------- ----------- ------ ------000DE37TOP 5 9 192


----------- ------- ------ ------ ------ --------- ---------- -------- -------------------------DA1 no 2 47 2 2 3072 MiB 14 days scrub 7% lowDA2 no 2 47 2 2 3072 MiB 14 days scrub 42% low


DA3 no 2 47 2 2 2048 MiB 14 days rebalance 61% lowDA4 yes 2 47 2 2 3072 MiB 14 days scrub 0% lowLOG no 1 4 1 1 192 GiB 14 days scrub 90% low

number of declusteredpdisk active paths array free space state----------------- ------------ ----------- ---------- -----c001d1 2 DA1 23 GiB okc001d2 2 DA2 23 GiB ok. . . . .. . . . .. . . . .c015d4 2 DA4 558 GiB dead/systemDrain/noRGD/noVCD/noData/replace. . . . .. . . . .. . . . .c088d4 2 DA4 0 B ok


active recovery group server servers----------------------------------------------- -------c07ap01.pok.ibm.com c07ap01.pok.ibm.com,c08ap01.pok.ibm.com

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmchrecoverygroup command” on page 51v “mmcrrecoverygroup command” on page 53v “mmdelrecoverygroup command” on page 62v “mmlspdisk command” on page 66v “mmlsrecoverygroupevents command” on page 72v “mmlsvdisk command” on page 74

Location

/usr/lpp/mmfs/bin


mmlsrecoverygroupevents commandDisplays the GPFS Native RAID recovery group event log.

Synopsismmlsrecoverygroupevents RecoveryGroupName [-T] [--days Days]

[--long-term Codes] [--short-term Codes]

Description

The mmlsrecoverygroupevents command displays the recovery group event log, internally divided intothe following two logs:

short-term logContains more detail than the long-term log, but due to space limitations may not extend far backin time

long-term logContains only brief summaries of important events and therefore extends much further back intime

Both logs use the following severity codes:

C Commands (or configuration)

These messages record a history of commands that changed the specified recovery group.

E Errors

W Warnings

I Informational messages

D Details

By default, mmlsrecoverygroupevents displays both long-term and short-term logs merged together inorder of message time stamp. Given the --long-term option, it displays only the requested severities fromthe long-term log. Given the --short-term option, it displays only the requested severities from theshort-term log. Given both --long-term and --short-term options, it displays the requested severities fromeach log, merged by time stamp.


Parameters

RecoveryGroupNameSpecifies the name of the recovery group for which the event log is to be displayed.

-T Indicates that the time is to be shown in decimal format.

--days DaysSpecifies the number of days for which events are to be displayed.

For example, --days 3 specifies that only the events of the last three days are to be displayed.

--long-term CodesSpecifies that only the indicated severity or severities from the long-term log are to be displayed. Youcan specify any combination of the severity codes listed in “Description.”

For example, --long-term EW specifies that only errors and warnings are to be displayed.


--short-term CodesSpecifies that only the indicated severity or severities from the short-term log are to be displayed.You can specify any combination of the severity codes listed in “Description” on page 72.

For example, --short-term EW specifies that only errors and warnings are to be displayed.

Exit status



Security

You must have root authority to run the mmlsrecoverygroupevents command.


Examples

The following command example shows how to print the event logs of recovery group 000DE37BOT:mmlsrecoverygroupevents 000DE37BOT

The system displays output similar to the following:Mon May 23 12:17:36.916 2011 c08ap01 ST [I] Start scrubbing tracks of 000DE37BOTDA4META.Mon May 23 12:17:36.914 2011 c08ap01 ST [I] Finish rebalance of DA DA4 in RG 000DE37BOT.Mon May 23 12:13:00.033 2011 c08ap01 ST [D] Pdisk c109d4 of RG 000DE37BOT state changed from noRGD to ok.Mon May 23 12:13:00.010 2011 c08ap01 ST [D] Pdisk c109d4 of RG 000DE37BOT state changed from noRGD/noVCD to noRGD.Mon May 23 12:11:29.676 2011 c08ap01 ST [D] Pdisk c109d4 of RG 000DE37BOT state changed from noRGD/noVCD/noData to noRGD/noVCD.Mon May 23 12:11:29.672 2011 c08ap01 ST [I] Start rebalance of DA DA4 in RG 000DE37BOT.Mon May 23 12:11:29.469 2011 c08ap01 ST [I] Finished repairing metadata in RG 000DE37BOT.Mon May 23 12:11:29.409 2011 c08ap01 ST [I] Start repairing metadata in RG 000DE37BOT.Mon May 23 12:11:29.404 2011 c08ap01 ST [I] Abort scrubbing tracks of 000DE37BOTDA4META.Mon May 23 12:11:29.404 2011 c08ap01 ST [D] Pdisk c109d4 of RG 000DE37BOT state changed from missing/systemDrain/noRGD/noVCD/noData/noPath

to noRGD/noVCD/noData.Mon May 23 12:11:29.401 2011 c08ap01 ST [D] Pdisk c109d4 of RG 000DE37BOT: path index 0 (/dev/rhdisk131): up.Mon May 23 12:11:29.393 2011 c08ap01 ST [I] Path /dev/rhdisk131 of pdisk c109d4 reenabled.Mon May 23 12:09:49.004 2011 c08ap01 ST [I] Start scrubbing tracks of 000DE37BOTDA4META.

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmaddpdisk command” on page 44v “mmchrecoverygroup command” on page 51v “mmchcarrier command” on page 46v “mmdelpdisk command” on page 60v “mmdelvdisk command” on page 64v “mmlspdisk command” on page 66v “mmlsrecoverygroup command” on page 69v “mmlsvdisk command” on page 74

Location

/usr/lpp/mmfs/bin


mmlsvdisk commandLists information for one or more GPFS Native RAID vdisks.

Synopsismmlsvdisk [--vdisk "VdiskName[;VdiskName...]" | --non-nsd]

ormmlsvdisk --recovery-group RecoveryGroupName

[--declustered-array DeclusteredArrayName]

Description

The mmlsvdisk command lists information for one or more vdisks, specified various ways. Unless the--recovery-group option is specified, the information comes from the GPFS cluster configuration data.

Parameters

--vdisk VdiskName[;VdiskName...]Specifies the name or names of the vdisk or vdisks for which the information is to be listed.

--non-nsdIndicates that information is to be listed for the vdisks that are not associated with NSDs.

--recovery-group RecoveryGroupNameSpecifies the name of the recovery group.

Note: The specified recovery group must be active to run this command.

--declustered-array DeclusteredArrayNameSpecifies the name of the declustered array.

Exit status



Security

You must have root authority to run the mmlsvdisk command.


Examples1. The following command example shows how to list all vdisks in the GPFS cluster:

mmlsvdisk

The system displays output similar to the following:declustered block size

vdisk name RAID code recovery group array in KiB remarks------------------ --------------- ------------------ ----------- ---------- -------000DE37BOTDA1DATA 8+3p 000DE37BOT DA1 8192000DE37BOTDA1META 4WayReplication 000DE37BOT DA1 1024000DE37BOTDA2DATA 8+3p 000DE37BOT DA2 8192000DE37BOTDA2META 4WayReplication 000DE37BOT DA2 1024


000DE37BOTDA3DATA 8+3p 000DE37BOT DA3 8192000DE37BOTDA3META 4WayReplication 000DE37BOT DA3 1024000DE37BOTDA4DATA 8+3p 000DE37BOT DA4 8192000DE37BOTDA4META 4WayReplication 000DE37BOT DA4 1024000DE37BOTLOG 3WayReplication 000DE37BOT LOG 1024 log000DE37TOPDA1DATA 8+3p 000DE37TOP DA1 8192000DE37TOPDA1META 4WayReplication 000DE37TOP DA1 1024000DE37TOPDA2DATA 8+3p 000DE37TOP DA2 8192000DE37TOPDA2META 4WayReplication 000DE37TOP DA2 1024000DE37TOPDA3DATA 8+3p 000DE37TOP DA3 8192000DE37TOPDA3META 4WayReplication 000DE37TOP DA3 1024000DE37TOPDA4DATA 8+3p 000DE37TOP DA4 8192000DE37TOPDA4META 4WayReplication 000DE37TOP DA4 1024000DE37TOPLOG 3WayReplication 000DE37TOP LOG 1024 log

2. The following command example shows how to list only those vdisks in the cluster that do not haveNSDs defined on them:# mmlsvdisk --non-nsd

The system displays output similar to the following:declustered block size

vdisk name RAID code recovery group array in KiB remarks------------------ --------------- ------------------ ----------- ---------- -------000DE37BOTLOG 3WayReplication 000DE37BOT LOG 1024 log000DE37TOPLOG 3WayReplication 000DE37TOP LOG 1024 log

3. The following command example shows how to see complete information about the vdisks indeclustered array DA1 of recovery group 000DE37TOP:mmlsvdisk --recovery-group 000DE37TOP --declustered-array DA1

The system displays output similar to the following:vdisk:

name = "000DE37TOPDA1META"raidCode = "4WayReplication"recoveryGroup = "000DE37TOP"declusteredArray = "DA1"blockSizeInKib = 1024size = "250 GiB"state = "ok"remarks = ""

vdisk:name = "000DE37TOPDA1DATA"raidCode = "8+3p"recoveryGroup = "000DE37TOP"declusteredArray = "DA1"blockSizeInKib = 16384size = "17 TiB"state = "ok"remarks = ""

See also

See also the following topics in GPFS Native RAID Administration and Programming Reference:v “mmcrvdisk command” on page 56v “mmdelvdisk command” on page 64v “mmlspdisk command” on page 66v “mmlsrecoverygroup command” on page 69v “mmlsrecoverygroupevents command” on page 72

Location

/usr/lpp/mmfs/bin


Chapter 5. Other GPFS commands related to GPFS NativeRAID

The following table summarizes other GPFS commands that are related to GPFS Native RAID.

Table 8. Other GPFS commands related to GPFS Native RAID

Command Purpose

“mmaddcallback command” on page 78 Registers a user-defined command that GPFS will executewhen certain events occur.

“mmchconfig command” on page 86 Changes GPFS configuration parameters.

“mmcrfs command” on page 95 Creates a GPFS file system.

“mmexportfs command” on page 102 Retrieves the information needed to move a file system to adifferent cluster.

“mmimportfs command” on page 104 Imports into the cluster one or more file systems that werecreated in another GPFS cluster.

“mmpmon command” on page 107 Manages performance monitoring and displays performanceinformation.


mmaddcallback commandRegisters a user-defined command that GPFS will execute when certain events occur.

Synopsismmaddcallback CallbackIdentifier --command CommandPathname

--event Event[,Event...] [--priority Value][--async | --sync [--timeout Seconds] [--onerror Action]][-N {Node[,Node...] | NodeFile | NodeClass}][--parms ParameterString ...]

Or,mmaddcallback {-S Filename | --spec-file Filename}

Description

Use the mmaddcallback command to register a user-defined command that GPFS will execute whencertain events occur.

The callback mechanism is primarily intended for notification purposes. Invoking complex orlong-running commands, or commands that involve GPFS files, may cause unexpected and undesiredresults, including loss of file system availability. This is particularly true when the --sync option isspecified.

Parameters

CallbackIdentifierSpecifies a user-defined unique name that identifies the callback. It can be up to 255 characters long.It cannot contain special characters (for example, a colon, semicolon, blank, tab, or comma) and itcannot start with the letters gpfs or mm (which are reserved for GPFS internally defined callbacks).

--command CommandPathnameSpecifies the full path name of the executable to run when the desired event occurs.

The executable that will be called by the mmaddcallback facility should be installed on all nodes onwhich the callback can be triggered. Place the executable in a local file system (not in a GPFS filesystem) so that it is accessible even when the networks fail.

--event Event[,Event...]Specifies a list of events that trigger the callback. The value defines when the callback will beinvoked. There are two kinds of events: global events and local events. A global event is an event thatwill trigger a callback on all nodes in the cluster, such as a nodeLeave event, which informs all nodesin the cluster that a node has failed. A local event is an event that will trigger a callback only on thenode on which the event occurred, such as mounting a file system on one of the nodes. The followingis a list of supported global and local events.

Global events include:

nodeJoinTriggered when one or more nodes join the cluster.

nodeLeaveTriggered when one or more nodes leave the cluster.

quorumReachedTriggered when a quorum has been established in the GPFS cluster.

quorumLossTriggered when a quorum has been lost in the GPFS cluster.


||||

quorumNodeJoinTriggered when one or more quorum nodes join the cluster.

quorumNodeLeaveTriggered when one or more quorum nodes leave the cluster.

clusterManagerTakeoverTriggered when a new cluster manager node has been elected. This happens when a cluster firststarts up or when the current cluster manager fails or resigns and a new node takes over ascluster manager.

Local events include:

lowDiskSpaceTriggered when the file system manager detects that disk space is running below the lowthreshold that is specified in the current policy rule.

noDiskSpaceTriggered when the file system manager detects that a disk ran out of space.

softQuotaExceededTriggered when the file system manager detects that a user or fileset quota has been exceeded.

preMount, preUnmount, mount, unmountSpecifies that these events will be triggered when a file system is about to be mounted orunmounted or has been mounted or unmounted successfully. These events will be generated forexplicit mount or unmount commands, a remount after GPFS recovery and a forced unmountwhen GPFS panics and shuts down.

preStartupTriggered after the GPFS daemon completes its internal initialization and joins the cluster, butbefore the node runs recovery for any VFS mount points that were already mounted, and beforethe node starts accepting user initiated sessions.

startupTriggered after a successful GPFS startup and when the node is ready for user initiated sessions.

preShutdownTriggered when GPFS detects a failure and is about to shutdown.

shutdownTriggered when GPFS completed the shutdown.

Local events for GPFS Native RAID include the following callbacks. For more information, see thetopic about GPFS Native RAID callbacks in GPFS Native RAID Administration and Programming.

preRGTakeoverThe preRGTakeover callback is invoked on a recovery group server prior to attempting to openand serve recovery groups. The rgName parameter may be passed into the callback as thekeyword value _ALL_, indicating that the recovery group server is about to open multiplerecovery groups; this is typically at server startup, and the parameter rgCount will be equal tothe number of recovery groups being processed. Additionally, the callback will be invoked withthe rgName of each individual recovery group and an rgCount of 1 whenever the server checksto determine whether it should open and serve recovery group rgName.

The following parameters are available to this callback: %myNode, %rgName, %rgErr,%rgCount, and %rgReason.

postRGTakeoverThe postRGTakeover callback is invoked on a recovery group server after it has checked,attempted, or begun to serve a recovery group. If multiple recovery groups have been taken over,

Chapter 5. Other GPFS commands related to GPFS Native RAID 79

||

||||||||

||

|||

the callback will be invoked with rgName keyword _ALL_ and an rgCount equal to the totalnumber of involved recovery groups. The callback will also be triggered for each individualrecovery group.


preRGRelinquishThe preRGRelinquish callback is invoked on a recovery group server prior to relinquishingservice of recovery groups. The rgName parameter may be passed into the callback as thekeyword value _ALL_, indicating that the recovery group server is about to relinquish service forall recovery groups it is serving; the rgCount parameter will be equal to the number of recoverygroups being relinquished. Additionally, the callback will be invoked with the rgName of eachindividual recovery group and an rgCount of 1 whenever the server relinquishes servingrecovery group rgName.


postRGRelinquishThe postRGRelinquish callback is invoked on a recovery group server after it has relinquishedserving recovery groups. If multiple recovery groups have been relinquished, the callback will beinvoked with rgName keyword _ALL_ and an rgCount equal to the total number of involvedrecovery groups. The callback will also be triggered for each individual recovery group.


rgOpenFailedThe rgOpenFailed callback will be invoked on a recovery group server when it fails to open arecovery group that it is attempting to serve. This may be due to loss of connectivity to some orall of the disks in the recovery group; the rgReason string will indicate why the recovery groupcould not be opened.

The following parameters are available to this callback: %myNode, %rgName, %rgErr, and%rgReason.

rgPanicThe rgPanic callback will be invoked on a recovery group server when it is no longer able tocontinue serving a recovery group. This may be due to loss of connectivity to some or all of thedisks in the recovery group; the rgReason string will indicate why the recovery group can nolonger be served.

The following parameters are available to this callback: %myNode, %rgName, %rgErr, and%rgReason.

pdFailedThe pdFailed callback is generated whenever a pdisk in a recovery group is marked as dead,missing, failed, or readonly.

The following parameters are available to this callback: %myNode, %rgName, %daName,%pdName, %pdLocation, %pdFru, %pdWwn, and %pdState.

pdRecoveredThe pdRecovered callback is generated whenever a missing pdisk is rediscovered.

The following parameters are available to this callback: %myNode, %rgName, %daName,%pdName, %pdLocation, %pdFru, and %pdWwn.

pdReplacePdiskThe pdReplacePdisk callback is generated whenever a pdisk is marked for replacement accordingto the replace threshold setting of the declustered array in which it resides.


|||

||

||||||||

||

|||||

||

|||||

||

|||||

||

|||

||

||

||

|||

The following parameters are available to this callback: %myNode, %rgName, %daName,%pdName, %pdLocation, %pdFru, %pdWwn, %pdState, and %pdPriority.

pdPathDownThe pdPathDown callback is generated whenever one of the block device paths to a pdiskdisappears or becomes inoperative. The occurrence of this event can indicate connectivityproblems with the JBOD array in which the pdisk resides.

The following parameters are available to this callback: %myNode, %rgName, %daName,%pdName, %pdPath, %pdLocation, %pdFru, and %pdWwn.

daRebuildFailedThe daRebuildFailed callback is generated when the spare space in a declustered array has beenexhausted, and vdisk tracks involving damaged pdisks can no longer be rebuilt. The occurrenceof this event indicates that fault tolerance in the declustered array has become degraded and thatdisk maintenance should be performed immediately. The daRemainingRedundancy parameterindicates how much fault tolerance remains in the declustered array.

The following parameters are available to this callback: %myNode, %rgName, %daName,%daRemainingRedundancy.

nsdCksumMismatchThe nsdCksumMismatch callback is generated whenever transmission of vdisk data by the NSDnetwork layer fails to verify the data checksum. This can indicate problems in the networkbetween the GPFS client node and a recovery group server. The first error between a given clientand server generates the callback; subsequent callbacks are generated for eachckReportingInterval occurrence.

The following parameters are available to this callback: %myNode, %ckRole, %ckOtherNode,%ckNSD, %ckReason, %ckStartSector, %ckDataLen, %ckErrorCountClient,%ckErrorCountNSD, and %ckReportingInterval.

--priority ValueSpecifies a floating point number that controls the order in which callbacks for a given event will berun. Callbacks with a smaller numerical value will be run before callbacks with a larger numericalvalue. Callbacks that do not have an assigned priority will be run last. If two callbacks have the samepriority, the order in which they will be run is undetermined.

--async | --sync [--timeout Seconds] [--onerror Action]Specifies whether GPFS will wait for the user program to complete and for how long it will wait. Thedefault is --async (GPFS invokes the command asynchronously). Action specifies the action GPFS willtake if the callback command returns a nonzero error code. Action can either be shutdown orcontinue. The default is continue.

-N {Node[,Node...] | NodeFile | NodeClass}Allows restricting the set of nodes on which the callback will be invoked. For global events, thecallback will only be invoked on the specified set of nodes. For local events, the callback will only beinvoked if the node on which the event occurred is one of the nodes specified by the -N option. Thedefault is -N all.

This command does not support a NodeClass of mount.

--parms ParameterString ...Specifies parameters to be passed to the executable specified with the --command parameter. The--parms parameter can be specified multiple times.

When the callback is invoked, the combined parameter string is tokenized on white-space boundaries.Constructs of the form %name and %name.qualifier are assumed to be GPFS variables and are replacedwith their appropriate values at the time of the event. If a variable does not have a value in thecontext of a particular event, the string UNDEFINED is returned instead.

GPFS recognizes the following variables:


||

||||

||

||||||

||

||||||

|||

%blockLimitSpecifies the current hard quota limit in KB.

%blockQuotaSpecifies the current soft quota limit in KB.

%blockUsageSpecifies the current usage in KB for quota-related events.

%clusterManager[.qualifier]Specifies the current cluster manager node.

%clusterNameSpecifies the name of the cluster where this callback was triggered.

%downNodes[.qualifier]Specifies a comma-separated list of nodes that are currently down.

%eventName[.qualifier]Specifies the name of the event that triggered this callback.

%eventNode[.qualifier]Specifies a node or comma-separated list of nodes on which this callback is triggered.

%filesLimitSpecifies the current hard quota limit for the number of files.

%filesQuotaSpecifies the current soft quota limit for the number of files.

%filesUsageSpecifies the current number of files for quota-related events.

%filesetNameSpecifies the name of a fileset for which the callback is being executed.

%filesetSizeSpecifies the size of the fileset.

%fsNameSpecifies the file system name for file system events.

%myNodeSpecifies the node where callback script is invoked.

%quorumNodes[.qualifier]Specifies a comma-separated list of quorum nodes.

%quotaIDSpecifies the numerical ID of the quota owner (UID, GID, or fileset ID).

%quotaOwnerNameSpecifies the name of the quota owner (user name, group name, or fileset name).

%quotaTypeSpecifies the type of quota for quota-related events. Possible values are USR, GRP, or FILESET.

%reasonSpecifies the reason for triggering the event. For the preUnmount and unmount events, thepossible values are normal and forced. For the preShutdown and shutdown events, the possiblevalues are normal and abnormal. For all other events, the value is UNDEFINED.

%storagePoolSpecifies the storage pool name for space-related events.


|

%upNodes[.qualifier]Specifies a comma-separated list of nodes that are currently up.

%userNameSpecifies the user name.

GPFS Native RAID recognizes the following variables (in addition to the %myNode variable). Formore information, see the topic about GPFS Native RAID callbacks in GPFS Native RAIDAdministration and Programming.

%ckDataLenThe length of data involved in a checksum mismatch.

%ckErrorCountClientThe cumulative number of errors for the client side in a checksum mismatch.

%ckErrorCountServerThe cumulative number of errors for the server side in a checksum mismatch.

%ckErrorCountNSDThe cumulative number of errors for the NSD side in a checksum mismatch.

%ckNSDThe NSD involved.

%ckOtherNodeThe IP address of the other node in an NSD checksum event.

%ckReasonThe reason string indicating why a checksum mismatch callback was invoked.

%ckReportingIntervalThe error-reporting interval in effect at the time of a checksum mismatch.

%ckRoleThe role (client or server) of a GPFS node.

%ckStartSectorThe starting sector of a checksum mismatch.

%daNameThe name of the declustered array involved.

%daRemainingRedundancyThe remaining fault tolerance in a declustered array.

%pdFruThe FRU (field replaceable unit) number of the pdisk.

%pdLocationThe physical location code of a pdisk.

%pdNameThe name of the pdisk involved.

%pdPathThe block device path of the pdisk.

%pdPriorityThe replacement priority of the pdisk.

%pdStateThe state of the pdisk involved.

%pdWwnThe worldwide name of the pdisk.


|||

||

||

||

||

||

||

||

||

||

||

||

||

||

||

||

||

||

||

||

%rgCountThe number of recovery groups involved.

%rgErrA code from a recovery group, where 0 indicates no error.

%rgNameThe name of the recovery group involved.

%rgReasonThe reason string indicating why a recovery group callback was invoked.

Variables that represent node identifiers accept an optional qualifier that can be used to specify howthe nodes are to be identified. When specifying one of these optional qualifiers, separate it from thevariable with a period, as shown here:variable.qualifier

The value for qualifier can be one of the following:

ip Specifies that GPFS should use the nodes' IP addresses.

nameSpecifies that GPFS should use fully-qualified node names. This is the default.

shortNameSpecifies that GPFS should strip the domain part of the node names.

Options

-S Filename | --spec-file FilenameSpecifies a file with multiple callback definitions, one per line. The first token on each line must bethe callback identifier.

Exit status



Security

You must have root authority to run the mmaddcallback command.


Examples1. To register command /tmp/myScript to run after GPFS startup, issue this command:

mmaddcallback test1 --command=/tmp/myScript --event startup

The system displays information similar to:mmaddcallback: Propagating the cluster configuration data to all


2. To register a callback that NFS exports or to unexport a particular file system after it has beenmounted or before it has been unmounted, issue this command:mmaddcallback NFSexport --command /usr/local/bin/NFSexport --event mount,preUnmount -N c26f3rp01--parms "%eventName %fsName"

The system displays information similar to:


||

||

||

||

|||

|

|

||

||

||

mmaddcallback: 6027-1371 Propagating the cluster configuration data to allaffected nodes. This is an asynchronous process.

See also

See also the following topics in GPFS: Administration and Programming Reference:

mmdelcallback command

mmlscallback command

Location

/usr/lpp/mmfs/bin


|

mmchconfig commandChanges GPFS configuration parameters.

Synopsismmchconfig Attribute=value[,Attribute=value...] [-i | -I] [-N {Node[,Node...] | NodeFile | NodeClass}]

Description

Use the mmchconfig command to change the GPFS configuration attributes on a single node, a set ofnodes, or globally for the entire cluster.

The Attribute=value options must come before any operand.

When changing both maxblocksize and pagepool, the command fails unless these conventions arefollowed:v When increasing the values, pagepool must be specified first.v When decreasing the values, maxblocksize must be specified first.

Results

The configuration is updated on each node in the GPFS cluster.

Parameters

-N {Node[,Node...] | NodeFile | NodeClass}Specifies the set of nodes to which the configuration changes apply.

This command does not support a NodeClass of mount.

Options

-I Specifies that the changes take effect immediately, but do not persist when GPFS is restarted. Thisoption is valid only for the dataStructureDump, dmapiEventTimeout, dmapiMountTimeoout,dmapiSessionFailureTimeout, maxMBpS, nsdBufSpace, pagepool, unmountOnDiskFail, andverbsRdma attributes.

-i Specifies that the changes take effect immediately and are permanent. This option is valid only forthe dataStructureDump, dmapiEventTimeout, dmapiMountTimeoout, dmapiSessionFailureTimeout,maxMBpS, nsdBufSpace, pagepool, unmountOnDiskFail, and verbsRdma attributes.

Attribute=valueSpecifies the name of the attribute to be changed and its associated value. More than one attributeand value pair, in a comma-separated list, can be changed with one invocation of the command.

To restore the GPFS default setting for any given attribute, specify DEFAULT as its value.

adminModeSpecifies whether all nodes in the cluster will be used for issuing GPFS administration commands orjust a subset of the nodes. Valid values are:

allToAllIndicates that all nodes in the cluster will be used for running GPFS administration commandsand that all nodes are able to execute remote commands on any other node in the cluster withoutthe need of a password.

centralIndicates that only a subset of the nodes will be used for running GPFS commands and that onlythose nodes will be able to execute remote commands on the rest of the nodes in the clusterwithout the need of a password.


|

|

autoloadStarts GPFS automatically whenever the nodes are rebooted. Valid values are yes or no.

automountDirSpecifies the directory to be used by the Linux automounter for GPFS file systems that are beingmounted automatically. The default directory is /gpfs/automountdir. This parameter does not applyto AIX and Windows environments.

The -N flag is valid for this attribute.

cipherListControls whether GPFS network communications are secured. If cipherList is not specified, or if thevalue DEFAULT is specified, GPFS does not authenticate or check authorization for networkconnections. If the value AUTHONLY is specified, GPFS does authenticate and check authorizationfor network connections, but data sent over the connection is not protected. Before setting cipherListfor the first time, you must establish a public/private key pair for the cluster by using the mmauthgenkey new command.

GPFS must be down on all the nodes if you are switching from a non-secure environment to a secureenvironment and vice versa.

See the Frequently Asked Questions (http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html) for a list of the cipherssupported by GPFS.

cnfsMountdPortSpecifies the port number to be used for rpc.mountd. See General Parallel File System: AdvancedAdministration Guide for restrictions and additional information.

cnfsNFSDprocsSpecifies the number of nfsd kernel threads. The default is 32.

cnfsSharedRootSpecifies a directory in a GPFS file system to be used by the clustered NFS subsystem.

GPFS must be down on all the nodes in the cluster when changing the cnfsSharedRoot attribute.


See the General Parallel File System: Advanced Administration Guide for restrictions and additionalinformation.

cnfsVIPSpecifies a virtual DNS name for the list of CNFS IP addresses assigned to the nodes with themmchnode command. This allows NFS clients to be distributed among the CNFS nodes usinground-robin DNS. For additional information, see General Parallel File System: Advanced AdministrationGuide.

dataStructureDumpSpecifies a path for the storage of dumps. The default is to store dumps in /tmp/mmfs. Specify no tonot store dumps.

It is suggested that you create a directory for the placement of certain problem determinationinformation. This can be a symbolic link to another location if more space can be found there. Do notplace it in a GPFS file system, because it might not be available if GPFS fails. If a problem occurs,GPFS may write 200 MB or more of problem determination data into the directory. These files mustbe manually removed when problem determination is complete. This should be done promptly sothat a NOSPACE condition is not encountered if another failure occurs.


defaultHelperNodes {Node[,Node...] | NodeFile | NodeClass}Overrides the default behavior for the -N option on commands that use -N to identify a set of nodesto do work, but if the -N option is explicitly specified on such commands, it takes precedence.


||


defaultMountDirSpecifies the default parent directory for GPFS file systems. The default value is /gpfs. If an explicitmount directory is not provided with the mmcrfs, mmchfs, or mmremotefs command, the defaultmount point will be set to DefaultMountDir/DeviceName.

dmapiDataEventRetryControls how GPFS handles data events that are enabled again immediately after the event ishandled by the DMAPI application. Valid values are:

-1 Specifies that GPFS will always regenerate the event as long as it is enabled. This value shouldonly be used when the DMAPI application recalls and migrates the same file in parallel by manyprocesses at the same time.

0 Specifies to never regenerate the event. This value should not be used if a file could be migratedand recalled at the same time.

RetryCountSpecifies the number of times the data event should be retried. The default is 2.

For further information regarding DMAPI for GPFS, see the General Parallel File System: DataManagement API Guide.

dmapiEventTimeoutControls the blocking of file operation threads of NFS, while in the kernel waiting for the handling ofa DMAPI synchronous event. The parameter value is the maximum time, in milliseconds, the threadwill block. When this time expires, the file operation returns ENOTREADY, and the event continuesasynchronously. The NFS server is expected to repeatedly retry the operation, which eventually willfind the response of the original event and continue. This mechanism applies only to read, write, andtruncate event types, and only when such events come from NFS server threads. The timeout value isgiven in milliseconds. The value 0 indicates immediate timeout (fully asynchronous event). A valuegreater than or equal to 86400000 (which is 24 hours) is considered infinity (no timeout, fullysynchronous event). The default value is 86400000.



dmapiMountEventControls the generation of the mount, preunmount, and unmount events. Valid values are:

allmount, preunmount, and unmount events are generated on each node. This is the defaultbehavior.

SessionNodemount, preunmount, and unmount events are generated on each node and are delivered to thesession node, but the session node will not deliver the event to the DMAPI application unless theevent is originated from the SessionNode itself.

LocalNodemount, preunmount, and unmount events are generated only if the node is a session node.



dmapiMountTimeoutControls the blocking of mount operations, waiting for a disposition for the mount event to be set.This timeout is activated, at most once on each node, by the first external mount of a file system thathas DMAPI enabled, and only if there has never before been a mount disposition. Any mount


operation on this node that starts while the timeout period is active will wait for the mountdisposition. The parameter value is the maximum time, in seconds, that the mount operation willwait for a disposition. When this time expires and there is still no disposition for the mount event,the mount operation fails, returning the EIO error. The timeout value is given in full seconds. Thevalue 0 indicates immediate timeout (immediate failure of the mount operation). A value greater thanor equal to 86400 (which is 24 hours) is considered infinity (no timeout, indefinite blocking until thethere is a disposition). The default value is 60.



dmapiSessionFailureTimeoutControls the blocking of file operation threads, while in the kernel, waiting for the handling of aDMAPI synchronous event that is enqueued on a session that has experienced a failure. Theparameter value is the maximum time, in seconds, the thread will wait for the recovery of the failedsession. When this time expires and the session has not yet recovered, the event is cancelled and thefile operation fails, returning the EIO error. The timeout value is given in full seconds. The value 0indicates immediate timeout (immediate failure of the file operation). A value greater than or equal to86400 (which is 24 hours) is considered infinity (no timeout, indefinite blocking until the sessionrecovers). The default value is 0.



failureDetectionTimeIndicates to GPFS the amount of time it will take to detect that a node has failed.

GPFS must be down on all the nodes when changing the failureDetectionTime attribute.

maxblocksizeChanges the maximum file system block size. Valid values are 64 KiB, 256 KiB, 512 KiB, 1 MiB, 2MiB, 4 MiB, 8 MiB (for GPFS Native RAID only), and 16 MiB (for GPFS Native RAID only). Thedefault value is 1 MiB. Specify this value with the character K or M, for example 512K.

File systems with block sizes larger than the specified value cannot be created or mounted unless theblock size is increased.

GPFS must be down on all the nodes when changing the maxblocksize attribute.


maxFcntlRangesPerFileSpecifies the number of fcntl locks that are allowed per file. The default is 200. The minimum valueis 10 and the maximum value is 200000.

maxFilesToCacheSpecifies the number of inodes to cache for recently used files that have been closed.

Storing a file's inode in cache permits faster re-access to the file. The default is 1000, but increasingthis number may improve throughput for workloads with high file reuse. However, increasing thisnumber excessively may cause paging at the file system manager node. The value should be largeenough to handle the number of concurrently open files plus allow caching of recently used files.


maxMBpSSpecifies an estimate of how many megabytes of data can be transferred per second into or out of asingle node. The default is 150 MB per second. The value is used in calculating the amount of I/O


||

that can be done to effectively prefetch data for readers and write-behind data from writers. Bylowering this value, you can artificially limit how much I/O one node can put on all of the diskservers.


maxStatCacheSpecifies the number of inodes to keep in the stat cache. The stat cache maintains only enough inodeinformation to perform a query on the file system. The default value is:

4 × maxFilesToCache


mmapRangeLockSpecifies POSIX or non-POSIX mmap byte-range semantics. Valid values are yes or no (yes is thedefault). A value of yes indicates POSIX byte-range semantics apply to mmap operations. A value ofno indicates non-POSIX mmap byte-range semantics apply to mmap operations.

If using InterProcedural Analysis (IPA), turn this option off:mmchconfig mmapRangeLock=no -i

This will allow more lenient intranode locking, but impose internode whole file range tokens on filesusing mmap while writing.

nsdBufSpaceThis option specifies the percentage of the pagepool reserved for the network transfer of NSDrequests. Valid values are within the range of 10 to 70. The default value is 30. On GPFS Native RAIDrecovery group NSD servers, this value should be decreased to its minimum of 10, since vdisk-basedNSDs are served directly from the RAID buffer pool (as governed by nsdRAIDBufferPoolSizePct).On all other NSD servers, increasing either this value or the amount of pagepool, or both, couldimprove NSD server performance. On NSD client-only nodes, this parameter is ignored.


nsdRAIDtracksThis option specifies the number of tracks in the GPFS Native RAID buffer pool, or 0 if this nodedoes not have a GPFS Native RAID vdisk buffer pool. This controls whether GPFS Native RAIDservices are configured.

Valid values are: 0; 256 or greater.


nsdRAIDBufferPoolSizePctThis option specifies the percentage of the page pool that is used for the GPFS Native RAID vdiskbuffer pool. Valid values are within the range of 10 to 90. The default is 50 when GPFS Native RAIDis configured on the node in question; 0 when it is not.


nsdServerWaitTimeForMountWhen mounting a file system whose disks depend on NSD servers, this option specifies the numberof seconds to wait for those servers to come up. The decision to wait is controlled by the criteriamanaged by the nsdServerWaitTimeWindowOnMount option.

Valid values are between 0 and 1200 seconds. The default is 300. A value of zero indicates that nowaiting is done. The interval for checking is 10 seconds. If nsdServerWaitTimeForMount is 0,nsdServerWaitTimeWindowOnMount has no effect.

The mount thread waits when the daemon delays for safe recovery. The mount wait for NSD serversto come up, which is covered by this option, occurs after expiration of the recovery wait allows themount thread to proceed.


|||||||

|

||||

|

|

|

|


nsdServerWaitTimeWindowOnMountSpecifies a window of time (in seconds) during which a mount can wait for NSD servers as describedfor the nsdServerWaitTimeForMount option. The window begins when quorum is established (atcluster startup or subsequently), or at the last known failure times of the NSD servers required toperform the mount.

Valid values are between 1 and 1200 seconds. The default is 600. If nsdServerWaitTimeForMount is0, nsdServerWaitTimeWindowOnMount has no effect.


When a node rejoins the cluster after having been removed for any reason, the node resets all thefailure time values that it knows about. Therefore, when a node rejoins the cluster it believes that theNSD servers have not failed. From the node's perspective, old failures are no longer relevant.

GPFS checks the cluster formation criteria first. If that check falls outside the window, GPFS thenchecks for NSD server fail times being within the window.

pagepoolChanges the size of the cache on each node. The default value is 64 MB. The minimum allowed valueis 4 MB. The maximum GPFS pagepool size depends on the value of the pagepoolMaxPhysMemPctparameter and the amount of physical memory on the node. This value can be specified using thesuffix K, M, or G; for example, 128M.


pagepoolMaxPhysMemPctPercentage of physical memory that can be assigned to the page pool. Valid values are 10 through 90percent. The default is 75 percent (with the exception of Windows, where the default is 50 percent).


prefetchThreadsControls the maximum possible number of threads dedicated to prefetching data for files that areread sequentially, or to handle sequential write-behind.

Functions in the GPFS daemon dynamically determine the actual degree of parallelism for prefetchingdata. The default value is 72. The minimum value is 2. The maximum value of prefetchThreads plusworker1Threads is:v 164 on 32-bit kernelsv 550 on 64-bit kernels


release=LATESTChanges the GPFS configuration information to the latest format supported by the currently installedlevel of GPFS. Perform this operation after all nodes in the GPFS cluster have been migrated to thelatest level of GPFS. For additional information see Completing the migration to a new level of GPFS inthe General Parallel File System: Concepts, Planning and Installation Guide.

This command attempts to access each of the nodes in the cluster to verify the level of the installedGPFS code. If one or more nodes cannot be reached, you will have to rerun the command until theinformation for all nodes can be confirmed.

sidAutoMapRangeLengthControls the length of the reserved range for Windows SID to UNIX ID mapping. See Identitymanagement on Windows in the General Parallel File System: Advanced Administration Guide for additionalinformation.


sidAutoMapRangeStartSpecifies the start of the reserved range for Windows SID to UNIX ID mapping. See Identitymanagement on Windows in the General Parallel File System: Advanced Administration Guide for additionalinformation.

subnetsSpecifies subnets used to communicate between nodes in a GPFS cluster or a remote GPFS cluster.

The subnets option must use the following format:subnets="Subnet[/ClusterName[;ClusterName...][ Subnet[/ClusterName[;ClusterName...]...]"

The order in which you specify the subnets determines the order that GPFS uses these subnets toestablish connections to the nodes within the cluster. For example, subnets="192.168.2.0" refers to IPaddresses 192.168.2.0 through 192.168.2.255.

This feature cannot be used to establish fault tolerance or automatic failover. If the interfacecorresponding to an IP address in the list is down, GPFS does not use the next one on the list. Formore information about subnets, see General Parallel File System: Advanced Administration Guide andsearch on Using remote access with public and private IP addresses.

tiebreakerDisksControls whether GPFS will use the node quorum with tiebreaker algorithm in place of the regularnode based quorum algorithm. See General Parallel File System: Concepts, Planning, and InstallationGuide and search for node quorum with tiebreaker. To enable this feature, specify the names of one orthree disks. Separate the NSD names with semicolon (;) and enclose the list in quotes. The disks donot have to belong to any particular file system, but must be directly accessible from the quorumnodes. For example:tiebreakerDisks="gpfs1nsd;gpfs2nsd;gpfs3nsd"

To disable this feature, use:tiebreakerDisks=no

When changing the tiebreakerDisks, GPFS must be down on all nodes in the cluster.

uidDomainSpecifies the UID domain name for the cluster.

GPFS must be down on all the nodes when changing the uidDomain attribute.

A detailed description of the GPFS user ID remapping convention is contained in the UID Mappingfor GPFS in a Multi-Cluster Environment white paper at http://www.ibm.com/systems/clusters/library/wp_lit.html.

unmountOnDiskFailControls how the GPFS daemon will respond when a disk failure is detected. Valid values are yes orno.

When unmountOnDiskFail is set to no, the daemon marks the disk as failed and continues as longas it can without using the disk. All nodes that are using this disk are notified of the disk failure. Thedisk can be made active again by using the mmchdisk command. This is the suggested setting whenmetadata and data replication are used because the replica can be used until the disk is broughtonline again.

When unmountOnDiskFail is set to yes, any disk failure will cause only the local node toforce-unmount the file system that contains that disk. Other file systems on this node and othernodes continue to function normally, if they can. The local node can try and remount the file systemwhen the disk problem has been resolved. This is the suggested setting when using SAN-attacheddisks in large multinode configurations, and when replication is not being used. This setting shouldalso be used on a node that hosts descOnly disks. See Establishing disaster recovery for your GPFScluster in General Parallel File System: Advanced Administration Guide.



usePersistentReserveSpecifies whether to enable or disable Persistent Reserve (PR) on the disks. Valid values are yes or no(no is the default). GPFS must be stopped on all nodes when setting this attribute.v PR is only supported on AIX nodes.v PR is only supported on NSDs that are built directly on hdisks.v The disk subsystem must support PRv GPFS supports a mix of PR disks and other disks. However, you will only realize improved

failover times if all the disks in the cluster support PR.v GPFS only supports PR in the home cluster. Remote mounts must access the disks using an NSD

server.

For more information, see Reduced recovery time using Persistent Reserve in the General Parallel FileSystem: Concepts, Planning, and Installation Guide.

verbsPortsSpecifies the InfiniBand device names and port numbers used for RDMA transfers between an NSDclient and server. You must enable verbsRdma to enable verbsPorts.

The format for verbsPorts is:verbsPorts="device/port[ device/port ...]"

In this format, device is the HCA device name (such as mthca0) and port is the one-based port number(such as 1 or 2). If you do not specify a port number, GPFS uses port 1 as the default.

For example:verbsPorts="mthca0/1 mthca0/2"

will create two RDMA connections between the NSD client and server using both ports of a dualported adapter.


verbsRdmaEnables or disables InfiniBand RDMA using the Verbs API for data transfers between an NSD clientand NSD server. Valid values are enable or disable.


Note: InfiniBand RDMA for Linux X86_64 is supported only on GPFS V3.2 Multiplatform. For thelatest support information, see the GPFS Frequently Asked Questions at: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html.

worker1ThreadsControls the maximum number of concurrent file operations at any one instant. If there are morerequests than that, the excess will wait until a previous request has finished.

This attribute is primarily used for random read or write requests that cannot be pre-fetched, randomI/O requests, or small file activity. The default value is 48. The minimum value is 1. The maximumvalue of prefetchThreads plus worker1Threads is:v 164 on 32-bit kernelsv 550 on 64-bit kernels


Exit status







Security

You must have root authority to run the mmchconfig command.


Examples

To change the maximum file system block size allowed to 4 MB, issue this command:mmchconfig maxblocksize=4M

The system displays information similar to:Verifying GPFS is stopped on all nodes ...mmchconfig: Command successfully completedmmchconfig: Propagating the cluster configuration

data to allaffected nodes. This is an asynchronous process.

To confirm the change, issue this command:mmlsconfig

The system displays information similar to:Configuration data for cluster ib.cluster:------------------------------------------clusterName ib.clusterclusterId 13882433899463047326autoload nominReleaseLevel 3.4.0.0dmapiFileHandleSize 32maxblocksize 4mpagepool 2g[c21f1n18]pagepool 5g[common]verbsPorts mthca0/1verbsRdma enablesubnets 10.168.80.0adminMode central

File systems in cluster ib.cluster:-----------------------------------/dev/fs1

Location

/usr/lpp/mmfs/bin


mmcrfs commandCreates a GPFS file system.

Synopsismmcrfs Device {"DiskDesc[;DiskDesc...]" | -F DescFile}

[-A {yes | no | automount}] [-B BlockSize] [-D {posix | nfs4}][-E {yes | no}] [-j {cluster | scatter}] [-k {posix | nfs4 | all}][-K {no | whenpossible | always}] [-L LogFileSize][-m DefaultMetadataReplicas] [-M MaxMetadataReplicas][-n NumNodes] [-Q {yes | no}] [-r DefaultDataReplicas][-R MaxDataReplicas] [-S {yes | no}] [-T Mountpoint][-t DriveLetter] [-v {yes | no}] [-z {yes | no}][--filesetdf | --nofilesetdf][--inode-limit MaxNumInodes[:NumInodesToPreallocate]][--metadata-block-size MetadataBlockSize][--mount-priority Priority] [--version VersionString]

Description

Use the mmcrfs command to create a GPFS file system. The first two parameters must be Device andeither DiskDescList or DescFile and they must be in that order. The block size and replication factorschosen affect file system performance. A maximum of 256 file systems can be mounted in a GPFS clusterat one time, including remote file systems.

When deciding on the maximum number of files (number of inodes) in a file system, consider that forfile systems that will be doing parallel file creates, if the total number of free inodes is not greater than5% of the total number of inodes, there is the potential for slowdown in file system access. The totalnumber of inodes can be increased using the mmchfs command.

When deciding on a block size for a file system, consider these points:1. Supported block sizes are 16 KiB, 64 KiB, 128 KiB, 256 KiB, 512 KiB, 1 MiB, 2 MiB, 4 MiB, 8 MiB (for

GPFS Native RAID only), and 16 MiB (for GPFS Native RAID only).2. The GPFS block size determines:

v The minimum disk space allocation unit. The minimum amount of space that file data can occupyis a sub-block. A sub-block is 1/32 of the block size.

v The maximum size of a read or write request that GPFS sends to the underlying disk driver.3. From a performance perspective, it is recommended that you set the GPFS block size to match the

application buffer size, the RAID stripe size, or a multiple of the RAID stripe size. If the GPFS blocksize does not match the RAID stripe size, performance may be severely degraded, especially for writeoperations. If GPFS Native RAID is in use, the block size must equal the vdisk track size.

4. In file systems with a high degree of variance in the size of files within the file system, using a smallblock size would have a large impact on performance when accessing large files. In this kind ofsystem it is suggested that you use a block size of 256 KB (8 KB sub-block). Even if only 1% of thefiles are large, the amount of space taken by the large files usually dominates the amount of spaceused on disk, and the waste in the sub-block used for small files is usually insignificant. For furtherperformance information, see the GPFS white papers at http://www.ibm.com/systems/clusters/library/wp_lit.html.

5. The effect of block size on file system performance largely depends on the application I/O pattern.v A larger block size is often beneficial for large sequential read and write workloads.v A smaller block size is likely to offer better performance for small file, small random read and

write, and metadata-intensive workloads.6. The efficiency of many algorithms that rely on caching file data in a GPFS page pool depends more

on the number of blocks cached rather than the absolute amount of data. For a page pool of a given


|

||

|

size, a larger file system block size would mean fewer blocks cached. Therefore, when you create filesystems with a block size larger than the default of 256 KB, it is recommended that you increase thepage pool size in proportion to the block size.

7. The file system block size must not exceed the value of the GPFS maxblocksize configurationparameter. The maxblocksize parameter is set to 1 MB by default. If a larger block size is desired, usethe mmchconfig command to increase the maxblocksize before starting GPFS.

Results

Upon successful completion of the mmcrfs command, these tasks are completed on all GPFS nodes:v Mount point directory is created.v File system is formatted.

Parameters

DeviceThe device name of the file system to be created.

File system names need not be fully-qualified. fs0 is as acceptable as /dev/fs0. However, file systemnames must be unique within a GPFS cluster. Do not specify an existing entry in /dev.

-F DescFileSpecifies a file containing a list of disk descriptors, one per line. You may use the rewritten DiskDescfile created by the mmcrnsd command, create your own file, or enter the disk descriptors on thecommand line. When using the DiskDesc file created by the mmcrnsd command, the values suppliedon input to the command for Disk Usage and FailureGroup are used. When creating your own file orentering the descriptors on the command line, you must specify these values or accept the systemdefaults.

"DiskDesc[;DiskDesc...]"A descriptor for each disk to be included. Each descriptor is separated by a semicolon (;). The entirelist must be enclosed in quotation marks (' or ").

A disk descriptor is defined as (second, third and sixth fields reserved):DiskName:::DiskUsage:FailureGroup::StoragePool

DiskNameYou must specify the name of the NSD previously created by the mmcrnsd command. For a listof available disks, issue the mmlsnsd -F command.

DiskUsageSpecify a disk usage or accept the default:

dataAndMetadataIndicates that the disk contains both data and metadata. This is the default for disks in thesystem pool.

dataOnlyIndicates that the disk contains data and does not contain metadata. This is the default fordisks in storage pools other than the system pool.

metadataOnlyIndicates that the disk contains metadata and does not contain data.

descOnlyIndicates that the disk contains no data and no file metadata. Such a disk is used solely tokeep a copy of the file system descriptor, and can be used as a third failure group in certaindisaster recovery configurations. For more information, see General Parallel File System:Advanced Administration and search on Synchronous mirroring utilizing GPFS replication.


FailureGroupA number identifying the failure group to which this disk belongs. You can specify any valuefrom -1 (where -1 indicates that the disk has no point of failure in common with any other disk)to 4000. If you do not specify a failure group, the value defaults to the node number of the firstNSD server defined in the NSD server list plus 4000. If you do not specify an NSD server list, thevalue defaults to -1. GPFS uses this information during data and metadata placement to assurethat no two replicas of the same block are written in such a way as to become unavailable due toa single failure. All disks that are attached to the same NSD server or adapter should be placed inthe same failure group.

If replication of -m or -r is set to 2, storage pools must have two failure groups for the commandsto work properly.

StoragePoolSpecifies the storage pool to which the disk is to be assigned. If this name is not provided, thedefault is system.

Only the system pool may contain descOnly, metadataOnly or dataAndMetadata disks.

Options

-A {yes | no | automount}Indicates when the file system is to be mounted:

yesWhen the GPFS daemon starts. This is the default.

no Manual mount.

automountOn non-Windows nodes, when the file system is first accessed. On Windows nodes, when theGPFS daemon starts.

-B BlockSizeSize of data blocks. Must be 16 KiB, 64 KiB, 128 KiB, 256 KiB (the default), 512 KiB, 1 MiB, 2 MiB, 4MiB, 8 MiB (for GPFS Native RAID only), or 16 MiB (for GPFS Native RAID only). Specify this valuewith the character K or M, for example 512K.

-D {nfs4 | posix}Specifies whether a deny-write open lock will block writes, which is expected and required by NFSV4. File systems supporting NFS V4 must have -D nfs4 set. The option -D posix allows NFS writeseven in the presence of a deny-write open lock. If you intend to export the file system using NFS V4or Samba, you must use -D nfs4. For NFS V3 (or if the file system is not NFS exported at all) use -Dposix. The default is -D nfs4.

-E {yes | no}Specifies whether to report exact mtime values (-E yes), or to periodically update the mtime value fora file system (-E no). If it is more desirable to display exact modification times for a file system,specify or use the default -E yes option.

-j {cluster | scatter}Specifies the block allocation map type. When allocating blocks for a given file, GPFS first uses around-robin algorithm to spread the data across all disks in the file system. After a disk is selected,the location of the data block on the disk is determined by the block allocation map type. If cluster isspecified, GPFS attempts to allocate blocks in clusters. Blocks that belong to a particular file are keptadjacent to each other within each cluster. If scatter is specified, the location of the block is chosenrandomly.

The cluster allocation method may provide better disk performance for some disk subsystems inrelatively small installations. The benefits of clustered block allocation diminish when the number ofnodes in the cluster or the number of disks in a file system increases, or when the file system's free


||||

space becomes fragmented. The cluster allocation method is the default for GPFS clusters with eightor fewer nodes and for files systems with eight or fewer disks.

The scatter allocation method provides more consistent file system performance by averaging outperformance variations due to block location (for many disk subsystems, the location of the datarelative to the disk edge has a substantial effect on performance). This allocation method isappropriate in most cases and is the default for GPFS clusters with more than eight nodes or filesystems with more than eight disks.

The block allocation map type cannot be changed after the file system has been created.

-k {posix | nfs4 | all}Specifies the type of authorization supported by the file system:

posixTraditional GPFS ACLs only (NFS V4 and Windows ACLs are not allowed). Authorizationcontrols are unchanged from earlier releases.

nfs4Support for NFS V4 and Windows ACLs only. Users are not allowed to assign traditional GPFSACLs to any file system objects (directories and individual files).

allAny supported ACL type is permitted. This includes traditional GPFS (posix) and NFS V4 NFSV4 and Windows ACLs (nfs4).

The administrator is allowing a mixture of ACL types. For example, fileA may have a posix ACL,while fileB in the same file system may have an NFS V4 ACL, implying different accesscharacteristics for each file depending on the ACL type that is currently assigned. The default is-k all.

Avoid specifying nfs4 or all unless files will be exported to NFS V4 or Samba clients, or the filesystem will be mounted on Windows. NFS V4 and Windows ACLs affect file attributes (mode) andhave access and authorization characteristics that are different from traditional GPFS ACLs.

-K {no | whenpossible | always}Specifies whether strict replication is to be enforced:

no Indicates that strict replication is not enforced. GPFS will try to create the needed number ofreplicas, but will still return EOK as long as it can allocate at least one replica.

whenpossibleIndicates that strict replication is enforced provided the disk configuration allows it. If thenumber of failure groups is insufficient, strict replication will not be enforced. This is the defaultvalue.

alwaysIndicates that strict replication is enforced.

For more information, see the topic "Strict replication" in the GPFS: Problem Determination Guide.

-L LogFileSizeSpecifies the size of the internal log file. The default size is 4 MB or 32 times the file system blocksize, whichever is smaller. The minimum size is 256 KB and the maximum size is 32 times the filesystem block size or 16 MB, whichever is smaller. Specify this value with the K or M character, forexample: 8M. This value cannot be changed after the file system has been created.

In most cases, allowing the log file size to default works well. An increased log file size is useful forfile systems that have a large amount of metadata activity, such as creating and deleting many smallfiles or performing extensive block allocation and deallocation of large files.


-m DefaultMetadataReplicasSpecifies the default number of copies of inodes, directories, and indirect blocks for a file. Validvalues are 1 and 2, but cannot be greater than the value of MaxMetadataReplicas. The default is 1.

-M MaxMetadataReplicasSpecifies the default maximum number of copies of inodes, directories, and indirect blocks for a file.Valid values are 1 and 2, but cannot be less than the value of DefaultMetadataReplicas. The default is 2.

-n NumNodesThe estimated number of nodes that will mount the file system. This is used as a best guess for theinitial size of some file system data structures. The default is 32. This value can be changed after thefile system has been created.

When you create a GPFS file system, you might want to overestimate the number of nodes that willmount the file system. GPFS uses this information for creating data structures that are essential forachieving maximum parallelism in file system operations (see the topic GPFS architecture in GeneralParallel File System: Concepts, Planning, and Installation Guide). Although a large estimate consumesadditional memory, underestimating the data structure allocation can reduce the efficiency of a nodewhen it processes some parallel requests such as the allotment of disk space to a file. If you cannotpredict the number of nodes that will mount the file system, allow the default value to be applied. Ifyou are planning to add nodes to your system, you should specify a number larger than the default.However, do not make estimates that are not realistic. Specifying an excessive number of nodes mayhave an adverse affect on buffer operations.

-Q {yes | no}Activates quotas automatically when the file system is mounted. The default is -Q no. Issue themmdefedquota command to establish default quota values. Issue the mmedquota command toestablish explicit quota values.

To activate GPFS quota management after the file system has been created:1. Mount the file system.2. To establish default quotas:

a. Issue the mmdefedquota command to establish default quota values.b. Issue the mmdefquotaon command to activate default quotas.

3. To activate explicit quotas:a. Issue the mmedquota command to activate quota values.b. Issue the mmquotaon command to activate quota enforcement.

-r DefaultDataReplicasSpecifies the default number of copies of each data block for a file. Valid values are 1 and 2, butcannot be greater than the value of MaxDataReplicas. The default is 1.

-R MaxDataReplicasSpecifies the default maximum number of copies of data blocks for a file. Valid values are 1 and 2.The value cannot be less than the value of DefaultDataReplicas. The default is 2.

-S {yes | no}Suppresses the periodic updating of the value of atime as reported by the gpfs_stat(), gpfs_fstat(),stat(), and fstat() calls. The default value is -S no. Specifying -S yes for a new file system results inreporting the time the file system was created.

-t DriveLetterSpecifies the drive letter to use when the file system is mounted on Windows.

-T MountPointSpecifies the mount point directory of the GPFS file system. If it is not specified, the mount point willbe set to DefaultMountDir/Device. The default value for DefaultMountDir is /gpfs but, it can bechanged with the mmchconfig command.


-v {yes | no}Verifies that specified disks do not belong to an existing file system. The default is -v yes. Specify -vno only when you want to reuse disks that are no longer needed for an existing file system. If thecommand is interrupted for any reason, use the -v no option on the next invocation of the command.

Important: Using -v no on a disk that already belongs to a file system will corrupt that file system.This will not be noticed until the next time that file system is mounted.

-z {yes | no}Enable or disable DMAPI on the file system. Turning this option on will require an external datamanagement application such as Tivoli® Storage Manager (TSM) hierarchical storage management(HSM) before the file system can be mounted. The default is -z no. For further information onDMAPI for GPFS, see General Parallel File System: Data Management API Guide.

--filesetdf | --nofilesetdfWhen this option is enabled and quotas are enforced for a fileset, the df command reports numbersbased on the quotas for the fileset and not for the total file system.

--inode-limit MaxNumInodes[:NumInodesToPreallocate]Specifies the maximum number of files in the file system.

For file systems that will be creating parallel files, if the total number of free inodes is not greaterthan 5% of the total number of inodes, file system access might slow down. Take this intoconsideration when creating your file system.

The parameter NumInodesToPreallocate specifies the number of inodes that the system willimmediately preallocate. If you do not specify a value for NumInodesToPreallocate, GPFS willdynamically allocate inodes as needed.

You can specify the NumInodes and NumInodesToPreallocate values with a suffix, for example 100K or2M. Note that in order to optimize file system operations, the number of inodes that are actuallycreated may be greater than the specified value.

--metadata-block-size MetadataBlockSizeSpecifies the block size for the system storage pool, provided its usage is set to metadataOnly. Validvalues are the same as those listed for -B BlockSize in “Options” on page 97.

--mount-priority PriorityControls the order in which the individual file systems are mounted at daemon startup or when oneof the all keywords is specified on the mmmount command.

File systems with higher Priority numbers are mounted after file systems with lower numbers. Filesystems that do not have mount priorities are mounted last. A value of zero indicates no priority.This is the default.

--version VersionStringEnable only the file system features that are compatible with the specified release. The lowest allowedVersion value is 3.1.0.0.

The default is 3.4.0.0, which will enable all currently available features but will prevent nodes that arerunning earlier GPFS releases from accessing the file system. Windows nodes can mount only filesystems that are created with GPFS 3.2.1.5 or later.

Exit status




|||

Security

You must have root authority to run the mmcrfs command.


Examples

This example shows how to create a file system named gpfs1 using three disks, each with a block size of512 KB, allowing metadata and data replication to be 2, turning quotas on, and creating /gpfs1 as themount point. To complete this task, issue the command:mmcrfs gpfs1 "hd2n97;hd3n97;hd4n97" -B 512K -m 2 -r 2 -Q yes -T /gpfs1

The system displays output similar to:GPFS: 6027-531 The following disks of gpfs1 willbe formatted on node e109c4rp1.gpfs.net:

hd2n97: size 9765632 KBhd3n97: size 9765632 KBhd4n97: size 9765632 KB

GPFS: 6027-540 Formatting file system ...GPFS: 6027-535 Disks up to size 102 GB can be addedto storage pool ’system’.Creating Inode FileCreating Allocation MapsClearing Inode Allocation MapClearing Block Allocation MapFormatting Allocation Map for storage pool ’system’GPFS: 6027-572 Completed creation of file system/dev/gpfs1.mmcrfs: 6027-1371 Propagating the cluster configurationdata to allaffected nodes. This is an asynchronous process.

See also

See also the following topics in GPFS: Administration and Programming Reference:

mmchfs command

mmdelfs command

mmdf command

mmedquota command

mmfsck command

mmlsfs command

Location

/usr/lpp/mmfs/bin


|

mmexportfs commandRetrieves the information needed to move a file system to a different cluster.

Synopsismmexportfs {Device | all} -o ExportfsFile

Description

The mmexportfs command, in conjunction with the mmimportfs command, can be used to move one ormore GPFS file systems from one GPFS cluster to another GPFS cluster, or to temporarily remove filesystems from the cluster and restore them at a later time. The mmexportfs command retrieves all relevantfile system and disk information and stores it in the file specified with the -o parameter. This file mustlater be provided as input to the mmimportfs command. When running the mmexportfs command, thefile system must be unmounted on all nodes.

When all is specified in place of a file system name, any disks that are not associated with a file systemwill be exported as well.

Exported file systems remain unusable until they are imported back with the mmimportfs command tothe same or a different GPFS cluster.

Results

Upon successful completion of the mmexportfs command, all configuration information pertaining to theexported file system and its disks is removed from the configuration data of the current GPFS cluster andis stored in the user specified file ExportfsFile.

Parameters

Device | allThe device name of the file system to be exported. File system names need not be fully-qualified. fs0is as acceptable as /dev/fs0. Specify all to import all GPFS file systems, as well as all disks that donot currently belong to a file system.

If the specified file system device is a GPFS Native RAID-based file system, then all affected GPFSNative RAID objects will be exported as well. This includes recovery groups, declustered arrays,vdisks, and any other file systems that are based on these objects.

This must be the first parameter.

-o ExportfsFileThe path name of a file to which the file system information is to be written. This file must beprovided as input to the subsequent mmimportfs command.

Exit status



Security

You must have root authority to run the mmexportfs command.

The node on which the command is issued must be able to execute remote shell commands on any othernode in the cluster without the use of a password and without producing any extraneous messages. Seethe topic about administration requirements in GPFS: Administration and Programming Reference.


|

|

||

|||

|

Examples

To export all file systems in the current cluster, issue this command:mmexportfs all -o /u/admin/exportfile

The output is similar to this:mmexportfs: Processing file system fs1 ...

mmexportfs: Processing file system fs2 ...

mmexportfs: Processing disks that do not belong to any file system ...mmexportfs: 6027-1371 Propagating the changes to all affected

nodes. This is an asynchronous process.

See also

See also the following topic in GPFS: Administration and Programming Reference or GPFS Native RAIDAdministration and Programming Reference: “mmimportfs command” on page 104.

Location

/usr/lpp/mmfs/bin


||

mmimportfs commandImports into the cluster one or more file systems that were created in another GPFS cluster.

Synopsismmimportfs {Device | all} -i ImportfsFile [-S ChangeSpecFile]

Description

The mmimportfs command, in conjunction with the mmexportfs command, can be used to move into thecurrent GPFS cluster one or more file systems that were created in another GPFS cluster. Themmimportfs command extracts all relevant file system and disk information from the ExportFilesysDatafile specified with the -i parameter. This file must have been created by the mmexportfs command.

When all is specified in place of a file system name, any disks that are not associated with a file systemwill be imported as well.

If the file systems being imported were created on nodes that do not belong to the current GPFS cluster,the mmimportfs command assumes that all disks have been properly moved, and are online andavailable to the appropriate nodes in the current cluster.

If any node in the cluster, including the node on which you are running the mmimportfs command, doesnot have access to one or more disks, use the -S option to assign NSD servers to those disks.

The mmimportfs command attempts to preserve any NSD server assignments that were in effect whenthe file system was exported.

After the mmimportfs command completes, use mmlsnsd to display the NSD server names that areassigned to each of the disks in the imported file system. Use mmchnsd to change the current NSDserver assignments as needed.

After the mmimportfs command completes, use mmlsdisk to display the failure groups to which eachdisk belongs. Use mmchdisk to make adjustments if necessary.

If you are importing file systems into a cluster that already contains GPFS file systems, it is possible toencounter name conflicts. You must resolve such conflicts before the mmimportfs command can succeed.You can use the mmchfs command to change the device name and mount point of an existing filesystem. If there are disk name conflicts, use the mmcrnsd command to define new disks and specifyunique names (rather than let the command generate names). Then replace the conflicting disks usingmmrpldisk and remove them from the cluster using mmdelnsd.

Results

Upon successful completion of the mmimportfs command, all configuration information pertaining to thefile systems being imported is added to configuration data of the current GPFS cluster.

Parameters

Device | allThe device name of the file system to be imported. File system names need not be fully-qualified. fs0is as acceptable as /dev/fs0. Specify all to import all GPFS file systems, as well as all disks that donot currently belong to a file system.

If the specified file system device is a GPFS Native RAID-based file system, then all affected GPFSNative RAID objects will be imported as well. This includes recovery groups, declustered arrays,vdisks, and any other file systems that are based on these objects.


|||

This must be the first parameter.

-i ImportfsFileThe path name of the file containing the file system information. This file must have previously beencreated with the mmexportfs command.

-S ChangeSpecFileThe path name of an optional file containing disk descriptors or recovery group stanzas, or both,specifying the changes that are to be made to the file systems during the import step.

Disk descriptors have the following format:DiskName:ServerList

where:

DiskNameIs the name of a disk from the file system being imported.

ServerListIs a comma-separated list of NSD server nodes. You can specify up to eight NSD servers in thislist. The defined NSD will preferentially use the first server on the list. If the first server is notavailable, the NSD will use the next available server on the list.

When specifying server nodes for your NSDs, the output of the mmlscluster command lists thehost name and IP address combinations recognized by GPFS. The utilization of aliased hostnames not listed in the mmlscluster command output may produce undesired results.

If you do not define a ServerList, GPFS assumes that the disk is SAN-attached to all nodes in thecluster. If all nodes in the cluster do not have access to the disk, or if the file system to which thedisk belongs is to be accessed by other GPFS clusters, you must specify a ServerList.

Recovery group stanzas have the following format:%rg: rgName=RecoveryGroupName

servers=Primary [,Backup ]

where:

RecoveryGroupNameSpecifies the name of the recovery group being imported.

Primary [,Backup ]Specifies the primary server and, optionally, a backup server to be associated with the recoverygroup.

Note:

1. You cannot change the name of a disk. You cannot change the disk usage or failure groupassignment with the mmimportfs command. Use the mmchdisk command for this purpose.

2. All disks that do not have descriptors in ChangeSpecFile are assigned the NSD servers that theyhad at the time the file system was exported. All disks with NSD servers that are not valid areassumed to be SAN-attached to all nodes in the cluster. Use the mmchnsd command to assignnew or change existing NSD server nodes.

3. Use the mmchrecoverygroup command to activate recovery groups that do not have stanzas inChangeSpecFile.

Exit status




||

|

|

|

||

||||

|||

|||

|

||

|

||

|||

|

||

||||

||

|

Security

You must have root authority to run the mmimportfs command.

The node on which the command is issued must be able to execute remote shell commands on any othernode in the cluster without the use of a password and without producing any extraneous messages. Seethe topic about administration requirements in GPFS: Administration and Programming Reference.

Examples

To import all file systems in the current cluster, issue this command:mmimportfs all -i /u/admin/exportfile

The output is similar to this:mmimportfs: Processing file system fs1 ...mmimportfs: Processing disk gpfs2nsdmmimportfs: Processing disk gpfs3nsdmmimportfs: Processing disk gpfs4nsd

mmimportfs: Processing file system fs2 ...mmimportfs: Processing disk gpfs1nsd1mmimportfs: Processing disk gpfs5nsd

mmimportfs: Processing disks that do not belong to any file system ...mmimportfs: Processing disk gpfs6nsdmmimportfs: Processing disk gpfs1001nsd

mmimportfs: Committing the changes ...

mmimportfs: The following file systems were successfully imported:fs1fs2

mmimportfs: 6027-1371 Propagating the changes to all affectednodes. This is an asynchronous process.

See also

See also the following topic in GPFS: Administration and Programming Reference or GPFS Native RAIDAdministration and Programming Reference: “mmexportfs command” on page 102.

Location

/usr/lpp/mmfs/bin


||

mmpmon commandManages performance monitoring and displays performance information.

Synopsismmpmon [-i CommandFile] [-d IntegerDelayValue] [-p][-r IntegerRepeatValue] [-s] [-t IntegerTimeoutValue]

Description

Before attempting to use mmpmon, IBM suggests that you review this command entry, then read theentire topic, Monitoring GPFS I/O performance with the mmpmon command in General Parallel File System:Advanced Administration Guide.

Use the mmpmon command to manage GPFS performance monitoring functions and displayperformance monitoring data. The mmpmon command reads requests from an input file or standardinput (stdin), and writes responses to standard output (stdout). Error messages go to standard error(stderr). Prompts, if not suppressed, go to stderr.

When running mmpmon in such a way that it continually reads input from a pipe (the driving script orapplication never intends to send an end-of-file to mmpmon), set the -r option value to 1 (or use thedefault value of 1) to prevent mmpmon from caching the input records. This avoids unnecessary memoryconsumption.

This command cannot be run from a Windows node.

Results

The performance monitoring request is sent to the GPFS daemon running on the same node that isrunning the mmpmon command.

All results from the request are written to stdout.

There are two output formats:v Human readable, intended for direct viewing.

In this format, the results are keywords that describe the value presented, followed by the value. Forexample:disks: 2

v Machine readable, an easily parsed format intended for further analysis by scripts or applications.In this format, the results are strings with values presented as keyword/value pairs. The keywords aredelimited by underscores (_) and blanks to make them easier to locate.

For details on how to interpret the mmpmon command results, see the topic Monitoring GPFS I/Operformance with the mmpmon command in General Parallel File System: Advanced Administration Guide.

Parameters

-i CommandFileThe input file contains mmpmon command requests, one per line. Use of the -i flag implies use ofthe -s flag. For interactive use, just omit the -i flag. In this case, the input is then read from stdin,allowing mmpmon to take keyboard input or output piped from a user script or applicationprogram.

Leading blanks in the input file are ignored. A line beginning with a pound sign (#) is treated as acomment. Leading blanks in a line whose first non-blank character is a pound sign (#) are ignored.


Input requests to the mmpmon command are:

fs_io_sDisplays I/O statistics per mounted file system

io_sDisplays I/O statistics for the entire node

nlist add name [name...]Adds node names to a list of nodes for mmpmon processing

nlist delDeletes a node list

nlist new name [name...]Creates a new node list

nlist sShows the contents of the current node list.

nlist sub name [name...]Deletes node names from a list of nodes for mmpmon processing.

once requestIndicates that the request is to be performed only once.

resetResets statistics to zero.

rhist nrChanges the request histogram facility request size and latency ranges.

rhist offDisables the request histogram facility. This is the default.

rhist onEnables the request histogram facility.

rhist pDisplays the request histogram facility pattern.

rhist resetResets the request histogram facility data to zero.

rhist sDisplays the request histogram facility statistics values.

verDisplays mmpmon version.

vio_s [f rg RecoveryGroupName [da DeclusteredArray [v Vdisk]]] [reset]Displays GPFS Native RAID vdisk I/O statistics.

vio_s_reset [f rg RecoveryGroupName [da DeclusteredArray [v Vdisk]]]Resets GPFS Native RAID vdisk I/O statistics.

Options

-d IntegerDelayValueSpecifies a number of milliseconds to sleep after one invocation of all the requests in the input file.The default value is 1000. This value must be an integer greater than or equal to 500 and less than orequal to 8000000.

The input file is processed as follows: The first request is processed, it is sent to the GPFS daemon,the responses for this request are received and processed, the results for this request are displayed,and then the next request is processed and so forth. When all requests from the input file have been


||

||

processed once, the mmpmon command sleeps for the specified number of milliseconds. When thistime elapses, mmpmon wakes up and processes the input file again, depending on the value of the -rflag.

-p Indicates to generate output that can be parsed by a script or program. If this option is not specified,human-readable output is produced.

-r IntegerRepeatValueSpecifies the number of times to run all the requests in the input file.

The default value is one. Specify an integer between zero and 8000000. Zero means to run forever, inwhich case processing continues until it is interrupted. This feature is used, for example, by a drivingscript or application program that repeatedly reads the result from a pipe.

The once prefix directive can be used to override the -r flag. See the description of once in MonitoringGPFS I/O performance with the mmpmon command in General Parallel File System: Advanced AdministrationGuide.

-s Indicates to suppress the prompt on input.

Use of the -i flag implies use of the -s flag. For use in a pipe or with redirected input (<), the -s flagis preferred. If not suppressed, the prompts go to standard error (stderr).

-t IntegerTimeoutValueSpecifies a number of seconds to wait for responses from the GPFS daemon before considering theconnection to have failed.

The default value is 60. This value must be an integer greater than or equal to 1 and less than orequal to 8000000.

Exit status


1 Various errors (insufficient memory, input file not found, incorrect option, and so forth).

3 Either no commands were entered interactively, or there were no mmpmon commands in theinput file. The input file was empty, or consisted of all blanks or comments.

4 mmpmon terminated due to a request that was not valid.

5 An internal error has occurred.

111 An internal error has occurred. A message will follow.

Restrictions1. Up to five instances of mmpmon may be run on a given node concurrently. However, concurrent

users may interfere with each other. See Monitoring GPFS I/O performance with the mmpmon command inGeneral Parallel File System: Advanced Administration Guide.

2. Do not alter the input file while mmpmon is running.3. The input file must contain valid input requests, one per line. When an incorrect request is detected

by mmpmon, it issues an error message and terminates. Input requests that appear in the input filebefore the first incorrect request are processed by mmpmon.

Security

The mmpmon command must be run by a user with root authority, on the node for which statistics aredesired.



Examples1. Assume that infile contains these requests:

verio_sfs_io_srhist off

and this command is issued:mmpmon -i infile -r 10 -d 5000

The output (sent to stdout) is similar to this:mmpmon node 192.168.1.8 name node1 version 3.1.0mmpmon node 192.168.1.8 name node1 io_s OKtimestamp: 1083350358/935524bytes read: 0bytes written: 0opens: 0closes: 0reads: 0writes: 0readdir: 0inode updates: 0mmpmon node 192.168.1.8 name node1 fs_io_s status 1no file systems mountedmmpmon node 192.168.1.8 name node1 rhist off OK

The requests in the input file are run 10 times, with a delay of 5000 milliseconds (5 seconds) betweeninvocations.

2. Here is the previous example with the -p flag:mmpmon -i infile -p -r 10 -d 5000

The output (sent to stdout) is similar to this:_ver_ _n_ 192.168.1.8 _nn_ node1 _v_ 2 _lv_ 3 _vt_ 0_io_s_ _n_ 192.168.1.8 _nn_ node1 _rc_ 0 _t_ 1084195701 _tu_ 350714 _br_ 0 _bw_ 0 _oc_ 0

_cc_ 0 _rdc_ 0 _wc_ 0 _dir_ 0 _iu_ 0_fs_io_s_ _n_ 192.168.1.8 _nn_ node1 _rc_ 1 _t_ 1084195701 _tu_ 364489 _cl_ - _fs_ -_rhist_

_n_ 192.168.1.8 _nn_ node1 _req_ off _rc_ 0 _t_ 1084195701 _tu_ 378217

3. This is an example of fs_io_s with a mounted file system:mmpmon node 198.168.1.8 name node1 fs_io_s OKcluster: node1.localdomainfilesystem: gpfs1disks: 1timestamp: 1093352136/799285bytes read: 52428800bytes written: 87031808opens: 6closes: 4reads: 51writes: 83readdir: 0inode updates: 11

mmpmon node 198.168.1.8 name node1 fs_io_s OKcluster: node1.localdomainfilesystem: gpfs2disks: 2timestamp: 1093352136/799285bytes read: 87031808bytes written: 52428800opens: 4closes: 3


reads: 12834writes: 50readdir: 0inode updates: 9

4. Here is the previous example with the -p flag:_fs_io_s_ _n_ 198.168.1.8 _nn_ node1 _rc_ 0 _t_ 1093352061 _tu_ 93867 _cl_ node1.localdomain_fs_ gpfs1 _d_ 1 _br_ 52428800 _bw_ 87031808 _oc_ 6 _cc_ 4 _rdc_ 51 _wc_ 83 _dir_ 0 _iu_ 10_fs_io_s_ _n_ 198.168.1.8 _nn_ node1 _rc_ 0 _t_ 1093352061 _tu_ 93867 _cl_ node1.localdomain

_fs_ gpfs2 _d_ 2 _br_ 87031808 _bw_ 52428800 _oc_ 4 _cc_ 3 _rdc_ 12834 _wc_ 50 _dir_ 0 _iu_ 8

This output consists of two strings.5. This is an example of io_s with a mounted file system:

mmpmon node 198.168.1.8 name node1 io_s OKtimestamp: 1093351951/587570bytes read: 139460608bytes written: 139460608opens: 10closes: 7reads: 12885writes: 133readdir: 0inode updates: 14

6. Here is the previous example with the -p flag:_io_s_ _n_ 198.168.1.8 _nn_ node1 _rc_ 0 _t_ 1093351982 _tu_ 356420 _br_ 139460608_bw_ 139460608 _oc_ 10 _cc_ 7 _rdc_ 0 _wc_ 133 _dir_ 0 _iu_ 14

This output consists of one string.

For several more examples, see Monitoring GPFS I/O performance with the mmpmon command in GeneralParallel File System: Advanced Administration Guide.

Location

/usr/lpp/mmfs/bin


Accessibility features for GPFS

Accessibility features help users who have a disability, such as restricted mobility or limited vision, to useinformation technology products successfully.

Accessibility featuresThe following list includes the major accessibility features in GPFS:v Keyboard-only operationv Interfaces that are commonly used by screen readersv Keys that are discernible by touch but do not activate just by touching themv Industry-standard devices for ports and connectorsv The attachment of alternative input and output devices

The IBM Cluster Information Center, and its related publications, are accessibility-enabled. Theaccessibility features of the information center are described in the Accessibility topic at the followingURL: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.addinfo.doc/access.html.

Keyboard navigationThis product uses standard Microsoft Windows navigation keys.

IBM and accessibilitySee the IBM Human Ability and Accessibility Center for more information about the commitment thatIBM has to accessibility:http://www.ibm.com/able/


http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.addinfo.doc/access.html

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries.Consult your local IBM representative for information on the products and services currently available inyour area. Any reference to an IBM product, program, or service is not intended to state or imply thatonly IBM's product, program, or service may be used. Any functionally equivalent product, program, orservice that does not infringe any of IBM's intellectual property rights may be used instead. However, itis the user's responsibility to evaluate and verify the operation of any non-IBM product, program, orservice.

IBM may have patents or pending patent applications covering subject matter described in thisdocument. The furnishing of this document does not give you any license to these patents. You can sendlicense inquiries, in writing, to:

IBM Director of LicensingIBM CorporationNorth Castle DriveArmonk, NY 10504-1785USA

For license inquiries regarding double-byte character set (DBCS) information, contact the IBM IntellectualProperty Department in your country or send inquiries, in writing, to:

Intellectual Property LicensingLegal and Intellectual Property LawIBM Japan Ltd.1623-14, Shimotsuruma, Yamato-shiKanagawa 242-8502 Japan

The following paragraph does not apply to the United Kingdom or any other country where suchprovisions are inconsistent with local law:

INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS”WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOTLIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY ORFITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or impliedwarranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodicallymade to the information herein; these changes will be incorporated in new editions of the publication.IBM may make improvements and/or changes in the product(s) and/or the program(s) described in thispublication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not inany manner serve as an endorsement of those Web sites. The materials at those Web sites are not part ofthe materials for this IBM product and use of those Web sites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate withoutincurring any obligation to you.

© Copyright IBM Corp. 2011 115

Licensees of this program who wish to have information about it for the purpose of enabling: (i) theexchange of information between independently created programs and other programs (including thisone) and (ii) the mutual use of the information which has been exchanged, should contact:

IBM CorporationIntellectual Property LawMail Station P3002455 South Road,Poughkeepsie, NY 12601-5400USA

Such information may be available, subject to appropriate terms and conditions, including in some cases,payment or a fee.

The licensed program described in this document and all licensed material available for it are providedby IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement orany equivalent agreement between us.

This information contains examples of data and reports used in daily business operations. To illustratethem as completely as possible, the examples include the names of individuals, companies, brands, andproducts. All of these names are fictitious and any similarity to the names and addresses used by anactual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustratesprogramming techniques on various operating platforms. You may copy, modify, and distribute thesesample programs in any form without payment to IBM, for the purposes of developing, using, marketingor distributing application programs conforming to the application programming interface for theoperating platform for which the sample programs are written. These examples have not been thoroughlytested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, orfunction of these programs. You may copy, modify, and distribute these sample programs in any formwithout payment to IBM for the purposes of developing, using, marketing, or distributing applicationprograms conforming to the application programming interfaces for the operating platform for which thesample programs are written. These examples have not been thoroughly tested under all conditions. IBM,therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

If you are viewing this information softcopy, the photographs and color illustrations may not appear.

Trademarks

IBM, the IBM logo, and ibm.com® are trademarks or registered trademarks of International BusinessMachines Corp., registered in many jurisdictions worldwide. Other product and service names might betrademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at“Copyright and trademark information” at http://www.ibm.com/legal/copytrade.shtml.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, orboth.

UNIX is a registered trademark of the Open Group in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.


http://www.ibm.com/legal/copytrade.shtml

Glossary

This glossary defines technical terms andabbreviations used in GPFS documentation. If youdo not find the term you are looking for, refer tothe index of the appropriate book or view theIBM Glossary of Computing Terms, located on theInternet at: http://www-306.ibm.com/software/globalization/terminology/index.jsp.

B

block utilizationThe measurement of the percentage ofused subblocks per allocated blocks.

C

clusterA loosely-coupled collection ofindependent systems (nodes) organizedinto a network for the purpose of sharingresources and communicating with eachother. See also GPFS cluster.

cluster configuration dataThe configuration data that is stored onthe cluster configuration servers.

cluster managerThe node that monitors node status usingdisk leases, detects failures, drivesrecovery, and selects file systemmanagers. The cluster manager is thenode with the lowest node numberamong the quorum nodes that areoperating at a particular time.

control data structuresData structures needed to manage filedata and metadata cached in memory.Control data structures include hashtables and link pointers for findingcached data; lock states and tokens toimplement distributed locking; andvarious flags and sequence numbers tokeep track of updates to the cached data.

D

Data Management Application ProgramInterface (DMAPI)

The interface defined by the OpenGroup's XDSM standard as described inthe publication System Management: DataStorage Management (XDSM) API Common

Application Environment (CAE) SpecificationC429, The Open Group ISBN1-85912-190-X.

deadman switch timerA kernel timer that works on a node thathas lost its disk lease and has outstandingI/O requests. This timer ensures that thenode cannot complete the outstandingI/O requests (which would risk causingfile system corruption), by causing apanic in the kernel.

disk descriptorA definition of the type of data that thedisk contains and the failure group towhich this disk belongs. See also failuregroup.

dispositionThe session to which a data managementevent is delivered. An individualdisposition is set for each type of eventfrom each file system.

disk leasingA method for controlling access to storagedevices from multiple host systems. Anyhost that wants to access a storage deviceconfigured to use disk leasing registersfor a lease; in the event of a perceivedfailure, a host system can deny access,preventing I/O operations with thestorage device until the preempted systemhas reregistered.

domainA logical grouping of resources in anetwork for the purpose of commonmanagement and administration.

F

failbackCluster recovery from failover followingrepair. See also failover.

failover(1) The assumption of file system dutiesby another node when a node fails. (2)The process of transferring all control ofthe ESS to a single cluster in the ESSwhen the other clusters in the ESS fails.See also cluster. (3) The routing of all


http://www-01.ibm.com/software/globalization/terminology/index.jsp

http://www-01.ibm.com/software/globalization/terminology/index.jsp

transactions to a second controller whenthe first controller fails. See also cluster.

failure groupA collection of disks that share commonaccess paths or adapter connection, andcould all become unavailable through asingle hardware failure.

fileset A hierarchical grouping of files managedas a unit for balancing workload across acluster.

file-management policyA set of rules defined in a policy file thatGPFS uses to manage file migration andfile deletion. See also policy.

file-placement policyA set of rules defined in a policy file thatGPFS uses to manage the initialplacement of a newly created file. See alsopolicy.

file system descriptorA data structure containing keyinformation about a file system. Thisinformation includes the disks assigned tothe file system (stripe group), the currentstate of the file system, and pointers tokey files such as quota files and log files.

file system descriptor quorumThe number of disks needed in order towrite the file system descriptor correctly.

file system managerThe provider of services for all the nodesusing a single file system. A file systemmanager processes changes to the state ordescription of the file system, controls theregions of disks that are allocated to eachnode, and controls token managementand quota management.

fragmentThe space allocated for an amount of datatoo small to require a full block. Afragment consists of one or moresubblocks.

G

GPFS clusterA cluster of nodes defined as beingavailable for use by GPFS file systems.

GPFS portability layerThe interface module that each

installation must build for its specifichardware platform and Linuxdistribution.

GPFS recovery logA file that contains a record of metadataactivity, and exists for each node of acluster. In the event of a node failure, therecovery log for the failed node isreplayed, restoring the file system to aconsistent state and allowing other nodesto continue working.

I

ill-placed fileA file assigned to one storage pool, buthaving some or all of its data in adifferent storage pool.

ill-replicated fileA file with contents that are not correctlyreplicated according to the desired settingfor that file. This situation occurs in theinterval between a change in the file'sreplication settings or suspending one ofits disks, and the restripe of the file.

indirect blockA block containing pointers to otherblocks.

inode The internal structure that describes theindividual files in the file system. There isone inode for each file.

J

journaled file system (JFS)A technology designed forhigh-throughput server environments,which are important for running intranetand other high-performance e-businessfile servers.

junctionA special directory entry that connects aname in a directory of one fileset to theroot directory of another fileset.

K

kernel The part of an operating system thatcontains programs for such tasks asinput/output, management and control ofhardware, and the scheduling of usertasks.


M

metadataA data structures that contain accessinformation about file data. These include:inodes, indirect blocks, and directories.These data structures are not accessible touser applications.

metanodeThe one node per open file that isresponsible for maintaining file metadataintegrity. In most cases, the node that hashad the file open for the longest period ofcontinuous time is the metanode.

mirroringThe process of writing the same data tomultiple disks at the same time. Themirroring of data protects it against dataloss within the database or within therecovery log.

multi-tailedA disk connected to multiple nodes.

N

namespaceSpace reserved by a file system to containthe names of its objects.

Network File System (NFS)A protocol, developed by SunMicrosystems, Incorporated, that allowsany host in a network to gain access toanother host or netgroup and their filedirectories.

Network Shared Disk (NSD)A component for cluster-wide disknaming and access.

NSD volume IDA unique 16 digit hex number that isused to identify and access all NSDs.

node An individual operating-system imagewithin a cluster. Depending on the way inwhich the computer system is partitioned,it may contain one or more nodes.

node descriptorA definition that indicates how GPFS usesa node. Possible functions include:manager node, client node, quorum node,and nonquorum node

node numberA number that is generated andmaintained by GPFS as the cluster is

created, and as nodes are added to ordeleted from the cluster.

node quorumThe minimum number of nodes that mustbe running in order for the daemon tostart.

node quorum with tiebreaker disksA form of quorum that allows GPFS torun with as little as one quorum nodeavailable, as long as there is access to amajority of the quorum disks.

non-quorum nodeA node in a cluster that is not counted forthe purposes of quorum determination.

P

policy A list of file-placement and service-classrules that define characteristics andplacement of files. Several policies can bedefined within the configuration, but onlyone policy set is active at one time.

policy ruleA programming statement within a policythat defines a specific action to bepreformed.

pool A group of resources with similarcharacteristics and attributes.

portabilityThe ability of a programming language tocompile successfully on differentoperating systems without requiringchanges to the source code.

primary GPFS cluster configuration serverIn a GPFS cluster, the node chosen tomaintain the GPFS cluster configurationdata.

private IP addressA IP address used to communicate on aprivate network.

public IP addressA IP address used to communicate on apublic network.

Q

quorum nodeA node in the cluster that is counted todetermine whether a quorum exists.

quota The amount of disk space and number ofinodes assigned as upper limits for aspecified user, group of users, or fileset.

Glossary 119

quota managementThe allocation of disk blocks to the othernodes writing to the file system, andcomparison of the allocated space toquota limits at regular intervals.

R

Redundant Array of Independent Disks (RAID)A collection of two or more disk physicaldrives that present to the host an imageof one or more logical disk drives. In theevent of a single physical device failure,the data can be read or regenerated fromthe other disk drives in the array due todata redundancy.

recoveryThe process of restoring access to filesystem data when a failure has occurred.Recovery can involve reconstructing dataor providing alternative routing through adifferent server.

replicationThe process of maintaining a defined setof data in more than one location.Replication involves copying designatedchanges for one location (a source) toanother (a target), and synchronizing thedata in both locations.

rule A list of conditions and actions that aretriggered when certain conditions are met.Conditions include attributes about anobject (file name, type or extension, dates,owner, and groups), the requesting client,and the container name associated withthe object.

S

SAN-attachedDisks that are physically attached to allnodes in the cluster using Serial StorageArchitecture (SSA) connections or usingfibre channel switches

secondary GPFS cluster configuration serverIn a GPFS cluster, the node chosen tomaintain the GPFS cluster configurationdata in the event that the primary GPFScluster configuration server fails orbecomes unavailable.

Secure Hash Algorithm digest (SHA digest)A character string used to identify a GPFSsecurity key.

session failureThe loss of all resources of a datamanagement session due to the failure ofthe daemon on the session node.

session nodeThe node on which a data managementsession was created.

Small Computer System Interface (SCSI)An ANSI-standard electronic interfacethat allows personal computers tocommunicate with peripheral hardware,such as disk drives, tape drives, CD-ROMdrives, printers, and scanners faster andmore flexibly than previous interfaces.

snapshotA copy of changed data in the active filesand directories of a file system with theexception of the inode number, which ischanged to allow application programs todistinguish between the snapshot and theactive files and directories.

source nodeThe node on which a data managementevent is generated.

stand-alone clientThe node in a one-node cluster.

storage area network (SAN)A dedicated storage network tailored to aspecific environment, combining servers,storage products, networking products,software, and services.

storage poolA grouping of storage space consisting ofvolumes, logical unit numbers (LUNs), oraddresses that share a common set ofadministrative characteristics.

stripe groupThe set of disks comprising the storageassigned to a file system.

stripingA storage process in which information issplit into blocks (a fixed amount of data)and the blocks are written to (or readfrom) a series of disks in parallel.

subblockThe smallest unit of data accessible in anI/O operation, equal to one thirty-secondof a data block.

system storage poolA storage pool containing file system


control structures, reserved files,directories, symbolic links, special devices,as well as the metadata associated withregular files, including indirect blocks andextended attributes The system storagepool can also contain user data.

T

token managementA system for controlling file access inwhich each application performing a reador write operation is granted some formof access to a specific block of file data.Token management provides dataconsistency and controls conflicts. Tokenmanagement has two components: thetoken management server, and the tokenmanagement function.

token management functionA component of token management thatrequests tokens from the tokenmanagement server. The tokenmanagement function is located on eachcluster node.

token management serverA component of token management thatcontrols tokens relating to the operationof the file system. The token managementserver is located at the file systemmanager node.

twin-tailedA disk connected to two nodes.

U

user storage poolA storage pool containing the blocks ofdata that make up user files.

V

virtual file system (VFS)A remote file system that has beenmounted so that it is accessible to thelocal user.

virtual node (vnode)The structure that contains informationabout a file system object in an virtual filesystem (VFS).

Glossary 121

Index

Aaccessibility features for the GPFS product 113adding pdisks 44adminDrain, pdisk state 14adminMode attribute 86array, declustered 6, 15

background tasks 19large 15managing 11parameters 15size 15small 15spare space 16

attributesadminMode 86automountDir 87cipherList 87cnfsMountdPort 87cnfsNFSDprocs 87cnfsSharedRoot 87cnfsVIP 87dataStructureDump 87defaultHelperNodes 87defaultMountDir 88dmapiDataEventRetry 88dmapiEventTimeout 88dmapiMountEvent 88dmapiMountTimeout 88dmapiSessionFailureTimeout 89failureDetectionTime 89maxblocksize 89maxFcntlRangesPerFile 89maxFilesToCache 89maxMBpS 89maxStatCache 90mmapRangeLock 90nsdBufSpace 90nsdRAIDBufferPoolSizePct 90nsdRAIDtracks 90nsdServerWaitTimeForMount 90nsdServerWaitTimeWindowOnMount 91pagepool 91pagepoolMaxPhysMemPct 91prefetchThreads 91release 91sidAutoMapRangeLength 91sidAutoMapRangeStart 92subnets 92tiebreakerDisks 92uidDomain 92unmountOnDiskFail 92usePersistentReserve 93verbsPorts 93verbsRdma 93worker1Threads 93

automatic mount, indicating 97automountDir attribute 87

Bbackground tasks 19block size 16

choosing 97effect on maximum mounted file system size 89, 97

Ccallbacks 23, 78

daRebuildFailed 23, 81nsdCksumMismatch 23, 81pdFailed 23, 80pdPathDown 23, 81pdRecovered 80pdReplacePdisk 23, 80postRGRelinquish 23, 80postRGTakeover 23, 79preRGRelinquish 23, 80preRGTakeover 23, 79rgOpenFailed 23, 80rgPanic 23, 80

carrier (disk), changing 46changing

attributesadminMode 86automountDir 87cipherList 87cluster configuration 86cnfsMountdPort 87cnfsNFSDprocs 87cnfsSharedRoot 87cnfsVIP 87dataStructureDump 87defaultHelperNodes 87defaultMountDir 88dmapiDataEventRetry 88dmapiEventTimeout 88dmapiMountEvent 88dmapiMountTimeout 88dmapiSessionFailureTimeout 89failureDetectionTime 89maxblocksize 89maxFcntlRangesPerFile 89maxFilesToCache 89maxMBpS 89maxStatCache 90mmapRangeLock 90nsdBufSpace 90nsdRAIDBufferPoolSizePct 90nsdRAIDtracks 90nsdServerWaitTimeForMount 90nsdServerWaitTimeWindowOnMount 91pagepool 91pagepoolMaxPhysMemPct 91prefetchThreads 91release 91sidAutoMapRangeLength 91sidAutoMapRangeStart 92subnets 92tiebreakerDisks 92

© Copyright IBM Corp. 2011 123

changing (continued)attributes (continued)

uidDomain 92unmountOnDiskFail 92usePersistentReserve 93verbsPorts 93verbsRdma 93worker1Threads 93

disk carrier 46pdisk state flags 49recovery group attributes 51

checksumdata 19end-to-end 3

cipherList attribute 87cluster

changing configuration attributes 86configuration data 102

cnfsMountdPort attribute 87cnfsNFSDprocs attribute 87cnfsSharedRoot attribute 87cnfsVIP attribute 87commands

GPFS Native RAID 43mmaddcallback 23, 78mmaddpdisk 44mmchcarrier 46mmchconfig 86mmchpdisk 49mmchrecoverygroup 51mmcrfs 95mmcrrecoverygroup 53mmcrvdisk 56mmdelpdisk 60mmdelrecoverygroup 62mmdelvdisk 64mmexportfs 102mmimportfs 104mmlspdisk 66mmlsrecoverygroup 69mmlsrecoverygroupevents 72mmlsvdisk 74mmpmon 107other related GPFS 77

creatingfile systems 95pdisks 53recovery groups 29, 53vdisks 56

DdaRebuildFailed callback 23, 81data checksum 19data redundancy 2dataStructureDump attribute 87dead, pdisk state 14declustered array 6, 15

background tasks 19large 15managing 11parameters 15size 15small 15spare space 16stanza files 53

declustered RAID 3

defaultHelperNodes attribute 87defaultMountDir attribute 88deleting

pdisks 60recovery groups 62vdisks 64

diagnosing, pdisk state 14diagnosis, disk 17disks

carrier, changing 46configuration 5, 15

declustered array 6recovery group 5

declustered array 15diagnosis 17hardware service 20hospital 8, 17maintenance 17pdisk 7, 12pdisk paths 13pdisk states 13physical 7, 12replacement 8, 19replacing failed 36setup example 25solid-state 8spare space 16SSD 8usage 96vdisk 7, 16vdisk size 17virtual 7, 16

displayinginformation for pdisks 66information for recovery groups 69information for vdisks 74the recovery group event log 72vdisk I/O statistics 22

dmapiDataEventRetry attribute 88dmapiEventTimeout attribute 88dmapiMountEvent attribute 88dmapiMountTimeout attribute 88dmapiSessionFailureTimeout attribute 89dumps, storage of information 87

Eend-to-end checksum 3event log (recovery group), displaying 72examples

creating recovery groups 29disk setup 25preparing recovery group servers 25

Ffailed disks, replacing 36failing, pdisk state 14failover, server 19failure group 96failureDetectionTime attribute 89features, GPFS Native RAID 1, 2file systems

block size 95checking 104creating 95


file systems (continued)exporting 102formatting 96importing 104mounted file system sizes 89, 97moving to another cluster 102

formatting, pdisk state 14

GGPFS cluster configuration data 102GPFS commands, other related 77GPFS Native RAID

callbacks 23commands 43

mmaddpdisk 44mmchcarrier 46mmchpdisk 49mmchrecoverygroup 51mmcrrecoverygroup 53mmcrvdisk 56mmdelpdisk 60mmdelrecoverygroup 62mmdelvdisk 64mmlspdisk 66mmlsrecoverygroup 69mmlsrecoverygroupevents 72mmlsvdisk 74

data redundancy 2declustered array 6declustered RAID 3disk configuration 5disk hospital 8disk replacement 8end-to-end checksum 3features 1, 2health metrics 8introduction 1managing 11monitoring 22overview 1pdisk 7pdisk discovery 8physical disk 7planning considerations 20RAID code 2, 16recovery group 5recovery group server parameters 11solid-state disk 8SSD 8system management 20vdisk 7virtual disk 7

group, recovery 5attributes, changing 51creating 11, 29, 53deleting 62listing information for 69log vdisk 17managing 11overview 11server failover 12server parameters 11verifying 32

Hhardware service 20health metrics 8hospital, disk 8, 17

Iinformation for recovery groups, listing 69init, pdisk state 14introduction, GPFS Native RAID 1

Llarge declustered array 15license inquiries 115listing information

for pdisks 66for recovery groups 69for vdisks 74vdisk I/O statistics 22

log (recovery group event), displaying 72log vdisk 17

Mmaintenance, disk 17management, system 20managing GPFS Native RAID 11maxblocksize attribute 89maxFcntlRangesPerFile attribute 89maxFilesToCache attribute 89maxMBpS attribute 89maxStatCache attribute 90missing, pdisk state 14mmaddcallback 23, 78mmaddpdisk 44mmapRangeLock attribute 90mmchcarrier 46mmchconfig 86mmchpdisk 49mmchrecoverygroup 51mmcrfs 95mmcrrecoverygroup 53mmcrvdisk 56mmdelpdisk 60mmdelrecoverygroup 62mmdelvdisk 64mmexportfs 102mmimportfs 104mmlspdisk 66mmlsrecoverygroup 69mmlsrecoverygroupevents 72mmlsvdisk 74mmpmon 107mmpmon command input 22monitoring

performance 107system 22

mount point directory 96mtime 97

NNFS V4 97noData, pdisk state 14

Index 125

node failure detection 89noPath, pdisk state 14noRGD, pdisk state 14notices 115noVCD, pdisk state 14NSD server 104nsdBufSpace attribute 90nsdCksumMismatch callback 23, 81nsdRAIDBufferPoolSizePct attribute 90nsdRAIDtracks attribute 90nsdServerWaitTimeForMount attribute 90nsdServerWaitTimeWindowOnMount attribute 91

Ook, pdisk state 14operating system

AIX ixother related GPFS commands 77overview, GPFS Native RAID 1

Ppagepool attribute 91pagepoolMaxPhysMemPct attribute 91patent information 115paths, pdisk 13pdFailed callback 23, 80pdisks 7, 12

adding 44changing 49creating 53deleting 60discovery 8displaying 66health metrics 8listing information for 66managing 11overview 12paths 13stanza files 44, 53, 60states 13, 49, 66

pdPathDown callback 23, 81pdRecovered callback 80pdReplacePdisk callback 23, 80performance, monitoring 107physical disk 7, 12planning considerations 20postRGRelinquish callback 23, 80postRGTakeover callback 23, 79prefetchThreads attribute 91preparing recovery group servers 25preRGRelinquish callback 23, 80preRGTakeover callback 23, 79problem determination information, placement of 87PTOW, pdisk state 14

RRAID code

comparison 2planning considerations 20Reed-Solomon 2replication 2vdisk configuration 16

RAID layoutsconventional 3declustered 3

RAID stripe size 95RAID, declustered 3readonly, pdisk state 14rebalance, background task 19rebuild-1r, background task 19rebuild-2r, background task 19rebuild-critical, background task 19rebuild-offline, background task 19recovery group servers, preparing 25recovery groups 5

attributes, changing 51configuring 29creating 11, 29, 53deleting 62event log, displaying 72layout 30listing information for 69log vdisk 17managing 11overview 11preparing servers 25server failover 12, 19server parameters 11stanza files 30, 105verifying 32

redundancy codescomparison 2Reed-Solomon 2replication 2

registering user event commands 78release attribute 91repair-RGD/VCD, background task 19replace, pdisk state 14replacement, disk 8, 19replacing failed disks 36replication, strict 98rgOpenFailed callback 23, 80rgPanic callback 23, 80

Sscrub, background task 19server failover 19service, hardware 20setup example, disk 25sidAutoMapRangeLength attribute 91sidAutoMapRangeStart attribute 92size, vdisk 17small declustered array 15solid-state disk 8spare space 16SSD 8stanza files

declustered array 53pdisk 44, 53, 60recovery group 30, 105vdisk 32, 56, 64

states, pdisk 13statistics, vdisk I/O

displaying 22, 108resetting 22, 108

strict replication 98subnets attribute 92


supported operating systemAIX ix

suspended, pdisk state 14system

management 20monitoring 22planning 20

systemDrain, pdisk state 14

Ttasks, background 19tiebreakerDisks attribute 92trademarks 116traditional ACLs

NFS V4 ACL 98Windows 98

UuidDomain attribute 92unmountOnDiskFail attribute 92usePersistentReserve attribute 93user event commands, registering 78

Vvdisks 7, 16

block size 16creating 32, 56creating NSDs 35data checksum 19defining 32deleting 64I/O statistics, displaying 22listing information for 74log vdisk 17managing 11RAID code 16relationship with NSDs 17size 17stanza files 32, 56, 64

verbsPorts attribute 93verbsRdma attribute 93virtual disk 7, 16

Wworker1Threads attribute 93

Index 127

��

Product Number: 5765-G66

Printed in USA

SA23-1354-00

A 2313540

Documents

Transcript of A 2313540