Multipathing, SAN, and Direct-Attached Storage ... · Multipathing, SAN, and Direct-Attached...

Multipathing, SAN, and Direct-Attached Storage Considerations for AIX

Administrators

Multipathing, SAN, and Direct-Attached Storage Considerations for AIX

Administrators

John Hock

IBM Power Systems Strategic Initiatives

1

Agenda

• Direct-attached storage options review

• Disk formatting update

• Working with MPIO

• Monitoring, Measuring, Tuning topics

2

3

DAS & SAN - Two Good Options

DASDirect Attached Storage

(“internal”)

SANStorage Area Network

(“external”)

• Both options are strategic

• Both options have their strengths

• Can use both options on the same server

� Fastest (lower latency)� Typically lower cost

hardware/software� Often simpler config

� Fast � Multi-server sharing� Advanced functions/values

Flash Copy, Metro/Global Mirror, Live Partition Mobility, Easy Tier

4

SAN vs. DAS Decision PointsSAN vs. DAS Decision PointsSAN vs. DAS Decision PointsSAN vs. DAS Decision Points

Criterion SAN Decision Point DAS Decision Point

Configuration ControlRequired resources may not be

under your control Direct local control

Performance, GeneralMain benefits lie in improved sharing, scalability,

manageabilityCan meet most demanding application IO

requirements

Performance, MB/sAdditional components may be needed to support

parallelism of paths to achieve MB/s targets.

Configuring for MB/s throughput is simpler as you most likely have full control of all the key

components.

Performance, IO/sIdeal for IO/s with cache and multiple processors in

subsystemsSSD, Gen3 PCIe provide IO/s

equal or better that SAN

Performance, LatencyLatency depends upon current load conditions and

request contention

DAS provides a shorter path for processing IO requests than SAN. This results in smaller

latencies.

Multiple Server sharing Key strength Limited sharing

Availability/BackupRAID-level data protection. Extensive backup and

replication toolsRAID-level data protection. Traditional backup

technologies such as tape

Partition Mobility Supported Not Supported

Placement Flexibility SAN cable & switch distances Intra-rack distances

CostCan be allocated across many

servers and applications

Generally less than SAN, but cost must be absorbed by a smaller set of applications on a

single server.

Scalability Key strength Limited, but may be sufficient for applications on single server

Slot Usage EfficiencyFC adapters make efficient use of slots since cache is

in subsystem

Non-RAID SAS adapters same efficiency as FC. RAID adapters

must be paired

5

• SAS adapters• Some adapters must be purchased and used in pairs• Capabilities vary• PCIe or PCI-X

• PCIe Gen1, Gen2, or Gen3• SAS disk bays

• In the CECs• Optional split back planes

• In IO drawers: FC 5802/5803• In SAS disk bay drawers: EXP12S, EXP24S, EXP30 FC 5887

• HDDs and SSDs 3.5” vs. 2.5” SFF, 1.8” SSD• Cables

• Various types and lengths: AA, AI, AE, AT, EE, X, YO, YI, YR• The number of ways to cable SAS is far more than the number of IBM-supported ways.

FC 5887 EXP 24S

FC 5886 EXP 12S

FC 5802

FC 5803FC 5901 FC

5913

FC 5805/5903

FC EDR1/5888 EXP 30

FC ESA1/2

IBM SAS Hardware OfferingsThe SAS Ecosystem

FC EJ0M/J/L

6

Created by SCSI Trade Association (STA) to facilitate the identification of SAS architecture in the marketplace

first generation…

…second generation

7

SAS Roadmap

↕Shipping in volume

8

SAS Environment Planning

� How many HDDs and SSDs do you need?Capacity and performance needs

� What RAID configuration will you use?Number of disks in each array and RAID levelHow about hot spares?Hdisks (RAID arrays) are usually assigned as a unitOperating system mirroring?

� Where will the HDDs and SSDs go?Intenal to the CECs

Is the RAID daughter card offered in the system?In IO drawers offering SAS disks baysIn SAS disk bay drawers

� What are you’re availability requirements for adapter failure?Dual adapters in a LPARDual adapters across LPARs (or across VIOSs)?

� Cabling� Understand limitations

Max HDDs/SSDs per adapterMix of HDDs and SSDs

9

SAS Adapter Planning• Plan your availability requirements

Single adapter, or dual adapterRAID, JBOD, LVM mirroringDon’t forget hot spares!

• Plan your cablingGenerally evenly balance IOs across physical disks……use the same RAID levels and number of disks per RAID array

• Plan your performanceOrder enough disks to get the IOPS needed, taking into account the

availability configuration

HDD max~150 IOPS SSD

> 10,000 IOPS

10

Sizing the storage subsystem – DAS formulas & rules of thumbFor random workloads at the application layer

• FormulasN = number of physical disksR = proportion of IOs that are readsW = 1-R = proportion of IOs that are writesD = IOPS for a single physical disk (approximately 175 IOPS for 15K RPM disks)

JBOD, RAID 0, and OS mirroring IOPS bandwidth = NxD

RAID 1 or RAID 10 IOPS bandwidth = NxD / (R+2W) RAID 5 or 5E IOPS bandwidth = NxD / (R+4W)RAID 6 IOPS bandwidth = NxD / (R+6W)

• Rules of thumb for 15K RPM disksRAID 6 – 65 to 85 IOPS/HDDRAID 5 – 80 to 100 IOPS/HDDMirrored/R0/JBOD – 100 to 120 IOPS/HDD

• 10K RPM disks approximately 30% fewer IOPS per disk

• FormulasN = number of physical disksR = proportion of IOs that are readsW = 1-R = proportion of IOs that are writesD = IOPS for a single physical disk (approximately 175 IOPS for 15K RPM disks)

JBOD, RAID 0, and OS mirroring IOPS bandwidth = NxD

RAID 1 or RAID 10 IOPS bandwidth = NxD / (R+2W) RAID 5 or 5E IOPS bandwidth = NxD / (R+4W)RAID 6 IOPS bandwidth = NxD / (R+6W)

• Rules of thumb for 15K RPM disksRAID 6 – 65 to 85 IOPS/HDDRAID 5 – 80 to 100 IOPS/HDDMirrored/R0/JBOD – 100 to 120 IOPS/HDD

• 10K RPM disks approximately 30% fewer IOPS per disk

11

SAS Hot-Plugability

• SAS fully supports hot plugging

• You can hot add SAS adapters and cable them to drive or media drawers (hot add) and power them up and config them

• On SAS Drawers you can hot repair port expander cards and power / cooling assemblies

• If you have dual adapters (X cables) you can hot repair adapters (you cannot hot replace an X cable)

• You can hot add a SAS media drawer

4x

4x

Knorr

4x

4x

4x

2x

CharlotteESM ESM

CharlotteESM ESM

CharlotteESM ESM

CharlotteESM ESM

C2-T2

(4x)

C2-T1

(4x)

C1-T2

(4x)

C1-T1

(4x)

C2-T2(4x)

C2-T1(4x)

C1-T2

(4x)

C1-T1

(4x)

2x 2x

2x

2x

C2-T2(4x)

C2-T1

(4x)

C1-T2(4x)

C1-T1(4x)

C2-T2

(4x)

C2-T1

(4x)

C1-T2

(4x)

C1-T1

(4x)

2x

2x

2x 2x

12

EXP24S SFF Gen-2-bay disk expansion drawer

• Twenty-four Gen2 - 2.5-inch small form factor SAS bays

• Gen-2 SAS bays (or SFF-2) are not compatible with Gen-1 SAS bays

• Not compatible with CEC SFF Gen-1 SAS bays or with #5802/5803 Gen-1 SFF SAS bays

• Supports 3G and 6G SAS interface speeds

• redundant AC power supplies

• Supports HDDs and SSDs

• EXP24s can be ordered in one of 3 possible manufacturing-configured MODE settings (not customer set-up) will result in:

• 1, 2 or 4 sets of disk partitions

• SAS controller implementation options are 5887 mode dependant:

• a single controller

• one pair of controllers

• two pairs of controllers or up to four separate controllers

C1-T2

ESM

C1

ESM

C2

C1-T3

C1-T1

C2-T2

C2-T3

C2-T1

#5887

13

#5887 EXP24S Modes

• 1 set 24 bays

• AIX, IBM i, Linux, VIOS

• All adapter/controllers

66 6 6121224

� Mode 2 - the T3 ports connect drives D1-D12 and the T2 ports connect drives D13-D24.

� #5887 Modes are set by IBM Manufacturing

� Option to reset mode outside of manufacturing, but cumbersome.

� 2 sets 12 bays� AIX, Linux, VIOS� Not IBM i � All adapter/controllers

� 4 sets 6 bays� AIX, Linux, VIOS � Not IBM i � Only #5901/5278 adapters� Each adapter could be a

separate partition/system

� Mode 1 � Mode 2 � Mode 4

Note: IBM manufacturing will attach an external label on the EXP24S to indicate the mode setting

C1-T2

ESM

C1

ESM

C2

C1-T3

C1-T1

C2-T2

C2-T3

C2-T1

#5887

POWER8S824 Backplane

• Choice of two storage features:

• Choice one:• Twelve SFF-3 bays, one DVD bay, one integrated SAS controller without cache,

and JBOD, RAID 0, 5, 6, or 10

• Optionally, split the SFF bays and add a second integrated SAS controller without cache

• Choice two: • Eighteen SFF-3 bays, one DVD bay, a pair of integrated SAS controllers with

cache, RAID 0, 5, 6, 10, 5T2, 6T2, and 10T2

• Optionally, attach an EXP24S SAS HDD/SSD Expansion Drawer to the dual IOA.

• Choice of two storage features:

• Choice one:• Twelve SFF-3 bays, one DVD bay, one integrated SAS controller without cache,

and JBOD, RAID 0, 5, 6, or 10

• Optionally, split the SFF bays and add a second integrated SAS controller without cache

• Choice two: • Eighteen SFF-3 bays, one DVD bay, a pair of integrated SAS controllers with

cache, RAID 0, 5, 6, 10, 5T2, 6T2, and 10T2

• Optionally, attach an EXP24S SAS HDD/SSD Expansion Drawer to the dual IOA.

14

EXP30EXP30EXP30EXP30 Ultra SSD I/O Drawer30 x 387 GB drives = up to 11.6 TB

� For POWER7+ 770/780 “D” mdl

� 1U drawer … Up to 30 SSD

� 30 x 387 GB drives = up to 11.6 TB

� Great performance

� Up to 480,000 IOPS (100% read)

� Up to 4.5 GB/s bandwidth

� Slightly faster PowerPC processor in SAS controller & faster internal instruction memory

� Concurrent maintenance

� Up to 48 drives & up to 43 TB downstream HDD in #5887

#EDR1 #5888

� For POWER7 710/720/730/740 “C” mdl

� 1U drawer … Up to 30 SSD

� Great performance

� Up to 400,000 IOPS (100% read)

� Up to 4.5 GB/s bandwidth

� Slightly slower PowerPC processor in SAS

controller & slower internal instruction

memory

� Limited concurrent maintenance

� No downstream HDD

Attaches to one or two GX++ slots (Not supported on POWER8)

15

POWER8 I/O Migration Considerations: Disk & SSD

• Drives in #5887 EXP24S I/O Drawer (SFF-2)

• All disk drives supported.

• All 387GB or larger SSD supported (177GB SSD not supported)

• PCIe-attached drawer can move “intact”. Simply move attaching PCIe adapter to POWER8 PCIe slot

• Drives in POWER7 system unit or in #5802/5803 I/O drawer (SFF-1)

• Drives supported but need to be “retrayed”. Once on SFF-2 carrier/tray can use in #5887 EXP24S drawer.

• “Retraying” done either in advance on POWER7 server or during upgrade/migration to POWER8 server

• All disk drives supported. All 387GB or larger SSD supported (177GB SSD not supported)

• Drives in FC-attached SAN not impacted, just move FC card.

• With PowerVM VIOS and Live Partition Mobility makes even easier

16

17

PCIe SAS Adapters

PCIe SAS adapters #5901 #5805/5903 #5913 (paired) #ESA1, #ESA2

Write cache effective 0 380 MB 1800 MB N

Write cache real 0 380 MB 1800 MB N

# PCI slots per adapter 1 1 2 per pair 1

Two cards required optional required required optional

Rule of thumb – max HDD modest 24-36+ 72 No

Rule of thumb – max SSD 0 3-5 26 18-26

Minimum AIX support 5.3 5.3 5.3 5.3

Minimum IBM i support 6.1 6.1 6.1/7.1 6.1/7.1Supported Native

only (no VIOS)

Cache battery maintenance N Y N N

See the complete comparison table at

http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/topic/p7ebj/pciexpresssasraidcards.htm

PCIe3 RAID SAS Adapters PCIe3 RAID SAS Adapters PCIe3 RAID SAS Adapters PCIe3 RAID SAS Adapters Optimized for SSD’sOptimized for SSD’sOptimized for SSD’sOptimized for SSD’s

PCIe3 RAID SAS Adapter Quad-port 6Gb x8 (#EJ0J/EJ0M)PCIe3 12GB Cache RAID SAS Adapter Quad-port 6Gb x (#EJ0L)

18

T0

Application-specific Integrated Circuit (ASIC)

FC EJ0J & EJ0M (LP)(CCIN 57B4)

(PPC now inside ASIC)

T3

T2

T1

HD

SA

S P

orts

19

EJ0J & EJ0M Supported Configuration Summary

• SSD’s and/or HDD’s

• 48 SSD’s max (Specific rules for each enclosure)

• Single or Dual controller configs (no write cache)

• Raid-0,10,5,6, (JBOD support also on single controller only)

• Max devices per array = 32

• SSD’s / HDDs in EXP24S (#5887) or #5803

� Specific rules for mixing SSD’s and HDD’s (cannot mix on port)

�#5803 mode 1 not supported with SSD’s

• SAS Tape support on Single controller (cannot mix Tape/Disk on same

controller)

EJ0J & EJ0M Supported Configuration Summary



• Single or Dual controller configs (no write cache)

• Raid-0,10,5,6, (JBOD support also on single controller only)


• SSD’s / HDDs in EXP24S (#5887) or #5803

� Specific rules for mixing SSD’s and HDD’s (cannot mix on port)

�#5803 mode 1 not supported with SSD’s

• SAS Tape support on Single controller (cannot mix Tape/Disk on same

controller)

20

21

New 6Gb HD SAS Cables • Narrower backshell on new cables than the backshell on the existing cables

due to higher density 4 gang SAS connectors

• Backwards compatible with current HD SAS adapters (57B5, 57C4)

• YO, X, AT, AA (*no AA cables with EJ0J & EJ0M )

• New Feature Codes and P/N’s

• Same Lengths. Same Basic Function

• New HD SAS cables for R/M = AE1 (4m) and YE1 (3m)

IBM PN 00E9345

HD SAS Connector Plug

-Dust Plug

-Strengthens connector for cable plug/unplug

22

Single EJ0J / EJ0M with SSD in EXP24S (Mode 1)

6G YO Cable

Mode 1 – Up to 24 SSDs

5887

4x

EJ0J/EJ0M 4x

4x

4x

23

Dual EJ0J/EJ0M with Max 48 SSD in EXP24S (Mode 1)

6G YO Cables

5887

Mode 1

4x

EJ0J/EJ0M 4x

4x

4x

4x

EJ0J/EJ0M 4x

4x

4x

5887

Mode 1

24

Dual EJ0J/EJ0M to one EXP24S (Mode 2) with SSDs

Special High IOPS Configuration

6G X Cables

0

C1-T2

ESM

C1ESM

C2

C1-T3

C2-T2

C2-T3

Mode 2 2-24 SSD’s 1-12 In each ½ drawer

5887

Max IOPS using only SSD

Mode 2

4x

EJ0J/EJ0M 4x

4x

4x

4x

EJ0J/EJ0M 4x

4x

4x

No AA cables needed (no write cache)

AIX & Linux onlyNo IBM i support for EXP24S Mode 2

25

Dual EJ0J/EJ0M to four EXP24Ss (Mode 1) with Max 96 HDD

6G YO Cables

Mode 1

Max 96 HDDs

5887

5887

5887

Mode 1

Mode 1

Mode 1

4x

EJ0J/EJ0M 4x

4x

4x

4x

EJ0J/EJ0M 4x

4x

4x

Mode 1

5887

26

4 Single EJ0J/EJ0M to one EXP24S in 4 way split mode

Mode 4 4 sets of 6 drivesDual X cablesUp to 24 SSDs

EJ0J/EJ0M

EJ0J/EJ0M

EJ0J/EJ0M

EJ0J/EJ0M

D60 1D50 1

D80 1D70 1

D100 1D90 1

D120 1D110 1

D20 1D10 1

D40 1D30 1

D180 1D170 1

D200 1D190 1

D220 1D210 1

D240 1D230 1

D140 1D130 1

D160 1D150 1

6G SAS Mode 4

AIX & Linux onlyNo IBM i support

5887Mode 4

The adapters do not see each other and can only see the 6 drives in the drive

bays that are zoned for them to access

4x

4x

4x

4x

4x

4x

4x

4x

4x

4x

4x

4x

4x

4x

4x

4x

Single X cable

Single X cable

27

Dual EJ0J/EJ0M in #5803 drawer to #5803 drawer drives

#5803 = mode 21-13 HDDS or 1-13 SSDs in each half of #5803

13 Drives in

#5803 Drawer

6G AT cables

#5803 in mode 21-13 drives in half

4x

EJ0J/EJ0M 4x

4x

4x

EJ0J/EJ0M 4x

4x

Field support config only, not supported in SBT/econfig tools

#5803 B

#5803 A

4x

4x

28

Single EJ0J/EJ0M to SAS media drawer (one adapter port to one device)

• 2 SAS TAPE / Media drawer (4 max physically)

4x

EJ0J/EJ0M 4x

4x

4x

Media Drawer

Device 1 Device 3

Device 2 Device 4

1 - 4 individual AE1 cables

29

Single EJ0J/EJ0M to SAS media drawer – Alternate Configuration (one HBA port to two devices)

4x

EJ0J/EJ0M 4x

4x

4x

Media Drawer

Device 1 Device 3

Device 2 Device 4

1 - 4 YE1 cables

• 2 SAS TAPE / Media drawer

• 8 Max Tape / adapter (physically)

PCIe3 12GB Cache RAID SAS Adapter QuadPCIe3 12GB Cache RAID SAS Adapter QuadPCIe3 12GB Cache RAID SAS Adapter QuadPCIe3 12GB Cache RAID SAS Adapter Quad----port 6Gb x8port 6Gb x8port 6Gb x8port 6Gb x8FC EJ0L (CCIN 57CE) FC EJ0L (CCIN 57CE) FC EJ0L (CCIN 57CE) FC EJ0L (CCIN 57CE)

IBM Confidential

30

31

EJ0L (57CE)

T3 T2 T1

Crocodile ASIC (PPC now inside ASIC)

T0

HD

SA

S P

orts

PowerGEM (capacitor card)

32

Super Caps vs BatteryFlash-backed DRAM vs battery-backed DRAM

Super Caps provide power long enough to copy data from Write Cache memory to non-volatile

Flash on abnormal power down/loss

• No Battery Maintenance

• No ~ 10 day retention limitation after abnormal power off

• No 3 year EOL Battery replacement.

• FW monitors PowerGem health. HW error log for card replacement if necessary.

• Projected for life of adapter

EJ0L - Supported Configs



• Dual Controller only


• Raid-0,10,5,6, no JBOD (except for initial formatting)

• EXP24S (#5887), #5803

�Specific rules for mixing SSD’s and HDD’s

�5803 mode 1 not supported with SSD’s

EJ0L - Supported Configs



• Dual Controller only


• Raid-0,10,5,6, no JBOD (except for initial formatting)

• EXP24S (#5887), #5803

�Specific rules for mixing SSD’s and HDD’s

�5803 mode 1 not supported with SSD’s

33

34

EJ0L

• Always Paired, full-high, single-slot adapters

• One exception, If AIX running HACMP, then the pair is split between two servers … each server having one adapter.

• AA cables (2) defaulted between two EJ0L

• The top ports (3rd and 4th) are reserved for the AA cable or I/O drawers with HDD

• All four ports can have drive attachment with

• AA cables 3m, 6m, 1.5m and 0.6 meter

• Supports SAS HDD/SSD

• Drives located in

• #5803 12X I/O Drawers

• EXP24S SFF I/O drawers

• Drives NOT located in system units or EXP12S

4x

EJ0L4x

4x

T3 top

T2 mid

T1 midT0 bottom

4x

35

EJ0L to two EXP24S (Mode1) with Max 48 SSD

6G YO Cables

Max 24 SSD per EXP24S

Max 48 SSDs per EJ0L

AA cables connected on 3rd and 4th ports

5887

5887

No SSD on 3rd or 4th

adapter ports

Mode 1

Mode 1

4x

EJ0L 4x

4x

4x

4x

EJ0L 4x

4x

4x

36

EJ0L to four EXP24S (Mode1) with Max 96 HDD

6G YO Cables

Mode 1

Max 96 HDDs

5887

5887

5887

Mode 1

Mode 1

Mode 1

4x

EJ0L 4x

4x

4x

4x

EJ0L 4x

4x

4x

Mode 1

5887

Both cables must be to

same port on the pair of

adapters

37

Dual EJ0L in #5803 drawer to #5803 drawer drives only

#5803 = mode 21-13 HDDs or 1-13 SSDs in half

13 Drives in

half of #58036G AT cables

#5803 in mode 21-13 drives in each half of T24

4x

EJ0L 4x

4x

4x

EJ0L 4x

4x

6G AA Cables

Field support config only, not supported in SBT/econfig tools

5803 B

5803 A

4x

4x

38

Configuration alternatives

� Provides availability for adapter failure

� If a VIOS, the array/hdisk can be split

into multiple LV VSCSI disks

� Not supported for i

� Clustered disk solution

� Two node PowerHA

� Two node Oracle RAC, GPFS, …

� Or use of hdisks split among

LPARs

� Similar to single adapter configuration

regarding availability without PowerHA

� Not supported for i

� Provides availability for adapter or

VIOS failure to VIOCs

� No splitting of RAID arrays into

multiple LV VSCSI hdisks

SAS

SAS

SAS

SAS

SAS

SAS

D1 D2 D3 …

D1 D2 D3 …

D1 D2 D3 …

VIOS

VIOS

LPAR

LPAR

LPAR

SAS D1 D2 D3 …LPAR� Single adapter configuration

� If a VIOS, the array/hdisk can be

split into multiple LV VSCSI disks

� VIOS = VIO Server LPAR� VIOC = VIO Client LPAR

39

SAS with VIO VSCSIPower Server

VIOSa VIOSb

AIX VIOC AIX VIOC

SA

S R

AID

SA

S R

AID

D1 D2 D3 D4…

• How is storage availability provided here?• Disk drive failure

• Use OS mirroring across D1 and D2 • Use of SAS RAID 1, 5, 6 or 10 arrays• Use AIX LVM mirroring across JBOD or RAID 0 arrays

• SAS RAID adapter failure• IO from VIOC to VIOS with the adapter fails, and IO is redirected to the other VIOS

• VIOS failure• IO from VIOC to VIOS fails and the IO is redirected to the other VIOS

• AIX uses fail_over algorithm for a LUN so IOs for it go thru only one VIOS• VIOSa can handle IOs for half the LUNs, and similarly for VIOSb

• Entire RAID arrays are allocated to VIOCs (no LV VSCSI hdisks, just PV VSCSI hdisks)

40

Setting Parity Optimization steps for AIX/VIO

• Changes can be made from current primary/Optimized adapter only

• Non-Optimized adapters are the passive/secondary adapter

• Change “Preferred Access” to Optimized or Non-Optimized for primary or secondary adapters respectively

Setting which adapter handles which LUN…

41

SAS setup for dual adapter configurations

� For JBOD: Controller settings in dual adapter environments must be changed from

Dual Initiator to JBOD HA Single Path via SAS Disk Array Manager

� For RAID: Assign primary adapter for each hdisk if needed via smit…also called setting access optimization

� One adapter owns/handles all IOs for a RAID array

Balance RAID arrays across adapters and/or

Make the primary adapter the one in the LPAR initiating the IO

� Once setup on one LPAR, verify setting on the other LPARIf necessary, run cfgmgr on the other LPAR

42

Active/passive nature of SAS

• IOs can be sent to either the active or passive adapter

• If an adapter or its VIOS fail, then the surviving adapter becomes the active adapter if it

isn’t the active adapter already

• If an IO is sent to the passive SAS adapter, it’s routed thru the SAS network to the active

adapter for the LUN

• Half the LUNs can be assigned to Adapter1 and half to Adapter2 to balance use of

resources

• Path priorities are best set at the VIOCs so that IOs for each LUN normally go to the

VIOS with the active adapter

• AA cables are useful here – they reduce IO latency

• Gets IOs from the passive to the active adapter quicker

• Gets write data onto cache for both adapters quicker

Power Server

VIOSa VIOSb

AIX VIOC

SA

S R

AID

SA

S R

AID

D1 D2 D3 D4

…

AA cable

43

Disabling write cache

� From Disk Array Manager menu-> Diagnostics and Recovery Options-> Change/Show SAS RAID Controller

� Adapter Cache can be set to Disabled

� For adapters in HA configurations with SSDs, turning cache off is sometimes best --TEST

RAID vs LVM528 vs 512 vs 4k

Questions from the field…

44

The Industry Trend to Larger Sectors

Rising bit density means smaller magnetic areas and more noise. The underlying or raw disk

media error rate is approaching 1 error in every thousand bits on average – while tiny media

defects can lose hundreds of bytes in a row. The larger sectors enable more powerful ECC to

fix those gaps.

A 512 byte sector can’t support enough ECC to correct for higher raw error rates. Thus bigger

sectors with stronger ECC capable of detecting and correcting much larger errors – up to 400

bytes on a 4k sector.

The 4k sector enables disk manufacturers to keep cramming more bits on a disk. Without

them the annual 40% capacity increases we’ve come to expect would stop.

45

512 byte, 528 byte and 4k sector sizes

• If we are changing the 512 byte sector size to 528 bytes, how does it in affect the filesystem block size and features such as concurrent IO ?

The answer is, it doesn’t. When HDDs/SSDs are formatted to 528 bytes/block, the host still sees the blocksize as 512. This is because the RAID adapter attaches the extra header/trailer bytes to every block as it is written and removes the extra header/trailer bytes to every block as it is read . The AIX filesystem does not see a difference.

• Likewise, native 4K HDDs/SSDs are formatted to 4224 bytes/block but are seen by AIX as 4096 bytes block.

• The extra header/trailer bytes on every block are used for standard T10 Data Integrity Fields and other purposes.

The T10 Protection Information Model (also known as Data Integrity Field, or DIF) provides means to protect the communication between host adapter and storage device.

46

512? 528? 4K?What will I get?

• (#ESDP) - 600GB 15K RPM SAS SFF-2 Disk Drive - 5xx Block (AIX/Linux)

2.5-inch (Small Form Factor (SFF)) 15k RPM SAS disk drive mounted in a Gen-2 carrier and supported in SAS SFF-2 bays. With 512 byte sectors (JBOD) drive capacity is 600GB. With 528 byte sectors (RAID) drive capacity is 571GB and the drive has additional data integrity protection. #ESDN and #ESDP are physically identical drives with the same CCIN. However, IBM Manufacturing always formats the #ESDN with 528 byte sectors. Depending on how the drive is ordered, IBM Manufacturing will ship #ESDP with either 512 or 528 byte formatting. Reformatting a disk drive can take significant time, especially on larger capacity disk drives.

• (#ELFP) - 600GB 15K RPM SAS SFF-2 4K Block - 4096 Disk Drive

600 GB 2.5-inch (Small Form Factor (SFF)) 15k rpm SAS disk drive on Gen-2 carrier/tray. Supported in SFF-2 SAS bays of EXP24S drawer. Disk is formatted for 4096 byte sectors. If reformatted to 4224 byte sectors, capacity would be 571 GB.

• (#ESDP) - 600GB 15K RPM SAS SFF-2 Disk Drive - 5xx Block (AIX/Linux)

2.5-inch (Small Form Factor (SFF)) 15k RPM SAS disk drive mounted in a Gen-2 carrier and supported in SAS SFF-2 bays. With 512 byte sectors (JBOD) drive capacity is 600GB. With 528 byte sectors (RAID) drive capacity is 571GB and the drive has additional data integrity protection. #ESDN and #ESDP are physically identical drives with the same CCIN. However, IBM Manufacturing always formats the #ESDN with 528 byte sectors. Depending on how the drive is ordered, IBM Manufacturing will ship #ESDP with either 512 or 528 byte formatting. Reformatting a disk drive can take significant time, especially on larger capacity disk drives.

• (#ELFP) - 600GB 15K RPM SAS SFF-2 4K Block - 4096 Disk Drive

600 GB 2.5-inch (Small Form Factor (SFF)) 15k rpm SAS disk drive on Gen-2 carrier/tray. Supported in SFF-2 SAS bays of EXP24S drawer. Disk is formatted for 4096 byte sectors. If reformatted to 4224 byte sectors, capacity would be 571 GB.

47

HDD Format Shipments from IBM Manufacturing (bytes)

P7/P7+ CECs (Gen1) P7/P7+ EXP24S (Gen-2S) P8 S-series CEC (Gen3)

P8 S-series EXP24S

(Gen2S)

AIX / Linux HDD 5xx Features 512 512 (528 w/#5913) 528 528

AIX / Linux HDD Service FRUs 512 512 528 512

AIX / Linux HDD MES 512 512 528 512

IBM i HDD 5xx Features 528 528 528 528

IBM i HDD Service FRUs 528 528 528 528

IBM i HDD MES 528 528 528 528

AIX / Linux HDD 4k Features NA NA 4224 4224

AIX / Linux HDD 4k Service FRUs NA NA 4224 4224

AIX / Linux HDD 4k MES NA NA 4224 4224

IBM i HDD 4k Features NA NA 4224 4224

IBM i HDD 4k Service FRUs NA NA 4224 4224

IBM i HDD 4k MES NA NA 4224 4224

48

49

SSF HDD InterchangeabilityCan I reformat? …from April 2012 Deep Dive

IBM configuration/ordering systems are not aware of HDD interchangeability. AIX/Linux/VIOS can use either JBOD or RAID. IBM i only uses 528-byte (RAID). Currently shipping SFF SAS drives are totally interchangeable. Earlier drives shipped as 512-byte could not be used by IBM i even when reformatted to 528-byte.

Note when the #1886 drive was first introduced, a few hundred were shipped with a different CCIN. This very small percentage of #1886 does can not be used by an IBM i partition even when formatted to 528-byte.

512 byte

(JBOD) format

528 byte

(RAID) format

SAS SFF

Shipped formatted as 512-byte sectors

labeled “AIX”

528-byte SFF-1 SFF-2

15k rpm

139 GB

#1888 CCIN=198C

#1956 CCIN=19B0

15k rpm

283 GB

#1879 CCIN=19A1

#1947 CCIN=19B1

10k rpm

283 GB

#1911 CCIN=198D

#1956 CCIN=19B7

10k rpm

571 GB

#1916 CCIN=19A3

#1962 CCIN=19B3

512-byte SFF-1 SFF-2

15k rpm

146 GB

#1886 CCIN=198C

#1917 CCIN=19B0

15k rpm

300 GB

#1880 CCIN=19A1

#1953 CCIN=19B1

10k rpm

300 GB

#1885 CCIN=198D

#1925 CCIN=19B7

10k rpm

600 GB

#1790 CCIN=19A3

#1964 CCIN=19B3

Shipped formatted as 528-byte sectors

labeled “IBM i”

SAS SFF

4K and VIOS

• The block size of the virtual device does match that of the back-end physical device (blksz is not abstracted)

• The client OS must have support for this block size. AIX introduced support for 4K block sizes in AIX 7.1 TL3 SP3 (2Q 2014)Example: VSCSI appears to present the 4k block size to the client LPAR when the backing device is a logical volume on a disk that has a 4k block size. Earlier AIX releases see the disks, but can't use them at all.

• this 4K block size support was not added to previous AIX TL's.

50

JFS and 4k Disks

Current system documentation acknowledges restrictions for JFS on 4K disks:

http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.install/install_sys_backups.htm?lang=en

Installing system backupsFile systems are created on the target system at the same size as they were on the source system, unless the backup image was created with SHRINK set to yes in the image.data file, or you selected yes in the BOS Install menus. An exception is the /tmp directory, which can be increased to allocate enough space for the bosboot command. If you are installing the AIX® operating system from a system backup that uses the JFS file system, you cannot use a disk with 4K sector sizes.

http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.install/alt_disk_rootvg_cloning.htm?lang=en

Cloning the rootvg to an alternate diskIf your current rootvg uses the JFS file system, then the alternate disk cannot have 4K sector sizes.

JFS2 will not create a file system with block size less than the disk sector size.

51

Using single disk RAID vs JBODWhich is better?

• Single disk (RAID 0) – same performance as JBOD

• Effort of changing sector size, i.e. reformatting or stay with what is delivered.

• SSDs do not support JBOD

• Certain adapters and configurations do not support JBOD

52

Using (hardware) RAID vs. LVM to mirror rootvgWhich should I use?

• If a customer is all AIX, use LVM, if heterogeneous, raid makes the environment more uniform

• Hardware RAID is probably technically faster

• RAID 1, 10, 5, 6 options (depending on cache)

53

Working with Multi-path IO

54

What is MPIO?• MPIO is an architecture designed by AIX development (released in AIX V5.2)

• MPIO is also a commonly used acronym for Multi-Path IO (AIX PCM aka MPIO)

– In this presentation MPIO refers explicitly to the architecture, not the acronym

• Why was the MPIO architecture developed?

– With the advent of SANs, each disk subsystem vendor wrote their own multi-path code

– These multi-path code sets were usually incompatible

• Mixing disk subsystems was usually not supported on the same system, and if they were, they usually required their own FC adapters

– Integration with AIX IO error handling and recovery

• Several levels of IO timeouts: basic IO timeout, FC path timeout, etc

• MPIO architecture details available to disk subsystem vendors

– Compliant code requires a Path Control Module (PCM) for each disk subsystem

• AIX PCMs for SCSI and FC ship with AIX and are often used by the vendors

– MPIO allows vendors to develop their own path selection algorithms

– Disk vendors have been moving towards MPIO compliant code

55

MPIO Common Interface

Overview of MPIO Architecture• LUNs show up as an hdisk

–Architected for 32 K paths

–No more than 16 paths are necessary

• PCM: Path Control Module–AIX PCMs exist for FC, SCSI

–Vendors may write optional PCMs

–May provide commands to manage paths

• Allows various algorithms to balance use of paths

• Full support for multiple paths to rootvg

56

� Hdisks can be Available, Defined or non-existent� Paths can also be Available, Defined, Missing or non-existent� Path status can be enabled, disabled or failed if the path is Available

(use chpath command to change status)� Add path: e.g. after installing new adapter and cable to the disk

run cfgmgr (or cfgmgr –l <adapter>)� One must get the device layer correct, before working with the path status layer

� Hdisks can be Available, Defined or non-existent� Paths can also be Available, Defined, Missing or non-existent� Path status can be enabled, disabled or failed if the path is Available

(use chpath command to change status)� Add path: e.g. after installing new adapter and cable to the disk

run cfgmgr (or cfgmgr –l <adapter>)� One must get the device layer correct, before working with the path status layer

Tip: to keep paths <= 16, groupsets of 4 host ports and 4 storage ports and balance LUNs across them

Disk configuration

57

� The disk vendor…

� Dictates what multi-path code can be used

� Supplies the filesets for the disks and multipath code

� Supports the components that they supply

Hitachi: https://tuf.hds.com/instructions/servers/AIXODMUpdates.php

EMC: ODM filesets are available via EMC Powerlink (registration required)

� A fileset is loaded to update the ODM to support the storage

� AIX then recognizes and appropriately configures the disk

� Without this, disks are configured using a generic ODM definition

� Performance and error handling may suffer as a result

� # lsdev –Pc disk displays supported storage

� The multi-path code will be a different fileset

� Unless using the MPIO that’s included with AIX

Beware of generic “Other” disk definitionNo command queuing

Poor Performance & Error Handling

How many paths for a LUN?

58

Server

FC Switch

Storage

• Paths = (# of paths from server to switch) x (# paths from storage to switch)

…Here there are potentially 6 paths per LUN…But reduced via:• LUN masking at the storage

Assign LUNs to specific FC adapters at the host, and thru specific ports on the storage

• ZoningWWPN or SAN switch port zoning

• Dual SAN fabricsdivides potential paths by two

• 4 paths per LUN are sufficient for availability and reduces CPU overhead for choosing the path• Path selection overhead is relatively low—usually negligible

• MPIO has no practical limits to number of paths• Other products have path limits

• SDDPCM limited to 16 paths per LUN

How many paths for a LUN?, cont’dDual SAN Fabric for SAN Zoning Reduces Potential Paths

59

4 X 4 = 16 paths

Server

FC Switch

Storage

2 X 2 + 2 X 2 = 8 paths

�With single initiator to single target zoning, both examples would have 4 paths

�A popular approach is to use 4 host and 4 storage ports, zoning one host port to one storage port, yielding 4 paths

Fabric 1 Fabric 2

Path selection benefits and costs

60

� Path selection algorithms choose a path to hopefully minimize latency added to

an IO to send it over the SAN to the storage

� Latency to send a 4 KB IO over a 8 Gbps SAN link is

4 KB / (8 Gb/s x 0.1 B/b x1048576 KB/GB) = 0.0048 ms

� Multiple links may be involved, and IOs are round trip

� As compared to fastest IO service times around 1 ms

� If the links aren’t busy, there likely won’t be much, if any, savings from

use of sophisticated path selection algorithims vs. round robin

� Costs of path selection algorithms (could outweigh latency savings)

� CPU cycles to choose the best path

� Memory to keep track of in-flight IOs down each path, or

� Memory to keep track of IO service times down each path

� Latency added to the IO to choose the best path

� Path selection algorithms choose a path to hopefully minimize latency added to

an IO to send it over the SAN to the storage

� Latency to send a 4 KB IO over a 8 Gbps SAN link is

4 KB / (8 Gb/s x 0.1 B/b x1048576 KB/GB) = 0.0048 ms

� Multiple links may be involved, and IOs are round trip

� As compared to fastest IO service times around 1 ms

� If the links aren’t busy, there likely won’t be much, if any, savings from

use of sophisticated path selection algorithims vs. round robin

� Costs of path selection algorithms (could outweigh latency savings)

� CPU cycles to choose the best path

� Memory to keep track of in-flight IOs down each path, or

� Memory to keep track of IO service times down each path

� Latency added to the IO to choose the best path

Generally utilizationof links is low

Balancing IOs with algorithms fail_over and round_robin

– Any load balancing algorithm must consume CPU and memory resources to

determine the best path to use.

– Using path priorities, it is possible to setup fail_over LUNs so that the loads are

balanced across the available FC adapters.

– Let's use an example with 2 FC adapters. Assume we correctly lay out our data so

that the IOs are balanced across the LUNs (this is usually a best practice). Then if

we assign half the LUNs to FC adapterA and half to FC adapterB, then the IOs are

evenly balanced across the adapters!

– A question to ask is, “If one adapter is handling more IO than another, will this have a

significant impact on IO latency?”

– Since the FC adapters are capable of handling more than 50,000 IOPS then we're

unlikely to bottleneck at the adapter and add significant latency to the IO.

61

A fail_over algorithm can be efficiently used to balance IOs!

round_robin may more easily ensure balanced IOs across the links for each LUN

● e.g., if the IOs to the LUNs aren't balanced, then it may be difficult to balance the LUNs and their IO rates across the adapter ports with fail_over

● requires less resource that load balancing

Multi-path IO with VIO and

VSCSI LUNs• Two layers of multi-path code: VIOC and VIOS

• VSCSI disks always use AIX PCM and all IO for a LUN normally goes to one VIOS

– algorithm = fail_over only

• Set the path priorities for the VSCSI hdisks so half use one VIOS, and half use the other

• VIOS uses the multi-path code specified for the disk subsystem

• Typical setup: Set vsci device’s attribute vsci_err_recovto fast_fail. The default is delayed_fail. This will speed up path failover in the event of VIOS failure.

62

VIO ClientAIX PCM

VIO ServerMulti-path code

VIO ServerMulti-path code

Disk Subsystem

Multi-path IO with VIO and NPIV• One layer of multi-path code

round-robin is appropriate load-balancing scheme

• VIOC has virtual FC adapters (vFC)

– Potentially one vFC adapter for every real FC

adapter in each VIOC

– Maximum of 64 vFC adapters per real FC adapter

recommended

• VIOC uses multi-path code that the disk subsystem

supports

• IOs for a LUN can go thru both VIOSs

63

VIO ClientMulti-path code

VIO Server VIO Server

Disk Subsystem

VFC VFC VFC VFC

HBA HBA HBA HBA

Mixed multi-path codes, which may be incompatible on a single LPAR, can be used on VIOC

LPARS with NPIV to share the same physical adapter, provided incompatible code isn't used on

the same LPAR. E.g. Powerpath + EMC & MPIO + DS8000.

Active/Active, Active/Passive and Asymetric Logical Unit Access (ALUA). Disk Subsystem Controllers

• Active/Active controllers

– IOs can be sent to any controller for a LUN

– DS8000, DS6000 and XIV

• Active/Passive controllers

– IOs for a LUN are sent to the primary controller for the LUN, except in failue scenarios

– The storage administrator balances LUNs across the controllers

• Controllers should be active for some LUNs and passive for others

– DS3/4/5000

• ALUA – Asynchronous Logical Unit Access

– IOs can be sent to any controller, but one controller is preferred (IOs passed to primary)

• Preferred due to performance considerations

– SVC, V7000 and NSeries/NetApp

• Using ALUA on NSeries/NetApp is preferred

– Set on the storage

• MPIO supports Active/Passive and Active/Active disk subsystems

– SVC and V7000 are treated as Active/Passive

• Terminology regarding active/active and active/passive varies considerably

64

MPIO supportStorage Subsystem Family MPIO code Multi-path algorithm

IBM ESS, DS6000, DS8000,

DS3950, DS4000, DS5000,

SVC, V7000

IBM Subsystem Device

Driver Path Control

Module (SDDPCM) or AIX

PCM

fail_over, round_robin and for

SDDPCM: load balance, load

balance port

DS3/4/5000 in VIOSAIX FC PCM

recommendedfail_over, round_robin

IBM XIV Storage System AIX FC PCM fail_over, round_robin

IBM System Storage N Series AIX FC PCM fail_over, round_robin

EMC Symmetrix AIX FC PCM fail_over, round_robin

HP & HDS

(varies by model)

Hitachi Dynamic Link

Manager (HDLM)

fail_over, round robin,

extended round robin

AIX FC PCM fail_over, round_robin

SCSI AIX SCSI PCM fail_over, round_robin

VIO VSCSI AIX SCSI PCM fail_over

65

Non-MPIO multi-path code

Storage subsystem family Multi-path code

IBM DS6000, DS8000, SVC, V7000 SDD

IBM DS4000 Redundant Disk Array Controller (RDAC)

EMC Power Path

HP AutoPath

HDS HDLM (older versions)

Veritas-supported storage Dynamic MultiPathing (DMP)

66

AIX Path Control Module (PCM) IO basics

67

The AIX PCM…

� Is part of the MPIO architecture

� Chooses the path each IO will take

� Is used to balance the use of resources used to connect to the storage

� Depends on the algorithm attribute for each hdisk

� Handles path failures to ensure availability with multiple paths

� Handles path failure recovery

� Checks the status of paths

� Supports boot disks

� Not all multi-path code sets do support boot disks

� Offers PCMs for both Fibre Channel and SCSI protocol disks

� Supports active/active, active/passive and ALUA disk subsystems

� But not all disk subsystems

� Supports SCSI-2 and SCSI-3 reserves

� SCSI reserves are often not used

68

Path management with AIX PCM

� Includes examining, adding, removing, enabling and disabling paths

► Adapter failure/replacement or addition

► Planned VIOS outages

► Cable failure and replacement

► Storage controller/port failure and repair

� Adapter replacement

► Paths will not be in use if the adapter has failed, paths will be in the failed state

1. Remove the adapter and its child devices including the paths using the adapter with

# rmdev –Rdl <fcs#>

2. Replace the adapter

3. cfgmgr

4. Check the paths with lspath

� It’s better to stop using a path before you know the path will disappear

► Avoid timeouts, application delays or performance impacts and potential error

recovery bugs

► To disable all paths using a specific FC port on the host:

# chpath –l hdisk1 –p <parent> -s disable

SDDPCM: An Overview• SDDPCM = Subsystem Device Driver Path Control Module

• SDDPCM is MPIO compliant and can be used with IBM DS6000, DS8000,

DS4000 (most models), DS5000, DS3950, V7000 and the SVC

– A “host attachment” fileset (populates the ODM) and SDDPCM fileset are both installed

– Host attachment: devices.fcp.disk.ibm.mpio.rte

– SDDPCM: devices.sddpcm.<version>.rte

• LUNs show up as hdisks, paths shown with pcmpath or lspath commands

– 16 paths per LUN supported

• Provides a PCM per the MPIO architecture

• One installs SDDPCM or SDD, not both.

• SDDPCM = Subsystem Device Driver Path Control Module

• SDDPCM is MPIO compliant and can be used with IBM DS6000, DS8000,

DS4000 (most models), DS5000, DS3950, V7000 and the SVC

– A “host attachment” fileset (populates the ODM) and SDDPCM fileset are both installed

– Host attachment: devices.fcp.disk.ibm.mpio.rte

– SDDPCM: devices.sddpcm.<version>.rte

• LUNs show up as hdisks, paths shown with pcmpath or lspath commands

– 16 paths per LUN supported

• Provides a PCM per the MPIO architecture

• One installs SDDPCM or SDD, not both.

69

SDDPCM • Load balancing algorithms

– rr - round robin

– lb - load balancing based on in-flight IOs per adapter

– fo - failover policy

– lbp - load balancing port (for ESS, DS6000, DS8000, V7000 and SVC

only) based on in-flight IOs per adapter and per storage port

• The pcmpath command is used to examine hdisks, adapters, paths, hdisk statistics,

path statistics, adapter statistics; to dynamically change the load balancing algorithm,

and to perform other administrative tasks such as adapter replacement.

• SDDPCM automatically recovers failed paths that have been repaired via the pcmserv

daemon

– MPIO health checking can also be used, and can be dynamically set via

the pcmpath command. This is recommended. Set the hc_interval to a

non-zero value to turn on path health checking

70

71

Path management with SDDPCM and the pcmpath command

# pcmpath query adapter

# pcmpath query device

# pcmpath query port

# pcmpath query devstats

# pcmpath query adaptstats

# pcmpath query portstats

# pcmpath query essmap

# pcmpath set adapter …

# pcmpath set device path …

# pcmpath set device algorithm

# pcmpath set device hc_interval

# pcmpath disable/enable ports …

# pcmpath query wwpn

And more

� SDD offers the similar datapath command

List adapters and status

List hdisks and paths

List DS8000/DS6000/SVC… ports

List hdisk/path IO statistics

List adapter IO statistics

List DS8000/DS6000/SVC port statistics

List rank, LUN ID and more for each hdisk

Disable/enable paths to adapter

Disable/enable paths to a hdisk

Dynamically change path algorithm

Dynamically change health check interval

Disable/enable paths to a disk port

Display all FC adapter WWPNs

72


# pcmpath query device

…

DEV#: 2 DEVICE NAME: hdisk2 TYPE: 2145 ALGORITHM: Load Balance

SERIAL: 600507680190013250000000000000F4

==========================================================================

Path# Adapter/Path Name State Mode Select Errors

0 fscsi0/path0 OPEN NORMAL 40928736 0

1* fscsi0/path1 OPEN NORMAL 16 0







…

• * Indicates path to passive controller

• 2145 is a SVC which has active/passive nodes for a LUN

• DS4000, DS5000, V7000 and DS3950 also have active/passive controllers• IOs will be balanced across paths to the active controller

73


# pcmpath query devstats

Total Dual Active and Active/Asymmetrc Devices : 67

DEV#: 2 DEVICE NAME: hdisk2

===============================

Total Read Total Write Active Read Active Write Maximum

I/O: 169415657 2849038 0 0 20SECTOR: 2446703617 318507176 0 0 5888

Transfer Size: <= 512 <= 4k <= 16K <= 64K > 64K

183162 67388759 35609487 46379563 22703724

…

• Maximum value useful for tuning hdisk queue depths • “20” is maximum inflight requests for the IOs shown• Increase queue depth until queue is not filling up or

until IO services times suffer (bottleneck is pushed to the subsystem)• writes > 3ms• reads > 15-20ms

• See References for queue depth tuning whitepaper

SDD & SDDPCM: Getting Disks configured correctly

• Install the appropriate filesets– SDD or SDDPCM for the required disks (and host attachment fileset)– If you are using SDDPCM, install the MPIO fileset as well which comes with AIX

• devices.common.IBM.mpio.rte– Host attachment scripts

• http://www.ibm.com/support/dlsearch.wss?rs=540&q=host+scripts&tc=ST52G7&dc=D410

• Reboot or start the sddsrv/pcmsrv daemon

• smitty disk -> List All Supported Disk– Displays disk types for which software support has been installed

• Or # lsdev -Pc disk | grep MPIOdisk mpioosdisk fcp MPIO Other FC SCSI Disk Drivedisk 1750 fcp IBM MPIO FC 1750 …DS6000disk 2105 fcp IBM MPIO FC 2105 …ESSdisk 2107 fcp IBM MPIO FC 2107 …DS8000disk 2145 fcp MPIO FC 2145 …SVCdisk DS3950 fcp IBM MPIO DS3950 Array Diskdisk DS4100 fcp IBM MPIO DS4100 Array Diskdisk DS4200 fcp IBM MPIO DS4200 Array Diskdisk DS4300 fcp IBM MPIO DS4300 Array Diskdisk DS4500 fcp IBM MPIO DS4500 Array Diskdisk DS4700 fcp IBM MPIO DS4700 Array Diskdisk DS4800 fcp IBM MPIO DS4800 Array Diskdisk DS5000 fcp IBM MPIO DS5000 Array Diskdisk DS5020 fcp IBM MPIO DS5020 Array Disk

74

75

www-01.ibm.com/support/docview.wss?rs=540&uid=ssg1S7001350#AIXSDDPCM

Comparing AIX PCM & SDDPCMFeature/Function AIX PCM SDDPCM

How obtained Included with VIOS and AIX Downloaded from IBM website

Suported Devices

Supports most disk devices that the AIX operating system and VIOS POWERVM firmware support, including selected third-party devices

Supports specific IBM devices and is referenced by the particular device support statement. The supported devices differ between AIX and POWERVM VIOS

OS Integration Considerations

Update levels are provided and are updated and migrated as a mainline part of all the normal AIX and VIOS service strategy and upgrade/migration paths

Add-on software entity that has its own update strategy and process for obtaining fixes. The customer must manage coexistence levels between both the mix of devices, operating system levels and VIOS levels. NOT a licensed program product.

Path Selection Algorithms

Fail over (default)Round Robin (excluding VSCSI disks)

Fail overRound RobinLoad Balancing (default)Load Balancing Port

Algorithm Selection Disk access must be stopped in order to change algorithm Dynamic

SAN boot, dump, paging support Yes Yes. Restart required if SDDPCM installed after

MPIOPCM and SDDPCM boot desired.

PowerHA & GPFS Support Yes Yes

Utilities standard AIX performance monitoring tools such as iostat and fcstat

Enhanced utilities (pcmpath commands) to show mappings from adapters, paths, devices, as well as performance and error statistics

77

Mixing multi-path code sets

• The disk subsystem vendor specifies what multi-path code is supported for their storage

– The disk subsystem vendor supports their storage, the server vendor generally doesn’t

• You can mix multi-path code compliant with MPIO and even share adapters

– There may be exceptions. Contact vendor for latest updates.

HP example: “Connection to a common server with different HBAs requires separate

HBA zones for XP, VA, and EVA”

• Generally one non-MPIO compliant code set can exist with other MPIO compliant code sets

– Except that SDD and RDAC can be mixed on the same LPAR

– The non-MPIO compliant code must be using its own adapters

• Except RDAC can share adapter ports with MPIO

• Devices of a given type use only one multi-path code set

– e.g., you can’t use SDDPCM for one DS8000 and SDD for another DS8000 on the same

AIX instance

78

Sharing Fibre Channel Adapter ports

• Disk using MPIO compliant code sets

can share adapter ports

• It’s recommended that disk and tape

use separate ports79

Disk (typicaly small block random) and tape (typically large block sequential) IO are different, and stability issues have

been seen at high IO rates

MPIO Command Set• lspath – list paths, path status, path ID, and path attributes for a disk

• chpath – change path status or path attributes

– Enable or disable paths

• rmpath – delete or change path state

– Putting a path into the defined mode means it won’t be used (from available to

defined)

– One cannot define/delete the last path of an open device

• mkpath – add another path to a device or makes a defined path available

– Generally cfgmgr is used to add new paths

• chdev – change a device’s attributes (not specific to MPIO)

• cfgmgr – add new paths to an hdisk or make defined paths available

(not specific to MPIO)

80

Useful MPIO Commands• List status of the paths and the parent device (or adapter)

# lspath -Hl <hdisk#>

• List connection information for a path

# lspath -l hdisk2 -F"status parent connection path_status path_id“

Enabled fscsi0 203900a0b8478dda,f000000000000 Available 0




• The connection field contains the storage port WWPN

– In the case above, paths go to two storage ports and WWPNs:203900a0b8478dda

201800a0b8478dda

• List a specific path's attributes

# lspath -AEl hdisk2 -p fscsi0 –w “203900a0b8478dda,f00000000000“

scsi_id 0x30400 SCSI ID False

node_name 0x200800a0b8478dda FC Node Name False

priority 1 Priority True

81

Path priorities• A Priority Attribute for paths can be used to specify a preference for path

IOs. How it works depends whether the hdisk’s algorithm attribute is set to fail_over or round_robin.

Value specified is inverse to priority, i.e. “1” is high priority

• algorithm=fail_over– the path with the higher priority value handles all the IOs unless there's a path failure.

–Set the primary path to be used by setting it's priority value to 1, and the next path's priority (in case of path failure) to 2, and so on.

– if the path priorities are the same, the primary path will be the first listed for the hdisk in the CuPath ODM as shown by # odmget CuPath

• algorithm=round_robin– If the priority attributes are the same, then IOs go down each path equally.

– In the case of two paths, if you set path A’s priority to 1 and path B’s to 255, then for every IO going down path A, there will be 255 IOs sent down path B.

• To change the path priority of an MPIO device on a VIO client:# chpath -l hdisk0 -p vscsi1 -a priority=2

–Set path priorities for VSCSI disks to balance use of VIOSs

82

Path prioritiesNote that a lower value for the path priority is a higher priority

# lsattr -El hdisk9

PCM PCM/friend/otherapdisk Path Control Module False

algorithm fail_over Algorithm True

hcheck_interval 60 Health Check Interval True

hcheck_mode nonactive Health Check Mode True

lun_id 0x5000000000000 Logical Unit Number ID False

node_name 0x20060080e517b6ba FC Node Name False

queue_depth 10 Queue DEPTH True

reserve_policy single_path Reserve Policy True

ww_name 0x20160080e517b6ba FC World Wide Name False

…

# lspath -l hdisk9 -F"parent connection status path_status"

fscsi1 20160080e517b6ba,5000000000000 Enabled Available

fscsi1 20170080e517b6ba,5000000000000 Enabled Available

# lspath -AEl hdisk9 -p fscsi1 -w"20160080e517b6ba,5000000000000"

scsi_id 0x10a00 SCSI ID False

node_name 0x20060080e517b6ba FC Node Name False

priority 1 Priority True

Note: whether or not path priorities apply depends on the PCM. With SDDPCM, path priorities only apply when the algorithm used is fail over (fo). Otherwise, they aren’t used.

Note: whether or not path priorities apply depends on the PCM. With SDDPCM, path priorities only apply when the algorithm used is fail over (fo). Otherwise, they aren’t used.

83

Path priorities – why change them?

• With VIOCs, send the IOs for half the LUNs to one VIOS and half to the other

–Set priorities for half the LUNs to use VIOSa/vscsi0 and half to use VIOSb/vscsi1

–Uses both VIOSs CPU and virtual adapters

–algorithm=fail_over is the only option at the VIOC for VSCSI disks

• With NSeries – have the IOs go the primary controller for the LUN if not using ALUA (ALUA is preferred)

–When not using ALUA, use the dotpaths utility to set path priorities to ensure most IOs go to the preferred controller

84

Hints & Tips

Here’s a good command to determine to what VIOS a vscsi adapter is connected:

# echo "cvai" | kdb

…

NAME STATE CMDS_ACTIVE ACTIVE_QUEUE HOST

vscsi0 0x000007 0x0000000000 0x0 vios1->vhost0

vscsi1 0x000007 0x0000000000 0x0 vios2->vhost0

This will be especially useful if you have more than one vscsi adapter connected to a VIOS.

85

Path Health Checking and RecoveryValidates a path is working & automates recovery of failed path

Note: applies to open disks only

• For SDDPCM and MPIO compliant disks, two hdisk attributes apply:

# lsattr -El hdisk26 hcheck_interval 0 Health Check Interval Truehcheck_mode nonactive Health Check Mode True

• hcheck_interval – Defines how often (1– 3600 seconds) the health check is performed on the paths for a device.

When a value of 0 is selected (the default), health checking is disabled– Preferably set to at least 2X IO timeout value…often 30 seconds

• hcheck_mode– Determines which paths should be checked when the health check capability is used:

• enabled: Sends the healthcheck command down paths with a state of enabled • failed: Sends the healthcheck command down paths with a state of failed• nonactive: (Default) Sends the healthcheck command down paths that have no active I/O,

including paths with a state of failed. If the algorithm selected is failover, then the healthcheck command is also sent on each of the paths that have a state of enabled but have no active IO. If the algorithm selected is round_robin, then the healthcheck command is only sent on paths with a state of failed, because the round_robin algorithm keeps all enabled paths active with IO.

• Consider setting up error notification for path failures (later slide)

86

Path Recovery• MPIO will recover failed paths if path health checking is enabled with

hcheck_mode=nonactive or failed and the device has been opened

• Trade-offs exist:– Lots of path health checking can create a lot of SAN traffic

– Automatic recovery requires turning on path health checking for each LUN

– Lots of time between health checks means paths will take longer to recover after repair

– Health checking for a single LUN is often sufficient to monitor all the physical paths, but not to recover them

• SDD and SDDPCM also recover failed paths automatically

• In addition, SDDPCM provides a health check daemon to provide an automated method of reclaiming failed paths to a closed device.

• To manually enable a failed path after repair or re-enable a disabled path: # chpath -l hdisk1 -p <parent> –w <connection> -s enable

or run cfgmgr or reboot

87

Path Recovery With Flaky Links• When a path fails, it takes AIX time to recognize it, and to redirect in-flight IOs previously sent

down the failed path

– IO stalls during this time, along with processes waiting on the IO

– Turning off a switch port results in a 20 second stall

• Other types of failures may take longer

– AIX must distinguish between slow IOs and path failures

• With flaky paths that go up and down, this can be a problem

• The MPIO timeout_policy attribute for hdisks addresses this for command timeouts

– IZ96396 for AIX 7.1, IZ96302 for AIX 6.1

– timeout_policy=retry_path Default and similar to before the attribute existed. The first occurrence of a command timeout on the path does not cause immediate path failure.

– timeout_policy=fail_path Wait until several clean health checks then recover the path

– timeout_policy=disable_path Disable the path and leave it that way

• Manual intervention will be required so be sure to use error notification in this case

• SDDPCM recoverDEDpath attribute – similar to timeout_policy but for all kinds of path errors

– recoverDEDpath=no Default and failed paths stay that way

– recoverDEDpath=yes Allows failed paths to be recovered

– SDDPCM V2.6.3.0 or later

88

Path Health Checking and Recovery – Notification!

• One should also set up error notification for path failure, so that someone knows

about it and can correct it before something else fails.

• This is accomplished by determining the error that shows up in the error log when a

path fails (via testing), and then

• Adding an entry to the errnotify ODM class for that error which calls a script (that

you write) that notifies someone that a path has failed.

Hint: You can use # odmget errnotify to see what the entries (or stanzas) look like,

then you create a stanza and use the odmadd command to add it to the errnotify

class.

• One should also set up error notification for path failure, so that someone knows

about it and can correct it before something else fails.

• This is accomplished by determining the error that shows up in the error log when a

path fails (via testing), and then

• Adding an entry to the errnotify ODM class for that error which calls a script (that

you write) that notifies someone that a path has failed.

Hint: You can use # odmget errnotify to see what the entries (or stanzas) look like,

then you create a stanza and use the odmadd command to add it to the errnotify

class.

89

Notification, cont’dPath and Fibre Channel Related Errors (examples)

# errpt -t | egrep "PATH|FCA"

02A8BC99 SC_DISK_PCM_ERR8 PERM H PATH HAS FAILED

080784A7 DISK_ERR6 PERM H PATH HAS FAILED

13484BD0 SC_DISK_PCM_ERR16 PERM H PATH ID

14C8887A FCA_ERR10 PERM H COMMUNICATION PROTOCOL ERROR

1D20EC72 FCA_ERR1 PERM H ADAPTER ERROR

1F22F4AA FCA_ERR14 TEMP H DEVICE ERROR

278804AD FCA_ERR5 PERM S SOFTWARE PROGRAM ERROR

2BD0BD1A FCA_ERR9 TEMP H ADAPTER ERROR

3B511B1A FCA_ERR8 UNKN H UNDETERMINED ERROR

40535DDB SC_DISK_PCM_ERR17 PERM H PATH HAS FAILED

7BFEEA1F FCA_ERR4 TEMP H LINK ERROR

84C2184C FCA_ERR3 PERM H LINK ERROR

9CA8C9AD SC_DISK_PCM_ERR12 PERM H PATH HAS FAILED

A6F5AE7C SC_DISK_PCM_ERR9 INFO H PATH HAS RECOVERED

D666A8C7 FCA_ERR2 TEMP H ADAPTER ERROR

DA930415 FCA_ERR11 TEMP H COMMUNICATION PROTOCOL ERROR

DE3B8540 SC_DISK_ERR7 PERM H PATH HAS FAILED

E8F9BA61 CRYPT_ERROR_PATH INFO H SOFTWARE PROGRAM ERROR

ECCE4018 FCA_ERR6 TEMP S SOFTWARE PROGRAM ERROR

F29DB821 FCA_ERR7 UNKN H UNDETERMINED ERROR

FF3E9550 FCA_ERR13 PERM H DEVICE ERROR 90

# errpt -atJ FCA_ERR4---------------------------------------------------------------------------IDENTIFIER 7BFEEA1FLabel: FCA_ERR4Class: HType: TEMPLoggable: YES Reportable: YES Alertable: NODescriptionLINK ERRORRecommended ActionsPERFORM PROBLEM DETERMINATION PROCEDURESDetail DataSENSE DATA

91

Options for Error Notification in AIX

• ODM-Basederrdemon program uses errnotify ODM class for error notification

• diag Command DiagnosticsThe diag command package contains a periodic diagnostic procedure called diagela. Hardware (only) errors generate mail messages to members of the system group, or other email

addresses, as configured.

• Custom NotificationWrite a shell script to check the error log periodically

• Concurrent Error LoggingStart errpt –c and each error is then reported when it occurs. Can redirect output to the console to notify the operator.

Error

Notification

ODM-Based

diag

Command

diagnostics

Concurrent

Error

Logging

Custom

Notification

Storage Area Network (SAN) Boot

92

Boot Directly from SAN� Storage is zoned

directly to the client� HBAs used for boot

and/or data access� Multi-path code for

the storage runs in client

SAN Sourced VSCSI Boot� Affected LUNs are zoned to

VIOS(s) and mapped to client from the VIOS

� VIOC uses AIX PCM� VIOS uses multi-path code

specified by the storage� Two layers of multi-path code

AIXMultipath

Code

FC

FC

FC SAN

AIXMPIO

VS

CS

I

VS

CS

I

FC SAN

VIOSMultipath

Code

VIOSMultipath

Code

NPIV Enabled NPIV Enabled SAN

vF

C

vF

C

vF

C

vF

C

MultipathCode

AIX

VIOSF

C

FC

VIOS

FC

FC

NPIV Boot� Affected LUNs are zoned

to VIOS(s) and mapped to client from the VIOS

� VIOC uses multi-path code specified by the storage

Monitoring, Measuring and Basic IO Tuning for SAN Storage from AIX

93

Monitoring SAN storage performanceMonitoring SAN storage performanceMonitoring SAN storage performanceMonitoring SAN storage performance

• For random IO, look at read and write service times from

# iostat –RDTl <interval> <# intervals>

Disks: xfers read write

-------------- -------------------------------- ------------------------------------ ------------------------------------ ----------

%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail avg

act serv serv serv outs serv serv serv outs time t

hdisk1 0.9 12.0K 2.5 0.0 12.0K 0.0 0.0 0.0 0.0 0 0 2.5 9.2 0.6 92.9 0 0 3.7 0

hdisk0 0.8 12.1K 2.6 119.4 12.0K 0.0 4.4 0.1 12.1 0 0 2.5 8.7 0.8 107.0 0 0 3.3 0

hdisk2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 0.0 0

hdisk3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 0.0 0

hdisk4 66.4 58.9M 881.1 58.9M 28.8K 879.1 6.4 0.1 143.5 0 0 2.0 1.8 0.2 27.5 0 0 54.7 0

hdisk6 66.3 51.1M 797.4 51.0M 24.5K 795.9 7.6 0.1 570.1 0 0 1.5 1.5 0.2 32.9 0 0 51.3 0

hdisk5 61.9 55.9M 852.9 55.9M 28.5K 850.5 6.0 0.1 120.8 0 0 2.4 1.6 0.1 33.6 0 0 46.1 0

hdisk7 58.3 55.4M 843.1 55.4M 21.2K 841.9 6.7 0.1 167.6 0 0 1.3 1.3 0.2 20.8 0 0 48.3 0

hdisk8 42.6 53.5M 729.1 53.5M 3.4K 728.9 5.7 0.1 586.4 0 0 0.2 0.9 0.2 5.9 0 0 54.3 0

hdisk10 44.1 37.1M 583.0 37.0M 16.9K 582.0 3.7 0.1 467.7 0 0 1.0 1.4 0.2 12.9 0 0 23.1 0

• Misleading indicators of disk subsystem performance

• %tm_act (percent time active)

• 100% busy Not meaningful for virtual disks, meaningful for real physical disks

• %iowait

• A measure of CPU idle while there are outstanding IOs

• IOPS, tps, and xfers… all refer to the same thing

94

Monitoring SAN storage performanceMonitoring SAN storage performanceMonitoring SAN storage performanceMonitoring SAN storage performance

95

# topas –D

or just press D when in topas

Avg. Write TimeAvg. Read Time Avg. Queue Wait

Total Service Time R = Average Read Time + Average Queue Wait Time

Total Service Time W = Average Write Time + Average Queue Wait Time

Disk IO service times Disk IO service times Disk IO service times Disk IO service times

96

� Multiple interface types

� ATA

� SATA

� SCSI

� FC

� SAS

““““ZBR” Geometry

� If the disk is very busy, IOs will wait for IOs ahead of it

� Queueing time on the disk (not queueing in the hdisk driver or elsewhere)

Seagate 7200 RPM SATA HDD performanceSeagate 7200 RPM SATA HDD performanceSeagate 7200 RPM SATA HDD performanceSeagate 7200 RPM SATA HDD performance

97

� As IOPS increase, IOs queue on the disk and wait for IOs ahead to complete first

What are reasonable IO service times?What are reasonable IO service times?What are reasonable IO service times?What are reasonable IO service times?

98

� Assuming the disk isn’t too busy and IOs are not queueing there

� SSD IO service times around 0.2 to 0.4 ms and they can do over 10,000 IOPS

What are reasonable IO service times?What are reasonable IO service times?What are reasonable IO service times?What are reasonable IO service times?

• Rules of thumb for IO service times for random IO and typical disk subsystems that are not mirroring data

synchronously and using HDDs

• Writes should average <= 2.5 ms

• Typically they will be around 1 ms

• Reads should average < 15 ms

• Typically they will be around 5-10 ms

• For random IO with synchronous mirroring

• Writes will take longer to get to the remote disk subsystem, write to its cache, and return an

acknowledgement

• 2.5 ms + round trip latency between sites (light thru fiber travels 1 km in 0.005 ms)

• When using SSDs

• For SSDs on SAN, reads and writes should average < 2.5 ms, typically around 1 ms

• For SSDs attached to Power via SAS adapters without write cache

• Reads and writes should average < 1 ms

• Typically < 0.5 ms

• Writes take longer than reads for SSDs

• What if we don’t know if the data resides on SSDs or HDDs (e.g. in an EasyTier environment)?

• Look to the disk subsystem performance reports

• For sequential IO, don’t worry about IO service times, worry about throughput

• We hope IOs queue, wait and are ready to process 99

What if IO times are worse than that?What if IO times are worse than that?What if IO times are worse than that?What if IO times are worse than that?

• You have a bottleneck somewhere from the hdisk driver to the physical disks

• Possibilities include:

• CPU (local LPAR or VIOS)

• Adapter driver

• Physical host adapter/port

• Overloaded SAN links (unlikely)

• Storage port(s) overloaded

• Disk subsystem processor overloaded

• Physical disks overloaded � most common

• SAN switch buffer credits

• Temporary hardware errors

• Evaluate VIOS, adapter, adapter driver from AIX/VIOS

• Evaluate the storage from the storage side

• If the write IO service times are marginal, the write IO rate is low, and the read IO rate is high, it’s

often not worth worrying about

• Can occur due to caching algorithms in the storage

100

What about IO size and sequential IO?What about IO size and sequential IO?What about IO size and sequential IO?What about IO size and sequential IO?Disks: xfers

-------------- --------------------------------

%tm bps tps bread bwrtn

act

hdisk4 99.6 591.4M 2327.5 590.7M 758.7K

101

� Large IOs typically imply sequential IO – check your iostat data

� bps/tps = bytes/transaction or bytes/IO

� 591.4 MB / 2327.5 tps = 260 KB/IO - likely sequential IO

� Use filemon to examine sequentiality, e.g.:

# filemon –o /tmp/filemon.out –O all,detailed –T 1000000;sleep 60; trcstop

VOLUME: /dev/hdisk4 description: N/A

reads: 9156 (0 errs)

read sizes (blks): avg 149.2 min 8 max 512 sdev 218.2

read times (msec): avg 6.817 min 0.386 max 1635.118 sdev 22.469

read sequences: 7155*

read seq. lengths: avg 191.0 min 8 max 34816 sdev 811.9

writes: 806 (0 errs)

write sizes (blks): avg 352.3 min 8 max 512 sdev 219.2

write times (msec): avg 20.705 min 0.702 max 7556.756 sdev 283.167

write sequences: 377*

write seq. lengths: avg 753.1 min 8 max 8192 sdev 1136.7

seeks: 7531 (75.6%)*� Here % sequential = 1 - 75.6% = 24.4%

� Perhaps multiple sequential IO threads accessing hdisk4� seeks value is number of times the actuator had to move to a different place on disk

…smaller the number, the more sequential* Adjacent IOs coalesced into fewer IOs

A situation you may seeA situation you may seeA situation you may seeA situation you may see

• Note the low write rate and high write IO service times

• Disk subsystem cache and algorithms may favor disks doing sequential or heavy IO relative to disks doing limited IO or no IO for several seconds

• The idea being to reduce overall IO service times

• Varies among disk subsystems

• Overall performance impact is low due to low write rates

102

# iostat –lD

Disks: xfers read write

-------------- -------------------------------- ------------------------------------ ------------------------------------

%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail

act serv serv serv outs serv serv serv outs

hdisk0 0.3 26.7K 3.1 19.3K 7.5K 1.4 1.7 0.4 19.8 0 0 1.6 0.8 0.6 6.9 0 0

hdisk1 0.1 508.6 0.1 373.0 135.6 0.1 8.1 0.5 24.7 0 0 0.0 0.8 0.6 1.0 0 0

hdisk2 0.0 67.8 0.0 0.0 67.8 0.0 0.0 0.0 0.0 0 0 0.0 0.8 0.7 1.0 0 0

hdisk3 1.1 37.3K 4.4 25.1K 12.2K 2.0 0.8 0.3 10.4 0 0 2.4 4.4 0.6 638.4 0 0

hdisk4 80.1 33.6M 592.5 33.6M 38.2K 589.4 2.4 0.3 853.6 0 0 3.1 6.5 0.5 750.3 0 0

hdisk5 53.2 16.9M 304.2 16.9M 21.5K 302.2 3.0 0.3 1.0S 0 0 2.0 16.4 0.7 749.3 0 0

hdisk6 1.1 21.7K 4.2 1.9K 19.8K 0.1 0.6 0.5 0.8 0 0 4.0 2.7 0.6 495.6 0 0

Basic AIX IO Tuning Basic AIX IO Tuning Basic AIX IO Tuning Basic AIX IO Tuning

103

Introduction to AIX IO TuningIntroduction to AIX IO TuningIntroduction to AIX IO TuningIntroduction to AIX IO Tuning

104

� Tuning IO involves removing logical bottlenecks in the AIX IO stack

� Requires some understanding of the AIX IO stack

� General rule is to increase buffers and queue depths so no IOs wait unecesarily

due to lack of a resource, but not to send so many IOs to the disk subsystem that

it loses the IO requests

� Four possible situations:

1. No IOs waiting unnecessarily

� No tuning needed

2. Some IOs are waiting and IO service times are good

� Tuning will help

3. Some IOs are waiting and IO service times are poor

� Tuning may or may not help

� Poor IO service times indicate a bottleneck further down the stack and

typically at the storage

� Often needs more storage resources or storage tuning

4. The disk subsystem is losing IOs and IO service times are bad

� Leads to IO retransmissions, error handling code, blocked IO stalls and

crashes.

Filesystem and Disk Buffers Filesystem and Disk Buffers Filesystem and Disk Buffers Filesystem and Disk Buffers

105

# vmstat –v

…

0 pending disk I/Os blocked with no pbuf

171 paging space I/Os blocked with no psbuf

2228 filesystem I/Os blocked with no fsbuf

66 client filesystem I/Os blocked with no fsbuf

17 external pager filesystem I/Os blocked with no fsbuf

� Numbers are counts of temporarily blocked IOs since boot

� blocked count / uptime = rate of IOs blocked/second

� Low rates of blocking implies less improvement from tuning

� For pbufs, use lvmo to increase pv_pbuf_count (see the next slide)

� For psbufs, stop paging (add memory or use less) or add paging spaces

� For filesystem fsbufs, increase numfsbufs with ioo

� For external pager fsbufs, increase j2_dynamicBufferPreallocation with ioo

� For client filesystem fsbufs, increase nfso's nfs_v3_pdts and nfs_v3_vm_bufs (or the

NFS4 equivalents)

� Run # ioo –FL to see defaults, current settings and what’s required to make the changes

go into effect

The rate is important

Use uptime

Disk Buffers Disk Buffers Disk Buffers Disk Buffers

106

# lvmo –v rootvg -a

vgname = rootvg

pv_pbuf_count = 512 Number of pbufs added when one PV is added to the VGtotal_vg_pbufs = 512 Current pbufs available for the VGmax_vg_pbuf_count = 16384 Max pbufs available for this VG, requires remount to changepervg_blocked_io_count = 1243 Delayed IO count since last varyon for this VGpv_min_pbuf = 512 Minimum number of pbufs added when PV is added to any VGglobal_blocked_io_count = 1243 System wide delayed IO count for all VGs and disks

# lvmo –v rootvg -o pv_pbuf_count=1024 Increases pbufs for rootvg and is dynamic

� Check disk buffers for each VG

FC adapter port tuning FC adapter port tuning FC adapter port tuning FC adapter port tuning

107

� The num_cmd_elems attribute controls the maximum number of in-flight IOs for the FC port

� The max_xfer_size attribute controls the maximum IO size the adapter will send to the

storage, as well as a memory area to hold IO data

� Doesn’t apply to virtual adapters

� Default memory area is 16 MB at the default max_xfer_size=0x100000

� Memory area is 128 MB for any other allowable value

� This cannot be changed dynamically – requires stopping use of adapter port

# lsattr -El fcs0

DIF_enabled no DIF (T10 protection) enabled True

bus_intr_lvl Bus interrupt level False

bus_io_addr 0xff800 Bus I/O address False

bus_mem_addr 0xffe76000 Bus memory address False

bus_mem_addr2 0xffe78000 Bus memory address False

init_link auto INIT Link flags False

intr_msi_1 209024 Bus interrupt level False

intr_priority 3 Interrupt priority False

lg_term_dma 0x800000 Long term DMA True

max_xfer_size 0x100000 Maximum Transfer Size True

num_cmd_elems 200 Maximum number of COMMANDS to queue to the adapter True

pref_alpa 0x1 Preferred AL_PA True

sw_fc_class 2 FC Class for Fabric True

tme no Target Mode Enabled True

FC adapter port queue depth tuning FC adapter port queue depth tuning FC adapter port queue depth tuning FC adapter port queue depth tuning Determining when to change the attributes

108

# fcstat fcs0

FIBRE CHANNEL STATISTICS REPORT: fcs0

Device Type: 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)

…

World Wide Port Name: 0x10000000C99C184E

…

Port Speed (supported): 8 GBIT

Port Speed (running): 8 GBIT

…

FC SCSI Adapter Driver Information <- Look at this section: numbers are counts of blocked IOs since boot

No DMA Resource Count: 452380 <- increase max_xfer_size for large values

No Adapter Elements Count: 726832 <- increase num_cmd_elems for large values

No Command Resource Count: 342000 <- increase num_cmd_elems for large values…

FC SCSI Traffic Statistics

…

Input Bytes: 56443937589435

Output Bytes: 4849112157696

# chdev –l fcs0 –a num_cmd_elems=4096 –a max_xfer_size=0x200000 –P <- requires reboot

fcs0 changed

� Calculate the rate the IOs are blocked

� # blocked / uptime (or since the adapter was made Available)

� Bigger tuning improvements when the rate of blocked IOs is higher

� If you’ve increased num_cmd_elems to its max value and increased max_xfer_size and

still get blocked IOs, it suggests you need another adapter port for more bandwidth

VIO IO tuning with VSCSI, NPIV, and SSPVIO IO tuning with VSCSI, NPIV, and SSPVIO IO tuning with VSCSI, NPIV, and SSPVIO IO tuning with VSCSI, NPIV, and SSP

109

� VSCSI

� IOs from vscsi adapter driver are DMA transferred via the hypervisor to the hdisk driver in

the VIOS

� iostat statistics at the VIOS show these IOs

� Set VIOC queue_depth to <= the VIOS queue_depth for the LUN

� Requires unmapping/remapping the disk, or rebooting the VIOS

� Tune both queue_depths together, or set the VIOS queue_depth high, and tune only

the VIOC queue_depth

� Ensures no blocking of IOs at the VIOC hdisk driver

� NPIV

� IOs from vFC adapter driver are DMA transferred via the hypervisor to the FC adapter

driver in the VIOS

� iostat statistics at the VIOS do not capture these IOs

� fcstat statistics at the VIOS does capture these IOs

�NMON also captures this data at the VIOS

� Higher blocked IO rates mean tuning will result in higher performance



� VSCSI

� IOs from vscsi adapter driver are DMA transferred via the hypervisor to the hdisk driver in

the VIOS

� iostat statistics at the VIOS show these IOs

� Set VIOC queue_depth to <= the VIOS queue_depth for the LUN

� Requires unmapping/remapping the disk, or rebooting the VIOS

� Tune both queue_depths together, or set the VIOS queue_depth high, and tune only

the VIOC queue_depth

� Ensures no blocking of IOs at the VIOC hdisk driver

� NPIV

� IOs from vFC adapter driver are DMA transferred via the hypervisor to the FC adapter

driver in the VIOS

� iostat statistics at the VIOS do not capture these IOs

� fcstat statistics at the VIOS does capture these IOs

�NMON also captures this data at the VIOS

� Higher blocked IO rates mean tuning will result in higher performance



VSCSI adapter queue depth sizingVSCSI adapter queue depth sizingVSCSI adapter queue depth sizingVSCSI adapter queue depth sizing

hdisk

queue

depth

Max

hdisks per

vscsi

adapter*

3 - default 85

10 39

24 18

32 14

64 7

100 4

128 3

252 2

256 1

110

� VSCSI adapters also have a queue but it’s not tunable..no tool to help see how full the q on

the adapter driver is getting

� We ensure we don’t run out of VSCSI queue slots by limiting the number of hdisks using the

adapter, and their individual queue depths

� Adapter queue slots are a resource shared by the

hdisks

� Max hdisks per adapter =

INT{510 / [(sum of (hdisk queue depths + 3)]}Note:The vscsi adapter has space for 512 command elements of which 2 are used by the adapter, 3 are reserved for each VSCSI LUN for error recovery, and the rest are used for IO requests

� You can exceed these limits to the extent that the

average service queue size is less than the queue

depth

* To assure no blocking of IOs at the vscsi adapter

NPIV adapter tuning NPIV adapter tuning NPIV adapter tuning NPIV adapter tuning

111

� The real adapters’ queue slots and DMA memory area are shared by the

vFC NPIV adapters

� Tip: Set num_cmd_elems to it’s maximum value and max_xfer_size to

0x200000 on the real FC adapter for maximum bandwidth, to avoid having

to tune it later. Some configurations won’t allow this and will result in

errors in the error log or devices showing up as Defined.

� Only tune num_cmd_elems for the vFC adapter based on fcstat statistics

112

Finis

Quaestiones?

Multipathing, SAN, and Direct-Attached Storage ... · Multipathing, SAN, and Direct-Attached...

Documents

Transcript of Multipathing, SAN, and Direct-Attached Storage ... · Multipathing, SAN, and Direct-Attached...