Multipathing, SAN, and Direct-Attached Storage ... · Multipathing, SAN, and Direct-Attached...
Transcript of Multipathing, SAN, and Direct-Attached Storage ... · Multipathing, SAN, and Direct-Attached...
Multipathing, SAN, and Direct-Attached Storage Considerations for AIX
Administrators
Multipathing, SAN, and Direct-Attached Storage Considerations for AIX
Administrators
John Hock
IBM Power Systems Strategic Initiatives
1
Agenda
• Direct-attached storage options review
• Disk formatting update
• Working with MPIO
• Monitoring, Measuring, Tuning topics
2
3
DAS & SAN - Two Good Options
DASDirect Attached Storage
(“internal”)
SANStorage Area Network
(“external”)
• Both options are strategic
• Both options have their strengths
• Can use both options on the same server
� Fastest (lower latency)� Typically lower cost
hardware/software� Often simpler config
� Fast � Multi-server sharing� Advanced functions/values
Flash Copy, Metro/Global Mirror, Live Partition Mobility, Easy Tier
4
SAN vs. DAS Decision PointsSAN vs. DAS Decision PointsSAN vs. DAS Decision PointsSAN vs. DAS Decision Points
Criterion SAN Decision Point DAS Decision Point
Configuration ControlRequired resources may not be
under your control Direct local control
Performance, GeneralMain benefits lie in improved sharing, scalability,
manageabilityCan meet most demanding application IO
requirements
Performance, MB/sAdditional components may be needed to support
parallelism of paths to achieve MB/s targets.
Configuring for MB/s throughput is simpler as you most likely have full control of all the key
components.
Performance, IO/sIdeal for IO/s with cache and multiple processors in
subsystemsSSD, Gen3 PCIe provide IO/s
equal or better that SAN
Performance, LatencyLatency depends upon current load conditions and
request contention
DAS provides a shorter path for processing IO requests than SAN. This results in smaller
latencies.
Multiple Server sharing Key strength Limited sharing
Availability/BackupRAID-level data protection. Extensive backup and
replication toolsRAID-level data protection. Traditional backup
technologies such as tape
Partition Mobility Supported Not Supported
Placement Flexibility SAN cable & switch distances Intra-rack distances
CostCan be allocated across many
servers and applications
Generally less than SAN, but cost must be absorbed by a smaller set of applications on a
single server.
Scalability Key strength Limited, but may be sufficient for applications on single server
Slot Usage EfficiencyFC adapters make efficient use of slots since cache is
in subsystem
Non-RAID SAS adapters same efficiency as FC. RAID adapters
must be paired
5
• SAS adapters• Some adapters must be purchased and used in pairs• Capabilities vary• PCIe or PCI-X
• PCIe Gen1, Gen2, or Gen3• SAS disk bays
• In the CECs• Optional split back planes
• In IO drawers: FC 5802/5803• In SAS disk bay drawers: EXP12S, EXP24S, EXP30 FC 5887
• HDDs and SSDs 3.5” vs. 2.5” SFF, 1.8” SSD• Cables
• Various types and lengths: AA, AI, AE, AT, EE, X, YO, YI, YR• The number of ways to cable SAS is far more than the number of IBM-supported ways.
FC 5887 EXP 24S
FC 5886 EXP 12S
FC 5802
FC 5803FC 5901 FC
5913
FC 5805/5903
FC EDR1/5888 EXP 30
FC ESA1/2
IBM SAS Hardware OfferingsThe SAS Ecosystem
FC EJ0M/J/L
6
Created by SCSI Trade Association (STA) to facilitate the identification of SAS architecture in the marketplace
first generation…
…second generation
7
SAS Roadmap
↕Shipping in volume
8
SAS Environment Planning
� How many HDDs and SSDs do you need?Capacity and performance needs
� What RAID configuration will you use?Number of disks in each array and RAID levelHow about hot spares?Hdisks (RAID arrays) are usually assigned as a unitOperating system mirroring?
� Where will the HDDs and SSDs go?Intenal to the CECs
Is the RAID daughter card offered in the system?In IO drawers offering SAS disks baysIn SAS disk bay drawers
� What are you’re availability requirements for adapter failure?Dual adapters in a LPARDual adapters across LPARs (or across VIOSs)?
� Cabling� Understand limitations
Max HDDs/SSDs per adapterMix of HDDs and SSDs
9
SAS Adapter Planning• Plan your availability requirements
Single adapter, or dual adapterRAID, JBOD, LVM mirroringDon’t forget hot spares!
• Plan your cablingGenerally evenly balance IOs across physical disks……use the same RAID levels and number of disks per RAID array
• Plan your performanceOrder enough disks to get the IOPS needed, taking into account the
availability configuration
HDD max~150 IOPS SSD
> 10,000 IOPS
10
Sizing the storage subsystem – DAS formulas & rules of thumbFor random workloads at the application layer
• FormulasN = number of physical disksR = proportion of IOs that are readsW = 1-R = proportion of IOs that are writesD = IOPS for a single physical disk (approximately 175 IOPS for 15K RPM disks)
JBOD, RAID 0, and OS mirroring IOPS bandwidth = NxD
RAID 1 or RAID 10 IOPS bandwidth = NxD / (R+2W) RAID 5 or 5E IOPS bandwidth = NxD / (R+4W)RAID 6 IOPS bandwidth = NxD / (R+6W)
• Rules of thumb for 15K RPM disksRAID 6 – 65 to 85 IOPS/HDDRAID 5 – 80 to 100 IOPS/HDDMirrored/R0/JBOD – 100 to 120 IOPS/HDD
• 10K RPM disks approximately 30% fewer IOPS per disk
• FormulasN = number of physical disksR = proportion of IOs that are readsW = 1-R = proportion of IOs that are writesD = IOPS for a single physical disk (approximately 175 IOPS for 15K RPM disks)
JBOD, RAID 0, and OS mirroring IOPS bandwidth = NxD
RAID 1 or RAID 10 IOPS bandwidth = NxD / (R+2W) RAID 5 or 5E IOPS bandwidth = NxD / (R+4W)RAID 6 IOPS bandwidth = NxD / (R+6W)
• Rules of thumb for 15K RPM disksRAID 6 – 65 to 85 IOPS/HDDRAID 5 – 80 to 100 IOPS/HDDMirrored/R0/JBOD – 100 to 120 IOPS/HDD
• 10K RPM disks approximately 30% fewer IOPS per disk
11
SAS Hot-Plugability
• SAS fully supports hot plugging
• You can hot add SAS adapters and cable them to drive or media drawers (hot add) and power them up and config them
• On SAS Drawers you can hot repair port expander cards and power / cooling assemblies
• If you have dual adapters (X cables) you can hot repair adapters (you cannot hot replace an X cable)
• You can hot add a SAS media drawer
4x
4x
Knorr
4x
4x
4x
2x
CharlotteESM ESM
CharlotteESM ESM
CharlotteESM ESM
CharlotteESM ESM
C2-T2
(4x)
C2-T1
(4x)
C1-T2
(4x)
C1-T1
(4x)
C2-T2(4x)
C2-T1(4x)
C1-T2
(4x)
C1-T1
(4x)
2x 2x
2x
2x
C2-T2(4x)
C2-T1
(4x)
C1-T2(4x)
C1-T1(4x)
C2-T2
(4x)
C2-T1
(4x)
C1-T2
(4x)
C1-T1
(4x)
2x
2x
2x 2x
12
EXP24S SFF Gen-2-bay disk expansion drawer
• Twenty-four Gen2 - 2.5-inch small form factor SAS bays
• Gen-2 SAS bays (or SFF-2) are not compatible with Gen-1 SAS bays
• Not compatible with CEC SFF Gen-1 SAS bays or with #5802/5803 Gen-1 SFF SAS bays
• Supports 3G and 6G SAS interface speeds
• redundant AC power supplies
• Supports HDDs and SSDs
• EXP24s can be ordered in one of 3 possible manufacturing-configured MODE settings (not customer set-up) will result in:
• 1, 2 or 4 sets of disk partitions
• SAS controller implementation options are 5887 mode dependant:
• a single controller
• one pair of controllers
• two pairs of controllers or up to four separate controllers
C1-T2
ESM
C1
ESM
C2
C1-T3
C1-T1
C2-T2
C2-T3
C2-T1
#5887
13
#5887 EXP24S Modes
• 1 set 24 bays
• AIX, IBM i, Linux, VIOS
• All adapter/controllers
66 6 6121224
� Mode 2 - the T3 ports connect drives D1-D12 and the T2 ports connect drives D13-D24.
� #5887 Modes are set by IBM Manufacturing
� Option to reset mode outside of manufacturing, but cumbersome.
� 2 sets 12 bays� AIX, Linux, VIOS� Not IBM i � All adapter/controllers
� 4 sets 6 bays� AIX, Linux, VIOS � Not IBM i � Only #5901/5278 adapters� Each adapter could be a
separate partition/system
� Mode 1 � Mode 2 � Mode 4
Note: IBM manufacturing will attach an external label on the EXP24S to indicate the mode setting
C1-T2
ESM
C1
ESM
C2
C1-T3
C1-T1
C2-T2
C2-T3
C2-T1
#5887
POWER8S824 Backplane
• Choice of two storage features:
• Choice one:• Twelve SFF-3 bays, one DVD bay, one integrated SAS controller without cache,
and JBOD, RAID 0, 5, 6, or 10
• Optionally, split the SFF bays and add a second integrated SAS controller without cache
• Choice two: • Eighteen SFF-3 bays, one DVD bay, a pair of integrated SAS controllers with
cache, RAID 0, 5, 6, 10, 5T2, 6T2, and 10T2
• Optionally, attach an EXP24S SAS HDD/SSD Expansion Drawer to the dual IOA.
• Choice of two storage features:
• Choice one:• Twelve SFF-3 bays, one DVD bay, one integrated SAS controller without cache,
and JBOD, RAID 0, 5, 6, or 10
• Optionally, split the SFF bays and add a second integrated SAS controller without cache
• Choice two: • Eighteen SFF-3 bays, one DVD bay, a pair of integrated SAS controllers with
cache, RAID 0, 5, 6, 10, 5T2, 6T2, and 10T2
• Optionally, attach an EXP24S SAS HDD/SSD Expansion Drawer to the dual IOA.
14
EXP30EXP30EXP30EXP30 Ultra SSD I/O Drawer30 x 387 GB drives = up to 11.6 TB
� For POWER7+ 770/780 “D” mdl
� 1U drawer … Up to 30 SSD
� 30 x 387 GB drives = up to 11.6 TB
� Great performance
� Up to 480,000 IOPS (100% read)
� Up to 4.5 GB/s bandwidth
� Slightly faster PowerPC processor in SAS controller & faster internal instruction memory
� Concurrent maintenance
� Up to 48 drives & up to 43 TB downstream HDD in #5887
#EDR1 #5888
� For POWER7 710/720/730/740 “C” mdl
� 1U drawer … Up to 30 SSD
� Great performance
� Up to 400,000 IOPS (100% read)
� Up to 4.5 GB/s bandwidth
� Slightly slower PowerPC processor in SAS
controller & slower internal instruction
memory
� Limited concurrent maintenance
� No downstream HDD
Attaches to one or two GX++ slots (Not supported on POWER8)
15
POWER8 I/O Migration Considerations: Disk & SSD
• Drives in #5887 EXP24S I/O Drawer (SFF-2)
• All disk drives supported.
• All 387GB or larger SSD supported (177GB SSD not supported)
• PCIe-attached drawer can move “intact”. Simply move attaching PCIe adapter to POWER8 PCIe slot
• Drives in POWER7 system unit or in #5802/5803 I/O drawer (SFF-1)
• Drives supported but need to be “retrayed”. Once on SFF-2 carrier/tray can use in #5887 EXP24S drawer.
• “Retraying” done either in advance on POWER7 server or during upgrade/migration to POWER8 server
• All disk drives supported. All 387GB or larger SSD supported (177GB SSD not supported)
• Drives in FC-attached SAN not impacted, just move FC card.
• With PowerVM VIOS and Live Partition Mobility makes even easier
16
17
PCIe SAS Adapters
PCIe SAS adapters #5901 #5805/5903 #5913 (paired) #ESA1, #ESA2
Write cache effective 0 380 MB 1800 MB N
Write cache real 0 380 MB 1800 MB N
# PCI slots per adapter 1 1 2 per pair 1
Two cards required optional required required optional
Rule of thumb – max HDD modest 24-36+ 72 No
Rule of thumb – max SSD 0 3-5 26 18-26
Minimum AIX support 5.3 5.3 5.3 5.3
Minimum IBM i support 6.1 6.1 6.1/7.1 6.1/7.1Supported Native
only (no VIOS)
Cache battery maintenance N Y N N
See the complete comparison table at
http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/topic/p7ebj/pciexpresssasraidcards.htm
PCIe3 RAID SAS Adapters PCIe3 RAID SAS Adapters PCIe3 RAID SAS Adapters PCIe3 RAID SAS Adapters Optimized for SSD’sOptimized for SSD’sOptimized for SSD’sOptimized for SSD’s
PCIe3 RAID SAS Adapter Quad-port 6Gb x8 (#EJ0J/EJ0M)PCIe3 12GB Cache RAID SAS Adapter Quad-port 6Gb x (#EJ0L)
18
T0
Application-specific Integrated Circuit (ASIC)
FC EJ0J & EJ0M (LP)(CCIN 57B4)
(PPC now inside ASIC)
T3
T2
T1
HD
SA
S P
orts
19
EJ0J & EJ0M Supported Configuration Summary
• SSD’s and/or HDD’s
• 48 SSD’s max (Specific rules for each enclosure)
• Single or Dual controller configs (no write cache)
• Raid-0,10,5,6, (JBOD support also on single controller only)
• Max devices per array = 32
• SSD’s / HDDs in EXP24S (#5887) or #5803
� Specific rules for mixing SSD’s and HDD’s (cannot mix on port)
�#5803 mode 1 not supported with SSD’s
• SAS Tape support on Single controller (cannot mix Tape/Disk on same
controller)
EJ0J & EJ0M Supported Configuration Summary
• SSD’s and/or HDD’s
• 48 SSD’s max (Specific rules for each enclosure)
• Single or Dual controller configs (no write cache)
• Raid-0,10,5,6, (JBOD support also on single controller only)
• Max devices per array = 32
• SSD’s / HDDs in EXP24S (#5887) or #5803
� Specific rules for mixing SSD’s and HDD’s (cannot mix on port)
�#5803 mode 1 not supported with SSD’s
• SAS Tape support on Single controller (cannot mix Tape/Disk on same
controller)
20
21
New 6Gb HD SAS Cables • Narrower backshell on new cables than the backshell on the existing cables
due to higher density 4 gang SAS connectors
• Backwards compatible with current HD SAS adapters (57B5, 57C4)
• YO, X, AT, AA (*no AA cables with EJ0J & EJ0M )
• New Feature Codes and P/N’s
• Same Lengths. Same Basic Function
• New HD SAS cables for R/M = AE1 (4m) and YE1 (3m)
IBM PN 00E9345
HD SAS Connector Plug
-Dust Plug
-Strengthens connector for cable plug/unplug
22
Single EJ0J / EJ0M with SSD in EXP24S (Mode 1)
6G YO Cable
Mode 1 – Up to 24 SSDs
5887
4x
EJ0J/EJ0M 4x
4x
4x
23
Dual EJ0J/EJ0M with Max 48 SSD in EXP24S (Mode 1)
6G YO Cables
5887
Mode 1
4x
EJ0J/EJ0M 4x
4x
4x
4x
EJ0J/EJ0M 4x
4x
4x
5887
Mode 1
24
Dual EJ0J/EJ0M to one EXP24S (Mode 2) with SSDs
Special High IOPS Configuration
6G X Cables
0
C1-T2
ESM
C1ESM
C2
C1-T3
C2-T2
C2-T3
Mode 2 2-24 SSD’s 1-12 In each ½ drawer
5887
Max IOPS using only SSD
Mode 2
4x
EJ0J/EJ0M 4x
4x
4x
4x
EJ0J/EJ0M 4x
4x
4x
No AA cables needed (no write cache)
AIX & Linux onlyNo IBM i support for EXP24S Mode 2
25
Dual EJ0J/EJ0M to four EXP24Ss (Mode 1) with Max 96 HDD
6G YO Cables
Mode 1
Max 96 HDDs
5887
5887
5887
Mode 1
Mode 1
Mode 1
4x
EJ0J/EJ0M 4x
4x
4x
4x
EJ0J/EJ0M 4x
4x
4x
Mode 1
5887
26
4 Single EJ0J/EJ0M to one EXP24S in 4 way split mode
Mode 4 4 sets of 6 drivesDual X cablesUp to 24 SSDs
EJ0J/EJ0M
EJ0J/EJ0M
EJ0J/EJ0M
EJ0J/EJ0M
D60 1D50 1
D80 1D70 1
D100 1D90 1
D120 1D110 1
D20 1D10 1
D40 1D30 1
D180 1D170 1
D200 1D190 1
D220 1D210 1
D240 1D230 1
D140 1D130 1
D160 1D150 1
6G SAS Mode 4
AIX & Linux onlyNo IBM i support
5887Mode 4
The adapters do not see each other and can only see the 6 drives in the drive
bays that are zoned for them to access
4x
4x
4x
4x
4x
4x
4x
4x
4x
4x
4x
4x
4x
4x
4x
4x
Single X cable
Single X cable
27
Dual EJ0J/EJ0M in #5803 drawer to #5803 drawer drives
#5803 = mode 21-13 HDDS or 1-13 SSDs in each half of #5803
13 Drives in
#5803 Drawer
6G AT cables
#5803 in mode 21-13 drives in half
4x
EJ0J/EJ0M 4x
4x
4x
EJ0J/EJ0M 4x
4x
Field support config only, not supported in SBT/econfig tools
#5803 B
#5803 A
4x
4x
28
Single EJ0J/EJ0M to SAS media drawer (one adapter port to one device)
• 2 SAS TAPE / Media drawer (4 max physically)
4x
EJ0J/EJ0M 4x
4x
4x
Media Drawer
Device 1 Device 3
Device 2 Device 4
1 - 4 individual AE1 cables
29
Single EJ0J/EJ0M to SAS media drawer – Alternate Configuration (one HBA port to two devices)
4x
EJ0J/EJ0M 4x
4x
4x
Media Drawer
Device 1 Device 3
Device 2 Device 4
1 - 4 YE1 cables
• 2 SAS TAPE / Media drawer
• 8 Max Tape / adapter (physically)
PCIe3 12GB Cache RAID SAS Adapter QuadPCIe3 12GB Cache RAID SAS Adapter QuadPCIe3 12GB Cache RAID SAS Adapter QuadPCIe3 12GB Cache RAID SAS Adapter Quad----port 6Gb x8port 6Gb x8port 6Gb x8port 6Gb x8FC EJ0L (CCIN 57CE) FC EJ0L (CCIN 57CE) FC EJ0L (CCIN 57CE) FC EJ0L (CCIN 57CE)
IBM Confidential
30
31
EJ0L (57CE)
T3 T2 T1
Crocodile ASIC (PPC now inside ASIC)
T0
HD
SA
S P
orts
PowerGEM (capacitor card)
32
Super Caps vs BatteryFlash-backed DRAM vs battery-backed DRAM
Super Caps provide power long enough to copy data from Write Cache memory to non-volatile
Flash on abnormal power down/loss
• No Battery Maintenance
• No ~ 10 day retention limitation after abnormal power off
• No 3 year EOL Battery replacement.
• FW monitors PowerGem health. HW error log for card replacement if necessary.
• Projected for life of adapter
EJ0L - Supported Configs
• SSD’s and/or HDD’s
• 48 SSD’s max (Specific rules for each enclosure)
• Dual Controller only
• Max devices per array = 32
• Raid-0,10,5,6, no JBOD (except for initial formatting)
• EXP24S (#5887), #5803
�Specific rules for mixing SSD’s and HDD’s
�5803 mode 1 not supported with SSD’s
EJ0L - Supported Configs
• SSD’s and/or HDD’s
• 48 SSD’s max (Specific rules for each enclosure)
• Dual Controller only
• Max devices per array = 32
• Raid-0,10,5,6, no JBOD (except for initial formatting)
• EXP24S (#5887), #5803
�Specific rules for mixing SSD’s and HDD’s
�5803 mode 1 not supported with SSD’s
33
34
EJ0L
• Always Paired, full-high, single-slot adapters
• One exception, If AIX running HACMP, then the pair is split between two servers … each server having one adapter.
• AA cables (2) defaulted between two EJ0L
• The top ports (3rd and 4th) are reserved for the AA cable or I/O drawers with HDD
• All four ports can have drive attachment with
• AA cables 3m, 6m, 1.5m and 0.6 meter
• Supports SAS HDD/SSD
• Drives located in
• #5803 12X I/O Drawers
• EXP24S SFF I/O drawers
• Drives NOT located in system units or EXP12S
4x
EJ0L4x
4x
T3 top
T2 mid
T1 midT0 bottom
4x
35
EJ0L to two EXP24S (Mode1) with Max 48 SSD
6G YO Cables
Max 24 SSD per EXP24S
Max 48 SSDs per EJ0L
AA cables connected on 3rd and 4th ports
5887
5887
No SSD on 3rd or 4th
adapter ports
Mode 1
Mode 1
4x
EJ0L 4x
4x
4x
4x
EJ0L 4x
4x
4x
36
EJ0L to four EXP24S (Mode1) with Max 96 HDD
6G YO Cables
Mode 1
Max 96 HDDs
5887
5887
5887
Mode 1
Mode 1
Mode 1
4x
EJ0L 4x
4x
4x
4x
EJ0L 4x
4x
4x
Mode 1
5887
Both cables must be to
same port on the pair of
adapters
37
Dual EJ0L in #5803 drawer to #5803 drawer drives only
#5803 = mode 21-13 HDDs or 1-13 SSDs in half
13 Drives in
half of #58036G AT cables
#5803 in mode 21-13 drives in each half of T24
4x
EJ0L 4x
4x
4x
EJ0L 4x
4x
6G AA Cables
Field support config only, not supported in SBT/econfig tools
5803 B
5803 A
4x
4x
38
Configuration alternatives
� Provides availability for adapter failure
� If a VIOS, the array/hdisk can be split
into multiple LV VSCSI disks
� Not supported for i
� Clustered disk solution
� Two node PowerHA
� Two node Oracle RAC, GPFS, …
� Or use of hdisks split among
LPARs
� Similar to single adapter configuration
regarding availability without PowerHA
� Not supported for i
� Provides availability for adapter or
VIOS failure to VIOCs
� No splitting of RAID arrays into
multiple LV VSCSI hdisks
SAS
SAS
SAS
SAS
SAS
SAS
D1 D2 D3 …
D1 D2 D3 …
D1 D2 D3 …
VIOS
VIOS
LPAR
LPAR
LPAR
SAS D1 D2 D3 …LPAR� Single adapter configuration
� If a VIOS, the array/hdisk can be
split into multiple LV VSCSI disks
� VIOS = VIO Server LPAR� VIOC = VIO Client LPAR
39
SAS with VIO VSCSIPower Server
VIOSa VIOSb
AIX VIOC AIX VIOC
SA
S R
AID
SA
S R
AID
D1 D2 D3 D4…
• How is storage availability provided here?• Disk drive failure
• Use OS mirroring across D1 and D2 • Use of SAS RAID 1, 5, 6 or 10 arrays• Use AIX LVM mirroring across JBOD or RAID 0 arrays
• SAS RAID adapter failure• IO from VIOC to VIOS with the adapter fails, and IO is redirected to the other VIOS
• VIOS failure• IO from VIOC to VIOS fails and the IO is redirected to the other VIOS
• AIX uses fail_over algorithm for a LUN so IOs for it go thru only one VIOS• VIOSa can handle IOs for half the LUNs, and similarly for VIOSb
• Entire RAID arrays are allocated to VIOCs (no LV VSCSI hdisks, just PV VSCSI hdisks)
40
Setting Parity Optimization steps for AIX/VIO
• Changes can be made from current primary/Optimized adapter only
• Non-Optimized adapters are the passive/secondary adapter
• Change “Preferred Access” to Optimized or Non-Optimized for primary or secondary adapters respectively
Setting which adapter handles which LUN…
41
SAS setup for dual adapter configurations
� For JBOD: Controller settings in dual adapter environments must be changed from
Dual Initiator to JBOD HA Single Path via SAS Disk Array Manager
� For RAID: Assign primary adapter for each hdisk if needed via smit…also called setting access optimization
� One adapter owns/handles all IOs for a RAID array
Balance RAID arrays across adapters and/or
Make the primary adapter the one in the LPAR initiating the IO
� Once setup on one LPAR, verify setting on the other LPARIf necessary, run cfgmgr on the other LPAR
42
Active/passive nature of SAS
• IOs can be sent to either the active or passive adapter
• If an adapter or its VIOS fail, then the surviving adapter becomes the active adapter if it
isn’t the active adapter already
• If an IO is sent to the passive SAS adapter, it’s routed thru the SAS network to the active
adapter for the LUN
• Half the LUNs can be assigned to Adapter1 and half to Adapter2 to balance use of
resources
• Path priorities are best set at the VIOCs so that IOs for each LUN normally go to the
VIOS with the active adapter
• AA cables are useful here – they reduce IO latency
• Gets IOs from the passive to the active adapter quicker
• Gets write data onto cache for both adapters quicker
Power Server
VIOSa VIOSb
AIX VIOC
SA
S R
AID
SA
S R
AID
D1 D2 D3 D4
…
AA cable
43
Disabling write cache
� From Disk Array Manager menu-> Diagnostics and Recovery Options-> Change/Show SAS RAID Controller
� Adapter Cache can be set to Disabled
� For adapters in HA configurations with SSDs, turning cache off is sometimes best --TEST
RAID vs LVM528 vs 512 vs 4k
Questions from the field…
44
The Industry Trend to Larger Sectors
Rising bit density means smaller magnetic areas and more noise. The underlying or raw disk
media error rate is approaching 1 error in every thousand bits on average – while tiny media
defects can lose hundreds of bytes in a row. The larger sectors enable more powerful ECC to
fix those gaps.
A 512 byte sector can’t support enough ECC to correct for higher raw error rates. Thus bigger
sectors with stronger ECC capable of detecting and correcting much larger errors – up to 400
bytes on a 4k sector.
The 4k sector enables disk manufacturers to keep cramming more bits on a disk. Without
them the annual 40% capacity increases we’ve come to expect would stop.
45
512 byte, 528 byte and 4k sector sizes
• If we are changing the 512 byte sector size to 528 bytes, how does it in affect the filesystem block size and features such as concurrent IO ?
The answer is, it doesn’t. When HDDs/SSDs are formatted to 528 bytes/block, the host still sees the blocksize as 512. This is because the RAID adapter attaches the extra header/trailer bytes to every block as it is written and removes the extra header/trailer bytes to every block as it is read . The AIX filesystem does not see a difference.
• Likewise, native 4K HDDs/SSDs are formatted to 4224 bytes/block but are seen by AIX as 4096 bytes block.
• The extra header/trailer bytes on every block are used for standard T10 Data Integrity Fields and other purposes.
The T10 Protection Information Model (also known as Data Integrity Field, or DIF) provides means to protect the communication between host adapter and storage device.
46
512? 528? 4K?What will I get?
• (#ESDP) - 600GB 15K RPM SAS SFF-2 Disk Drive - 5xx Block (AIX/Linux)
2.5-inch (Small Form Factor (SFF)) 15k RPM SAS disk drive mounted in a Gen-2 carrier and supported in SAS SFF-2 bays. With 512 byte sectors (JBOD) drive capacity is 600GB. With 528 byte sectors (RAID) drive capacity is 571GB and the drive has additional data integrity protection. #ESDN and #ESDP are physically identical drives with the same CCIN. However, IBM Manufacturing always formats the #ESDN with 528 byte sectors. Depending on how the drive is ordered, IBM Manufacturing will ship #ESDP with either 512 or 528 byte formatting. Reformatting a disk drive can take significant time, especially on larger capacity disk drives.
• (#ELFP) - 600GB 15K RPM SAS SFF-2 4K Block - 4096 Disk Drive
600 GB 2.5-inch (Small Form Factor (SFF)) 15k rpm SAS disk drive on Gen-2 carrier/tray. Supported in SFF-2 SAS bays of EXP24S drawer. Disk is formatted for 4096 byte sectors. If reformatted to 4224 byte sectors, capacity would be 571 GB.
• (#ESDP) - 600GB 15K RPM SAS SFF-2 Disk Drive - 5xx Block (AIX/Linux)
2.5-inch (Small Form Factor (SFF)) 15k RPM SAS disk drive mounted in a Gen-2 carrier and supported in SAS SFF-2 bays. With 512 byte sectors (JBOD) drive capacity is 600GB. With 528 byte sectors (RAID) drive capacity is 571GB and the drive has additional data integrity protection. #ESDN and #ESDP are physically identical drives with the same CCIN. However, IBM Manufacturing always formats the #ESDN with 528 byte sectors. Depending on how the drive is ordered, IBM Manufacturing will ship #ESDP with either 512 or 528 byte formatting. Reformatting a disk drive can take significant time, especially on larger capacity disk drives.
• (#ELFP) - 600GB 15K RPM SAS SFF-2 4K Block - 4096 Disk Drive
600 GB 2.5-inch (Small Form Factor (SFF)) 15k rpm SAS disk drive on Gen-2 carrier/tray. Supported in SFF-2 SAS bays of EXP24S drawer. Disk is formatted for 4096 byte sectors. If reformatted to 4224 byte sectors, capacity would be 571 GB.
47
HDD Format Shipments from IBM Manufacturing (bytes)
P7/P7+ CECs (Gen1) P7/P7+ EXP24S (Gen-2S) P8 S-series CEC (Gen3)
P8 S-series EXP24S
(Gen2S)
AIX / Linux HDD 5xx Features 512 512 (528 w/#5913) 528 528
AIX / Linux HDD Service FRUs 512 512 528 512
AIX / Linux HDD MES 512 512 528 512
IBM i HDD 5xx Features 528 528 528 528
IBM i HDD Service FRUs 528 528 528 528
IBM i HDD MES 528 528 528 528
AIX / Linux HDD 4k Features NA NA 4224 4224
AIX / Linux HDD 4k Service FRUs NA NA 4224 4224
AIX / Linux HDD 4k MES NA NA 4224 4224
IBM i HDD 4k Features NA NA 4224 4224
IBM i HDD 4k Service FRUs NA NA 4224 4224
IBM i HDD 4k MES NA NA 4224 4224
48
49
SSF HDD InterchangeabilityCan I reformat? …from April 2012 Deep Dive
IBM configuration/ordering systems are not aware of HDD interchangeability. AIX/Linux/VIOS can use either JBOD or RAID. IBM i only uses 528-byte (RAID). Currently shipping SFF SAS drives are totally interchangeable. Earlier drives shipped as 512-byte could not be used by IBM i even when reformatted to 528-byte.
Note when the #1886 drive was first introduced, a few hundred were shipped with a different CCIN. This very small percentage of #1886 does can not be used by an IBM i partition even when formatted to 528-byte.
512 byte
(JBOD) format
528 byte
(RAID) format
SAS SFF
Shipped formatted as 512-byte sectors
labeled “AIX”
528-byte SFF-1 SFF-2
15k rpm
139 GB
#1888 CCIN=198C
#1956 CCIN=19B0
15k rpm
283 GB
#1879 CCIN=19A1
#1947 CCIN=19B1
10k rpm
283 GB
#1911 CCIN=198D
#1956 CCIN=19B7
10k rpm
571 GB
#1916 CCIN=19A3
#1962 CCIN=19B3
512-byte SFF-1 SFF-2
15k rpm
146 GB
#1886 CCIN=198C
#1917 CCIN=19B0
15k rpm
300 GB
#1880 CCIN=19A1
#1953 CCIN=19B1
10k rpm
300 GB
#1885 CCIN=198D
#1925 CCIN=19B7
10k rpm
600 GB
#1790 CCIN=19A3
#1964 CCIN=19B3
Shipped formatted as 528-byte sectors
labeled “IBM i”
SAS SFF
4K and VIOS
• The block size of the virtual device does match that of the back-end physical device (blksz is not abstracted)
• The client OS must have support for this block size. AIX introduced support for 4K block sizes in AIX 7.1 TL3 SP3 (2Q 2014)Example: VSCSI appears to present the 4k block size to the client LPAR when the backing device is a logical volume on a disk that has a 4k block size. Earlier AIX releases see the disks, but can't use them at all.
• this 4K block size support was not added to previous AIX TL's.
50
JFS and 4k Disks
Current system documentation acknowledges restrictions for JFS on 4K disks:
http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.install/install_sys_backups.htm?lang=en
Installing system backupsFile systems are created on the target system at the same size as they were on the source system, unless the backup image was created with SHRINK set to yes in the image.data file, or you selected yes in the BOS Install menus. An exception is the /tmp directory, which can be increased to allocate enough space for the bosboot command. If you are installing the AIX® operating system from a system backup that uses the JFS file system, you cannot use a disk with 4K sector sizes.
http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.install/alt_disk_rootvg_cloning.htm?lang=en
Cloning the rootvg to an alternate diskIf your current rootvg uses the JFS file system, then the alternate disk cannot have 4K sector sizes.
JFS2 will not create a file system with block size less than the disk sector size.
51
Using single disk RAID vs JBODWhich is better?
• Single disk (RAID 0) – same performance as JBOD
• Effort of changing sector size, i.e. reformatting or stay with what is delivered.
• SSDs do not support JBOD
• Certain adapters and configurations do not support JBOD
52
Using (hardware) RAID vs. LVM to mirror rootvgWhich should I use?
• If a customer is all AIX, use LVM, if heterogeneous, raid makes the environment more uniform
• Hardware RAID is probably technically faster
• RAID 1, 10, 5, 6 options (depending on cache)
53
Working with Multi-path IO
54
What is MPIO?• MPIO is an architecture designed by AIX development (released in AIX V5.2)
• MPIO is also a commonly used acronym for Multi-Path IO (AIX PCM aka MPIO)
– In this presentation MPIO refers explicitly to the architecture, not the acronym
• Why was the MPIO architecture developed?
– With the advent of SANs, each disk subsystem vendor wrote their own multi-path code
– These multi-path code sets were usually incompatible
• Mixing disk subsystems was usually not supported on the same system, and if they were, they usually required their own FC adapters
– Integration with AIX IO error handling and recovery
• Several levels of IO timeouts: basic IO timeout, FC path timeout, etc
• MPIO architecture details available to disk subsystem vendors
– Compliant code requires a Path Control Module (PCM) for each disk subsystem
• AIX PCMs for SCSI and FC ship with AIX and are often used by the vendors
– MPIO allows vendors to develop their own path selection algorithms
– Disk vendors have been moving towards MPIO compliant code
55
MPIO Common Interface
Overview of MPIO Architecture• LUNs show up as an hdisk
–Architected for 32 K paths
–No more than 16 paths are necessary
• PCM: Path Control Module–AIX PCMs exist for FC, SCSI
–Vendors may write optional PCMs
–May provide commands to manage paths
• Allows various algorithms to balance use of paths
• Full support for multiple paths to rootvg
56
� Hdisks can be Available, Defined or non-existent� Paths can also be Available, Defined, Missing or non-existent� Path status can be enabled, disabled or failed if the path is Available
(use chpath command to change status)� Add path: e.g. after installing new adapter and cable to the disk
run cfgmgr (or cfgmgr –l <adapter>)� One must get the device layer correct, before working with the path status layer
� Hdisks can be Available, Defined or non-existent� Paths can also be Available, Defined, Missing or non-existent� Path status can be enabled, disabled or failed if the path is Available
(use chpath command to change status)� Add path: e.g. after installing new adapter and cable to the disk
run cfgmgr (or cfgmgr –l <adapter>)� One must get the device layer correct, before working with the path status layer
Tip: to keep paths <= 16, groupsets of 4 host ports and 4 storage ports and balance LUNs across them
Disk configuration
57
� The disk vendor…
� Dictates what multi-path code can be used
� Supplies the filesets for the disks and multipath code
� Supports the components that they supply
Hitachi: https://tuf.hds.com/instructions/servers/AIXODMUpdates.php
EMC: ODM filesets are available via EMC Powerlink (registration required)
� A fileset is loaded to update the ODM to support the storage
� AIX then recognizes and appropriately configures the disk
� Without this, disks are configured using a generic ODM definition
� Performance and error handling may suffer as a result
� # lsdev –Pc disk displays supported storage
� The multi-path code will be a different fileset
� Unless using the MPIO that’s included with AIX
Beware of generic “Other” disk definitionNo command queuing
Poor Performance & Error Handling
How many paths for a LUN?
58
Server
FC Switch
Storage
• Paths = (# of paths from server to switch) x (# paths from storage to switch)
…Here there are potentially 6 paths per LUN…But reduced via:• LUN masking at the storage
Assign LUNs to specific FC adapters at the host, and thru specific ports on the storage
• ZoningWWPN or SAN switch port zoning
• Dual SAN fabricsdivides potential paths by two
• 4 paths per LUN are sufficient for availability and reduces CPU overhead for choosing the path• Path selection overhead is relatively low—usually negligible
• MPIO has no practical limits to number of paths• Other products have path limits
• SDDPCM limited to 16 paths per LUN
How many paths for a LUN?, cont’dDual SAN Fabric for SAN Zoning Reduces Potential Paths
59
4 X 4 = 16 paths
Server
FC Switch
Storage
2 X 2 + 2 X 2 = 8 paths
�With single initiator to single target zoning, both examples would have 4 paths
�A popular approach is to use 4 host and 4 storage ports, zoning one host port to one storage port, yielding 4 paths
Fabric 1 Fabric 2
Path selection benefits and costs
60
� Path selection algorithms choose a path to hopefully minimize latency added to
an IO to send it over the SAN to the storage
� Latency to send a 4 KB IO over a 8 Gbps SAN link is
4 KB / (8 Gb/s x 0.1 B/b x1048576 KB/GB) = 0.0048 ms
� Multiple links may be involved, and IOs are round trip
� As compared to fastest IO service times around 1 ms
� If the links aren’t busy, there likely won’t be much, if any, savings from
use of sophisticated path selection algorithims vs. round robin
� Costs of path selection algorithms (could outweigh latency savings)
� CPU cycles to choose the best path
� Memory to keep track of in-flight IOs down each path, or
� Memory to keep track of IO service times down each path
� Latency added to the IO to choose the best path
� Path selection algorithms choose a path to hopefully minimize latency added to
an IO to send it over the SAN to the storage
� Latency to send a 4 KB IO over a 8 Gbps SAN link is
4 KB / (8 Gb/s x 0.1 B/b x1048576 KB/GB) = 0.0048 ms
� Multiple links may be involved, and IOs are round trip
� As compared to fastest IO service times around 1 ms
� If the links aren’t busy, there likely won’t be much, if any, savings from
use of sophisticated path selection algorithims vs. round robin
� Costs of path selection algorithms (could outweigh latency savings)
� CPU cycles to choose the best path
� Memory to keep track of in-flight IOs down each path, or
� Memory to keep track of IO service times down each path
� Latency added to the IO to choose the best path
Generally utilizationof links is low
Balancing IOs with algorithms fail_over and round_robin
– Any load balancing algorithm must consume CPU and memory resources to
determine the best path to use.
– Using path priorities, it is possible to setup fail_over LUNs so that the loads are
balanced across the available FC adapters.
– Let's use an example with 2 FC adapters. Assume we correctly lay out our data so
that the IOs are balanced across the LUNs (this is usually a best practice). Then if
we assign half the LUNs to FC adapterA and half to FC adapterB, then the IOs are
evenly balanced across the adapters!
– A question to ask is, “If one adapter is handling more IO than another, will this have a
significant impact on IO latency?”
– Since the FC adapters are capable of handling more than 50,000 IOPS then we're
unlikely to bottleneck at the adapter and add significant latency to the IO.
61
A fail_over algorithm can be efficiently used to balance IOs!
round_robin may more easily ensure balanced IOs across the links for each LUN
● e.g., if the IOs to the LUNs aren't balanced, then it may be difficult to balance the LUNs and their IO rates across the adapter ports with fail_over
● requires less resource that load balancing
Multi-path IO with VIO and
VSCSI LUNs• Two layers of multi-path code: VIOC and VIOS
• VSCSI disks always use AIX PCM and all IO for a LUN normally goes to one VIOS
– algorithm = fail_over only
• Set the path priorities for the VSCSI hdisks so half use one VIOS, and half use the other
• VIOS uses the multi-path code specified for the disk subsystem
• Typical setup: Set vsci device’s attribute vsci_err_recovto fast_fail. The default is delayed_fail. This will speed up path failover in the event of VIOS failure.
62
VIO ClientAIX PCM
VIO ServerMulti-path code
VIO ServerMulti-path code
Disk Subsystem
Multi-path IO with VIO and NPIV• One layer of multi-path code
round-robin is appropriate load-balancing scheme
• VIOC has virtual FC adapters (vFC)
– Potentially one vFC adapter for every real FC
adapter in each VIOC
– Maximum of 64 vFC adapters per real FC adapter
recommended
• VIOC uses multi-path code that the disk subsystem
supports
• IOs for a LUN can go thru both VIOSs
63
VIO ClientMulti-path code
VIO Server VIO Server
Disk Subsystem
VFC VFC VFC VFC
HBA HBA HBA HBA
Mixed multi-path codes, which may be incompatible on a single LPAR, can be used on VIOC
LPARS with NPIV to share the same physical adapter, provided incompatible code isn't used on
the same LPAR. E.g. Powerpath + EMC & MPIO + DS8000.
Active/Active, Active/Passive and Asymetric Logical Unit Access (ALUA). Disk Subsystem Controllers
• Active/Active controllers
– IOs can be sent to any controller for a LUN
– DS8000, DS6000 and XIV
• Active/Passive controllers
– IOs for a LUN are sent to the primary controller for the LUN, except in failue scenarios
– The storage administrator balances LUNs across the controllers
• Controllers should be active for some LUNs and passive for others
– DS3/4/5000
• ALUA – Asynchronous Logical Unit Access
– IOs can be sent to any controller, but one controller is preferred (IOs passed to primary)
• Preferred due to performance considerations
– SVC, V7000 and NSeries/NetApp
• Using ALUA on NSeries/NetApp is preferred
– Set on the storage
• MPIO supports Active/Passive and Active/Active disk subsystems
– SVC and V7000 are treated as Active/Passive
• Terminology regarding active/active and active/passive varies considerably
64
MPIO supportStorage Subsystem Family MPIO code Multi-path algorithm
IBM ESS, DS6000, DS8000,
DS3950, DS4000, DS5000,
SVC, V7000
IBM Subsystem Device
Driver Path Control
Module (SDDPCM) or AIX
PCM
fail_over, round_robin and for
SDDPCM: load balance, load
balance port
DS3/4/5000 in VIOSAIX FC PCM
recommendedfail_over, round_robin
IBM XIV Storage System AIX FC PCM fail_over, round_robin
IBM System Storage N Series AIX FC PCM fail_over, round_robin
EMC Symmetrix AIX FC PCM fail_over, round_robin
HP & HDS
(varies by model)
Hitachi Dynamic Link
Manager (HDLM)
fail_over, round robin,
extended round robin
AIX FC PCM fail_over, round_robin
SCSI AIX SCSI PCM fail_over, round_robin
VIO VSCSI AIX SCSI PCM fail_over
65
Non-MPIO multi-path code
Storage subsystem family Multi-path code
IBM DS6000, DS8000, SVC, V7000 SDD
IBM DS4000 Redundant Disk Array Controller (RDAC)
EMC Power Path
HP AutoPath
HDS HDLM (older versions)
Veritas-supported storage Dynamic MultiPathing (DMP)
66
AIX Path Control Module (PCM) IO basics
67
The AIX PCM…
� Is part of the MPIO architecture
� Chooses the path each IO will take
� Is used to balance the use of resources used to connect to the storage
� Depends on the algorithm attribute for each hdisk
� Handles path failures to ensure availability with multiple paths
� Handles path failure recovery
� Checks the status of paths
� Supports boot disks
� Not all multi-path code sets do support boot disks
� Offers PCMs for both Fibre Channel and SCSI protocol disks
� Supports active/active, active/passive and ALUA disk subsystems
� But not all disk subsystems
� Supports SCSI-2 and SCSI-3 reserves
� SCSI reserves are often not used
68
Path management with AIX PCM
� Includes examining, adding, removing, enabling and disabling paths
► Adapter failure/replacement or addition
► Planned VIOS outages
► Cable failure and replacement
► Storage controller/port failure and repair
� Adapter replacement
► Paths will not be in use if the adapter has failed, paths will be in the failed state
1. Remove the adapter and its child devices including the paths using the adapter with
# rmdev –Rdl <fcs#>
2. Replace the adapter
3. cfgmgr
4. Check the paths with lspath
� It’s better to stop using a path before you know the path will disappear
► Avoid timeouts, application delays or performance impacts and potential error
recovery bugs
► To disable all paths using a specific FC port on the host:
# chpath –l hdisk1 –p <parent> -s disable
SDDPCM: An Overview• SDDPCM = Subsystem Device Driver Path Control Module
• SDDPCM is MPIO compliant and can be used with IBM DS6000, DS8000,
DS4000 (most models), DS5000, DS3950, V7000 and the SVC
– A “host attachment” fileset (populates the ODM) and SDDPCM fileset are both installed
– Host attachment: devices.fcp.disk.ibm.mpio.rte
– SDDPCM: devices.sddpcm.<version>.rte
• LUNs show up as hdisks, paths shown with pcmpath or lspath commands
– 16 paths per LUN supported
• Provides a PCM per the MPIO architecture
• One installs SDDPCM or SDD, not both.
• SDDPCM = Subsystem Device Driver Path Control Module
• SDDPCM is MPIO compliant and can be used with IBM DS6000, DS8000,
DS4000 (most models), DS5000, DS3950, V7000 and the SVC
– A “host attachment” fileset (populates the ODM) and SDDPCM fileset are both installed
– Host attachment: devices.fcp.disk.ibm.mpio.rte
– SDDPCM: devices.sddpcm.<version>.rte
• LUNs show up as hdisks, paths shown with pcmpath or lspath commands
– 16 paths per LUN supported
• Provides a PCM per the MPIO architecture
• One installs SDDPCM or SDD, not both.
69
SDDPCM • Load balancing algorithms
– rr - round robin
– lb - load balancing based on in-flight IOs per adapter
– fo - failover policy
– lbp - load balancing port (for ESS, DS6000, DS8000, V7000 and SVC
only) based on in-flight IOs per adapter and per storage port
• The pcmpath command is used to examine hdisks, adapters, paths, hdisk statistics,
path statistics, adapter statistics; to dynamically change the load balancing algorithm,
and to perform other administrative tasks such as adapter replacement.
• SDDPCM automatically recovers failed paths that have been repaired via the pcmserv
daemon
– MPIO health checking can also be used, and can be dynamically set via
the pcmpath command. This is recommended. Set the hc_interval to a
non-zero value to turn on path health checking
70
71
Path management with SDDPCM and the pcmpath command
# pcmpath query adapter
# pcmpath query device
# pcmpath query port
# pcmpath query devstats
# pcmpath query adaptstats
# pcmpath query portstats
# pcmpath query essmap
# pcmpath set adapter …
# pcmpath set device path …
# pcmpath set device algorithm
# pcmpath set device hc_interval
# pcmpath disable/enable ports …
# pcmpath query wwpn
And more
� SDD offers the similar datapath command
List adapters and status
List hdisks and paths
List DS8000/DS6000/SVC… ports
List hdisk/path IO statistics
List adapter IO statistics
List DS8000/DS6000/SVC port statistics
List rank, LUN ID and more for each hdisk
Disable/enable paths to adapter
Disable/enable paths to a hdisk
Dynamically change path algorithm
Dynamically change health check interval
Disable/enable paths to a disk port
Display all FC adapter WWPNs
72
Path management with SDDPCM and the pcmpath command
# pcmpath query device
…
DEV#: 2 DEVICE NAME: hdisk2 TYPE: 2145 ALGORITHM: Load Balance
SERIAL: 600507680190013250000000000000F4
==========================================================================
Path# Adapter/Path Name State Mode Select Errors
0 fscsi0/path0 OPEN NORMAL 40928736 0
1* fscsi0/path1 OPEN NORMAL 16 0
2 fscsi2/path4 OPEN NORMAL 43927751 0
3* fscsi2/path5 OPEN NORMAL 15 0
4 fscsi1/path2 OPEN NORMAL 44357912 0
5* fscsi1/path3 OPEN NORMAL 14 0
6 fscsi3/path6 OPEN NORMAL 43050237 0
7* fscsi3/path7 OPEN NORMAL 14 0
…
• * Indicates path to passive controller
• 2145 is a SVC which has active/passive nodes for a LUN
• DS4000, DS5000, V7000 and DS3950 also have active/passive controllers• IOs will be balanced across paths to the active controller
73
Path management with SDDPCM and the pcmpath command
# pcmpath query devstats
Total Dual Active and Active/Asymmetrc Devices : 67
DEV#: 2 DEVICE NAME: hdisk2
===============================
Total Read Total Write Active Read Active Write Maximum
I/O: 169415657 2849038 0 0 20SECTOR: 2446703617 318507176 0 0 5888
Transfer Size: <= 512 <= 4k <= 16K <= 64K > 64K
183162 67388759 35609487 46379563 22703724
…
• Maximum value useful for tuning hdisk queue depths • “20” is maximum inflight requests for the IOs shown• Increase queue depth until queue is not filling up or
until IO services times suffer (bottleneck is pushed to the subsystem)• writes > 3ms• reads > 15-20ms
• See References for queue depth tuning whitepaper
SDD & SDDPCM: Getting Disks configured correctly
• Install the appropriate filesets– SDD or SDDPCM for the required disks (and host attachment fileset)– If you are using SDDPCM, install the MPIO fileset as well which comes with AIX
• devices.common.IBM.mpio.rte– Host attachment scripts
• http://www.ibm.com/support/dlsearch.wss?rs=540&q=host+scripts&tc=ST52G7&dc=D410
• Reboot or start the sddsrv/pcmsrv daemon
• smitty disk -> List All Supported Disk– Displays disk types for which software support has been installed
• Or # lsdev -Pc disk | grep MPIOdisk mpioosdisk fcp MPIO Other FC SCSI Disk Drivedisk 1750 fcp IBM MPIO FC 1750 …DS6000disk 2105 fcp IBM MPIO FC 2105 …ESSdisk 2107 fcp IBM MPIO FC 2107 …DS8000disk 2145 fcp MPIO FC 2145 …SVCdisk DS3950 fcp IBM MPIO DS3950 Array Diskdisk DS4100 fcp IBM MPIO DS4100 Array Diskdisk DS4200 fcp IBM MPIO DS4200 Array Diskdisk DS4300 fcp IBM MPIO DS4300 Array Diskdisk DS4500 fcp IBM MPIO DS4500 Array Diskdisk DS4700 fcp IBM MPIO DS4700 Array Diskdisk DS4800 fcp IBM MPIO DS4800 Array Diskdisk DS5000 fcp IBM MPIO DS5000 Array Diskdisk DS5020 fcp IBM MPIO DS5020 Array Disk
74
75
www-01.ibm.com/support/docview.wss?rs=540&uid=ssg1S7001350#AIXSDDPCM
76
Comparing AIX PCM & SDDPCMFeature/Function AIX PCM SDDPCM
How obtained Included with VIOS and AIX Downloaded from IBM website
Suported Devices
Supports most disk devices that the AIX operating system and VIOS POWERVM firmware support, including selected third-party devices
Supports specific IBM devices and is referenced by the particular device support statement. The supported devices differ between AIX and POWERVM VIOS
OS Integration Considerations
Update levels are provided and are updated and migrated as a mainline part of all the normal AIX and VIOS service strategy and upgrade/migration paths
Add-on software entity that has its own update strategy and process for obtaining fixes. The customer must manage coexistence levels between both the mix of devices, operating system levels and VIOS levels. NOT a licensed program product.
Path Selection Algorithms
Fail over (default)Round Robin (excluding VSCSI disks)
Fail overRound RobinLoad Balancing (default)Load Balancing Port
Algorithm Selection Disk access must be stopped in order to change algorithm Dynamic
SAN boot, dump, paging support Yes Yes. Restart required if SDDPCM installed after
MPIOPCM and SDDPCM boot desired.
PowerHA & GPFS Support Yes Yes
Utilities standard AIX performance monitoring tools such as iostat and fcstat
Enhanced utilities (pcmpath commands) to show mappings from adapters, paths, devices, as well as performance and error statistics
77
Mixing multi-path code sets
• The disk subsystem vendor specifies what multi-path code is supported for their storage
– The disk subsystem vendor supports their storage, the server vendor generally doesn’t
• You can mix multi-path code compliant with MPIO and even share adapters
– There may be exceptions. Contact vendor for latest updates.
HP example: “Connection to a common server with different HBAs requires separate
HBA zones for XP, VA, and EVA”
• Generally one non-MPIO compliant code set can exist with other MPIO compliant code sets
– Except that SDD and RDAC can be mixed on the same LPAR
– The non-MPIO compliant code must be using its own adapters
• Except RDAC can share adapter ports with MPIO
• Devices of a given type use only one multi-path code set
– e.g., you can’t use SDDPCM for one DS8000 and SDD for another DS8000 on the same
AIX instance
78
Sharing Fibre Channel Adapter ports
• Disk using MPIO compliant code sets
can share adapter ports
• It’s recommended that disk and tape
use separate ports79
Disk (typicaly small block random) and tape (typically large block sequential) IO are different, and stability issues have
been seen at high IO rates
MPIO Command Set• lspath – list paths, path status, path ID, and path attributes for a disk
• chpath – change path status or path attributes
– Enable or disable paths
• rmpath – delete or change path state
– Putting a path into the defined mode means it won’t be used (from available to
defined)
– One cannot define/delete the last path of an open device
• mkpath – add another path to a device or makes a defined path available
– Generally cfgmgr is used to add new paths
• chdev – change a device’s attributes (not specific to MPIO)
• cfgmgr – add new paths to an hdisk or make defined paths available
(not specific to MPIO)
80
Useful MPIO Commands• List status of the paths and the parent device (or adapter)
# lspath -Hl <hdisk#>
• List connection information for a path
# lspath -l hdisk2 -F"status parent connection path_status path_id“
Enabled fscsi0 203900a0b8478dda,f000000000000 Available 0
Enabled fscsi0 201800a0b8478dda,f000000000000 Available 1
Enabled fscsi1 201900a0b8478dda,f000000000000 Available 2
Enabled fscsi1 203800a0b8478dda,f000000000000 Available 3
• The connection field contains the storage port WWPN
– In the case above, paths go to two storage ports and WWPNs:203900a0b8478dda
201800a0b8478dda
• List a specific path's attributes
# lspath -AEl hdisk2 -p fscsi0 –w “203900a0b8478dda,f00000000000“
scsi_id 0x30400 SCSI ID False
node_name 0x200800a0b8478dda FC Node Name False
priority 1 Priority True
81
Path priorities• A Priority Attribute for paths can be used to specify a preference for path
IOs. How it works depends whether the hdisk’s algorithm attribute is set to fail_over or round_robin.
Value specified is inverse to priority, i.e. “1” is high priority
• algorithm=fail_over– the path with the higher priority value handles all the IOs unless there's a path failure.
–Set the primary path to be used by setting it's priority value to 1, and the next path's priority (in case of path failure) to 2, and so on.
– if the path priorities are the same, the primary path will be the first listed for the hdisk in the CuPath ODM as shown by # odmget CuPath
• algorithm=round_robin– If the priority attributes are the same, then IOs go down each path equally.
– In the case of two paths, if you set path A’s priority to 1 and path B’s to 255, then for every IO going down path A, there will be 255 IOs sent down path B.
• To change the path priority of an MPIO device on a VIO client:# chpath -l hdisk0 -p vscsi1 -a priority=2
–Set path priorities for VSCSI disks to balance use of VIOSs
82
Path prioritiesNote that a lower value for the path priority is a higher priority
# lsattr -El hdisk9
PCM PCM/friend/otherapdisk Path Control Module False
algorithm fail_over Algorithm True
hcheck_interval 60 Health Check Interval True
hcheck_mode nonactive Health Check Mode True
lun_id 0x5000000000000 Logical Unit Number ID False
node_name 0x20060080e517b6ba FC Node Name False
queue_depth 10 Queue DEPTH True
reserve_policy single_path Reserve Policy True
ww_name 0x20160080e517b6ba FC World Wide Name False
…
# lspath -l hdisk9 -F"parent connection status path_status"
fscsi1 20160080e517b6ba,5000000000000 Enabled Available
fscsi1 20170080e517b6ba,5000000000000 Enabled Available
# lspath -AEl hdisk9 -p fscsi1 -w"20160080e517b6ba,5000000000000"
scsi_id 0x10a00 SCSI ID False
node_name 0x20060080e517b6ba FC Node Name False
priority 1 Priority True
Note: whether or not path priorities apply depends on the PCM. With SDDPCM, path priorities only apply when the algorithm used is fail over (fo). Otherwise, they aren’t used.
Note: whether or not path priorities apply depends on the PCM. With SDDPCM, path priorities only apply when the algorithm used is fail over (fo). Otherwise, they aren’t used.
83
Path priorities – why change them?
• With VIOCs, send the IOs for half the LUNs to one VIOS and half to the other
–Set priorities for half the LUNs to use VIOSa/vscsi0 and half to use VIOSb/vscsi1
–Uses both VIOSs CPU and virtual adapters
–algorithm=fail_over is the only option at the VIOC for VSCSI disks
• With NSeries – have the IOs go the primary controller for the LUN if not using ALUA (ALUA is preferred)
–When not using ALUA, use the dotpaths utility to set path priorities to ensure most IOs go to the preferred controller
84
Hints & Tips
Here’s a good command to determine to what VIOS a vscsi adapter is connected:
# echo "cvai" | kdb
…
NAME STATE CMDS_ACTIVE ACTIVE_QUEUE HOST
vscsi0 0x000007 0x0000000000 0x0 vios1->vhost0
vscsi1 0x000007 0x0000000000 0x0 vios2->vhost0
This will be especially useful if you have more than one vscsi adapter connected to a VIOS.
85
Path Health Checking and RecoveryValidates a path is working & automates recovery of failed path
Note: applies to open disks only
• For SDDPCM and MPIO compliant disks, two hdisk attributes apply:
# lsattr -El hdisk26 hcheck_interval 0 Health Check Interval Truehcheck_mode nonactive Health Check Mode True
• hcheck_interval – Defines how often (1– 3600 seconds) the health check is performed on the paths for a device.
When a value of 0 is selected (the default), health checking is disabled– Preferably set to at least 2X IO timeout value…often 30 seconds
• hcheck_mode– Determines which paths should be checked when the health check capability is used:
• enabled: Sends the healthcheck command down paths with a state of enabled • failed: Sends the healthcheck command down paths with a state of failed• nonactive: (Default) Sends the healthcheck command down paths that have no active I/O,
including paths with a state of failed. If the algorithm selected is failover, then the healthcheck command is also sent on each of the paths that have a state of enabled but have no active IO. If the algorithm selected is round_robin, then the healthcheck command is only sent on paths with a state of failed, because the round_robin algorithm keeps all enabled paths active with IO.
• Consider setting up error notification for path failures (later slide)
86
Path Recovery• MPIO will recover failed paths if path health checking is enabled with
hcheck_mode=nonactive or failed and the device has been opened
• Trade-offs exist:– Lots of path health checking can create a lot of SAN traffic
– Automatic recovery requires turning on path health checking for each LUN
– Lots of time between health checks means paths will take longer to recover after repair
– Health checking for a single LUN is often sufficient to monitor all the physical paths, but not to recover them
• SDD and SDDPCM also recover failed paths automatically
• In addition, SDDPCM provides a health check daemon to provide an automated method of reclaiming failed paths to a closed device.
• To manually enable a failed path after repair or re-enable a disabled path: # chpath -l hdisk1 -p <parent> –w <connection> -s enable
or run cfgmgr or reboot
87
Path Recovery With Flaky Links• When a path fails, it takes AIX time to recognize it, and to redirect in-flight IOs previously sent
down the failed path
– IO stalls during this time, along with processes waiting on the IO
– Turning off a switch port results in a 20 second stall
• Other types of failures may take longer
– AIX must distinguish between slow IOs and path failures
• With flaky paths that go up and down, this can be a problem
• The MPIO timeout_policy attribute for hdisks addresses this for command timeouts
– IZ96396 for AIX 7.1, IZ96302 for AIX 6.1
– timeout_policy=retry_path Default and similar to before the attribute existed. The first occurrence of a command timeout on the path does not cause immediate path failure.
– timeout_policy=fail_path Wait until several clean health checks then recover the path
– timeout_policy=disable_path Disable the path and leave it that way
• Manual intervention will be required so be sure to use error notification in this case
• SDDPCM recoverDEDpath attribute – similar to timeout_policy but for all kinds of path errors
– recoverDEDpath=no Default and failed paths stay that way
– recoverDEDpath=yes Allows failed paths to be recovered
– SDDPCM V2.6.3.0 or later
88
Path Health Checking and Recovery – Notification!
• One should also set up error notification for path failure, so that someone knows
about it and can correct it before something else fails.
• This is accomplished by determining the error that shows up in the error log when a
path fails (via testing), and then
• Adding an entry to the errnotify ODM class for that error which calls a script (that
you write) that notifies someone that a path has failed.
Hint: You can use # odmget errnotify to see what the entries (or stanzas) look like,
then you create a stanza and use the odmadd command to add it to the errnotify
class.
• One should also set up error notification for path failure, so that someone knows
about it and can correct it before something else fails.
• This is accomplished by determining the error that shows up in the error log when a
path fails (via testing), and then
• Adding an entry to the errnotify ODM class for that error which calls a script (that
you write) that notifies someone that a path has failed.
Hint: You can use # odmget errnotify to see what the entries (or stanzas) look like,
then you create a stanza and use the odmadd command to add it to the errnotify
class.
89
Notification, cont’dPath and Fibre Channel Related Errors (examples)
# errpt -t | egrep "PATH|FCA"
02A8BC99 SC_DISK_PCM_ERR8 PERM H PATH HAS FAILED
080784A7 DISK_ERR6 PERM H PATH HAS FAILED
13484BD0 SC_DISK_PCM_ERR16 PERM H PATH ID
14C8887A FCA_ERR10 PERM H COMMUNICATION PROTOCOL ERROR
1D20EC72 FCA_ERR1 PERM H ADAPTER ERROR
1F22F4AA FCA_ERR14 TEMP H DEVICE ERROR
278804AD FCA_ERR5 PERM S SOFTWARE PROGRAM ERROR
2BD0BD1A FCA_ERR9 TEMP H ADAPTER ERROR
3B511B1A FCA_ERR8 UNKN H UNDETERMINED ERROR
40535DDB SC_DISK_PCM_ERR17 PERM H PATH HAS FAILED
7BFEEA1F FCA_ERR4 TEMP H LINK ERROR
84C2184C FCA_ERR3 PERM H LINK ERROR
9CA8C9AD SC_DISK_PCM_ERR12 PERM H PATH HAS FAILED
A6F5AE7C SC_DISK_PCM_ERR9 INFO H PATH HAS RECOVERED
D666A8C7 FCA_ERR2 TEMP H ADAPTER ERROR
DA930415 FCA_ERR11 TEMP H COMMUNICATION PROTOCOL ERROR
DE3B8540 SC_DISK_ERR7 PERM H PATH HAS FAILED
E8F9BA61 CRYPT_ERROR_PATH INFO H SOFTWARE PROGRAM ERROR
ECCE4018 FCA_ERR6 TEMP S SOFTWARE PROGRAM ERROR
F29DB821 FCA_ERR7 UNKN H UNDETERMINED ERROR
FF3E9550 FCA_ERR13 PERM H DEVICE ERROR 90
# errpt -atJ FCA_ERR4---------------------------------------------------------------------------IDENTIFIER 7BFEEA1FLabel: FCA_ERR4Class: HType: TEMPLoggable: YES Reportable: YES Alertable: NODescriptionLINK ERRORRecommended ActionsPERFORM PROBLEM DETERMINATION PROCEDURESDetail DataSENSE DATA
91
Options for Error Notification in AIX
• ODM-Basederrdemon program uses errnotify ODM class for error notification
• diag Command DiagnosticsThe diag command package contains a periodic diagnostic procedure called diagela. Hardware (only) errors generate mail messages to members of the system group, or other email
addresses, as configured.
• Custom NotificationWrite a shell script to check the error log periodically
• Concurrent Error LoggingStart errpt –c and each error is then reported when it occurs. Can redirect output to the console to notify the operator.
Error
Notification
ODM-Based
diag
Command
diagnostics
Concurrent
Error
Logging
Custom
Notification
Storage Area Network (SAN) Boot
92
Boot Directly from SAN� Storage is zoned
directly to the client� HBAs used for boot
and/or data access� Multi-path code for
the storage runs in client
SAN Sourced VSCSI Boot� Affected LUNs are zoned to
VIOS(s) and mapped to client from the VIOS
� VIOC uses AIX PCM� VIOS uses multi-path code
specified by the storage� Two layers of multi-path code
AIXMultipath
Code
FC
FC
FC SAN
AIXMPIO
VS
CS
I
VS
CS
I
FC SAN
VIOSMultipath
Code
VIOSMultipath
Code
NPIV Enabled NPIV Enabled SAN
vF
C
vF
C
vF
C
vF
C
MultipathCode
AIX
VIOSF
C
FC
VIOS
FC
FC
NPIV Boot� Affected LUNs are zoned
to VIOS(s) and mapped to client from the VIOS
� VIOC uses multi-path code specified by the storage
Monitoring, Measuring and Basic IO Tuning for SAN Storage from AIX
93
Monitoring SAN storage performanceMonitoring SAN storage performanceMonitoring SAN storage performanceMonitoring SAN storage performance
• For random IO, look at read and write service times from
# iostat –RDTl <interval> <# intervals>
Disks: xfers read write
-------------- -------------------------------- ------------------------------------ ------------------------------------ ----------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail avg
act serv serv serv outs serv serv serv outs time t
hdisk1 0.9 12.0K 2.5 0.0 12.0K 0.0 0.0 0.0 0.0 0 0 2.5 9.2 0.6 92.9 0 0 3.7 0
hdisk0 0.8 12.1K 2.6 119.4 12.0K 0.0 4.4 0.1 12.1 0 0 2.5 8.7 0.8 107.0 0 0 3.3 0
hdisk2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
hdisk3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.0 0 0 0.0 0
hdisk4 66.4 58.9M 881.1 58.9M 28.8K 879.1 6.4 0.1 143.5 0 0 2.0 1.8 0.2 27.5 0 0 54.7 0
hdisk6 66.3 51.1M 797.4 51.0M 24.5K 795.9 7.6 0.1 570.1 0 0 1.5 1.5 0.2 32.9 0 0 51.3 0
hdisk5 61.9 55.9M 852.9 55.9M 28.5K 850.5 6.0 0.1 120.8 0 0 2.4 1.6 0.1 33.6 0 0 46.1 0
hdisk7 58.3 55.4M 843.1 55.4M 21.2K 841.9 6.7 0.1 167.6 0 0 1.3 1.3 0.2 20.8 0 0 48.3 0
hdisk8 42.6 53.5M 729.1 53.5M 3.4K 728.9 5.7 0.1 586.4 0 0 0.2 0.9 0.2 5.9 0 0 54.3 0
hdisk10 44.1 37.1M 583.0 37.0M 16.9K 582.0 3.7 0.1 467.7 0 0 1.0 1.4 0.2 12.9 0 0 23.1 0
• Misleading indicators of disk subsystem performance
• %tm_act (percent time active)
• 100% busy Not meaningful for virtual disks, meaningful for real physical disks
• %iowait
• A measure of CPU idle while there are outstanding IOs
• IOPS, tps, and xfers… all refer to the same thing
94
Monitoring SAN storage performanceMonitoring SAN storage performanceMonitoring SAN storage performanceMonitoring SAN storage performance
95
# topas –D
or just press D when in topas
Avg. Write TimeAvg. Read Time Avg. Queue Wait
Total Service Time R = Average Read Time + Average Queue Wait Time
Total Service Time W = Average Write Time + Average Queue Wait Time
Disk IO service times Disk IO service times Disk IO service times Disk IO service times
96
� Multiple interface types
� ATA
� SATA
� SCSI
� FC
� SAS
““““ZBR” Geometry
� If the disk is very busy, IOs will wait for IOs ahead of it
� Queueing time on the disk (not queueing in the hdisk driver or elsewhere)
Seagate 7200 RPM SATA HDD performanceSeagate 7200 RPM SATA HDD performanceSeagate 7200 RPM SATA HDD performanceSeagate 7200 RPM SATA HDD performance
97
� As IOPS increase, IOs queue on the disk and wait for IOs ahead to complete first
What are reasonable IO service times?What are reasonable IO service times?What are reasonable IO service times?What are reasonable IO service times?
98
� Assuming the disk isn’t too busy and IOs are not queueing there
� SSD IO service times around 0.2 to 0.4 ms and they can do over 10,000 IOPS
What are reasonable IO service times?What are reasonable IO service times?What are reasonable IO service times?What are reasonable IO service times?
• Rules of thumb for IO service times for random IO and typical disk subsystems that are not mirroring data
synchronously and using HDDs
• Writes should average <= 2.5 ms
• Typically they will be around 1 ms
• Reads should average < 15 ms
• Typically they will be around 5-10 ms
• For random IO with synchronous mirroring
• Writes will take longer to get to the remote disk subsystem, write to its cache, and return an
acknowledgement
• 2.5 ms + round trip latency between sites (light thru fiber travels 1 km in 0.005 ms)
• When using SSDs
• For SSDs on SAN, reads and writes should average < 2.5 ms, typically around 1 ms
• For SSDs attached to Power via SAS adapters without write cache
• Reads and writes should average < 1 ms
• Typically < 0.5 ms
• Writes take longer than reads for SSDs
• What if we don’t know if the data resides on SSDs or HDDs (e.g. in an EasyTier environment)?
• Look to the disk subsystem performance reports
• For sequential IO, don’t worry about IO service times, worry about throughput
• We hope IOs queue, wait and are ready to process 99
What if IO times are worse than that?What if IO times are worse than that?What if IO times are worse than that?What if IO times are worse than that?
• You have a bottleneck somewhere from the hdisk driver to the physical disks
• Possibilities include:
• CPU (local LPAR or VIOS)
• Adapter driver
• Physical host adapter/port
• Overloaded SAN links (unlikely)
• Storage port(s) overloaded
• Disk subsystem processor overloaded
• Physical disks overloaded � most common
• SAN switch buffer credits
• Temporary hardware errors
• Evaluate VIOS, adapter, adapter driver from AIX/VIOS
• Evaluate the storage from the storage side
• If the write IO service times are marginal, the write IO rate is low, and the read IO rate is high, it’s
often not worth worrying about
• Can occur due to caching algorithms in the storage
100
What about IO size and sequential IO?What about IO size and sequential IO?What about IO size and sequential IO?What about IO size and sequential IO?Disks: xfers
-------------- --------------------------------
%tm bps tps bread bwrtn
act
hdisk4 99.6 591.4M 2327.5 590.7M 758.7K
101
� Large IOs typically imply sequential IO – check your iostat data
� bps/tps = bytes/transaction or bytes/IO
� 591.4 MB / 2327.5 tps = 260 KB/IO - likely sequential IO
� Use filemon to examine sequentiality, e.g.:
# filemon –o /tmp/filemon.out –O all,detailed –T 1000000;sleep 60; trcstop
VOLUME: /dev/hdisk4 description: N/A
reads: 9156 (0 errs)
read sizes (blks): avg 149.2 min 8 max 512 sdev 218.2
read times (msec): avg 6.817 min 0.386 max 1635.118 sdev 22.469
read sequences: 7155*
read seq. lengths: avg 191.0 min 8 max 34816 sdev 811.9
writes: 806 (0 errs)
write sizes (blks): avg 352.3 min 8 max 512 sdev 219.2
write times (msec): avg 20.705 min 0.702 max 7556.756 sdev 283.167
write sequences: 377*
write seq. lengths: avg 753.1 min 8 max 8192 sdev 1136.7
seeks: 7531 (75.6%)*� Here % sequential = 1 - 75.6% = 24.4%
� Perhaps multiple sequential IO threads accessing hdisk4� seeks value is number of times the actuator had to move to a different place on disk
…smaller the number, the more sequential* Adjacent IOs coalesced into fewer IOs
A situation you may seeA situation you may seeA situation you may seeA situation you may see
• Note the low write rate and high write IO service times
• Disk subsystem cache and algorithms may favor disks doing sequential or heavy IO relative to disks doing limited IO or no IO for several seconds
• The idea being to reduce overall IO service times
• Varies among disk subsystems
• Overall performance impact is low due to low write rates
102
# iostat –lD
Disks: xfers read write
-------------- -------------------------------- ------------------------------------ ------------------------------------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail
act serv serv serv outs serv serv serv outs
hdisk0 0.3 26.7K 3.1 19.3K 7.5K 1.4 1.7 0.4 19.8 0 0 1.6 0.8 0.6 6.9 0 0
hdisk1 0.1 508.6 0.1 373.0 135.6 0.1 8.1 0.5 24.7 0 0 0.0 0.8 0.6 1.0 0 0
hdisk2 0.0 67.8 0.0 0.0 67.8 0.0 0.0 0.0 0.0 0 0 0.0 0.8 0.7 1.0 0 0
hdisk3 1.1 37.3K 4.4 25.1K 12.2K 2.0 0.8 0.3 10.4 0 0 2.4 4.4 0.6 638.4 0 0
hdisk4 80.1 33.6M 592.5 33.6M 38.2K 589.4 2.4 0.3 853.6 0 0 3.1 6.5 0.5 750.3 0 0
hdisk5 53.2 16.9M 304.2 16.9M 21.5K 302.2 3.0 0.3 1.0S 0 0 2.0 16.4 0.7 749.3 0 0
hdisk6 1.1 21.7K 4.2 1.9K 19.8K 0.1 0.6 0.5 0.8 0 0 4.0 2.7 0.6 495.6 0 0
Basic AIX IO Tuning Basic AIX IO Tuning Basic AIX IO Tuning Basic AIX IO Tuning
103
Introduction to AIX IO TuningIntroduction to AIX IO TuningIntroduction to AIX IO TuningIntroduction to AIX IO Tuning
104
� Tuning IO involves removing logical bottlenecks in the AIX IO stack
� Requires some understanding of the AIX IO stack
� General rule is to increase buffers and queue depths so no IOs wait unecesarily
due to lack of a resource, but not to send so many IOs to the disk subsystem that
it loses the IO requests
� Four possible situations:
1. No IOs waiting unnecessarily
� No tuning needed
2. Some IOs are waiting and IO service times are good
� Tuning will help
3. Some IOs are waiting and IO service times are poor
� Tuning may or may not help
� Poor IO service times indicate a bottleneck further down the stack and
typically at the storage
� Often needs more storage resources or storage tuning
4. The disk subsystem is losing IOs and IO service times are bad
� Leads to IO retransmissions, error handling code, blocked IO stalls and
crashes.
Filesystem and Disk Buffers Filesystem and Disk Buffers Filesystem and Disk Buffers Filesystem and Disk Buffers
105
# vmstat –v
…
0 pending disk I/Os blocked with no pbuf
171 paging space I/Os blocked with no psbuf
2228 filesystem I/Os blocked with no fsbuf
66 client filesystem I/Os blocked with no fsbuf
17 external pager filesystem I/Os blocked with no fsbuf
� Numbers are counts of temporarily blocked IOs since boot
� blocked count / uptime = rate of IOs blocked/second
� Low rates of blocking implies less improvement from tuning
� For pbufs, use lvmo to increase pv_pbuf_count (see the next slide)
� For psbufs, stop paging (add memory or use less) or add paging spaces
� For filesystem fsbufs, increase numfsbufs with ioo
� For external pager fsbufs, increase j2_dynamicBufferPreallocation with ioo
� For client filesystem fsbufs, increase nfso's nfs_v3_pdts and nfs_v3_vm_bufs (or the
NFS4 equivalents)
� Run # ioo –FL to see defaults, current settings and what’s required to make the changes
go into effect
The rate is important
Use uptime
Disk Buffers Disk Buffers Disk Buffers Disk Buffers
106
# lvmo –v rootvg -a
vgname = rootvg
pv_pbuf_count = 512 Number of pbufs added when one PV is added to the VGtotal_vg_pbufs = 512 Current pbufs available for the VGmax_vg_pbuf_count = 16384 Max pbufs available for this VG, requires remount to changepervg_blocked_io_count = 1243 Delayed IO count since last varyon for this VGpv_min_pbuf = 512 Minimum number of pbufs added when PV is added to any VGglobal_blocked_io_count = 1243 System wide delayed IO count for all VGs and disks
# lvmo –v rootvg -o pv_pbuf_count=1024 Increases pbufs for rootvg and is dynamic
� Check disk buffers for each VG
FC adapter port tuning FC adapter port tuning FC adapter port tuning FC adapter port tuning
107
� The num_cmd_elems attribute controls the maximum number of in-flight IOs for the FC port
� The max_xfer_size attribute controls the maximum IO size the adapter will send to the
storage, as well as a memory area to hold IO data
� Doesn’t apply to virtual adapters
� Default memory area is 16 MB at the default max_xfer_size=0x100000
� Memory area is 128 MB for any other allowable value
� This cannot be changed dynamically – requires stopping use of adapter port
# lsattr -El fcs0
DIF_enabled no DIF (T10 protection) enabled True
bus_intr_lvl Bus interrupt level False
bus_io_addr 0xff800 Bus I/O address False
bus_mem_addr 0xffe76000 Bus memory address False
bus_mem_addr2 0xffe78000 Bus memory address False
init_link auto INIT Link flags False
intr_msi_1 209024 Bus interrupt level False
intr_priority 3 Interrupt priority False
lg_term_dma 0x800000 Long term DMA True
max_xfer_size 0x100000 Maximum Transfer Size True
num_cmd_elems 200 Maximum number of COMMANDS to queue to the adapter True
pref_alpa 0x1 Preferred AL_PA True
sw_fc_class 2 FC Class for Fabric True
tme no Target Mode Enabled True
FC adapter port queue depth tuning FC adapter port queue depth tuning FC adapter port queue depth tuning FC adapter port queue depth tuning Determining when to change the attributes
108
# fcstat fcs0
FIBRE CHANNEL STATISTICS REPORT: fcs0
Device Type: 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
…
World Wide Port Name: 0x10000000C99C184E
…
Port Speed (supported): 8 GBIT
Port Speed (running): 8 GBIT
…
FC SCSI Adapter Driver Information <- Look at this section: numbers are counts of blocked IOs since boot
No DMA Resource Count: 452380 <- increase max_xfer_size for large values
No Adapter Elements Count: 726832 <- increase num_cmd_elems for large values
No Command Resource Count: 342000 <- increase num_cmd_elems for large values…
FC SCSI Traffic Statistics
…
Input Bytes: 56443937589435
Output Bytes: 4849112157696
# chdev –l fcs0 –a num_cmd_elems=4096 –a max_xfer_size=0x200000 –P <- requires reboot
fcs0 changed
� Calculate the rate the IOs are blocked
� # blocked / uptime (or since the adapter was made Available)
� Bigger tuning improvements when the rate of blocked IOs is higher
� If you’ve increased num_cmd_elems to its max value and increased max_xfer_size and
still get blocked IOs, it suggests you need another adapter port for more bandwidth
VIO IO tuning with VSCSI, NPIV, and SSPVIO IO tuning with VSCSI, NPIV, and SSPVIO IO tuning with VSCSI, NPIV, and SSPVIO IO tuning with VSCSI, NPIV, and SSP
109
� VSCSI
� IOs from vscsi adapter driver are DMA transferred via the hypervisor to the hdisk driver in
the VIOS
� iostat statistics at the VIOS show these IOs
� Set VIOC queue_depth to <= the VIOS queue_depth for the LUN
� Requires unmapping/remapping the disk, or rebooting the VIOS
� Tune both queue_depths together, or set the VIOS queue_depth high, and tune only
the VIOC queue_depth
� Ensures no blocking of IOs at the VIOC hdisk driver
� NPIV
� IOs from vFC adapter driver are DMA transferred via the hypervisor to the FC adapter
driver in the VIOS
� iostat statistics at the VIOS do not capture these IOs
� fcstat statistics at the VIOS does capture these IOs
�NMON also captures this data at the VIOS
� Higher blocked IO rates mean tuning will result in higher performance
� If you’ve increased num_cmd_elems to its max value and increased max_xfer_size and
still get blocked IOs, it suggests you need another adapter port for more bandwidth
� VSCSI
� IOs from vscsi adapter driver are DMA transferred via the hypervisor to the hdisk driver in
the VIOS
� iostat statistics at the VIOS show these IOs
� Set VIOC queue_depth to <= the VIOS queue_depth for the LUN
� Requires unmapping/remapping the disk, or rebooting the VIOS
� Tune both queue_depths together, or set the VIOS queue_depth high, and tune only
the VIOC queue_depth
� Ensures no blocking of IOs at the VIOC hdisk driver
� NPIV
� IOs from vFC adapter driver are DMA transferred via the hypervisor to the FC adapter
driver in the VIOS
� iostat statistics at the VIOS do not capture these IOs
� fcstat statistics at the VIOS does capture these IOs
�NMON also captures this data at the VIOS
� Higher blocked IO rates mean tuning will result in higher performance
� If you’ve increased num_cmd_elems to its max value and increased max_xfer_size and
still get blocked IOs, it suggests you need another adapter port for more bandwidth
VSCSI adapter queue depth sizingVSCSI adapter queue depth sizingVSCSI adapter queue depth sizingVSCSI adapter queue depth sizing
hdisk
queue
depth
Max
hdisks per
vscsi
adapter*
3 - default 85
10 39
24 18
32 14
64 7
100 4
128 3
252 2
256 1
110
� VSCSI adapters also have a queue but it’s not tunable..no tool to help see how full the q on
the adapter driver is getting
� We ensure we don’t run out of VSCSI queue slots by limiting the number of hdisks using the
adapter, and their individual queue depths
� Adapter queue slots are a resource shared by the
hdisks
� Max hdisks per adapter =
INT{510 / [(sum of (hdisk queue depths + 3)]}Note:The vscsi adapter has space for 512 command elements of which 2 are used by the adapter, 3 are reserved for each VSCSI LUN for error recovery, and the rest are used for IO requests
� You can exceed these limits to the extent that the
average service queue size is less than the queue
depth
* To assure no blocking of IOs at the vscsi adapter
NPIV adapter tuning NPIV adapter tuning NPIV adapter tuning NPIV adapter tuning
111
� The real adapters’ queue slots and DMA memory area are shared by the
vFC NPIV adapters
� Tip: Set num_cmd_elems to it’s maximum value and max_xfer_size to
0x200000 on the real FC adapter for maximum bandwidth, to avoid having
to tune it later. Some configurations won’t allow this and will result in
errors in the error log or devices showing up as Defined.
� Only tune num_cmd_elems for the vFC adapter based on fcstat statistics
112
Finis
Quaestiones?