PowerVM AIX Monitoring Mitec W210A 2&3

76
PowerVM/AIX Monitoring MITEC Session W 210A-2 & 3 Steve Nasypany [email protected]

description

PowerVM Monitoring

Transcript of PowerVM AIX Monitoring Mitec W210A 2&3

  • PowerVM/AIX Monitoring

    MITEC Session W 210A-2 & 3

    Steve [email protected]

  • 2013 IBM Corporation2

    Agenda

    What This Presentation Covers PowerVM Review Monitoring Questions Topology/System Aggregation VIOS Summary Metrics/Tools

    CPU Memory IO

  • 2013 IBM Corporation3

    There are a plethora of Performance Management and Monitoring products and free software. Listings and links are provided for reference later in this presentation.

    This presentation does not focus on any product solution you should review your requirements and contact the vendors that interest you. Many higher end products have trial versions or can demonstrate theirfunctionality in person. Many of the free products have vast capabilities but require study, experimentation and configuration work.

    This presentation is focused on the basic metrics that should be monitored for assessing performance, and where you can find them in AIX

    What This Presentation Covers

  • 2013 IBM Corporation4

    Logical Partitions (LPARs) are defined to be dedica ted or shared Shared partitions use whole or fractions of CPUs (smallest increment is 0.1, can be greater than 1.0) Dedicated partitions use whole number of CPUs, traditional Unix environment Dedicated-donating can donate free cycles to a shared pool to allow higher capacity

    Shared Processor Pools are a subset or all of the p hysical CPUs in a system. Desire is to virtualize all partitions to maximize exploitatio n of physical resources.

    Shared Pool partitions run on Virtual Processors (V P). Each Virtual Processor maps to one physical processing unit in capacity (a physica l core in AIX).

    Capacity is expressed in the form of a number of 10 % CPU units, called Entitlement Entitlement value for an active partition is a guarantee of processing resources Sum of partition entitlements must be less than or equal to physical resources in shared pool Desired: Desired size of partition at boot time Minimum: Partition will start will less than desired, but wont start if Minimum not available Maximum: Dynamic changes to desired cannot exceed this capacity

    Capped vs Uncapped Capped: Entitlement is a hard limit, similar to a dedicated partitions Uncapped: Capacity is limited by unused capacity in pool and number of Virtual Processors a

    partition has configured

    When the pool is constrained, Variable Capacity Wei ght setting provides automatic load balancing of cycles over entitlement

    PowerVM Virtualization Definitions Review

  • 2013 IBM Corporation5

    How do we monitor for shared CPU pool constraints? AIX provides metrics to show physical and entitlement utilization on

    each partition AIX optionally provides the amount of the shared pool that is idle -

    when you are out of shared pool, you are constrained

    What about Hypervisor metrics? There are really no metrics that see into the hypervisor layer Metrics like %hypv in AIX lparstat are dominated by idle time, as idle

    cycles are ceded to they hypervisor layer

    This presentation will help you with determining what performance metrics are important at the partition and frame level

    Questions

  • 2013 IBM Corporation6

    What product is best for interactive and short-term analysis of AIX resources? nmon provides access to most of the important metrics required for

    benchmarks, proof-of-concepts and regular monitoring CPU vmstat, sar, lparstat, mpstat Memory vmstat, svmon Paging vmstat Hdisk iostat, sar, filemon Adapter iostat, fcstat Network entstat, netstat Process ps, svmon, trace tools (tprof, curt) Threads ps, trace tools (curt)

    nmon Analyser & Consolidator provide free and simple trend reports Not provided in AIX/nmon

    Java/GC must use java tools Transaction times application specific Database database products

    Questions

  • 2013 IBM Corporation7

    Ideal for benchmarks, proof-of-concepts and problem analysis

    Allows high-resolution recordings to be made while in monitoring mode Records samples at the interactive monitoring rate AIX 5.3 TL12 & AIX 6.1 TL05

    Usage Start nmon, use [ and ] brackets to start and end a recording

    Records standard background recording metrics, not just what is on screen.

    You can adjust the recorded sampling interval with-s [seconds] on startupInteractive options - and + ( +) do NOT change ODR interval

    Generates a standard nmon recording of format:__.nmon

    Tested with nmon Analyser v33C, and works fine

    nmon On Demand Recording (ODR)

  • 2013 IBM Corporation8

    nmon ODR

  • 2013 IBM Corporation9

    The majority of commercial products supporting PowerVM/AIX monitoring provide access to all of the important virtualization metrics

    Metric naming is a problem. In many cases, different implementations display the same data, but have different naming conventions. Any customer wanting to evaluate these products will have to have an AIX specialist study the products metric definitions to map apples-to-apples.

    Many products pre-package user interface views that aggregate these metrics

    But usage modes for recording, post-processing, capacity planning can all vary

    Many customers are using products in older dedicated modes and are not aware of differences in monitoring virtualized systems

    Customers interested in a particular solution should evaluate each product for complete support. Most, if not all, of these products are available for trial evaluation.

    Currency/Naming

  • 2013 IBM Corporation10

    AIX LPARKPX_memrepage_Info

    KPX_vmm_pginwait_InfoKPX_vmm_pgfault_Info

    KPX_vmm_pgreclm_Info

    KPX_vmm_unpin_low_Warn KPX_vmm_pgout_pend_InfoKPX_Pkts_Sent_Errors_InfoKPX_Sent_Pkts_Dropped_InfoKPX_Pkts_Recv_Errors_Info KPX_Bad_Pkts_Recvd_InfoKPX_Recv_pkts_dropped_Info KPX_Qoverflow_Info

    KPX_perip_InputErrs_InfoKPX_perip_InputPkts_Drop_InfoKPX_perip_OutputErrs_Info KPX_TCP_ConnInit_InfoKPX_TCP_ConnEst_Info

    KPX_totproc_cs_Info KPX_totproc_runq_avg_InfoKPX_totproc_load_avg_Info KPX_totnum_procs_InfoKPX_perproc_IO_pgf_Info KPX_perproc_nonIO_pgf_InfoKPX_perproc_memres_datasz_InfoKPX_perproc_memres_textsz_InfoKPX_perproc_mem_textsz_Info KPX_perproc_vol_cs_Info

    KPX_Active_Disk_Pct_InfoKPX_Avg_Read_Transfer_MS_InfoKPX_Read_Timeouts_Per_Sec_InfoKPX_Failed_Read_Per_Sec_InfoKPX_Avg_Write_Transfer_MS_InfoKPX_Write_Timeout_Per_Sec_InfoKPX_Failed_Writes_Per_Sec_InfoKPX_Avg_Req_In_WaitQ_MS_InfoKPX_ServiceQ_Full_Per_Sec_InfoKPX_perCPU_syscalls_Info KPX_perCPU_forks_InfoKPX_perCPU_execs_Info

    KPX_perCPU_cs_Info

    KPX_Tot_syscalls_Info

    KPX_Tot_forks_Info

    KPX_Tot_execs_InfoKPX_LPARBusy_pct_WarnKPX_LPARPhyBusy_pct_Warn KPX_LPARvcs_Info

    KPX_LPARfreepool_Warn KPX_LPARPhanIntrs_InfoKPX_LPARentused_Info KPX_LPARphyp_used_Info

    KPX_user_acct_locked_Info KPX_user_login_retries_InfoKPX_user_idletime_Info

    VIOSKVA_memrepage_InfoKVA_vmm_pginwait_InfoKVA_vmm_pgfault_InfoKVA_vmm_pgreclm_InfoKVA_vmm_unpin_low_Warn KVA_vmm_pgout_pend_Infov Networking KVA_Pkts_Sent_Errors_Info KVA_Sent_Pkts_Dropped_InfoKVA_Pkts_Recv_Errors_Info KVA_Bad_Pkts_Recvd_InfoKVA_Recv_pkts_dropped_InfoKVA_Qoverflow_InfoKVA_Real_Pkts_Dropped_Info KVA_Virtual_Pkts_Dropped_InfoKVA_Output_Pkts_Dropped_Info KVA_Output_Pkts_Failures_InfoKVA_Mem_Alloc_Failures_WarnKVA_ThreadQ_Overflow_Pkts_Info KVA_HA_State_InfoKVA_Times_Primary_Per_Sec_Info KVA_perip_InputErrs_InfoKVA_perip_InputPkts_Drop_Info KVA_perip_OutputErrs_InfoKVA_TCP_ConnInit_InfoKVA_TCP_ConnEst_Infov Process KVA_totproc_cs_InfoKVA_totproc_runq_avg_Info KVA_totproc_load_avg_InfoKVA_totnum_procs_InfoKVA_perproc_IO_pgf_Info KVA_perproc_nonIO_pgf_InfoKVA_perproc_memres_datasz_InfoKVA_perproc_memres_textsz_Info KVA_perproc_mem_textsz_InfoKVA_perproc_vol_cs_InfoKVA_Firewall_InfoKVA_memrepage_InfoKVA_vmm_pginwait_InfoKVA_vmm_pgfault_InfoKVA_vmm_pgreclm_InfoKVA_vmm_unpin_low_Warn KVA_vmm_pgout_pend_Infov Networking KVA_Pkts_Sent_Errors_Info KVA_Sent_Pkts_Dropped_InfoKVA_Pkts_Recv_Errors_Info KVA_Bad_Pkts_Recvd_InfoKVA_Recv_pkts_dropped_InfoKVA_Qoverflow_InfoKVA_Real_Pkts_Dropped_InfoKVA_Virtual_Pkts_Dropped_InfoKVA_Output_Pkts_Dropped_Info KVA_Output_Pkts_Failures_InfoKVA_Mem_Alloc_Failures_WarnKVA_ThreadQ_Overflow_Pkts_Info KVA_HA_State_InfoKVA_Times_Primary_Per_Sec_Info KVA_perip_InputErrs_InfoKVA_perip_InputPkts_Drop_Info KVA_perip_OutputErrs_InfoKVA_TCP_ConnInit_InfoKVA_TCP_ConnEst_Infov Process

    VIOS (cont)KVA_totproc_cs_Info

    KVA_totproc_runq_avg_Info KVA_totproc_load_avg_Info

    KVA_totnum_procs_Info

    KVA_perproc_IO_pgf_Info KVA_perproc_nonIO_pgf_InfoKVA_perproc_memres_datasz_InfoKVA_perproc_memres_textsz_InfoKVA_perproc_mem_textsz_Info KVA_perproc_vol_cs_Info

    KVA_Firewall_Info

    KVA_Active_Disk_Pct_Info KVA_Avg_Read_Transfer_MS_InfoKVA_Read_Timeouts_Per_Sec_InfoKVA_Failed_Read_Per_Sec_InfoKVA_Avg_Write_Transfer_MS_InfoKVA_Write_Timeout_Per_Sec_Info

    KVA_Failed_Writes_Per_Sec_InfoKVA_Avg_Req_In_WaitQ_MS_InfoKVA_ServiceQ_Full_Per_Sec_Info

    KVA_perCPU_syscalls_Info

    KVA_perCPU_forks_Info

    KVA_perCPU_execs_Info

    KVA_perCPU_cs_Info

    KVA_Tot_syscalls_Info KVA_Tot_forks_InfoKVA_Tot_execs_Info

    KVA_LPARBusy_pct_Warn KVA_LPARPhyBusy_pct_Warn

    KVA_LPARvcs_Info

    KVA_LPARfreepool_Warn

    KVA_LPARPhanIntrs_Info

    KVA_LPARentused_Info

    KVA_LPARphyp_used_Info KVA_user_acct_locked_InfoKVA_user_login_retries_Info

    KVA_user_idletime_Info

    HMC

    KPH_Busy_CPU_Info

    KPH_Paging_Space_Full_Info

    KPH_Disk_Full_Warn

    KPH_Runaway_Process_InfoThe

    Currency/Naming Example (Tivoli)

  • 2013 IBM Corporation11

    Excerpts from Agent definitions:

    BYLS_CPU_PHYS_TOTAL_UTIL: On AIX, this metric is eq uivalent to sum of BYLS_CPU_PHYS_USER_MODE_UTIL and BYLS_CPU_PHYS_SYS_MODE_UTIL. For AIX lpars, the metric is calculated with respect to the available physica l CPUs in the pool to which this LPAR belongs to

    BYLS_LS_MODE This metric indicates whether the CPU entitlement for the logical system is Capped or Uncapped.

    BYLS_LS_SHARED This metric indicates whether the ph ysical CPUs are dedicated to this logical system or shared.

    GBL_CPU_PHYS_TOTAL_UTIL: The percentage of time the available physical CPUs were not idle for this logical system during the interval. This m etric is calculated as GBL_CPU_PHYS_TOTAL_UTIL =GBL_CPU_PHYS_USER_MODE_UTIL + GBL_CPU_SYS_MODE_UTIL

    GBL_POOL_CPU_AVAIL: The available physical processo rs in the shared processor pool during the interval.

    GBL_POOL_CPU_ENTL: The number of physical processor s available in the shared processor pool to which this logical system belongs. On AIX SPLPAR, this metric is equivalent to "Active Physical CPUs in system" field of 'lparstat -i' command. On a standalone system, the value is "na".

    GBL_POOL_TOTAL_UTIL: Percentage of time, the pool C PU was not idle during the interval.

    Currency/Naming Example (HP)

  • 2013 IBM Corporation12

    AIX Service Tool PERFPMR:ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr/

    IBMs PM for Power SystemsSubscription reports for capacity planninghttp://www-03.ibm.com/systems/power/support/pm/index.html

    The AIX Community Wikihttp://www.ibm.com/systems/p/community/

    AIX Wiki, use links under Performance to find nmon, nmon Analyser and other tools:

    https://www.ibm.com/developerworks/community/wikis/home?lang=en#/wiki/Power%20Systems/page/AIX

    IBM AIX Links

  • 2013 IBM Corporation13

    This is not an all-inclusive list, search for the VIOS Recognized here:http://www-304.ibm.com/partnerworld/wps/pub/systems/power/solutions

    ATS Group Galileo http://www.theatsgroup.com/ BMC ProactiveNet Performance Management

    http://www.bmc.com/products/product-listing/ProactiveNet-Performance-Management.html

    CA http://www.ca.com HP GlancePlus http://www.hp.com/go/software

    Metron Athene http://www.metron-athene.com/index.htmll Orsyp Sysload http://www.orsyp.com/products/software/sysload.html Power Navigator http://www.mpginc.com SAP AIX CCMS Agents http://www.sap.com Teamquest

    http://www.teamquest.com/products-services/full-product-services-list/index.htm

    ISV Monitoring Products

  • 2013 IBM Corporation14

    CEC Central Electronics Complex (System)

    ProcessorsCPU type, frequency, count

    # of processors in the shared pool, pool utilization

    Total CEC utilizationNumber of unused processors

    I/OSet of storage adaptersSet of network adapters

    Other adaptersVirtual I/O connections

    MemoryTotal amount of memory in CEC,

    Memory allocated to partitionsUnallocated memory

    PartitionsActive partitions,

    operating systems,names and typesinactive partitions

    System Topology

  • 2013 IBM Corporation15

    Virtualization allows and even encourages that a system contains multiple independent partitions, all with resources

    Ideally, monitoring tools will easily determine all of the active partitions on a system and organize the data for those partitions together Normally, this requires consulting with an HMC to identify the active

    partitions on a system With mobile partitions, the tools must accommodate the movement of

    partitions between systems If automatic detection of partitions to systems isnt possible, some means

    of organizing them manually must be applied

    Various products do automatic CEC aggregation

    Nmon Analyzer and Consolidator allow manual aggregation

    The most important metric for CPU pool monitoring is the Available Pool Processor (APP) value. It represents the amount of a pool not currently consumed, and can be retrieved from individual partitions or calculated by aggregators

    System Aggregation

  • 2013 IBM Corporation16

    We do not distinguish metrics between regular shared partitions and Virtual I/O Server partitions

    All the metrics have the same meanings nmon recording works on both We treat the VIOS just like wed treat dedicated partitions hosting adapters As many VIOS host Shared Ethernet devices, we will cover this information in the

    network section

    Properly configured, a VIOS system can perform as well and achieve the same throughputs as a dedicated system

    We tune the OS, adapters, queue depths exactly the same All IBM storage devices have Redbooks with best practices/performance sections See Best Practices section for VIOS recommendations

    If you have IO performance issues, review the VIOS first Make sure you arent out of CPU Make sure you arent out of memory

    A persistent complaint from customers is that there are no breakdowns of NPIV clients activity on VIOS, and each client must be individually monitored. We are working on it.

    VIOS Performance Overview

  • 2013 IBM Corporation17

    VIOS overhead is not a problem

    Virtual SCSI architecture between client and server is simple, a read is a read and a write is a write. These are in-memory transfers.

    CPU overhead is a tiny fraction of the overall I/O time VIOS cpu entitlements are typically fractions of a physical core, more Virtual

    Processors are better than running VIOS dedicated Always run production VIOS uncapped

    Monitor the same CPU and memory statistics covered later in this presentation

    Adjusting the VIO servers Variable Capacity Weighting (biasing a partitions access to free cycles when the shared pool is constrained) is advised for heavier production IO workloads

    Hypervisor will allocate private memory for adapters depending on type and number. Allocations supported by the System Planning Toolhttp://www.ibm.com/systems/support/tools/systemplanningtool/

    VIOS CPU & Memory

  • 2013 IBM Corporation18

    CPU Number of CPUs

    Dedicated Shared pool Unallocated

    Entitlement settings Utilization %

    Dedicated consumed Pool consumed (alternatively, entitlement consumed or free)

    Memory Allocated, in use, computational, non-computational Unallocated

    IO Aggregated adapter totals read, write, IOPS, MB/sec

    Metrics: CEC

  • 2013 IBM Corporation19

    HMC -> System Management -> Server -> Operations -> Utilization Data

    Change Sampling Rate 30 seconds, 1/5/30 minutes,

    1 hourView

    Snapshot, Hourly, Daily, Monthly

    Beginning and ending time/date

    All, Utilization sample, Configuration change, Utility CoD usage

    Maximum number of events

    HMC Performance Information

  • 2013 IBM Corporation20

    System Utilization Views System summary Partition Processor/Memory Physical Processor, Shared Process or Shared Memory Pool

    HMC Performance Information

  • 2013 IBM Corporation21

    Unix CPU buckets utilization%User: Fraction of entitlement consumed in user program%System: Fraction of entitlement consumed running in the AIX kernel

    The time in the kernel could be from system calls made by user programs, interrupts, or kernel daemons

    Sometimes %system is used as an indicator of health. If system becomes very high, it *might* mean that a problem exists. But, some subsystems such as the NFS server code run in system mode, so this is not always a good indicator

    %Idle: Fraction of entitlement that the partition was not running any processes or threads

    %I/O wait: Fraction of the entitlement that the partition was not running any processes or threads, and there was disk I/O outstanding

    %I/O wait is also sometimes used as a health indicator. However, like system time, it is not always a good indicator due to the nature of the programs that may run

    %User + %System + %Idle + %I/O wait = 100% of a fixed entitlement

    In the dedicated world, the sum of %user + %system is the utilization percentage or busytime of the dedicated partition

    Physical Busy = (%User + %System) X (# of dedicated cpus)

    Metrics: Dedicated CPU Partition

  • 2013 IBM Corporation22

    CPU buckets and cpu busy percentages no longer tell us physical utilization

    This is because the number of physical cores being utilized is no longer a fixed whole number, and can always be changing

    User and System percentages are relative to consumption A logical CPU can be 99% busy on 0.01 to 1.0 physical cores

    For shared partitions, you must understand new metrics Entitlement: physical resources guaranteed to a shared partition

    Can range from 0.1 to maximum number of physical cores When entitlement is not being used, it can be ceded back to the pool

    Entitlement Consumed (reported as entc% or ec%) Physical Consumed (reported as physc or pc in different tools)

    Entitlement Consumed Capped partitions can only go to 100% Uncapped partitions can go over 100%

    How far depends on the number of Virtual Processors (VP) Active VP count defines # of physical cores a partition can consume If a partition has an entitlement of 0.5 and 4 VPs, the maximum entitlement

    consumed possible is 800% (4.0/0.5 = 8), and maximum physical consumption is 4.0

    Metrics: PowerVM Shared CPU Partition

  • 2013 IBM Corporation23

    POWER6 vs POWER7 SMT Utilization

    Simulating a single threaded process on 1 core, 1 Virtual Processor, utilization values change. In each of these cases, physical consumption can be reported as 1.0.

    Real world production workloads will involve dozens to thousands of threads, so many users may not notice any difference in the macro scale

    Whitepapers on POWER7 SMT and utilizationSimultaneous Multi-Threading on POWER7 Processors by Mark Funkhttp://www.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdfProcessor Utilization in AIX by Saravanan Devendranhttps://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki

    /Power%20Systems/page/Understanding%20CPU%20utilization%20on%20AIX

    busy

    idle

    POWER6 SMT2

    Htc0

    Htc1

    100% busy

    busy

    idle

    POWER7 SMT4

    ~65% busy

    idle

    idle

    busy

    idle

    POWER7 SMT2

    ~70% busy

    busy

    busy

    100% busy

    busy

    busy

    100% busy

    Htc0

    Htc1

    Htc0

    Htc1

    Htc0

    Htc1

    Htc0

    Htc1

    Htc2

    Htc3

    busy = user% + system%

  • 2013 IBM Corporation24

    SMT, Dispatch Behavior & Consumption

    0

    1

    2 3

    SMT Thread

    Primary

    Secondary

    Tertiary

    POWER7 processors can run in Single-thread, SMT2, SMT4 modes Like POWER6, the SMT threads will dynamically

    adjust based on workload SMT threads dispatch via a Virtual Processor (VP) POWER7 threads start with different priorities on

    Primary, Secondary and Tertiary instances, but can be equally weighted for highly parallel workloads

    POWER5 and POWER6 overstate utilization as the CPU utilization algorithm does not account for how many SMT threads are active One or both SMT threads can fully consume a physical core and utilization is 100% On POWER7, a single thread cannot exceed ~65% utilization. Values are calibrated

    in hardware to provide a linear relationship between utilization and throughput

    When core utilization reaches a certain threshold, a Virtual Processor is unfolded and work begins to be dispatched to another physical core

  • 2013 IBM Corporation25

    POWER6 vs POWER7 Dispatch

    There is a difference between how workloads are distributed across cores in POWER7 and earlier architectures

    In POWER5 & POWER6, the primary and secondary SMT threads are loaded to ~80% utilization before another Virtual Processor is unfolded

    In POWER7, all of the primary threads (defined by how many VPs are available) are loaded to at least ~50% utilization before the secondary threads are used. Once the secondary threads are loaded, only then will the tertiary threads be dispatched. This is referred to as Raw Throughput mode.

    Why? Raw Throughput provides the highest per-thread throughput and best response times at the expense of activating more physical cores

    POWER6 SMT2busy

    idle

    POWER7 SMT4

    ~50% busy

    idle

    idle

    busy

    busy

    ~80% busy

    Htc0

    Htc1

    Htc0

    Htc1

    Htc2

    Htc3

    Another Virtual Processor is activated at the utilization values below (both systems may have a reported physical consumption of 1.0):

    VirtualProcessorActivate

  • 2013 IBM Corporation26

    POWER6 vs POWER7 Dispatch

    proc0 proc1 proc2 proc3

    Once a Virtual Processor is dispatched, the Physical Consumption metric will typically increase to the next whole number

    Put another way, the more Virtual Processors you assign, the higher your Physical Consumption is likely to be

    proc0 proc1 proc2 proc3

    Primary

    Secondary

    Tertiaries

    POWER7

    POWER6 PrimarySecondary

  • 2013 IBM Corporation27

    POWER7 may activate more cores at lower utilization levels than earlier architectures when excess Virtual Processors are present

    Customers may complain that the physical consumption (physc or pc) metric is equal to or possibly even higher after migrations to POWER7 from earlier architectures. They may also note that CPU capacity planning is more difficult in POWER7 (discussion to follow)

    Expect every POWER7 customer with this complaint to also have significantly higher idle% percentages over earlier architectures

    Expect that they are consolidating workloads and may also have many more VPs assigned to the POWER7 partition.

    POWER7 Consumption: A Problem?

  • 2013 IBM Corporation28

    Because POWER5 and POWER6 SMT utilization will always be at or above 80% before another VP is activated, utilization ratios (80% or 0.8 of a core) and physc of 1.0 core may be closer to each other than POWER7 environments

    Physical Consumption alone was close enough for capacity planning in POWER5/POWER6 and many customers use this

    This may not be true in POWER7 environments when excess VPs are present

    Under the default Raw throughput mode, customers that do not want to reduce VPs may want to deduct higher idle buckets (idle + wait) from capacity planning metric(s)

    Physical Busy = (User + System)% X Reported Physica l Consumption

    This is reasonable presuming the workload benefits from SMT. This will not work with single-threaded hog processes that want to consume a full core.

    AIX 6.1 TL8 & AIX 7.1 TL2 offer an alternative VP activation mechanism known as Scaled Throughput. This can provide the option to make POWER7 behavior more POWER6 Like but this is a generalized statement and not a technical one.

    POWER7 Consumption: Capacity Planning

  • 2013 IBM Corporation29

    Metrics: Shared Pool Monitoring, Step 1

    The most important metric in PowerVM Virtualization monitoring is the Available Pool Processor (APP) value. This represents the current number of free physical core resources in the shared pool.

    Only partitions with this collection setting can display this pool value it is obtained via a hypervisor mechanism

    The topas C command calculates this value automatically because it collects utilization values from each AIX instance.

    Change is dynamic and takes effect immediately for lparstat (various products may not see the value until they restart or recycle for recording agents, typically at beginning of day)

  • 2013 IBM Corporation30

    [email protected] /tmp # lparstat -h 1 4

    System configuration: type=Shared mode=Capped smt=O n lcpu=4 mem=4096 psize=2 ent=0.40

    %user %sys %wait %idle physc %entc lbusy app vcsw phint %hypv hcalls

    ----- ---- ----- ----- ----- ----- ------ --- ---- ----- ---- - ------

    84.9 2.0 0.2 12.9 0.40 99.9 27.5 1.59 521 2 13.5 2093

    86.5 0.3 0.0 13.1 0.40 99.9 25.0 1.59 518 1 13.1 490

    Shows the number of available processors in the shared pool. The shared pool psize is 2 processors. Must set Allow performance information collection. View the properties for a partition and click the Hardware tab, then Processors andMemory.

    Shows the percentage of entitled capacity consumed. For a capped partition the percentage will not exceed 100%; however, for uncapped partitions the percentage can exceed 100%.

    Shows the number of physical processors consumed. For a capped partition this number will not exceed the entitled capacity. For an uncapped partition this number could match the number of processors in the shared pool; however, this my be limited based on the number of on-line Virtual Processors.

    app

    %entcShared Mode Only

    physc

    Pool sizeLogical CPUs

    Entitlement

    Metrics: APP in lparstat

  • 2013 IBM Corporation31

    Metrics: nmon APP and UtilizationCPU and Entitlement Consumption

    Available Pool

    p display option

  • 2013 IBM Corporation

    Available PoolLocal LPAR Physical Used

    Other LPAR(s) Utilization of Pool =

    Pool Size APP Local Utilization

    Analyser can only see:

    - Pool Size

    - APP

    - Local Utilization

    LPAR(s) View: nmon Analyser

  • 2013 IBM Corporation

    View depends on partitions selected

    Breakdowns by dedicated and shared partitions

    LPAR(s) View: nmon Consolidator

  • 2013 IBM Corporation

    Dashes represent data not available at OS level

    Can be provided via command-line

    Topas can be configured to collect via ssh to HMC

    AIX 5.3 TL-08 topas supports

    View of each shared pool, if AIX partitions

    CPU cycle donations made by dedicated partitions

    topas CUpper section displays aggregated CEC informationLower section displays shared/dedicated data closely mimics lparstat

    Topas CEC Monitor Interval: 10 Thu Jul 28 17:04:57 2006Partition Info Memory (GB) ProcessorMonitored : 6 Monitored :24.6 Monitored :1. 2 Shr Physical Busy: 0.30 UnMonitored: - UnMonitored: - UnMonitored: - Ded P hysical Busy: 2.40Shared : 3 Available :24.6 Available : -Dedicated : 3 UnAllocated: 0 UnAllocated: - HypervisorCapped : 1 Consumed : 2.7 Shared :1. 5 Virt. Context Switch: 632Uncapped : 1 Dedicated : 5 Phantom Interrupts : 7

    Pool Size : 3Avail Pool :2.7

    Host OS M Mem InU Lp Us Sy Wa Id PhysB Ent %EntC Vcsw PhI-------------------------------------shared-------- -------------------ptoolsl3 A53 c 4.1 0.4 2 14 1 0 84 0.08 0.50 15.0 208 0ptoolsl2 A53 C 4.1 0.4 4 20 13 5 62 0.18 0.50 36.5 219 5ptoolsl5 A53 U 4.1 0.4 4 5 0 0 95 0.04 0.50 7.6 205 2

    ------------------------------------dedicated------ -------------------ptoolsl1 A53 S 4.1 0.5 4 20 10 0 70 0.30ptoolsl4 A53 4.1 0.5 2 100 0 0 0 2.00ptoolsl6 A52 4.1 0.5 1 5 5 12 88 0.10

    APP = Pool Size -Shared Physical Busy

    Topas Partitions & CEC View

  • 2013 IBM Corporation35

    Run queue length is another well known metric of CPU usage It refers to the number of software threads that are ready to run, but have to wait

    because the CPU(s) is/are busy or waiting on interrupts The length is sometimes used as a measure of health, and long run queues usually

    mean worse performance, but many workloads can vary dramatically It is quite possible for a pair of single-threaded workloads to contend for a single

    physical resource (batch, low run queue, bad performance) while dozens of multi-threaded workloads share it (OLTP, high run queue, good performance)

    In a dedicated processor partition or a capped shared processor partition, the run queue length will increase with higher CPU utilization

    The longer the run queue length, the more time that software threads must wait for CPU, increasing response times and generally degrading the end user experience

    In an uncapped shared processor partition, the run queue length may not increase with higher consumption

    Extra capacity of the shared processor pool can be used This assumes shared pool resources are available and the partition has adequate

    Virtual Processors assigned to fluctuate with the demand Because run queue is not a consistent indicator depending on workload type and the ability

    of shared partitions to vary physical CPU resources, run queue is no longer considered a good generic indicator

    Tools output vmstat reports global run queue as r mpstat reports per logical CPU run queue as rq

    Metrics: Partition CPU

  • 2013 IBM Corporation36

    Context switches The number of times a running entity was stopped and replaced by another Collected for Threads (operating system) and Virtual Processors (hypervisor) There are voluntary and involuntary context switches

    How Many context switches are Too Many? No rules of thumb exist Voluntary: Not an issue because it means no work for the CPU Involuntary: Could be an issue, but generally the bottleneck will materialize in a

    easier to diagnosis metric; such as, CPU utilization, physical consumption, entitlement consumed, run queue

    How can context switch metrics be used? Establish a baseline and compare when system encounter performance problems When benchmarking or performing a PoC, if these values blow up, this is

    indicative of software scaling issues (architecture, latches, locks, etc)

    Tool outputs vmstat reports total context switches as cs mpstat reports total cs and involuntary ics lparstat reports virtual processor context switches as vcsw

    Metrics: Partition CPU

  • 2013 IBM Corporation37

    Other operating systems focus only on total real used and unusedmemory. This is not enough for AIX.

    Because AIX memory maps files, the length of the free list (or number of free pages in the partition) is not usually a good indicator of memory utilization If software does not actually use memory pages it requested, VMM does

    not allocate them AIX does not actively scrub memory. It scans and frees pages

    depending on demand The free list will likely spike when a large process exits, as those pages

    will become free Default AIX memory tunings typically result in total memory use being

    reported as near 100%

    Partition Memory (AIX)

  • 2013 IBM Corporation38

    In AIX, you must understand the difference between these types:

    Computational Code and process/kernel working storage This metric is not available visible to the HMC

    File Cache (sometimes referred to as non-computational) File: legacy JFS file pages in memory Client: other file pages (JFS2, NFS, etc) in memory May be broken out in some tools or reported as one file cache value Cache will not be cleared from file systems until other IO workloads

    write over them or the file systems are remounted

    Computational% is the only memory metric that reall y matters in AIX. Computational memory ALWAYS has priority over file cache until you reach Computational = 97%

    Metrics: Partition Memory (AIX)

  • 2013 IBM Corporation39

    When computational memory becomes short, or the memory management is mistuned, paging occurs. Paging is the worst consequence of memory problems, so it should be monitored. Metric names used in vmstat: pi: Reads from paging devices po: Writes to paging devices sr & fr: Pages scanned and freed to satisfy new allocation request. A

    persistently high scan rate (tens of thousands of pages per second) and low free rate typically indicate the system is struggling to allocate memory.

    All physical paging is bad. If you are not near the computational threshold, you may need APARs specific to 4K or 64K memory page defects.

    Advanced features like Active Memory Sharing and Active Memory Expansion (AME) require a more complex set of metrics AME adds metrics for pages in (ci) and out (co) out of a compressed pool If a requested expansion factor is too aggressive, a memory hole can be

    created and may result in paging

    Metrics: Partition Memory (AIX)

  • 2013 IBM Corporation40

    How can I figure out process memory? The best tool is the svmon command Use O filters to view

    Process or user breakdowns of cache, kernel, shared and private memory It is most important to filter out kernel and shared segments all sophisticated

    software products share kernel extension, shared/memory mapped areas (SGA, PGA, etc), libraries and some code text

    Ive used nmon, topas or svmon and I cant account for all the used memory. Where is it?

    Likely you a lot of file inode & metadata cached This is only available via the proc file system. Worst case, these areas in AIX 6.1

    can take up to 20% of system memory (tuned lower in AIX 7.1)

    # cat /proc/sys/fs/jfs2/memory_usagemetadata cache: 254160896 (values in bytes)inode cache: 57212928total: 311373824

    Metrics: Process & inode

  • 2013 IBM Corporation41

    The number of threads blocked waiting for a file sy stem I/O operation to complete.bkthr

    The number of threads blocked waiting for a raw dev ice I/O operation to complete.p

    The number of pages scanned sr and the number of p ages stolen (or freed) fr. The ratio of scanned to freed represents relative memory activity. The ratio will start at 1 and increase as memory contention increases. Exa mine Sr # of pages to steal Fr pages. Note: Interrupts are disabled a t times when lrud is running.

    fr / sr

    Page Space Page In and Page Space Page Out per seco nd, which represents paging.pi / po

    File pages In and File pages Out per second, which represents I/O to and from a file system. fi / fo

    Page

    The number of frames of memory on the free list. N ote: A frame refers to physical memory vs. a page w hich refers to virtual memory.

    fre

    The number of active virtual memory pages, which re presents computational memory requirements. The ma ximum avm number divided by number of real memory frames equa ls the computational memory requirement.

    avmMem

    ory

    # vmstat -I 1

    System configuration: lcpu=2 mem=912MB

    kthr memory page faul ts cpu -------- ----------- ------------------------------ -- ---------- -----------

    r b p avm fre fi fo pi po fr sr in sy cs us sy i d wa1 1 0 139893 2340 12288 0 0 0 0 0 200 25283 496 77 16 0 71 1 0 139893 1087 4503 0 8 733 3260 126771 415 9291 440 82 15 0 33 0 0 139893 1088 9472 0 1 95 9344 100081 191 19414 420 77 20 0 31 1 0 139893 1087 12547 0 6 0 12681 13407 207 25762 584 71 21 0 71 2 0 140222 1013 6110 1 39 0 6169 6833 160 15451 471 83 11 0 51 2 0 139923 1087 6976 0 31 2 7062 7599 183 19306 544 79 14 0 7

    Key Points:

    Computational (avm)

    Paging Rates

    Scanning Rates

    Memory Monitoring with vmstat

  • 2013 IBM Corporation42

    Where is computational memory? vmstat

    # vmstat -v1048576 memory pages Total Real Memory

    992384 lruable pages Memory addressable by lrud668618 free pages Free List

    1 memory pools One lrud per memory pool152370 pinned pages

    80.0 maxpin percentage3.0 minperm percentage

    90.0 maxperm percentage11.4 numperm percentage

    113546 file pages JFS-only, or reports client if no JFS11.4 numclient percentage % file cache if JFS2 only90.0 maxclient percentage

    113546 client pages JFS2/NFS

    25.4 percentage of memory used for computational pages

    New in AIX 6.1 TL06

  • 2013 IBM Corporation43

    Where is computational memory? svmon

    # svmon -Gsize inuse free pin virtual

    memory 233472 125663 107809 108785 140123pg space 262144 54233

    work pers clnt lpagepin 67825 0 0 4096 0in use 79725 536 4442 0

    Size: Total # of Memory Frames (Frames are in 4K units)

    Inuse: # of Frames in Use

    Free: # of Frames on Free List

    Pin: # of pinned Frames

    Virtual: Computational

    Working (or computational) memory = 140123

    %Computational = virtual/size = 140213 / 233472 = 60%

    Pers (or JFS file cache) memory = 536

    Clnt (or JFS2 and NFS file cache) memory = 4442

  • 2013 IBM Corporation44

    Where is computational memory? topas

    Metrics related to Memory monitoring

    %Comp Working memory

    %Noncomp, %Client File memory

  • 2013 IBM Corporation45

    nmon Memory (m) + Kernel (s)

    Computational

    Run Queue + Process Context Switches

    Scans & Frees (Steals)

  • 2013 IBM Corporation

    Where is computational memory? nmon Analyser

    Computational%

    nmon Analyser cannot graph computational rates over 100% (physical paging)

    System+Process = Computational

  • 2013 IBM Corporation47

    Cache(vmstat v client,

    svmon pers + clnt)

    Available Pool(lparstat)

    Context Switches(vmstat, mpstat)

    Run Queue(vmstat, sar q)

    User/System/Idle/Wait(vmstat, lparstat, sar)

    Dedicated PagingRAMShared

    Memory CPU

    Pages In/Out(vmstat, vmstat -s)

    In Use(lspa a, svmon)

    Total Size(lsps -a)

    Scan Rate & Free Rate(vmstat sr & fr)

    Computational(vmstat avm, vmstat v,

    svmon virtual)

    Physical consumedEntitlement consumed (lparstat, vmstat, sar)

    Total Size(vmstat v memory

    svmon size)

    Nmon, nmon Analyser and topas provide all of these metrics

    Metrics: CPU & Partition Memory/Paging

  • 2013 IBM Corporation48

    The following sample threshold tables are intended to be examples and not standard recommendations by IBM

    We do not maintain or advise one set of thresholds for all environments

    Thresholds

  • 2013 IBM Corporation49

    Sample Thresholds: CPU

    Possible entitlement under-sizing

    > 100% for 60 minutes, 3X dayEntitlement %

    Workload dependentRun Queue

    Highly variable, monitor for unusual spikes

    Workload dependentContext Swithces

    Pool size >= 4 cores

    Ceiling = round up func()

    < CEILING(#cores in pool x .10)

    for 10 minutes

    Available Pool

    Adjusts for idle time in physical consumed, you must compute this

    80% of lpar VPs

    #VP x .80 = physb_thresh

    Physical busy

    cpu busy% x physc

    90% of lpar Virtual Processors

    #VP x .90 = physc_thresh

    Physical consumed

    (physc or pc)

    relative to physical consumption in shared partitions

    Dedicated > 80%

    Shared n/a

    cpu busy%

    (user+sys)%

    ExplanationThresholdMetric

  • 2013 IBM Corporation50

    Sample Thresholds: Memory

    AIX 6.1 will use all free memfor file cache (~100% in use)

    noneFree Memory %

    Useful for determining what percentage of memory is just cache

    noneFile Cache %

    Any persistent paging indicates over computational rate or defect

    AnyPhysical Paging in/out

    Potential to crash> 80% of page spacePhysical Paging %

    High filecache turnover, likely nearing 97% computational

    > 8:1 and scanning > 10K pages/sec for 10 minutes

    Scan:Free ratio

    Assumes AIX 6.1 Physical paging begins at 97%

    Green: < 80%Yellow: 80-92%

    Red: > 92%

    Computational %

    ExplanationThresholdMetric

  • 2013 IBM Corporation51

    We need to know how many I/Os are being read/written and the total bytes read/written to know where we are relative to bandwidth

    nmon, topas and iostat provide aggregate values, but not relative to bandwidth

    Many tools/products support this at the hdisk layer, and can become very complicated on large systems with hundreds or thousands of hdisks.

    Many tools/products collect adapter statistics, but understanding their relationship to physical adapters in a virtualized environment can get complicated. For customers using newer features like NPIV, there is no simple aggregated view available within AIX at the VIOS level at this time

    Metrics: Disk/Storage Activity

  • 2013 IBM Corporation52

    %utilization (or %busy) of devices and adapters

    While device utilization is a commonly used metric, it is not always a good metric of quality of service, it can indicate for simple devices (SCSI) or imply for more complex devices (e.g. FC-attached LUNs) performance issues

    Disk utilization measures the fraction of time that AIX has I/Os outstanding to a device. A very fast device (say a RAID5 4+1P LUN on DS4800) can process many I/Os per second. Even if a LUN is 100% busy, it may be offering good response time and be capable of processing more requests per second

    These metrics are good for one thing sorting between hundreds or thousands of active vs in-active disks

    Metrics: Disk/Storage Activity

  • 2013 IBM Corporation53

    Response & service times

    Provide a much better view of whether I/Os are delayed. There are two commonly used measures of response time.

    One is the time the I/O is queued at the device. This measures the responsiveness of the device to I/Os.

    Another is the time the I/O is queued for service, which might include time queued in the operating system or device driver. If a large number of I/Os are launched at a device, the queued time may become an important metric.

    AIX is well instrumented for these metrics now and they are the primary means for assessing storage performance

    Metrics: Disk/Storage Activity

  • 2013 IBM Corporation54

    If IO service times are reasonably good, but queues are getting filled up, then Increase queue depths until:

    You arent filling the queues or IO service times start degrading (bottleneck at disk)

    For hdisks, queue_depth controls the maximum number of in-flight IOs

    For FC adapters, num_cmd_elems controls maximum number of in-flight IOs

    Drivers for hdisks and adapters have service and wait queues When the queue is full and an IO completes, then another is issued

    Tools used on partitions to identify queue issuesSDDPCM:# pcmpath query devstats

    # pcmpath query adaptstats SDD: # datapath query devstats

    # datapath query adaptstats iostat: # iostat D fcstat: # fcstat fcs*

    Queue Depths

  • 2013 IBM Corporation55

    hdisk1 xfer: %tm_act bps tps bread bwrtn

    87.7 62.5M 272.3 62.5M 823.7

    read: rps avgserv minserv maxserv timeouts fails

    271.8 9.0 0.2 168.6 0 0

    write: wps avgserv minserv maxserv timeouts fails

    0.5 4.0 1.9 10.4 0 0

    queue: avgtime mintime maxtime avgwqsz avgsqsz sqfull

    1.1 0.0 14.1 0.2 1.2 60

    Virtual adapters extended throughput report (-D)

    Metrics related to transfers (xfer:)tps Indicates the number of transfers per second issued to the adapter.recv The total number of responses received from the hosting server to this adapter.sent The total number of requests sent from this adapter to the hosting server.partition id The partition ID of the hosting server, which serves the requests sent by this adapter.

    Adapter Read/Write Service Metrics (read:)avgserv Indicates the average time. Default is in milliseconds.minserv Indicates the minimum time. Default is in milliseconds.maxserv Indicates the maximum time. Default is in milliseconds.

    Adapter Wait Queue Metrics (wait:)avgtime Indicates the average time spent in wait queue. Default is in milliseconds.mintime Indicates the minimum time spent in wait queue. Default is in milliseconds.maxtime Indicates the maximum time spent in wait queue. Default is in milliseconds.avgwqsz Indicates the average wait queue size.qvgsqsz Indicates the average service queue size Waiting to be sent to the disk.sqfull Indicates the number of times the service queue becomes full.

    Cant exceed queue_depth for the disk

    If this is often > 0, then increase queue_depth

    All D outputs are rates, except sqfull, which is an interval delta. Recent APARs change this to rate.

    Service times. For SAN environment,

    Reads > 10 msec & writes > 2 msec are high

    Average IO Sizes

    read = bread/rps

    write = bwrtn/wps

    Use l option

    For wide output

    hdisk service times: iostat

  • 2013 IBM Corporation56

    nmon adapter (a) + hdisk busy (o) and detail (D)

    D

    DDD

    DDDD

  • 2013 IBM Corporation57

    # fcstat fcs0FC SCSI Adapter Driver Information No DMA Resource Count: 4490

  • 2013 IBM Corporation58

    Fibre channel adapter attributes:# lsattr -El fcs0bus_intr_lvl 8355 Bus interrupt level Falsebus_io_addr 0xffc00 Bus I/O address Falsebus_mem_addr 0xf8040000 Bus memory address Falseinit_link al INIT Link flags Trueintr_priority 3 Interrupt priority Falselg_term_dma 0x1000000 Long term DMA Truemax_xfer_size 0x100000 Maximum Transfer Size Truenum_cmd_elems 200 Maximum number of COMMANDS to queue to t he adapter Truepref_alpa 0x1 Preferred AL_PA Truesw_fc_class 2 FC Class for Fabric True

    The max_xfer_size attribute also controls a DMA memory area used to hold data for transfer, and at the default is 16 MB

    Changing to other allowable values increases it to 128 MB and increases the adapters bandwidth Usually not required unless adapter is pushing many 10s MB/sec Change to 0x200000 based on guidance from Redbooks or tools

    This can result in a problem if there isnt enough memory on PHB chips in the IO drawer with too many adapters/devices on the PHB

    Make the change and reboot check for Defined devices or errors in the error log, and change back if necessary

    NPIV and virtual FC adapters the DMA memory area is 128 MB at 6.1 TL2 or later

    FC Adapter Attributes

  • 2013 IBM Corporation59

    Option a for all adapters or ^ for FC adapters

    nmon Analyser shows IO over time

    nmon FC Monitoring

  • 2013 IBM Corporation60

    fcstat NPIV Monitoring NEW!

    New breakdown by World Wide Port Name fcstat -n wwpn device_name Displays the statistics on a virtual port level that is specified by the

    worldwide port number (WWPN) of the virtual adapter.

    #fcstat -n C050760547E90000

    FIBRE CHANNEL STATISTICS REPORT: fcs0

    Device Type: 8Gb PCI Express Dual Port FC Adapter ( df1000f114108a03) (adapter/pciex/df1000f114108a0)Serial Number: 1B03205232Option ROM Version: 02781135ZA: U2D1.10X5World Wide Node Name: 0xC050760547E90000World Wide Port Name: 0xC050760547E90000FC-4 TYPES:

    Supported: 0x00000120000000000000000000000000000000 00000000000000000000000000Active: 0x00000100000000000000000000000000000000 00000000000000000000000000

    Class of Service: 3Port Speed (supported): 8 GBITPort Speed (running): 8 GBITPort FC ID: 0x010f00Port Type: FabricSeconds Since Last Reset: 431494

    Transmit Statistics Receive Statistics------------------- ------------------

    Frames: 2145085 1702630Words: 758610432 187172864

    http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=/p7hcg/fcstat.htm

  • 2013 IBM Corporation61

    ~750 MB/sn/a57744 Gbps FC adapter (dual) PCI-e

    750 MB/s per port simplex, 997 MB/s duplex per port1475 MB/s simplex per adapter, 2000 MB/s duplex per

    142,00057358 Gbps FC dual port PCI-e

    930 MB/s per port simplex, 1900 MB/s per port duplex1630 MB/s simplex per adapter, 2290 MB/s duplex per adapter

    150,000570810 Gb FCoE PCIe Dual Port

    400 MB/s simplex, ~750 MB/s duplexn/a57734 Gbps FC adapter PCI-e

    DDR slots: ~750 MB/s, SDR slots: ~500 MB/s

    n/a57594 Gbps FC adapter (dual)

    DDR slots: 400 MB/s simplex, ~750 MB/s duplex, SDR slots: 400 MB/s simplex, 500 MB/s duplex

    n/a57584 Gbps FC adapter (single port)

    198 MB/s simplex, 385 MB/s duplex38,46157162 Gbps FC adapter (single port)

    Sustained Sequential b/wIOPS 4KFCAdapter

    Adapter Performance Chart

  • 2013 IBM Corporation62

    Network monitoring is different than disk monitoring, as it is not a master/slave model Disk requests follow a start/end model Network requests may be all send, all receive, or any arbitrary mix

    Measure bandwidth A 1GB Ethernet link can sustain at most 125MB/sec of bandwidth If a link approaches the bandwidth limit, it is likely a point of resource

    contention Knowledge of topology is important to identify if the links are 100MB,

    1GB, or 10GB per second

    Metrics: Network Monitoring

  • 2013 IBM Corporation63

    Measure packet rates. Each adapter will be limited to some maximum number of packets per second for small packets (though the cumulative bandwidth may still approach the wire limits)

    Related health metrics may include: Packets dropped (a variety of reasons exist), collision errors, timeout

    errors, etc These health metrics may imply topology issues or logical resource limits VIOS Performance Advisor will report on these errors

    Metrics: Network Monitoring

  • 2013 IBM Corporation64

    How do I know network capacity? You should be able to reach 70-80% of line speeds with 1 Gb but 10 Gb

    may require special tuning (beyond tcp/send receive, rfc1323, etc) On VIOS, if you are not near limits

    Review CPU and Memory, always run uncapped Review netstat s/-v and entstat for any errors Review CPU, Memory and network on problem client

    On 10 GB, if you are driving hundreds of thousands of tiny/smallpackets/sec at 1500 MTU, or have very short latency requirements, tuning will be required Two POWER7 cores will be required to reach 10Gb Large Send/Receive, mtu_bypass (assuming AIX clients only) Virtual buffer tunings (backup slide) Nodelay Ack, nagle, dog threads In extreme environments, using dedicated-donating mode on VIOS

    If tuning exhausted and still have issues, then likely network environment or APARs required

    Network Capacity

  • 2013 IBM Corporation65

    Network requirements are the same for CPU, regardless of operation as dedicated or shared lpar Driving 10 Gb adapters to capacity for workloads can take two physical

    cores Larger packets, larger MTU sizes dramatically decrease CPU utilization

    Integrated Virtual Ethernet vs Shared Ethernet IVE is the performance solution, will take more memory Shared ethernet 1 Gb performance is competitive with IVE, but 10 Gb

    performance is more limited for small MTU receives (~5-6 Gb/sec) without tuning

    Virtual Ethernet (Switch) Reliable < 1 ms latency times, but driving 1 Gb at normal packet sizes will

    consume up to two physical cores Virtual switch within hypervisor is not designed to scale to 10 Gb at 1500

    MTU for a single session (usually gated by core throughput for single session)

    VIOS Network

  • 2013 IBM Corporation66

    Throughput is a function of entitled capacity

    *POWER5 1.65 GHz

    VIOS Network: CPU Capacity = Throughput

    Effect of 'capping' on VIO TCP/IP throughput

    020,00040,00060,00080,000

    100,000120,000140,000

    1 2 4 8 16 32 64 128

    256

    512

    1024

    2048

    4096

    8192

    1638

    4

    Packet Size (Bytes)

    Thro

    ughp

    ut

    (KB

    ytes

    /Sec

    ond)

    1.0

    0.9

    0.8

    0.7

    0.6

    0.5

    0.4

    0.3

    0.2

    0.1

    CPU Entitlement(capped)

  • 2013 IBM Corporation67

    Review physical adapter values seastat nmon/topas entstat/netstat VIOS Performance Advisor

    Check the virtual adapter Check CPU utilization Shared Ethernet

    lsattr El en# entstat topas (AIX 6.1)

    Shared Ethernet Tools

  • 2013 IBM Corporation68

    SEA Monitoring: seastat

    Accounting must first be enabled per devicechdev -dev ent* -attr accounting=enabled

    Command line for seastatseastat -d -c [-n | -s search_criterio n=value]

    shared adapter device-c clears per-client SEA statistics-n displays name resolution on the IP addresses-s search values

    MAC address (mac)VLAN id (vlan)IP address (ip)Hostname (host)Greater than bytes sent (gbs)Greater than bytes recv (gbr)Greater than packets sent (gps)Greater than packets recv (gpr)Smaller than bytes sent (sbs)Smaller than bytes recv (sbr)Smaller than packets sent (sps)Smaller than packets recv (spr)

  • 2013 IBM Corporation69

    SEA Monitoring: seastat$ seastat d ent5

    =================================================== =============================Advanced Statistics for SEADevice Name: ent5=================================================== =============================MAC: 32:43:23:7A:A3:02----------------------VLAN: NoneVLAN Priority: NoneHostname: mob76.dfw.ibm.comIP: 9.19.51.76Transmit Statistics: Receive Stat istics:-------------------- -------------------Packets: 9253924 Packets: 112 75899Bytes: 10899446310 Bytes: 64519 56041=================================================== =============================MAC: 32:43:23:7A:A3:02----------------------VLAN: NoneVLAN Priority: NoneTransmit Statistics: Receive Stat istics:-------------------- -------------------Packets: 36787 Packets: 349 2188Bytes: 2175234 Bytes: 27220 7726=================================================== =============================MAC: 32:43:2B:33:8A:02----------------------VLAN: NoneVLAN Priority: NoneHostname: sharesvc1.dfw.ibm.comIP: 9.19.51.239Transmit Statistics: Receive Stat istics:-------------------- -------------------Packets: 10 Packets: 644 762Bytes: 420 Bytes: 48476 4292

  • 2013 IBM Corporation70

    SEA Monitoring: topas -E Usage

    chdev dev ent* attr accounting=enabled

    topas E or from topas screen hit E

    Topas Monitor for host: P7_1_vios1 Interval: 2 Wed D ec 15 10:09:13 2010=================================================== ====================Network KBPS I-Pack O-Pack KB-In KB-Outent6 (SEA PRIM) 38.7 5.0 29.0 1.8 36.9|\--ent0 (PHYS) 19.6 4.0 14.0 1.7 17.9|\--ent5 (VETH) 19.2 1.0 15.0 0.1 19.0\--ent4 (VETH CTRL) 0.1 0.0 3.5 0.0 0.1lo0 2.7 14.0 14.0 1.3 1.3

    Note: In order for this tool to work on a Shared Ethernet Adapter, the state of the layer-3 device (en) cannot be in the defined state. If you are not using the layer-3 device on the SEA, the easiest way to change the state of the device is to change one of its parameters. The following command will change the state of a Shared Ethernet Adapters layer-3 device, without affecting bridging.chdev -l -a state=down

  • 2013 IBM Corporation71

    SEA Monitoring: nmon O on VIOS

  • 2013 IBM Corporation

    Packet Counts

    Throughput in KB/sec

    nmon Analyser

    V34a SEA & SEAPACKET tab

    SEA Monitoring: nmon O recording option

    Unfortunately, Analyser does not provide stacked gra phs for SEA aggregation views

  • 2013 IBM Corporation73

    9000 MTU1500 MTUSessionsDirectionTest

    1756 MB/s

    1733 MB/s

    1179 MB/s

    1173 MB/s

    1111 MB/s

    1076 MB/s

    Single port

    237415 TPS

    26171 TPS

    1914 MB/s

    1712 MB/s

    992 MB/s

    1015 MB/s

    1402 MB/s

    1311 MB/s

    Both ports

    182062 TPS 150

    13324 TPS111 byte message

    TCP_Request& Response

    2176 MB/s1527 MB/s4

    2106 MB/s1439 MB/s 1duplex

    1393 MB/s925 MB/s4

    1393 MB/s785 MB/s1receive

    1668 MB/s1068 MB/s4

    1647 MB/s870 MB/s1sendTCP STREAM

    Both portsSingle port

    Host: P7 750 4-way, SMT-2, 3.3 GHz, AIX 5.3 TL12, dedicated LPAR, dedicated adapter

    Client: P6 570 with two single-port 10 Gb (FC 5769), point-to-point wiring (no ethernet switch)

    1Single session 1/TPS round trip latency is 75 microseconds, default ISNO settings, no interrupt coalescing

    AIX 6.1 should do better with SMT4, Disk I/O will do better due to larger blocks/buffers

    FC over Ethernet Adapter (10Gb FC5708)

  • 2013 IBM Corporation74

    %IO relative to bandwidth (esitimate from adapter rates)

    %IO relative to bandwidth(estimate from adapter rates)

    Avg Service Time(s)(iostat D, sar d)

    Service Queue Full or Wait(iostat -D, sar -d)

    Read/Write IOPS & KB(iostat)

    %busy, IO/sec, KB/sec(iostat, sar -d)

    HdiskNetworkAdapter

    StorageAdapter

    IO

    Packet errors, drops, timeouts(netstat)

    Service Queue Counters(fcstat, MPIO pkg commands)

    Send/Receive MB(entstat, netstat)

    Read/Write Bytes(iostat -as, fcstat)

    Send/Receive Packets(entstat, netstat)

    IO/sec(iostat -as)

    nmon, nmon Analyser provide metrics with yellow bac kground

    Metrics: IO

  • 2013 IBM Corporation75

    Sample Thresholds: IO

    Indicates queue depth tuning should be reviewed

    Any persistence over 3 minHdisk queue full

    Assuming tuned SANRead > 15 msec

    Write > 2 msec

    Hdisk service time

    Use IBM ATS thruput charts80% of nominal capacityFC IOPS & Bytes

    Expectation of 15K RPM disk

    Hdisks comprising array of physical disks can go higher

    > 300 IOPS per spindle, poor service times

    Hdisk IOPS

    Useful for benchmarkingn/aHdisk KB in/out

    High packet rates typically require advanced tuning

    > 100K/sec @ 10Gb

    80% of nominal capacity

    Network Packet/sec

    Byte rates

    Busy is not a reliable relative throughput metric in modern SANs, but useful for sorting most active hdisks

    n/aHdisk %busy

    ExplanationThresholdMetric

  • 2013 IBM Corporation76

    The following are trademarks of the International B usiness Machines Corporation in the United States, other countries, or both.

    The following are trademarks or registered trademar ks of other companies.

    * All other products may be trademarks or registered trademarks of their respective companies.

    Notes : Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.This publication was produced in the United States. IBM may not offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the product or services available in your area.All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.

    Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.HP is a registered trademark of Hewlett-Packard Development Company in the United States, other countries, or both.Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office.IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency, which is now part of the Office of Government Commerce.

    For a complete list of IBM Trademarks, see www.ibm.com/legal/copytrade.shtml:

    *, AS/400, e business(logo), DBE, ESCO, eServer, FICON, IBM, IBM (logo), iSeries, MVS, OS/390, pSeries, RS/6000, S/30, VM/ESA, VSE/ESA, WebSphere, xSeries, z/OS, zSeries, z/VM, System i, System i5, System p, System p5, System x, System z, System z9, BladeCenter

    Not all common law marks used by IBM are listed on this page. Failure of a mark to appear does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market.

    Those trademarks followed by are registered trademarks of IBM in the United States; all others are trademarks or common law marks of IBM in the United States.

    Trademarks