Advanced Performance Tuning..

HP-UX Advanced Performance Tuning Class

Module 1 Hardware1) The limitations of the hardware

a) CPU speed and quantity b) Amount of physical memory (RAM) c) Amount of virtual memory and configuration (swap) d) Disk type and configuration e) Type of bus architecture, and configuration

Module 2 CPU a) What tools are there to measure CPU load? b) CPU and process management c) Scalability

Module 3 Process Managementa) Process creationb) Process executionc) Process Terminationd) Kernel Threadse) Process Resource Manager

Module 4 Memory a) Memory ranges for HP systemsb)Configurable memory parameters c)The system memory map d)The McKusic & Karels memory allocator e)The Arena Allocator f)Performance Optimized Page Sizing

Module 5 Disk I/Oa)Tools to measure disk I/Ob)Factors that effect disk I/Oc)Data configurations and their effects

Module 6 Network Performancea)NFS performanceb)Fiber Channel performancec)Fiber Channel Mass Storage performance

Module 7 General Toolsa)ADB - the absolute debuggerb)SAR – the System Activity Reporterc)Iostat ,Vmstatd)Time, Timex, IPCSe)Glance

Module 8 WTEC Toolsa)kmeminfob)shminfoc) vmtraced) tusc

Module 1

HARDWARE It is essential to determine the hardware configuration and limitations to set reasonable expectations

Advanced Performance Tuning for HP-UX file:///C:/satish/hp-ux/Performance/Advanced%20Performance%20Tuni...

1 of 135 10/28/2010 11:19 PM

of overall performance. Information about current HP-UX servers configuration is available at:http://www.hp.com/products1/servers/operating/hpux.html

CPU The current range of hp servers includes the rp2405 with 1-2 650 MHz PA-8700 CPU and 256Mb -8Gb of RAMThrough the Superdome with up to 64 875MHz PA 8700 CPU and 256Gb of RAM. The current range of workstations includes the b2600 with a single 500MHz PA8600 CPU with up To 4Gb of RAM through the j 6750 with dual 875 MHz PA8700 cpus, with up to 16Gb of RAM. There are many legacy systems that operate on CPU's as slow as 96MHz and with as little as 128Mb of RAM. The last of the 32 bit servers ran at a maximum processor speed of 240MHz. 10.X systems are OS limited to 3.75Gb of RAM.

SAM can be used to determine system properties , look under Performance Monitors -> System Properties . The available categories are Processor , Memory , Operating System, Network, and Dynamic System properties can also be accessed via the adb command from the command line : To determine processor speed: echo itick_per_usec/D | adb -k /stand/vmunix /dev/memitick_per_usec:itick_per_usec: 650 The itick_per_usec = MHz

RAMCurrent HP servers range from 256Mb to 256Gb of RAM. For 32 bit architecture, the maximum amount of usable RAM is 4Gb. Any 32 bit process is limited to a 4Gb memory map. A 64 bit system has a 4 Tb limitation of addressable space. To determine physical memory: echo phys_mem_pages/D | adb64 /stand/vmunix /dev/memphys_mem_pages:phys_mem_pages: 262144 The result is expressed in 4 Kb memory pages. To determine the size in Megabytes ( phys_mem_pages x 4096 /1024 /1024) NOTE : If Performance Optimized Pages Sizing is implemented be sure to adjust your calculations accordingly , more about this topicwill be disccussed later.


2 of 135 10/28/2010 11:19 PM

Disk Type and configuration. The type and configuration of the systems disks is a key factor in determining the I/O speed the system is capable of.Standard disks will require more RAM for buffering file system writes, most modern disk arrays have onboard memory for this. The number of controllers and the amount of read and write transactions on the disk also is a major factor on overall performance. To determine specific informations on the type of disks on the system run: ioscan -fnC diskthen diskinfo -v /dev/rdsk/cXtXdX These commands will let you determine the size, type and hardware path of the system disks. The type of bus is also a determining factor. The maximum data transfer rate depends on the adapter type.Currently the fastest is the A6892A PCI Dual-Channel ULTRA160 SCSI Adapter.The firmware suggested default for the A6829A adapter's maximum data transfer rate is the adapter's maximum speed (160 MB/s). The A6829A can communicate with all LVD or SE devices that have speeds up to 160 MB/s. This includes the following speeds (synchronous communication over a Wide bus): Fast (20 MB/s)Ultra (40 MB/s)Ultra2 (80 MB/s)Ultra160 (160 MB/s) Note that the actual transfer rate between the adapter and a SCSI device depends on the transfer rate that was negotiated between them.The maximum speed will be that of slowest device. As of 11i the SCSI Interface drivers for HP-PB, HSC, EISA SCSI cards have been obsoleted

Buses on HP-UX Servers:

PCI

This bus is spec’d at a peak of 120 MB/s. Because of the excellent performance of this bus it is possible to have multiple high speed SCSIand/or network cards installed into a single bus.

HSC

This bus is spe’d at 132 MB/s. Again because of the speed of this bus it is possible to have multiple high speed SCSI and/or network cards installed into a single bus.

HP-PB (NIO)

The HP-PB (NIO) system bus found on many HP Servers including the T500 and H Class Servers is spec'd at 32 MB/s. Realistic performance numbers for this bus are ~10 MB/s.


3 of 135 10/28/2010 11:19 PM

Module 2

CPU

What tools can be used to determine CPU metrics?GlancePlus,top, ps -elf, sar , vmstat , SAM. To get an idea of which processes are most CPU intensive, you can use SAM’s Performance Monitors, which invokes top , which canbe run from the command line . Alternately use GlancePlus, or ps -elf to see which processes have the highest cumulative CPU time.

SAR CPU data

The system activity reporter contains a number of useful cpu statistics :example:

sar –Mu 5 100 this will produce 100 data points 5 seconds apart. The output will look similar to : 11:20:05 cpu %usr %sys %wio %idle11:20:10 0 1 1 0 99 1 17 83 0 0 system 9 42 0 49 After all samples are taken an average is printed This will return data on the cpu load for each processor:

cpu - cpu number (only on a multi-processor system and used with the -M option)

%usr - This is the percentage of time spent executing code in user mode, as opposed to code within the kernel.

%sys- The percentage of time running in system or kernel mode.

%wio - idle with some process waiting for I/O (only block I/O, raw I/O, or Virtual memory pageins/swapins indicated)


4 of 135 10/28/2010 11:19 PM

%idle - other idle To find out what the run queue load is, run :

sar –q 5 100 this will produce 100 data points 5 seconds apart. The output will look similar to :

runq-sz %runocc swpq-sz %swpocc10:06:36 0.0 0 0.0 010:06:41 1.5 40 0.0 010:06:46 3.0 20 0.0 010:06:51 1.0 20 0.0 0Average 1.8 16 0.0 0

runq-sz - Average length of the run queue(s) of processes (in memory and runnable)

%runocc - The percentage of time the run queue(s) were occupied by processes (in memory and runnable)

swpq-sz - Average length of the swap queue of runnable processes (processes swapped out but ready to run) These cpu reports can be combined using sar -Muq . Typically the %usr value will be higher than %sys . If the system is making many read/write transactions this may not be true as they aresystem calls. Out of memory errors can occur when excessive CPU time given to system versus user processes. These can also be causedwhen maxdsiz is undersized. As a rule , we should expect to see %usr at 80% or less, and %sys at 50% or less. Values higher than these

can indicate a CPU bottleneck. The %wio should ideally be 0%, values less than 15% are acceptable. The %idle being low over short periods of time is not a majorconcern . This is the percentage of time that the CPU is not running processes. However low %idle over a sustained period could be an indication of a CPU bottleneck. If the %wio is greater than 15% and %idle is low , consider the size of the runq (runq-sz). Ideally we would like to see values less than 4. If the runq-sz is high and the %wio is 0 then there is no bottleneck . This is usually a case of many small processes running that do notoverload the processors. If the system is a single processor system under heavy load the CPU bottleneck may be unavoidable. Other metrics to consider are :

Nice Utilization

This is the percentage of CPU time spent running user processes with nice values of 21-39. This is typically included in user CPUutilization, but some tools such as Glance , trank this seperately to determine how much CPU time is being spent on lower priorityprocesses.

System Call Rate

The system call rate is the rate system calls are being generated by the user processes . Every system call results in a switch between user


5 of 135 10/28/2010 11:19 PM

and system or kernel mode. A high system call rate typically coerlates to high system CPU utilization.

Context Switch Rate

This is the number of times a CPU switches processes on average per second. This is typically included in the system CPU rate , but toolssuch as Glance can track it seperately . Context switches occur based on the priority of the processes in the run queue and the time set inthe kernel by the parameter timeslice which by default is 100 milliseconds .(timeslice =10)

Using Glance for CPU metrics

Glance allows for a more in depth look at cpu statistics , the Glance cpu report consists of two pages :

Page 1

CPU REPORT

Users= 5

State Current Average High Time Cum Time

--------------------------------------------------------------------------------

User 1.5 3.2 4.3 0.08 0.51

Nice 0.0 0.1 0.2 0.00 0.02

Negative Nice 1.1 1.9 16.0 0.06 0.30

RealTime 1.1 0.5 1.1 0.06 0.08

System 2.3 2.8 4.0 0.12 0.44

Interrupt 0.8 0.7 0.8 0.04 0.11

ContextSwitch 0.6 0.6 1.3 0.03 0.10

Traps 0.0 0.0 0.0 0.00 0.00

Vfaults 0.0 0.1 1.3 0.00 0.01

Idle 92.6 90.0 92.6 4.87 14.18

Top CPU user: PID 2206, scopeux 1.0% cpu util

Page 2

CPU REPORT Users= 5

State Current Cumulative High

--------------------------------------------------------------------------------

Load Average 4.4 4.5 4.5

Syscall Rate 1209.5 1320.0 1942.8

Intrpt Rate 412.8 380.1 412.8

CSwitch Rate 359.2 355.1 360.7

Top CPU user: PID 5916, glance 2.4% cpu util

The CPU Report screen shows global processor time allocation for different activities such as User, Nice, Real-time, System, Interrupt, Context Switch and Idle. Several values are given for each activity. On multi-processor systems, the values represent the average over all CPUs. Thus the percentage columns never exceed 100. For individual processor detail, use the 'a' (CPU By Processor) screen. For each of the activities, the Current column displays the percentage of CPU time devoted to The this activity during the last interval.

The Average column shows the average percentage of time spent in this activity since data collection was started or the statistics were reset using the 'z' (zero) command.

The High column shows the highest percentage ("high water mark") for this activity over all intervals.

The Time column displays the amount of time spent in this activity during the last interval (displayed as a percentage in the Current column).

The Cum Time column stores the cumulative total CPU time allocated to this activity since the start of data collection. The final entry indicates the current highest CPU consumer process on the system.


6 of 135 10/28/2010 11:19 PM

The CPU report second screen accessed by hiiting the + key shows the Load Average (run queue) and event rate statistics for System

Calls, Interrupts and Context Switches. For each event current, cumulative and high rates are shown. The final entry indicates the current highest CPU consumer process on the system.

Glance Metrics for CPU

Metric common parameters : The cumulative collection times are defined from the point in time when either: a) the process or kernel thread was first started b) the performance tool was first started c) the cumulative counters were reset (relevant only to GlancePlus), whichever occurred last. On systems with multiple CPUs, these metrics are normalized. That is, the CPU used over all processors is divided by the number of processors online This represents the usage of the total processing capacity available.

* GBL_CPU_NORMAL_UTIL

The percentage of time that the CPU was in user mode at normal priority during the interval. Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

* GBL_CPU_NORMAL_UTIL_CUM

The percentage of time that the CPU was in user mode at normal priority over the cumulative collection time.

Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

* GBL_CPU_NORMAL_UTIL_HIGH

The highest percentage of time that the CPU was in user mode at normal priority during any one interval over the cumulative collectiontime.

Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

* GBL_CPU_NORMAL_TIME

The time, in seconds, that the CPU was in user mode at normal priority during the interval. Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

* GBL_CPU_NORMAL_TIME_CUM

The time, in seconds, that the CPU was in user mode at normal priority over the cumulative collection time. Normal priority user mode CPU excludes CPU used at real-time and nice priorities.

Nice

Nice common metric parameters The NICE metrics include positive nice value CPU time only. Negative nice value CPU is broken out into NNICE (negative nice) metrics. Positive nice values range from 20 to 39. Negative nice values range from 0 to 19.

* GBL_CPU_NICE_UTIL


7 of 135 10/28/2010 11:19 PM

The percentage of time that the CPU was in user mode at a nice priority during the interval.

* GBL_CPU_NICE_UTIL_CUM

The percentage of time that the CPU was in user mode at a nice priority over the cumulative collection time.

* GBL_CPU_NICE_UTIL_HIGH

The highest percentage of time during any one interval that the CPU was in user mode at a nice priority over the cumulative collectiontime.

* GBL_CPU_NICE_TIME

The time, in seconds, that the CPU was in user mode at a nice priority during the interval.

* GBL_CPU_NICE_TIME_CUM

The time, in seconds, that the CPU was in user mode at a nice priority over the cumulative collection time.

* GBL_CPU_NNICE_UTIL

The percentage of time that the CPU was in user mode at a nice priority computed from processes with negative nice values during the interval.

* GBL_CPU_NNICE_UTIL_CUM

The percentage of time that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time.

* GBL_CPU_NNICE_UTIL_HIGH

The highest percentage of time during any one interval that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time.

* GBL_CPU_NNICE_TIME

The time, in seconds, that the CPU was in user mode at a nice priority computed from processes with negative nice values during the interval.

* GBL_CPU_NNICE_TIME_CUM

The time, in seconds, that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time.

* GBL_CPU_NNICE_UTIL_HIGH

The highest percentage of time during any one interval that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time.

* GBL_CPU_NNICE_TIME

The time, in seconds, that the CPU as in user mode at a nice priority computed from processes with negative nice values during theinterval.


8 of 135 10/28/2010 11:19 PM

* GBL_CPU_NNICE_TIME_CUM

The time, in seconds, that the CPU was in user mode at a nice priority computed from processes with negative nice values over the cumulative collection time.

Real Time

* GBL_CPU_REALTIME_UTIL

The percentage of time that the CPU was in user mode at a realtime priority during the interval. Running at a realtime priority means that the process or kernel thread was run using the rtprio command or the rtprio system call to alter its priority. Realtime priorities range from zero to 127 and are absolute priorities, meaning the realtime process with the lowest priority runs as long

as it wants to. Since this can have a huge impact on the system, the realtime CPU is tracked separately to make visible the effect of usingrealtime priorities

* GBL_CPU_REALTIME_UTIL_CUM

The percentage of time that the CPU was in user mode at a realtime priority over the cumulative collection time.

* GBL_CPU_REALTIME_UTIL_HIGH

The highest percentage of time that the CPU was in user mode at a realtime priority during any one interval over the cumulative collection time.

* GBL_CPU_REALTIME_TIME

The time, in seconds, that the CPU was in user mode at a realtime priority during the interval.

* GBL_CPU_REALTIME_TIME_CUM

The time, in seconds, that the CPU was in user mode at a realtime priority over the cumulative collection time.

System

* GBL_CPU_SYSCALL_UTIL

The percentage of time that the CPU was in system mode (excluding interrupt, context switch, trap, or vfault CPU) during the interval.

* GBL_CPU_SYSCALL_UTIL_CUM

The percentage of time that the CPU was in system mode (excluding interrupt, context switch, trap, or vfault CPU) over the cumulative collection time.

* GBL_CPU_SYSCALL_UTIL_HIGH

The highest percentage of time that the CPU was in system mode

* GBL_CPU_SYSCALL_TIME

The time, in seconds, that the CPU was in system mode

* GBL_CPU_SYSCALL_TIME_CUM


9 of 135 10/28/2010 11:19 PM

The time, in seconds, that the CPU was in system mode over the cumulative collection time

* GBL_CPU_SYSCALL_UTIL

The percentage of time that the CPU was in system mode

* GBL_CPU_SYSCALL_UTIL_CUM

The percentage of time that the CPU metric is normalized.

Interrupt

* GBL_CPU_INTERRUPT_UTIL

The percentage of time that the CPU spent processing interrupts during the interval.

* GBL_CPU_INTERRUPT_UTIL_CUM

The percentage of time that the CPU spent processing interrupts over the cumulative collection time.

* GBL_CPU_INTERRUPT_UTIL_HIGH

The highest percentage of time that the CPU spent processing interrupts during any one interval over the cumulative collection time.

* GBL_CPU_INTERRUPT_TIME

The time, in seconds, that the CPU spent processing interrupts during the interval.

* GBL_CPU_INTERRUPT_TIME_CUM

The time, in seconds, that the CPU spent processing interrupts over the cumulative collection time.

ContextSwitch

* GBL_CPU_CSWITCH_UTIL

The percentage of time that the CPU spends context switching during the interval. This includes context switches that result in the

execution of a different process and those caused by a process stopping, then resuming, with no other process running in the meantime.

* GBL_CPU_CSWITCH_UTIL_CUM

The percentage of time that the CPU spent context switching over the cumulative collection time.

* GBL_CPU_CSWITCH_UTIL_HIGH

The highest percentage of time during any one interval that the CPU spent context switching over the cumulative collection time.

* GBL_CPU_CSWITCH_TIME

The time, in seconds, that the CPU spent context switching during the interval.

* GBL_CPU_CSWITCH_TIME_CUM

The time, in seconds, that the CPU spent context switching over the cumulative collection time.


10 of 135 10/28/2010 11:19 PM

Traps

* GBL_CPU_TRAP_UTIL

The percentage of time the CPU was executing trap handler code during the interval.

* GBL_CPU_TRAP_UTIL_CUM

The percentage of time the CPU was in trap handler code over the cumulative collection time.

* GBL_CPU_TRAP_UTIL_HIGH

The highest percentage of time during any one interval the CPU was in trap handler code over the cumulative collection time.

* GBL_CPU_TRAP_TIME

The time the CPU was in trap handler code during the interval.

* GBL_CPU_TRAP_TIME_CUM

The time, in seconds, the CPU was in trap handler code over the cumulative collection time.

Vfaults

* GBL_CPU_VFAULT_UTIL

The percentage of time the CPU was handling page faults during the interval.

* GBL_CPU_VFAULT_UTIL_CUM

The percentage of time the CPU was handling page faults over the cumulative collection time.

* GBL_CPU_VFAULT_UTIL_HIGH

The highest percentage of time during any one interval the CPU was handling page faults over the cumulative collection time.

* GBL_CPU_VFAULT_TIME

The time, in seconds, the CPU was handling page faults during the interval.

* GBL_CPU_VFAULT_TIME_CUM

The time, in seconds, the CPU was handling page faults over the cumulative collection time.

Idle

* GBL_CPU_IDLE_UTIL

The percentage of time that the CPU was idle during the interval.

* GBL_CPU_IDLE_UTIL_CUM

The percentage of time that the CPU was idle over the cumulative collection time.


11 of 135 10/28/2010 11:19 PM

* GBL_CPU_IDLE_UTIL_HIGH

The highest percentage of time that the CPU was idle during any one interval over the cumulative collection time.

* GBL_CPU_IDLE_TIME

The time, in seconds, that the CPU was idle during the interval.

* GBL_CPU_IDLE_TIME_CUM

The time, in seconds, that the CPU was idle over the cumulative collection time.

CPU process priority

Processes are assigned priority by the system in 3 categories : Real Time 0-127System Mode 128-177User Mode 178-255 While on processor ,by default the priority will age and go up within their nice range. The nice value determines how fast a priorityregains priority while waiting on cpu . This can be defeated by implementing the sched_noage policy.This prevents processes from losing priority while on CPU , caution shoulf be used when implementing this .

Process Scheduling

To understand how threads of a process run, we have to understand how they are scheduled. Although processes appear to the user to runsimultaneously, in fact a single processor is executing only one thread of execution at any given moment.Several factors contribute to process scheduling:

· Kind of scheduling policy required -- timeshare or real-time. Scheduling policy governs how the process (or thread of execution)interacts with other processes (or threads of execution) at the same priority.

· Choice of scheduler. Four schedulers are available: HP-UX timeshare scheduler (SCHED_HPUX), HP Process Resource Manager(a timeshare scheduler), HP-UX real-time scheduler (HPUX_RTPRIO), and the POSIX-compliant real-time scheduler.

· Priority of the process. Priority denotes the relative importance of the process or thread of execution.

· Run queues from which the process is scheduled.

· Kernel routines that schedule the process.

Scheduling Policies

HP-UX scheduling is governed by policy that connotes the urgency for which the CPU is needed, as either timeshare or real-time. Thefollowing table compares the two policies in very general terms.

Comparison of Timeshare vs Real-time scheduling

Timeshare Real-Time

Typically implemented round-robin. Implemented as either round-robin or first-in-first-out (FIFO),depending on scheduler.

Kernel lowers priority when process is running; that is,timeshare priorities decay. As you use CPU, your prioritybecomes weaker. As you become starved for CPU, yourpriority becomes stronger. Scheduler tends to regresstoward the mean.

Priority not adjusted by kernel; that is, real-time priorities arenon-decaying. If a real-time priority is set at 50 and another real-timepriority is set at 40 (where 40 is stronger than 50), the process orthread of priority 40 will always be more important than the processor thread of priority 50.

Runs in timeslices that can be preempted by processrunning at higher priority.

Runs until exits or is blocked. Always runs at higher priority thantimeshare.

The principle behind the distribution of CPU time is called a timeslice. A timeslice is the amount of time a process can run before the


12 of 135 10/28/2010 11:19 PM

kernel checks to see if there is an equal or stronger priority process ready to run.

· If a timeshare policy is implemented, a process might begin to run and then relinquish the CPU to a process with a stronger priority.

· Real-time processes running round-robin typically run until they are blocked or relinquish CPU after a certain timeslice hasoccurred.

Real-time processes running FIFO run until completion, without being preempted.Scheduling policies act upon sets of thread lists, one thread list for each priority. Any runnable thread may be in any thread list. Multiplescheduling policies are provided. Each nonempty list is ordered, and contains a head (th_link) as one end of its order and a tail (th_rlink)as the other. The purpose of a scheduling policy is to define the allowable operations on this set of lists (for example, moving threadsbetween and within lists).Each thread is controlled by an associated scheduling policy and priority. Applications can specify these parameters by explicitlyexecuting the sched_setscheduler() or sched_setparam() functions.

Hierarchy of Priorities

All POSIX real-time priority threads have greater scheduling importance than threads with HP-UX real-time or HP-UX timesharepriority. By comparison, all HP-UX real-time priority threads are of greater scheduling importance than HP-UX timeshare prioritythreads, but are of lesser importance than POSIX real-time threads. Neither POSIX nor HP-UX real-time threads are subject to

degradation.

Schedulers

As of release 10.0, HP-UX implements four schedulers, two time-share and two real-time.To choose a scheduler, you can use the user command, rtsched(1), which executes processes with your choice of scheduler and enablesyou to change the real-time priority of currently executing process ID.rtsched -s scheduler -p priority command [arguments] rtsched [ -s scheduler ] -p priority -P pidLikewise, the system call rtsched(2) provides programmatic access to POSIX real-time scheduling operations.

RTSCHED (POSIX) Scheduler

The RTSCHED POSIX-compliant real-time deterministic scheduler provides three scheduling policies, whose characteristics arecompared in the following table.

RTSCHED policies

Scheduling

Policy How it works

SCHED_FIFO Strict first in-first out (FIFO) scheduling policy. This policy contains a range of at least 32 priorities. Threadsscheduled under this policy are chosen from a thread list ordered according to the time its threads have been inthe list without being executed. The head of the list is the thread that has been in the list the longest time; the tailis the thread that has been in the list the shortest time.

SCHED_RR Round-robin scheduling policy with a per-system time slice (time quantum). This policy contains a range of atleast 32 priorities and is identical to the SCHED_FIFO policy with an additional condition: when theimplementation detects that a running process has been executing as a running thread for a time period of lengthreturned by the function sched_rr_get_interval(), or longer, the thread becomes the tail of its thread list, and thehead of that thread list is removed and made a running thread.

SCHED_RR2 Round-robin scheduling policy, with a per-priority time slice (time quantum). The priority range for this policycontains at least 32 priorities. This policy is identical to the SCHED_RR policy except that the round-robin timeslice interval returned by sched_rr_get_interval() depends upon the priority of the specified thread.

SCHED_RTPRIO Scheduler

Realtime scheduling policy with nondecaying priorities (like SCHED_FIFO and SCHED_RR) with a priority range between the POSIXreal-time policies and the HP-UX policies.

For threads executing under this policy, the implementation must use only priorities within the range returned by the functionssched_get_priority_max() and sched_get_priority_min() when SCHED_RTPRIO is provided as the parameter.

NOTE : In the SCHED_RTPRIO scheduling policy, smaller numbers represent higher (stronger) priorities, which is the opposite of thePOSIX scheduling policies. This is done to provide continuing support for existing applications that depend on this priority ordering.

The strongest priority in the priority range for SCHED_RTPRIO is weaker than the weakest priority in thepriority ranges for any of the POSIX policies, SCHED_FIFO, SCHED_RR, and SCHED_RR2.


13 of 135 10/28/2010 11:19 PM

SCHED_HPUX Scheduler

The SCHED_OTHER policy, also known as SCHED_HPUX and SCHED_TIMESHARE, provides a way for applications to indicate, ina portable way, that they no longer need a real-time scheduling policy.For threads executing under this policy, the implementation can use only priorities within the range returned by the functionssched_get_priority_max() and sched_get_priority_min() when SCHED_OTHER is provided as the parameter. Note that for theSCHED_OTHER scheduling policy, like SCHED_RTPRIO, smaller numbers represent higher (stronger) priorities, which is the oppositeof the POSIX scheduling policies. This is done to provide continuing support for existing applications that depend on this priorityordering. However, it is guaranteed that the priority range for the SCHED_OTHER scheduling policy is properly disjoint from the priorityranges of all of the other scheduling policies described and the strongest priority in the priority range for SCHED_OTHER is weaker thanthe weakest priority in the priority ranges for any of the other policies, SCHED_FIFO, SCHED_RR, and SCHED_RR2.

Scheduling Priorities All processes have a priority, set when the process is invoked and based on factors such as whether the process is running on behalf ofuser or system and whether the process is created in a time-share or real-time environment.Associated with each policy is a priority range. The priority ranges foreach policy can (but need not) overlap the priority ranges of otherpolicies.Two separate ranges of priorities exist: a range of POSIX standard priorities and a range of other HP-UX priorities. The POSIX standardpriorities are always higher than all other HP-UX priorities.Processes are chosen by the scheduler to execute a time-slice based on priority. Priorities range from highest priority to lowest priorityand are classified by need. The thread selected to run is at the head of the highest priority nonempty thread list.

Internal vs. External Priority Values With the implementation of the POSIX rtsched, HP-UX priorities are enumerated from two perspectives -- internal and external priorityvalues.

· The internal value represents the kernel's view of the priority.

· The external value represents the user's view of the priority, as is visible using the ps(1) command. In addition, legacy HP-UXpriority values are ranked in opposite sequence from POSIX priority values:

· In the POSIX standard, the higher the priority number, the stronger the priority.

· In legacy HP-UX implementation, the lower the priority number, the stronger the priority. The following macros are defined in pm_rtsched.h to enable a program to convert between POSIX and HP-UX priorities and internal toexternal values:

· PRI_ExtPOSIXPri_To_IntHpuxPri

To derive the HP-UX kernel (internal) value from the value passed by a user invoking the rtsched command (that is, using the POSIXpriority value).

· PRI_IntHpuxPri_To_ExtPOSIXPri()

To convert HP-UX (kernel) internal priority value to POSIX priority value.

· PRI_IntHpuxPri_To_ExtHpuxPri

To convert HP-UX internal to HP-UX external priority values.

rtsched_numpri Parameter

A configurable parameter, rtsched_numpri, controls:

· The number of scheduling priorities supported by the POSIX rtsched scheduler.

· The range of valid values is 32 to 512 (32 is default)Increasing rtsched_numpri provides more scheduling priorities at the cost of increased context switch time, and to a minor degree,increased memory consumption.

Schedulers and Priority Values

There are now four sets of thread priorities: (Internal to External View)

Scheduler priority values

Type of Scheduler External Values Internal Values

POSIX Standard 512 to 480 0 to 31

Real-time 512 to 640 0 to 127

System, timeshare 640 to 689 128 to 177

User, timeshare 690 to 767 178 to 255


14 of 135 10/28/2010 11:19 PM

NOTE: For the POSIX standard scheduler, the higher the number, the stronger the priority. For the RTPRIO scheduler, the lower thenumber, the stronger the priority. The following lists categories of priority, from highest to lowest:

· RTSCHED (POSIX standard) ranks as highest priority range, and is separate from other HP-UX priorities.

RTSCHED priorities range between 32 and 512 (default 32) and can be set by the tunable parameter rtsched_numpri.

· SCHED_RTPRIO (real-time priority) ranges from 0-127 and is reserved for processes started withrtprio() system calls.· Two priorities used in a timeshare environment:

User priority (178-255), assigned to user processes in a time-share environment.

System priority (128-177), used by system processes in a time-share environment.

The kernel can alter the priority of time-share priorities (128-255) but not real-time priorities (0-127).The following priority values, internal to the kernel, are defined in param.h:

PRTSCHEDBASE Smallest (strongest) RTSCHED priority

MAX_RTSCHED_PRI Maximum number of RTSCHED priorities

PRTBASE Smallest (strongest) RTPRIO priority. Defined as PRTSCHED + MAX_RTSCHED_PRI.

PTIMESHARE Smallest (strongest) timeshare priority. Defined as PRTBASE + 128.

PMAX_TIMESHARE Largest (weakest) timeshare priority. Defined as 127 + PTIMESHARE.

Priorities stronger (smaller number) than or equal to PZERO cannot be signaled. Priorities weaker (bigger number) than PZERO can besignaled.

RTSCHED Priorities

The following discussion illustrates the HP-UX internal view, based on how the user specifies a priority to the rtsched command. Eachavailable real-time scheduler policy has a range of priorities (default values shown below).

Scheduler Policy highest priority lowest priority

SCHED_FIFO 31 0

SCHED_RR 31 0

SCHED_RR2 31 0

SCHED_RTPRIO 0 127

The user may invoke the rtsched(1) command to assign a scheduler policy and priority. For example:rtsched -s SCHED_RR -p 31 ls

Within kernel mode sched_setparam() is called to set the scheduling parameters of a process. It (along with sched_setscheduler()) is themechanism by which a process changes its (or another process') scheduling parameters. Presently the only scheduling parameter ispriority, sched_priority.The sched_setparam() and sched_setscheduler() system calls look up the process associated with the user argument pid, and call theinternal routine sched_setcommon() to complete the execution.sched_setcommon() is the common code for sched_setparam() and sched_setscheduler(). It modifies the threads scheduling priority andpolicy. The scheduler information for a thread is kept in its thread structure. It is used by the scheduling code, particularly setrq(), todecide when the thread runs, with respect to the other threads in the system. sched_setcommon() is called with the sched_lock held.sched_setcommon() calls the macro PRI_ExtPOSIXPri_To_IntHpuxPri, defined in pm_rtsched.h. The priority requested is then converted.Since priorities in HP-UX are stronger for smaller values, and the POSIX specification requires the opposite behavior, we merge the twoby running the rtsched priorities from ((MAX_RTSCHED_PRI-1) - rtsched_info.rts_numpri) (strongest) to (MAX_RTSCHED_PRI-1)(weakest).Based on the macro definition using the value passed by the user, the internal value seen by the kernel is calculated as follows:((MAX_RTSCHED_PRI - 1) - (ExtP_pri))

512 - 1 - 31 = 480


15 of 135 10/28/2010 11:19 PM

The kernel priority of the user's process is 480. The value of 480 is the strongest priority available to the user.Run Queues

A process must be on a queue of runnable processes before the scheduler can choose it to run. Processes get linked into the run queue based on the process's priority, set in the process table. Run queues are link-listed in decreasingpriority. The scheduler chooses the process with the highest priority to run for a given time-slice.Each process is represented by its header on the list of run queue headers; each entry in the list of run queue headers points to the processtable entry for its respective process.The kernel maintains separate queues for system-mode and user-mode execution. System-mode execution takes precedence for CPU time.User-mode priorities can be preempted -- stopped and swapped out to secondary storage; kernel-mode execution cannot. Processes rununtil they have to wait for a resource (such as data from a disk, for example), until the kernel preempts them when their run time exceeds atime-slice limit, until an interrupt occurs, or until they exit. The scheduler then chooses a new eligible highest-priority process to run;eventually, the original process will run again when it has the highest priority of any runnable process.When a timeshare process is not running, the kernel improves the process's priority (lowers its number). When a process is running, itspriority worsens. The kernel does not alter priorities on real-time processes. Timeshared processes (both system and user) lose priorityas they execute and regain priority when they do not execute

Run Queue Initialization

Run queues are initialized by the routine rqinit(), which is called from init_main.c after system monarch processor is established andbefore final kernel initialization.rqinit examines all potential entries in the system global per-processor information structure (struct mpinfo), gets the run queueinformation and pointers to the linked list of running threads. It then clears the run queue data in bestq (an index into the array of run queuepoints which points to the highest priority non-empty queue), newavg_on_rq (the run queue average for the processor), nready_locked andnready_free (sums provided the total threads in the processor's run queues). rqinit then sets the current itimer value for all run queues,links the queue header as the sole element, and sets up the queue.Next, the RTSCHED-related global run data structures are initialized with the global structure rtsched_info (defined in pm_rtsched.h),which describes the RTSCHED run queues.

Entries in rtsched_info

Entry Purpose

rts_nready Total number of threads on queues

rts_bestq Hint of which queue to find threads

rts_numpri Number of RTSCHED priorities

rts_rr_timeslice Global timeslice for SCHED_RR threads

*rts_timeslicep Round-robin timeslices for each priority (used by SCHED_RR2 threads)

*rts_qp Pointer to run queues

*rts_lock Spinlock for the run queues

The tunable parameter rtsched_numpri determines how many run queues exist:

· The minimum value allowed is 32, imposed by the POSIX.4 specification and defined as RTSCHED_NUMPRI_FLOOR.

· The maximum supported value of 512 is a constant of the implementation, defined as RTSCHED_NUMPRI_CEILING and set equalto MAX_RTSCHED_PRI. If a higher maximum is required, the latter definition must be changed.malloc() is called to allocate space for RTSCHED run queues. (rtsched_numpri * sizeof (struct mp_threadhd)) bytes are required. Theresulting pointer is stored in rtsched_info.rts_qp.Timeslice is checked to ensure that it is set to a valid value, which may be either -l (meaning no timeslicing) or positive integers. If it isinvalid, it is set to the default, HZ/10. rtsched_info.rts_rr_timeslice is set to timeslice, which round-robins with that many clock ticks. Foreach of the rtsched_numpri run queues, the struct mp_threadhd header block is linked circularly to itself. Finally, a spinlock is allocated tolock the run queue.

Note: There is one RTSCHED run queue systemwide, though separate track is kept for each processor. The queue for given thread isbased on how the scheduling policy is defined. One global set of run queues is maintained for RTSCHED (SCHED_FIFO, SCHED_RR,SCHED_RR2) threads. Run queues are maintained for each SPU for SCHED_TIMESHARE and SCHED_RTPRIO threads.

RTSCHED Run Queue


16 of 135 10/28/2010 11:19 PM

SCHED_RTPRIO (HP-UX REAL TIME) run queue

SCHED_RTPRIO (HP-UX REAL TIME) run queue

SCHED_TIMESHARE run queue

The following figure shows threads set to run at various RTSCHED priorities.The global RTSCHED run queues are searched for the strongest (most deserving) thread to run; the best candidate is returned as akthread_t. Each priority has one thread list. Any runnable thread may be in any thread list. Multiple scheduling policies are provided.Each nonempty list is ordered, and contains a head (th_link) at one end of its order and a tail (th_rlink) at the other.

· rtsched_info.rts_qp points to the strongest RTSCHED queue.

· rtsched_info.rts_bestq points to the queue to begin the search.The search (by the routine find_thread_rtsched()) proceeds from rts_bestq downwards looking for non-empty run queues. When the firstnon-empty queue is found, its index is noted in the local first_busyq. All threads in that queue are checked to determine if they are trulyrunnable or blocked on a semaphore.

· If there is a runnable thread, the rts_bestq value is updated to the present queue and a pointer to the thread found is returned to thecaller.

· If no truly runnable thread is found, threads blocked on semaphores are considered.If first_busyq is set, the rts_bestq value is updated to it and the thread at the head of that queue is returned to the caller. If first_busyq didnot get set in the loop, the routine panics, because it should be called only if rtsched_info.rts_nready is non-zero.Although the threads scheduler is set to a default value of 32 (RTSCHED_NUMPRI_FLOOR), it can be expanded to a system limit ofPRTSCHEDBASE (a value of 0).

The Combined SCHED_RTPRIO and SCHED_TIMESHARE Run Queue

The SCHED_RTPRIO and SCHED_TIMESHARE priorities use the same queue.The SCHED_RTPRIO and SCHED_TIMESHARE queue is searched with the same technique as the RTSCHED queue. The mostdeserving thread is found to run on the current processor. The search starts at bestq, which is an index into the table of run queues. Thereis one thread list for each priority. Any runable thread may be in any thread list. Multiple scheduling policies are provided. Eachnonempty list is ordered, and contains a head (th_link) as one end of its order and a tail (th_rlink) as the other.

The mp_rq structure constructs the run queues by linking threads together. The structure qs is an array of pointer pairs that act as a doublylinked list of threads. Each entry in qs[] represents a different priority queue. sized by NQS, which is 160. The qs[].th_link pointer pointsto the first thread in the queue and the qs[].th_rlink pointer points to the tail.

Priorities 0 (highest realtime priority) through 127 (least realtime priority) are reserved for real time threads. The real time priority threadwill run until it sleeps, exits, or is preempted by a higher priority real time thread. Equal priority threads will be run in a round robinfashion.The rtprio(1) command may be used to give a thread a real time priority. To use the rtprio(1) command a user must belong in thePRIV_RTPRIO privilege group or be superuser (root). The priorities of real time threads are never modified by the system unlessexplicitly requested by a user (via a command or system call). Also a real time thread will always run before a time share thread.The following are a few key points regarding a real-time thread:

· Priorities are not adjusted by the kernel

· Priorities may be adjusted by a system call

· Real-time priority is set in kt_pri

· The p_nice value has no effect


17 of 135 10/28/2010 11:19 PM

SCHED_TIMESHARE run queue

Timeshare threads are grouped into system priorities (128 through 177) and user priorities (178 through 255). The queues are fourpriorities wide. The system picks the highest priority timeshare thread, and lets it run for a specific period of time (timeslice). As thethread is running its priority decreases. At the end of the time slice, a new highest priority is chosen.Waiting threads gain priority and running threads lose priority in order to favor threads that perform I/O and give lesser attention tocompute-bound threads.SCHED_TIMESHARE priorities are grouped as follows:

· Real-time priority thread: range 0-127

· Time-share priority thread: range 128-255

· System-level priority thread: range 128-177

· User-level priority thread: range 178-255RTSCHED priority queues are one priority wide; timeshare priority queues are four priorities wide.

Thread Scheduling The thread of a parent process forks a child process. The child process inherits the scheduling policy and priority of the parent process.As with the parent thread, it is the child thread whose scheduling policy and priority will be used.

· Each thread in a process is independently scheduled.

· Each thread contains its own scheduling policy and priority

· Thread scheduling policies and priorities may be assigned before a thread is created (in the threads attributes object) or set dynamically while athread is running.

· Each thread may be bound directly to a CPU.

· Each thread may be suspended (and later resumed) by any thread within the process.The following scheduling attributes may be set in the threads attribute object. The newly created thread will contain these scheduling attributes:

contentionscope

PTHREAD_SCOPE_SYSTEM specifies a bound (1 x 1, kernel-spacel) thread. When abound thread is created, both a user thread and a kernel-scheduled entity are created.

PTHREAD_SCOPE_PROCESS will specify an unbound (M x N, combination user- andkernel-space) thread. (Note, HP-UX release 10.30 does not support unbound threads.)

inheritsched

PTHREAD_INHERIT_SCHED specifies that the created thread will inherit its schedulingvalues from the creating thread, instead of from the threads attribute object.

PTHREAD_EXPLICIT_SCHED specifies that the created thread will get its schedulingvalues from the threads attribute object.

schedpolicy The scheduling policy of the newly created thread

schedparam The scheduling parameter (priority) of the newly created thread.

Timeline


18 of 135 10/28/2010 11:19 PM

A process and its thread change with the passage of time. A thread's priority is adjusted four key times

Thread priority adjustments

Interval What happens

10milliseconds

The clock-interrupt handling routine clock_int() adjusts a time interval on the monarch every clock tick. Themonarch processor calls hardclock() to handle clock ticks on the monarch for general maintenance (such as diskand LAN states). hardclock() calls per_spu_hardclock() to charge the running thread with cpu time accumulated(kt_cpu).

40milliseconds

per_spu_hardclock() determines the running thread has accumulated 40ms of time and calls setpri(). setpri() callscalcusrpri() to adjust the running thread's user priority (kt_usrpri).

100milliseconds

By default, 10 clock ticks represents the value of timeslice, the configurable kernel parameter that defines theamount of time one thread is allowed to run before the CPU is given to the next thread. Once a timeslice intervalhas expired a call to swtch() is made to enact a context switch.

one second statdaemon() loops on the thread list and once every second calls schedcpu() to update all thread priorities. Thekt_usrpri priority is given to the thread on the next context switch; if in user mode kt_usrpri is given immediately.

Thread scheduling routines

Routine Purpose

hardclock() Runs on the monarch processor to handle clock ticks.

per_spu_hardclock() handles per-processor hardclock activities.

setpri() Called with a thread as its argument and returns a user priority for that thread. Calls calcusrpri() to get thenew user priority. If the new priority is stronger than that of the currently running thread, setpri() generatesan MPSCHED interrupt on the processor executing that thread, stores the new user priority in kt_usrpri andreturns it to its caller.

calcusrpri() The user priority (kt_usrpri) portion of setpri(). calcusrpri() uses the kt_cpu and p_nice(proc) fields of thethread, tt, to determine tt's user priority and return that value without changing any fields in *tt. If tt is aRTPRIO or RTSCHED thread, kt_usrpri is the current value of kt_pri.

swtch() Finds the most deserving runnable thread, takes it off the run queue, and sets it to run.

statdaemon() A general-purpose kernel process run once per second to check and update process and virtual memoryartifacts, such as signal queueing and free protection IDs. Calls schedcpu() to recompute thread prioritiesand statistics.

schedcpu() Once a second, schedcpu() loops through the thread list to update thread scheduling priorities. If the systemhas more than one SPU, it balances SPU loads. schedcpu updates thread usage information(kt_prevrecentcycles and kt_fractioncpu), calculates new kt_cpu for the current thread (info used by setpri(),updates the statistics of runnable threads on run queues and those swapped out, and awakens the swapper.Calls setpri().

setrq() Routine used to put threads onto the run queues. Set the appropriate protection (spl7 in UP case, thread lockin MP case). Assert valid HP-UX priority and scheduling policy and perform policy-specific setup

remrq() Routine used to remove a thread from its run queue. With a valid kt_link, set the appropriate protection(spl7 in the UP case or thread lock in MP case). Find the processor on which the thread is running.Decrement the thread count on run queues. Update the mpinfo structure. Restore the old spl level, updateRTSCHED counts if necessary. Adjust the kt_pri, return to schedcpu.

Adjusting a Thread Priority

Every 10 msecs, the routine hardclock() is called with spinlock SPL5 to disable I/O modules and software interrupts. hardclock() callsthe per-processor routine per_spu_hardclock(), which looks for threads whose priority is high enough to run. ( Searching the processor


19 of 135 10/28/2010 11:19 PM

run queues depends on the scheduling policy). If a thread is found, the MPSCHED_INT_BIT in the processor EIRR (External InterruptRequest Register) is set.When the system receives an MPSCHED_INT interrupt while running a thread in user mode, the trap handler puts the thread on a runqueue and switches context, to bring in the high- priority thread.If the current executing thread is the thread with the highest priority, it is given 100ms (one timeslice) to run. hardclock() calls setpri()every 40ms to review the thread's working priority (kt_pri). setpri() adjusts the user priority (kt_usrpri) of a time-share thread processbased on cpu usage and nice values. While a time-share thread is running, kt_cpu time increases and its priority (kt_pri) worsens.RTSCHED or RTPRIO thread priorities do not change.Every 1 second, schedcpu() decrements the kt_cpu value for each thread on the run queue. setpri() is called to calculate a new priority ofthe current thread being examined in the schedcpu() loop. remrq() is called to remove that thread from the run queue and then setrq()places the thread back into the run queue according to its new priority.If a process is sleeping or on a swap device (that is, not on the run queue), the user priority (kt_usrpri) is adjusted in setpri() and kt_pri isset in schedcpu().

Context Switching In a thread-based kernel, the kernel manages context switches between kernel threads, rather than processes. Context switching occurswhen the kernel switches from executing one thread to executing another. The kernel saves the context of the currently running thread andresumes the context of the next thread that is scheduled to run. When the kernel preempts a thread, its context is saved. Once the preemptedthread is scheduled to run again, its context is restored and it continues as if it had never stopped.The kernel allows context switch to occur under the following circumstances:

· Thread exits

· Thread's time slice has expired and a trap is generated.

· Thread puts itself to sleep, while awaiting a resource.

· Thread puts itself into a debug or stop state

· Thread returns to user mode from a system call or trap

· A higher-priority thread becomes ready to runIf a kernel thread has a higher priority than the running thread, it can preempt the current running thread. This occurs if the thread isawakened by a resource it has requested. Only user threads can be preempted. HP-UX does not allow preemption in the kernel exceptwhen a kernel thread is returning to user mode.In the case where a single process can schedule multiple kernel threads (1 x 1 and M x N), the kernel will preempt the running threadwhen it is executing in user space, but not when it is executing in kernel space (for example, during a system call).

The swtch() Routine

The swtch() routine finds the most deserving runnable thread, takes it off the run queue, and starts running it.

swtch() routines

Routine Purpose

swidle()(asm_utl.c) Performs an idle loop while waiting to take action. Checks for a valid kt_link. On a uniprocessormachine without a threadlock thread, goes to spl7. Finds the thread's spu. Decrements the count ofthreads on run queues. Updates ndeactivated, nready_free, nready_locked in the mpinfo()structure.Removes the thread from its run queue. Restores the old spl level. Updates RTSCHED counts.

save()(resume.s) Routine called to save states.Saves the thread's process control block (pcb) marker

find_thread_my_spu()(pm_policy.c)

For the current CPU, find the most deserving thread to run and remove the old. Search starts at bestq, anindex into the table of run queues. When found, set up the new thread to run. Mark the interval timer inthe spu's mpinfo.Set the processor state as MPSYS. Remove the thread from its run queue. Verify that itis runnable (kt_stat== TSRUN). Set the EIRR to MPSCHED_INT_ENABLE.Set the thread context bit to TSRUNPROC to indicate the thread is running.

resume()(resume.s) Restores the register context from pcb and transfers control to enable the thread to resume execution.

Process and Processor Interval Timing

Timing intervals are used to measure user, system, and interrupt times for threads and idle time for processors. These measurements aretaken and recorded in machine cycles for maximum precision and accountability. The algorithm for interval timing is described inpm_cycles.h.

Each processor maintains its own timing state by criteria defined in struct mpinfo, found in mp.h.


20 of 135 10/28/2010 11:19 PM

Processor timing states

Timing state Purpose

curstate The current state of the processor (spustate_t)

starttime Start time (CR16) of the current interval

prevthreadp Thread to attribute the current interval.

idlecycles Total cycles the SPU has spent idling since boot (cycles_t)

Processor states

SPU state Meaning

SPUSTATE_NONE Processor is booting and has not yet entered another state

SPUSTATE_IDLE Processor is idle.

SPUSTATE_USER Processor is in user mode

SPUSTATE_SYSTEM Processor is in syscall() or trap.

Time spent processing interrupts is attributed to the running process as user or system time, depending on the state of the process when theinterrupt occurred. Each time the kernel calls wakeup() while on the interrupt stack, a new interval starts and the time of the previousinterval is attributed to the running process. If the processor is idle, the interrupt time is added to the processor's idle time.

State Transitions

A thread leaves resume(), either from another thread or the idle loop. Protected by a lock, the routine resume_cleanup() notes the time,attributes the interval to the previous thread if there was one or the processor's idle time if not, marks the new interval's start time, andchanges the current state to SPUSTATE_SYSTEM.When the processor idles, the routine swtch(), protected by a currently held lock, notes the time, attributes the interval to the previousthread, marks the new interval as starting at the noted time, and changes the current state to SPUSTATE_IDLE.

A user process makes a system call.

A user process makes a system call.

A user process running in user-mode at (a) makes a system call at (b). It returns from the system call at (e) to run again in user-mode.Between (b) and (e) it is in running in system-mode. Toward the beginning of syscall() at (c), a new system-mode interval starts. Theprevious interval is attributed to the thread as user time. Toward the end of syscall() at (d), a new user-mode interval starts. The previousinterval is attributed to the thread as system-time.For timing purposes, traps are handled identically, with the following exceptions:

· (c) and (d) are located in trap(), not syscall(), and

· whether or not (d) starts a user- or system-mode interval depends on the state of the thread at the time of the trap.

An interrupt occurs


21 of 135 10/28/2010 11:19 PM

An interrupt occurs

Interrupts are handled much like traps, but any wakeup that occurs while on the interrupt stack (such as w1 and w2 in the figure above)starts a new interval and its time is attributed to the thread being awakened rather than the previous thread.Interrupt time attributed to processes is stored in the kt_interrupttime field of the thread structure. Concurrent writes to this field areprevented because wakeup is the only routine (other than allocproc()) that writes to the field, and it only does so under the protection of aspinlock. Reads are performed (by pstat() and others) without locking, by using timecopy() instead.Conceptually, the work being done is on behalf of the thread being awakened instead of the previously running thread.

CPU Bottlenecks To determine which processes are taking up the majority of cpu time run :

# ps –ef | sort –rnk 8 | more

The time column is the 8th one. CPU bottlenecks show up as a high %wio or wait on I/O from sar -u . For multi processor systems add the capital M prefix. High wait onI/O can be caused by a number of factors: The total amount of jobs in cpu run queue. This can be detected with sar -q by looking at the runq-sz column, typically this value is 1.0. as the amount of jobs increases the effect islogarhytmic.

Priority

The highest priority processes in the cpu run queue receive processor time, as processes run their priority ages to allow other processesaccess. If there are processes in the run queue with significantly more important priority running, low priority process may get very littleor no CPU time, as a result more jobs will accumulate in the run queue, which increases the wait on I/O.

Kernel parameters that effect CPU

Timeslice

This kernel parameter defaults to 10 , which equals 100 milliseconds . The implementation of some tuned parameter sets oruser-implemented changes can alter this with some dramatic effects. Shorter values can increase the cpu overhead by causing excessivecontext switching. Every context switch requires the system to re-prioritize the jobs in the run queue based on their relative importance.Processes will context switch out at either the end of their allotted timeslice or if a more important process enters the CPU run queue.

System Tables

The system tables include the process, inode and file tables. These are the most frequently misconfigured kernel parameters. These arerepresented in the kernel as nproc , ninode , vx_ninode and nfile. As they are by default controlled by maxusers , they are frequentlyoversized. Most vendor recommendations do not specify values for the individual tables, instead they recommend setting maxusers . Thismay be an appropriate starting point, however as the vast majority of systems do not use HFS file systems outside of the requirement for/stand this creates a problem. The HFS inode table, controlled by ninode is frequently oversized. System tables should be sized based onsystem load. Over sizing the tables causes excessive system overhead reading the tables, and excessive kernel memory use.

Inode Table

On 10.20 the inode table and dnlc (directory name lookup cache) are combined. As most systems run only the /stand file system as HFSthe size of this parameter does not need to be any larger than the amount of HFS inodes the system requires and enough space for anadequate HFS dnlc, 1024 is an adequate value to address this. On 11.00 the dnlc is now configurable using the ncsize and vx_ncsize kernel parameters.


22 of 135 10/28/2010 11:19 PM

By default ncsize =(ninode+vx_ncsize) +(8*dnlc_hash_locks) . The parameter vx_ncsize defines the memory space reserved for VxFSdirectory path-name cache (in bytes) The default value for vx_ncsize is 1024, dnlc_hash_ locks defaults to 512. A VxFS file system obtains the value of vx_ninode from the system configuration file used for making the kernel (/stand/system forexample). This value is used to determine the number of entries in the VxFS inode table. By default, vx_ninode initializes at zero; the filesystem then computes a value based on the system memory size (see Inode Table Size). To change the computed value of vx_ninode, youcan add an entry to the system configuration file. For example:vx_ninode sets the inode table size to 1,000,000 inodes after making a new kernel using mk_kernel and then rebooting.The number of inodes in the inode table is calculated according to the following table. The first column is the amount of system memory,the second is the number of inodes. If the available memory is a value between two entries, the value of vx_ninode is interpolated. Table 1 Inode Table Size Total Memory in Mbytes Maximum Number of Inodes 8 400 16 1000 32 2500 64 6000 128 8000 256 16,000 512 32,000 1024 64,000 2048 128,000 8192 256,000 32,768 512,000 131,072 1,024,000

Inode Tables

The HFS Inode Cache

The HFS Inode cache contains information about the file type , size , timestamp , permissions and block map.This information is stored in the On disk inode . The In-memory inode contains information on on-disk inode, linked list and other pointersinode number and lock primitives. One inode entry for every open file must exist in memory .Closed file inode are kept on the free list.

The HFS inode table is controlled by the kernel parameter ninode .

Memory costs for the HFS inode cache in bytes for inode/vnode /hash entry

10.20 11.0 32 bit 11.0 64 bit 11i 32 bit 11i 64 bit

424 444 680 475 688

On 10.20 the inode table and dnlc (directory name lookup cache) are combined. The tunable parameter for dnlc ncsizewas introduced in patch PHKL_18335. On 11.00 the dnlc is now configurable using the ncsize and vx_ncsize kernel parameters. By default ncsize =(ninode+vx_ncsize) +(8*dnlc_hash_locks) . The parameter vx_ncsize defines the memory space reserved for VxFSdirectory path-name cache (in bytes) The default value for vx_ncsize is 1024, dnlc_hash_ locks defaults to 512. As of JFS 3.5 vx_ncsize became obsolete.

The JFS Inode Cache

A VxFS file system obtains the value of vx_ninode from the system configuration file used for making the kernel (/stand/system forexample). This value is used to determine the number of entries in the VxFS inode table. By default, vx_ninode initializes at zero; the filesystem then computes a value based on the system memory size (see Inode Table Size). To change the computed value of vx_ninode, you can hard code the value in SAM .For example:Set vx_ninode=16,000. The number of inodes in the inode table is calculated according to the following table.The first column is the amount of system memory, the second is the number of inodes. If the available memory is a value between two entries, the value of vx_ninode is interpolated.


23 of 135 10/28/2010 11:19 PM

The memory requirements for JFS are dependent on the revision of JFS and system memory.

Maximum VxFS inodes in the cache based on system memory

System Memory in MB JFS 3.1 JFS.3.3-3.5 256 18666 16000512 37333 32000 1024 74666 640002048 149333 128000 8192 149333 256000 32768 149333 512000 131072 149333 1024000 To check how many inodes are in the JFS inode cache for JFS 3.1 or 3.3 :# echo “vxfs_nidnode/D | adb –k /stand /vmunix /dev/mem

for JFS 3.5 use the vxfsstat command : vxfsstat –v / | grep maxinovxi_icache_maxino 128000 vxi_icache_peakino 128002

The JFS daemon ( vxfsd ) scans the free list , if inodes are on the free list for given length of time the inode is freed back to the kernelmemory allocator . The amount of time this takes , and the amount freed varies by revison .

Maximum time in seconds before being freed JFS 3.1 300 JFS 3.3 500 JFS 3.5 1800

Maximum inodes to free per second 1/300th of current 50 1-25

Memory cost per in bytes for JFS inode by revision for inode/vnode/locks :

JFS 3.1 11.0 32 bit 1220 64 bit 2244

JFS 3.3 11.0 32 bit 1494 64 bit 1632

JFS 3.3 11.11 32 bit 13.52 64 bit 1902

JFS 3.5 11.11 64 bit 1850

Tuning the maximum size of the JFS Inode Cache

Remember each environment is differentThere must be one inode entry for each file opened at any given time .Most systems will run fine with 2% or less of memory used for the JFS Inode CacheLarge file sservers ie Web servers , NFS servers which randomly access a large set of inodes benefit from a large cache The inode cache typically appears full after accessing many files sequentially ie . find,ll , backupsThe HFS ninode parameter has no impact on the JFS Inode Cache

While a static cache ( setting a non 0 value for vx_ninode ) may save memory , there are factors to keep in mind :Inodes freed to the kernel memory allocator may not be available for immidiate use by other objectsStatic inode caches keep inodes in the cache longer

Process Table

The process table has 2 levels of control , nproc for the system wide limit and maxuprc for the user limit . When configuring theseparameters it is important to take into account the amount of configured memory . Ideally all running processes will use no more than theamount of device swap configured. When configuring maxuprc it is prudent to consider user environments that require large numbers of


24 of 135 10/28/2010 11:19 PM

user processes, the most common of these would be databases and environments with a large number of printers and print jobs. As databases typically fall in the user domain, i.e. Oracle is considered a user; a value of 60% of nproc is a good starting point. Asremote and network print jobs require 4 processes per job a value of 4 times the number of printers is suggested. Can't fork errors will result if the limits of table size or virtual memory are reached . If possible sar -v should be run to check the processtable use. If there is not an overflow , the total number of system processes can be determined. If there is insufficient virtual memory toprevent the fork call, the system will indicate can't fork out of virtual memory . This is not a process table problem , increasing nproc ormaxuprc will only make matters worse . Increasing device swap is appropriate if this error is encountered.

File Table

The file table imposes the lightest impact on performance. High values may be needed to satisfy the systems need for many concurrent fileopens. By using Glance or sar , a system high can be determined for both process and file tables . Setting the process table to 25% abovethe peak usage value provides a sufficient worst-case load buffer. Setting nfile 50% above the peak usage will provide sufficient buffer .

Module 3

Process Management

Process creation

Process 0 is created and initialized at system boot time but all other processes are created by a fork() or vfork() system call.

· The fork() system call causes the creation of a new process. The new (child) process is an exact copy of the calling (parent) process.

· vfork() differs from fork() only in that the child process can share code and data with the calling process (parent process). Thisspeeds cloning activity significantly at a risk to the integrity of the parent process if vfork() is misused.

The use of vfork() for any purpose except as a prelude to an immediate exec() or exit() is not supported. Any

program that relies upon the differences between fork() and vfork() is not portable across HP-UX systems.


25 of 135 10/28/2010 11:19 PM

Comparison of fork() and vfork()

fork() vfork()

Sets context to point to parent.Child process is an exactcopy of the parent process. (See fork(2) manpage forinherited attributes.)

Can share parent's data and code. vfork() returns 0 in the child's contextand (later) the pid of the child in the parent's context.

Copy on access Child borrows the parent's memory and thread of control until a call toexec() or exit().Parent must sleep while the child is using its resources,since child shares stack and uarea

Must reserve swap No reservation of swap

At user (application) level, processes or threads can create new processes via fork() or vfork().

At kernel level, only threads can fork new processes. When fork'd, the child process inherits the following attributes from the parent process:

· Real, effective, and saved user IDs.

· Real, effective, and saved group IDs.

· List of supplementary group IDs (see getgroups(2)).

· Process group ID.

· File descriptors.

· Close-on-exec flags (see exec(2)).

· Signal handling settings (SIG_DFL, SIG_IGN, address).

· Signal mask (see sigvector(2)).

· Profiling on/off status (see profil(2)).

· Command name in the accounting record (see acct(4)).

· Nice value (see nice(2)).

· All attached shared memory segments (see shmop(2)).

· Current working directory

· Root directory (see chroot(2)).

· File mode creation mask (see umask(2)).

· File size limit (see ulimit(2)).

· Real-time priority (see rtprio(2)).Each child file descriptor shares a common open file description with the corresponding parent file descriptor. Thus, changes to the fileoffset, file access mode, and file status flags of file descriptors in the parent also affect those in the child, and vice-versa.The child process differs from the parent process in the following ways:

· The child process has a unique process ID.

· The child process has a different parent process ID (which is the process ID of the parent process).The set of signals pending for thechild process is initialized to the empty set.

· The trace flag (see the ptrace(2) PT_SETTRC request is cleared in the child process.

· The AFORK flag in the ac_flags component of the accounting record is set in the child process.

· Process locks, text locks, and data locks are not inherited by the child (see plock(2)).

· All semadj values are cleared (see semop(2)).

· The child process's values for tms_utime, tms_stime, tms_cutime, and tms_cstime are set to zero..

· The time left until an alarm clock signal is reset to 0 (clearing any pending alarm), and all interval timers are set to 0 (disabled).

Process Execution

Once a process is created with fork() and vfork(), the process calls exec() (found in kern_exec.c) to begin executing program code. Forexample, a user might run the command /usr/bin/ll from the shell and to execute the command, a call is made to exec().exec(), in all its forms, loads a program from an ordinary, executable file onto the current process, replacing the existing process's textwith a new copy of an executable file.


26 of 135 10/28/2010 11:19 PM

An executable object file consists of a header (see a.out(4)), text segment, and data segment. The data segment contains an initializedportion and an uninitialized portion (bss). The path or file argument refers to either an executable object file or a script file of data for aninterpreter. The entire user context (text, data, bss, heap, and user stack) is replaced. Only the arguments passed to exec() are passed fromthe old address space to the new address space. A successful call to exec() does not return because the new program overwrites thecalling program.

Process states

Through the course of its lifetime, a process transits through several states. Queues in main memory keep track of the process by itsprocess ID. A process resides on a queue according to its state; process states are defined in the proc.h header file. Events such as receiptof a signal cause the process to transit from one state to another.

Process states

State What Takes Place

idle (SIDL) Process is created by a call to fork, vfork, or exec; can be scheduled to run.

run (SRUN) Process is on a run queue, available to execute in either kernel or user mode.

stopped(SSTOP)

Executing process is stopped by a signal or parent process

sleep (SSLEEP) Process is not executing; may be waiting for resources

zombie(SZOMB)

Having exited, the process no longer exists, but leaves behind for the parent process some record of its

execution.

When a program starts up a process, the kernel allocates a structure for it from the process table. The process is now in idle state, waitingfor system resources. Once it acquires the resource, the process is linked onto a run queue and made runnable. When the process acquiresa time-slice, it runs, switching as necessary between kernel mode and user mode.

If a running process receives a SIGSTOP signal (as with control-Z in vi) or is being traced, it enters a stop state. On receiving aSIGCONT signal, the process returns to a run queue (in-core, runnable). If a running process must wait for a resource (such as asemaphore or completion of I/O), the process goes on a sleep queue (sleep state) until getting the resource, at which time the processwakes up and is put on a run queue (in-core, runnable).

A sleeping process might also be swapped out, in which case, when it receives its resource (or wakeup signal) the process might bemade runnable, but remain swapped out. The process is swapped in and is put on a run queue. Once a process ends, it exits into a zombiestate.

The sleep*() Routines

Unless a thread is running with real-time priority, it will exhaust its time slice and be put to sleep. sleep() causes the calling thread (notthe process) to suspend execution for the required time period. A sleeping thread gives up the processor until a wakeup() occurs on thechannel on which the thread is placed. During sleep() the thread enters the scheduling queue at priority (pri).

· When pri <= PZERO, a signal cannot disturb the sleep

· If pri > PZERO the signal request will be processed.

· In the case of RTPRIO scheduling, a signal can be disturbed only if SSIGABL is set. Setting SSIGABL is dependent on the value ofpri.

Note: The sleep.h header file has parameter and sleep hash queue definitions for use by the sleep routines. The ksleep.h header file hasstructure definitions for the channel queues to which the kernel thread is linked when asleep.

· sleep() is passed the following parameters:

Address of the channel on which to sleep.

Priority at which to sleep and sleep flags.


27 of 135 10/28/2010 11:19 PM

Address of thread that called sleep().

· The priority of the sleeping thread is determined.

If the thread is scheduled real-time, sleep() makes its priority the stronger of the requested value andkt_pri.

Otherwise, sleep() uses the requested priority.

· The thread is placed on the appropriate sleep queue and the sleep-queue lock is unlocked.

If sleeping at an interruptable priority, the thread is marked SSIGABL and handle any signals received.

If sleeping at an uninterruptable priority, the thread is marked !TSSIGABL and will not handle anysignals.

· The thread's voluntary context switches are increased and swtch() is called to block the thread.

· Once time passes and the thread awakens, it checks to determine if a signal was received, and if so, handles it.

· Semaphores previously set aside are now called again.

wakeup()

The wakeup() routine is the counterpart to the sleep() routine. If a thread is put to sleep with a call to sleep(), it must be awakened bycalling wakeup().When wakeup() is called, all threads sleeping on the wakeup channel are awakened. The actual work of awakening a thread isaccomplished by the real_wakeup() routine, called by wakeup() with the type set to ST_WAKEUP_ALL. When real_wakeup() is passedthe channel being aroused, it takes the following actions:

· Determines appropriate sleep queue (slpque) data structure, based on the type of wakeup passed in.

· Acquires the sleep queue lock if needed in the multiprocessing (MP) case; goes to spl6 in the uniprocessing (UP) case.

· Acquires the thread lock for all threads on the appropriate sleep queue.

If the kt_wchan matches the argument chan, removes them from the sleep queue and updates the sleep tailarray, if needed.

Clears kt_wchan and its sleeping time.

If threads were TSSLEEP and not for a beta semaphore, real_wakeup() assumes they were not on a runqueue and calls force_run() to force the thread into a TSRUN state.

Otherwise, if threads were swapped out (TSRUN && !SLOAD), real_wakeup() takes steps to get themswapped in.

If the thread is on the ICS, attributes this time to the thread being awakened. Starts a new timing intervalattributing the previous one to the thread being awakened.

·

· Restores the spl level, in the UP case; releases the sleep queue lock as needed in the MP case.

force_run()

The force_run subroutine marks a thread TSRUN, asserts that the thread is in memory (SLOAD), and puts the thread on a run queue withsetrq(). If its priority is stronger than the one running, force a context switch. Set the processor's wakeup flag and notify the thread'sprocessor (kt_spu) with the mpsched_set() routine. Otherwise, force_run() improves the the swapper's priority if needed, sets wantin, andwakes up the swapper.

Process Termination

When a process finishes executing, HP-UX terminates it using the exit system call.Circumstances might require a process to synchronize its execution with a child process. This is done with the wait system call, which hasseveral related routines.During the exit system call, a process enters the zombie state and must dispose of child processes. Releasing process and thread structuresno longer needed by the exiting process or thread is handled by three routines -- freeproc(), freethread(), and kissofdeath().This section will describe each process-termination routine in turn.

The exit System Call

exit() may be called by a process upon completion, or the kernel may have made the call on behalf of the process due to a problem.If the parent process of the calling process is executing a wait(), wait3(), or waitpid(), it is notified of the calling process's termination. Ifthe parent of the calling process is not executing a wait(), wait3(), or waitpid(), and does not have SIGCLD ( death of a child) signal setto SIG_IGN (ignore signal), the calling process is transformed into a zombie process. The parent process ID is set to 1 for all of the


28 of 135 10/28/2010 11:19 PM

calling process's existing child processes and zombie processes. This means the process 1 (init) inherits each of the child processes.Process Management Structures

The process management system contains the kernel's scheduling subsystem and interprocess communication (IPC) subsystem.The process management system interacts with the memory management system to make use of virtual memory space. The process controlsystem interacts with the file system when reading files into memory before executing them.Processes communicate with other processes via shared memory or system calls. Communication between processes (IPC) includesasynchronous signaling of events and synchronous transmission of messages between processes. System calls are requests by a processfor some service from the kernel, such as I/O, process coordination, system status, and data exchange.The effort of coordinating the aspects of a process in and out of execution is handled by a complex of process management structures in thekernel. Every process has an entry in a kernel process table and a uarea structure, which contains private data such as control and statusinformation. The context of a process is defined by all the unique elements identifying it -- the contents of its user and kernel stacks, valuesof its registers, data structures, and variables -- and is tracked in the process management structures.Process management code is divided into external interface and internal implementation parts. The proc_iface.h defines the interface,contains the utility and access functions, external interface types, utility macros. The proc_private.h defines the implementation, containsinternal functions, types, and macros.Kernel threads code is similarly organized into kthread_iface.h and kthread_private.h.

Process structure, virtual layout overview

Process structure, virtual layout overview

Principal structures of process management

Structure Purpose

proc table Allocated at boot time; remains resident in memory (non-swappable). For every process contains an entry of theprocess's status, signal, and size information, as well as per-process data that is shared by the kernel thread

kthread

structure

One of two structures representing the kernel thread (the other is the user structure). Contains the scheduling,priority, state, CPU usage information of a kernel thread. Remains resident in memory.

vas The vas structure contains all the information about a process's virtual space. It is dynamically allocated asneeded and is memory resident.

pregion Contains process and thread information about use of virtual address space for text, data, stack, and sharedmemory, including page count, protections, and starting addresses of each.

uarea User structure contains the per-thread data that is swappable.


29 of 135 10/28/2010 11:19 PM

proc Table

The proc table is comprised of identifying and functional information about every individual process. Each active process has a proc tableentry, which includes information on process identification, process threads, process state, process priority and process signal handling.The table resides in memory and may not be swapped, as it must be accessable by the kernel at all times.

Definitions for the proc table are found in the proc_private.h header file.

Principal fields in the proc structure

Type of

Field Name and Purpose

Processidentification

Process ID ( p_pid)Parent process ID ( p_ppid)Read user ID ( p_uid) used to direct tty signals Process group ID(p_pgrp)Pointer to the pgroup structure (*p_pgrp_p)Maximum number of open files allowed (p_max)Pointer tothe region containing the uarea (p_upreg)

threads Values for first and subsequent threads (p_created_threads) Pointer to first and last thread in the process(p_firstthreadp, p_lastthreadp)Number of live threads in the process, excluding zombies (p_livethreads)List ofcached threads (*p_cached_threads)

process state Current process state (p_stat)Priority (p_pri)Per-process flags ( p_flag)

processsignaling

Signals pending on the process (p_sig)Active list of pending signals (*p_ksiactive)Signals being ignored (p_sigignore )Signals being caught by user ( p_sigcatch)Number of signals recognized by process (p_nsig)

Lockinginformation

Thread lock for all threads (*thread_lock)Per-process lock(*p_lock)

What are Kernel Threads?

A process is a representation of an entire running program. By comparison, a kernel thread is a fraction of that program. Like a process, athread is a sequence of instructions being executed in a program. Kernel threads exist within the context of a process and provide theoperating system the means to address and execute smaller segments of the process. It also enables programs to take advantage ofcapabilities provided by the hardware for concurrent and parallel processing.The concept of threads is interpreted numerous ways, but to quote a definitive source on the HP-UX implementation (S.J. Norton and M.D.DiPasquale, `ThreadTime: Multithreaded Programming Guide , (Upper Saddle River, NJ: Prentice Hall PTR, Hewlett-PackardProfessional Books), 1997, p.2):

A thread is "an independent flow of control within the process", composed of a [process's register] context,

program counter, and a sequence of instructions to execute. An independent flow of control is an execution

path through the program code in a process. The register context and program counter contain values that

indicate the current state of program execution. The sequence of instructions to execute is the actual program

code.

Further, threads are

· A programming paradigm and associated set of interfaces allowing applications to be broken up into logically distinct tasks thatwhen supported by hardware, can be run in parallel.

· Multiple, independent, executable entities within a process, all sharing the process' address space, yet owning unique resourceswithin the process.Each thread can be scheduled, synchronized, prioritized, and can send and receive signals. Threads share many of the resources of aprocess, eliminating much of the overhead involved during creation, termination, and synchronization.A thread's "management facilities" (register context et al) are used to maintain the thread's "state" information throughout its lifetime. Stateinformation monitors the condition of an entity (like a thread or process); it provides a snap-shot of an entity's current condition. For


30 of 135 10/28/2010 11:19 PM

example, when a thread context switch takes place, the newly scheduled thread's register information tells the processor where the threadleft off in its execution. More specifically, a thread's program counter would contain the current instruction to be executed upon start up.As of release 10.30, HP-UX has kernel threads, which change the role of processes. A process is now just a logical container used togroup related threads of an application. Each process contains at least one thread. This single (initial) thread is created automatically bythe system when the process starts up. An application must explicitly create the additional threads.A process with only one thread is a"single-threaded process." A process with more than one thread is a "multi-threaded process." Currently, the HP-UX kernel managessingle-threaded processes as executable entities that can be scheduled to run on a processor (that is, each process contains only onethread.) Development of HP-UX is moving toward an operating system that supports multi-threaded processes.

Comparison of Threads and Processes

The following lists process resources shared by all threads within a process:

· File descriptors, file creation mask

· User and group IDs, tty

· Root working directory, current working directory

· semaphores, memory, program global variables

· signal actions, message queues, timers The following lists thread resources private to each thread within a process:

· User registers

· Error number (errno)

· Scheduling policy and priority

· Processor affinity

· Signal mask

· Stack

· Thread-specific data

· Kernel uareaLike the context of a process, the context of a thread consists of instructions, attributes, user structure with register context, private storage,thread structure, and thread stack.Two kernel data structures -- proc and user -- represent every process in a process-based kernel. (The proc structure is non-swappableand user is swappable.) In addition, each process has a kernel stack allocated with the user structure in the uarea.A threads-based kernel also uses a proc and a user structure. Like the proc structure of the process-based kernel, the threads-based proc

structure remains memory resident and contains per-process data shared by all the kernel threads within the process.Each thread shares its host process' address space for access to resources owned or used by the process (such as a process' pointers into

the file descriptor table). Head and tail pointers to a process' thread list are included in the proc structure.Each thread manages its own kernel resources with private data structures to maintain state information and a unique counter. A thread isrepresented by a kthread structure (always memory resident), a user structure (swappable), and a separate kernel stack for each kernelthread.Every kthread structure contains a pointer to its associated proc structure, a pointer to the next thread within the same process. All theactive threads in the system are linked together on the active threads list.Like a process, a thread has a kind of life cycle based on the execution of a program or script. Through the course of time, threads likeprocesses are created, run, sleep, are terminated.

User and Kernel Mode

A kernel thread, like a process, operates in user and kernel modes, and through the course of its lifetime switches between the stacksmaintained in each mode. Stacks for each mode accumulate information such as variables, addresses, and buffer counts accumulate and itis through these stacks that the thread executes instructions and switches modes.Certain kinds of instructions trigger mode changes. For example, when a program invokes a system call, the system call stub code passesthe system call number through a gateway page that adjusts privilege bits to switch to kernel mode. When a thread switches mode to thekernel, it executes kernel code and uses the kernel stack.

Thread's Life Cycle

Like the process, the thread can be understood in terms of its "life cycle. 1. Process is created via a call to fork() or vfork(); the fork1() routine sets up the process's pid (process id) and tid (thread id). Theprocess and its thread are linked to the active list. The thread is given a creation state flag of TSIDL.2. fork1() calls newproc() to create the thread and process, and to set up the pointers to the parent. newproc() calls procdup() to createa duplicate copy of the parent and allocate the uarea for the new child process. The new child thread is flagged runnable and given a flagof TSRUN. Once the thread has this flag, it is placed in the run queue.3. The kernel schedules the thread to run; its state becomes TSRUNPROC (running). While in this state, the thread is given theresources it requests. This continues until a clock interrupt occurs, or the thread relinquishes its time to wait for a requested resource, orthe thread is preempted by another (higher priority) thread. If this occurs, the thread's context is switched out.


31 of 135 10/28/2010 11:19 PM

4. A thread is switched out if it must wait for a requested resource. This causes the thread to go into a state of TSLEEP. The threadsleeps until its requested resource returns and makes it eligible to run again. During the thread's TSLEEP state, the kernel callshardclock() every click tick (10ms) to charge the currently running thread with cpu usage. After 4 clock ticks (40ms), hardclock() callssetpri() to adjust the thread's user priority. The thread is given this value on the next context switch. After 10 click tics (100ms), a contextswitch occurs. The next thread to run will be the threadwith the highest priority in a state of TSRUN. For the remaining threads in TSRUNstate, schedcpu() is called after 100 clock tics (1 second). schedcpu() adjusts all thread priorities at this time.5. Once a thread acquires the requested resource, it calls the wakeup() routine and again changes states from TSLEEP to TSRUN. Thismakes the thread eligible to run again.6. On the next context switch the thread is allowed to run, provided it is the next eligible candidate. When allowed to run, the threadstate changes again to TSRUNPROC.7. Once the thread completes its task it calls exit(). It releases all resources and transfers to the TSZOMB state. Once all resources arereleased, the thread and the process entries are released.8. If the thread is being traced, it enters the TSSTOP state.9. Once the thread is resumed, it transfers from TSSTOP to TSRUN.

Multi-Threading When a task has two or more semi-independent subtasks, multiple threading can increase throughput, give better response time, speedoperations, improve program structure, use fewer system resources, and make more efficient use of multiprocessors. With multi-threading,a process has many threads of control. Note, order of execution is still important!The following terminology will be useful to understand multi-threading:

User threads

Handled in user space and controlled using the threads APIs provided in the threads library. Also referred to as user-level or application-level threads.

Kernel threads

Handled in kernel space and created by the thread functions in the threads library. Kernel threads are kernel schedulable entities visible to the operating system.

Lightweight processes (LWPs)

Threads in the kernel that execute kernel code and system calls.

Bound threads

Threads that are permanently bound to LWPs. A bound thread is a user thread bound directly to a kernel thread. Both a user threadand a kernel-scheduled entity are created when a bound thread is created.

Unbound threads

Threads that attach and detach from among the LWP pool. An unbound thread is a user thread that can execute on top of any availableLWP. Both bound and unbound threads have their advantages and disadvantages, depending entirely on the application that uses them.

Concurrency At least two threads are in progress at the same time

Parallelism At least two threads are executing simultaneously.

Kernel Thread Structure

Each process has an entry in the proc table; this information is shared by all kernel threads within the process.One kernel thread structure (kthread) is allocated per active thread. The kthread structure is not swappable. It contains all thread-specificdata needed while the thread is swapped out, including process ID, pointer to the process address space, file descriptors, currentdirectory, UID, and GID. Other per-thread data (in user.h) is swapped with the thread.Information shared by all threads within a process is stored in the proc structure, rather than the kthread structure. The kthread structurecontains a pointer to its associated proc structure. (In a multi-threads environment the kthread would point to other threads that make up the process and controlled by a threads listingmaintained in the proc table.)In a threads-based kernel, the run and sleep queues consist of kthreads instead of processes. Each kthread contains forward and backwardpointers for traversing these queues. All schedule-related attributes, such as priority and states, are kept at the threads level.Definitions for the kernel threads structure are found in the kthread_private.h header file include general information, schedulinginformation, CPU affinity information, state and flag information, and signal information.

Principal entries in kernel thread structure

Entry in struct kthread Purpose

*kt_link, *kt_rlink pointers to forward run/sleep queue link and backward run queue link

*kt_procp Pointer to proc structure


32 of 135 10/28/2010 11:19 PM

kt_fandx, kt_pandx Free active and kthread structure indices

kt_nextp, kt_prevp Other threads in the same process

kt_flag, kt_flag2 Per-thread flags

kt_cntxt_flags thread context flags

kt_fractioncpu fraction of cpu during recent p_deactime

kt_wchan Event thread is sleeping on

*kt_upreg pointer to the pregion containing the uarea

kt_deactime seconds since last deact or react

kt_sleeptime seconds since last sleep or wakeup

kt_usrpri User priority (based on kt_cpu and p_nice)

kt_pri priority (lower numbers are stronger)

kt_cpu decaying cpu usage for scheduling

kt_stat Current thead state

kt_cursig number of current pending signal, if any

kt_spu SPU number to which thread is assigned

kt_spu_wanted preference to desired SPU

kt_spu_group SPU group to which thread is associated

kt_spu_mandatory;kt_sleep_type

Assignment as to whether SPU is mandatory or advisory; directive to wake

up all or one SPU

kt_sync_flag Reader synchronization flags

kt_interruptible Is the thread interruptible?

kt_wake_suspend Is a resource waiting for the thread to suspend?

kt_active Is the thread alive?

kt_halted Is the thread halted cleanly?

kt_tid unique thread ID

kt_user_suspcnt,kt_user_stopcnt

user-initiated suspend and job-control stop counts

kt_suspendcnt Suspend count

*kt_krusagep Pointer to kernel resource usages

kt_usertime;

kt_systemtime;

kt_interrupttime

Machine cycles spent in user-mode, system mode and handling interrupts.

kt_sig signals pending to the thread

kt_sigmask Current signal mask

kt_schedpolicy scheduling policy for the thread

kt_ticksleft Round-robin clock ticks left

*kt_timers Pointer to thread's timer structures


33 of 135 10/28/2010 11:19 PM

*kt_slink Pointer to linked list of sleeping threads

*kt_sema Head of per-thread alpha semaphore list

*kt_msem_info Pointer to msemaphore info structure

*kt_chanq_infop Pointer to channel queue info structure

kt_dil_signal Signal to use for DIL interrupts

*kt_cred Pointer to user credentials1

*kt_cdir, *kt_rdir Curent and root directories of current thread, as shown in struct vnode

*kt_fp Current file pointer to struct file.

*kt_link, *kt_rlink pointers to forward run/sleep queue link and backward run queue link

Role of the vas structure

vas structure Every process has a proc entry containing a pointer (p_vas) to the process's virtual address space. The vas maintains a doubly linked listof pregions that belong to a given process and thread. The vas is always memory resident and provides information based on the process's

virtual address space.

Note : Do not confuse the vas structure with virtual address space (VAS) in memory. The vas structure is a few bytes;VAS is 4 gigabytesThe following table (derived from vas.h) shows the principal entries in struct

Entries in vas structure

Entry in struct vas Purpose

va_ll Doubly linked list of pregions

va_refcnt Number of pointers to the vas

va_rss, va_prss, va_dprss Cached approximation of shared and private resident set size, and private RSS in memory and on swap

*va_proc Pointer to existing process in struct proc

va_flags Various flags (itemized after this table)

va_wcount number of writable memory-mapped files sharing pseudo-vas

va_vaslock field in struct rw_lock that controls access to vas

*va_cred Pointer to process credentials in struct ucred

va_hdl vas hardware-dependent information

va_ki_vss Total virtual memory

va_ki_flag Indication of whether vss has changed

va_ucount Total virtual memory of user space.

The following definitions correspond to va_flags:

VA_HOLES vas might have holes within pregions

VA_IOMAP IOMAP pregion within the vas

VA_WRTEXT writable text

VA_PSEUDO pseudo vas, not a process vas

VA_MULTITHEADED vas conected to a multithreaded process


34 of 135 10/28/2010 11:19 PM

VA_MCL_FUTURE new pages that must be mlocked

VA_Q2SHARED quadrant 2 used for shared data

Pregion Structure

The pregion represents an active part of the process's Virtual Address Space (VAS). This may consist of the text, data, stack, and sharedmemory. A pregion is memory resident and dynamically allocated as needed. Each process has a number of pregions that describe theregions attached to the process. In this module we will only discuss to the pregion level. The HP-UX Memory Management white paper

provides more information about regions.

pregion types

Type Definition

PT_UNUSED unused pregion

PT_UAREA User area

PT_TEXT Text region

PT_DATA Data region

PT_STACK Stack region

PT_SHMEM Shared memory region

PT_NULLDREF Null pointer dereference

PT_SIGSTACK Signal stack

PT_IO I/O region

These pregion types are defined based on the value of p_type within the pregion structure and can be useful to determine characteristics ofa given process. This may be accessed via the kt_upreg pointer in the thread table. A process has a minimum of four defined pregions,

under normal conditions. The total number of pregion types defined may be identified with the definition PT_NTYPES.

Entries comprising a pregion

Type Purpose

Structure

information

Pointers to next and previous pregions.Pointer and offset into the region.Virtual space and offset forregion.Number of pages mapped by the pregion.Pointer to the VAS.

Flags and type Referenced by p_flags and p_type.

Schedulinginformation

Remaining pages to age (p_ageremain).Indices of next scans for vhand's age and steal hands (p_agescan,p_steadscan).Best nice value for all processes sharing the region used by the pregion (p_bestnice).Sleepaddress for deactivation (p_deactsleep).

Threadinformation

Value to identify thread, for uarea pregion (p_tid).

Traversing pregion Skip List

Pregion linked lists can get quite large if a process is using many discrete memory-mapped pregions. When this happens, the kernel spends

a lot of time walking the pregion list. To avoid the list being walked linearly, we use skip lists,2 which enable HP-UX to use four forwardlinks instead of one. These are found in the beginning of the vas and pregion structures, in the p_ll element.

User Structures (uarea)

The user area is a per-process structure containing data not needed in core when a process is swapped out.


35 of 135 10/28/2010 11:19 PM

The threads of a process point to the pregion containing the process's user structure, which consists of the uarea and kernel stack. The userstructure contains the information necessary to the execution of a system call by a thread. The kernel thread's uarea is special in that itresides in the same address space as the process data, heap, private MMFs, and user stack. In a multi-threaded environment, each kernelthread is given a separate space for its uarea.

Each thread has a separate kernel stack.

Addressing the uarea is analogous to the prior process-based kernel structure. A kernel thread references its own uarea through structuser. However, you cannot index directly into the user structure as is possible into the proc table. The only way into the uarea is throughthe kt_upreg pointer in the thread table.

Principal entries in the uarea (struct user)

Type Purpose

user structurepointers

Pointers to proc and thread structures (u_procp, u_kthreadp).Pointers to saved state and most recent savestate(u_sstatep,u_pfaultssp).

system call fields Arguments to current system call (u_arg[]).Pointer to the arglist (u_ap).Return error code (u_error).Systemcall return values (r_val(n)).

signal management Signals to take on sigstack (u_sigonstack).Saved mask from before sigpause (u_oldmask).Code to trap(u_code).

The user credentials pointer (for uid, gid, etc) has been moved from the uarea and is now accessed through the p_cred() accessor for theproc structure and the kt_cred()accessor for the kthread structure. See comments under the kt_cred() field in kthread.h for details

governing usage. Process Control Block (pcb)

Note : HP-UX now handles context switching on a per-thread basisA process control block (pcb) is maintained in the user structure of each kernel thread as a repository for thread scheduling information.The pcb contains all the register states of a kernel thread that are saved or restored during a context switch from one threads environmentto another.The context of a current running thread is saved in its associated uarea pcb when a call to swtch() is made. The save() routine saves thecurrent thread state in the pcb on the switch out. The resume() routine maps the user-area of the newly selected thread and restores theprocess registers from the pcb. When we return from resume(), the selected thread becomes the currently running thread and its uarea isautomatically mapped into the virtual memory address of the system's global uarea. The register's context includes:

· General-purpose Registers

· Space registers

· Control registers

· Instruction Address Queues (Program Counter)

· Processor Status Word (PSW)

· Floating point register

Contents of theProcess Control Block (pcb)

Context element Purpose

General registerspcb_r1 -->pcb_r31 [GR0 - GR31]

Thirty two general registers that provide the central resource for all computation. These areavailable for programs at all privilege levels.

Space registers pcb_sr0 -->pcb_sr7 [SR0 - SR7]

Eight space ID registers for virtual addressing.

Control registerspcb_cr0 --> pcb_cr31[CR0,CR8 - CR31]

Twenty-five control registers that contain system state information.


36 of 135 10/28/2010 11:19 PM

Program counters( pcb_pc) Two registers that hold the virtual address of the current and next instruction to be executed.

· The Instruction Address Offset Queue (IAOQ) is 32 bits long. The upper 30 bits contain thework offset of the instruction and the lower 2 bits maintain the privilege level of thecorresponding instruction.

· The Instruction Address Space Queue(IASQ) is 32 bits long in a PA-RISC 2.0 (64-bit)system or 16 bits a PA-RISC 1.x (32-bit) system. Contains the Space ID for instructions

Processor Status Word(pcb_psw)

Contains the machine level status that relates to a process as it does operations and computations.

Floating point registerspcb_fr1 --> pcb_fr32

Maintains the floating point status for the process.

Footnotes

1 UID, GID, and other credentials are pointed to as a snapshot of the process-wide cred structures when the thread enters the kernel.These are only valid when a thread operates in kernel mode. Permanent changes to the cred structure (e.g., setuid()) should be madeto the cred structure pointed to by the proc structure element p_cred.

2 Skip lists were developed by William Pugh of the University of Maryland. An article he wrote for CACM can be found atftp://ftp.cs.umd.edu/pub/skipLists/skiplists.ps.Z.

PRM - Process Resource Manager

Process Resource Manager (PRM) is a resource management tool used to control the amount of resources that processes use during peaksystem load (at 100% CPU, 100% memory, or 100% disk bandwidth utilization). PRM can guarantee a minimum allocation of systemresources available to a group of processes through the use of PRM groups.A PRM group is a collection of users and applications that are joined together and assigned certain amounts of CPU, memory, and diskbandwidth. The two types of PRM groups are FSS PRM groups and PSET PRM groups. An FSS PRM group is the traditional PRM group,whose CPU entitlement is specified in shares. This group uses the Fair Share Scheduler (FSS) in the HP-UX kernel within the system'sdefault processor set (PSET). A PSET PRM group is a PRM group whose CPU entitlement is specified by assigning it a subset of thesystem's processors (PSET). Processes in a PSET have equal access to CPU cycles on their assigned CPUs through the HP-UX standard

scheduler. PRM has four managers:

CPU Ensures that each PRM group is granted at least its allocation of CPU. Optionally for FSS PRM groups, this resource manager ensures nomore than its capped amount of CPU. For PSET PRM groups, processes are capped on CPU usage by the number of processors assigned

to the group.

Memory

Ensures that each PRM group is granted at least its share, but (optionally) no more than its capped amount of memory. Additionally, underprm2d memory management, you can specify memory shares be isolated so that a group's assigned memory shares cannot be loaned out to,

or borrowed from, other groups.

Disk

Ensures that each FSS PRM group is granted at least its share of disk bandwidth. PRM disk bandwidth management can only control disks

that are managed by HP's Logical Volume Manager (LVM) or by VERITAS Volume ManagerTM (VxVM®). PSET PRM groups aretreated as part of PRM_SYS (PRMID 0) for disk bandwidth purposes.

Application

Ensures that specified applications and their child processes run in the appropriate PRM groups. The managers control resources, user processes, and applications based on records in the configuration. Each manager has its own recordtype. The most important records are PRM group/CPU records, because all other records must reference these defined PRM groups. Thevarious records are described below.

Group / CPU


37 of 135 10/28/2010 11:19 PM

Specifies a PRM group's name and its CPU allocation. The two types of PRM group records are FSS PRM group records and PSET PRMgroup records. An FSS PRM group is the traditional PRM group, whose CPU entitlement is specified in shares. This group uses the FairShare Scheduler (FSS) in the HP-UX kernel within the system's default processor set (PSET). A PSET PRM group is a PRM group whoseCPU entitlement is specified by assigning it a subset of the system's processors (PSET). Processes in a PSET have equal access to CPUcycles on their assigned CPUs through the HP-UX standard scheduler

Memory

Specifies a PRM group's memory shares, and its optional cap. In addition, the prm2d memory manager allows you to specify memoryisolation field values. This allows you to isolate a group's memory shares so that memory is not loaned out to or borrowed from othergroups

Disk Bandwidth

Specifies an FSS PRM group's disk bandwidth shares for a given logical volume group (LVM) or disk group (VxVM). You cannot specifydisk bandwidth records for PSET PRM groups. PSET PRM groups are treated as part of PRM_SYS (PRMID 0) for disk bandwidthpurposes.

Application

Specifies an application (either explicitly or by regular expression) and the PRM group in which the application should run. Optionally, itspecifies alternate names the application can take at execution. (Alternate names are most common for complex programs such as databaseprograms that launch many processes and rename them.)

User

Specifies a user or a collection of users (through a netgroup) and assigns the user or netgroup to an initial PRM group. Optionally, itspecifies alternate PRM groups. A user or netgroup member then has permissions to use these PRM groups with the prmmove and prmruncommands.

PRM Resource Management

PRM places limits on resource use based on values specified in a configuration file. These values always indicate a minimum amount andin some cases can indicate a maximum amount of a resource

Note: Do not use PRM with gang scheduling, which is the concurrent scheduling of multiple threads from a single process as a group(gang).

PRM groups

PRM groups are integral to how PRM works. These groups are assigned per process and are independent of any other groups, such asuser groups that are defined in /etc/group. You assign applications and users to PRM groups. PRM then manages each group's CPU, diskbandwidth, and real memory resources according to the current configuration. If multiple users or applications within a PRM group arecompeting for resources, standard HP-UX resource management determines the resource allocation.

There are two types of PRM groups:

FSS PRM groups are the traditional and most commonly used PRM group. These groups have CPU, memory and disk bandwidthresources allocated to them using the shares model. FSS PRM groups use the Fair Share Scheduler in the HP-UX kernel within thesystem's default processor set (PSET).

PSET PRM groups are the second type of PRM group. In PSET PRM groups, the CPU entitlement is specified by assigning them a subsetof the system's processors--instead of using the shares model. The memory allocation is still specified in shares, however the PSET PRMgroups are treated as part of PRM_SYS (PRMID 0) for disk bandwidth purposes. Processes in a PSET PRM group have equal access toCPU cycles through the HP-UX time-share scheduler.

Because resource management is performed on a group level, individual users or applications may not get the resources required in agroup consisting of many users or applications. In such cases, reduce the size of the group or create a group specifically for the resource-intensive user or application.

Resource allocation

Resources are allocated to PRM groups differently depending on the resource and the type of PRM group. You allocate CPU resources to


38 of 135 10/28/2010 11:19 PM

PSET PRM groups using processor sets. All resources for FSS PRM groups and real memory resources for PSET PRM groups areallocated in shares. You cannot allocate disk bandwidth resources to PSET PRM groups.

What are processor sets?

Processor sets allow CPUs on your system to be grouped together in a set by the system administrator and assigned to a PSET PRM group.Once these processors are assigned to a PSET PRM group, they are reserved for use by the applications and users assigned to that group.Using processor sets allows the system administrator to isolate applications and users that are CPU-intensive, or that need dedicatedon-demand CPU resources.

How processor sets work

Processor sets are a way of allocating dedicated CPU resources to designated applications and users. At system initialization time, adefault PSET is created. This default PSET initially consists of all of your system's processors. All FSS PRM group CPU allocation canonly occur in the default PSET. The system administrator can create additional PSET PRM groups and assign processors, applications,and users to those groups. Once processors are assigned to a PSET PRM group, they cannot be used by another group until a newconfiguration is loaded.

Applications and users that are assigned to a PSET PRM group have dedicated CPU cycles from the CPUs assigned to the group.Competition for CPU cycles within the processor set are handled using the HP-UX time-share scheduler.

Parent, child, sibling, and leaf PRM groups shows a 16-processor system that has four FSS PRM groups defined within the default PSET,and two additional system-administrator-defined PSET PRM groups. The default PSET contains eight processors, one of which isProcessor 0. This is the only processor that is required to be in the default PSET. The remaining processors in the default PSET are usedby the Dev, Appl, and OTHERS FSS PRM groups. There are two databases on this system that each have four processors assigned tothem. Unlike the processors in the default PSET, the processors in the database PSET PRM groups are dedicated CPUs using the HP-UXtime-share scheduler. This creates an isolated area for the databases.

Processor sets example

PRM Group Type Group Name CPU ID Use

Default, FSS PRM

groups

PRM_SYS, OTHERS, Dev,Appl

0, 1, 4 ,5, 8 ,9

12, 13

System processes, General users and

developers

PSET PRM group SalesDB 2, 3, 6, 7 Sales database

PSET PRM group FinanceDB 10, 11, 14, 15 Financial database

What are shares?

Resource shares are the minimum amounts of a resource assigned to each PRM group in a PRM configuration file (default name/etc/prmconf). For FSS PRM groups, you can assign CPU, disk bandwidth, and real memory shares, although only CPU share assignmentsare required. For PSET PRM groups, you can only assign real memory in shares.

In addition to minimum amounts, you can specify maximum amounts of of some resources that PRM groups can use. For FSS PRM groups,you can specify maximum amounts of CPU and memory. For PSET PRM groups, you can only assign maximum amounts of memory. Thesemaximum amounts, known as caps, are not available for disk bandwidth for either type of PRM group.

How shares work

A share is a guaranteed minimum when the system is at peak load. When the system is not at peak load, PRM shares are not enforced--unless CPU capping is enabled, in which case CPU shares are always enforced.

Valid values for shares are integers from one to MAXINT (the maximum integer value allowed for the system). PRM calculates the sum ofthe shares, then allocates a percentage of the system resource to each PRM group based on its shares relative to the sum.

Converting shares to percentages shows how shares determine CPU percentage. The total number of shares assigned is four. Divide eachgroup's number of shares by four to find that group's CPU percentage. This CPU percentage applies only to those CPUs available to FSSPRM groups. If PSET PRM groups are configured, the processors assigned to them are no longer available to the FSS PRM groups. In thiscase, the CPU percentage would be based on a reduced number of CPUs.


39 of 135 10/28/2010 11:19 PM

Converting shares to percentages

PRM group CPU shares CPU %

GroupA 1 1/4 = 25.00%

GroupB 2 2/4 = 50.00%

OTHERS 1 1/4 = 25.00%

Shares allow you to add or remove a PRM group to a configuration, or alter the distribution of resources in an existing configuration,concentrating only on the relative proportion of resources and not the total sum. For example, assume we add another group to ourconfiguration in Converting shares to percentages, giving us the new configuration in Altered configuration. To give the new group 50% ofthe available CPU, we assign it four shares, the total number of shares in the old configuration, thereby doubling the total number of sharesin the new configuration.

Altered configuration

PRM group CPU shares CPU percentage determined by PRM

GroupA 1 12.50%

GroupB 2 25.00%

GroupC 4 50.00%

OTHERS 1 12.50%

Hierarchical PRM groups

In addition to the flat divisions of resources presented so far, you can nest FSS PRM groups inside one another--forming a hierarchy ofgroups similar to a directory structure. Hierarchies allow you to divide groups and allocate resources more intuitively than you can withflat allocations. Note that PSET PRM groups cannot be part of a hierarchy.

When forming a hierarchy, any group that contains other groups is known as a parent group. Naturally, the groups it contains are known as

child groups. All the child groups of the same parent group are called sibling groups. Any group that does not have child groups is calleda leaf group.

There is also an implied parent group of all groups where the implied parent has 100% of the resource to distribute.

Parent, child, sibling, and leaf PRM groups illustrates a configuration with hierarchical groups, indicating the parent, child, sibling, and

leaf PRM groups.


40 of 135 10/28/2010 11:19 PM

Parent, child, sibling, and leaf PRM groups

Parent, child, sibling, and leaf PRM groups

In Parent, child, sibling, and leaf PRM groups, parent groups are the Development and Development/Compilers groups. There is also animplied parent group to the Finance, Development, and OTHERS groups. The Development group has the childrenDevelopment/Compilers, Development/Debuggers, and Development/Profilers. The Compilers group is broken down further with twochildren of its own: Development/Compilers/C and Development/Compilers/Fortran. These two groups are also known as sibling groups.Leaf groups are groups that have no children. In the illustration above, leaf groups include the Finance, Development/Debuggers, andOTHERS groups, among others.

You specify resource shares for each group in a hierarchy. If a group has child groups, the parent group's resource shares are distributedto the children based on the shares they are assigned. If a group has no children, it uses the shares. More explicitly, the percentage that agroup's shares equate to is determined as follows:

1. Start at the top level in the hierarchy. Consider these groups as sibling groups with an implied parent. This implied parent has 100%of the CPU to distribute. (Shares work the same way for CPU, memory and disk bandwidth.)2. Add all the CPU shares of the first level of sibling groups together into a variable, TOTAL.3. Each sibling group receives a percentage of CPU equal to its number of shares divided by TOTAL.4. If the sibling group has no child groups, it uses the CPU itself.5. If the sibling group does have child groups, the CPU is distributed further based on the shares assigned to the child groups. Calculatethe percentages of the resource they receive by repeating items 2 through 5.6.

Consider the example in Hierarchical PRM groupstop level, which shows the PRM groups at the top-level.


41 of 135 10/28/2010 11:19 PM

Hierarchical PRM groups--top level

Group CPU shares Percent of system's available CPU

Finance 3 30.00%

Development 5 50.00%

OTHERS 2 20.00%

Hierarchical PRM groupsDevelopment's child groups shows how the CPU percentages for the child groups of the Development group aredetermined from their shares. It also shows how the child groups for the Development/Compilers group further divide the CPU.

Hierarchical PRM groups--Development's child groups

Group

CPU

shares Percent of system's available CPU

Development 5 5/10 = 50.00% passed to child groups

Development/Debuggers 1 1/4 of its parent's CPU (50.00%) = 12.50% of system CPU

Development/Profilers 1 1/4 of its parent's CPU (50.00%) = 12.50% of system CPU

Development/Compilers 2 2/4 of its parent's CPU (50.00%) = 25.00% passed to child

groups

Development/Compilers/C 4 4/8 of its parent's CPU (25.00%) = 12.50% of system CPU

Development/Compilers/Fortran

4 4/8 of its parent's CPU (25.00%) = 12.50% of system CPU

There is no requirement that the sum of the shares for a set of sibling groups be less than their parent's shares. For example, HierarchicalPRM groupsDevelopment's child groups shows the Development/Compilers group has 2 shares, while the sum of the shares for its childgroups is 8. You can assign any group any number of shares between one and MAXINT (the system's maximum integer value), setting theproportions between groups as you consider appropriate.

The maximum number of leaf nodes is 64, which is the maximum number of PRM groups you can have.

NOTE : Application records must assign applications only to leaf groups--not parent groups. Similarly, user records must assign usersonly to leaf groups . In group/CPU records, each PRM group--regardless of where it is in the hierarchy--must be assigned resource

shares.

Hierarchies offer a number of advantages, as explained below:

· Facilitates less intrusive changes--Similar to how shares in a flat configuration allow you to alter one record while leaving all theothers alone, hierarchies enable you to alter the hierarchy in one area, leaving the rest unchanged.

· Enables you to use a configuration template--Create a configuration file that provides each department access to the system, thendistribute the configuration and assign resources giving preference to certain departments on different machines.

· Allows continued use of percentages--If you prefer using percentages instead of shares, you can assign each level in the hierarchyonly 100 resource shares.

· Facilitates giving equal access--If you want each PRM group to have equal access to a resource, simply assign each group the samenumber of shares. When you add a group, you do not have to recalculate resources and divide by the new number of groups; just assign thenew group the same number of shares as the other groups. Similarly, removing a group does not require a recalculation of resources; justremove the group.

· Allows for more intuitive groups--Hierarchies enable you to place similar items together, such as all databases or a businessentity/goal, and assign them resources as a single item.


42 of 135 10/28/2010 11:19 PM

· Enables making higher-level policy decisions--By placing groups in a hierarchy, you can implement changes in policy or funding at ahigher level in a configuration without affecting all elements of the configuration.

· Facilitates system upgrades, capacity planning, and partitioning--If you are moving from a two-CPU system to a four-CPU system,you can reserve the two additional CPUs by adding a place-holder group at the top level in the hierarchy, assigning it shares equal to 50%of the CPU, and enabling capping. This place-holder prevents users from getting a boost in performance from the new CPUs, then beingfrustrated by poor performance when more applications are added to the system.

The syntax for hierarchical groups is explained in Group/CPU record syntax.

By default, PRM utilities (prmconfig, prmlist, prmmonitor) include only leaf groups in their output. Use the -h option to displayinformation for parent groups as well.

Precision of shares

PRM's calculation of groups' resources is most accurate when the maximum number of shares assigned divided by the minimum number of

shares assigned is less than or equal to 100, as shown in When resource percentages are most precise.

When resource percentages are most precise

When resource percentages are most precise

For example, Example with large difference in assigned max/min shares shows a situation in which the expected percentage is notachieved due to a large difference in the maximum and minimum shares.

Example with large difference in assigned max/min shares

PRM group Shares Expected percentage Actual percentage

GroupA 1 1/425 = 0.24% 0.48%

GroupB 200 200/425 = 47.06% 46.89%

GroupC 199 199/425 = 46.82% 46.89%

OTHERS 25 25/425 = 5.88% 5.74%

How PRM manages CPU

. To understand PRM's CPU management, it is useful to know how the standard HP-UX scheduler works.

The HP-UX scheduler chooses which process to run based on priority. Except for real-time processes, the system dynamically adjusts thepriority of a process based on resource requirements and resources used. In general, when processes are not running, the HP-UXscheduler raises their priorities; and while they are running, their priorities are lowered. The rate at which priority declines duringexecution is linear. The rate at which priority increases while waiting is exponential, with the rate of increase fastest when the CPU loadis low and slowest when the CPU load is high. When a process other than the current process attains a higher priority, the schedulersuspends the current process and starts running the higher priority process.

Because the rate at which the priority increases is slowest when CPU load is high, the result is that a process with a heavy demand for


43 of 135 10/28/2010 11:19 PM

PRM CPU management

PRM CPU management

CPU time is penalized by the standard HP-UX scheduler as its CPU use increases.

With PRM you can reverse the effects of the standard scheduler. By placing users with greater demands for CPU in an FSS PRM groupwith a higher relative number of CPU shares than other groups, you give them a higher priority for CPU time. In a similar manner, you canassign an application to an FSS PRM group with a higher relative number of shares. The application will run in its assigned FSS PRMgroup, regardless of which user invokes it. This way you can ensure that critical applications have enough CPU resources. You can alsoisolate applications and users with greater demands for CPU by placing them in a PSET PRM group and assigning the desired number ofprocessors to the group. The applications and users will have dedicated access to the processors in the PSET PRM group, ensuring CPUcycles when needed. This method of isolating applications and users effectively creates a partition on your system.

PRM manages CPU by using the fair share scheduler (FSS) for FSS PRM groups. When the PRM CPU manager is enabled, FSS runs forFSS PRM groups instead of the HP-UX standard scheduler.

When PSET PRM groups are configured, FSS still runs for FSS PRM groups, but the standard HP-UX scheduler is used within PSETPRM groups.

PRM gives higher-priority FSS PRM groups more opportunities to use CPU time. Free CPU time is available for use by any FSS PRMgroup and is divided up between FSS PRM groups based on relative number of CPU shares. As a result, tasks are given CPU time whenneeded, in proportion to their stated importance, relative to others with a demand.

PRM itself has low system overhead.

Example: PRM CPU management

PRM CPU management illustrates PRM's CPU management for two FSS PRM groups.

In this example, Group1 has 33 CPU shares, and Group2 has 66 CPU shares.

Note that the percentage of CPU referred to may not be total system CPU if PSET PRM groups are configured. The percentage is of CPUavailable on the processors assigned to the default PSET. If PSET PRM groups are not configured, then the available CPU is the same asthe system CPU.

At Time A:

· Group1 is using 40% of the available CPU, which is more than its share.

· Group2 is using 15% of the available CPU, which is less than its share.

· 45% of the available CPU is not used.

· PRM scheduling is not in effect.

At Time B:


44 of 135 10/28/2010 11:19 PM

· Group1's processes are now using 80% of available CPU time, which consists of all of Group1's shares and an unused portion ofGroup2's share.

· Group2 processes continue at a steady 15%.

· PRM scheduling is not in effect.

Between Time B and Time C:

· Group2's demands start to increase.

· With available CPU use approaching 100%, PRM starts to have an effect on CPU allocation.

· Both groups' CPU use begins moving toward their assigned number of shares. In this case, the increasing demand of Group2 causesGroup1 to be pulled toward the 33% mark despite its desire for more CPU.

At Time C:

· CPU use for Group1 and Group2 is limited to the assigned shares.

After Time C:

PRM holds each group to its assigned available CPU percentage until total available CPU demand is less than 100%. This gives Group2 apriority for CPU over Group1. In contrast, in the standard HP-UX scheduler, processor time is allocated based upon the assumption thatall processes are of equal importance. Assuming there is one process associated with each PRM group, the standard HP-UX schedulerwould allocate each process 50% of the available CPU after Time C.

CPU allocation and number of shares assigned

PRM favors processes in FSS PRM groups with a larger number of CPU shares over processes in FSS PRM groups with fewer CPUshares. Processes in FSS PRM groups with a larger number of CPU shares are scheduled to run more often and are given moreopportunities to consume CPU time than processes in other FSS PRM groups. This preference implies that the process in an FSS PRMgroup with a larger number of shares may have better response times with PRM than with the standard HP-UX scheduler. PRM does notprevent processes from using more than their CPU share when the system is at nonpeak load, unless a CPU maximum has been assigned.

Capping CPU use

PRM gives you the option of capping CPU use. When enabled, CPU capping is in effect for all user-configured FSS PRM groups on asystem--regardless of CPU load.

CPU use can be capped for either all FSS PRM groups or no FSS PRM groups.

When CPU usage is capped, each FSS PRM group takes its entire CPU allocation. Thus, no group can obtain more CPU. The FSS PRMgroup's minimum allocation becomes its maximum allocation.

The PRM_SYS group is exempt from capping, however. If it gets CPU time and has no work, the PRM scheduler immediately goes to thenext FSS PRM group.

For PSET PRM groups, capping is a result of the number of CPUs assigned to the group.

Capping CPU usage can be a good idea when migrating users and applications to a new system. When the system is first introduced, thefew users on the system may become accustomed to having all of the machine's resources. However, by setting CPU caps early after thesystem's introduction, you can simulate the performance of the system under heavier use. Consequently, when the system becomes moreheavily used, performance is not noticeably less. For information on capping CPU use, see Specifying PRM groups/controlling CPU use.

How PRM manages CPU for real-time processes

Although PRM is designed to treat processes fairly based upon their assigned shares, PRM does not restrict real-time processes.Real-time processes using either the POSIX.4 real-time scheduler (rtsched) or the HP-UX real-time scheduler (rtprio) keep their assignedpriorities because timely scheduling is crucial to their operation. Hence, they are permitted to exceed their group's CPU share and cap.The CPU they use is charged to their groups. Thus, they can prevent other processes in their groups from running.

Multiprocessors and PRM

PRM takes into account architectural differences between multiprocessor (MP) and single-processor systems.

In the case of memory management, Hewlett-Packard multiprocessor systems share the same physical address space. Therefore PRM


45 of 135 10/28/2010 11:19 PM

PRM's process scheduling on MP systems

PRM's process scheduling on MP systems

memory management is the same as on a single-processor system.

However, in the case of CPU management, PRM makes accommodations for MP systems. The normal HP-UX scheduling scheme for MPsystems keeps the CPU load average at a uniform level across the processors. PRM tries to even the mix of FSS PRM groups on eachavailable processor (those not assigned to PSET PRM groups). This is done by assigning each process in an FSS PRM group to adifferent processor, stepping round-robin through the available processors. Only processes that can be run or processes that are likely torun soon are actually assigned in this manner.

For example, on a two-way MP system, FSS PRM Group1 has two active processes A and B, and FSS PRM Group2 has two activeprocesses C and D. In this example, PSET PRM groups are not configured. PRM assigns process A to the first processor, process B tosecond processor, process C to the first processor, and finally process D to the second processor--as shown in PRM's process schedulingon MP systems.

If a process is locked downon a particular processor,PRM does not reassign it,but does take it into accountwhen distributing otherprocesses across theprocessors. PRM managesthe CPU only for theprocessors on a singlesystem, it cannot distributeprocesses across processorson different systems.As implied above, PRMprovides a PRM group itsentitlement on a symmetric-multiprocessing (SMP)system by granting the groupits entitlement on each CPU.If the group does not have at

least one process for each CPU, PRM increases the entitlements for the processes to compensate. For example, a PRM group with a 10%entitlement on a 4-CPU system, gets 10% of each CPU. If the group is running on only one CPU because it has only one process, the 10%entitlements from the three unused CPUs are given to the group on the CPU where it has the process running. Thus, it gets 40% on that oneCPU.

NOTE:

A PRM group may not be able to get its entitlement because it has too few processes. For example, if the PRM group above--with onlyone single-threaded process--were to have a 50% entitlement for the 4-CPU system, it would never get its entitlement. PRM would givethe group its 50% of the CPU where the process is running and its 50% from one other CPU. However, the group cannot get the 50%entitlements from the two remaining CPUs. As a result, the PRM group only gets a 25% entitlement (one CPU out of four).

How PRM manages real memory

Memory management refers to the rules that govern real and virtual memory and allow for sharing systemresources by user and system processes.

In order to understand how PRM manages real memory, it is useful to understand how PRM interacts withstandard HP-UX memory management.

How HP-UX manages memory The data and instructions of any process (a program in execution) must be available to the CPU by residing in real memory at the time ofexecution. Real memory is shared by all processes and the kernel.

To execute a process, the kernel executes through a per-process virtual address space that has been mapped into real memory. Memorymanagement allows the total size of user processes to exceed real memory by using an approach termed demand-paged virtual memory.Virtual memory enables you to execute a process by bringing into real memory parts of the process only as needed and pushing out parts ofa process that have not been recently used.


46 of 135 10/28/2010 11:19 PM

The system uses a combination of paging and swapping to manage virtual memory. Paging involves writing unreferenced pages from realmemory to disk periodically.

Swapping takes place if the system is unable to maintain a large enough free pool of memory. In such a case, entire processes areswapped. The pages associated with these processes can be written out by the pager to secondary storage over a period of time.

The more real memory a system has available, the more data it can access and the more (or larger) processes it can execute withouthaving to page or cause swapping.

Available memory

A portion of real memory is always reserved for the kernel (/stand/vmunix) and its data structures, which are dynamically allocated. Theamount of real memory not reserved for the kernel and its data structures is termed available memory. Available memory is consumed byuser processes and also nonkernel system processes such as network daemons. Because the size of the kernel varies depending on thenumber of interface cards, users, and values of the tunable parameters, available memory varies from system to system.

For example, Example of available memory on a 1024-Mbyte system shows a system with 1024 Mbytes of physical memory.Approximately 112 Mbytes of that memory is used by the kernel and its data structures, leaving 912 Mbytes of memory available for allprocesses, including system processes. In this example, 62 Mbytes are used by system processes, leaving 850 Mbytes of memoryavailable for user processes. PRM reserves 11% of the remaining memory to ensure processes in PRM_SYS have immediate access toneeded memory. Although you cannot initially allocate this reserve to your PRM groups, it is still available for your PRM groups toborrow from when needed. So, in this example, the prmavail command would show 850 Mbytes of available memory before PRM is

configured, and 756 Mbytes of available memory after PRM is configured.

Example of available memory on a 1024-Mbyte system

Mbyte Memory type

1024 Physical memory available on the system

912 Memory available for all processes

850 Memory available for user processes

756 Memory available after PRM is configured

How PRM controls memory usage

PRM memory management allows you to prioritize how available memory is allocated to user and application processes. This controlenables you to ensure that critical users and applications have enough real memory to make full use of their CPU time.

Processes in the PRM_SYS group (PRMID 0) and the kernel get as much memory as they need. They are not subject to PRM constraints.

PRM provides two memory managers:

· /opt/prm/bin/prm0d

· /opt/prm/bin/prm2d

The prm0d manager is the original memory manager. It is the default on HP-UX versions prior to 11i. The prm2d manager is the default asof HP-UX 11i V1.0 (B.11.11). prm2d is the recommended memory manager.

When the prm0d memory manager is enabled, available memory continues to be distributed to active processes using the standard HP-UXmethod. However, when system memory use is at a peak and a PRM group is exceeding its share of memory, the prm0d memory managersuppresses processes in that group. These suppressed processes give memory back to the pool, and therefore more memory is availablefor use by other PRM groups, which may not be getting their fair share. PRM suppresses a process by stopping it. Once the PRM group'smemory use is below its share or memory pressure ceases, the process is re-activated.

The prm0d memory manager selects processes for suppression based on the method specified in the memory records in the PRMconfiguration file. The selection methods are:


47 of 135 10/28/2010 11:19 PM

· ALL--Suppress all processes in the group.

· LARGEST--Suppress the largest processes in the group, then continue suppressing smaller and smaller processes until the goal ismet.

Typically, you might assign the ALL parameter to a PRM group with low priority so that PRM will be more aggressive in suppressingprocesses within the group. Groups with higher priority would typically be assigned the LARGEST parameter, causing PRM to be moreselective in suppressing processes.

prm0d stops processes by attaching to the processes like a debugger process would. You can restrict the processes that prm0d can stop asexplained the section in the section Exempting processes from memory control.

NOTE : PRM does not suppress processes using locked memory. For more information, see How PRM manages locked memory The prm2d memory manager uses the in-kernel memory feature to partition memory (when a configuration is loaded) with each PRMgroup getting a partition. A partition includes x Mbytes of memory, where x Mbytes is equivalent to the group's entitled percent of theavailable memory. Each partition pages separately.

When system memory use is not at 100%, a PRM group that does not have its memory use capped or isolated can freely borrow excessmemory pages from other PRM groups. If a process requires memory and its memory use is capped, processes in the same PRM group asthe original process are forced to page to free up memory.

When system memory use is at a peak, any borrowed memory pages are returned to the owning PRM groups. The time involved for theborrowed memory pages to be returned is dependent on the swap rate and the order in which old pages are paged out.

If a group is exceeding its memory shares on a system that is under stress, prm2d uses proportional overachievement logic to determinewhich groups need their import shares reduced. Overachievement for a group is the ratio of memory used to memory entitlement. Thisvalue is then compared to the average overachievement of all groups. If a PRM group is overachieving compared to the average, then theimport shares for that group are lowered. This allows other groups to start importing the newly available memory.

Groups are not allowed to exceed their memory caps with the prm2d memory manager.

Reducing shares under prm2d

If a PRM group's memory share is reduced while the group is using most of its memory pages, the reduction is not immediately visible.The memory must be paged out to the swap device. The time involved for the reduction to take effect is determined by the memory transferrate (for example, 2 Mbytes/second), and the order in which the old pages are paged out.

When changing shares, give them time to take effect before implementing new shares.

Exempting processes from memory control

You can prevent the prm0d memory manager from suppressing (stopping) certain processes. Specify the processes that the PRM memorymanager should not suppress by adding their path names (one per line) to the file /opt/prm/exempt.

The prm0d memory manager consults the files /opt/prm/shells and /etc/shells to properly identify shell scripts. These interactive shellsare not stopped and do not need to be added to the /opt/prm/exempt file.

The following processes are exempt:

· Login shells

· PRM commands

· Applications listed in /opt/prm/exempt

· Processes with locked memory

· The kernel

· Processes in the PRM_SYS group (PRMID 0)

Capping memory use

You can optionally specify a memory cap for a PRM group. With the prm0d memory manager, a memory cap is a soft upper bound. Withprm2d, a PRM group cannot exceed its memory cap. Typically, you might choose to assign a memory cap to a PRM group of relativelylow priority, so that it does not place excessive memory demands on the system. For information on setting a memory cap, see Controllingmemory use.

Implementation of shares and caps


48 of 135 10/28/2010 11:19 PM

Locked memory distribution by prm2d memory manager

In addition to specifying memory shares (a lower bound), you can optionally specify a memory cap (upper bound) for a PRM group.

It is important to note the difference between memory shares and a memory cap. Shares guarantee the minimum amount of real memory thata group is allowed to consume at times of peak system load. The memory cap is an upper bound. The prm0d memory manager hasdifferent criteria for suppressing processes when group memory use exceeds these boundaries.

With the prm0d memory manager, memory caps are not really upper bounds: Processes are allowed to exceed the caps. By placing amemory cap on a group, you instruct PRM to suppress the processes in that group before suppressing the processes in groups that do nothave a cap. If memory is still being requested by a group below its share, prm0d continues to suppress processes until no PRM group isexceeding its memory share.

The prm2d memory manager handles caps more strictly than prm0d. prm2d does not allow the memory use of processes in a PRM groupto exceed the memory cap of that PRM group.

Isolating a group's memory resources

In addition to specifying memory shares, the prm2d memory manager allows you to optionally specify a group's memory resources to berestricted from use from other groups and processes on the system. This type of restriction is called memory isolation.

When a group's memory shares are isolated, those memory shares cannot be loaned out to other groups. Memory isolation also means thatmemory cannot be borrowed from other groups.

PRM allows groups that do not have memory isolation turned on to freely borrow memory from other groups as needed. The lendinggroups are restricted in their giving by their physical entitlement size. A group cannot lend its memory resources if memory isolation isturned on.

Memory isolation can be useful for applications that need dedicated memory resources, or that tune their own memory needs based ontheir allocated resources.

How PRM manages locked memory

Real memory that can be locked (that is, its pages kept in memory for the lifetime of a process) by the kernel, by the plock() system call,or by the mlock() system call, is known as lockable memory.

Locked memory cannot be paged or swapped out. Typically, locked real memory holds frequently accessed programs or data structures,such as critical sections of application code. Keeping them memory-resident improves system performance. Lockable memory isextensively used in real-time environments, like hospitals, where some processes require immediate response and must be constantlyavailable.

prm0d does not suppress a process that uses locked memory once the process has the memory because suppressing the process will notcause it to give back memory pages. However, the memory resources that such a process consumes are still charged against its PRMgroup. If processes using locked memory consume as much or more memory than their group is entitled to, other processes in that groupmay be suppressed until the demands of the processes with locked memory are lower than the group's share.

With the prm2d memory manager, locked memory is distributed based on the assigned memory shares. For example, assume a system has200 Mbytes of available memory, 170 Mbytes of which is lockable. Lockable memory divided by available memory is 85%. If GroupAhas a 50% memory share, it gets 100 Mbytes of real memory. Of that amount, 85% (or 85 Mbytes) is lockable. Notice that 85 Mbytes/170Mbytes is 50%, which is the group's memory share. Locked memory distribution by prm2d memory manager illustrates this idea.


49 of 135 10/28/2010 11:19 PM

Locked memory distribution by prm2d memory managerHow PRM manages

shared memory

With the prm0d memory manager, PRM charges a PRM group for its use of shared memory based on the number of the group's processesthat are attached to the shared memory segment, relative to the total number of attached processes. For example, assume a system has a100-Mbytes shared memory segment and two PRM groups, Group1 and Group2. If Group1 has three processes attached to the segmentand Group2 has one attached, Group1 is charged with 75 Mbytes, while Group2 is charged with 25 Mbytes.With the prm2d memory manager, if a group is exceeding its memory shares as system memory utilization approaches 100%, prm2ddetermines which groups are importing the most memory above their entitlement, as compared with the average overachievement of allgroups. If a PRM group is overachieving compared to the average, then the import shares for that group are lowered. This allows othergroups to start importing the newly available memory.

Example: prm2d memory management

This example shows how prm2d manages the competing memory demands of three PRM groups as system memory utilization approaches100%.

prm2d memory management

prm2d memory management

At Time A

· There is plenty of memory available on the system for the processes that are running.

· Group1 is using its share, and Group2 is using slightly more than its share, borrowing excess from Group3.

· Group3 is using much less than its share. At Time B:

· System memory use approaches 100%.

· Group1 is borrowing excess memory from Group3.

· Group2 processes reach the group's 30% memory cap. Unlike prm0d, prm2d does not allow a group to exceed its memory cap.Consequently, Group2's processes are forced to page, causing a performance hit.Between Time B and Time C, Group3's demands continue to increase.At Time C:

· System memory use is near 100%.

· Group3 is not getting sufficient memory and needs its loaned-out memory back. PRM then determines which groups are


50 of 135 10/28/2010 11:19 PM

overachieving with respect to their memory entitlement. In this case, the increasing demand of Group3 causes Group1 and Group2 to bepulled toward their shares of 30% and 10% respectively despite their desire for more memory. Group3 is allowed to freely consume upto 60% of available memory, which it reaches at Time D.After Time D:PRM now holds each group to its entitled memory percentage. If a group requests more memory, the request is filled with pages alreadyallocated to the group.

How resource allocations interact

You can assign different numbers of shares for CPU (for FSS PRM groups), memory, and disk bandwidth to a PRM group depending onthe group's requirements for each type of resource. To optimize resource use, it is important to understand the typical demands forresources within a PRM group.

For example, suppose the DesignTool application is assigned to PRM group DTgroup, and it is the only application running in that group.Suppose also that the DesignTool application uses CPU and memory in an approximate ratio of two to three. For optimal results, youshould assign the resource shares for DTgroup in the same ratio. For example, assign 10 CPU shares and 15 memory shares or 20 CPUshares and 30 memory shares.

If the percentages assigned do not reflect actual usage, then a PRM group may not be able to fully utilize a resource to which it is entitled.For instance, assume you assign 50 CPU shares and 30 memory shares to DTgroup. At times of peak system load, DTgroup is able to useonly approximately 20 CPU shares (although it is assigned 50 shares) because it is limited to 30 memory shares. (Recall that DesignTooluses CPU and memory at a ratio of two to three.) Conversely, if DTgroup is assigned 10 CPU shares and 30 memory shares, then at timesof peak system load, DTgroup is only able to utilize 15 memory shares (not its 30 shares), because it is restricted to 10 CPU shares.

To use system resources in the most efficient way, monitor typical resource use in PRM groups and adjust shares accordingly. You canmonitor resource use with the prmanalyze command, the prmmonitor command, or the optional HP product GlancePlus.

How PRM manages disk bandwidth

PRM manages disk bandwidth at the logical volume group/disk group level. As such, your disks must be mounted and under the control of

either HP's Logical Volume Manager (LVM) or VERITAS Volume ManagerTM (VxVM®) to take advantage of PRM disk bandwidthmanagement. PRM controls disk bandwidth by re-ordering the I/O requests of volume groups and disk groups. This has the effect ofdelaying the I/O requests of low-priority processes and accelerating those of higher-priority processes

NOTE: Disk bandwidth management works only when there is contention for disk bandwidth, and it works only for actual I/O to the disk.(Commonly, I/O on HP-UX is staged through the buffer cache to minimize or eliminate as much disk I/O as possible.) Also, note that youcannot allocate disk bandwidth shares for PSET PRM groups. PSET PRM groups are treated as part of PRM_SYS (PRMID 0) for diskbandwidth purposes. Disk bandwidth management works on disk devices, stripes, and disk arrays. It does not work on tape or network devices.When you change share allocations on a busy disk device, it typically takes 30 seconds for the actual bandwidth to conform to the newallocations.Multiple users accessing raw devices (raw logical volumes) will tend to spend most of their time seeking. The overall throughput on thisgroup will tend to be very low. This degradation is not due to PRM's disk bandwidth management.When performing file system accesses, you need approximately six disk bandwidth consumers in each PRM group before I/O schedulingbecomes noticeable. With two users, you just take turns. With four, you still spend a lot of your time in system call overhead relative to thepeak device bandwidth. At six, PRM disk bandwidth management begins to take effect. The more demand you put on the system, the closerthe disk bandwidth manager approaches the specified values for the shares.

How PRM manages applications

When an application is started, it runs in the initial PRM group of the user that invoked it. If the application is assigned to a PRM group bya record in the configuration file, the application manager soon moves the application to its assigned group. A user who does not haveaccess to an application's assigned PRM group can still launch the application as long as the user has execute permission to theapplication. An application can be assigned to only one PRM group at a time. Child processes inherit their parent's PRM group.Therefore, all the application's child processes run in the same PRM group as the parent application by default.

You can explicitly place an application in a PRM group of your choosing with two commands. Use the prmmove command to move an


51 of 135 10/28/2010 11:19 PM

existing application to another group. Use the prmrun command to start an application in a specified group.These rules may not apply to processes that bypass login

How application processes are assigned to PRM groups at start-up

PRM's group assignments at process start-up describes what PRM groups an application process is started inbased on how the application is started.

PRM's group assignments at process start-up

Process initiated Process runs in PRM group as follows

By userBy atBy cronUpon login

Process runs in the user's initial group. If the user does not have an initial group, the process runs in the userdefault group, OTHERS. (If the process has an application record, it still starts in the invoking user's initialgroup. However, the application manager will soon move the process to its assigned group.)

By prmrun {-gtargetgrp | -i}

Process runs in the PRM group specified by targetgrp or in the user's initial group. The PRM applicationmanager cannot move a process started in this manner to another group.

By prmrunapplication

(-g targetgrp isnot specified)

Process runs in the application's assigned PRM group. If the application does not have a group, an error isreturned.

By prmmove{targetgrp | -i}

Process runs in the PRM group specified by targetgrp or in the user's initial group. The PRM applicationmanager cannot move a process started in this manner to another group.

By anotherprocess

Process runs in the parent process's group.

How PRM handles child processes

When they first start, child processes inherit the PRM groups of their parent processes. At configurable polling intervals, the applicationmanager checks the PRM configuration file against all processes currently running. If any processes should be assigned to different PRMgroups, the application manager moves those applications to the correct PRM groups.If you move a parent process to another PRM group (with the prmmove command), all of its child processes remain in the original PRM

group. If the parent and child processes should be kept together, move them as a process group or by user login name.

Pattern matching for filenames

Application filenames in application records can contain pattern matching notation as described in the regexp(5) man page. This featureallows you to assign all appropriate applications that reside in a single directory to a PRM group--without creating an application recordfor each individual application. The wildcard characters ([, ], *, and ?) can be used to specify application filenames. However, these characters cannot be used indirectory names. For example, the following record is valid:/opt/prm/bin/x[opq]rm::::PRM_SYSHowever, the next record uses a wildcard in the directory name and is not valid:/opt/pr?/bin/xprm::::PRM_SYS # INVALIDTo assign all the applications in a directory to a PRM group, create an application record similar to the following, with the filenamespecified only by an asterisk (*):/opt/special_apps/bin/*::::GroupSFilenames are expanded to their complete names when a PRM configuration is loaded. Explicit application records take precedence overapplication records that use wildcards. If an application is matched by several records that use pattern matching, the application isassigned to the PRM group specified in the "first" matching record. The "first" matching record is determined by sorting--in ASCIIdictionary order-- the matching patterns.

NOTE: If you use wildcards in an application record to specify the application filename, you cannot use alternate names for thatapplication record


52 of 135 10/28/2010 11:19 PM

Pattern matching for renamed application processes

Alternate names specified in application records can also contain pattern matching notation as described in the regexp(5) man page.

NOTE: Use pattern matching only when it is not practical to list all possible alternate names Many complex applications, such as database applications, may assign unique names to new processes or rename themselves whilerunning. For example, some database applications rename processes based on the database instance, as shown in this list of processesassociated with a payroll database instance:db02_payrolldb03_payrolldb04_payrolldbsmon_payrolldbwr_payrolldbreco_payroll To make sure all payroll processes are put in the same PRM group, use pattern matching in the alternate names field of the applicationrecord, as shown below:/usr/bin/database::::business_apps,db*payrollFor alternate names and pattern matching to work, the processes must share the same file ID. (The file ID is based on the file systemdevice and the file's inode number.) PRM performs this check to make sure that only processes associated with the application named inthe application record are put in a configured PRM group. The only case where alternate names might not share the file ID is if you havespecified a symbolic link as the fully qualified executable in the application record. For this reason, avoid using (or referencing) symboliclinks in application records.

If there are multiple application records that match an application name due to redundant pattern matching resolutions, the "first" record tomatch the application name takes precedence. For example, the application abb matches both of the following application records:

/opt/foo/bin/bar::::GroupA,a*/opt/foo/bin/bar::::GroupB,*bBecause the *b record is first (based on ASCII dictionary order), the application abb would be assigned to the PRM group GroupB.

Knowing the names of all the processes spawned and renamed by the applications can help in creating pattern matching that is only asgeneral as it needs to be. Eliminate redundant name resolutions whenever possible, and make sure pattern matching does not causeunwarranted moves.

The PRM application manager checks that applications are running in the correct PRM groups every interval seconds. The defaultinterval is 30 seconds


53 of 135 10/28/2010 11:19 PM

Module 4

Memory Memory use and configurations is one of the most complex and misunderstood areas of performance tuning. Factors to consider when configuring memory on a system : What are the hardware limitations for physical memory (RAM)What are the disk limitations for device swapWhat architecture are the OS , and the applicationWhat are the memory requirements for the OS and applications.What are the cost limitations

Current HP Servers memory ranges

Model Physical memory range

superdome 64 16-256GBsuperdome 32 2-128GBsuperdome 16 2-64GBrp 8400 2-64 GBrp 7410 2-32 GBrp 7400 1-32 GBrp5470 1-16 GBrp5430 1-8 GBrp2470 1-2 GB

Recent HP servers

V Class 1-128GBN Class 512MB-16GBL-Class 256-16GBK class 128Mb-8GBD Class 64Mb-3GbR Class 128Mb-3GbA Class 128Mb-2GB

Current HP Workstations

Model Physical memory range hp b2600 512MB- 4GB


54 of 135 10/28/2010 11:19 PM

hp c3700 512MB- 4GB hp c3750 512MB- 8 GB hp j6700/6750 1MB- 16 GB

For complete specs on current hp servers see: http://welcome.hp.com/country/us/eng/prodserv/servers.html For workstations see :

http://www.hp.com/workstations/risc/index.html

Commonly tuned memory paging parameters:

bufpages Pages of static buffer cache Minimum: 0 or 6 (nbuf*2 or 64 pages) Maximum: Memory limited Default: 0

nbuf Number of static buffer headers Minimum: 0 or 16 Maximum: Memory limited Default: 0 These parameters are used when the dynamic buffer cache is disabled.

dbc_min_pct Minimum dynamic buffer cache Minimum: 2 Maximum: 90 Default: 5

dbc_max_pct Maximum dynamic buffer cache Minimum: 2 Maximum: 90 Default: 50 As these represent a percentage of RAM , care should be taken when selecting a value . Values in excess of 400Mb are counter productivefrom a performance perspective. For systems with greater than 16Gb of RAM it may be advisable to disable the dynamic buffer cacheand hard code the values . The buffer cache, whether static or dynamic is used to facilitate faster disk read / write transactions . If thetotal amount of maximum disk I/O volume is determined the amount of buffer cache can be set accordingly .

maxswapchunks maximum swap space available to client Minimum: 1 Maximum: 16384 Default: 256

swchunk client swap-chunk size Minimum: 2048 To determine the amount of swap configurable on system multiply maxswapchunks by swapchunks .By default the maximum configurable swap area is 32Gb . Typically swchunk is not altered from its default. If greater than 32 Gb of device swap is configured , set swchunk to 4096 .

nswapdev number of available swap devices Minimum: 1 Maximum: 25 Default: 10

nswapfs number of file systems available for swap Minimum: 1 Maximum: 25


55 of 135 10/28/2010 11:19 PM

Default: 10

page_text_to_local enable/disable text swap on client Minimum: 0 (stand-alone or client uses file-system server) Maximum: 1 (use client local swap) Default: 1 (use client local swap)

remote_nfs_swap enable/disable swap to remote NFS Minimum: 0 Maximum: 1 Default: 0

swapmem_on enable/disable pseudo-swap reservation Minimum: 0 (disable pseudo-swap reservation) Maximum: 1 (enable pseudo-swap reservation) Default: 1 The swapmem_on parameter should be left at the default of 1 unless the total lockable memory exceeds 25% of RAM .Typically the total lockable memory on a system is between 15-20 % of RAM .

Configurable IPC Shared Memory Parameters

shmem Enable/disable shared memory (Series 700 only) Minimum: 0 (exclude System V IPC shared memory code from kernel) Maximum: 1 (include System V IPC shared memory code in kernel) Default: 1

shmmax Maximum shared memory segment size Minimum: 2 Kbytes Maximum: memory limited Default: 0x04000000 (64 Mbytes) The shmmax parameter should be set to no greater than 1 memory quadrant i.e. ¼ of total system memory For 32 bit this has a maximum value of 1 Gb For 64 bit systems it should be set to 1 quadrant .

shmmni Maximum segments on system Minimum: 3 Maximum: (memory limited) Default: 200 identifiers

shmseg Maximum segments per process Minimum: 1 Maximum: shmmni Default: 120

Process Management Subsystem

maxdsiz maximum process data segment size (32-bit) Minimum: 0x400000 (4 Mbytes) Maximum: 0x7B03A000 1.92GB Default: 0x4000000 (64 Mbytes) The practical limit for maxdsiz is the free space in quadrants 1&2 i.e. 2Gb -maxtsiz,-maxssiz, -64Mb Uarea .


56 of 135 10/28/2010 11:19 PM

maxdsiz_64bit maximum process data segment size (64-bit) Minimum: 0x400000 (4 Mbytes) Maximum: 4396972769279 Default: 0x40000000 1 Gb

The same sizing rule applies to the 64bit equivalent . The practical limit for maxdsiz_64bit is the free space in quadrants 2& 3. Ifmaxdsiz is exceeded, the process will be terminated, usually with a SIGSEGV (segmentation violation) and you will probably see thefollowing message:

Memory fault(coredump)

maxssiz maximum process storage segment size (32-bit) Minimum: 0x4000 (16 Kbytes) Maximum: 0x17F00000 383Mb Default: 0x800000 (8 Mbytes)

maxssiz_64bit maximum process storage segment size (64-bit) Minimum: 0x4000 (16 Kbytes) Maximum: 1073741824 Default: 0x800000 (8 Mbytes) Kernel stack is assigned memory before data in its quadrant , unless a SIGSEGV , stack growth error is received or a vendorspecifically recommends a size keep this parameter at default

maxtsiz maximum process text segment size (32-bit) Minimum: 0x40000 (4 Mbytes) Maximum: 0x7B033000 (approx 2 Gbytes) Default: 0x4000000 (64 Mbytes)

maxtsiz_64bit maximum process text segment size (64-bit) Minimum: 0x40000 (4 Mbytes) Maximum: 4398046511103 (approx 4 Tbytes) Default: 0x4000000 (64 Mbytes) Text rarely requires more space than the default value , unless the error /usr/lib/dld.sl: Call to mmap() failed - TEXT is received or avendor specifically recommends a size keep the memory for text at the default value.

Other memory configurables

max_mem_window Enables/configures number of Memory Windows in system Minimum: 0 Maximum: memory limited Default: 0

unlockable_mem Memory size reserved for system use Minimum: 0 Maximum: Available memory indicated at power-up Default: 0 (system sets to appropriate value) Typically unlockable memory is best left to the system to set

Memory Bottlenecks:

Process Deactivations


57 of 135 10/28/2010 11:19 PM

When a process is deactivated-----------------------------Once a process and its pregions are marked for deactivation, sched() * removes the process from the run queue. * adds its uarea to the active pregion list so that vhand can page it out. * moves all the pregions associated with the target process in front of the steal hand, so that vhand can steal from them immediately. * enables vhand to scan and steal pages from the entire pregion,instead of 1/16. Eventually, vhand pushes the deactivated process's pages to secondary storage. When a process is reactivated----------------------------- Processes stay deactivated until the system has freed up enough memoryand the paging rate has slowed sufficiently to return processes to the runqueue. The process with the highest reactivation priority is thenreturned to the run queue. Once a process and its pregions are marked for reactivation, sched(): * removes the process's uarea from the active pregion list. * clears all deactivation flags. * brings in the vfd/dbd pairs. * faults in the uarea. * adds the process to the run queue.

Swap

There are three types of swap : Device swap, file system swap and pseudo swap . Device swap is divided into two different types. The first is primary swap. This swap device is usually /dev/vg00/lvol2 and is createdwhen the operating system is installed. Primary swap can only be configured on the boot drive. The second type of device swap is calledsecondary swap. Secondary swap areas can be configured in any volume group, or on any disk, on the system. Ideally device swapshould be configured in equal sized , equal priority partitions with the same priority to promote interleaving. Device swap makes swaprequests to disk in 256KB chunks

For details on swap configuration refer to the following documents :

Configuring Device Swap DocId: KBAN00000218

How to configure device swap in VxVM DocId: VXVMKBRC00005232

File system swap

File system swap allows a system administrator to add more swap to the system even when all of the disk space has been allocated toother logical volumes, and there is no space left to create a device swap area. With file system swap, you can configure available spacewithin a file system to be used for swap. This type of swap has the poorest I/O performance and should only be used as a last resort. File system swap requests are made in only 8KB chunks. File system swap can cause corruption with VxFS file systems , this issue hasbeen addressed by patches PHKL_23940 (Critical, Reboot) s700_800 11.00 JFS 3.3 File system swap corruption and PHKL_24026(Critical, Reboot) s700_800 11.11 JFS File system swap corruption

Device Swap

In modern HP-UX systems device swap is typically used only to hold reserve memory for open processes.Most systems utilize RAM for active process memory calls , this is done by default with the kernel parameter swapmem_on set to 1 . We


58 of 135 10/28/2010 11:19 PM

refer to this feature as pseudo -swap . Normally this area is not used for reserve, however if the system is under configured for device swap, RAM will be used .It is not recommended to disable pseudo swap unless the systems lockable memory exceeds 25% of RAM . This is unusual . The amount of device swap that should be configured on a system depends on the requirement for memory for open processes . For 64 bitsystems , there should be a sufficient amount of device swap to allow all processes that will run concurrently to reserve their memory .To allow for a full 32 bit memory map , the system should have a total of 4GB of device swap and RAM . A 1:1 ratio of RAM to deviceswap is a good starting point . For a system that has local swap and also serves other systems with swap space, make a second estimation in addition to the one above. 1. Include the local swap space requirements for the server machine, based on the estimation from above.2. Add up the total swap space you estimate each client requires. At a minimum, this number should equal the sum of physical memory for each client. The parameter maxswapchunks limits the number of swap space chunks. The default size is 256 the maximum is 16384The default size of each chunk of swap space is 2 MB. (swapchunks=2048) This can be increased to 4096 if greater than 32GB of swapis required . The OS limit for swap is 64GB RAM used to satisfy on processor memory calls is called pseudo-swap space. It allows users to execute processes in memory withoutallocating physical swap. Pseudo-swap is controlled by an operating-system parameter; by default, swapmem_on is set to 1, enablingpseudo-swap. Pseudo-swap space allows for the use of system memory (RAM) as a third type of swap space.It can range from 75% to 87.5% of RAM depending on how much lockable memory and buffer cache is used. Typically systems usebetween 15-20% of RAM for lockable memory . This region is used for kernel memory and by applications that want to insure RAM isalways available. To determine lockable memory : echo total_lockable_mem/D | adb /stand/vmunix /dev/mem total_lockable_mem:total_lockable_mem: 185280 This will return the amount in Kbytes of lockable memory in use.Divide this by 1024 to get the size in megabytes, then divide by the amount ofRAM in megabytes to determine the percentage. Typically, when the system executes a process, swap space is reserved for the entire process, in case it must be paged out. According tothis model ,to run one gigabyte of processes, the system would have to have one gigabyte of configured swap space. . Pseudo swap represent the system memory serves two functions: as process-execution space and as swap space. Because memory calls are able to access RAM directly instead of paging to disk , disk I/O is reduced and the calls to memory aresatisfied in a shorter time . As before, if a process attempts to grow or be created beyond this extended threshold, it will fail. For systems, which perform best when the entire application is resident in memory, pseudo-swap space can beused to enhance performance: you can either lock the application in memory or make sure the total number of processes created does notexceed the allocated space . The unused portion of physical memory allows a buffer between the system and the swapper to give the systemcomputational flexibility. When the number of processes created approaches capacity, the system might exhibit thrashing and a decrease in system response time. Ifnecessary, you can disable pseudo-swap space by setting the tunable parameter swapmem_on in /usr/conf/master.d/core-hpux to zero.

Estimating Your Swap Space Needs

Your swap space must be large enough to hold all the processes that could be running at your system's peak usage times.


59 of 135 10/28/2010 11:19 PM

As a result of the larger physical memory limits of the 64-bit hardware platforms introduced at 11.0, you may need to significantly increase the amount of swap space for certain applications on these systems. For optimum performance , all reserve memory should be allocated from the available device swap .All active process calls to memory should be satisfied within RAM . When a system is forced to page to disk , performance will beimpacted. Device swap requests are 256Kb in size , files system swap requests are 8Kb in size . If a system is forced to page largequantities of memory to disk the resulting increase in disk I/O will impact the speed of the transaction as well as have an effect on thespeed of read/write transactions on that disk due to increased I/O demands. Swap space usage increases with system load. If you are adding (or removing) a large number of additional users or applications, youwill need to re-evaluate your swap space needs. To determine virtual memory configuration run the swapinfo command : Example #swapinfo –tam Mb Mb Mb PCT START/ MbTYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAMEdev 1024 0 1024 0 1 /dev/vg00/lvol1reserve 184 -184memory 372 96 276 26total 1396 280 1116 20 To determine the processes that are using the most memory run :

# ps –elf | sort –rnk 10 | more

The 10th column SZ referes to the size in memory pages. Take into account whether standard page sizes are implemented .

The McKusick & Karels memory allocatorWe first look at a high level picture how the operating system manages memory.


60 of 135 10/28/2010 11:19 PM

In HP-UX all memory is divided into 4k pages. Since we have a virtual memory system, pages that are logically

contiguous do not need to be adjacent in physical memory. The virtual memory system maintains a mappingbetween the virtual and physical memory page. As a result the operating system can satisfy request for larger

memory allocations by setting up a contiguous virtual memory page that has mappings to physical pages that arenot necessarily adjacent in physical memory. The smallest amount of memory we can allocate this way is a full

page, 4kb. But the kernel often creates objects that are much smaller than 4k. If we would always allocate a fullpage for these small objects we would waste a lot of memory. This is where our McKusick & Karels memory

allocator come to play.

The goal for the kernel memory allocator is to allow quick allocation and release of memory in a efficient way.As the picture shows, the kernel memory allocator is making use of the page level allocator. As pointed out

already, the main problem we try to solve with the kernel memory allocator is to be able to satisfy requests formemory allocations of less than 4k.

Therefore the kernel memory allocator requests a page from the page allocator which then is broken down intosmaller sized chunks of equal size. This is done on a power of two basis, starting with the smallest implemented

size of 32 byte going up in powers of two to a whole page. Requests of 1 page up to 8 pages are satisfied throughthe kernel allocator too, but with a bit different mechanism (calling kmalloc()).

The chunks of memory generated from that one 4k page are put on a freelist. For each size of a chunk we do

have our own freelist. A page of memory has to be broken down into the same size of chunks, we can not use thesame page for different sizes of chunks. So as a example, if we try to allocate a 128 byte chunk, and no entry is

available on the free list, we will allocate a new page and break it down into 32 128 byte chunks. The remaining31 chunks are then put on to the 128 byte freelist. A sample picture to illustrate the matter could look like this:


61 of 135 10/28/2010 11:19 PM

We determine the right freelist simply by taking the next larger size of power of two where our request does }

The general workings of the allocator did not change much over the course of time, what changed is the list

header names and the number of lists we maintain on a multiprocessor system. We only had one spinlockguarding the whole bucket free list. On a multi processor system we might end up in considerable time spent in

memory allocation due to lock contention. Therefore it was decided to have free lists per processor beginningwith 10.20. When a list runs out of free entries of a certain size it first tries to "steal" a entry from a different

processor before allocating a new full page and splitting it up.

Another change was the introduction of 32 and 64 bit kernel versions with 11.00. The 32 bit kernel uses a array

called bucket_32bit[] and the 64 bit kernel a array called bucket_64bit[]. For memory allocations larger than apage we make use of page_buckets_32bit[] and page_buckets_64bit[].

Besides allocation of new pages to the bucket pool in case we need more memory, we might be in a situation

where the system runs low on physical memory. Imagine a subsystem that required thousands of small 32 bytememory chunks due to a certain load situation. Then the load has gone and the sub system returns all the many

chunks to the bucket free list. A lot of unused memory is now allocated to the bucket allocator that is notavailable for user applications or general use anymore. In 10.20 vhand was going to check the bucket pool when

under physical memory pressure, trying to coalesce the single chunks belonging together into a full page. If thatcould be managed the full page was returned to the page free pool or to the superpage pool, depending from

which pool the page came originally.

In 11.00 this mechanism was switched off due to performance reasons. Unfortunately due to the effect that wecould pile up lots of entries now per CPU, we could end up exhausting the virtual address space on 32 bit

systems. Therefore the algorithm was changed again to allow reclaiming of unused pages again.

This is a rough overfiew of the workings and the background of the McKusic and Karels memory allocator.

The Arena Allocator for 11.11

The arena allocator is a new kernel memory allocator introduced in 11.11 that replaces the older MALLOC()/FREE() interface, whichwas implemented using the McKusic and Karel allocator (also known as the Bucket allocator). The new Arena allocator is very similar tothe slab and zone allocators used in SUN and Mach OS respectively.The important features of the Arena allocator are object caching,improved fault isolation, reduced memory fragmentation and better system balance.

The following is only a brief description of arena allocation for the sake of first pass dump analysis. A full discussion of Arena internalsis available at 11i Internals.

Arenas and objects

All memory allocations in the kernel can be categorized as fixed size objects and variable sized objects. A fixed sized object is an objectwhose size remains the same at every memory allocation request. A variable sized object may vary in size at different allocation requests.


62 of 135 10/28/2010 11:19 PM

Each type of object will have its own arena. Before any allocation is requested, the user needs to create an arena by callingkmem_arena_create(). Once created, each arena is assigned a unique handle and all allocations are made using the handle. Allocationsare made by calling kmem_arena_alloc() for fixed size objects and kmem_arena_varalloc() for variable sized objects.

Objects are de-allocated by calling kmem_arena_free(). Each arena will have its own free lists to manage the allocation and deallocationfrom its private memory pool. This helps improves the performance of allocation/deallocation requests. When the free list is empty, VMwill refill the list by allocating more pages from the superpage pool or from the page allocator. When there is memory pressure, vhand()will initiate garbage collection among the arenas' free lists to reclaim memory from the arenas.

For smoother transition, the Arena allocator provides source compatibility for the previous MALLOC()/FREE() interface. During kernelbootup, arenas will be implicitly created for each type of memory that will be allocated using the old interface. These arenas are givennames starting with "M_" (e.g. M_DYNAMIC, M_MBUF). A full list of names can be found in /usr/include/sys/malloc.h. Currently thesearenas are implemented as variable sized arenas.

The following diagram gives a pictoral description of the basic operations of the arena allocator.

As shown in the above diagram, each arena in the kernel is represented by a kmem_arena_t structure, which describes the arena and its attributes as well as store the

Implementation of Arena's Free Lists

To facilitate the management of each memory chunk in an arena's free list, we associate each chunk with an object header. To implement object caching, VM cannot us

The free list of fixed sized objects is kept in a one-dimensional array. Each element in the array corresponds to each spu in the system as


63 of 135 10/28/2010 11:19 PM

shown in the figure below. The head of the free list is kfh_head and this points to the linked list of all the free fixed size object headers.

The free list management for variable sized objects is similar to fixed size objects. However the free list for variable sized objects is keptin a two-dimensional array. Each element in the array corresponds to a spu and bucket index, similar to the Bucket allocator. The majordifference from the Bucket allocator is that the size of each bucket is not necessarily 2 to the power of its corresponding index. A bucketmap is used instead to map bucket indices to bucket sizes. Two additional fields are defined in the arena structure, ka_bkt_size_map andka_bkt_idx_map, to allow the mapping between sizes and bucket indices.


64 of 135 10/28/2010 11:19 PM

Performance Optimized Page Sizing In 11.0 a new performance related parameter for memory set became available variable page sizing . To determine the size of memorypages being used use 'getconf PAGE_SIZE' . the following white paper discusses this new option in detail : KBAN00000849Performance Optimized Page Sizing in HP-UX 11.0. White Paper


65 of 135 10/28/2010 11:19 PM

HP-UX 11.0 will be the first release of the operating system to have general support for variable sizedpages, also known as POPS (Performance Optimized Page Sizing) or Large Pages. Partial support for variablesized pages has existed since HP-UX 10.20 for kernel text, kernel data, kernel dynamic data, lockeduser shared memory and text. HP-UX 11.0 allows a customer to configure an executable to use specific variable page sizes and/or configure the system to transparently select variable page sizes based upon program heuristics and size. The document answers the questions:

Who benefits from variable sized pages and why? What are the drawbacks of using variable sized pages? Where can variable sized pages be used in an application? How are page sizes selected for an application? What configured kernel parameters influence page size selection? How can I select page sizes for an application? Why am I not getting my variable sized pages? What statistics/counters are available to assist me?

1. ObjectivesHP-UX 11.0 will be the first release of the operating system to have general support for variable sizedpages, also known as POPS (Performance Optimized Page Sizing) or Large Pages. Partial support for variable sized pages has existedsince HP-UX 10.20 for kernel text, kernel data, kernel dynamic data, locked user shared memory and text. HP-UX 11.0 allows a customerto configure an executable to use specific variable page sizes and/or configure the system to transparently select variable page sizes basedupon program heuristics and size.

1.1 Requirements for Variable Pages.To get variable pages on an executable, the hardware platform must support it. The only machines supporting variable pages are thePA-8000 and any follow on processors based on the PA 2.0 architecture.1.1 machines such as the PA-7200 do not support variable pages and will not use any page size otherthan 4K.

2.0 Who benefits from large pages?When a memory address is referenced the hardware must know where to locate that memory. The translation lookaside buffer or TLB isthe mechanism hardware uses to accomplish this task. If the TLB doesn't contain information for the request then the hardware generates aTLB miss. The software takes over and performs whatever tasks are necessary that result in entering the required information into the TLBso a subsequent request will succeed. This miss handling has been part of all processors to date. However the newer processors,specifically the PA-8000, have fewer TLB entries (96 vs. 120), no hardware TLB walker, and a higher cycle time to handle the miss. Allthese combined mean applications with large data sets can end up spending more time in TLB misses on newer hardware than old. Thishas been measured in several real world applications under HP-UX 10.20 on the PA-8000.With variable sized pages a larger portion of the virtual address space can be mapped using a single TLBentry. Consequently, applications with large reference sets can be mapped using fewer TLB entries. Fewerentries means less TLB misses and increased performance.Of course all of this assumes that the application is experiencing TLB miss performance problems to beginwith. Applications with large references sets *NOT* experiencing TLB miss problems see no measurablegain in performance with variable sized pages. For example, an application spending 20% of its time handling TLB misses can hope togain 20% in performance using variable sized pages, while an applicationspending 1% of its time can only gain 1%.

3.0 What are the drawbacks of using variable sized pages?Prior to variable sized pages every page fault resulted in the use of a 4K page. When an application consciously (or unconsciously) uses alarger page it is using more physical space per fault than it had previously.


66 of 135 10/28/2010 11:19 PM

Depending on the reference pattern the application may end up using more physical space thanbefore. For example, if only 4K out of a 16K page are referenced, 12K of the space allocated is not used.The increased physical consumption would result in fewer available pages to other applications in the system, possibly resulting inincreased paging activity and possible performance degradation.To avoid this the system takes into consideration the availability of memory when selecting page sizes.

4.0 Where can variable sized pages be used in an application?The HP-UX 11.0 release will support the use of variable sized pages on the following user objects:

· Program Text

· Program Initialized data

· Program private data and BSS

· Program private dynamic data (as allocated via sbrk())

· Program Stack

· Shared Memory

· Anonymous memory mapped files (MAP_ ANONYMOUS)

· Shared librariesThe following will not employ the use of variable sized pages in the 11.0 release:

· Memory Mapped Files (MAP_ FILE) that are not shared libraries.The MMF restriction is thought to be a temporary, but there is no current date to support the use of variablesized pages on MMF's.

5.0 How are page sizes selected for an application?Page size selection is driven by two mechanisms, the user and the kernel. The user has control over pagesize selection by setting the configured kernel parameters (discussed in the next section) or by selecting aspecific page size (chatr (1)) for a specific application. The kernel honors page size specification in thefollowing order:1. User specified a Text/Data page size via the chatr (1) interface. chatr (1) is described in moredetail later.2. If no "chatr()" hint is specified then the kernel selects what it decides is a suitable page sizebased upon system configuration and the object size. This is what we call transparent selection.That size is then compared to the boot time configured page size (vps_pagesize). If thatvalue is smaller than vps_pagesize then vps_ pagesize is used.

One of the more important uses of transparent selection comes into play with dynamic objects like stackand data. How can we determine a suitable size for them if we don't know how big they get? If the userspecified a data size via chatr (1), the kernel uses that page size hint when the caller increases their datasize (stack or data). If chatr (1) is not specified the kernel tracks the users dynamic growth over time. As the size of the object increases,the kernel increases the size of the page size used by the object. The neteffect is a scaling of page size based upon the increase in size of the segment in question. An applicationwith a large data set of say 140 Megabytes ends up selecting larger pages than an application who's sizeis lets say 64K. As the data/stack is increased the maximum size page is restricted byvps_ ceiling (described later). Even though the object desires a 64K or 256K page size, there are other restrictions that may result in thedenial of such a desired size. To use a variable sized page of 64K both the virtual address and physicalpage must be aligned on that size. The denial for a physical page of 64K would only result if there are no64K pages available or if there are no pages larger than 64K (say 256K) that could be broken into 64Kpieces. In that case the most the application can hope for is the use of a smaller page like 16K, but there isa good chance it will resort to 4K. The virtual restriction comes into play for starting and ending boundaries.

If the starting base address of the fault is not a multiple of the page size desired then we can't

accommodate that size. This would be the case for objects such as anonymous MMF's where the startingaddress is randomly chosen. Suppose our anonymous mapping started at address 0xc0005000 for alength of 1 Megabyte and desires a page size of 64K. Because 0xc0005000 is not aligned to a 64K boundary the first 'x' pages are split upusing a mixture of 4K and 16K pages until the 64K boundary at0xc0010000 is encountered.

6.0 What configured kernel parameters influence page size selection?


67 of 135 10/28/2010 11:19 PM

The kernel currently supports 3 configured kernel parameters that influence the use of variable sizedpages. They are:1. vps_pagesize2. vps_ceiling3. vps_chatr_ceiling

vps_pagesize represents the default or minimum page size the kernel should use if the user has notchatr (1)'d a specific value. vps_pagesize is specified in units of kilobytes and should equate to one of thesupported values. In the event vps_pagesize does not correspond to a supported page size, the closestpage size smaller than the users specification is used. For example, specifying 20K would result in the kernel using 16K. vps_pagesize isessentially a boot time configured page size for all user objects created.The actual effectiveness of that size for the system is unknown. As described earlier, the actual page sizeused is dependent on virtual alignment as well. Even though vps_pagesize is configured to 16K, if the virtual alignment is not suitable fora 16K page then 4K pages are used instead. The current default value is 4 kilobytes (vps_pagesize = 4).

vps_ ceiling represents the maximum size page the kernel uses when selecting page size "transparently".vps_ceiling is specified in units of kilobytes. Like vps_pagesize, vps_ceiling should be a valid page size. Ifnot, the value is rounded down to the closest valid page size. vps_ceiling places a limit on the size used forprocess data/stack and the size of pages the kernel selects transparently for non dynamic objects (text,shared memory, etc.). The default value is 16K( vps_ceiling = 16).

vps_ chatr_ceiling places a restriction on the largest value a user is able to chatr (1). The command itself is not limited, but thekernel checks the chatr'd value against the maximum and only values belowvps_chatr_ceiling are actually used. In the event the value exceeds vps_chatr_ceiling, the actual valueused is the value of vps_chatr_ceiling. Like the others, vps_chatr_ceiling is specified in units of kilobytesand will be rounded down to the closest page size if an invalid size is specified. chatr (1) does not requireany sort of user privilege and can therefore be used by any user. This configured kernel parameter allowsthe system administrator to restrict the use of large pages if there are "bad citizens" who abuse the facility.The default value is 64 Megabytes, the largest possible page size (vps_chatr_ceiling = 16384).

7.0 How can I select page sizes for an application? Chatr (1)Chatr (1) is the user command to change the default page size for a processes text and data segments.chatr (1) has the ability to specify a page size selection for text (+pi option) and for data (+pd option). Validpage sizes are from 4K to 64 Megabytes. When an executable is built its page size value is set to "default".The kernel performs transparent selection on default settings. When a user chatr's a specific page size,that size is used for the objects existence. Note there is a difference between 4K and default. In fact thereis a specific value for setting "default" or the transparent selection of pages (see chatr (1)). If an executableis chatr'd to 4K, only 4K pages are used.

The +pi option is used for text. The kernel uses the chatr'd value as the desired page size for the userstext segment (PT_TEXT). No other objects within the process are affected by the setting of +pi. To set an

executable to 16K you would specify: chatr +pi 16K my_ executable

The +pd option is used to specify the page size for data. Data is a bit different from text in that it affectsmore objects than simply the users initialized data. The value specified in +pd of an executable is usedwhen creating any of the following: Program private data, anonymous MMF's, shared memory segmentsand User stack. In order for the page size to be passed to shared memory segments, the chatr'd processmust be the one to create the segment. Processes simply attaching to an existing segment have no affect

on the desired page size. To set data to 64K specify: chatr +pd 64K my_ executable

What about those pesky shared libraries? Shared libraries do not inherit page size. Shared libraries themselves must be chatr'd. If the userwants to change the text/data page size of a shared library then thecaller must chatr the shared library. Because chatr (1) must write to the executable header the sharedlibrary may not be active at the time of chatr.

8.0 Why am I not getting my variable sized pages?You've set the configured kernel parameters (vps_pagesize/vps_ceiling) and/or chatr'd your executable, butyou don't see any gain in performance. Are you getting large pages? If not, why? First of all, the next section details the statistics kept on asystem as well as process level. Verifying system/process counters can certainly help. Let us suppose you verify your executable that hasbeen chatr'd to 256K for data, but its not receiving 256K pages. What could be happening?

"Do I have enough physical memory?" If the system is paging because of an over commitment of memory,there may be no physical pages of the desired size available. Variable sized pages are allocated notonly on the page size desired, but also the availability of that physical page size. Using pstat_getvminfo()determine just how much memory is available. If its extremely low relative to your page size, then the systemdoes not allocate a page of that size. One way you can tell this might be happening is by examining thepsv_select_success[ ]/psv_select_failure[ ] statistics. You'd expect to see failures for your page size. Youshould also look at psv_pgalloc_failure [ ] to see what physical allocation failures have occurred as well.One reason memory could be low and resulting in variable page size request failures could be the

dynamic buffer cache. If the system has been configured with a dynamic buffer cache and the maximumpercentage is reasonably large compared to system size, its possible the buffer cache is expanding andusing available free memory.


68 of 135 10/28/2010 11:19 PM

Perhaps you're getting large pages, but not for the entire object. What's the starting virtual alignment?Using pstat_getprocvm() determine the starting virtual address for the object. pstat_getprocvm() will alsoreturn the page sizes being used by the object. The alignment of that address dictates what page size canbe used starting from that location. If its not a multiple of the specified page size then smaller pages areused until such a point where the virtual address becomes aligned with the page size. You would expect tosee some number of 4K's, then some number of 16K's, then some number of 64K's, etc.You may not get any benefit if you chatr(1) a program you have run recently to a new larger size and thenrun that program with the new size. Note, this only happens if you have run the program after the lastmount of the file system on which the program resides and before the chatr().The reason for this is that on the first running of the program, pages would have been brought into thepage cache with the old page size. When you rerun after chatr'ing, the pages are found in the page cacheat a different size than what is required. To get around this problem you need to unmount the file system onwhich the program resides and remount it.So all the counters look good, but the statistics show you don't have any large pages within your object.

There is the possibility your pages are being demoted. There are operations both user generated and kernelgenerated that result in the need for a page to be specifically 4K. For example, the mprotect() systemcall can operate on a per-page basis. Having 3 out of 4 pages be read/write and 1 page being read-only

won't work for a large page. So prior to the mprotect() operation the underlying large page is convertedfrom its size 'X' to 4K. Operations resulting in demotion are:

· mprotect()

· partial munmap()

· negative sbrk()

· mlock() of 4K pieces in large pageNote, you can determine if demotions are occurring by examining psv_demotions[] returned from pstat().

Along the lines of "memory depletion", I want to point out the possible side effect of physical size inflation.This occurs when a large page is used and not all the 4K pieces are accessed. By using too large a page,the object itself uses more physical memory than before and can create paging activity where there wasnone before. Performance may actually decrease because the process spends time "waiting" for pagefaults to complete.

9.0 What statistics/counters are available to assist me?

9.1 vps_stats.32 & vps_stats.64vps_stats.{32/64} is an unsupported command provided by HP to report large page size statistics. The.32 version is for 32 bit systems and the .64 is for 64 bit systems. To get a copy of the tool, please contactyour HP representative and ask him/her to extract the unsupported tool shar file from:ftp://hpchs.cup.hp.com/tools/11.X/vps_stats.sharWhat vps_stats.{32/64} reports can be accessed through existing interfaces, each of which is described below.

9.2 Large page statisticsFor Large Pages we maintain several kernel statistics of system activity to track performance. These statistics are accessible to user spacevia pstat (2). The statistics and the pstat() calls that access them are given below (note that only the 32-bit versions are shown).

9.3 Supported page sizesstruct pst_ static {...int32_t pst_supported_pgsize[PST_N_PG_SIZES];...};This system-static value is accessible via pstat_getstatic(). It returns an array of valid page sizes, eachgiven as a number of 4K pages. If there are less than PST_N_PG_SIZES page sizes, the array is paddedwith zeroes.

9.4 User-supplied hints of running processesstruct pst_status {...int32_t pst_text_size; /* Page size used for text objects. */int32_t pst_data_size; /* Page size used for data objects. */...};These per-process values are accessible via pstat_getproc(). They reflect the executable's desired text BR> and/or data page sizesupplied via chatr (1). The page size value is given as a number of 4K pages. If nochatr has been performed on the executable so that the default page size selection heuristic is being used,


69 of 135 10/28/2010 11:19 PM

the field value is PST_SZ_DEFAULT.

9.5 Per-region statisticsstruct pst_vm_status {...int32_t pst_pagesize_hint;pst_vps_pgsizes[PST_N_PG_SIZES];...};These values are accessible via pstat_getprocvm(). They are per-region values, i. e. there are separate values for text, data, stack, eachmemory-mapped and shared memory region, etc.pst_pagesize_hint is a usually-static value that indicates the preferred page size for the region. It is set atregion creation time, either from the default page size selection heuristic or from explicit user page sizeinformation supplied via chatr. The hint remains the same throughout the life of the region, except for dataand stack regions, whose hints can be adjusted upwards as they grow in size.pst_vps_pgsizes[ ] gives the total number of pages by page size currently in use by the region. The arrayindex is the base-2 log of the number of 4K pages in a particular page size, e.g. 0=4K, 2=16K, 4=64K, etc.Note that only translated pages are accounted.

9.6 Global statisticsstruct pst_vminfo{...int32_t psv_select_success[PST_N_PG_SIZES];int32_t psv_select_failure[PST_N_PG_SIZES];int32_t psv_pgalloc_success[PST_N_PG_SIZES];int32_t psv_pgalloc_failure[PST_N_PG_SIZES];int32_t psv_demotions[PST_N_PG_SIZES];...};These global values are accessible via pstat_getvminfo(). They tally the success/failure of different stagesof large page creation. First, we select a page size to attempt to create. We start with pst_pagesize_hintand adjust due to conditions such as:

· virtual address misalignment

· neighboring pages already in memory or with different Copy-On-Write status

· neighboring pages backed by different sources (e.g. some from file systemand some from swap space)After selecting a size, we increment the psv_select_success counter corresponding to the size. If the sizeis less than pst_pagesize_hint, we increment psv_select_filure for all the page sizes up to and includingpst_pagesize_hint. In this fashion we can determine which page sizes are asked for but are failing, andwhich are actually being used.Note that the counters may be inflated in some cases. Under certain conditions, we may select a size to try,

encounter an exceptional event (e.g. a wait for memory or I/O alignment), and go back and redo the selection stage. Thus, we may tallyseveral times for the same large page creation, possibly on different size counters. We expect this situation to be rare.After settling on a page size to try, we allocate physical space with the page allocator.psv_pgalloc_success and psv_pgalloc_failure count success/failure of the allocator. The counts are brokendown by page size. We tally a success if we ask for a page of a particular size and successfully allocateit, failure otherwise. In some cases, we specify both a desired and minimum acceptable page size. Ifwe succeed at a page size smaller than desired, we increment failure for all the page sizes up to thedesired one (similar to the above). Thus, failure counts may appear larger than expected.Note that a psv_select_failure doesn't necessarily generate a psv_pgalloc_failure. The allocator doesn'tknow if we've adjusted downward before asking for physical space; it only knows if it handed us the pagesize we requested.Finally, we may incur trouble even after a large page is allocated and in use. Certain system operationswork only on 4K pages; if they encounter a large page, they must demote it to a series of 4K pages. Forexample, we might need to temporarily re-map or copy an existing large page, and cannot get theresources for the temporary large page. In order to do the re-map/copy, we demote the original page andretry for (less restrictive) resources.Demotions are tallied by page size in psv_demotions. Almost all demotions result in 4K pages, though inrare cases we demote to the next smaller page size. The kernel currently supports 3 configured kernel parameters that influence the use of variable sized pages.:


70 of 135 10/28/2010 11:19 PM

Figure 1 32-bit address space layout on PA1.x

·

Variable-Page-Size Parameters:

vps_ceiling Maximum system-selected page size in Kbytes Minimum: 4 Maximum: 65536 Default: 16

vps_chatr_ceiling Maximum chatr-selected page size in Kbytes Minimum: 4 Kbytes Maximum: 65536 Kbytes Default: 65536 Kbytes

vps_pagesize Default user page size in Kbytes Minimum: 4 Maximum: 65536 Default: 4 vps_pagesize represents the default or minimum page size the kernel should use if the user has not

chatr (1)'d a specific value. vps_pagesize is specified in units of kilobytes and should equate to one of thesupported values. In the event vps_pagesize does not correspond to a supported page size, the closestpage size smaller than the users specification is used. For example, specifying 20K would result in the kernelusing 16K. vps_pagesize is essentially a boot time configured page size for all user objects created.The actual effectiveness of that size for the system is unknown. As described earlier, the actual page size used is dependent on virtual alignment as well. Even though vps_pagesize is configured to 16K,if the virtual alignment is not suitable for a 16K page then 4K pages are used instead. The current default value is 4 kilobytes(vps_pagesize = 4). vps_ ceiling represents the maximum size page the kernel uses when selecting page size "transparently".

vps_ceiling is specified in units of kilobytes. Like vps_pagesize, vps_ceiling should be a valid page size. Ifnot, the value is rounded down to the closest valid page size. vps_ceiling places a limit on the size used forprocess data/stack and the size of pages the kernel selects transparently for non-dynamic objects (text,shared memory, etc.). The default value is 16K( vps_ceiling = 16). vps_ chatr_ceiling places a restriction on the largest value a user is able to chatr (1). The command itself

is not limited, but the kernel checks the chatr'd value against the maximum and only values belowvps_chatr_ceiling are actually used. In the event the value exceeds vps_chatr_ceiling, the actual valueused is the value of vps_chatr_ceiling. Like the others, vps_chatr_ceiling is specified in units of kilobytesand will be rounded down to the closest page size if an invalid size is specified. chatr (1) does not requireany sort of user privilege and can therefore be used by any user. This configured kernel parameter allowsthe system administrator to restrict the use of large pages if there are "bad citizens" who abuse the facility.The default value is 64 Megabytes, the largest possible page size (vps_chatr_ceiling = 16384). There are commands available for 32 and 64 bit to provide statistics for variable page size : for 32 bit vps_stats.32for 64 bit vps_stats.64 vps_stats.{32/64} is an unsupported command provided by HP to report large page size statistics. These are avaiable at :ftp://hpchs.cup.hp.com/tools/11.X/vps_stats.shar

The 32 bit memory map

The current 32-bit address space layout can be depicted by comparing how that virtual address space is used inkernel mode and user mode.


71 of 135 10/28/2010 11:19 PM

32-bit address space layout on PA1.x

Shared memory Shared object space otherwise known as shared memory occupies 2 quadrants in the system memory map.The size of the memory map is determined by the total amount of memory on the system , i.e. all RAM +SWAP.The memory map is divided into 4 quadrants . For 32 bit systems shared object space is allocated in quadrants3&4 . Within quadrant 3 there is a maximum address space of 1Gb, for quadrant 4 there is a maximum address spaceof 768 Mb . The last 256Mb of quadrant 4 is reserved for kernel I/O . If additional shared object space is required for 32 bit operation , there are 2 alternatives available.Regardless of MAGIC type , or if memory windows are implemented there is a 1Gb 32 bit architectural limit for

a single segment . There is further information on shared memory in the WTEC tools section under SHMEMINFO

Shared memory use can be monitored by using the ipcs utility.

ipcs –mob

You will see an output similar to this : IPC status from /dev/kmem as of Tue Apr 17 09:29:33 2001T ID KEY MODE OWNER GROUP NATTCH SEGSZShared Memory:m 0 0x411c0359 --rw-rw-rw- root root 0 348m 1 0x4e0c0002 --rw-rw-rw- root root 1 61760m 2 0x412006c9 --rw-rw-rw- root root 1 8192m 3 0x301c3445 --rw-rw-rw- root root 3 1048576m 4004 0x0c6629c9 --rw-r----- root root 2 7235252m 5 0x06347849 --rw-rw-rw- root root 1 77384m 206 0x4918190d --rw-r--rw- root root 0 22908m 6607 0x431c52bc --rw-rw-rw- daemon daemon 1 5767168


72 of 135 10/28/2010 11:19 PM

The two fields of the most interest are NATTCH and SEGSZ.

NATTCH -The number of processes attached to the associated sharedmemory segment. Look for those that are 0, they indicate processes who have notreleased their shared memory segment. If there are multiple segments showing with an NATTACH of zero , especially ifthey are owned by a database, this can be an indication that the segments arenot being efficiently released . This is due to the program not callingdetachreg . These segments can be removed using ipcrm -m shmid. Note : Even though there is no process attached to the segment , the datastructure is still intact. The shared memory segment and data structureassociated with it are destroyed by executing this command.

SEGSZ The size of the associated shared memory segment in bytes. Thetotal of SEGSZ for a 32-bit system using EXEC_MAGIC cannot exceed 1879048192bytes or 1.75Gb, or 2952790016 bytes or 2.75Gb for SHMEM_MAGIC. By default systems operate under EXEC_MAGIC , it is possible to utilize quadrant 2 for additional shared object spaceby converting via chatr -m to SHMEM_MAGIC. An existing application may be relinked as the new executable typeSHMEM_MAGIC, or, the application can be linked as type EXEC_MAGIC, and then chatr'd to be the new executable typeSHMEM_MAGIC. It is important to remember that if this choice is made , the available memory for data , kernel stack , text and Uareawill be confined to the 1GB maximum in quadrant 1 . The second alternative is to implement memory windows . This alternative allows for discrete shared object space , called by thegetmemwindow command . The ability to create a unique memory window removes the current system wide 1.75 gigabyte limitation, 2.75gigabytes if compiled as SHMEM_MAGIC. A 32-bit process can create a unique memory window for shared objects like sharedmemory. Other processes can then use this window for shared objects as well. To enable the use of memory windows, the kernel tunable, max_mem_window, must be set to the desired amount of memory windows. The disabled value is 0.The amount of memory windows is limited by the total system memory . The theoretical limit is 8192 1Gb windows , at this time the OS and available hardware prevent this . Magic number review There are 3 magic numbers that can be used for a 32 bit executable at 11.00.They are SHARE_MAGIC (DEMAND_MAGIC), EXEC_MAGIC, and SHMEM_MAGIC. For 64 bit 11.00 executables there is currently no need to have different memory mapsavailable as the standard one allows up to 4TB for the program text, another4TB for its private data and a total of 8TB for shared areas. SHARE_MAGIC is the default at 11.0. SHARE_MAGIC is also called DEMAND_MAGIC.With SHARE_MAGIC, quadrant 1 is used for program text, quadrant 2 is used forprogram data, kernel stack and Uarea and quadrants 3 and 4 are for shared objects. EXEC_MAGIC allows a greater process data space by allowing text and data toshare quadrant 1.

Note: Even with SHMEM_MAGIC executables, a single shared memory segment must be contained completely in one quadrant, so 1 GB is still the maximum size of a single shared memory segment. Memory Windows will run on HP-UX 11.0 either 32 or 64 bit installationTo implement memory windows for 11.0 the following patches must be installed : PHKL_18543 (Critical, Reboot)s700_800 11.00 PM/VM/UFS/async/scsi/io/DMAPI/JFS/perf patchPHCO_23705 s700_800 11.00 memory windows cumulative patchPHCO_27375 s700_800 11.00 cumulative SAM/ObAM patchPHKL_28766 (Critical, Reboot)s700_800 11.00 Probe,IDDS,PM,VM,PA-8700,AIO,T600,FS,PDC,CLK


73 of 135 10/28/2010 11:19 PM

These have dependencies , for the latest revisions check http://itrc.hp.com/ To configure the file /etc/services.window needs to be set up with the correct information .A line should be added in the /etc/services.window file to associate an application with a memory window id. Here is a sample /etc/services.window. for 3 Oracle instances . In the example that follows the Oracledb1 uses memory window id 20 Oracledb2 has id 30 and Oracledb3 has id 40. Oracledb1 20 Oracledb2 30 Oracledb3 40 Two new commands have been added to support memory windows. The getmemwindow command used to extract window ids of user processes from the /etc/services.window file.The setmemwindow command changes the window id of a running process or starts a specified program in a particularmemory window. The following is a simple script to start a process in a memory window: # more startDB1.shWinId=$(getmemwindow Oracledb1)setmemwindow -i $WinId /home/dave/memwinbb/startDB1 "DB1 says hello!" Run the script and see the output from the binary. The setmemwindow commanddidn't produce any of the output: # ./startDB1.shwriting to segment: "DB1 says hello!"Key is 1377042728 Shared memory is allocated in segements . The size of the segments is limited by the free spacewithin the memory quadrants allotted for shared object space. This shared memory segment, musthave all of the memory addresses allocated in sequence, with no gaps between addresses. Memory allocated in this manner is known as contiguous memory. No process or shared memory segment may cross a quadrant boundary. This is true regardless of architecture or OS . There is an issue with fragmentation of shared memory . The most common causefor this to happen is excessively sized inode tables . This is true of HFS as well as VxFS.As mentioned in the section on system tables setting the inode tables to appropriate sizes willminimize this issue.

VMSTAT

The vmstat tool reports virtual memory statistics , it can be a key tool to track suspected memory leaks . The vmstat command reports certain statistics kept about process, virtual memory, trap, and CPU activity. It also can clear theaccumulators in the kernel sum structure. Options vmstat recognizes the following options: -d Report disk transfer information as a separate section, in the form of transfers per second.-n Provide an output format that is more easily viewed on an 80-column display device.


74 of 135 10/28/2010 11:19 PM

This format separates the default output into two groups: virtual memory information and CPU data. Each group is displayed as a separate line of output. On multiprocessor systems, this display format also provides CPU utilization on aper CPU basis. -S Report the number of processes swapped in and out (si and so) instead of page reclaims and address translation faults (re and at).interval Display successive lines, which are summaries over the last interval seconds. If interval is zero, the output is displayed onceonly. If the -d option is specified, the column headers are repeated. If -d is omitted, the column headers are not repeated. The command vmstat 5 prints what the system is doing every five seconds. This is a good choice of printing interval since this is howoften some of the statistics are sampled in the system; others vary every second.count Repeat the summary statistics count times. If count is omitted or zero, the output is repeated until an interrupt or quit signal isreceived. From the terminal, these are commonly ^C and \̂, respectively (see stty(1) ).-f Report on the number of forks and the number of pages of virtual memory involved since boot-up.-s Print the total number of several kinds of paging-related events from the kernel sum structure that have occurred since boot-up or sincevmstat was last executed with the -z option.-z Clear all accumulators in the kernel sum structure. This option is restricted to the super user. If none of these options is given, vmstat displays a one-line summary of the virtual memory activity since boot-up or since the -z optionwas last executed. Column Descriptions The column headings and the meaning of each column are: procs Information about numbers of processes in various states. r in runqueueb Blocked for resources (I/O, paging, etc.)w Runnable or short sleeper (< 20 secs) but swapped memory Information about the usage of virtual and real memory. Virtual pages are considered active if they belong to processes that are running orhave run in the last 20 seconds.avm Active virtual pages free Size of the free list page Information about page faults and paging activity.These are averaged each five seconds, and given in units per second. re Page reclaims(without -S)at Address translation faults (without -S)si Processes swapped in (with -S)so Processes swapped out (with -S)pi Pages paged inpo Pages paged outfr Pages freed per secondde Anticipated short-term memory shortfallsr Pages scanned by clock algorithm, per second faults Trap/interrupt rate averages per second over last 5 seconds. in Device interrupts per second (nonclock)sy System calls per secondcs CPU context switch rate (switches/sec) cpu Breakdown of percentage usage of CPU time us User time for normal and low priority processessy System timeid CPU idle


75 of 135 10/28/2010 11:19 PM

EXAMPLES

The following examples show the output for various command options. For formatting purposes, some leading blanks have been deleted.

1. Display the default output.

vmstat

procs memory page faults cpur b w avm free re at pi po fr de sr in sy cs us sy id0 0 0 1158 511 0 0 0 0 0 0 0 111 18 7 0 0 100

2. Add the disk transfer information to the default output.

vmstat -d procs memory page faults cpur b w avm free re at pi po fr de sr in sy cs us sy id0 0 0 1158 511 0 0 0 0 0 0 0 111 18 7 0 0 100 Disk Transfers device xfer/sec c0t6d0 0 c0t1d0 0 c0t3d0 0 c0t5d0 0

3. Display the default output in 80-column format.

vmstat -n VM memory page faults avm free re at pi po fr de sr in sy cs 1158 430 0 0 0 0 0 0 0 111 18 7CPU cpu procs us sy id r b w 0 0 100 0 0 0

4. Replace the page reclaims and address translation faults with process swapping in the default output.

vmstat -S procs memory page faults cpur b w avm free si so pi po fr de sr


76 of 135 10/28/2010 11:19 PM

in sy cs us sy id0 0 0 1158 430 0 0 0 0 0 0 0 111 18 7 0 0 100

5. Display the default output twice at five-second intervals. Note that the headers are not repeated.

vmstat 5 2 procs memory page faults cpur b w avm free re at pi po fr de sr in sy cs us sy id0 0 0 1158 456 0 0 0 0 0 0 0 111 18 7 0 0 1000 0 0 1221 436 5 0 5 0 0 0 0 108 65 18 0 1 99

6. Display the default output twice in 80-column format at five-second intervals. Note that the headers are not repeated.

vmstat -n 5 2 VM memory page faults avm free re at pi po fr de sr in sy cs1221 436 0 0 0 0 0 0 0 111 18 7CPU cpu procs us sy id r b w 0 0 100 0 0 01221 435 2 0 2 0 0 0 0 109 35 17 0 1 99 0 0 0

7. Display the default output and disk transfers twice-in 80-column format at five-second intervals. Note that the headers are

repeated.

vmstat -dn 5 2 VM memory page faults avm free re at pi po fr de sr in sy cs1221 435 0 0 0 0 0 0 0 111 18 7CPU cpu procs us sy id r b w 0 0 100 0 0 0 Disk Transfers device xfer/sec c0t6d0 0


77 of 135 10/28/2010 11:19 PM

c0t1d0 0 c0t3d0 0 c0t5d0 0 VM memory page faults avm free re at pi po fr de sr in sy cs1219 425 0 0 0 0 0 0 0 111 54 15CPU cpu procs us sy id r b w 1 8 92 0 0 0 Disk Transfers device xfer/sec c0t6d0 0 c0t1d0 0 c0t3d0 0 c0t5d0 0

8. Display the number of forks and pages of virtual memory since boot-up.

vmstat -f 24558 forks, 1471595 pages, average= 59.92

9. Display the counts of paging-related events.

vmstat -s 0 swap ins0 swap outs0 pages swapped in0 pages swapped out1344563 total address trans. faults taken542093 page ins2185 page outs602573 pages paged in4346 pages paged out482343 reclaims from free list504621 total page reclaims124 intransit blocking page faults1460755 zero fill pages created404137 zero fill page faults366022 executable fill pages created71578 executable fill page faults0 swap text pages found in free list162043 inode text pages found in free list196 revolutions of the clock hand45732 pages scanned for page out4859 pages freed by the clock daemon36680636 cpu context switches1497746186 device interrupts1835626 traps


78 of 135 10/28/2010 11:19 PM

87434493 system calls

WARNINGS Users of vmstat must not rely on the exact field widths and spacing of its output, as these will vary depending on the system, the release ofHP-UX, and the data to be displayed.

Module 5

DISK I/O

Disk Bottlenecks:

high disk activity

high idle CPU time waiting for I/O requests to finish

long disk queues

Efforts to optimize disk performance will be wasted if the server has insufficient memory.

This will report activity every five seconds. Look at the bps and sps columns for the disks (device) that hold exported file systems . bpsshows the number of kilobytes transferred per second during the period; sps shows the number of seeks per second (ignore msps).

To optimize disk I/O consider the following layout: Put your most frequently accessed information on your fastest disks, and distribute the workload evenly among identical, mounted disksso as to prevent overload on a disk while another is under-utilized. Whenever possible, if you plan to have a file system span disks, have the logical volume span identical disk interface types. Bestperformance results from a striped logical volume that spans similar disks. The more closely you match the striped disks in terms of speed, capacity, and interface type, the better the performance you can expect. So, for example, when striping across several disks of varying speeds, performance will be no faster than that of theslowest disk. Increasing the number of disks may not necessarily improve performance. This is because the maximum efficiency that can be achieved by combining disks in a striped logical volume is limited by the maximum throughput of the file system itself and of the buses to which the disks are attached If you have more than one interface card or bus to which you can connect disks, distribute the disks as evenly as possible among them. That is, each interface card or bus should have roughly the same number of disks attached to it. You will achieve the best I/O performance when you use more than one bus and interleave the stripes of the logical volume. A logicalvolume's stripe size identifies the size of each of the blocks of data that make up the stripe.You can set the stripe size to four, eight, 16, 32, or 64 kilobytes (KB) (the default is eight KB).


79 of 135 10/28/2010 11:19 PM

The stripe size of a logical volume is not related to the physical sector size of a disk, which is typically 512 bytes. Other factors to consider when optimizing file system performance for VxFS are the block size, intent log size and mount options.

Block size

The unit of allocation in VxFS is a block. There are no fragments because storage is allocated in extents that consist of one or moreblocks. The smallest block size available is 1K, which is also the default block size for VxFS file systems created on file systems of lessthan 8 gigabytes. Choose a block size based on the type of application being run. For example, if there are many small files, a 1K block size may savespace. For large file systems, with relatively few files, a larger block size is more appropriate. The trade-off of specifying larger blocksizes is a decrease in the amount of space used to hold the free extent bitmaps for each allocation unit, an increase in the maximum extentsize, and a decrease in the number of extents used per file versus an increase in the amount of space wasted at the end of files that are not amultiple of the block size. Largerblock sizes use less disk space in file system overhead, but consume more space for files that are not a multiple of the block size. Theeasiest way to judge, which block sizes provide the greatest system efficiency is to try representative system, loads against various sizesand pick the fastest. To determine the current block size for the vxfs file system: fstyp -v /dev/vg00/lvol# For example: # fstyp -v /dev/vg00/lvol1 f_bsize: 8192 Note: The f_bsize parameter reports the block size for the vxfs file system.

Intent Log size

The intent log size is chosen when a file system is created and cannot be subsequently changed. The mkfs utility uses a default intent logsize of 1024 blocks. The default size is sufficient for most workloads. If the system is used as an NFS server or for intensive synchronouswrite workloads, performance may be improved using a larger log size.With larger intent log sizes, recovery time is proportionately longer and the file system may consume more system resources (such asmemory) during normal operation. There are several system performance benchmark suites for which VxFS performs better with larger log sizes. As with block sizes, thebest way to pick the log size is to try representative system loads against various sizes and pick the fastest.

Mount options

Standard mount options :

Intent Log Options

These options control how transactions are logged to disk:

1. Full Logging (log) File system structural changes are logged to disk before the system

call returns to the application. If the system crashes, fsck(1M) will complete logged operations that have not completed.

2. Delayed Logging (delaylog) Some system calls return before the intent log is written. This improves the performance of the system, but some changes are not guaranteed until a short time later when the intent log is written. This mode approximates traditional UNIX system guarantees for


80 of 135 10/28/2010 11:19 PM

correctness in case of system failure.

3. Temporary Logging (tmplog) The intent log is almost always delayed. This improves performance, but recent changes may disappear if the system crashes. This mode is only recommended for temporary file systems.

4. No Logging (nolog) The intent log is disabled. The other three logging modes provide for

fast file system recovery; nolog does not provide fast file system

recovery. With nolog mode, a full structural check must be performed after a crash; this may result in loss of substantial portions of the file system, depending upon activity at the time of the crash.

Usually, a nolog file system should be rebuilt with mkfs(1M)

after a crash. The nolog mode should only be used for memory resident or very temporary file systems.

Write Options

These options control how user data is written to disk:

1. Direct writes (direct)

The direct value causes any writes without the O_SYNC flag and

all reads to be handled as if the VX_DIRECT caching advisory was set instead.

2. Data Synchronous Writes (dsync) A write operation returns to the caller after the data has been transferred to external media, but the inode is not updated synchronously if only the times in the inode need to be updated.

3. Sync on Close Writes (closesync) Sync-on-close I/O mode causes writes to be delayed rather than to

take effect immediately, and causes the equivalent of an fsync(2) to be run when a file is closed.

4. Delayed Writes (delay) This causes writes to be delayed rather than to take effect immediately. No special action is performed when closing a file.

5. Temporary caching (tmpcache)

The tmpcache value disables delayed extended writes, trading off integrity for performance. When this option is chosen, JFS does not zero out new extents allocated as files are sequentially written. Uninitialized data may appear in files being written at the time of a system crash. The system administrator can independently control the way writes with and

without O_SYNC are handled. The mincache mount option determines

how ordinary writes are treated; the convosync option determines howsynchronous writes are treated

mincache=direct|dsync|closesync|tmpcache

convosync=direct|dsync|closesync|delay


81 of 135 10/28/2010 11:19 PM

In addition to the standard mount mode (log mode), VxFS provides blkclear, delaylog, tmplog, nolog, and nodatainlog modes of operation.Caching behavior can be altered with the mincache option, and the behavior of O_SYNC and D_SYNC (see fcntl(2)) writes can bealtered with the convosync option. The delaylog and tmplog modes are capable of significantly improving performance. The improvement over log mode is typically about15 to 20 percent with delaylog; with tmplog, the improvement is even higher. Performance improvement varies, depending on theoperations being performed and the workload. Read/write intensive loads should show less improvement, while file system structureintensive loads (such as mkdir, create, and rename) may show over 100 percent improvement. The best way to select a mode is to testrepresentative system loads against the logging modes and compare the performance results. Most of the modes can be used in combination. For example, a desktop machine might use both the blkclear and mincache=closesyncmodes. If you plan to use the striped logical volume as a raw data partition (for example, for a database application that uses the device directly): The stripe size should be the same as the primary I/O size for the application. This section describes the kernel tunables in VxFS. See the System Tables section regarding vx_ninode .

Monitoring free space

In general, VxFS works best if the percentage of free space in the file system does not get below 10 percent. This is because file systemswith 10 percent or more free space have less fragmentation and better extent allocation. Regular use of the df_vxfs(1M) command tomonitor free space is desirable. Full file systems may have an adverse effect on file system performance. Full file systems shouldtherefore have some files removed, or should be expanded (see fsadm_vxfs (1M) for a description of online file system expansion). The reorganization and resize features of fsadm_vxfs(1M) are available only with the optional HP OnLineJFS product. If Advanced JFS ( Online) is installed defragmentation may yield performance gains .Fragmentation means that files are scattered haphazardly across a disk or disks, the result of growth over time.Multiple disk-head movements are needed to read and update such files, theoretically slowing response time. Defragmentation can be done either from the command line or via SAM.Ideally a servers’ JFS file systems should be defragmented regularly.The frequency should be based on the volatility of read/write/deletes. The easiest way to ensure that fragmentation does not become a problem is to schedule regular defragmentation runs from cron. Defragmentation scheduling should range from weekly (for frequently used file systems) to monthly (for infrequently used file systems). Extent fragmentation should be monitored with fsadm_vxfs (1M) or the -o s options of df_vxfs(1M). There are three factors, which can be used to determine the degree of fragmentation: 1) percentage of free space in extents of less than eight blocks in length2) percentage of free space in extents of less than 64 blocks in length3) percentage of free space in extents of length 64 blocks or greater An unfragmented file system will have the following characteristics: less than 1 percent of free space in extents of less than eight blocks in lengthless than 5 percent of free space in extents of less than 64 blocks in lengthmore than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks A badly fragmented file system will have one or more of the following characteristics: greater than 5 percent of free space in extents of less than 8 blocks in lengthmore than 50 percent of free space in extents of less than 64 blocks in lengthless than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks


82 of 135 10/28/2010 11:19 PM

Note: Defragmentation can be done at the directory level, depending on the volatility of the directory structure, this can be less efficientand not always provide significant increases in performance. For an extent fragmentation report run : fsadm -E /mount_pointTo execute an extent defragmentation run fsadm -e /mount_point For a directory fragmentation report run: fsadm -D /mount_pointTo execute a directory defragmentation run fsadm -d /mount_point Defragmentation can also be done through SAM Execute sam.

Select Disks and File Systems functional area.

Select the File Systems application.

Select the JFS (VxFS) file system.

Select Actions

Select the VxFS Maintenance menu item.

You can choose to view reports:

Select View Extent Fragmentation Report

Select View Directory Fragmentation Report

or perform the defragmentation:

Select Reorganize Extents

Reorganize Directories

Performance of a file system can be enhanced by a suitable choice of I/O sizes and proper alignment of the I/O requests based on therequirements of the underlying special device. VxFS provides tools to tune the file systems.

Tuning VxFS I/O Parameters (Online JFS 3.3 or higher) The VxFS file system provides a set of tunable I/O parameters that control some of its behavior. If the default parameters are not acceptable, then the /etc/vx/tunefstab file can be used to set values for I/O parameters. The mount_vxfs(1M) command invokes the vxtunefs (1M) command to process the contents of the /etc/vx/tunefstab file. Please note that the mountcommand will continue even if the call to vxtunefs fails or if vxtunefs detects invalid parameters. While the file system is mounted, anyI/O parameters can be changed using the vxtunefs command which can have tunables specified on the command line or can read them fromthe /etc/vx/tunefstab file. For more details, see vxtunefs (1M) and tunefstab (4). The vxtunefs command can be used to print the current values of the I/O parameters.

Tunable VxFS I/O Parameters

read_pref_io

The preferred read request size. The file system uses this in conjunction with the read_nstream value to determine how much data to readahead.The default value is 64K.

write_pref_io

The preferred write request size. The file system uses this in conjunction with the write_nstream value to determine how to do flushbehind onwrites. The default value is 64K.


83 of 135 10/28/2010 11:19 PM

read_nstream

The number of parallel read requests of size read_pref_io to have outstanding at one time. The file system uses the product ofread_nstream multiplied by read_pref_io to determine its read ahead size. The default value for read_nstream is 1.

write_nstream

The number of parallel write requests of size write_pref_io to have outstanding at one time. The file system uses the product ofwrite_nstream multiplied by write_pref_io to determine when to do flush behind on writes. The default value for write_nstream is 1.

default_indir_ size

On VxFS, files can have up to ten direct extents of variable size stored in the inode. Once these extents are used up, the file must useindirect extents, which are a fixed size that is set when the file first uses indirect extents. These indirect extents are 8K by default. The filesystem does not use larger indirect extents because it must fail a write and return ENOSPC if there are no extents available that are theindirect extent size. For file systems with a lot of large files, the 8K indirect extent size is too small. The files that get into indirect extents use a lot of smaller extents instead ofa few larger ones. By using this parameter, the default indirect extent size can be increased so large that files in indirects use fewer largerextents.

NOTE :The tunable default_indir_size should be used carefully. If it is set too large, then writes will fail when they are unable to allocate extentsof the indirect extent size to a file. In general, the fewer and the larger the files on a file system, the larger the default_indir_size can beset. This parameter should generally be set to some multiple of the read_pref_io parameter. default_indir_size is not applicable onVersion 4 disk layouts.

discovered_direct_iosz

Any file I/O requests larger than the discovered_direct_iosz are handled as discovered direct I/O. A discovered direct I/O is unbufferedsimilar to direct I/O, but it does not require a synchronous commit of the inode when the file is extended or blocks are allocated. Forlarger I/O requests, the CPU time for copying the data into the page cache and the cost of using memory to buffer the I/O data becomesmore expensive than the cost of doing the disk I/O. For these I/O requests, using discovered direct I/O is more efficient than regular I/O.The default value of this parameter is 256K.

initial_extent_ size

Changes the default initial extent size. VxFS determines, based on the first write to a new file, the size of the first extent to be allocated tothe file. Normally the first extent is the smallest power of 2 that is larger than the size of the first write. If that power of 2 is less than 8K,the first extent allocated is 8K. After the initial extent, the file system increases the size of subsequent extents (see max_seqio_extent_size)with each allocation.Since most applications write to files using a buffer size of 8K or less, the increasing extents start doubling from a small initial extent.initial_extent_size can change the default initial extent size to be larger, so the doubling policy will start from a much larger initial sizeand the file system will not allocate a set of small extents at the start of file. Use this parameter only on file systems that will have a verylarge average file size. On these file systems it will result in fewer extents per file and less fragmentation. initial_extent_size is measuredin file system blocks.

max_buf_data_size

The maximum buffer size allocated for file data; either 8K bytes or 64K bytes. Use the larger value for workloads where largereads/writes are performed sequentially. Use the smaller value on workloads where the I/O is random or is done in small chunks. 8Kbytes is the default value.

max_direct_iosz

The maximum size of a direct I/O request that will be issued by the file system. If a larger I/O request comes in, then it is broken up intomax_direct_iosz chunks. This parameter defines how much memory an I/O request can lock at once, so it should not be set to more than 20percent of memory.

max_diskq

Limits the maximum disk queue generated by a single file. When the file system is flushing data for a file and the number of pages beingflushed exceeds max_diskq, processes will block until the amount of data being flushed decreases. Although this doesn't limit the actualdisk queue, it prevents flushing processes from making the system unresponsive. The default value is 1 MB.


84 of 135 10/28/2010 11:19 PM

max_seqio_extent_size

Increases or decreases the maximum size of an extent. When the file system is following its default allocation policy for sequential writesto a file, it allocates an initial extent, which is large enough for the first write to the file. When additional extents are allocated, they areprogressively larger (the algorithm tries to double the size of the file with each new extent) so each extent can hold several writes worthof data. This is done to reduce the total number of extents in anticipation of continued sequential writes. When the file stops being written,any unused space is freed for other filesto use. Normally this allocation stops increasing the size of extents at 2048 blocks which prevents one file from holding too much unusedspace.max_seqio_extent_size is measured in file system blocks.

Tips

Try to align the parameters to match the geometry of the logical disk. With striping or RAID-5, it is common to set read_pref_io to thestripe unit size and read_nstream to the number of columns in the stripe. For striping arrays, use the same values for write_pref_io andwrite_nstream, but for RAID-5 arrays, set write_pref_io to the full stripe size and write_nstream to 1. For an application to do efficient disk I/O, it should issue read requests that are equal to the product of read_nstream multiplied byread_pref_io.Generally, any multiple or factor of read_nstream multiplied by read_pref_io should be a good size for performance. For writing, thesame rule of thumb applies to the write_pref_io and write_nstream parameters. When tuning a file system, the best thing to do is try out thetuning parameters under a real life workload. If an application is doing sequential I/O to large files, it should try to issue requests larger than the discovered_direct_iosz. This causesthe I/O requests to be performed as discovered direct I/O requests, which are unbuffered like direct I/O but do not require synchronousinode updates when extending the file. If the file is larger than can fit in the cache, then using unbuffered I/O avoids throwing useful dataout of the cache and it avoids a lot of CPU overhead.

Cache Advisories

The VxFS file system allows an application to set cache advisories for use when accessing files. These advisories are in memory onlyand they do not persist across reboots. Some advisories are currently maintained on a per-file, not a per-file-descriptor, basis. This meansthat only one set of advisories can be in effect for all accesses to the file. If two conflicting applications set different advisories, both usethe last advisories that were set. All advisories are set using the VX_SETCACHE ioctl. The current set of advisories can be obtained with the VX_GETCACHE ioctl. Fordetails on the use of these ioctls, see vxfsio(7). The VX_SETCACHE ioctl is available only with the HP OnLineJFS product.

Direct I/O Direct I/O is an unbuffered form of I/O. If the VX_DIRECT advisory is set, the user is requesting direct data transfer between the disk andthe user-supplied buffer for reads and writes. This bypasses the kernel buffering of data, and reduces the CPU overhead associated withI/O by eliminating the data copy between the kernel buffer and the user's buffer. This also avoids taking up space in the buffer cache thatmight be better used for something else. The direct I/O feature can provide significant performance gains for some applications. For an I/O operation to be performed as direct I/O, it must meet certain alignment criteria. The alignment constraints are usuallydetermined by the disk driver, the disk controller, and the system memory management hardware and software. The file offset must bealigned on a 4-byte boundary. If a request fails to meet the alignment constraints for direct I/O, the request is performed as data synchronous I/O. If the file is currently


85 of 135 10/28/2010 11:19 PM

being accessed by using memory mapped I/O, any direct I/O accesses are done as data synchronous I/O. Since direct I/O maintains the same data integrity as synchronous I/O, it can be used in many applications that currently use synchronousI/O. If a direct I/O request does not allocate storage or extend the file, the inode is not immediately written. The CPU cost of direct I/O is about the same as a raw disk transfer. For sequential I/O to very large files, using direct I/O with largetransfer sizes can provide the same speed as buffered I/O with much less CPU overhead. If the file is being extended or storage is being allocated, direct I/O must write the inode change before returning to the application. Thiseliminates some of the performance advantages of direct I/O. The direct I/O and VX_DIRECT advisories are maintained on a per-file-descriptor basis.

Unbuffered I/O If the VX_UNBUFFERED advisory is set, I/O behavior is the same as direct I/O with the VX_DIRECT advisory set, so the alignmentconstraints that apply to direct I/O also apply to unbuffered. For I/O with unbuffered I/O, however, if the file is being extended, or storageis being allocated to the file, inode changes are not updated synchronously before the write returns to the user. The VX_UNBUFFEREDadvisory is maintained on a per-file-descriptor basis.

Discovered Direct I/O Discovered Direct I/O is not a cache advisory that the user can set using the VX_SETCACHE ioctl. When the file system gets an I/Orequest larger than the default size of 128K, it tries to use direct I/O on the request. For large I/O sizes, Discovered Direct I/O canperform much better than buffered I/O. Discovered Direct I/O behavior is similar to direct I/O and has the same alignment constraints, except writes that allocate storage orextend the file size do not require writing the inode changes before returning to the application.

Data Synchronous I/O If the VX_DSYNC advisory is set, the user is requesting data synchronous I/O. In synchronous I/O, the data is written, and the inode iswritten with updated times and (if necessary) an increased file size. In data synchronous I/O, the data is transferred to disk synchronouslybefore the write returns to the user. If the file is not extended by the write, the times are updated in memory, and the call returns to the user.If the file is extended by the operation, the inode is written before the write returns. Like direct I/O, the data synchronous I/O feature can provide significant application performance gains. Since data synchronous I/Omaintains the same data integrity as synchronous I/O, it can be used in many applications that currently use synchronous I/O. If the datasynchronous I/O does not allocate storage or extend the file, the inode is not immediately written. The data synchronous I/O does not haveany alignment constraints, so applications that find it difficult to meet the alignment constraints of direct I/O should use data synchronousI/O. If the file is being extended or storage is allocated, data synchronous I/O must write the inode change before returning to the application.This case eliminates the performance advantage of data synchronous I/O. The direct I/O and VX_DSYNC advisories are maintained on a per-file-descriptor basis.

Other Advisories The VX_SEQ advisory indicates that the file is being accessed sequentially. When the file is being read, the maximum read-ahead isalways performed. When the file is written, instead of trying to determine whether the I/O is sequential or random by examining the writeoffset, sequential I/O is assumed. The pages for the write are not immediately flushed. Instead, pages are flushed some distance behind thecurrent write point. The VX_RANDOM advisory indicates that the file is being accessed randomly. For reads, this disables read-ahead. For writes, thisdisables the flush-behind. The data is flushed by the pager, at a rate based on memory contention. The VX_NOREUSE advisory is used as a modifier. If both VX_RANDOM and VX_NOREUSE are set, pages are immediately freed andput on the quick reuse free list as soon as the data has been used. If VX_NOREUSE is set when doing sequential I/O, pages are also put onthe quick reuse free list when they are flushed. The VX_NOREUSE may slow down access to the file, but it can reduce the cached dataheld by the system. This can allow more data to be cached for other files and may speed up those accesses. VxFS provides the VX_GET_IOPARAMETERS ioctl to get the recommended I/O sizes to use on a file system. This ioctl can be used bythe application to make decisions about the I/O sizes issued to VxFS for a file or file device. For more details on this ioctl, refer to vxfsio(7)


86 of 135 10/28/2010 11:19 PM

Raw asynchronus logical volumes

Some database vendors recommend using raw logical volumes for faster I/O. This is best implemented with asynchronous I/O. The difference between the async I/O and the synchronous I/O is that async does not wait for confirmation of the write before moving onto the next task. This does increase the speed of the disk performance at the expense of robustness. Synchronous I/O waits foracknowledgement of the write (or fail) before continuing on. The write can have physically taken place or could be in the buffer cache butin either case, acknowledgement has been sent. In the case of async, no waiting. To implement asynchronous I/O on HP-UX for raw logical volumes:* set the async_disk driver (Asynchronous Disk Pseudo Driver) to IN in the HP-UX Kernel, this will require generating a new kernel and rebooting. * create the device file: # mknod /dev/async c 101 0x00000##=the minor number can be one of the following values: 0x000000 default 0x000001 enable immediate reporting 0x000002 flush the CPU cache after reads 0x000004 allow disks to timeout 0x000005 is a combination of 1 and 4 0x000007 is a combination of 1, 2 and 4 Note: Contact the database vendor or product vendor to determine the correct minor number for your application. Change the ownership to the appropriate group and owner: chown oracle: dba /dev/asyncchange the permissions: chmod 660 /dev/async vi /etc/privgroup add 1 line: dba MLOCK give the group MLOCK priviledges /usr/sbin/setprivgrp MLOCK To verify if a group has the MLOCK privilege execute: /usr/bin/getprivgrp

Disk I/O monitoring tools The two standard utilities to measure disk I/O are sar -d and iostat. In order to get a statistically significant sample run the over a sufficient time to detect load variances. sar -d 5 100 is a good starting point . (8.3 minute sample) The output will look similar to: device %busy avque r+w/s blks/s avwait avservc1t6d0 0.80 0.50 1 4 0.27 13.07


87 of 135 10/28/2010 11:19 PM

c4t0d0 0.60 0.50 1 4 0.26 8.60 There will be an average printed at the end of the report. %busy Portion of time device was busy servicing a request avque Average number of requests outstanding for the device r+w/s Number of data transfers per second (read and writes) from and to the device blks/s Number of bytes transferred (in 512-byte units) from and to the device avwait Average time (in milliseconds) that transfer requests waited idly on queue for the device avserv Average time (in milliseconds) to service each transfer request (includes seek, rotational latency, and data transfer times) for the device. When average wait (avwait) is greater than average service time (avserv) it indicates the disk can't keep up with the load during thatsample. When the average queue length exceeds the norm of .50 it is an indication of jobs stacking up. These conditions are considered to be a bottleneck. It is prudent to keep in mind how long these conditions last. If the queue flushes, or theavwait clears in a reasonable time, (ie 5 seconds), it is not a cause for concern. Keep in mind that the more jobs in a queue, the greater the effect on wait on I/O even if they are small. Large jobs, those greater than 1000blks/s will also affect throughput. Also consider the type of disks being used. Modern disk arrays are capable of handling very largeamounts of data in very short processing times.Processing loads of 5000 blks/s or greater in under 10mS. Older standard disks may show far less capability. The avwait is similar to %wio returned for sar -u on cpu.

IOSTAT

Another way to sample disk activity is to run iostat with a time interval, for example: iostat 5 iostat iteratively reports I/O statistics for each active disk on the system. Disk data is arranged in a four-column format:Column Heading Interpretationdevice Device name bps Kilobytes transferred per second sps Number of seeks per second msps Milliseconds per average seek If two or more disks are present, data is presented on successive lines for each disk. To compute this information, seeks, data transfer completions, and the number of words transferred are counted for each disk. Also, thestate of each disk is examined HZ times per second (as defined in <sys/param.h>) and a tally is made if the disk is active. These numberscan be combined with the transfer rates of each device to determine average seek times for each device. With the advent of new disk technologies, such as data striping, where a single data transfer is spread across several disks, the number ofmilliseconds per average seek becomes impossible to compute accurately. At best it is only an approximation, varying greatly, based on


88 of 135 10/28/2010 11:19 PM

several dynamic system conditions. For this reason and to maintain backward compatibility, the milliseconds per average seek (msps)field is set to the value 1.0.

Options

iostat recognizes the following options and command-line arguments: -t Report terminal statistics as well as disk statistics. Terminalstatisticsinclude: tin Number of characters read from terminals.tout Number of characters written to terminals.us Percentage of time system has spent in user mode.ni Percentage of time system has spent in user mode running low-priority (nice) processes.sy Percentage of time system has spent in system mode.id Percentage of time system has spent idling. interval Display successive lines, which are summaries of the last interval seconds. The first line reported is for the time since a reboot and eachsubsequent line is for the last interval only. count Repeat the statistics count times.

EXAMPLES

1. Show current I/O statistics for all disks:

iostat

2. Display I/O statistics for all disks every 10 seconds until INTERRUPT or QUIT is pressed:

iostat 10

3. Display I/O statistics for all disks every 10 seconds and terminate after 5 successive readings:

iostat 10 5

4. Display I/O statistics for all disks every 10 seconds, also show terminal and processor statistics, and terminate after 5 successive

readings:

iostat -t 10 5

WARNINGS

Users of iostat must not rely on the exact field widths and spacing of its output, as these will vary depending on the system, the release ofHP-UX, and the data to be displayed.


89 of 135 10/28/2010 11:19 PM

Module 6

Network Performance

Excessive demand on an NFS server.

LAN bandwidth limitations

Guidelines

Keep NFS servers and their clients on the same LAN segment or subnet. If this is not practical, and you have control over the network hardware, use switches, rather than hubs, bridges and routers, to connect the workgroup. As far as possible, dedicate a given server to one type of task.On file servers, use your fastest disks for the exported file systems. Distribute the workload evenly across these disks.Make sure servers have ample memory. Keep NFS servers and their clients on the same LAN segment or subnet. If this is not practical, and you have control over the network hardware, use switches, rather than hubs, bridges and routers, to connect theworkgroup. As far as possible, dedicate a given server to one type of task. For example, in our sample network (see A Sample Workgroup / Network) flserver acts as a file server, exporting directories to the workstations, whereas appserver is running applications. If the workgroup needed a web server, it would be wise to configure it on a third, high-powered system that was not doing other heavywork. On file servers, use your fastest disks for the exported file systems, and for swap. Distribute the workload evenly across these disks. For example, if two teams are doing I/O intensive work, put their files on different disks or volume groups. Distribute the disks evenly among the system's I/O controllers. For exported HFS file systems, make sure the NFS read and write buffer size on the client match the block size on the server. You can setthese values when you import the file system onto the NFS client; see the Advanced Options pop-up menu on SAM’s Mounted RemoteFile Systems screen. See Checking NFS Server/Client Block Size for directions for checking and changing the values. Enable asychronous writes on exported file systems. For HFS set fsasync to 1 in the kernel Make sure enough nfsd daemons are running on the servers. As a rule, the number of nfsds running should be twice the number of disk spindles available to NFS clients.For example, if a server is exporting one file system, and it resides on a volume group comprising three disks, you should probably berunning six nfsds on the server.

TIPS

Monitor server memory frequently Keep exported files and directories small as possible. Large files require more NFS operations than small ones, and large directories take longer to search. Encourage your users to weed out large, unnecessary files regularly (see Finding Large Files). Monitor server and client performance regularly. In practice, though, a server is dealing with many I/O requests at a time, and intelligence is designed into the drivers to take account of thecurrent head location and direction when deciding on the next seek. This means that defragmenting an HFS file system on HP-UX maynever be necessary; JFS file systems, however, do need to be defragmented regularly.

Tuning Fibre Channel Network Performance


90 of 135 10/28/2010 11:19 PM

Two TCP extensions have been designed to improve performance over paths with large bandwidth and delay products, and to providereliable operation over very high-speed paths. The first extension increases the number of TCP packets that can be sent before the firstpacket sent is acknowledged. This is called window scaling. The second extension, time stamping, provides a more reliable data deliverymechanism. The following two options turn these extensions on or off in the kernel:

tcp_dont_winscale

Setting this option to 0 means do window scaling; any non-zero value means don't do window scaling)

tcp_dont_tsecho

Setting this option to 0 means enable the time stamp option; any non-zero value would disable the time stamp option) Two TCP variables, which determine the amount of memory used for socket buffers effect the window scaling and time stamp options. The following are the default settings for HSC Fibre Channel recommended by Hewlett-Packard:

tcp_sendspace = 0x30000

tcp_recspace = 0x30000 To change these defaults, copy, modify, and execute the following script. Since your performance improvement depends on a number offactors, including available memory and network load, you may want to experiment with these settings. Hewlett-Packard recommends thatyou enable both window scaling and time stamping. /****************Begin Script***************************/ #!/bin/kshadb -w /stand/vmunix /dev/kmem < EOF #This script is used for changing sb_max, tcp_sendspace# and tcp_recvspace in the live kernel (/dev/kmem).#tcp_sendspace/W 20000tcp_recvspace/W 20000#For HSC Fibre Channel, use tcp_sendspace/W 30000#For HSC Fibre Channel, use tcp_recvspace/W 30000#This script is used for changing sb_max, tcp_sendspace# and tcp_recvspace in the kernel (/hp-ux).#tcp_sendspace? W 20000tcp_recvspace? W 20000#For HSC Fibre Channel, use tcp_sendspace? W 30000#For HSC Fibre Channel, use tcp_recvspace? W 30000#window scaling enabled in live kerneltcp_dont_winscale/W 0 #window scaling enabled in the kerneltcp_dont_winscale? W 0


91 of 135 10/28/2010 11:19 PM

#Timestamp option enabled in live kerneltcp_dont_tsecho/W 0 #Timestamp option enabled in the kerneltcp_dont_tsecho?W 0 EOF /******************End Script********************************/

Tuning Fibre Channel Mass Storage Performance

Two parameters are available for configuring HP Fibre Channel/9000 for maximum mass storage performance. The first parametercontrols the type of memory that is allocated by the system. This is dependent upon the number of FC adapters in the system. The secondparameter may be used to override the default number of concurrent FCP requests allowed on the adapter. The optimal number ofconcurrent requests is dependent upon a number of factors, including device characteristics, I/O load, and host memory. The following two options set these parameters in the kernel:

num_tachyon_adapters

Set this parameter to the number of HSC Fibre Channel adapters in the system.

max_fcp_reqs

Set this parameter to the number of concurrent FCP requests allowed on the adapter. The default value is 512.

Module 7

General Tools It is important to keep in mind that no one utility is completely accurate. The shortest sample period of any utility is 1 second. Consideringthat every cpu will have no fewer than 10 processes running per second, at best we are only getting an approximation of the true activity. When analyzing performance data, it is wise to look at all the data, as problems often have causes that are related to resource issues inother regions of the system from what appears the obvious cause.

ADB –adb is a general-purpose debugging program that is sensitive to the underlying architecture of the processor and operating systemon which it runs. It can be used to examine files and provide a controlled environment for executing HP-UX programs. For 64 bit adb64is invoked It operates on assembly language programs. It allows you to look at object files and "core" files that result from abortedprograms, to print output files in a variety of formats, to patch files, and to run programs with embedded breakpoints Information on adb can be found at http://docs.hp.com/ in the following manuals : HP-UX 11 Reference Manual volume 1 section 1ADB TutorialStreams/UX for the HP 9000 Reference Manual It can be useful to gather system information useful in performance tuning as well To determine the physical memory (RAM) : for HP-UX 10x

echo physmem/D | adb /stand/vmunix /dev/kmem

physmem:physmem: 24576 for HP-UX 11.x systems running on 32 bit architecture:

echo phys_mem_pages/D | adb /stand/vmunix /dev/kmem

phys_mem_pages:phys_mem_pages: 24576


92 of 135 10/28/2010 11:19 PM

for HP-UX 11.x systems running on 64 bit architecture:

echo phys_mem_pages/D | adb64 /stand/vmunix /dev/mem

phys_mem_pages:phys_mem_pages: 262144 The results of these commands are in memory pages, multiply by 4096 to obtain the size in bytes. To determine the amount of lockable memory:

echo total_lockable_mem/D | adb /stand/vmunix /dev/mem

total_lockable_mem:total_lockable_mem: 185280 for HP-UX 11.x systems running on 64 bit architecture:

echo total_lockable_mem/D |adb64 /stand/vmunix /dev/mem

To determine the number of free swap pages :

echo swapspc_cnt/D | adb /stand/vmunix /dev/kmem

swapspc_cnt:swapspc_cnt: 216447 This will display the number of free swap pages.Multiply the number returned by 4096 for the number of free swap bytes. To determine the processor speed:

echo itick_per_usec/D | adb /stand/vmunix /dev/mem

itick_per_usec:itick_per_usec: 360 To determine the number of processors in use:

echo "runningprocs/D" | adb /stand/vmunix /dev/mem

runningprocs:runningprocs: 2 To determine the number of pages of buffer cache ( 4Kb in size)

echo bufpages/D | adb /stand/vmunix /dev/mem

bufpages:bufpages: 18848 To display kernel parameters using adb use the parameter name :

echo nproc/D | adb /stand/vmunix /dev/mem

nproc:nproc: 276 To determine the amount of vxfs inodes in use

echo vxfs_ninode/D | adb /stand/vmunix /dev/mem

vxfs_ninode:


93 of 135 10/28/2010 11:19 PM

vxfs_ninode: 64000 To determine the kernel you are booted from:10.x

echo 'boot_string/S' | adb /stand/vmunix /dev/mem

boot_string:boot_string: disc(52.6.0;0)/stand/vmunix 11.xexample:

echo 'boot_string/S' | adb /stand/vmunix /dev/mem

boot_string:boot_string: disk(0/0/2/0.6.0.0.0.0.0;0)/stand/vmunix

IPCS - report status of interprocess communication facilities ipcs displays certain information about active interprocess communication facilities. With no options, ipcs displays information in shortformat for the message queues, shared memory segments, and semaphores that are currently active in the system. Options The following options restrict the display to the corresponding facilities. (none) This is equivalent to -mqs.-m Display information about active shared memory segments.-q Display information about active message queues.-s Display information about active semaphores. The following options add columns of data to the display. See "Column Description" below. (none) Display default columns: for allfacilities: T,ID, KEY, MODE, OWNER, GROUP.-a Display all columns, as appropriate. This is equivalent to -bcopt.-b Display largest-allowable-size information: for message queues: QBYTES; for shared memory segments: SEGSZ; for semaphores:NSEMS.-c Display creator's login name and group name: for all facilities: CREATOR, CGROUP.-o Display information on outstanding usage: for message queues: CBYTES, QNUM; for shared memory segments: NATTCH.-p Display process number information: for message queues: LSPID, LRPID; for shared memory segments: CPID, LPID.-t Display time information: for all facilities: CTIME; for message queues: STIME, RTIME; for shared memory segments: ATIME,DTIME; for semaphores: OTIME. The following options redefine the sources of information. -C core Use core in place of /dev/kmem. core can be a core file or a directorycreated by savecrash or savecore. -N namelist Use file namelist or the namelist within core in place of /stand/vmunix

Column Descriptions

The column headings and the meaning of the columns in an ipcs listing are given below. The columns are printed from left to right in theordershown below. T Facility type: m Shared memory segmentq Message queues Semaphore


94 of 135 10/28/2010 11:19 PM

ID The identifier for the facility entry.KEY The key used as an argument to msgget (), semget (), or shmget () to create the facility entry. (Note: The key of a shared memorysegment ischanged to IPC_PRIVATE when the segment has been removed until all processes attached to the segment detach it.)MODE The facility access modes and flags: The mode consists of 11 characters that are interpreted as follows: The first two characters can be: R A process is waiting on a msgrcv ().S A process is waiting on a msgsnd ().D The associated shared memory segment has been removed. It will disappear when the last process attached to the segment detaches it.C The associated shared memory segment is to be cleared when the first attach is executed.- The corresponding special flag is not set. The next 9 characters are interpreted as three sets of three characters each. The first set refers to the owner's permissions, the next topermissions of others in the group of the facility entry, and the last to all others. Within each set, the first character indicates permission to read, the second character indicates permission to write or alter the facilityentry, and the last character is currently unused. r Read permission is granted.w Write permission is granted.a Alter permission is granted.- The indicated permission is not granted. OWNER The login name of the owner of the facility entry.GROUP The group name of the group of the owner of the facility entry.CREATOR The login name of the creator of the facility entry.CGROUP The group name of the group of the creator of the facility entry.CBYTES The number of bytes in messages currently outstanding on the associated message queue.QNUM The number of messages currently outstanding on the associated message queue.QBYTES The maximum number of bytes allowed in messages outstanding on the associated message queue.LSPID The process ID of the last process to send a message to the associated message queue.LRPID The process ID of the last process to receive a message from the associated message queue.STIME The time the last msgsnd () message was sent to the associated message queue.RTIME The time the last msgrcv () message was received from the associated message queue.CTIME The time when the associated facility entry was created or changed.NATTCH The number of processes attached to the associated shared memory segment.SEGSZ The size of the associated shared memory segment.CPID The process ID of the creating process of the shared memory segment.LPID The process ID of the last process to attach or detach the shared memory segment.ATIME The time the last shmat () attach was completed to the associated shared memory segment.DTIME The time the last shmdt () detach was completed on the associated shared memory segment.NSEMS The number of semaphores in the set associated with the semaphore entry.OTIME The time the last semop () semaphore operation was completed on the set associated with the semaphore entry.

WARNINGS

ipcs produces only an approximate indication of actual system status because system processes are continually changing while ipcs isacquiring the requested information. Do not rely on the exact field widths and spacing of the output, as these will vary depending on thesystem, the release of HP-UX, and the data to be displayed.

SAR - The System Activity Reporter The sar utility is available on all systems and can provide valuable data to assist in identifying problems and making changes to optimizethe systems efficiency .It is one of the least intrusive tools to gather performance related statistics. Interpretation of the data takes time andexperience. It is important when gathering any data to get a statistically significant sample. For analysis purposes a sample every 5seconds for at least 100 iterations is the smallest amount of data to consider. For example to look at the disk I/O on a system run sar -d 5100 . Only disks with activity will report, you may see some samples during the report that all disks are not present. This only indicates there


95 of 135 10/28/2010 11:19 PM

was no disk activity during that sample period. device %busy avque r+w/s blks/s avwait avservc1t6d0 0.80 0.50 1 4 0.27 13.07c4t0d0 0.60 0.50 1 4 0.26 8.60 Keep in mind that read and write transactions are system calls. When an application is producing a heavy load on disk, the % system in the cpu reports may appear higher than expected. Data can be obtained on the following areas: -d Block Device (disk or tape) -b Buffer Cache -u, -q CPU use and run queue -a Files system access routines - m Message and Semaphore activity - c System Calls -s System swapping and context switching - v System tables - process , inode , file- y TTY device On Multi processor systems you must use the -M switch for a detailed report of each cpu .

Time

usr/bin/time is a UNIX command that can be used to run a program and determine what percentage of time is being spent in user code andwhat percentage is being spent in the system. Upon completion, time prints the elapsed time during the command, the time spent in thesystem, and the time spent executing the command. Times are reported in seconds. Execution time can depend on the performance of thememory in which the program is running. The times are printed on standard error. For example $ /bin/time bdfreal 15.2user 11.4sys 0.4

Timex

When run without switches this is the equivalent to the time command. The timex command can be useful for determining the impact acommand has on the system. -o Report the total number of blocks read or written and total characters transferred by command and all its children. This option worksonly if the process accounting software is installed. -p [fhkmrt] List process accounting records for command and all its children. The suboptions f, h, k, m, r, and t modify the data itemsreported.They behave as defined in acctcom (1M). The number of blocks read or written and the number of characters transferred are alwaysreported. This option works only if the process accounting software is installed and /usr/lib/acct/turnacct has been invoked to create /var/adm/pacct. -s Report total system activity (not just that due to command) that occurred during the execution interval of command. All the data items


96 of 135 10/28/2010 11:19 PM

listed in sar (1M) are reported.

Tools available for purchase

Glanceplus

While Glance is not available on all systems, it is a key diagnostic tool to use when available. It is resource intensive, so running it while the system is under a severe load may impose an unacceptable load. The OpenView GlancePlus Concepts Guide is available at:

http://ovweb.external.hp.com/ovnsmdps/pdf/concepts_lx.pdf

The following white paper is available internally at :

http://rpm-www.rose.hp.com/rpm/papers/glanceff.htm

Using Glance Effectively

Doug GrumannHewlett-Packard Company

Introduction

Many people have used Glance, which is a powerful tool for system performance diagnosis. Although Glance is very popular, many usersdo not take full advantage of the capabilities of the product, or do not understand how its many metrics can be used to optimize theirsystems' performance.

In this article, I will present some background information on general system performance principles, cover some tips and techniques forgetting the most from Glance, and list some common performance problems and how Glance can be used to characterize them. I'll alsodiscuss how to customize your use of Glance to best suit your environment. This article is intended primarily for those who have a basicknowledge of the product and it's not intended as a tutorial for new users.

Performance Analysis

Many articles have been written on the art of system performance analysis. In an ideal situation, performance tools would not be necessaryat all. Your computer system would optimize its resources automatically, and continually adjust its behavior based on the workload. Inreality, it is up to system administrators to optimize system performance manually. I believe that tuning performance will always remainsomewhat of an art. There are too many variables and dependencies in constant flux for a self-diagnostic to handle. For example, evenengineers that write the HP-UX operating system cannot always determine the performance impact of every change and feature they codeinto the kernel. This is one reason why we have user-configurable kernel parameters and options such as disk mirroring, the logicalvolume manager, and commands to adjust process scheduling priorities. These facilities allow you to manage your configuration to bestoptimize the performance of your particular system. Different features affect performance in different ways. To optimize performance inyour environment, you need to understand your workload and understand the major resources of the system which may be under stress.

Let's briefly review some of the guiding principles of performance analysis:

Know your system. Your task of solving a performance problem will be much harder if you don't know what the system looks like when itis performing well. If you're lucky and proactive, you can get an understanding of the normal everyday workloads, throughput, andresponse times of the systems you manage before a performance crisis occurs. Then when you later take steps to tune a system, you'll havebaseline knowledge to compare against.

Define the symptom. Users like to say things like "the system's too slow," but subjective complaints are hard to address. Before you startchanging things, define exactly what's wrong and try to set goals so that you'll know if you were successful. Many administrators useresponse time or throughput metrics to define their goals. Try to find something quantifiable, and write the goals down along with yourmeasurements.

Characterize the bottleneck. People who do performance analysis consulting use the term bottleneck a lot. A bottleneck is a resourcewhich is at maximum capacity, and cannot keep up with the demands being placed on it. In other words, the bottlenecked resource is thepart of the computer responsible for the system not running faster. A more powerful CPU will do you no good if your performancebottleneck is in the disk I/O subsystem. Measuring performance with a tool like Glance allows you to characterize which resources areconstrained so you can determine how to alleviate the bottleneck.


97 of 135 10/28/2010 11:19 PM

Change one thing at a time. Once you've isolated a performance problem and you decide how to address it, change only one thing at a

time. If you change more than one thing at once, then you will not know which change helped performance. It's also possible that onechange will improve performance while another makes it worse, but you won't know that unless you implement them separately andmeasure performance in-between.

A complete discussion of performance analysis might include information on topics such as benchmarking, system sizing, workgroupcomputing, and capacity planning. Other HP products such as MeasureWare and PerfView address more long-term performance datacollection and analysis needs. These topics are beyond the scope of this article. I will concentrate on the area of performance analysis thatGlance is made to address: single-system on-line performance diagnosis.

Glance Overview

The Glance product is available on several platforms including HP-UX, Solaris, and AIX. I will focus this material for HP-UX Glance.Note that the implementations of Glance differ in minor ways on the different platforms. In all cases, the purpose of the product is toaddress the "what's going on right now" type of question for system administrators.

There are two user interfaces for Glance. The original interface is a curses-based character mode interface named simply glance. Twoyears ago, a second user interface was added to the product. This Motif-based interface is named gpm. You may use either or bothprograms to display performance data. The gpm interface imposes more memory and CPU overhead on your system, however you mayfind it more intuitive, and some of its features go beyond what the character-mode interface provides. For the remainder of this article Iwill refer to gpm exclusively, however most of the examples apply equally well to either interface.

People often ask me why the data shown in Glance sometimes differs from the data shown by tools such as sar, vmstat, iostat, and top.Most often, the root cause of discrepancies is due to the underlying collection methodology. Glance uses special tracing features in theHP-UX kernel which are translated into performance metrics via the midaemon process. The "kmem" tools like top get their data fromcounters in the kernel which are maintained by sampling instrumentation. Because a tracing methodology can capture all system stateinformation, it is more accurate than data which is obtained via periodic sampling.

I strongly encourage new Glance users to get into gpm's award-winning on-line help subsystem, and view its Guided Tour topics. TheGuided Tour introduces you to the product and its concepts. Experienced Glance users (like myself!) also find the on-line help invaluable,with its topics such as Adviser Syntax Reference and Performance Metric Definitions.

Top-Down Approach

There are over 1000 performance metrics accessible from Glance. You do not need to understand even a small percentage of them inorder to get your work done. This is because the tool organizes its data in two different ways so that you only need to look at the metricsimportant to your situation. First of all, metrics are organized according to resource: there is a group of reports and graphs oriented aroundCPU, Memory, I/O, Networking, Swap, and System Tables. If your system is having no problems with I/O, then you need never investigatethe reports in that area. Secondly, metrics are organized from a Global level down to an Application level and finally down to a Processlevel. As a whole, global metrics show you overall summarization of what is going on with the entire system. Application metrics allowyou to group your system's workload into sets of processes representing different tasks. Then you can compare the relative impact ofdifferent applications on overall performance. Process metrics let you zoom in on specific processes and their individual attributes.

Use Glance in a top-down manner to be most effective. When you first start gpm, the main graphical display will show you four potentialbottleneck areas of performance: CPU, Memory, Disk, and Network. Each of these areas is represented by a graph and a button. Thegraphs show metrics for these resources over time, while the buttons give you status on adviser symptoms. The gpm adviser is a complexbut powerful feature of Glance that I will delve more into later. For now, just note that the color of the buttons can be your first clue to aperformance bottleneck. If the CPU button turns yellow or red, then this means you should investigate the utilization of the CPU resource.Use the main window to determine which area might be impacting performance. Drill down into report screens for that resource tocharacterize the problem further. Then use the Application or Process list reports to pinpoint the cause of the problem. Once you are down

to the process level, you can determine which actions to take to correct the situation. It sounds easy, huh? Before we go into someexamples, let's discuss a few important techniques.

Applications

It's useful to view Application data in glance as an intermediate step between Global and Process data. If you manage hundreds of diversesystems, you may not have the time to group processes into applications on each system.


98 of 135 10/28/2010 11:19 PM

Likewise, if you are managing systems that only have basically one application on them, then tuning your parm file application groupingsmay not be a good use of your time. On the other hand, if you are doing frequent performance analysis on multi-user systems, applicationgroupings can be very useful. Frequently on my systems I'll have separate applications defined for backups and builds. Without looking atindividual processes, I can quickly tell if my backups have been running too long or if a software build is interfering with my NFS server'sother activities. Just keep in mind that if you don't want to use application groupings, you don't have to. Neither global nor process data isaffected by your application parm file definitions.

Sorting, Arranging, and Filtering

Too often I see gpm users scrolling through hundreds of lines of Process List detail, looking for an item of interest. It would save themtime to just set up some filtering or sorting for the Process List report to bring the data they want into one window. For example, I usuallyset up a sort by Current CPU so that the processes that are most active will be at the top of the list. The default column arrangement canalso be changed. For example, you can bring the memory fields RSS and VSS into view and sort by those fields if you are looking formemory hogs. Filtering allows you to set up intricate thresholds based on what type of data you'd like to view or highlight. For example,you can filter on the Physical I/O field so the Process List will only report processes doing I/O, and you can highlight processes thatexceed a threshold you define.

Remember that your customizations of gpm such as report field column arrangements, sort fields, and filter values (as well as colors,fonts, and measurement intervals) are saved in a ".gpmhp" file in your home directory. This file saves your customizations so that they stayin effect between invocations of gpm. Normally, I like to run gpm under my own non-root user login so that other people who share theroot login with me won't change my .gpm settings. If you have several users who share root access on a system, you can also createseparate su accounts for them with different home directories so they keep separate gpm configurations.

Overhead

Any performance tool will impose a certain amount of additional overhead on the system. In the case of Glance, this overhead issignificant only for CPU and memory. There is a tradeoff here: the more data you want to gather, the more overhead required to get thedata. If you're concerned about CPU overhead, you can reduce the impact by running Glance with longer update intervals. One trick I'veused in the past is to set the update interval way up to, say, 5 minutes, and then I use the Update Now menu selection (just a carriage returnin character-mode glance) to update intermittently when I want to see fresh data. You'll notice that gpm's memory usage is higher thancharacter-mode glance's because it loads the Motif libraries. With systems getting faster and bigger all the time, you rarely need to beconcerned about Glance overhead.

New in Glance

New features are always being added to Glance. The gpm interface now has a continuous on-item help feature which lets you get instantinformation about any metric. The "choose metric" functionality of gpm allows you to select exactly which metrics go into the reportwindows. The glance character-mode interface now uses the Adviser technology (you'll also find the adviser alarm syntax used in HPMeasureWare). There have been reports added for Distributed Computing Environment (DCE) support and Transaction Tracker metrics.Transaction Tracker is a user-defined transaction library which is bundled with HP MeasureWare. Glance for HP-UX 10.0 has severalfeatures including new Disk I/O detail information, Global and Process-level System Calls reports, and reporting and manipulation ofProcess Resource Manager variables. Details about these new features are found in the product ReleaseNotes and on-line help.

Examples

We'll now go through a few examples of using Glance to address specific performance problems. Hopefully these will provide insight asto how to drill down into the data to characterize problems, but remember that every system is different and it's impossible to cover evena small percentage of all possible performance problem scenarios. Note that gpm's on-line help contains a few short case studies under itsPerformance Sleuthing topic that might also be useful to you.

CPU Bottlenecks

Let's say the main window shows the CPU to be 100% utilized. This might mean that the CPU is bottlenecked, but then again it may not.Realize that it is good to have a resource fully utilized: it means that your system is fully taking advantage of its capabilities. On a single


99 of 135 10/28/2010 11:19 PM

user workstation, the CPU might always be 100% busy because the user has their x11fish backdrop program running to entertain anddistract visitors. If overall performance is fine, there is no problem. On another system, however, the CPU might be 100% busy and usersmight be complaining because their response time has fallen off dramatically. Then it's time to delve deeper into glance.

The CPU graph and report will tell you whether there is contention for the CPU. If so, then you may want to go straight to the Process Listand sort on the top CPU consumers. The simplest common source of CPU contention is a runaway process. Often shell programmers willget scripts stuck in a loop, and sometimes they'll leave the loops active. When you see a process using as much of the CPU as it can get,spending all its time in User mode, and doing no I/O, then it might be looping. Check with the owner of the process. Compiles are also amajor culprit in CPU bottlenecks. In software development environments, I've seen cases where a whole project team was slowed downbecause they all were doing their build on the same system. By mounting the source on a NFS server, we separated the compiles ontodifferent systems and alleviated the bottleneck.

On a Multiprocessor system, CPU bottlenecks can be very interesting to diagnose. For example, on a 2-way MP system, a looping processcan consume 100% of one processor while leaving the other processor idle. Also, it might "bounce" between processors, keeping eachabout 50% busy. Note that Glance normalizes CPU at the global level, so 100% busy in the main window means that all processors are100% busy. The CPU By Processor report will show you how this breaks down into individual processor loads. The Process List report,since it is oriented on a process level, does not normalize CPU utilization. In our example of a looping process on a 2-way MP system, theglobal CPU utilization might show 50% but the looping pid will show 100% utilization in the Process List. This makes sense because ifyou had a 8-way MP system you want to see that process stand out: a single process in that environment could only use 12.5% of theoverall system CPU resource because it could only be active on one processor at any one time.

In gpm, double-clicking on a process in the Process List gets you into the Process Resource report. This report is very useful because itshows you a lot of detail about what the process is doing. For example, some processes that use timers often have a very high proportion

of System CPU, and you'll see a lot of context switching and perhaps Signals Received. I've sometimes surprised developers by showingthem gpm screens of their programs in action, doing outrageous things like opening and closing 50 files a second.

Memory Bottlenecks

Often in today's Unix environments, it is normal to see physical memory fully utilized. This is a sign of a well-tuned system, becauseaccess to memory is so much faster than access to disk. Keeping text and data pages in memory speeds up subsequent references, and onlybecomes a problem when processes try to allocate more memory and the system is forced to flush buffers, page out data, or (worst case)

start swapping. Sometimes, a memory bottleneck will disguise itself as a disk bottleneck, because memory management activities causedisk I/O. Normally, a good rule of thumb is to avoid swapping at all costs.

A certain amount of paging is very normal (especially page-ins), but swapping only occurs when there is excessive demands for physicalmemory. Note: in HP-UX 10.0, swapping is called deactivation, but the basic concept remains the same.

Too often, I've seen the attitude that the simplest way to solve any performance problem is to buy more memory. Although this solutionfrequently works because memory bottlenecks are common, there are many instances where a cheaper alternative existed. In HP-UX 9.0,some systems experienced memory bottlenecks caused by the dynamic buffer cache growing too large. The buffer cache speeds upfilesystem I/O, but on some systems it can grow too large and start causing excessive paging. Many administrators are familiar with thedynamic buffer cache patch to HP-UX which puts a limit on the size the cache can grow. In 10.0, there are dbc_min and dbc_max kernelparameters that allow you to fine-tune the cache to meet your needs. In most instances, I've found the 10.0 default values for thesevariables to be appropriate.

Glance's Memory report shows you the relative amount of physical memory allocated to the system, user pages, and the buffer cache. TheSystem Tables report will help you decide if you should reconfigure the kernel with larger of smaller table sizes. Normally, you want toallocate enough space in tables so that you never have to worry about running out (the same goes for swap areas). If your system is tight onmemory, you may consider reducing the size of some tables in order to make the kernel smaller, leaving more room for user data. I've seensystems running with unnecessarily huge values for maxusers or nproc, which increases the size of the kernel and can impact performance.A single user workstation does not normally need nproc values over 1000!

Isolating memory hogs in gpm's process list is easy: sort or filter the report on the Resident Memory Set Size or Virtual Memory Set Size.Processes that allocate a lot of memory may have huge VSS values, most of which is paged or swapped out most of the time. I've oftenspotted programs with memory leaks by just watching their VSS size grow over time. Sometimes a developer will not find a memory leakduring testing because a program is never executed for very long periods, but then when the program moves into production, the processwill slowly consume memory over a matter of days until it aborts. The Process Memory Regions report available for individual processesin Glance is extremely useful for debugging memory problems.


100 of 135 10/28/2010 11:19 PM

Disk Bottlenecks

Disk bottlenecks are very common on multi-user systems and servers. Glance's I/O By Filesystem and I/O By Disk reports are extremelyuseful in isolating these problems. Look for filesystems and disks with consistently high activity. Look at the Disk Queues to see if thereare a lot of I/Os waiting for service. I've often seen cases where all the most frequently-used files are on the root filesystem, which getsbottlenecked while other disks sit idle. Load balancing across disks can be an easy way to improve performance. Common techniquesinclude moving swap areas and heavily accessed filesystems off the root disk, or using disk striping, LVM and/or mirroring to spread I/Osout across multiple disks. You can use Glance to verify the effectiveness of these methods. For example, LVM mirroring can improve readperformance but degrade write performance. Using Glance, you can look at a volume that you're considering mirroring in order to verifythat many more read than write I/Os are occurring.

The filesystem buffer cache, mentioned above, it very important in understanding disk bottlenecks. If your workload is filesystem disk I/Ointensive, a large buffer cache can be useful in reducing the number of disk I/Os. However, certain environments such as database serversusing raw disk partitions don't make use of the buffer cache, and so a large buffer cache could hurt performance by wasting memory.

Ideally, what you'd like to see on your system are processes doing lots of logical I/Os and very few physical I/Os. Because of the effectsof read-ahead and buffering, it isn't always easy to determine why a application is doing more or fewer physical I/Os, but Glance'sProcess Resource report and Process Open File report can be useful.

Network Bottlenecks

As systems rely more and more on the network, we've begun to see more instances of bottlenecks relating to network activity.Unfortunately, at a system level, there are not as many good metrics for isolating network performance problems as there are for otherbottleneck areas. For network servers such as NFS servers, you can sometimes use the process of elimination to isolate a networkbottleneck: if your server has ample CPU, memory, and disk resources but is still slow, it may be due to LAN bandwidth limitations. Youcan use glance to look at the client side and server side simultaneously. Glance has several NFS reports which can be useful, especiallyNFS By System which will tell you which clients are pounding your server the hardest. One example I've seen is where a user wasrepeatedly executing the find command on NFS clients looking for old core files, but each find hit the same NFS-mounted disk over andover, causing needless overhead on the server.

In large environments, tools such as OpenView and PerfView are useful for monitoring the overall network and then they can turn toGlance to zoom in on specific systems with the greatest activity.

The Adviser

Although you can use gpm effectively without ever even knowing about its Adviser feature, taking the time to understand it can be veryprofitable. I don't have space to cover this topic thoroughly, but it is covered extensively in gpm's on-line help.

Basically, think of the Adviser as a set of rules based on performance metrics which can be used to take different actions. If you look atthe default Adviser syntax, you'll see some rules that control the colors of the four primary bottleneck indicators, and some supplementaryrules that generate alarms for things like high system table utilization. These default rules control the color of the buttons on gpm's mainwindow, and they are also visible in the Adviser Status and History reports. You should feel free to edit the rules to have more meaningfor your own unique environment, because every system is different. You can always return to the default rules. Your customized rules areanother aspect of the configuration of gpm that's stored in the .gpm file

As an example of when you might want to change the default Adviser symptoms, let's say you have a large server system that's alwaysfully CPU utilized, and frequently also has a high run queue. Although on many systems a large run queue is a CPU bottleneck indicator,this isn't always the case, especially on large servers. In our example, the high CPU utilization and the high run queue always makes theCPU bottleneck symptom in the default Adviser syntax go red. It isn't helpful if a button is always red. You would want to edit the Advisersyntax to bump up the criteria so that a CPU red alert only goes off when the run queue exceeds, say, 10 instead of 3.

The full potential value of the Adviser is in adding syntax for your own particular environment. For example, let's say that you know frompast experience that the physical disk I/O rate exceeds a certain value, user response time degrades. If you are willing to let gpm stay


101 of 135 10/28/2010 11:19 PM

running for longer periods, you can put in some adviser syntax which will alert you via email of the problem:

if gbl_disk_phys_io_rate > 2000 then exec "echo 'disk i/o rate is high' | mail root"

You can get really fancy with these rules. You can combine metrics, define variables and use looping constructs. You can generate Alerts,execute Unix commands, and print information to gpm's stdout. What follows is a more complex example to illustrate Adviser"programming"

. More examples are in gpm's on-line help.

# check for high system-mode cpu utilization, and when it is high # print the highest sys cpu consuming process. if gbl_cpu_sys_mode_util > 50 then { highestsys = 0 process loop if proc_cpu_sys_mode_util > highestsys then { highestpid = proc_proc_id highestname = proc_proc_name highestsys = proc_cpu_sys_mode_util } print "--- High system mode cpu rate = ", gbl_cpu_sys_mode_util, " at ", gbl_stattime, " ---" print " Process with highest system cpu was pid ", highestpid|5|0, ", name: ", highestname print " which had", highestsys, " percent system mode cpu utilization" }

Summary

In order to manage the performance of your systems effectively, you need to understand a little about the art of performance analysis, andyou need a good tool like Glance. I encourage you to spend some time getting to know your system's performance characteristics before aproblem occurs. When you are involved in a performance crisis, objectively define the symptoms of the problem, and then use them toguide you through analysis. Use Glance to characterize the bottlenecked resource. Follow the tools' top-down methodology to go from ahigh-level bottleneck down to the process responsible if possible. When you know what's wrong, make a change but change only onevariable in the environment at a time so you can gauge its success.

Measureware /Perfview

The PerfView product is an HP-UX Motif-based tool designed for analysis, monitoring, and forecasting of system performance andresource utilization data. Data collection and threshold alarming is provided by MeasureWare.

This is also known as the HP OpenView VantagePoint Performance Agent for HP-UX

Information on this product can be found in the


102 of 135 10/28/2010 11:19 PM

HP OpenView Performance Agent for HP-UX Installation & Configuration Guide for HP-UX 10.20 and 11.x

Available at http://docs.hp.com/

Module 8

WTEC Tools

There are many useful tools available from the Worldwide Technical Expert Center (WTEC). Unless instructed by WTEC or

completely familiar with these tools, it is ill advised to distribute them.

The following are the most appropriate tools for use by the Response Center .

kmeminfo

http://oceanie.grenoble.hp.com/georges/kmeminfo.html

A tool to troubleshoot kernel and user memory (VM) problems- by Georges Aureau - WTEC HPUX DescriptionUsage: kmeminfo [options ...] [coredir | kernel core]Default: coredir="." if "INDEX" file present else kernel="/stand/vmunix" core="/dev/kmem" Options-summary-dynamic-bucket ...-static-user ...-pid ... [-prot] [-parse]-bufcache-eqalloc-help


103 of 135 10/28/2010 11:19 PM

-physical ..-physmem-pfdat ...-sysmap-kas-vmtrace ...-alias-virtual ...-pdk_malloc ... When invoked with no option kmeminfo tries to open the current directory as a dump directory, and if the current directory does notcontain a crash dump, it opens /stand/vmunix and /dev/kmem. It then prints statistics about the physical memory utilisation, with a focus onmemory allocated by the kernel. kmeminfo supports the Mc Kusick bucket kernel allocator (HP-UX 10.x and 11.0). Example: One error that can is seen occaisonaly is : equivalent mapped reserve pool exhausted . By running kmeminfo –eqalloc you can determine the size of eqalloc and set eqmemsize to this value and resolve the issue. The example below shows a memory leak: $ kmeminfo /dumps/LABS/Dumps/MemLeak.1100 Pfdat processing: Scanning 900603 pfdat entries (be patient) ... Physical memory usage summary (in page/byte/percent): Physmem = 917248 3.5g 100% Physical memory Freemem = 24059 94.0m 3% Free physical memory Used = 893189 3.4g 97% Used physical memory System = 537002 2.0g 59% By kernel: Static = 31028 121.2m 3% for text/static data Dynamic = 229014 894.6m 25% for dynamic data Bufcache = 275174 1.0g 30% for buffer cache Eqmem = 26 104.0k 0% for equiv. mapped memory SCmem = 1760 6.9m 0% for critical memory User = 356183 1.4g 39% By user processes: Uarea = 3704 14.5m 0% for thread uareas Disowned = 4 16.0k 0% Disowned pages Kernel dynamic memory usage (in page/byte/percent): Dynamic = 229014 894.6m 25% Kernel dynamic memory MALLOC = 205447 802.5m 22% Memory buckets bucket[5] = 1606 6.3m 0% size 32 bytes bucket[6] = 150 600.0k 0% size 64 bytes bucket[7] = 4472 17.5m 0% size 128 bytes bucket[8] = 1586 6.2m 0% size 256 bytes bucket[9] = 169755 663.1m 19% size 512 bytes


104 of 135 10/28/2010 11:19 PM

bucket[10] = 20396 79.7m 2% size 1024 bytes bucket[11] = 1863 7.3m 0% size 2048 bytes bucket[12] = 318 1.2m 0% size 4096 bytes bucket[13] = 234 936.0k 0% size 2 pages bucket[14] = 102 408.0k 0% size 3 pages bucket[15] = 8 32.0k 0% size 4 pages bucket[16] = 70 280.0k 0% size 5 pages bucket[17] = 180 720.0k 0% size 6 pages bucket[18] = 490 1.9m 0% size 7 pages bucket[19] = 120 480.0k 0% size 8 pages bucket[20] = 4097 16.0m 0% size > 8 pages Reserved = 13 52.0k 0% Reserved pools Kalloc = 20304 79.3m 2% kalloc() SuperPagePool = 0 0.0k 0% Kernel superpage cache BufcacheBufs = 15353 60.0m 2% Buffer cache bufs BufcacheHash = 1280 5.0m 0% Buffer cache hash heads Other = 3671 14.3m 0% Other... Eqalloc = 3250 12.7m 0% eqalloc() Checking bucket free list heads: No corruption detected... And it supports the arena kernel allocator (HP-UX 11i and up). The example below shows the corruption of an arena free list head. $ kmeminfo /dumps/pa/arena.11i Pfdat processing: Scanning 2044581 pfdat entries (be patient) ... Physical memory usage summary (in page/byte/percent): Physmem = 2093056 8.0g 100% Physical memory Freemem = 1630660 6.2g 78% Free physical memory Used = 462396 1.8g 22% Used physical memory System = 335982 1.3g 16% By kernel: Static = 96457 376.8m 5% for text/static data Dynamic = 100043 390.8m 5% for dynamic data Bufcache = 135290 528.5m 6% for buffer cache Eqmem = 46 184.0k 0% for equiv. mapped memory SCmem = 4146 16.2m 0% for critical memory User = 126406 493.8m 6% By user processes: Uarea = 2984 11.7m 0% for thread uareas Disowned = 8 32.0k 0% Disowned pages Kernel dynamic memory usage (in page/byte/percent): Dynamic = 100043 390.8m 5% Kernel dynamic memory Arenas = 64553 252.2m 3% Kernel arenas M_TEMP = 19504 76.2m 1% M_SWAP = 12468 48.7m 1% M_LVM = 4761 18.6m 0% KMEM_ALLOC = 4647 18.2m 0% ALLOCB_MBLK_LM = 4052 15.8m 0% M_IOSYS = 3368 13.2m 0% ALLOCB_MBLK_DA = 2941 11.5m 0% M_SPINLOCK = 2416 9.4m 0% VFD_BT_NODE = 1312 5.1m 0% ALLOCB_MBLK_SM = 1296 5.1m 0% M_DYNAMIC = 1067 4.2m 0%


105 of 135 10/28/2010 11:19 PM

KMEM_VARFLIST_H = 882 3.4m 0% ALLOCB_MBLK_MH = 780 3.0m 0% M_PREG = 597 2.3m 0% M_REG = 590 2.3m 0% Other = 3872 15.1m 0% Other arenas... Kalloc = 35399 138.3m 2% kalloc() SuperPagePool = 17584 68.7m 1% Kernel superpage cache BufcacheBufs = 11235 43.9m 1% Buffer cache bufs BufcacheHash = 5120 20.0m 0% Buffer cache hash heads Other = 1460 5.7m 0% Other... Eqalloc = 91 364.0k 0% eqalloc() Checking locked arena free list heads: The following free list is locked: kmem_arena_t 0x0000000040001240 "M_DYNAMIC" kmem_flist_hdr_t 0x0000000040012480 (cpu 3, index 1, size 56) Error while scanning a "M_DYNAMIC" free list: kmem_flist_hdr_t 0x0000000040012480 (cpu 3, index 1, size 56) kfh_head 0x000e40c500000000 (no translation!)

kmeminfo -summary

The -summary option prints the memory usage summary (see above default output).

kmeminfo -dynamic

The -dynamic option prints the memory usage summary and the kernel dynamic memory usage(see above default output).

kmeminfo -static

The -static option prints the details of kernel static memory usage, ie. Pfdat processing: $ kmeminfo -static /padumps/arena_corr Pfdat processing: Scanning 2044581 pfdat entries (be patient) ... -Static kernel memory usage (in page/byte/percent): Static = 96457 376.8m 5% Static memory Text = 2308 9.0m 0% Text Data = 450 1.8m 0% Data Bss = 482 1.9m 0% Bss Tables = 93217 364.1m 4% System tables pfdat = 47919 187.2m 2% pfdat Static system memory (size in bytes and pages): Name Start- End Nent Sizetext 0x00020000-0x00924000 1 9453568data 0x00924000-0x00ae6000 1 1843200bss 0x00ae6000-0x00cc89f0 1 1976816phys_mem_tbl 0x00ce05a0-0x00ce05ac 1 12sysmap_32bit 0x00defd40-0x00e47f40 22560 360960sysmap_64bit 0x00e47f40-0x00ea0140 22560 360960pgclasstab 0x013ce000-0x015cd000 2093056 2093056


106 of 135 10/28/2010 11:19 PM

mpproc_info 0x015d1000-0x015eae40 8 106048htbl2_0 0x04000000-0x08000000 2097152 67108864pfn_to_virt_ptr 0x0962e000-0x0962fff0 511 8176pfn_to_virt 0x09630000-0x0b620000 2093056 33488896inode 0x0b628000-0x0ba08000 8192 4063232file 0x0ba08000-0x0bab8370 8202 721776ncache 0x0bab8380-0x0bc8c380 13312 1916928nc_hash 0x0bc8c380-0x0bccc380 16384 262144nc_lru 0x0bccc380-0x0bcd6380 512 40960callout_info_array 0x0bcd6380-0x0bcd6c00 8 2176cfree 0x0bd56400-0x0bd6b3a0 2148 85920physio_buf_list 0x0bd6b3c0-0x0be39a18 1409 845400pfdat 0x0bdf3880-0x0be53880 4096 393216pfdat 0x0bdfdc00-0x1797dc00 2048000 196608000tmp_save_states 0x0be39a40-0x0be3be40 8 9216memWindows 0x0be3be40-0x0be3bea0 2 96quad4map_32bit 0x0be3f880-0x0be41510 457 7312quad1map_64bit 0x0be41540-0x0be431d0 457 7312quad4map_64bit 0x0be43200-0x0be44e90 457 7312pfdat_ptr 0x0be44ec0-0x0be46eb0 511 8176page_groups 0x0be4b400-0x0be4b4e0 7 224space_map 0x17980000-0x17988000 262144 32768kmem_lobj_hdr_tbl 0x40080000-0x400e3d10 25553 408848 Total accounted static memory = 78668 pages

kmeminfo -user [swap,max=<count>,all]The -user option prints a list of user processes sorted by physical size (aka resident), or sorted by swap size when specifying the swapflag. kmeminfo scans the pregions and prints the following information: virtual is summing up the pregions p_count.physical is summing up r_nvalid for private regions, and r_nvalid/r_refcnt for shared regions.swap is summing up r_swalloc for private regions, and r_swalloc/r_refcnt for shared regions.when used, r_refcnt is adjusted to skip references from pseudo-pregions.The all flag allows to also include system daemons (as opposed to just getting user processes). Some examples: Top 5 processes using the most physical memory: $ kmeminfo -user max=5 /padumps/eqalloc.11oSummary of processes memory usage: List sorted by physical size, in pages/bytes: virtual physical swap pid ppid pages / bytes pages / bytes pages / bytes command 3073 1 100586 392.9m 11013 43.0m 9421 36.8m oninit 3083 3082 100586 392.9m 10827 42.3m 9443 36.9m oninit 3084 3082 100586 392.9m 10824 42.3m 9427 36.8m oninit 3090 3082 100591 392.9m 10413 40.7m 9417 36.8m oninit 3088 3082 100586 392.9m 10410 40.7m 9415 36.8m oninit Total: 53487 208.9m 47123 184.1m Top 10 processes having reserved the most swap space: $ kmeminfo -user swap,max=10 /iadumps/bbds_panicSummary of processes memory usage: List sorted by swap size, in pages/bytes:


107 of 135 10/28/2010 11:19 PM

virtual physical swap pid ppid pages / bytes pages / bytes pages / bytes command 24409 24363 268924 1.0g 1275 5.0m 3182 12.4m sim 1085 1 3758 14.7m 1741 6.8m 1658 6.5m dced 1207 1 3591 14.0m 1654 6.5m 1475 5.8m swagentd 1382 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1381 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1385 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1380 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1384 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1941 1370 269990 1.0g 761 3.0m 1344 5.2m httpd 1370 1 269974 1.0g 757 3.0m 1343 5.2m httpd Total: 9993 39.0m 15722 61.4m In the above example, there is a few processes with a virtual set size of 1GB, but those processes are just using a few MB of swap space.This is resulting from IA64 stacks being lazy-swap evaluated, see -pid 24409 below.

kmeminfo -pid <pid> [-prot] [-parse]The -pid option prints the list of the pregions in the vas of the specified process pid. A pid of -1 allows to select all the processes. The type column gives the pregion type. Note that kmeminfo uses pseudo pregion types for shared libraries: SHLDATA is an MMAP pregion with PF_SHLIB set (and PF_VTEXT clear).SHLTEXT is an MMAP pregion with both PF_SHLIB and PF_VTEXT set.The ref column is the region r_refcnt. The virtual column is the pregion p_count. The physical column is the region r_nvalid, ie. number ofresidentphysical memory pages. The swap column is the region r_swalloc, ie. reserved swap space. The total on the last line is computed as described above for the -user option. An example: $ kmeminfo -pid 24409 /iadumps/bbds_panic Process's memory regions (in pages): Process "sim", pid 24409, 64bit ASL, R_SHARE_MAGIC: type space vaddr ref virt phys swapNULLDREF 0x27ad231.0x0000000000000000 412 1 1 1 TEXT 0x3aa6231.0x4000000000000000 2 36 35 0 DATA 0x0399031.0x6000000000000000 1 704 676 704 UAREA 0x1fe0831.0x8003ffff7fefc000 1 20 16 20 UAREA 0x1fe0831.0x8003ffff7ff10000 1 20 16 20 UAREA 0x1fe0831.0x8003ffff7ff24000 1 20 16 20 UAREA 0x1fe0831.0x8003ffff7ff38000 1 20 16 20 UAREA 0x1fe0831.0x8003ffff7ff4c000 1 20 16 20 ......RSESTACK 0x1fe0831.0x8003ffffbf7ff000 1 2048 1 2 STACK 0x1fe0831.0x8003ffffc0000000 1 262144 5 5 ...... total 268924 1275 3182 When the -prot option is specified along with -pid, kmeminfo prints two additional columns: the key column gives the pregionp_hdl.hdlprot field


108 of 135 10/28/2010 11:19 PM

ie. a protection id on PARISC or a protection key on IA64, and the ar column gives the pregion p_hdl.hdlar field ie. access rights. $ ps PID TTY TIME COMMAND 27385 pts/4 0:00 rlogind 27443 pts/4 0:00 ps 27387 pts/4 0:00 sh$ kmeminfo -prot -pid 27387 Process's memory regions (in pages): Process "sh", pid 27387, 32bit ASL, R_SHARE_MAGIC: type space vaddr key ar ref virt phys swapNULLDREF 0x522e800.0x0000000000000000 0x6865 URX 98 1 1 1 TEXT 0x522e800.0x0000000000001000 0x6865 URX 17 45 41 0 DATA 0x8b22c00.0x0000000040001000 0x1e24 URW 1 19 18 19 MMAP 0x8b22c00.0x000000007f7d2000 0x1e24 URWX 1 1 0 1 SHLDATA 0x8b22c00.0x000000007f7d3000 0x1e24 URWX 1 1 1 1 SHLDATA 0x8b22c00.0x000000007f7d4000 0x1e24 URWX 1 9 8 9 MMAP 0x8b22c00.0x000000007f7dd000 0x1e24 URWX 1 14 7 14 MMAP 0x8b22c00.0x000000007f7eb000 0x1e24 URWX 1 2 2 2 SHLDATA 0x8b22c00.0x000000007f7ed000 0x1e24 URWX 1 3 3 3 STACK 0x8b22c00.0x000000007f7f0000 0x1e24 URW 1 8 8 8 SHLTEXT 0xd99dc00.0x00000000c0004000 PUBLIC URX 83 2 2 0 SHLTEXT 0xd99dc00.0x00000000c0010000 PUBLIC URX 98 21 19 0 SHLTEXT 0xd99dc00.0x00000000c0100000 PUBLIC URX 83 299 231 0 UAREA 0xa78c800.0x400003ffffff0000 KERNEL KRW 1 8 8 8 total 433 60 65

kmeminfo -parse

When the -parse option is specified along with -pid, kmeminfo prints an additional pid column in first position. This format allows to parse the output as to search for specific patterns. For example, checking if shared libraries on the system areproperly configured can be done with parsing the output with grep commands: $ kmeminfo -parse -prot -pid -1 /padumps/java_perf | \ grep SHLTEXT | grep -v PUBLIC 1595 SHLTEXT 0xc6f6c00.0x00000000c022f000 0x17b5 URX 2 1 1 0 1595 SHLTEXT 0xc6f6c00.0x00000000c025f000 0x524e URX 2 1 1 0 1595 SHLTEXT 0xc6f6c00.0x00000000c027e000 0x500e URX 2 2 2 0 1595 SHLTEXT 0xc6f6c00.0x00000000c02b3000 0x45b1 URX 2 1 1 0 1595 SHLTEXT 0xc6f6c00.0x00000000c02ba000 0x5d89 URX 2 2 2 0 1595 SHLTEXT 0xc6f6c00.0x00000000c02dd000 0x3a09 URX 2 3 3 0 ... ie. the process with pid 1595 is using shared libraries which are not configured with the usual read-only/execute 0555 (chmod a=rx)mode.

kmeminfo -bucket [index,flags]Without flags, the -bucket option prints statistics about kernel memory buckets, such as number of pages allocated per bucket and per cpu,total number of objects, number of free objects, and number of objects in use. Note that if not already printed, the summary and dynamicsection are printed. An example: $ kmeminfo -bucket /padumps/buckcorr.11o...Per cpu kernel dynamic memory usage (in pages): Only the byte buckets are per cpu (bucket 5 to 12, ie. size 32 to 4096)


109 of 135 10/28/2010 11:19 PM

CPU # 0 = 1488 : ( objects: used, free) bucket[ 5] = 35 size 32 bytes ( 4480: 4010, 470) bucket[ 6] = 17 size 64 bytes ( 1088: 929, 159) bucket[ 7] = 76 size 128 bytes ( 2432: 1163, 1269) bucket[ 8] = 93 size 256 bytes ( 1488: 972, 516) bucket[ 9] = 188 size 512 bytes ( 1504: 1206, 298) bucket[10] = 223 size 1024 bytes ( 892: 838, 54) bucket[11] = 548 size 2048 bytes ( 1096: 1019, 77) bucket[12] = 308 size 4096 bytes ( 308: 296, 12) CPU # 1 = 2576 : ( objects: used, free) bucket[ 5] = 125 size 32 bytes ( 16000: 15367, 633) bucket[ 6] = 22 size 64 bytes ( 1408: -1, -1) bucket[ 7] = 153 size 128 bytes ( 4896: 4543, 353) bucket[ 8] = 158 size 256 bytes ( 2528: 2319, 209) bucket[ 9] = 332 size 512 bytes ( 2656: 2435, 221) bucket[10] = 164 size 1024 bytes ( 656: 611, 45) bucket[11] = 1504 size 2048 bytes ( 3008: 2675, 333) bucket[12] = 118 size 4096 bytes ( 118: 114, 4)... ie. we had 17 pages allocated to cpu 0 bucket 6. Those 17 pages represented a total of 1088 objects of 64 bytes: 929 objects were used,and 159 objects were on the bucket's free list. A negative used count is telling that kmeminfo couldn't walk the free list. Below is the list of the possible error codes: -1 means P4_ACCESS_ERR_NO_TRANS-2 means P4_ACCESS_ERR_NO_PHYSMEM-3 means P4_ACCESS_ERR_NOT_SELECTED-4 means P4_ACCESS_ERR_BAD_CORE_ACCESS-5 means P4_ACCESS_ERR_PG0-99 means that a next pointer wasn't aligned on the bucket size.ie. cpu 1 bucket 6 was likely to be corrupted. Bucket indexWhen specifying a bucket index, kmeminfo prints the list of the objects belonging to the specified bucket. It actually scans the kmemusagearray for pages allocated to the bucket, and those pages are sliced into objects, reporting the state of each object, ie. free vs. used. When a bucket free list is corrupted, the state of some of the objects might be reported as n/a, as kmeminfo couldn't walk the free list toproperly determine the state of the object: $ kmeminfo -bucket 6 /padumps/buckcorr.11o Error while scanning a bucket_64bit free list: cpu 1, index 6 (64 bytes), head 0x00a55e78 next 0x0000be48 (no translation!) used 0x104004000used 0x104004040used 0x104004080used 0x1040040c0...n/a 0x1025fb000n/a 0x1025fb040n/a 0x1025fb080n/a 0x1025fb0c0...free 0x109203000used 0x109203040free 0x109203080free 0x1092030c0...


110 of 135 10/28/2010 11:19 PM

Bucket flagsThe following flags allow to control the processing of bucket objects: free to print only free objects.used to print only used objects.cpu=number to print only objects from the specify cpu.skip=number to skip number objects before starting printing objects.max=number to limit the output to number objects.dump[=size] to include an hexa dump of size bytes, the default size being the size of the object.type=struct to include a struct print out.offset=size to specify an offset of size bytes to add to the object address before dumping hexa or printing struct.$ kmeminfo -bucket 10,dump=32,cpu=0,free,max=3 /padumps/memcorr.11o bucket_64bit[0][10]:0x5a7014000x5a701400 : 0x00000000631c1000 0x2f64617465003200 ....c.../date.2.0x5a701410 : 0x3200474c2f6c6962 0x2f6c69626f676c74 2.GL/lib/liboglt0x631c10000x631c1000 : 0x000000005e9c7c00 0x000000007f7f0005 .... .̂|.........0x631c1010 : 0x7f7f00117f7f0036 0x7f7f00817f7f00a5 .......6........0x5e9c7c000x5e9c7c00 : 0x0000000057e44800 0x6f6c2f707767722f ....W.H.ol/pwgr/0x5e9c7c10 : 0x6461656d6f6e002f 0x636c69656e743135 daemon./client15 as the head of the free list has no translation.

kmeminfo -pgclass

The -pgclass option prints page classification statistics. This can be usefull to check for partial selective dumps (dumps where not all theselected pages have been dumped) as in the example below: $ kmeminfo -pgclassPage class statistics: PC_UNUSED = 73401 excluded, 0 dumpedPC_USERPG = 361086 excluded, 0 dumpedPC_BCACHE = 78500 excluded, 0 dumpedPC_KCODE = 1656 excluded, 0 dumpedPC_USTACK = 2927 included, 0 dumpedPC_FSDATA = 99 included, 0 dumpedPC_KDDATA = 204688 included, 198296 dumpedPC_KSDATA = 63819 included, 63819 dumped Total = 271533 included, 262115 dumped


111 of 135 10/28/2010 11:19 PM

kmeminfo -pdk_malloc

The -pdk_malloc option prints the pinned and unpinned pdk malloc maps, along with other pdk malloc related tables, such as theTranslation Registers. Note that the pseudo TRs for the unpinned pdk space are tagged as DATA*, see the example below. This option ismost useful whendebugging pdk malloc related problems. $ kmeminfo -pdk_malloc Translation Registers: dtr type ar virt_addr range phys_addr range 0 TEXT KRW 0xe000000000000000-0xe000000003ffffff 0x04000000-0x07ffffff 1 DATA KRW 0xe000000100000000-0xe000000103ffffff 0x08000000-0x0bffffff 2 DATA KRW 0xe000000104000000-0xe000000107ffffff 0x0c000000-0x0fffffff 3 SAPIC KRW 0xe000eeeefee00000-0xe000eeeefeefffff 0xfee00000-0xfeefffff 6 VHPT KRW 0xe000000120000000-0xe0000001203fffff 0x00400000-0x007fffff 7 VHPT KRW 0xe000000120400000-0xe0000001207fffff 0x00800000-0x00bfffff 64 DATA* KRW 0xe000000108000000-0xe00000010bffffff 0x10000000-0x13ffffff itr type ar virt_addr range phys_addr range 1 TEXT KRX 0xe000000000000000-0xe000000003ffffff 0x04000000-0x07ffffff 2 DATA KRX 0xe000000100000000-0xe000000103ffffff 0x08000000-0x0bffffff 3 DATA KRX 0xe000000104000000-0xe000000107ffffff 0x0c000000-0x0fffffff PDK Malloc: pinned_pdk_malloc_base = 0xe0000001004e3000pinned_pdk_malloc_unused_base = 0xe000000100757000pinned_pdk_malloc_unused_end = 0xe000000106ffffffpinned_pdk_malloc_end = 0xe000000107ffffff unpinned_pdk_malloc_base = 0xe000000108000000unpinned_pdk_malloc_unused_base= 0xe00000010976f000unpinned_pdk_malloc_unused_end = 0xe000000109800fffunpinned_pdk_malloc_end = 0xe00000010bffffff unpinned_pdk_malloc_itir.ps = 0x0000000004000000unpinned_pdk_malloc_va2pa_delta= 0xe0000000f8000000 pinned_pdk_malloc_map: map size vaddr: first last0xe00000010036a710 12288 0xe0000001004e3000 0xe0000001004e5fff0xe00000010036a720 64 0xe000000100692000 0xe00000010069203f0xe00000010036cd00 64 0xe000000107fed880 0xe000000107fed8bf0xe00000010036cd10 64 0xe000000107ff47c0 0xe000000107ff47ff0xe00000010036cd20 18688 0xe000000107ffb700 0xe000000107ffffff unpinned_pdk_malloc_map: map size vaddr: first last0xe000000100371590 2624 0xe00000010806e5c0 0xe00000010806efff0xe0000001003715a0 3392 0xe0000001088fa2c0 0xe0000001088fafff0xe0000001003715b0 3392 0xe000000108cf52c0 0xe000000108cf5fff0xe0000001003715c0 4032 0xe00000010976e040 0xe00000010976efff0xe0000001003715d0 64 0xe00000010b000fc0 0xe00000010b000fff0xe0000001003715e0 16142336 0xe00000010b09b000 0xe00000010bffffff

kmeminfo -sysmap

The -sysmap option prints the system virtual address resource maps. Those resource maps are tracking ranges of free virtual addresses forkernel dynamic memory allocations and for buffer cache allocations: PARISC 10.x PARISC 11.x IA64 11.xKERNEL


112 of 135 10/28/2010 11:19 PM

SYSMAP's sysmap sysmap_32bitsysmap_64bit sysmapsextmapBUFCACHEBUFMAP's bufmapbufmap2 bufmapbufmap2 bufmap When free virtual address ranges are getting fragmented, the resource map entries are used up, and the kernel is no longer able to returnfree addresses to the resource map. In this case, an rmap ovflo message is printed and the freed address range is lost. Eventually, if wekeep loosing addresses, the kernel might no longer be able to allocate virtual space when needed. For kernel sysmap, this could result ina panic: out of kernel virtual space.

For bufcache bufmap's, this could result in poor performance (hang in bcfeeding_frenzy()). Below is an example from an PARISC 11.11 machine: $ kmeminfo -sysmapResource maps for kernel dynamic virtual addresses: sysmap_32bit at 0xca2890 (6399 entries max): m_addr m_size vaddr_range:first last0x0000000000c93 9 0x0000000000c92000 0x0000000000c9afff0x0000000000cd6 213 0x0000000000cd5000 0x0000000000da9fff0x0000000002135 16 0x0000000002134000 0x0000000002143fff0x0000000002148 57 0x0000000002147000 0x000000000217ffff0x000000000218c 6 0x000000000218b000 0x0000000002190fff0x00000000021a2 17104 0x00000000021a1000 0x0000000006470fff0x0000000006473 25949 0x0000000006472000 0x000000000c9cefff0x000000000c9da 6 0x000000000c9d9000 0x000000000c9defff0x000000000c9ec 4 0x000000000c9eb000 0x000000000c9eefff0x000000000ca01 1181 0x000000000ca00000 0x000000000ce9cfff0x000000000ce9f 1 0x000000000ce9e000 0x000000000ce9efff0x000000000cea1 209248 0x000000000cea0000 0x000000003fffffffTotal size: 253794 (12 entries used) sysmap_64bit at 0xcbb890 (6399 entries max): m_addr m_size vaddr_range:first last0x000000004106b 96 0x000000004106a000 0x00000000410c9fff0x0000000041191 3696 0x0000000041190000 0x0000000041ffffff0x0000000044001 278528 0x0000000044000000 0x0000000087ffffffTotal size: 282320 (3 entries used) Bitmaps for buffer cache virtual addresses: bm_t bufmap at 0x40589000 Address space (in pages): Total = 98304 Used = 75006 Free = 23298 bitmap nbit/mask vaddr_range:first last0x000000004058d100 128 0x8000000008800000 0x800000000887ffff0x000000004058d110 0x0fffffff 0x8000000008880000 n/a0x000000004058d130 0xffff0000 0x8000000008980000 n/a0x000000004058d134 160 0x80000000089a0000 0x8000000008a3ffff...0x000000004058edf8 32 0x8000000016fc0000 0x8000000016fdffff0x000000004058edfc 0x7e7fffff 0x8000000016fe0000 n/a0x000000004058ee00 1504 0x8000000017000000 0x80000000175dffff0x000000004058eebc 0xfff3ffff 0x80000000175e0000 n/a _________________________________________________________________________________________________


113 of 135 10/28/2010 11:19 PM

kmeminfo -alias

This option prints the list of the Physical Addresses aliased to different Virtual Addresses. To do so, kmeminfo searches the pfn to virttable for entries having aliases. Except for the NULLDREF pages, there should not be many aliased pages in the system. $ kmeminfo -aliasList of alias entries: Printing all the alias entries ... PA 0x04e5c000 aliased to VA 0x0204b000.0x0000000000000000PA 0x04e5c000 aliased to VA 0x03c11c00.0x0000000000000000PA 0x04e5c000 aliased to VA 0x098f8000.0x0000000000000000PA 0x04e5c000 aliased to VA 0x0cf9f400.0x0000000000000000 Used aliases: 4 entriesFree aliases: 336 entriesTotal aliases: 340 entries (1 pages) _________________________________________________________________________________________________

kmeminfo -kas

This option prints information about the Kernel Allocator Superpage, aka kernel superpage pool. The 11.00 implemention of the superpage pool was vulnerable to the fragmentation of the physical memory pool. For 11.00, the highest field would typically be used to check for fragmentation issue causing this kas allocator to perform poorly. The11.11 implementation fixed the performance issue observed when fragmenting superpages. $ kmeminfo -kasKernel memory Allocation Superpage pool (KAS): super_page_pool at 0x69b668 kas_total_in_use = 10372 kas_max_total_in_use = 15216 kas_force_free_on_coalesce = 1 kas_total_freed = 0 size sp_pool_t count free highest sp_next 0 4KB 0x000000000069b668 0 384 825 0x000000010380f000 1 8KB 0x000000000069b680 0 450 540 0x00000001009f2000 2 16KB 0x000000000069b698 0 354 363 0x0000000100864000 3 32KB 0x000000000069b6b0 0 238 241 0x0000000100a68000 4 64KB 0x000000000069b6c8 0 10 53 0x0000000103410000 5 128KB 0x000000000069b6e0 0 1 24 0x0000000104b20000 6 256KB 0x000000000069b6f8 0 1 7 0x0000000104b40000 7 512KB 0x000000000069b710 0 1 3 0x0000000104b80000 8 1MB 0x000000000069b728 0 0 2 0x0000000000000000 9 2MB 0x000000000069b740 0 0 2 0x000000000000000010 4MB 0x000000000069b758 0 1 2 0x0000000104c0000011 8MB 0x000000000069b770 0 0 1 0x000000000000000012 16MB 0x000000000069b788 4 0 0 0x0000000000000000Total number of free page on pools: 6012

kmeminfo -vmtrace [flag,flag,...]

Without any flags, the vmtrace option causes kmeminfo to dumps all the vmtrace logs: memory corruption logmemory leak loggeneral memory tracing logThe entries in both the corruption log and the general tracing log are printed using the following format: Each record is printed in the following format: address, size, arena, pid, tid, date/time, stack trace Memory log for cpu 0:


114 of 135 10/28/2010 11:19 PM

0xe0000001400d09c0 56 M_TEMP 68 77 Sep 5 17:29:16vmtrace_free+0x1d0 kfree+0x150 vx_worklist_process+0x290 vx_worklist_thread+0x70 ... By default, the memory leak log is printed grouping allocation pattern together, and sorting them by increasing occurences, eg. : Vmtrace Leak Log: Note: "Total allocated memory" is the number of pages allocated since vmtrace was started. Repeated 752 times, malloc size 24 bytes: vmtrace_alloc+0x160 kmalloc+0x240 vx_zalloc+0x30 vx_inode_alloc+0x220 vx_ireuse+0x340 vx_iget+0x270 Total allocated memory 5 pages Latest on Sep 5 17:47:39 Oldest on Sep 5 13:30:57 ... To obtain a detailed output of each memory leak log entries, the -verbose option should be specified. In which case, the leak log entriesare sorted by time. The following flags can be used to limit the output to specific logs: bucket=<bucket> # limit output to the specified bucket index (10.x and 11.0)arena=<arena> # limit output to the specified arena (11.11 and beyond)count=<num> # limit output to the first log entriesleak # limit output to the leak logcor # limit output to the corruption loglog # limit output to the general tracing logparse # produce an output which can be easilly parse For more information about vmtrace, please visit the vmtrace web site.

kmeminfo -virtual [<space>.]<vaddr>[,trans,hash,pid=<pid>]The -virtual option prints both translation and ownership for the specified space.vaddr virtual address. TranslationWhen focusing on translation, you may skip the ownership information by specifying the trans flag. Translation hash chains may be printedby specifying the hash flag. When hash is specified, the primary hash chain is printed, but kmeminfo also prints the secondary hash chainon systems supporting dual pdir (ie. post 11.22). When specifying a process id using the pid flag, you may specify only the vaddr offset (skipping the space). The space would be takenfrom the pregion holding vaddr (if any). A PA2.0 dual pdir example: $ kmeminfo -virtual 0x7ac02300,trans,hash,pid=1802 /padumps/type9 VA 0x813f400.0x7ac02300 translates to PA 0x77802300 Page table entry: hpde2_0_t 0x4b51fa0 Access rights : 0x1f PDE_AR_URW Protection key: 0x4f7a Page size : 4MB Large page details: Addr : virtual physical Start: 0x7ac00000 0x77800000 End : 0x7b000000 0x77c00000


115 of 135 10/28/2010 11:19 PM

Hashing details: Primary: pdirhash=0x04b51fe0=htbl[370943] vtag=0x0813f400 0x0007ac02 hpde2_0_t pde_next pde_space pde_vpage 0x04b51fe0 0x00000000 0xffffffff 0x00000000 Secondary: mask=0x3fffff base=0x7ac00000 pdirhash=0x04b51fa0=htbl[370941] vtag=0x0813f400 0x0007ac00 hpde2_0_t pde_next pde_space pde_vpage 0x04b51fa0 0x00000000 0x0813f400 0x0007ac00 When space is omitted, and a pid not specified, kmeminfo will use the kernel space if the vaddr is within the kernel segment. An IA64 example: $ kmeminfo -virtual 0xfffc00005cefa000,trans,hash /iadumps/bbds_panic VA 0xdead31.0xfffc00005cefa000 translates to PA 0x15fa000 Page table entry: pte_t 0xe000000108ab8820 Access rights : 0x0c PTE_AR_KRWX Protection key: 0xbeef KERNEL/PUBLIC Page size : 4KB Hashing details: thash=0xe000000120220ae0=vhpt[69719] ttag=0x006f56c00005cefa pte_t pte_next pte_tag 0xe000000120220ae0 0x00000000107d54c0 0x00a8180000004067 0xe0000001087d54c0 0x000000001119c340 0x006f56800011cefa 0xe00000010919c340 0x0000000010ab8820 0x0088680000040087 0xe000000108ab8820 0x0000000000000000 0x006f56c00005cefa

Ownership

When trans is not specified, kmeminfo also prints the owner of the virtual address.User private: $ kmeminfo -virtual 0x813f400.0x7ac02300 /padumps/type9 ... VA belongs to PRIVATE reg_t 0x494cb4c0: Region index: 2 Page valid : 1 Page frame : 0x77802 dbd_type : DBD_NONE dbd_data : 0xfffff0c Front store : struct vnode 0 Back store : struct vnode 0x413ff140 VA belongs to process "a.out" pid 1802, MMAP preg_t 0x4949fb80. User shared: $ kmeminfo -virtual 0x0c6f6c00.0xc1381000 /dumps/pa/java_perf ... VA belongs to SHARED reg_t 0x4a75a600: Region index: 0 Page valid : 1


116 of 135 10/28/2010 11:19 PM

Page frame : 0x5b211 dbd_type : DBD_NONE dbd_data : 0x1fffff0c Front store : struct vnode 0 Back store : struct vnode 0x401f3e00 List of pregions sharing the region: pid preg_t type vaddr bytes command 1551 0x000000004a90e500 SHMEM 0x00000000c1381000 0x00074000 HPSS7 1550 0x000000004a8c6b00 SHMEM 0x00000000c1381000 0x00074000 ss7waiter.TSC 1530 0x000000004a7cbd00 SHMEM 0x00000000c1381000 0x00074000 ttlRecover Kernel buffer cache: $ kmeminfo -virtual 0xfffc00005cefa000 /iadumps/bbds_panic... VA belongs to buffer cache, struct buf 0xe000000118b54500. Kernel byte bucket: $ kmeminfo -virtual 0x4a90e520 /padumps/java_perf... VA belongs to "bucket_64bit" (cpu 0,index 8, size 256 bytes).VA is within the object at: Start: 0x4a90e500 Size : 0x00000100 256 bytes End : 0x4a90e600The object is currently in use. Kernel page bucket: $ kmeminfo -virtual 0x4b9f83c0 /padumps/java_perf... VA belongs to "page_buckets_64bit" (index 2, size 8 KB).VA is within the object at: Start: 0x4b9f8000 Size : 0x00002000 2 pages 8 KB End : 0x4b9fa000The object is free (on the bucket free list at position 3). Kernel arena: $ kmeminfo -virtual 0x4bf4a940 /dumps/pa/arena_corr... VA belongs to variable arena "M_PREG" (cpu 0, index 4, size 248 bytes).VA is within the object at: Start: 0x4bf4a940 Size : 0x000000f8 248 bytes End : 0x4bf4aa38The object is in use. Kernel super page pool:


117 of 135 10/28/2010 11:19 PM

$ kmeminfo -virtual 0xe000000145812f00 /iadumps/bbds_panic... VA belongs to a free chunk on "super_page_pool.sp_pool_list[11]": Start: 0xe000000145800000 Size : 0x0000000000800000 2048 pages 8 MB End : 0xe000000146000000 Kernel sysmap: $ kmeminfo -virtual 0xe00000011022d000 /dumps/ia/bbds_panic VA 0xdead31.0xe00000011022d000 does not have a translation. Hashing details: thash=0xe0000001203b9000=vhpt[121984] ttag=0x006f56800011022d pte_t pte_next pte_tag 0xe0000001203b9000 0x0000000000000000 0x006f56c00005022d VA belongs to a free chunk on "sysmap": Start: 0xe00000011022d000 Size : 0x0000000000003000 3 pages 12 KB End : 0xe000000110230000

shminfo

http://oceanie.grenoble.hp.com/georges/shminfo.htmlA tool to troubleshoot shared memory allocationby Georges Aureau - WTEC HPUX Descriptionshminfo looks at the resource maps of the available shared quadrants (global and/or private) and at the allocated shared memorysegments, then it prints a consolidated map of what is free/used. It also prints information about system limitations (shmmax, swap space),and about allocation policies. shminfo supports the following features/releases: 10.20 MR...10.20 PHKL_8327 SHMEM_MAGIC executables...10.20 PHKL_15058 Best-fit allocation policy...10.30 MR/LATEST.11.00 MR 32bit or 64 bit shared space...11.00 PHKL_13810 Memory windows...11.00 PHKL_16236 BigSpace memory windows...11.00 PHKL_20224 Q3 Private executables...11.00 PHKL_20995 Fix for Q3 Private breaking memory windows... UsageWhen run without any options, shminfo prints the following 4 sections (click here for an example): Global 32-bit shared quadrants.Private 32-bit shared quadrants.Limits for 32-bit SHMEM allocation.Allocation policy for 32-bit shared segments.If the current directory contains a crash dump (an INDEX file is present) then the crash dump is opened, otherwise itopens /stand/vmunix and /dev/kmem The -help (for short -h) gives the full usage: Usage: shminfo [options ...] [coredir | kernel core]Default: coredir="." if "INDEX" file present else kernel="/stand/vmunix" core="/dev/kmem"


118 of 135 10/28/2010 11:19 PM

Options: -s | -shmem -w | -window -g | -global -p | -private -f | -free -F | -bigfree -l | -limits -W | -64bit -h | -help -v | -verbose -a | -async -V | -version -u | -update The -global option prints only the maps of the global shared quadrants. The -private option prints only the maps of the private sharedquadrants (memory windows). The -window <id> option prints the map of the specified memory window. The -free option prints onlythe free virtual space of the shared quadrants, and the -F (-bigfree) option prints the largest free chunk within each shared quadrant (veryuseful when debugging shmem allocation problems). The -limits prints only the system limitation for 32bit shared memory segmentallocation. The -shmem <id> option prints thelist of the processes currently attached to the specified shmem id. The -W (-64bit) option prints the maps of 64bit shared quadrants. The-async option (courtesy of Peter Hryczanek, thanks Pete :-) prints information of the registered async segments.

Shared quadrants

Shared space is allocated from shared quadrants. A shared quadrant can be global or private. Space allocated from a global sharedquadrant can be shared by all processes on the system. However, space allocated from a private shared quadrant can only be shared by agroup of processes (see memory windows below). 10.20 MRThere is only two global shared quadrants, with a total of 1.75GB of shared space: 10.20 MR global shared quadrants Space Start End SizeQ3 q3 space 0x80000000 0xc0000000 1GBQ4 0 0xc0000000 0xf0000000 .75GB An application can allocate up to 1.75GB of shared segments. However, the largest individual shared segment is limited to 1GB (the sizeof a quadrant), ie. to allocate 1.75GB of shared segments, the application would have to get at least 2 individual shared segments. 10.20 PHKL_8327 (SHMEM_MAGIC)A third global shared quadrant was added in order to support SHMEM_MAGIC executables ("chatr -N", PHSS_8358), thus giving a totalsharedspace of ~2.75GB: 10.20 LATEST global shared quadrants Space Start End SizeQ2 q2 space 0x40000000 0x7ffe6000 ~1GBQ3 q3 space 0x80000000 0xc0000000 1GBQ4 0 0xc0000000 0xf0000000 .75GB An SHMEM_MAGIC application can allocate up to ~2.75GB of shared segments. However, the largest individual shared segment islimited to 1GB (the size of a quadrant), ie. to allocate ~2.75GB of shared segments, the application would have to get at least 3individual shared segments.Also, note that there is a gap of virtual space between the Q2 and Q3 shared quadrants: this gap is actually used for the UAREA (startingat 0x7ffe6000 on 10.x). 10.20 PHKL_15058 (BEST FIT)Best fit allocation policy was introduced to reduce the fragmentation of the shared space resource maps. 11.0 MRSupport for 32bit shared space and 64bit shared space. There is three 32bit global shared quadrants, however the layout of thosequadrants differfrom 32bit kernel to 64bit kernels for the end of 32bit Q2. For 32bit kernels, the UAREA sits at the end of 32bit Q2 (starting at 0x7fff0000on11.0), thus creating a virtual space gap between the Q2 and Q3 32bit shared quadrants:


119 of 135 10/28/2010 11:19 PM

11.0 MR global 32bit shared quadrants on 32bit kernel Space Start End SizeQ2 q2 space 0x40000000 0x7fff0000 ~1GBQ3 q3 space 0x80000000 0xc0000000 1GBQ4 0 0xc0000000 0xf0000000 .75GB 11.0 MR global 32bit shared quadrants on 64bit kernel Space Start End SizeQ2 q2 space 0x40000000 0x80000000 1GBQ3 q3 space 0x80000000 0xc0000000 1GBQ4 0 0xc0000000 0xf0000000 .75GB There is two 64bit global shared quadrants (64bit kernel only). Those 64bit shared quadrants are used by 64bit applications: 11.0 MR global 64bit shared quadrants on 64bit kernel Space Start End SizeQ1 q1 space 0x0000000000000000 0x0000040000000000 4TBQ4 q4 space 0xc000000000000000 0xc000040000000000 4TB Also, note that for 64kernels the UAREA sits a the end of 64bit Q2. The table below gives the UAREA location given the kernel bits: UAREA location Kernel Q2 start/end UAREA start32bit 0x40000000 0x80000000 0x7fff000064bit 0x40000000'000000000x40000400'00000000 0x400003ff'ffff0000 As you can see, the end of the 32bit Q2 shared quadrant overlaps with UAREA on 32bit kernels but not on 64bit kernels. On a 64bitkernel, the 32bit Q2 shared quadrant is actually within the 64bit Q1 quadrant, hence it does not overlap with the UAREA which sits in the64bit Q2 quadrant. This is an important point to notice as it explains why the BigSpace Window and Q3 Private features (described below) are onlyavailable on 64bit kernels. 11.0 PHKL_13810 (MEMORY WINDOW)This patch introduced support for memory windows. In the previous HPUX releases, the available shared quadrants were global sharedquadrants.With memory windows, each window is giving 2 private shared quadrants. Memory windows are only used by 32bit applications(running on either 32bit or 64bit kernels), ie. a memory window provides 32bit shared space. The memory window with id 0 is know asthe "global memory window" (or "default memory window). Each memory window provides ~2GB of private shared space: Private shared quadrants for memory window index i on 32bit kernel Space Start End SizeQ2 mw[i] q2 space 0x40000000 0x7fff0000 ~1GBQ3 mw[i] q3 space 0x80000000 0xc0000000 1GB Private shared quadrants for memory window index i on 64bit kernel Space Start End SizeQ2 mw[i] q2 space 0x40000000 0x80000000 1GBQ3 mw[i] q3 space 0x80000000 0xc0000000 1GB Global 32bit shared quadrant Space Start End SizeQ4 0 0xc0000000 0xf0000000 .75GB 11.0 PHKL_16236 (BIG SPACE MEMWIN)BigSpace memory windows ("setmemwindow -b", PHCO_16795) on 64bit kernels only. When configured for "big space", the memorywindow isset with its q2 space matching its q3 space, thus allowing (on 64bit kernel only) to have 2GB of virtually-contiguous (same space id, andno gap atthe end of Q2 for UAREA) shared space: Space Start End OS SizeQ2Q3 mw[b] q2 spacemw[b] q3 space 0x400000000x80000000 0x800000000xc0000000 64bit 1GB1GB The big space window can be viewed as:


120 of 135 10/28/2010 11:19 PM

Space Start End OS SizeQ2Q3 mw[b] space 0x40000000 0xc0000000 64bit 2GB A 32bit SHMEM_MAGIC application can then allocate two 1GB shared memory segments, and treat those two segments as a single 2GBsegment. Note that the largest 32bit shared memory segment which can be can allocated through a shmget() call is limited by the size of a quadrant,hence 1GB. The Big Space window allows to work-around this limitation, and through 2 shmget() calls, an application is now able to geta "2GB" shared memory segment. 11.0 PHKL_20224... (Q3 PRIVATE)Q3 private executables ("chatr +q3p enable", PHSS_19866) on 64bit kernels only. Q3 private processes are attached to the Q3 privatememory window (id 1). The Q3 private memory window does not hold any virtual space, its resource maps for both its q2 and q3quadrants are kept empty, thus forcing allocation of shared space from the Q4 global shared quadrant. This essentially gives 3GB ofprivate space (Q1, Q2, Q3 are considered as quadrants for private data). Running shminfo -w 1 (ie. q3 private window id is 1) reports no free space: $ shminfo -w 1libp4 (7.98): Opening /stand/vmunix /dev/kmemshminfo (3.7) Shared space from Window id 1 (q3private): Space Start End Kbytes UsageQ2 0xffffffff.0x40000000-0x7fffffff 1048576 OTHERQ3 0xffffffff.0x80000000-0xbfffffff 1048576 OTHER 11.0 PHKL_20995 (Q3 PRIVATE FIX)Fix for Q3 private breaking memory windows. The initial Q3 private patches (PHKL_20227 and PHKL_20836) had a defect causingmemorywindows (other than the global window) to not been properly initialized. When Q3 private is configured and all the windows except the global window are empty then we are very likely to be running into the Q3private bug and thus shminfo prints the following warning: WARNING: Q3 Private is enabled but "shminfo" couldn't find a memory window with free virtual space: this is suggesting that one of PHKL_20227 or PHKL_20836 is installed (both are BAD patches). Please, make sure that PHKL_20995 is installed !

shmalloc

Along with the shminfo executable, the ktools shminfo.exe archive provides shmalloc.c, a C program to exercise shared memoryallocation: # shmalloc -hPurpose: To troubleshoot 32bit shared memory allocation problems. It should be used along with "shminfo" (v3.0 or greater)... Usage: shmalloc [options ...]Options: -s size segment size in KB (default=4, max=1048576=1GB) -c number number of segments (default=1, max=10)


121 of 135 10/28/2010 11:19 PM

-g Global space (IPC_GLOBAL), instead of window space -n Narrow mode (IPC_SHARE32), useful when compiling +DD64 -l Lock segments in memory (root or PRIV_MLOCK groups only) -t seconds Touch all the pages during seconds -h prints the helpCompiling shmalloc.c: SHARED_MAGIC: cc -o shmalloc shmalloc.c EXEC_MAGIC: cc -Wl,-N -o shmalloc shmalloc.c SHMEM_MAGIC: cc -Wl,-N -o shmalloc shmalloc.c; chatr -M shmallocExample: Allocating 2 shmem segments of 1GB each: Default window : ./shmalloc -c 2 -s 1048576 Global space : ./shmalloc -c 2 -s 1048576 -g Window id 100 : setmemwindow -i 100 ./shmalloc -c 2 -s 1048576 BigSpace id 100: setmemwindow -b -i 100 ./shmalloc -c 2 -s 1048576

vmtrace

http://psweb1.cup.hp.com/~projects/vm/tools/vmtrace/index.html IntroductionVmtrace is a tool for debugging incorrect use of dynamically allocated kernel memory in HPUX. This is memory allocated using kmem_arena_alloc(), kmem_arena_varalloc(), MALLOC(), kmalloc(), and other related routines and macros. Vmtrace consists of 3 parts: The main portion is built into all kernels since 10.10, and was available as a patch for some earlier releases. It's normally inactive, andmust be enabled in order to use it. A user space tool called 'vmtrace' is the normal means of enabling vmtrace.A perl script called 'vmtrace.pl' is used with Q4 to assist in analyzing information recorded by vmtrace.Users should be aware that the implementation of vmtrace has changed over time, and many of the early versions of user space vmtracecomponents lack version information. It's thus possible to get confusing results when using the wrong versions of the tools. They should also be aware that the user interface has changed over time. In particular, the kernel memory allocator was rewritten in11.11, leading to many changes in the behavior and interface of vmtrace. In order to reduce confusion caused by version incompatabilities, the most recent (11.11) versions of the user space components havebuilt in backwards compatability; if they are run on an older kernel, they will emulate the behavior of their 11.10 versions. This meansthat the 11.11 version of the user space tool will work on all vmtrace kernel implementations, except for a Beta version of vmtrace thatexisted for about a month during 11.11 development. The 11.11 version of the vmtrace perl script will work for 11.00 and later kernelsexcept, once again, the 11.11 Betaversion. Be aware that when operating in compatability mode, the user interface is significantly different. Users should also be aware that when built in OSDEBUG mode, 11.11 and later kernels have significant built in kernel memory allocationdebugging capability even without vmtrace. Thus, you may not need vmtrace to find your problem on such a kernel

VMTRACE - How It Works (Pre 11.11)

Introduction

It has often been the case when one subsytem allocates a chunk of kernel memory through a MALLOC call and after releasing the memorythrough a FREE call it incorrectly accesses it. This illegal operation often ends up corrupting other data structures in the kernel who usethe same size. The consequences can be very drastic since there is no way to track down the offending code. The other common case which has been noticed is that a subsystem allocates a chunk of kernel memory but never releases it. Repeated useof this path can lead to a drain in the system memory and eventually lead to a system hang or very poor performance. A mechanism has been put in place in the regular kernel to enable tracking these often difficult bugs. This tool would enable an engineer tostart online tracing without rebooting. If there is a memory corruption problem the system would eventually panic and the stack traceshould show the culprit. If there is a leak the system needs to be brought to single user mode and then a memory dump should be taken forfurther analysis of the log.


122 of 135 10/28/2010 11:19 PM

What needs to be considered before using vmtrace ?

First of all, vmtrace will have an impact on performance as well as on memory consumption. The performance impact typically is not toobad, about 2-3%, but in cases where applications take big benefits of large pages, the performance impact can be more visible.

Memory consumption is more of a concern. When tracing for memory corruption, all sizes less than a page are getting a full pageallocated. So there is increased use of system memory. For e.g. a 256 byte size allocation now uses a page (4096 bytes). So if a particularsystem uses the 256 bytes very heavily then there is a considerable increase in memory usage. For normal to heavy workloads 64 MBshould be sufficient. This has been a problem a while ago, but with the current systems and their memory sizes there should not be toomuch concern in using vmtrace.

When tracing for memory corruption, the system will be paniced by vmtrace as soon as a corruption is detected. This is certainlysomething the customer needs to be prepared for. For all other cases, the system will need to be brought down manually. This can beaccomplished by shutting down the system to single user mode and then TOC'ing the system. Make sure that the dump space is largeenough to be able to save the majority of the memory content. To be on the save side allow a full dump to be saved.

On 10.20 systems that use PA2.0 processors, usage of superpages has to be switched off before enabling vmtrace. In 11.00 this is notnecessary anymore. Here is how to do so with adb:

# adb -w /stand/vmunix

kas_force_superpages_off?W 0x1

$q

Lastly the list of patches that need to be installed before usage of vmtrace:

10.20:

PHKL_8377 s800 vmtrace:malloc()PHKL_8376 s700 10.20 vmtrace and malloc() patch

11.00:

PHKL_17038 vmtrace:data:page:fault a dependency on PHKL_18543!PHNE_18486 streams:cumulative

On 11.11 vmtrace operation can be enabled as well as stopped while the system is alive. This allows to start vmtrace, reproduce theproblem and then disable vmtrace afterwards again. So far no patches are needed to be able to use vmtrace on 11.11.

Common Cases and their symptoms

Memory CorruptionThis is the most common and the most difficult to debug. There are several ways in which memory corruption can happen.

Case 1:

A subsystem allocates memory, uses it and then releases it. After releasing it, the subsytem continues to write to it. This causes corruptionof other data structures which have currently been allocated this chunk of memory.

Case 2:

A subsystem allocates a chunk of memory but writes beyond the size allocated to it. This causes corruption of data structures which havechunks of memory adjacent to this chunk.

Case 3:

A subsystem allocates a chunk of memory, uses it and then releases it twice. This often causes corruption of the first word since it is usedas a link in the free list. The symptoms can be drastic since the same chunk could be allocated to two subsystems simultaneously. It alsocould lead to severe damage of the free list.

Memory Leaks

In this situation a subsystem allocates a chunk of memory but does not release it when done. If there is continuous use of this path, there is


123 of 135 10/28/2010 11:19 PM

a constant drain of memory which could lead to a hang or poor performance.

Tracing Mechanism

This mechanism does not change the MALLOC and FREE macro. So, this avoids any performance overhead when tracing is disabled. Itdoes have a couple of checks in kmalloc() and kfree(). But since these routines are not called often the performance overhead is almostnegligible. When tracing is enabled for memory corruption, sizes less than a page are always allocated a page. When the virtual address is releasedwe remove the translation and the virtual address is kept in a backup pool. So if the offending code touches the virtual address which hasbeen released it should cause a Data fault at the exact location of illegal access. Also, the remaining portion of the page for sizes less thana page are scrambled with some known values before giving the virtual address to the subsystem. When the subsystem returns the memorywe verify that the scrambled portion remains the same. If it has been written to, then we panic the kernel. This happens in cases when someone writes beyond theirrequested size. This situation is detected only when the subsytem releases the memory and not at the exact instruction when someonewrote to this illegal location. When tracing is enabled for memory leaks, we allocate the usual sizes but everytime before we give the virtual address to the subsystemwe enter it into a log. When the address is released we remove it from the log. So the log consists of virtual addresses currently in use. Ifthere is a drain of system memory we can analyze the log and pinpoint the culprit with information in the log When tracing is enabled for general logging, all allocations and deallocations are tracked in a circular log.

Memory and Performance Degradation

When tracing is enabled, there is some memory and performance degradation. When tracing for memory corruption, all sizes less than a page are allocated a page. So there is increased use of system memory. For e.g a256 byte size allocation now uses a page (4096 bytes). So if a particular system uses the 256 bytes very heavily then there is considerableloss of memory.For normal to heavy workloads a 64MB should be sufficient. But if for instance there is an application which memory maps 10,000segments of 32 pages or less it could use 10,000 256 byte allocation which gets allocated 10,000 pages (~40 MB). Since there is no wayto determine the impact of this, the best way is to enable it online and if the system runs out of memory, then add lots of memory and run itagain. This should not be a problem in most cases and definitely worth enabling online until we hit the problem.When tracing for memory leaks or general tracing, then the default size for the log is 2MB each. The allocation size remains the same andso there is no memory issue as in the case of tracing memory corruption. So additional memory is never needed. Whenever vmtrace is turned on (any mode), large pages will no longer be allocated for kernel memory, not even for bucket sizes not beingtraced. This can cause performance degradation on large memory systems.

VMTRACE - How to Use It (11.00 to 11.10)By default the performance kernel does not do any tracing. However the kernel contains the necessary code capable of online tracing ofbuckets. Tracing for chosen buckets and for a chosen type can be enabled through a tool called vmtrace, or by setting certain kernel globalvariables with adb and rebooting the kernel. The steps to be followed are: Identify the bucket size from the dump. This can be done in two ways for memory corruption. Given the data structure which is corruptedfind the size of the data structure and this should give the bucket size. The second way is to use the function WhichBucket in the perl scriptkminfo.pl with q4. For memory leaks you can use the function Dbuckets in the perl script kminfo.pl with q4. If there is a bucket which hasa large number of pages allocated for that size, then it is very probable that the corresponding bucket is leaking. For more information onhow to use the perl script kminfo.pl please refer to the documentation in the perl script. Turn on vmtrace. There are several possible ways to do this: Run the tracing tool vmtrace. This is an interactive tool which would prompt for bucket sizes and type of tracing i.e whether you want todetect memory corruption, memory leaks or just want to log all memory allocation/deallocation calls. This tool can also be run with


124 of 135 10/28/2010 11:19 PM

command line arguments, bypassing the menus entirely. This can be useful when you want to run it from a script. The syntax for this optionis: vmtrace -b <bucket map> -f <flags>where bucket map is a bit map. The corresponding bits need to be set for the appropriate sizes being traced. The chart below shows themapping of the bit to the corresponding bucket sizes. The bits are numbered from right to left starting with 0. BIT Description0-4 ******** ( NOT VALID )5 32 byte bucket ( VALID )6 64 byte bucket ( VALID )7 128 byte bucket ( VALID )8 256 byte bucket ( VALID )9 512 byte bucket ( VALID )10 1024 byte bucket ( VALID ) 11 2048 byte bucket ( VALID ) 12 4096 byte bucket ( VALID )13 2 page bucket ( VALID ) 14 3 page bucket ( VALID ) 15 4 page bucket ( VALID ) 16 5 page bucket ( VALID ) 17 6 page bucket ( VALID ) 18 7 page bucket ( VALID ) 19 8 page bucket ( VALID ) 20 > 8 pages ( VALID ) 21-31 ******* ( NOT VALID ) The flags parameter should be an OR of the following values.1 = Tracing for Memory Corruption2 = Tracing for Memory Leaks4 = Tracing for Logging For e.g. vmtrace -b 0x180 -f 0x1 would enable tracing for 256 and 128 size buckets and log for memory corruption. If you type vmtrace -b0x100 -f 0x7 tracing would be enabled for 256 size buckets and log for memory corruption, memory leaks and general logging. Please Note that for sizes upto a page the bit corresponds to the size i.e you can just OR the sizes.If you prefer, you can turn on vmtrace by setting global variables and rebooting. You must do this to trace problems that occur beforereaching the multi-user prompt), PLEASE NOTE that this tool should be used with caution. In the case of tracing for memory corruption please understand the memoryissues as described in the next section before tracing for multiple buckets. After the dump is taken analyze the logs using the perl script vmtrace.pl in conjunction with Q4. Note that it is important to have the rightversion of this script to match the kernel being debugged.After the tracing is enabled the following symptoms would be observed:

Memory Corruption

For case 1, the kernel panics with "Data Fault" exactly at the location of the offending instruction which was accessing an address whichwas reelased earlier. By running the Q4 perl script vmtrace.pl you can look at a log and find the stack trace which released this address.This should be sufficient to help one to find the bug easily. For case 2, the kernel panics when it detects that someone wrote beyond their allocated size. The stack trace should show the location ofwhen the kernel released this memory. This does not give the exact offending instruction when someone wrote beyond their allocated sizebut gives approximately which data structure was the offending one. Then one needs to match the corresponding MALLOC and find thebug through other variables. If this becomes hard, at least we know that this was the case and the tracing mechanism in the kernel can beenhanced to detect this case more precisely. For case 3, the kernel panics in either FREE() or in vmtrace_kfree(). In vmtrace_kfree() we print that there was no translation for theaddress. This was because the translation for the vitual address was removed by an earlier FREE(). If the kernel panic'd in FREE() then itwould panic with a data fault. Again by looking at the log one can find when it was released earlier and find the bug very easily.


125 of 135 10/28/2010 11:19 PM

Memory Leaks

In this case the symptom would be a drain in memory. After the system displays an appreciable loss of memory, the system should bebrougt to single user mode through the command "shutdown". After the system comes to single user mode a dump should be taken with a"TOC". The dump can then be analyzed with a Q4 perl script vmtrace.pl to find the culprit. The output of the script should give a list of theoutstanding MALLOC's with the corresponding stack trace. If there is one stack trace with very many entries in the log then it is the mostprobable culprit. Caveats

When tracing for memory corruption, there could be fragmentation of the virtual address resource map called the sysmap. This couldcause it to lose some virtual addresses. Losing virtual addresses does not mean losing physical memory. In 32 bit systems which havelarge physical memory(> 2GB) we could reach the virtual space limit (1 GB approx) under very heavy load. If this limit is reached thenthe kernel would panic with the message "kalloc: out of kernel virtual space". For large 32-bit systems, do not trace multiple buckets ifthis panic happens. If this panic occurs even when tracing one bucket then the instrumentation needs to be enhanced for this customer to avoid this case. There's also the possibility of vmtrace for corruption causing a small memory system to use so much extra physical memory that thrashingoccurs, producing severe performance degradation. There are multiple versions of the vmtrace.pl perl script. Some do not work with 64 bit kernels. Others do not work with pre-11.00kernels.

Vmtrace in 11.11 or Later Kernels

VMTRACE - How It Works

IntroductionKernel memory corruption has been a recurrent problem in HPUX. This occurs when one caller allocates a chunk of memory, and theneither writes beyond the end of the chunk size requested, or frees the memory and continues to use it, or frees it more than once. This islikely to result in corrupting other structures allocated from the same kernel memory arena; sometimes it even spills beyond the specificarena and affects the superpage pool. These problems can be extremely difficult to debug. The resulting panic is likely to occur long after the memory was corrupted, and mayaffect any code that uses the same arena, or code remotely connected with a user of that arena (such as something called with a parametertaken from an object allocated from that arena). Moreover, some arenas (particularly comaptability mode arenas like M_TEMP) have avery large number of users, not always even from the same subsystem. This can make diagnosis and triage rather difficult. In pre-11.11 kernels the situation was even worse. There were no arenas; instead, all allocations of approximately the same size (towithin a power of 2, eg. 129-258 byte allocations) shared the same "bucket", and could corrupt each other. Another common problem is that a subsystem allocates a chunk of kernel memory but never releases it. Repeated use of this path can leadto a drain in the system memory and eventually lead to a system hang or very poor performance. The arena allocator, when built in OSDEBUG mode, contains some mechanisms intended for fast detection and isolation of memorycorruption problems. However, to supplement these, and to handle performance kernels and memory leaks, we have a kernel memorydebugging tool called vmtrace.

Vmtrace is built into the regular kernel, whether debug or performance. It simply needs to be enabled, using a user space tool called,simply "vmtrace", or (when looking for problems that occur early in system initialization) by setting certain kernel global variables withadb and rebooting. If there is a memory corruption problem, the vmtraced system will panic with a stack trace showing the culprit. In the case of leaks, thesystem needs to be brought to single user mode and then a memory dump should be taken for further analysis of the vmtrace leak log.

Common Cases and their symptoms

Memory Corruption

This is the most common and the most difficult to debug.There are several ways in which memory corruption can happen.

Case 1 (Stale Pointer):


126 of 135 10/28/2010 11:19 PM

A subsystem allocates memory, uses it and then releases it. After releasing it, the subsytem continues to write to it. This causes corruptionof other data structures which have currently been allocated this chunk of memory.

Case 2 (Overrun):A subsystem allocates a chunk of memory but writes beyond the size allocated to it. This causes corruption of data structures which havechunks of memory adjacent to this chunk.

Case 3 (Double Free):A subsystem allocates a chunk of memory, uses it and then releases it twice. This often causes corruption of the first word since it is usedas a link in the free list. The symptoms can be drastic since the same chunk could be allocated to two subsystems simultaneously. It alsocould lead to severe damage of the free list.

Memory Leaks

In this situation a subsystem allocates a chunk of memory but does not release it when done. If there is continuous use of this path, there isa constant drain of memory which could lead to a hang or poor performance.

Tracing Mechanism

This mechanism does not change the kmem_arena_alloc(), kmem_arena_varalloc() or kmem_arena_free() routines or associated macros.So, this avoids any performance overhead when tracing is disabled. Instead, it uses the function pointer associated with each arena freelist; if vmtrace is in use on that free list, the function pointer will be set, and a special vmtrace version of the function will be called.

Corruption Tracing Modes

There are 3 different modes for handling memory corruption.These are called, somewhat unimaginatively:

Light corruption mode

Standard corruption mode

Heavy corruption mode

The names are based on the effect on memory usage. The light mode uses only a little extra memory; standard mode uses rather more (atleast a page per traced allocation) and heavy mode uses even more than that (at least 2 pages per traced allocation). In particular, the 3 corruption modes do the following:

Light Corruption Mode

This mode is pretty much the same as the built in features of the OSDEBUG kernel, except that it's available in customer/performancekernels. It detects double frees when the second free occurs.It detects overruns, but only when the memory is freed.It cannot detect underruns.It cannot detect stale pointer references.Each allocation has a few extra bytes of padding added at the end, with well known contents; these are checked on free to detect overrun. When memory is in use, a bit is set in the object header to indicate this. When it's freed, the bit is checked, and then cleared. This is usedto detect double frees.

Standard Corruption Mode

This mode is particularly useful for detecting stale pointer use. It detects double frees when the second free occurs.It detects overruns, but only when the memory is freed.It detects stale pointer use.With this mode, sizes less than a page are increased to be a full page. The rest of the page is filled with well known contents; these arechecked on free to detect overrun. When memory is freed, the page(s) are protected (made unaccessible) and placed on the free list. Any attempt to access this memory (viaa stale pointer) before it's reallocated for some other use will cause a protection fault panic.

Heavy Corruption Mode

This mode is designed to diagnose overruns efficiently. It also gives significantly more information in case of double frees and stale


127 of 135 10/28/2010 11:19 PM

pointer references, and is the only mode that can give any information at all in case of underrun. It detects double frees when the second free occurs.It detects overruns immediately.It detects stale pointer use.It can detect some underrun errors.With this mode, sizes are increased to whole page(s), and then an extra page is added. That extra page is protected (made inaccessible).The caller is given an object that ends just before the beginning of the last page, so that an overrun will immediately panic with aprotection fault. Known contents are written to the part of the page before the object and its object header; these might possibly detect anunderrun problem. When memory is freed, its translation is deleted. The physical memory is returned to the physical memory allocator, but the virtualaddresses are placed in an aging queue, along with information (stack trace, etc.) about when they were freed. A stale pointer referencewill cause a data page fault panic; its stack trace will show where the stale reference was made, and the information in the aging queuecan be used to find out when the memory was freed. If a double free occurs, a panic will occur with vmtrace_free_heavy_corruption() in the stack trace. Possibly when attempting to deletethe non-existent translation, or possibly earlier, e.g. a data page fault in the destructore function. If an underrun occurs, it might simply corrupt the object header, in which case there's little useful the code can do. But if it corrupts theearly part of the page (before the object header) it will be detected when the object is freed. (This isn't an especially good method, butunderruns are extremely rare.)

Other Modes

There are also 2 other modes, which may be enabled individually, or in combination with each other and/or any single corruption mode.

Leak Mode

When tracing is enabled for memory leaks, we allocate the usual sizes but every time before we give the virtual address to the subsystemwe enter it into a log. When the address is released we remove it from the log. So the log consists of virtual addresses currently in use. Ifthere is a drain of system memory we can analyze the log and pinpoint the culprit with information in the log. Each log record also contains information about the allocation. In particular, the stack trace of the code that called the allocation function.To improve performance (when searching for the address being freed) the log is organized as a hash table. This has the side effect thatwhen printing out addresses allocated but not yet freed, the addresses are printed in no particular order.

Logging Mode

When tracing is enabled for general logging, all allocations and deallocations are tracked in a circular log. To reduce lock contention, each cpu has its own log. Allocations and frees are recorded in the log corresponding to the cpu where theallocation as performed, even though the memory may have been freed on some other cpu. This corresponds to the arena free list used, asthese are also organized on a per cpu basis to reduce lock contention.

Memory and Performance Degradation

When tracing is enabled, there is some memory and performance degradation.

Memory Usage

When tracing for memory corruption, standard and heavy corruption modes use significantly more memory for each allocation. Anallocation that normally use 256 bytes would use a whole 4096 byte page in standard corruption mode, and two or these pages (8192bytes) in heavy corruption mode. This can add up fast. On performance kernels, light corruption mode uses slightly more memory for each allocation. (This is the same increase seen betweenperformance and OSDEBUG kernels.) Leak Mode, General Logging Mode, and Heavy Corruption Mode all allocate logs or buffers to store information they record. Leak modeuses 2 MB for its log records, plus a little extra for its hash table. General logging mode uses 2 MB for each cpu. Haavy corruption modeallocates 4096 aging records per cpu. Their size depend on the CPU architecture. On a 64 bit PA RISC system they are presently 136bytes each, for a total of approx. .5 MB per cpu, in addition to the extra memory used for each allocation. All of these sizes are only defaults; they can be changed by adb'ing kernel global variables and rebooting.


128 of 135 10/28/2010 11:19 PM

Large Pages

In 11.00 and 11.10 kernels, any use of vmtrace disabled use of large pages for all kernel memory allocations, whether or not they weretraced, even those not involving MALLOC.This had significant performance impact, especially on systems with many cpus. In 11.11, large pages are only disabled for allocations where this is needed. In particular, allocations being traced in standard or heavycorruption mode. Allocations from other arenas are not affected. This greatly reduces lock contention; when the lower level allocatorallocates a kernel virtual address (needed for every small page allocation) it passes through a choke point where a single lock (not a percpu lock) is held across an inefficient algorithm. It also reduces kernel memory fragmentation, and the risk of overflowing the sysmap, which were problems with the eaelier versions ofvmtrace.

VMTRACE - How to Use It

By default the performance kernel does not do any tracing. However the kernel contains the necessary code capable of online tracing ofarenas. Tracing for chosen arena(s) can be enabled through a tool called vmtrace, or by setting certain kernel global variables with adband rebooting the kernel. The steps to be followed are: Determine the mode(s) to use. There are several mode(s) available; the appropriate mode(s) to use depend on the problems to bediagnosed, and to some extent on the platform and scenario that replicates the problem. See scenarios below for details. Identify the arena(s) to be traced. See scenarios below for examples of how to do this.Determine the size to be traced. Most of the time, it's simplest to trace all sizes (enter zero in response to the query about the size to be traced); iftracing a heavily used variable sized arena in a situation where you believe only one size is affected, you might want to specify the size.Run the tracing tool vmtrace. This is an interactive tool. See How to Use the Vmtrace Tool below for details. If you prefer, you can turn on vmtrace by setting global variables and rebooting. You must do this to trace problems that occur beforereaching the multi-user prompt), After the dump is taken you will probably need to analyze the logs using the perl script vmtrace.pl inconjunction with Q4.

PLEASE NOTE that running vmtrace can have a number of possible effects on the system. Depending on the mode, it can significantlyincrease kernel memory usage, increase kernel memory fragmentation, and/or introduce additional lcok contention into the kernel memoryallocation path. It should be used with caution.

Scenarios

Here are some likely problems, and how to use vmtrace to help debug them.

Stale Pointer

This is the situation when some code frees memory, but retains a pointer to it, which is later used. In some cases, by the time the stalepointer is used, the memory may already have been reallocated for some other purpose. Depending on the code, you may want informationon the code that used the stale pointer, the code that freed the memory, and/or (least common) the code that allocated it originally. You usually discover this situation because of: An assertion failure or other panicTraced back to a corrupt dynamically allocated data structureWith contents that appear to be reasonable ... but incorrect.Alternatively, it's noticed while running "standard" or "heavy" corruption mode, which will panic on the access rather than on the datacontents. To get information on this: Run heavy or standard corruption modes. Both panic when the stale pointer is used, but heavy mode will store information on the code thatdid the free as well as producing a panic stack trace that points at the code which uses the stale pointer. If you can panic reliably withoutvmtrace's assistance, you don't absolutely have to run either of these modes. If you need information about the stack trace that freed the memory, run heavy corruption mode or general logging, and extract theinformation from the core with vmtrace.pl.

If you need information about the original allocator of the abused memory, run general logging mode, and extract the information from thecore with vmtrace.pl.


129 of 135 10/28/2010 11:19 PM

Overrun

This is the situation when some not so clever piece of code allocates fewer bytes of memory than it needs, and then writes beyond the endof the area allocated. Any corruption mode will detect this and panic when the memory is freed. This includes "light" corruption mode,and the built in debugging features of OSDEBUG kernels. Sometimes you can determine what's wrong simply by looking at the memory contents. In the more common case, you want to panic when the memory is actually overwritten. To do this, run heavy corruption mode; it willpanic with a stack trace pointing to the culprit. If you want to know where the memory was allocated, run general logging mode, andextract the information from the core with vmtrace.pl. (Or you could also get the same information from leak mode.) If all you care about is where the memory was allocated, you could simply use general logging mode, and rely on (e.g.) the OSDEBUGkernel features to panic when the overwritten memory is freed. But this is unlikely to be very useful.

Underrun

Double Free

Leak

How to Use the Vmtrace Tool

There are two steps to turning on vmtrace. They must be done in the correct order.

WARNINGOnce you select vmtrace modes, the only way to change them is to reboot. If this is inconvenient, be sure you know what modes you reallywant before using the vmtrace utility. Also, due to a defect (in the tool, not the kernel), once you select the mode(s) you should not exit thevmtrace tool until you are certain that you will not want to enable tracing on any additional arenas. (Yes, the fix is well known andobvious. However, actually making the change hasn't beengiven sufficient priority by management.)

Main Vmtrace Menu

1) Start vmtrace (do this before selecting any arenas) 2) Select an arena to trace 3) Disable vmtrace (prevents additional memory consumption ONLY) 4) Reenable vmtrace 5) DONE Pick option 1 first. This is where you select modes. If you haven't done this, anything else you attempt will fail with EINVAL.

Selecting Modes in the Vmtrace Tool

Enter action desired [ 1- 5]> 1 1) Lightweight Corruption 2) Standard Corruption 3) Heavyweight Corruption 4) Leak Detection 5) General Logging 6) DONE See the discussion above to determine which modes to select. Select modes one at a time by number. When finished, select DONE.

Once you have done this, your next step will be to select arena(s) to trace.

WarningOnce you select any modes and leave this menu, you cannot select additional modes, or change the ones already selected withoutrebooting.

Selecting Arenas in the Vmtrace Tool

Enter action desired [ 1- 5]> 2Enter arena name (hit return for none) > M_TEMPEnter allocation size to trace (zero for all supported sizes) >


130 of 135 10/28/2010 11:19 PM

Known Problems

In HP-UX 10.30, HP-UX 11.00 without PHKL_17038, tracing for leaks can cause corruption on multi-processor systems. The workaround is to trace for both corruption and leaks, rather than tracing for leaks alone.In releases prior to HP-UX 11.00 vmtrace does not work on machine which support superpages (i.e Mohawk). On Mohawk machines forboth the 32bit kernel and 64bit kernel, you need to do the following through adb and reboot the kernel: adb -w /stand/vmunixkas_force_superpages_off?W 0x1$q In the HP-UX 10.20 Release (a.k.a. Davis), the vmtrace could cause panics on MP systems. You need the patches PHKL_8376 (Series700) or PHKL_8377 (Series 800) installed on the systems.

Getting Help

The HPUX kernel virtual memory (VM) group maintains vmtrace. The CHART project is hpux.kern.vm.The VM group's policy is to have designated owners of all VM code. The current list of VM owners can be found athttp://integration.cup.hp.com/~cmather/ownership_table.html. At present (07/06/00) the owner of vmtrace is Arlie Stephens. She can be reached at: Arlie StephensESTL LabHewlett-Packard Co., MS 47LA2Cupertino, California 95014

Vmtrace for the arena memory allocator

Usage of the vmtrace utility on 11.11 and later releases is very similar to the usage on McKusic based bucketallocator OS releases. The main difference is in usage of arena names instead of bucket sizes.

Lets simply start by looking at the vmtrace menu's again:

# ./vmtrace_11.11 1) Start vmtrace (do this before selecting any arenas) 2) Select an arena to trace 3) Disable vmtrace (prevents additional memory consumption ONLY) 4) Reenable vmtrace 5) DONEEnter action desired [ 1- 5]>1

As we are told by the tool, we first start vmtrace, enter 1:

1) Lightweight Corruption 2) Standard Corruption 3) Heavyweight Corruption 4) Leak Detection 5) General Logging 6) DONEEnter type of tracing (one per prompt) [ 1- 6]>4

Here we are going for leak detection, as we would do for investigating a kernel memory leak.

Enter type of tracing (one per prompt) [ 1- 6]> 4Enter type of tracing (one per prompt) [ 1- 6]> 6Enabling vmtrace 1) Start vmtrace (do this before selecting any arenas) 2) Select an arena to trace 3) Disable vmtrace (prevents additional memory consumption ONLY) 4) Reenable vmtrace 5) DONEEnter action desired [ 1- 5]>


131 of 135 10/28/2010 11:19 PM

Now we are going to select the arena we want to trace:

Enter action desired [ 1- 5]> 2Enter arena name (hit return for none) > M_TEMPEnter allocation size to trace (zero for all supported sizes) > 0Enabling vmtrace on arena M_TEMPEnter arena name (hit return for none) > 1) Start vmtrace (do this before selecting any arenas) 2) Select an arena to trace 3) Disable vmtrace (prevents additional memory consumption ONLY) 4) Reenable vmtrace 5) DONEEnter action desired [ 1- 5]> 5#

We have selected the M_TEMP arena and then left the vmtrace tool. Remember that you now can easily disable

vmtrace without the need to reboot the system. Tracing and logging can later be re-enabled too. There is onecaveat, tracing modes can not be changed on the fly, you will need to reboot for this. So please make up you

mind what exactly to do before running vmtrace....

Selection of the corruption detection modes is not that simple. To decide and figure out that a certain crash wascaused by a corruption issue usually requires some amount of experience in reading dumps

Tusc

NOTE:

Tusc is not an official HP product, and therefore is not currently supported by HP.

It is available along with the other GR8 tools at ftp://hpchs.cup.hp.com/tools/11.X

If you intend on running the kitrace tool , GR8 advises running tusc first .

tusc works with HP-UX 11.* PA-RISC systems, and HP-UX 11i Version 1.5 (Itanium® Processor Family) systems. It is not

supported on HP-UX 10.20.

tusc is a great program for working with Java. It gives you another view into the system activity in addition to Java stack

traces, GlancePlus, and HPjmeter. tusc has many options which can be displayed with the command tusc -help.

Below you'll find a list of the available options, plus a few examples of using tusc for debugging and performancetuning

Below is the output from tusc -help:

Usage: tusc [-<options>] <command [args ...]> -OR- <pid [pid ...]> -a: show exec arguments -A: append to output file -b bsize: dump 'bsize' max bytes (-r/-w) -c: count syscalls instead of printing trace -d [+][!][fd | all]: select only syscalls using fd -e: show environment variables -E: show syscall entries -f: follow forks -F: show kernel's ttrace feature level -g: don't attach to members of my session -h: show state of all processes when idle -i: don't display interruptible syscalls -I start[/stop]: single-step and show instructions -k: keep alive (wait for *all* processes) -l: print lwpids -n: print process names


132 of 135 10/28/2010 11:19 PM

-o [file|fd]: send trace output to file or fd -p: print pids -Q: be quiet about some warnings -r [!][fd | all]: dump read buffers -R: show syscall restarts -s [!]syscalls: [un]select these syscalls -S [!]signals: [un]select these signals -t: detach process if it becomes traced -T timestamp: print time stamps -u: print user thread IDs (pthreads) -v: verbose (some system calls only) -V: print version -w [!][fd | all]: dump write buffers -x: print raw (hex) arguments -z: only show failing syscalls

Here are a few examples of debugging and performance tuning information you can see with tusc.

thread.interrupt

The Thread.interrupt() call is implemented using SIGUSR1. Hence, if you see SIGUSR1 in the tusc output,the program must be making Thread.interrupt() calls. You can confirm this by making an -Xeprof trace and

viewing the data with HPjmeter. It's not necessarily good or bad to use Thread.interrupt(), but you canmonitor it with tusc and it may be helpful information in various performace or correctness situations. Here is an

example of a Thread.interrupt(). Threads are indentified by their lwp id, shown in the second column. Thread19729 interrupts thread 19731 with the signal.

1008628500.571138 {19731} write(1, "\n", 1) ...................... = 11008628500.571337 {19731} gettimeofday(0x6c258910, NULL) ......... = 01008628500.571444 {19731} clock_gettime(CLOCK_REALTIME, 0x6c258a40) = 01008628500.571625 {19729} _lwp_kill(19731, SIGUSR1) .............. = 01008628500.571737 {19731} Received signal 16, SIGUSR1, in ksleep(), \ 91;caught]1008628500.571757 {19731} Siginfo: sent by pid 10468 (uid 565), \ si_errno: 01008628500.571939 {19731} ksleep(PTH_CONDVAR_OBJECT, 0x1fde70, 0x1fde78, \ 0x6c258908) = -EINTR1008628500.572143 {19731} gettimeofday(0x6c258910, NULL) ......... = 01008628500.572258 {19801} ksleep(PTH_MUTEX_OBJECT, 0xaae8, 0xaaf0, \ NULL) = 01008628500.572438 {19731} clock_gettime(CLOCK_REALTIME, 0x6c258a40) = 01008628500.572522 {19801} kwakeup(PTH_CONDVAR_OBJECT, 0x309580, \ WAKEUP_ALL, 0x6b6c1848) = 01008628500.572611 {19802} ksleep(PTH_CONDVAR_OBJECT, 0x309580, 0x309588, \ 0x6b640908) = 01008628500.572704 {19729} kwakeup(PTH_MUTEX_OBJECT, 0xaae8, WAKEUP_ONE, \ 0x6c2d978c) = 01008628500.572800 {19778} sched_yield() .......................... = 0

Here we used -T "" and -l to show the timestamp in basic format and the lwp id. This time we happened to

interrupt a thread sleeping on a pthread_cond_wait call. You can see how he wakes up with EINTR. This willcause an InterruptedException in the Java program.

implicit null pointer checks

The hotspot compiled code uses SIGSEGV and SIGBUS to implement implicit null pointer checks which result in

NullPointerExceptions in the Java application, for example, when trying to perform a method dispatch whenthe "this" pointer is null. To a Java programmer, it's not particularly important whether the exceptions come from

interpreted or compiled code, but it's helpful to understand this for performance tuning. If there are such signalsin the output, the program must be throwing these exceptions from a frequently called method which has been

compiled. The interpreter uses SIGFPE for null pointer checks in the interpreter. If there are such signals in theoutput, the program is causing these exceptions from interpreted methods. The JVM is designed to execute thenormal non-exception-throwing case as fast as possible, but the exception-throwing case is quite expensive, so to


133 of 135 10/28/2010 11:19 PM

get good performance it is important to eliminate any extra exceptions caused by careless coding. You can use

tusc to detect if this is happening, since if you have correct exception-handling routines in your program youmight not notice it, but you would have lower overall performance than you could otherwise achieve.

read(24, "\0q \0\006\0\0\0", 8) .......................... = 8send(42, "\0\0\01706e \0\0\098a5\0\098a5\0".., 23, 0) .... = 23sigsetstatemask(0x17, NULL, 1135461680) .................. = 0read(24, "\0\006020105\001\n\0\a\0810102c1".., 105) ...... = 105 Received signal 11, SIGSEGV, in user mode, [0xbaf215a2], partial siginfo Siginfo: si_code: I_NONEXIST, faulting address: 0x8, si_errno: 0 PC: 0xb92eb3c7, instruction: 0x0eb01096sigsetstatemask(0x17, NULL, 1135460976) .................. = 0 Received signal 8, SIGFPE, in user mode, [0xbaf215a2], partial siginfo Siginfo: si_code: I_COND, faulting address: 0x1132ab, si_errno: 0 PC: 0x1132ab, instruction: 0x0a6024c0send(24, "\00f\0\006\0\0\0\0\00314\00114", 15, 0) ........ = 15 Received signal 11, SIGSEGV, in user mode, [0xbaf215a2], partial siginfo Siginfo: si_code: I_NONEXIST, faulting address: 0x8, si_errno: 0 PC: 0xb8d73d4b, instruction: 0x0cd01096sigsetstatemask(0x17, NULL, 1135461104) .................. = 0

In this output, we are not showing the lwp id or timestamp. Here we have thrown a couple exceptions in a row.

The SIGSEGV will result in NullPointerExceptions from hotspot-compiled code, and the SIGFPE will result in aNullPointerException from interpreted Java code. To get the best performance, avoid throwing exceptions

whenever possible. You can measure the count of such exceptions happening at runtime with tusc, then useHPjmeter to determine where they are happening. It requires thousands of machine instructions to throw theexception, while it usually requires little effort up front to minimize them. To determine the source of the

NullPointerExceptions, make an -Xeprof trace and view the data with HPjmeter. HPjmeter has built-infeatures to examine exception handling.

return values

Another great thing about tusc is that you can see the system call return values.

close(34) ................................................ = 0send(20, "\00f\0\006\0\0\0\0\00314\00115", 15, 0) ........ = 15sigsetstatemask(0x17, NULL, 1134403888) .................. = 0read(6, 0x43852a68, 5) ................................... ERR#11 EAGAINsigsetstatemask(0x17, NULL, 1133867888) .................. = 0poll(0x4a1af0, 6, -1) .................................... = 1

Here we can see that the read returned with EAGAIN. This kind of information may be useful for diagnosing

various problems. Below we can see a new thread being created. First the mmap for the thread stack, then the_lwp_create happens. Lastly, the sigaltstack() installs the signal stack on the new thread.

1008628500.575606 {19792} sched_yield() .......................... = 01008628500.575784 {19792} sched_yield() .......................... = 01008628500.575952 {19792} sched_yield() .......................... = 01008628500.576072 {19616} mmap(NULL, 528384, PROT_READ|PROT_WRITE, \ MAP_PRIVATE|MAP_ANONYMOUS, 0, NULL) = 0x6a8a50001008628500.576197 {19792} sched_yield() .......................... = 01008628500.576312 {19616} mprotect(0x6a925000, 4096, PROT_NONE) .. = 01008628500.576424 {19792} sched_yield() .......................... = 01008628500.576588 {19792} sched_yield() .......................... = 01008628500.576753 {19616} _lwp_create(0x77ff1c00, \ LWP_DETACHED|LWP_INHERIT_SIGMASK|LWP_USER_TID, \ 0x471fd4, 0x77ff20d8) = 0 (19934)1008628500.576948 {19934} _lwp_self() ............................ = 199341008628500.577179 {19792} sched_yield() .......................... = 01008628500.577274 {19616} kwakeup(PTH_MUTEX_OBJECT, 0xaae8, WAKEUP_ONE, \ 0x77ff1a8c) = 01008628500.577365 {19869} ksleep(PTH_MUTEX_OBJECT, 0xaae8, 0xaaf0, \ NULL) = 01008628500.577462 {19896} kwakeup(PTH_CONDVAR_OBJECT, 0x45ee20, \


134 of 135 10/28/2010 11:19 PM

WAKEUP_ALL, 0x6acad848) = 01008628500.577552 {19897} ksleep(PTH_CONDVAR_OBJECT, 0x45ee20, 0x45ee28, \ 0x6ac2c908) = 01008628500.577663 {19778} sched_yield() .......................... = 01008628500.577769 {19934} sigaltstack(0x6a8a50b8, NULL) .......... = 01008628500.577881 {19792} sched_yield() .......................... = 01008628500.578008 {19616} sched_yield() .......................... = 0

The new thread is 19934, and the first thing he does is call _lwp_self(). Remember that the lwp id is also shown inJava stack traces, and in GlancePlus in the Process Thread List window, so you can correlate the tusc data with

other data.

This course was written by James Tonguet 1/20/04

References

Process Management Whitepaper

Online JFS 3.3 Guide

HP Process Resource Manager User's Guide

by Jan Weaver- Crisis Management Team Engineer: The HPUX Buffer Cache

by Mark Ray – GR8 Engineer :The HFS Inode Cache , The JFS Inode Cache

by Eric Caspole – Enterprise Java Lab - Tusc review

by Ute Kavanaugh-Response Center Engineer

Performance Optimized Page Sizing in HP-UX 11.0. White Paper -KBAN00000849

by James Tonguet -Response Center Engineer :

Introduction to Performance Tuning – UPERFKBAN00000726

Configuring Device Swap DocId: KBAN00000218

How to configure device swap in VxVM DocId: VXVMKBRC00005232

How to use adb to get system information DocId: KBRC00004523

Using GlancePlus for CPU metrics DocId: UPERFKBAN00001048 By Markus Ostrowicki -GSE Engineer The McKusick & Karels bucket allocator and The Arena Allocator : http://wtec.cup.hp.com/~hpux/crash/FirstPassWeb/PA/contents.htm


135 of 135 10/28/2010 11:19 PM

Advanced Performance Tuning..

Documents

Transcript of Advanced Performance Tuning..