Cpu Monitoring and Tunig SMIT

7/27/2019 Cpu Monitoring and Tunig SMIT

1/26

Introduction

AIX 5L Version 5.3 is the latest version of the AIX operating system that offers

simultaneous multi-threading (SMT) on eServer p5 systems to deliver industry leading

throughput and performance levels. With support for advanced virtualization, AIX 5L

Version 5.3 helps you to dramatically increase your server utilization and consolidateworkloads for more efficient management.

A review of computing history and operating systems shows that computer scientists have

developed many CPU scheduling policies. First-in, first-out (FIFO), shortest job first, and

round robin are just a few. Scheduling policies are important because a single policy might

not be best suited to all applications. Some applications in certain workloads can run well in a

default scheduling policy. However, the same applications with a different workload might

require a scheduling policy adjustment in order to achieve the optimal performance.

Note: This article is an update for AIX 5.3 performance. The advanced virtualization is not

discussed in this article. It has enhancements and updates to emphasize AIX 5L Version 5.3features, tools, and capabilities.

Back to top

What is SMT?

SMT is the ability of a single physical processor to concurrently dispatch instructions from

more than one hardware thread. In AIX 5L Version 5.3, a dedicated partition created with one

physical processor is configured as a logical two-way by default. Two hardware threads can

run on one physical processor at the same time. SMT is a good choice when overall

throughput is more important than the throughput of an individual thread. For example, Web

servers and database servers are good candidates for SMT.

Viewing processor and attribute information

By default, the SMT is enabled, as shown inListing 1below.

Listing 1. SMT# smtctl

This system is SMT capable.

SMT is currently enabled.

SMT threads are bound to the same physical processor.

Proc0 has 2 SMT threads

Bind processor 0 is bound with proc0


Proc2 has 2 SMT threads
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing1https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing1https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing1https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing1https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pcon


2/26



# lsattr -El proc0

frequency 1656376000 Processor Speed False

smt_enabled true Processor SMT enabled False

smt_threads 2 Processor SMT threads False

state enable Processor state False

type PowerPC_POWER5 Processor type False

The smtctl command provides privileged users and applications the ability control

utilization of processors with SMT support. With this command, you can turn SMT on or off.

The smtctl command syntax is:

smtctl [-m off | on [ -w boot | now] ]

What are shared processors?

Shared processors are physical processors that are allocated to partition on a timeslice basis.

You can use any physical processor in the shared processor pool to meet the execution needs

of any partition using the shared processor pool. An eServer p5 system can contain a mix of

shared and dedicated partitions. A partition must be all shared or all dedicated, and you can

not use dynamic LPAR (DLPAR) commands to change between the two. You need to bring

down the partition and switch it from using dedicated to shared, or vice versa.

Processing units

After a partition is configured, you can assign it an amount of processing units. A partition

must have a minimum of 1/10 of a processor. And after that requirement has been met, you

can configure processing units at the granularity on 1/100 of a processor. A partition that uses

shared processors is often called a shared partition. A dedicated partition is one that usesdedicated processors.

Each partition is configured with a percentage of execution dispatch time for each 10

milliseconds (ms) timeslice. For example:

A partition with 0.2 processing units is entitled to 20 percent capacity during eachtimeslice.

A partition with 1.8 processing units is entitled to 18ms processing time for each10ms timeslice (using multiple processors).

There is no accumulation of unused cycles. If a partition does not use the entitled processingcapacity, the excess processing time is ceded back to the shared processing pool.


3/26

Partitions with shared processors are either capped or uncapped. The capped partition is

assigned with a hard limit capacity. If a partition needs an extra CPU cycle (more than its

total processing units), it can utilize unused capacity in the shared pool.

Back to top

Scheduling algorithms

AIX 5 implements the following scheduling policies: FIFO, round robin, and a fair round

robin. The FIFO policy has three different implementations: FIFO, FIFO2, and FIFO3. The

round robin policy is named SCHED_RR in AIX, and the fair round robin is called

SCHED_OTHER. We discuss these policies in greater detail in the upcoming sections.

Scheduling policies can have a major impact on system performance, depending on how one

assigns and manages them (response time and throughput). For example, FIFO is a goodchoice for the job that uses a lot of CPU, but it also can choke out all of the other jobs waiting

in line. A basic round robin gives a "timeslice" or "quantum" to each job in a time-shared

manner. As a result, it tends to discriminate against I/O-intensive tasks, since those tasks

often give up CPU voluntarily due to I/O wait. The fair round robin is "fair" because

scheduling priorities change as the jobs accumulate quantums of CPU time during execution.

This allows the operating system to demote a CPU hugger so that an I/O bound job has a fair

chance to use the CPU resource.

Let's go over two important concepts before getting into the scheduling details: the nice value

and the AIX priority and run queue structure.

The nice and renice commands

AIX has two important scheduling commands: nice and renice. A user job in AIX carries a

base priority level of 40 and a default nice value of 20. Together, these two numbers form

the default priority level of 60. This value applies to most of the jobs you see in a system.

When you start a job with a nice command, such as nice -n 10 myjob, the number 10

becomes the delta_NICE. This number is added to the default 20 to create the new nicevalue of 30. In AIX, the higher this number, the lower the priority. Using this example, your

job now starts with a priority of 70, which is 10 levels worse in priority than the default.

The renice command applies to a job that has already started. For example, the renice -n

5 -p 2345 command causes process 2345 to have a nice value of 25. Note that the renice

value is always applied to a base nice of 20, regardless of the current nice value of the

process.

AIX priority and run queue structure

A thread carries a priority range from 0 to 255 (the range is from 0 to 127 on systems prior to

AIX 5). Priority 0 is the highest or the most favorable, and 255 is the lowest or least

favorable. AIX maintains a run queue in the form of a 256-level priority queue to efficientlysupport the 256 priority levels of threads.
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pcon


4/26

AIX also implements a 256-bit array to map to the 256 levels of the queue. If a particular

queue level is empty, the corresponding bit is set to 0. This design allows the AIX scheduler

to quickly identify the first non-empty level and start the first ready-to-run job in that level.

See the AIX run queue structure inFigure 1below.

Figure 1. Scheduler run queue

InFigure 1, the scheduler maintains a run queue of all the threads that are ready to bedispatched. All dispatchable threads of a given priority occupy consecutive positions in the

run queue.

AIX 5L implements one run queue for each CPU and a global queue. For example, there are

32 run queues and one global queue in an eServer pSeries p590 machine. With a per-CPU

run queue, a thread has better chance to go back to the same CPU after a preemption, which

is an affinity enhancement. Also, the contention among CPUs to lock the run queue structure

is much reduced with multiple run queues.

However, for some situations, a multiple run queue structure might not be desirable.

Exporting a system environment variable RT_GRQ=ON can cause a thread to be placed on

the global run queue when it becomes runnable. This can improve performance for threads

that are interrupt-driven and running SCHED_OTHER. Ifschedo o fixed_pri_global=1 is run on AIX 5L Version 5.2 and later, threads running the fixed priority are placed on theglobal run queue.

For local run queues, the dispatcher picks the best priority thread in the run queue when a

CPU is available. When a thread has been running on a CPU, it tends to stay on that CPU's

run queue. If that CPU is busy, then the thread can be dispatched to another idle CPU and

assigned to that CPU's run queue.

FIFO
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure1ahttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure1ahttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure1ahttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure1ahttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure1ahttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure1ahttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure1ahttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure1a


5/26

Although the FIFO policy is the simplest, it is rarely used because of its non-preemptive

nature. A thread with this scheduling policy runs all the way to completion, unless one of the

following happens:

It gives up the CPU voluntarily by executing a function that would put the thread tosleep, such as sleep() orselect().

It gets blocked due to resource contention. It has to wait for I/O completion.

The checkout lane at a grocery store uses a typical FIFO policy. Imagine yourself in the

checkout lane with only one TV dinner (and you're hungry), but the person in front has a full

load in his cart. What can you do? Not much. Since this is a FIFO, you must wait patiently

for your turn.

Similarly, it is obvious that job response time can suffer severely if several tasks are running

FIFO mode in AIX. Consequently, FIFO is rarely used in AIX. Only a process owned by root

can set itself or another thread to FIFO with the thread_setsched() system call.

There are two variations of the FIFO policy: FIFO2 and FIFO3. FIFO2 says that a thread is

put at the head of its run queue if it was asleep for only a short period of time less than a

predefined number of ticks (affinity_lim ticks, tunable with the schedo -p command).

This allows a thread to have a good chance to reuse the cache content. For FIFO3, a thread is

always put at the head of the queue when it becomes runnable.

Round robin

The well-known round robin scheduling policy is even older than UNIX itself. AIX 5Limplements round robin on top of its multilevel priority queue of 256 levels. At a given

priority level, a round robin thread shares the CPU timeslices with all other entries of the

same priority. A thread is scheduled to run until one of the following occurs:

It yields the CPU to other tasks. It is blocked for I/O. It uses up its timeslice.

When the timeslice is exhausted, if a thread of equal or better priority is available to run on

that CPU, the thread that is currently running is then placed at the end of the queue for the

next turn to own the processor. A thread can be preempted because of a higher priority jobwaking up or a device interrupt (for example, after an I/O is done).

For a round robin task only, this preempted thread is placed at the beginning of its queue

level, because AIX wants to ensure that a round robin job has a full timeslice before it is

moved to the end of the round robin chain. It is important to note that the priority of a round

robin thread is fixed and does not change over time. This makes the priority of a round robin

task persistent (as opposed to the changing priorities in fair round robin) and more

predictable.

Since a round robin thread has special status, only root can set a thread to run with the round

robin scheduling policy. To set SCHED_RR for a thread, use one of the following applicationprogramming interfaces (APIs): thread_setsched() orsetpri().


6/26

SCHED_OTHER

This last scheduling policy is also the default. While trying to establish the fairest policy

among tasks, this innovative SCHED_OTHER algorithm was created with a not so innovative

POSIX-defined name. The AIX SCHED_OTHER is a priority-queue round robin design at the

core, with one major difference: the priority is no longer fixed. If a task is using an excessiveamount of CPU time, its priority level should be downgraded to allow other jobs an

opportunity to access the CPU.

If a task is at a priority level so low (a high number) that it does not have an opportunity to

run, then its priority should be upgraded to a higher level (a lower number) so it can run to

finish. A new concept was also implemented to further enhance the effectiveness of the nice

value: If a task is nice (the UNIX nice value) at the beginning, the system will then force it

to be nice all the time. I discuss this feature later.

Traditional CPU utilization

Prior to AIX 5.3 or with SMT disabled, AIX processor utilization uses a sample-based

approach to approximate:

Percentage of processor time spent executing user programs System code Waiting for disk I/O Idle time

AIX produces 100 interrupts per second to take samples. At each interrupt, a local timer tick

(10ms) is charged to the current running thread that is preempted by the timer interrupt. Oneof the following utilization categories is chosen based on the state of the interrupted thread:

If the thread was executing code in the kernel using system call, the entire tick ischarged to the process system time.

If the thread was executing application code, the entire tick is charged to the processuser time. Otherwise, if the current running thread was the operating system's idle

process, the tick is changed in a separate variable. The problem with this method is

the process receiving the tick most likely did not run for the entire timer period and

happened to be executing when the timer expired. With AIX 5.3 SMT enabled, the

traditional utilization metrics are misleading as treating due to the two logical

processors. If one thread is 100 percent busy, one idle thread would result in 50 percent

utilization. But in reality, if one SMT thread is using all CPU resources, then that

CPU is 100 percent busy, as reported using the new Processor Utilization Resource

Register- (PURR) based method.

PURR

Beginning in AIX 5.3, the number of dispatch cycles for each thread can be measured using a

new register called the PURR. Each physical processor has two PURR registers (one for each

hardware thread). The PURR is a new register provided by the POWER5 processor, which is

used to provide an actual count of physical processing time units that a logical processor hasused. All performance tools and APIs utilize this PURR value to report CPU utilization


7/26

metrics for SMT systems. This register is a special-purpose register that can be read or

written by the POWER Hypervisor; however, it is read-only by the operating system.

The hardware increments for PURRs is based on how each thread is using the resources of

the processor, including the dispatch cycles that are allocated to each thread. For a cycle in

which no instructions are dispatched, the PURR of the thread that last dispatched an

instruction is incremented. The register advances automatically so that the operating systemcan always get the current up-to-date value.

When the processor is in single-thread mode, the PURR increments by one every eight

processor clock cycles. When the processor is in SMT mode, the thread that dispatches a

group of instructions in a cycle increments the counter by 1/8 in that cycle. If no group

dispatch occurs in a given cycle, both threads increment their PURR by 1/16. Over a period

of time, the sum of the two PURR registers, when running in SMT mode, should be very

close, but not greater than the number of timebase ticks.

AIX 5.3 CPU utilization

In AIX 5L V5.3, there are new metrics that are collected by the kernel that are stated-based

rather than a sample-based approach. State-based is the collection of information based on

PURR increments rather than a set time of 10ms. AIX 5.3 uses PURR for process accounting.

Instead of charging the entire 10ms clock tick to the interrupted process as before, processes

are charged on the PURR delta for the hardware thread since the last interval. At each

interrupt:

The elapsed PURR is calculated for the current sample period. This value is added to the appropriated utilization category (user, sys, iowait, and

idle), instead of the fixed-size increment (10 ms) that was previously added.

There are two different ways to measure: the threads processor time and the elapsed

time. To measure the elapsed time, the time-based register (TB) is still used. The physical

resource utilization metrics for a logical processor are:

(delta PURR/delta TB) represents the fraction of the physical processor consumed bya logical processor.

(delta PURR/delta TB) * 100 over an interval represent the percentage of dispatchcycles given to a logical processor.

CPU utilization example

Assume two threads are running on one physical processor with SMT enabled. Both SMT

threads of a physical CPU are busy. Using the old tick-based method, both SMT threads

would be reported as 100 percent busy but, in reality, they are really sharing the CPU

resources evenly. This means the new PURR-based method would show each SMT thread as

50 percent busy.

Using the PURR methods, each logical processor reports a utilization of 50 percent

representing the proportion of physical processor resources that it used, assuming equal

distribution of physical processor resources to both the hardware threads.

Additional CPU utilization metrics


8/26

The following metrics uses the per-thread PURR method to measure the thread's processor

time and uses the TB register to measure the elapsed time.

Table 1. Per-thread PURR method

Additional CPU utilization metrics Information provided%sys=(delta PURR in system mode/entitled PURR) *

100 where entitled PURR (ENT * delta TB), and

ENT is entitlement in # of processors (entitlement/100)

Physical CPU utilization metrics are

calculated using the PURR-based

samples and entitlement.

sum (delta PURR/delta TB) for each logical processor

in a partition

The Physical Processor Consumed

over an interval.

(PPC/ENT) * 100The percentage of entitlement

consumed.

(delta PIC/delta TB) where PIC is the Pool Idle count,

which represents the clock ticks where POWER

Hypervisor was idle

It provides the available pool of

processors.

Sum of traditional 10ms tic-based %sys and %user

Logical processor utilization helpsyou to determine if more virtual

processors should be added to apartition.

Back to top

AIX 5.3 command changes

When AIX is running with SMT enabled, commands that display CPU information, such as

vmstat, iostat, topas, and sar, display the PURR-based statistics, rather than the

traditional sample-based statistics. In SMT mode, additional columns of information are

displayed, as show inTable 2below.

Table 2. SMT mode

Column Descriptionpc or physc Physical Processor Consumed by the partition

pec or %entcPercentage of Entitlement Consumed by the partition

Another tool that needed modification was trace/trcrpt and several other tools that are basedon the trace utility. In an SMT environment, trace can optionally collect PURR register

values at each trace hook, and trcrpt can display elapsed PURR.

Table 3below shows the arguments to use for an SMT.

Table 3. Arguments for SMT

Argument Description

trace r PURRCollects the PURR register values. Only valid for a trace run on a 64-

bit kernel.

trcrpt OPURR=[on|off]

Tells trcrpt to show the PURR, along with any timestamps.
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table3https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table3https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table3https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pcon


9/26

netpmon r PURRUses the PURR time instead of timebase in percent and CPU

calculation. Elapsed time calculations are unaffected.

pprof r PURRUses the PURR time instead of timebase in percent and CPU

calculation. Elapsed time calculations are unaffected.

gprof GPROF is the new environment variable to support the SMT.

curt r PURR Specifies the use of PURR register to calculate CPU times.splat p Specifies the use of PURR register to calculate CPU times.

Back to top

Thread priority formulas

You can calculate the priority of a thread using the formulas, as shown inListing 2below. It

is a function of the nice value, the CPU usage c, and a tuning factorr.

Back to top

How AIX calculates the new priority

The clock timer interrupt occurs every 10ms or 1 tick on each CPU. The timers are staggered

so that a CPU's clock timer does not go off at the same time as another CPU's clock timer.

When the CPU clock timer interrupt occurs (even before the thread has run for a full 10ms),

the thread has its CPU usage value (the CPU charge) incremented by one, up to a maximum

of 120. If a job does not get a full 10ms slice and is running RR policy, the system dispatcherchanges the thread's priority in the run queue to allow it to run again soon.

The priority of most user processes varies with the amount of CPU time the process has used

recently. The CPU scheduler's priority calculations are based on two parameters that are set

with schedo, sched_R, and sched_D. The sched_R and sched_D values are in 1/32 seconds.

The scheduler uses this formula to calculate the amount to add to a process's priority value as

a penalty for recent CPU use. For example:

CPU penalty = (recently used CPU value of the process) * (r/32)

The recalculation (once per second) of the recently used CPU value of each process is:

New recently used CPU value = (old recently used CPU value of the process)* (d/32)

Both r (sched_R parameter) and d (sched_D parameter) have default values of 16.

The recent CPU charge C is then used to determine the priority penalty and to recalculate thenew thread priority. Using the first formula as a reference (seeListing 2), you know that a
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pcon


10/26

newly started user task, which carries a base priority 40, a default nice value of 20, and no

CPU charge so far (C=0), begins with a priority level 60.

Also, in the first formula, the value r determines the penalty ratio with a range from zero to

32. An r value of zero means a no-charge penalty for the CPU, since it is always zero

(C*r/32). Ifr=32, it yields the highest possible penalty charge for a CPU -- each tick (10ms)of CPU usage translates to one priority-level downgrade.

In most cases, the value ofr lies near the middle between zero and 32. AIX defaults r to 16;

that is, every two ticks of CPU charge become one level of priority penalty. When the r value

is high, the impact of a nice value becomes less important since the CPU usage penalty

prevails. A smallerr, on the contrary, makes the effect of the nice value more obvious.

Based on this discussion, the effectiveness of the nice value diminishes after a while. The

reason for this is because the CPU charge grows in time and gradually becomes the main

factor in determining the new priority.

This formula has been modified in AIX 5L to increase the weight of the nice value in

calculating the priority level. With all the different versions of AIX, two new factors have

been introduced : x_nice and x_nice_factor ("extra nice" and "extra nice factor"). See the

second formula inListing 2below.

Listing 2. Thread priority formulasPriority = p_nice + (C * r/32) (1)

Priority = x_nice + (C * r/32 * x_nice_factor) (2)Where:

p_nice = base_PRIORITY + NICEbase_PRIORITY = 40NICE = 20 + delta_NICE(20 is the default nice value)That is,

P_nice = 60 + delta_NICE

C is the CPU usage chargeThe maximum value of C is 120

If NICE 20 thenx_nice = p_nice * 2 - 60 orx_nice = p_nice + delta_NICE, or (3)x_nice = 60 + (2 * delta_NICE) (3a)x_nice_factor = (x_nice + 4)/64 (4)Priority has a maximum value of 255

As you can see from Formula 2 and Formula 3, the x_nice now has doubled the increased

nice value. The x_nice_factor further strengthens the r ratio. For example, an initial nice

16, which gives a nice value of 36, results in a new x_nice_factor of 1.5. This value is a 50

percent higher CPU charge penalty for the CPU usage part over the lifetime of the thread.
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing2


11/26

Decaying the CPU usage

It is possible that a thread can get a priority so low that it never has a chance to run. This

would occur if you use only Formulas 1 and 2 without a mechanism to push a thread's

priority level back up.

When a thread runs with SCHED_OTHER, its priority is degraded for its use of CPU time. When

it is not running and is waiting for its turn, AIX tries to regain its priority by "decaying" its

CPU charges, about once a second. The rule is simple: A CPU-bound job should be assigned

a lower priority to allow other jobs to run, but it should not be discriminated against to the

point that it cannot finish itself. All threads' CPU charge is decayed based on a predefined

factor of once per second, as follows:

New Charge C = (Old Charge C) * d / 32 (5)

A kernel process Sweapperdoes this job. Once every second, Swapper wakes up and handlesthe CPU charge decaying for all the threads. The default decay factor is 0.5 or d=16, which

"discounts" or "waives" half of the CPU charge.

With this mechanism, a CPU-intensive job accumulates CPU charge, gets to a lower priority

level, and then advances to a much higher level at the end of a second. On the other hand, an

I/O-intensive job does not vary its priority up and down as much, since it generally

accumulates less CPU time.

Back to top

Have you exhausted your CPU?

Now that you understand how the AIX scheduler prioritizes the workload, let's look at several

commonly used commands. If AIX seems to take too long to finish your workload or it does

not respond quickly enough, try these commands to investigate whether your system is CPU-

bound: vmstat, iostat, and sar.

We do not discuss all the possible ways to use these commands, but instead emphasize the

information they convey to you. For a detailed description of these commands, see your AIXpublications or visit the IBM System p and AIX Information Center at

http://publib16.boulder.ibm.com/pseries/index.htm. Scroll down, if necessary, and clickAIX

Version 5L Version 5.3 Version 5.3 information centerto start using the AIX 5 publications.

The priority change history of a thread

Listing 3shows how the CPU charge can change the priority of a thread:

Listing 3. Change of CPU charge and the priority of a thread

Base priority is 40Default NICE value is 20, assume task was run using thedefault nice value
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttp://publib16.boulder.ibm.com/pseries/index.htmhttp://publib16.boulder.ibm.com/pseries/index.htmhttp://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsphttp://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsphttp://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsphttp://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsphttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing3https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing3https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing3http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsphttp://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsphttp://publib16.boulder.ibm.com/pseries/index.htmhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pcon


12/26

p_nice = base_priority + NICE = 40 + 20 = 60Assume r = 2 to slow down the penalty increase (defaultr value is 16)Priority = p_nice + C*r/32 = 60 + C * r / 32Tick 0 P = 60 + 0 * 2 / 32 = 60Tick 1 P = 60 + 1 * 2 / 32 = 60

Tick 2 P = 60 + 2 * 2 / 32 = 60.Tick 15 P = 60 + 15 * 2 / 32 = 60Tick 16 P = 60 + 16 * 2 / 32 = 61Tick 17 P = 60 + 17 * 2 / 32 = 61..Tick 100 P = 60 + 100 * 2 / 32 = 66Tick 100 Swapper decays all CPU usage charges for all threads.New C CPU Charge = (Current CPU Charge) * d / 32Assume d = 16 (the default)For the test thread, new C = 100 * 16 / 32 = 50

Tick 101 P = 60 + 51 * 2 / 32 = 63

Listing 4shows how to specify a fast or slow priority:

Listing 4. Priority change of a typical CPU-bound job (fast verses slow)fast.c:main(){for (;;)}

slow.c:

main() {sleep 80;}

Back to top

Common commands

The vmstat, iostat, and sar commands are used frequently for CPU monitoring. You

should be familiar with the usage and the meaning of the reports each command generates.

vmstat

The vmstat command provides an overview of resource utilization through a report of CPU,

disk, and memory activity in a one-line-per-report format. The sample output inListing 5is

generated on an AIX 5L Version 5.3 system running "vmstat 1 6". This report was

generated every second, as requested. Since a count of six was specified following the

interval, reporting stops after the sixth report. One popular way to run the vmstat command

is to leave out the count parameter; vmstat then generates reports continuously until the

command terminates.
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing4https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing4https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing5https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing5https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing5https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing5https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing4


13/26

Except for the avm and fre columns, the first report contains average statistics per second

since system startup. Subsequent reports contain statistics collected during the interval since

the previous report.

Beginning with AIX 5L Version 5.3, the vmstat command reports the number of physical

processors consumed (pc) and the percentage of entitlement consumed (ec) in the Micro-Partitioning and SMT environments. These metrics only display on Micro-Partitioning and

SMT environments.

AIX 5L adds a useful new option "-I" to vmstat that shows the number of threads waiting

for the raw I/O to complete (p column) and the number of file pages paged in/out per second

(fi/fo columns).

The following detailed descriptions of the columns convey useful information about CPU

utilization.Listing 5shows the output of the vmstat 1 6 command:

Listing 5. Output of the vmstat 1 6 command from a p520 system (two CPUs)vmstat 1 6System configuration: lcpu=4 mem=15808MBkthr memory page faults cpu----- ------- ------ -------- -----------r b avm fre re pi po fr sr cy in sy cs us sy id wa1 1 110996 763741 0 0 0 0 0 0 231 96 91 0 0 99 00 0 111002 763734 0 0 0 0 0 0 332 2365 179 0 1 99 00 0 111002 763734 0 0 0 0 0 0 330 2283 139 0 5 93 10 0 111002 763734 0 0 0 0 0 0 310 2212 153 0 0 99 01 0 111002 763734 0 0 0 0 0 0 314 2259 173 0 0 99 0

0 0 111002 763734 0 0 0 0 0 0 321 2261 177 0 1 99 0

Figure 2shows the output of the command vmstat -I 1 (issued during a software

installation):

Figure 2. Output of the vmstat -I 1 command

XML error: The image is not displayed because the width is greater than the maximum of

580 pixels. Please decrease the image width.

SeeTable 4below for a listing of relevant columns with descriptions.

Table 4. Description of relevant columns

ColumnDescription

kthr Kernel thread state changes per second over the sampling interval.

r Number of kernel threads placed in run queue.

b Number of kernel threads placed in the Virtual Memory Manager (VMM) waitqueue (awaiting resource, awaiting input/output).
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing5https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing5https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing5https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table4https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table4https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table4https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#table4https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#figure2https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing5


14/26

pThe number of threads waiting on raw I/Os (bypassing journaled file system (JFS))

to complete. This is only available on AIX 5 and later.

fi/foNumber of file pages paged in/out per second. Note: This column is available only

on AIX 5 and later systems.

cpu

Breakdown of percentage usage of CPU time. For multiprocessor systems, CPU

values are global averages among all processors. Also, the I/O wait state is definedsystem-wide and not per processor.

us Average percentage of CPU time executing in the user mode.

sy Average percentage of CPU time executing in the system mode.

idAverage percentage of time that CPUs were idle and the system did not have an

outstanding disk I/O request.

a

CPU idle time during which the system had outstanding disk/NFS I/O request(s). If

there is at least one outstanding I/O to a disk when wait is running, the time is

classified as waiting for I/O. Unless asynchronous I/O is being used by the process,

an I/O request to disk causes the calling process to block (or sleep) until the request

has been completed. Once an I/O request for a process completes, it is placed on the

run queue. If the I/Os were completing faster, more CPU time could be used.

pcNumber of physical processors consumed. Displayed only if the partition is running

with shared processor.

ecThe percentage of entitled capacity consumed. Displayed only if the partition is

running with the shared processor.

A CPU is marked wio at the time of a clock interrupt (every 1/100 ms), if the CPU is idling

and an outstanding I/O was initiated on that CPU. If a CPU is only idling with no outstanding

I/O from that CPU, it is marked as id instead ofwa. For example, a system with four CPUs

and one thread doing I/O reports a maximum of 25 percent wio time. A system with 12 CPUs

and one thread doing I/O reports a maximum of 8.3 percent wio time. To be precise, the wiomeasures the percent of time the CPU is idle as it waits for an I/O to complete.

These four columns should total 100 percent, or very close. If the sum of user and system (us

and sy) CPU-utilization percentages consistently approach a 100 percent, the system might

be encountering a CPU bottleneck.

iostat

The iostat command is used primarily to monitor system input and output devices, but it

can also provide CPU utilization data. Beginning with AIX 5.3, the iostat command reports

number of physical processors consumed (physc) and the percentage of entitlementconsumed (% entc) in Micro-Partitioning and SMT environments. These metrics are only

displayed on Micro-Partitioning/SMT environments. When SMT is enabled, iostat

automatically uses a new PURR-based data and formula for:

%user %sys %wait %idle

Listing 6is generated on an AIX 5L Version 5.3 system by entering "iostat 5 3", as

follows:
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing6https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing6https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#listing6


15/26

Listing 6. iostat reportSystem configuration: lcpu=4 drives=9

tty: tin tout avq-cpu: %user %sys %idle %iowait0.0 4.3 0.2 0.6 98.8 0.4

Disks: %tm_act Kbps tps Kb_read Kb_wrtnhdisk0 0.0 0.2 0.0 7993 4408hdisk1 0.0 0.0 0.0 2179 1692hdisk2 0.4 1.5 0.3 67548 59151cd0 0.0 0.0 0.0 0 0tty: tin tout cpu: %user %sys %idle %iowait

0.0 30.3 8.8 7.2 83.9 0.2Disks: %tm_act Kbps tps Kb_read Kb_wrtnhdisk0 0.2 0.8 0.2 4 0hdisk1 0.0 0.0 0.0 0 0hdisk2 0.0 0.0 0.0 0 0cd0 0.0 0.0 0.0 0 0tty: tin tout cpu: %user %sys %idle %iowait

0.0 8.4 0.2 5.8 0.0 93.8Disks: %tm_act Kbps tps Kb_read Kb_wrtnhdisk0 0.0 0.0 0.0 0 0hdisk1 0.0 0.0 0.0 0 0hdisk2 98.4 75.6 61.9 396 2488cd0 0.0 0.0 0.0 0 0

Example iostat with SPLAR configuration#iostat t 2 3System Configuration: lcpu=4 ent=0.80avg-cpu %user %sys %idle %iowait physc %entc

0.1 0.2 99.7 0.0 0.0 0.90.1 0.4 99.5 0.0 0.0 1.1

0.1 0.2 99.7 0.0 0.0 0.9

Just like the vmstat command report, the first report contains statistic averages since thesystem started up. Subsequent reports contain statistics collected during the interval since the

previous report.

The four columns that show the breakdown of CPU usage time convey the same information

as the vmstat command. The columns should total approximately 100 percent. If the sum of

user and system (us and sy) CPU-utilization percentages consistently approach 100 percent,

the system might be encountering a CPU bottleneck.

On systems running one application, a high I/O wait percentage might be related to the

workload. On systems with many processes, some will be running while others wait for I/O.

In this case, the %iowait can be small or zero because running processes "hide" some wait

time. Although %iowait is low, a bottleneck can still limit application performance. If the

iostat command indicates that a CPU-bound situation does not exist and %iowait time is

greater than 20 percent, you might have an I/O or disk-bound situation.

sar

The sar command has two forms: The first form samples, displays, and/or saves systemstatistics and the second form processes and displays previously captured data. The sar


16/26

command can provide queue and processor statistics just like the vmstat and iostat

commands. However, it has two additional features:

Each sample has a leading time stamp, so an overall average appears at the end of thesamples.

The -P option can be used to generate per-processor statistics, in addition to theglobal averages among all processors. The sample code below shows sample output

from a four-way symmetric multiprocessor (SMP) system that resulted from entering

two commands:

osar -o savefile 5 3 > /dev/null &

oNote: This command collects the data three times at five-second intervals,

saves the collected data in savefile, and redirects the report to null so that noreport is written to the terminal.

osar -P ALL -u -f savefile

oo Note: The -P ALL is specified to get per-processor statistics for each

individual processor and -u CPU usage data. In addition, -f savefile tells

sar to generate the report using the data saved in savefile. The sar P

All output for all logical processors with SMT enabled shows the physical

processor consumed physc (delta PURR/delta TB). This column shows therelative SMT split between processors -- in other words, it illustrates the

measurement of fraction of time a logical processor was getting physical

processor cycles. Whenever the percentage of entitled capacity consumed is

under 100 percent, a line beginning with U is added to represent the unused

capacity. When running in shared mode, sar displays the percentage of

entitlement consumed %entc, which is ((PPC/ENT)*100).

Listing 7. A typical sar report from a 2-way p520 system with dedicated LPAR

configurationAIX nutmeg 3 5 00CD241F4C00 06/14/05

System configuration: lcpu=4

11:51:33 cpu %usr %sys %wio %idle physc11:51:34 0 0 0 0 100 0.30

1 1 1 1 98 0.692 2 1 0 96 0.693 0 0 0 100 0.31- 1 1 0 98 1.99

11:51:35 0 0 0 0 100 0.311 0 0 0 100 0.692 0 0 0 100 0.733 0 0 0 100 0.31- 0 0 0 100 2.04


17/26

11:51:36 0 0 0 0 100 0.311 0 0 0 100 0.692 0 0 0 100 0.703 0 0 0 100 0.31- 0 0 0 100 2.01

11:51:37 0 0 0 0 100 0.31

1 0 0 0 100 0.692 0 0 0 100 0.693 0 0 0 100 0.31- 0 0 0 100 2.00

Average 0 0 0 0 100 0.311 0 0 0 99 0.692 1 0 0 99 0.703 0 0 0 100 0.31- 0 0 0 99 2.01

mpstat

The mpstat command collects and displays performance statistics for all logical CPUs in the

system. If SMT is enabled, the mpstat s command displays physicals as well as usage of

logical processors, as shown inListing 8below.

Listing 8. A typical mpstat report from a 2-way p520 system with SPLAR configurationSystem configuration: lcpu=4

Proc0 Proc1

63.65% 63.65%

cpu2 cpu0 cpu1 cpu358.15% 5.50% 61.43% 2.22%

lparstat

The lparstat command provides a report of LPAR-related information and utilization

statistics. This command provides a display of current LPAR-related parameters and

hypervisor information, as well as utilization statistics for the LPAR. An interval mechanismretrieves numbers of reports at a certain interval.

The following statistics are displayed only when the partition type is shared:

physc Shows the number of physical processors consumed.

%entcShows the percentage of the entitled capacity consumed.

lbusyShows the percentage of logical processor(s) utilization that occurred while executing

at the user and system level.

app Shows the available physical processors in the shared pool.

phintShows the number of phantom (targeted to another shared partition in this pool)

interruptions received.


18/26

The following statistics are displayed only when the -h flag is specified:

%hypvShows the percentage of time spent in hypervisor.

hcalls Shows number of hypervisor calls executed.

Listing 9. A typical lparstat report from a 2-way p520 machineSystem configuration: type=Dedicated mode=Capped smt=On lcpu=4 mem=15808

%user %sys %wait %idle----- ---- ----- -----

0.0 0.1 0.0 99.90.0 0.1 0.0 99.90.4 0.2 0.1 99.3

# lparstat 1 3

System configuration: type=Shared mode=Uncapped smt=On lcpu=2 mem=2560ent=0.50

%user %sys %wait %idle physc %entc lbusy app vcsw phint----- ---- ----- ----- ----- ----- ------ --- ---- -----

0.3 0.4 0.0 99.3 0.01 1.1 0.0 - 346 043.2 6.9 0.0 49.9 0.29 58.4 12.7 - 389 00.1 0.4 0.0 99.5 0.00 0.9 0.0 - 312 0

Back to top

Improving system performance

For a CPU-bound system, you can improve the system performance by manipulating thread

and process priorities of a specific process or tuning the scheduler algorithm to set a different

system-wide scheduling policy.

Changing user-process priority

The commands to change or set user task priority include the nice and renice commandsand two system calls that allow thread priority and scheduling policy to be changed through

API calls.

Using the nice command

The standard nice value of a foreground process is 20; the standard nice value of a

background process is 24, if started from ksh orcsh (20, if started by tcsh and bsh). The

system uses the nice value to calculate the priority of all threads associated with the process.

Using the nice command, a user can specify an increment or decrement to the standard nice

value so that a process can be started with a different priority. The thread priority is still non-

fixed and gets different values based on the thread's CPU usage.

By using nice, any user can run a command at a lower priority than normal. Only root can

use nice to run commands at a priority higher than normal. For example, the command nice-5 iostat 10 3 >iostat.out causes the iostat command to start with a nice value of 25


19/26

(instead of 20), resulting in a lower starting priority. The values ofnice and priority can be

viewed using the ps command with the -l flag.Listing 10shows a typical output using the

ps -l command:

Listing 10. Using ps -l to observe process priorityF S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD

240001 A 0 15396 5746 1 60 20 393ce 732 pts/3 0:00 ksh200001 A 0 15810 15396 3 70 25 793fe 524 pts/3 0:00

iostat

As root, you can run iostat at a higher priority with # nice --5 vmstat 10 3 >io.out.

The iostat command can run with a nice value of 15, resulting in a higher starting priority.

Using the renice command

If a process is already running, you can use the renice command to alter the nice value, and

thus the priority. The processes are identified by process ID, process group ID, or the name of

the user who owns the processes. The renice command cannot be used on fixed priority

processes.

Using the setpri() and thread_setsched() subroutines

There are now two system calls that allow users to make individual processes or threads to be

scheduled with fixed priority. The setpri() system call is process-oriented and

thread_setsched() is thread-oriented. Use caution when calling these two subroutines,

since improper use might cause the system to hang.

An application that runs under the root user ID can invoke the setpri() subroutine to set its

own priority or the priority of another process. The target process is scheduled using the

SCHED_RR scheduling policy with a fixed priority. The change is applied to all the threads in

the process. Note the following two examples:

retcode = setpri(0, 45);

Gives the calling process a fixed priority of 45.

retcode = setpri(1234, 35);

Gives the process with PID of 1234 a fixed priority of 35.

If the change is intended for a specific thread, the thread_setsched() subroutine can be

used:

retcode = thread_setsched(thread_id,priority_value, scheduling_policy)

The parameter scheduling_policy can be one of the following:


20/26

SCHED_OTHER, SCHED_FIFO, or SCHED_RR.

When SCHED_OTHER is specified as the scheduling policy, the second parameter

(priority_value) is ignored.

Changing the scheduling algorithm globally

AIX allows users to make changes to the priority calculation formula using the schedo

command.

Adjusting r and d

As mentioned earlier, the formula for calculating the priority value is as follows:

Priority = x_nice + (C * r/32 * x_nice_factor)

The recent CPU usage value is displayed as the C column in the ps command output. Themaximum value of recent CPU usage is 120. Once every second, the CPU usage value for

each thread is degraded using the following formula:

New Charge C = (Old Charge C) * d / 32

The default value ofr is 16; therefore, the thread priority is penalized by recent CPU usage

* 0.5. The d also has a default value of 16, which means the recent CPU usage value ofevery process is reduced to half of its original value once every second. For some users, the

default values ofsched_R and sched_D do not allow enough distinction between foreground

and background processes. These two values can be tuned using sched_R and sched_D

options to the schedo command. Note the following two examples:

# schedo -o sched_R=0

(R=0, D=.5) indicates that the CPU penalty was always 0. The priority value of the

process would effectively be fixed, although it is not treated like an RR process.

# schedo -o sched_D=32

(R=0.5, D=1) indicates that long-running processes would reach a C value of 120 and

stay there. The recent CPU usage value does not get reduced once every second and


21/26

the priority of long-running processes would not fluctuate back to low numbers

(higher importance) to compete with new processes.

Changing the timeslice

Although the schedo command can modify the length of the scheduler timeslice, thetimeslice change only applies to RR threads. This does not affect threads running with other

scheduling policies. The syntax for this command is:

schedo -L timeslice

n is the number of 10ms clock ticks to be used as the timeslice. schedo -p -o timeslice=2would set the timeslice length to 20ms.

You must log on as root to make changes using the schedo command.

Back to top

Using additional techniques

Other techniques that can help a CPU-bound system include the following.

Scheduling

Depending on the relative importance of applications, you could schedule less important ones

for off-shift hours using at, cron, orbatch commands.

Using the mkpasswd command

If your system has thousands of entries in the /etc/passwd file, you could use mkpasswd

command to create a hashed or indexed version of the /etc/passwd file to save CPU time

spent in looking up a user ID.

Back to top

Tuning individual applications

The following techniques can help you diagnose and improve the performance of specific

applications running under AIX.

Using the ps command

The ps command or profiling can identify an application that is consuming large fractions of

CPU time. This information can then be used to narrow the search for a CPU bottleneck.
https://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pconhttps://www.ibm.com/developerworks/aix/library/au-aix5_cpu/#ibm-pcon


22/26

After you find the problem area, you can tune up or improve the application. You might need

to recompile the application or change the source code.

Using the schedo command

The schedo command is used to set or display current or next boot values for all CPUscheduler tuning parameters. This command can only be executed by the root user. The

schedo command can also make permanent changes or defer changes until the next reboot.Beginning with AIX 5L Version 5.3, several tuning parameters have been added to the

schedo command.Listing 11shows all the CPU scheduler parameters.

Listing 11. CPU scheduler parameters# schedo -a

%usDelta = 100affinity_lim = 7

big_tick_size = 1fixed_pri_global = 0force_grq = 0

hotlocks_enable = 0idle_migration_barrier = 4

krlock_confer2self = n/akrlock_conferb4alloc = n/a

krlock_enable = n/akrlock_spinb4alloc = n/akrlock_spinb4confer = n/a

maxspin = 16384n_idle_loop_vlopri = 100

pacefork = 10sched_D = 16sched_R = 16

search_globalrq_mload = 256search_smtrunq_mload = 256setnewrq_sidle_mload = 384shed_primrunq_mload = 64sidle_S1runq_mload = 64sidle_S2runq_mload = 134sidle_S3runq_mload = 134sidle_S4runq_mload = 4294967040slock_spinb4confer = 1024

smt_snooze_delay = 0smtrunq_load_diff = 2

timeslice = 1unboost_inflih = 1v_exempt_secs = 2v_min_process = 2

v_repage_hi = 0v_repage_proc = 4

v_sec_wait = 1

Upgrading

Upgrading the system to a faster CPU or more CPUs might be necessary if tuning does notimprove the performance.


23/26

Back to top

Case studies

Two real-world examples show how the performance experts from IBM implemented these

theories and techniques.

Case 1

Symptoms: The user has a batch script that starts up 500 other batch scripts, and each of

these scripts queries and updates a database. Each script also starts as a client request from

another machine. Each client request creates a database user thread on the database server

machine. The response time began at less than 10 seconds for a period of time. Then the

response time gradually became worse. At times it was more than a minute -- sometimes two

minutes.

Diagnosis: The run queue began growing until it reached into the hundreds. Another

symptom included the CPU being 100 percent utilized (this was an eight-way SMP system),

with 99 percent in user mode. By examining an AIX trace sample collected for a few

seconds, we saw a pattern emerge. While a thread was using the CPU, a network packet

would arrive and cause a network adapter interrupt. This would take the currently running

thread off its CPU so the interrupt could be serviced.

After servicing the interrupt, the scheduler verifies if any other threads are runnable and have

a better priority than the currently running thread. Since the currently running thread had run

for a few timeslices already, its CPU priority had increased as it accumulated CPU ticks.

Each of the 500 scripts began with priority 60. If they were runnable, they would preempt any

currently running thread with a thread priority higher than 60. The preempted thread would

then be put at the end of the run queue and have to wait for the CPU until its priority rose

again.

One effect of this preemption was that sometimes a thread would be preempted while holding

a database lock. Since this type of lock is implemented at the application layer within the

database software, the kernel does not know that the thread is holding a lock. If the lock was

a kernel-level lock or a pthread library mutex lock, then the kernel could perform priority

boosting and boost a thread's priority to the same level as that of a running thread that isrequesting the lock. This way, the requesting thread does not have to wait long for the lock

holder to get the CPU again and release the lock.

Since the lock in this scenario was a user lock, the database thread would spin on the lock

until it exhausted its spin count (a tunable database parameter), and then go to sleep. So the

99 percent used CPU was mostly due to the threads spinning on database locks.

Prescription: After determining that priority preemption was having a negative effect, we

tuned the scheduler formula, which calculates the thread priority. This particular formula is:

pri = base_pri + NICE + (C * r/32)


24/26

pri is the new priority, base_pri is 40, NICE is the nice value (20 in this case), C is the CPU

usage in ticks, and r is 16.

As a thread accumulates CPU ticks, its priority value becomes larger, thereby making its

priority lower.

The schedo command provides a way to change the value ofr by using the sched_R option.

Running schedo -p -o sched_R=0 causes r to be 0, which then causes the CPU penalty

factor (C * r/32) to be 0. This prevents priorities from changing, unless the nice value is

changed. If the nice value is the same for all threads, then threads can complete their

timeslices without being preempted due to priority changes. This allows the thread that is

currently running and holding the database lock to keep running and then release the lock.

Results: These changes had an instantaneous impact on the performance. The response time,

which was over two minutes by this time, started getting better until all of the scripts were

completing in just a few seconds. The C value in the priority formula is recalculated once a

second by a CPU usage decay factor (C = C*d/32). Setting the d value to 0 when using the

schedo command would have accomplished the same result. In this case, ifd=0, then C*d/32

= 0. Since the CPU penalty factor is C*r/32, this also becomes 0 so that the priority will be

just 40 + NICE.

Case 2

Symptoms: A pSeries machine was used as both a database and an application server. Users

would input requests into a forms-based application and then submit the transactions. They

noticed that at certain times the forms would take longer to get updated on their screens and

their usual short-running queries would return in a longer time period.

Diagnosis: When this slowness was observed, there were also some long-running database

batch jobs that were submitted to the system. Normally, such batch jobs would be run at

night, but near the end of the month additional batch jobs were run during the day while the

users were on the system. The batch jobs were CPU-intensive and constantly on the run

queue. Therefore, users' threads had to compete with the threads of the batch jobs for the

CPU.

With priorities degrading as CPU usage increased, the batch jobs' priorities became worse

and allowed the users' threads to run. However, the kernel decays the CPU usage value C by

half once a second. This allowed the priorities of the batch jobs to improve in a short timeperiod. So the batch jobs would again compete for the CPU with the users' threads.

Prescription: By changing the decay factor (d/32) used to reduce CPU usage once a second,

we improved performance for the users. We used the schedo command to set the d value to

31. The higher the value ofd; the higher the value ofC (C=C*d/32).

Since C is used to calculate priorities (pri=40+NICE+C*r/32), the priority would get worse as

C became larger. By setting the d value to a higher number, the C value is reduced at a slower

than usual rate.


25/26

Results: The users' threads get the CPU more often than the batch threads. As a result, the

users saw an immediate improvement in performance. Of course, the batch jobs would be

slowed down somewhat, but these jobs would get the CPU whenever the users had any

"think" time or had to wait on I/O. The impact was minimal on the batch jobs, but

performance improvement for the users was dramatic.

Case study notes: Tracing a pattern

A final tip describes some odd things that impact performance. During one of our

benchmarks, we noticed that the CPU usage reached 100 percent, with most of the time being

charged to "system". At that time, the application performance degraded noticeably.

After we collected an AIX trace, we noticed a repeating pattern. One application process

would encounter a page fault on an address. That page fault caused a protection exception in

the VMM, which in turn caused the kernel to send this process a SIGSEGV (segmentationviolation) signal. When the process resumed, the page faulted on the same address again,

which then caused yet another protection exception and anotherSIGSEGV signal to be sent to

the process. The default signal disposition for the SIGSEGV signal is to kill the process andgenerate a core dump, but in this case, the application continued on and stayed in this loop.

Most of the CPU time was spent in this loop.

After investigation, we discovered the problem: A developer for another component had

installed a signal handler to catch the SIGSEGV signal in the code during the test process.

After the testing was completed, the developer had forgotten to remove the signal handler.

That component then linked with the rest of the application and, during the benchmark,

another unrelated component of the application caused a segmentation fault. This old signal

handler caught the exception, ignored it, and caused the process to resume. The currentinstruction (the one which caused the exception) was then restarted, causing an infinite loop

to occur.

Resources

TheAIX 5L Support for Micro-Partitioning and Simultaneous Multi-threadingwhitepaper describes the simultaneous multi-threading and optionally, Micro-Partitioning

new technologies and the AIX 5L support for them.

The articleOperating system exploitation of the POWER5 systemdiscusses how newperformance features deliver improved system scalability and performance.

TheAIX 5L Differences Guide Version 5.3 EditionRedbook focuses on thedifferences introduced in AIX 5L Version 5.3 when compared to AIX 5L Version 5.2.

TheCapped and Uncapped Partitions in IBM POWER5whitepaper introduces andexplains the concepts of capped and uncapped partitions and discusses priority

weighting and CPU utilization by memory pools.
http://www-1.ibm.com/servers/aix/whitepapers/aix_support.pdfhttp://www-1.ibm.com/servers/aix/whitepapers/aix_support.pdfhttp://www-1.ibm.com/servers/aix/whitepapers/aix_support.pdfhttp://www.research.ibm.com/journal/abstracts/rd/494/mackerras.htmlhttp://www.research.ibm.com/journal/abstracts/rd/494/mackerras.htmlhttp://www.research.ibm.com/journal/abstracts/rd/494/mackerras.htmlhttp://www.redbooks.ibm.com/abstracts/SG247463.htmlhttp://www.redbooks.ibm.com/abstracts/SG247463.htmlhttp://www.redbooks.ibm.com/abstracts/SG247463.htmlhttps://www.ibm.com/partnerworld/wps/servlet/ContentHandler?contentId=cnt5j0ZSCR$cYX4MDAD&roadMapId=aix5lsol&roadMapName=Porting+your+Solaris+solution+to+AIX+Version+6.1&locale=en_UShttps://www.ibm.com/partnerworld/wps/servlet/ContentHandler?contentId=cnt5j0ZSCR$cYX4MDAD&roadMapId=aix5lsol&roadMapName=Porting+your+Solaris+solution+to+AIX+Version+6.1&locale=en_UShttps://www.ibm.com/partnerworld/wps/servlet/ContentHandler?contentId=cnt5j0ZSCR$cYX4MDAD&roadMapId=aix5lsol&roadMapName=Porting+your+Solaris+solution+to+AIX+Version+6.1&locale=en_UShttps://www.ibm.com/partnerworld/wps/servlet/ContentHandler?contentId=cnt5j0ZSCR$cYX4MDAD&roadMapId=aix5lsol&roadMapName=Porting+your+Solaris+solution+to+AIX+Version+6.1&locale=en_UShttp://www.redbooks.ibm.com/abstracts/SG247463.htmlhttp://www.research.ibm.com/journal/abstracts/rd/494/mackerras.htmlhttp://www-1.ibm.com/servers/aix/whitepapers/aix_support.pdf


26/26

TheAIX 5L Practical Performance Tools and Tuning GuideRedbook acomprehensive guide about the performance monitoring and tuning tools that are

provided with AIX 5L Version 5.3.

Want more? The developerWorksAIX and UNIXzone hosts hundreds of informativearticles and introductory, intermediate, and advanced tutorials.

Get involved in the developerWorks community by participating indeveloperWorksblogs.
http://www.redbooks.ibm.com/abstracts/SG246478.htmlhttp://www.redbooks.ibm.com/abstracts/SG246478.htmlhttp://www.redbooks.ibm.com/abstracts/SG246478.htmlhttp://www.ibm.com/developerWorks/aix/http://www.ibm.com/developerWorks/aix/http://www.ibm.com/developerWorks/aix/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerworks/blogs/http://www.ibm.com/developerWorks/aix/http://www.redbooks.ibm.com/abstracts/SG246478.html

Cpu Monitoring and Tunig SMIT

Documents

Transcript of Cpu Monitoring and Tunig SMIT