Perf Meth

download Perf Meth

of 14

Transcript of Perf Meth

  • 8/13/2019 Perf Meth

    1/14

    2002 Oracle Corporation, all rights reserved

    FROM THE CENTER OF EXPERTISE

    THE COE PERFORMANCE METHOD

    A PERFORMANCE METH ODOLOGY FOR EN TERPRISE-WI DE INFORMAT ION

    SYSTEM S

    Roger Snowden

    Center of Expertise, Oracle Corporation

  • 8/13/2019 Perf Meth

    2/14

    THE COE PERFORMANCE METHOD 2

    ABSTRACTA holistic method for diagnosing performance techniques in a complex information system is presented. TheCOE Performance Method (CPM) relies on proven techniques and offers a simple, end-to-end approach toisolating performance bottlenecks based on evidence of their actual causes. There are many excellent Oraclesolutions which treat single, individual technology components in greater depth than this paper, but thepurpose in this document is to provide a complete method of end-to-end performance analysis for an entireapplication of perhaps many synergistic components. While this approach is shown in the context of anetworked enterprise database application, a general use of the described CPM can be easily applied to anycomputing environment. An explicit goal of the COE Performance Method described here is to enhance theperformance achievements of Service Level Agreements and to quickly diagnose variances from those SLAs.

    TRADITIONAL APPROACHES TO PERFORMANCE TUNING

    Although Oracles relational database products have never been exactly simple, the softwares complexity hasgrown significantly, particularly in recent versions. With increased complexity has come a great deal ofconfusion and misinformation regarding performance management of the server and related technologies.

    Often, performance issues are treated as though they were concealed in magic, heralded by mystery.

    Bookshelves are full of offerings for tips n tricks and secret knowledge about squeezing performancefrom an Oracle database. Authors with many years of database experience suggest various parameter settingsand configurations with little expository justification. While there is some excellent material on the subject ofOracle performance, making the best use of that information requires a methodology based on facts andlogic rather than guesswork.

    The problem with the common best guess approach becomes apparent when the database administratorencounters a situation where parameters are adjusted, expensive memory or disk is added changes madeas the experts recommended, yet performance is still abysmal and Oracle appears to be theproblem. What todo? Throwing memory or CPU at a problem may not address the underlying issues at all. In some cases,such a blanket approach may simply make things worse, until the system in question grinds to an

    unimpressive halt.Therefore, it is reasonable and proper for us to seek a rational, comprehensive approach to managing theperformance of an Oracle database without reliance on guesswork. We need some consistent, uncomplicatedmethod of finding and relieving bottlenecks in a complex enterprise information architecture.

    CPMTHE SOLUTION FOR AN ORDERED APPROACH

    The movement toward network architectures has significantly added to the complexity of the computingenvironment. Years ago it was only necessary to manage a single, unified set of technology components toachieve optimal performance. Now we have to manage multiple technologystacks the database and itshost platform, application servers, varied client workstations and operating systems as well as the networkthat glues it all together. It is no longer possible to examine a single component and perform effectivediagnostics for the system at large.

    The methodology for diagnosing and analyzing performance put forth here not only encompasses alltechnology stacks in the realm of an application system, but additionally does so in an orderly manner, quicklyleading the diagnostician toward a positive result. Moreover, it takes into account the disparate pieces of thecomputing puzzle other, purely component-based approaches ignore.

    ASYSTEM IS NO FASTER THAN ITS WORST BOTTLENECK.

    The study of performance and capacity analysis of complex systems relies on a mathematical disciplineknown asqueuing theory. Queuing theory relies on statistical methods to make possible an effective analysis ofthe behaviour of systems of processes, particularly as interrelated processes affect one another. While this

  • 8/13/2019 Perf Meth

    3/14

    THE COE PERFORMANCE METHOD 3

    description suggests a level of complexity that might discourage the non-mathematician, it is not necessary tohave a mathematics background to develop a reasoned understanding of the principles involved.

    The fundamental equation we need to understand is this:

    Response Time = Service Time + Wait Time

    Response timerefers to the total time a process consumes, start to finish. In a rush hour traffic example,response time would be measured from the time a car entered a freeway to the time it left an off-ramp. In aretail service scenario, it might be from the time a customer gets into a bank tellers line (to cash a check,perhaps) to the moment cash is in hand.Service timeis the amount of time consumed by the process itself the tellers busy time.Wait timerefers to the time spent in line waiting for service. Optimal processes haveminimal service and wait times. The target in the performance method discussed here is overall response time.For the most part, the focus will be on the causes of wait time, but by no means will service timebe ignored.

    Most of us already understand these concepts, and we only need to observe the events of our daily lives

    to reinforce this understanding. Consider the commuter driving to work during rush hour on a typical

    morning. If traffic is moving rapidly, but congestion is heavy and cars are close together, a simple near

    miss caused by one car stopping suddenly can create instant havoc. As following cars are forced to

    brake suddenly, even more cars further back are affected and are forced to slam on their brakes. Theeffect ripples backward through the highway perhaps for miles. Even if the original incident involves no

    actual damage and traffic at that initial site begins moving again immediately, the delaying after-effects

    are likely to continue for perhaps an additional hour. Once congestion has set in, it seems to feed on

    itself long after the cause of the bottleneck is removed. It may be impractical to attempt to solve all of the

    mathematical equations demonstrating the various events and collective consequences, but certainly

    the rush hour driving experience reinforces the conclus ion that a relatively small event can have severe

    performance consequences.

    As with traffic jams, computer systems suffer similar congestion.

    Service timedeserves some consideration. In the case of a database application, a sessions process might befound to spend too much service time, in the form of CPU time, processing extra data blocks because ofthe lack of a proper index on a particular table. That is, the server performs a full table scan instead of an

    index range scan, retrieving many more data blocks than otherwise would be necessary. While this additionalwork might be initially regarded as service time indeed, each block retrieval operation will consist of someCPU processing time the operation will involve even more I/O wait time as the users process must waitfor each additional blocks disk read requests. So, while the full table scan certainly incurs additional CPUservice time, the symptom of poor performance will most obviously be exhibited by excessive wait time(disk I/O) rather than service (CPU) time.

    Consider another example from daily life: the junk food lunch. We drop by our favorite hamburger

    restaurant for a quick bite and are faced with three lines of people waiting to order food from three

    employees acting as servers. Which line do we choose? Almost automatically, we choose the shortest

    line available. After several minutes, we notice someone who arrived after us is being served before us.

    It dawns on us the person serving our line might still be in training. It takes that person about twice as

    long to fill an order as the more experienced workers. So, we intuitively understand service time the

    time it takes to actually take and fill an order is a vital component of response time. Response time inthis case is the time it takes to get our food in hand, starting from the moment we step into line in the

    restaurant.

    Another example of the importance of wait time as a primary measure of poor performance would beCPU time consumed by excess SQL parsing operations. A well-designed application will not only make useof sharable SQL and avoidhard parses, but will also avoid soft parsesby keeping frequently used cursorsopen for immediate execution without reparsing at all neither hard nor soft. A poorly designed applicationwill certainly exhibit a high percentage of parse time CPU, but will probably also incur a disproportionateamount of time waiting for latches, most notable the library cache latch. As such, even a highly CPU-

  • 8/13/2019 Perf Meth

    4/14

    THE COE PERFORMANCE METHOD 4

    consumptive process is likely to cause measurable disproportionate waits. So, while service time must bemonitored, performance problems are more likely to be quickly spotted by focusing on wait time.

    CPM as presented here takes a holistic approach to performance analysis and encourages the analyst toconcentrate on service time or wait time as appropriate for the situation at hand. If the real problem is

    service-time related rather than wait time, it will be indicated by CPM and its cause corrected.Although the earlier automobile traffic example is easy to understand, the importance of wait time is all tooeasy to forget when dealing with the abstractions of computer software. However, that example canhighlight how a database server might have a buffer cache hit ratio of ninety-nine percent and at the sametimeexhibit abysmal response time. Or, how a large parallel query might take too long to complete whileCPU consumption mysteriously drops to near-idle levels. When the CPU is not working, it is waiting.

    VARIANCE, UTILIZATION AND CAPACITY

    Queuing analysis is helpful in understanding resource utilization and for optimizing service levels. In queuinganalysis, the exact timing of an event is not always known. Customer arrivals, or computer users clicking thesubmitbutton to invoke a database request tend not to be uniformly timed, and often come in groups. This isa common statistical phenomenon known as variance. It is simpler and more effective to instead deal with the

    aggregation of events and construct a mathematical model based on the probability of each event. Sincecustomer arrival times and hamburger preparation times vary, a model can take the form of a graphillustrating the effects of congestion, or busy-ness. From that model, an analysis can be performed ofresponse time, throughput, or the nature of a bottleneck.

    The manager of the hamburger restaurant knows from experience that people arrive at random intervals.That is, while there might be an average of three customers per minute during the mid-morning hours,people dont actually arrive at exactly twenty- second intervals. They come in groups and as individuals atunpredictable times. Thus, variances in arrival rates may have an effect on our response time.

    An idle resource, like an employee or a CPU, is often seen as wasted capacity. However, having anoccasionally idle resource may be the price one pays to provide a level of service needed to be competitive.Similarly, the freeway we use to drive to work during rush hour may have several lanes idle at two oclock inthe morning. During rush hour, all lanes may be full and backed up. Extra slack or capacity is traded off forbusy-time response and throughput.

    In computing systems, congestion can be experienced as either idle CPU time, or growing process runqueues; unused memory or swapping virtual memory; idle or busy disk. We may not be able to determineprecisely how many users will be logged on at one time or exactly what the workload will be, so we mayhave to provide some margin of extra capacity in order to get our business completed on time.

    In a large enterprise, the queuing model presents itself within the measure of end-to-end application response.A user pressing a mouse button in an office may be physically and logically miles from the data of businessinterest. The total time a user waits before their screen responds with a display of data is the sum total timefor each system component between that mouse and the distant repository of data and well as the returntrip. Each component of technology has its own process model and is a potential contributor to responsedelay. We will refer to these interconnected technology components as technologystacks. Examples include

    the network, database server, application server, the underlying hardware machines, and their operatingsystems.

    With a basic understanding of queuing theory, we need to develop a way to apply it to the technologyproblem at hand. We need to have access to information which tells us when system components are busy,how busy they are, and what they are waiting for when they are not busy. Fortunately, there are numeroussources for this information. All we need is to identify them and to find a cohesive way to bring thisinformation together in an understandable manner.

    Although each of these stacks consists of sub-processes, each with their own queuing models, we can viewthe overall stack as an aggregate process and consider its response as a unit. For the Oracle Database Server

  • 8/13/2019 Perf Meth

    5/14

    THE COE PERFORMANCE METHOD 5

    there exist a number of statistical values available for report, calledwait events, indicating the presence orabsence of internal bottlenecks. Measuring changes in the performance of an Oracle database involvesviewing these wait events by value of time waited and comparing these wait times to the same measure froma different time period. Other stacks involved in the end-to-end application view typically have tools toprovide similar information. We will discuss some of those tools in more detail later. Lets now forge on to

    the practical details of diagnosing performance issues.

    THE ENGINEERING APPROACH

    Certainly, the need for engineering discipline in the deployment and management of mission criticalapplications is well understood. Such discipline may be currently less widely applied toward performancemanagement than other areas of enterprise technology, but an engineering approach to the performance ofan application is equally as important as engineering an initial deployment. While practices vary fromenterprise to enterprise, certain key practices have been identified by Oracles Center of Expertise as essentialto effective performance management. First among these is the establishment of a Service Level Agreement(SLA). It is beyond the scope of this paper to fully define the nature of such an agreement. Nevertheless, it isclear that in order to declare a particular aspect of system performance as bad, one must first have a cleardefinition of good. One goal of the COE Performance Method described here is to achieve the performance

    commitments of the SLA and to diagnose variances from that SLA.

    SERVICE LEVEL AGREEMENT

    Since an SLA is an agreement between a technology service provider and a user, it tends to be a bottom-linedocument. That is, the agreement is for a particular specification of availability and performance for atechnology-based service. As such, it tends to focus on end-to-end service and does not bother with theinterconnected details in the middle. It is up to the technology provider to understand and define theinterconnected components (stacks) and to support them. Technology stacks in a contemporary informationenvironment will include database servers, application servers, hardware and operating system platforms onwhich to run those servers, network components such as routers, hubs, gateways and firewalls, andworkstations with user interface software for end users. Each stack has its own set of support issues andavailable tools for management.

    In order to be able to effectively respond to reactive performance issues, the service provider should take aproactive approach. The tools and techniques needed to diagnose wait time versus service time for eachtechnology stack must be implemented and in place, and they should be well understood by the serviceprovider prior to any actual performance diagnostic engagement. This deployment includes not only thetools, but also the engineering training and support to use them.

    Oracle Database Server from version 8.1.6 and beyond has been shipped with a tool called Statspack.Statspack is specifically designed to monitor server performance and offers a high level view of server waitevents the key to tracking down database performance bottlenecks. Operating system tools such as sar,netstat, glance, vmstatand iostat, among others, are also available on most UNIX platforms and are quiteeffective in combination with Statspack for overall proactive diagnostic monitoring. Windows NT and itssuccessors, Windows 2000 and Windows XP also come packaged with performance monitoring tools. Third

    party tools are also available and many are quite effective, although they generally have a price tag associatedwith them. Statspack is available free of charge, as is usually the case with the operating system toolsmentioned above.

    PERFORMANCE BASELIN E REFERENCE

    Whatever our toolset choices, we need to use those tools to establish and maintain a performance metricbaseline. This takes the form of actual performance data gathered at appropriate times, using tools such asthose already mentioned, to establish some measurable norm. A baseline might consist of an elaborate set ofgathered data, or may be as simple as a benchmark timing of a standard query. The important characteristicof the baseline is that it is consistent and offers a reasonable basis of comparison. Data gathered should

  • 8/13/2019 Perf Meth

    6/14

    THE COE PERFORMANCE METHOD 6

    represent actual system performance taken during one or more periods of busy activity. A baseline of datagathered while the system is idle is of little use.

    The baseline will need to be maintained as the system evolves, with respect to workload, functionality andconfiguration. If you add new application features, upgrade the database version or add or replace CPUs or

    other hardware, the environment has changed and therefore performance may have changed. If the baselineis not reestablished, any understanding of a future performance complaint by the user community will becompromised and blurred one will not be able to know if a performance change is due to a configurationissue or is a bug introduced with a new application feature. The baseline is established for thissystem in thisenvironment and enables a comparative analysis to be made to diagnose aspecificproblem.

    The issue of the performance complaint itself is worthy of some note. One of the problems inherent withmanaging complex systems is the uncertainty of the performance metric. Performance is largely a matter ofperception. A user may decide one day that a two second response for the execution of a particular form isacceptable, but unacceptable the next day, depending on issues like how hurried or relaxed the user feels on aparticular day. This suggests the information used for the reference baseline needs to be coordinated with themetrics used for the SLA. Even though performance complaints may still be lodged, at least the system ordatabase administrator has either a defense to offer or a starting point to diagnose the issue.

    ENGINEERING ASIMPLE METHOD

    One of the best features of the COE Performance Methodology is that it lends itself to performance analysisof large systems of interconnected technology stacks. Since our premise is that a system is no faster than itsworst bottleneck, it is obviously important to be able to identify the location of that bottleneck. Moreover,although Oracle tends to be the common denominator from the perspective of users and management alike,we know from experience bottlenecks can just as well reside in the network, the application server, or anoperating system.

    In order to identify the problem technology stack, and ultimately the actual problem itself, we need asystematic approach. The essential steps of the CPM approach, illustrated in Figure 1, will now be discussedbriefly.

    The COE Performance Methodology,in a nutshell

    Problem Statement

    Information Gathering / Stack Identification

    Stack Drill-Down

    Fix the Stack

    Test Against Baseline

    Repeat Until Complete

    As illustrated, the basic steps of the COE Performance Methodology are straightforward. Bystarting at a high level, broad view of the enterprise system and rigorously following the steps inan orderly manner, positive results are achieved simply, quickly and without expensive and time-consuming guesswork.

    Figure 1

  • 8/13/2019 Perf Meth

    7/14

    THE COE PERFORMANCE METHOD 7

    PROBLEMSTATEMENT

    A clear and unambiguous definition of both good and bad behaviour is essential. The problem statement ismore than half of the battle for a solution and defines success for us. Moreover, the discipline of stating theproblem clearly and concisely often uncovers possible solutions. There is an undeniable and innate siren

    offering a temptation to gloss over this step, but this temptation must be resisted so that misunderstandingsand inefficiencies are avoided. If you think you are solving one problem and the customer or user has adifferent expectation, valuable time will be wasted addressing misguided issues. An example of a weakproblem definition would be, Online queries are slow and need to be much faster, while a good problemstatement might be, The Customer Service Name Lookup screen normally returns query results in 3-4seconds, but has been taking more than 20 seconds since the open of business this morning.

    Define the problem specifically and concisely, establish the measure of success with the customer and makecertain you have agreement. The accordant goal must, of course, be reasonable and realistic. The definitionneeds to be quantifiable in terms conforming to the SLA metrics. The weak problem statement exampleabove is harmfully vague. How would we know when we have succeeded in finding a solution? In ourgood example, if the SLA requires specific response times for the application function in question, we atleast have a target for success and therefore a greater probability of success.

    Sometimes a clear problem statement is elusive. When things go wrong, often during critical business hours,tempers flare and communication lines break down. Sometimes the issue is obvious while at other times wewonder if we are simply imagining a problem that does not exist. When in doubt, ask yourself the simplequestion, What makes you think there is a problem here? then demand of yourself a very specific answerbased on symptomatic behavior. As Winston Churchill said, Never overlook the obvious. It may well bethe cause of the problem is already understood or suspected. A clear description of what the problem isand isnt will go a long way toward quickly resolving both obvious and obscure problems.

    Take the time to clearly define the nature of the performance symptom, the time and circumstances ofappearance or disappearance, and to establish a valid test. Say what isknownabout the problem, and describewhat isnotknown. A previously developed test case is ideal, and if one does not exist in advance, now is thetime to create one. A test case can be as simple as the execution of a procedure through SQL*Plus and then

    also through the web server, with a measurement of response times.The result of the test needs to be compared to the baseline, so the importance of a valid and current baselineis therefore apparent. I f a baseline was not established in advance, get one now so that you at least have thecurrentbadperformance captured and have something against which to measure the impact of changes. Notall changes are good.

    INFORMATIONGATHERING / STACK IDENTIFICATION

    Execute the test case and record the result. Gather associated performance data from all technology stacksdefined earlier, using appropriate tools. Compare the test results for each stack to the baseline for that stackand identify the most probable stack as the source of the bottleneck.

    What is needed for this critical stack identification step is a cursory check of each stack potentially involved inthe problem. For hardware platforms, it may be a straightforward tool such as sar, iostator vmstat. Network

    tools includenetstatandping. For the Oracle database server, a quick review of the alert log or error trace fileswill frequently turn up critical evidence for the trained database administrator. The ideal test is the one thatyields the most information with the least effort, so proceed accordingly.

    Ideally, we will gather overall system resource data as well as service and wait times for each individual stackin order to determine which stack is the biggest bottleneck. This is one of the biggest challenges: getting acoherent, end-to-end measure of response time through each stack. Some organizations prefer to developand maintain their own monitoring tools and there are plenty of open source and freeware resourcesavailable for use, including various scripting languages such asperland tcl. A common practice is to useoperating system command utilities such as vmstatand iostatshown in Figures 2and3, and to use a scripting

  • 8/13/2019 Perf Meth

    8/14

    THE COE PERFORMANCE METHOD 8

    language such asperlto analyze the text output. The tool can then phone home when exceptions areencountered or predefined thresholds exceeded.

    Having an integrated monitoring environment will facilitate rapid and accurate stack identification during aperformance crisis. While elaborate third party tools are available for such an infrastructure, off-the-shelf and

    freeware tools are often entirely adequate, although any tools you choose will have to be integrated into yourenvironment. For example, each UNIX platform in the enterprise might have a scheduled process to gathersarandnetstatstatistics on regular intervals. If Statspack snapshots are also collected at similar times, it is asimple matter to analyze reports for those tools for a period of concern and compare the available data toreports from, say, exactly one week or one month earlier. If the application workload is similar for bothperiods, but the performance problem did not exist in the earlier period, we have a fast way to comparebad performance data to baseline data. If the problem is with the underlying UNIX platform or thenetwork, it should be apparent immediately. Even without the baseline, a trained technician will recognizesymptoms of constraint a high percentage of CPU wait time or process swapping activity, for example.SeeFigure 2for an example of vmstatoutput.

    If no obvious starting point presents itself, we recommend you start with the database server itself. Oneobvious reason is the database administrator understands that stack best. Another advantage is the Oracle

    server gathers and provides information offering clues to problems across other stacks. For example,network problems often show up as a specific Oracle wait event, sql*net more data to client.

    Knowing the response time through the database stack will allow you to determine whether most of theoverall response time is spent in the database or not. This in turn will direct your attention to the databaseitself or to another stack.

  • 8/13/2019 Perf Meth

    9/14

    THE COE PERFORMANCE METHOD 9

    $ vmstat 5 5

    procs memory page disk faults cpu

    r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us

    sy id0 1 0 6968 20376 49 1775 37 399 577 87248 27 8 0 37 36 4520 6649 7044 17

    6 77

    1 1 0 31807664 443376 10 1037 0 0 0 51536 0 1 0 49 49 5251 61709 8798 29

    6 64

    1 4 0 31798856 443008 1 934 68 0 0 30456 0 1 2 54 55 5277 77163 17295 32

    8 60

    0 1 0 31807744 441872 0 1051 164 0 0 18000 0 4 8 49 49 4755 74029 11738 31

    7 62

    0 0 0 31808072 441376 0 986 0 0 0 10640 0 10 0 47 47 4720 48430 5518 23

    5 72

    This is avmstatsample taken from a 32-processor Sun system for five intervals of five seconds each. Statistical sampling is such thatwe ignore the first line of vmstat. A quick glance under the procssection tells us there is some process run queue wait time (r iseither 0 or 1 in this example) and some resource waiting (b > 0 for most interval samples). This is generally considered good, non-bottlenecked performance although the b value indicates a process blocked by an IO wait, so disk may need balancing if that bvalue grows. Run queues are averaged for all CPUs for Solaris.

    Memory paging and swapping are not the same. Paging, even with these seemingly large numbers, is quite normal. The sr columntells you how often the page scanner daemon is looking for memory pages to reclaim, shown in pages scanned per second.Consistent high numbers here (> 200) are a good indication of real (not virtual) memory shortage.

    The fields displayed are:

    procs Report the number of processes in each of the three following states:

    r in run queue

    b blocked for resources (I/ O, paging, and so forth)

    w runnable but swapped

    memory report on usage of virtual and real memory.

    swap amount of swap space currently available (Kbytes)

    free size of the free list (Kbytes)

    page Report information about page faults and paging activity, in units per second.

    re page reclaims

    mf minor faults

    pi kilobytes paged in

    po kilobytes paged out

    fr kilobytes freed

    de anticipated short-term memory shortfall (Kbytes)

    sr pages scanned by clock algorithm

    disk Report the number of disk operations per second, per disk unit shown.

    faults Report the trap/ interrupt rates (per second).

    in (non clock) device interrupts

    sy system calls

    cs CPU context switches

    cpu Give a breakdown of percentage usage of CPU time. On MP systems, this is an average across all processors.us user time

    sy system time

    id idle time

    Figure 2

    TIMING ISEV ERYTHING

    An important consideration when evaluating third party tools or rolling your own is to gather and analyzedata in a meaningful manner. For the most part, we are dealing with statistical samples when we monitorhardware and software resources, so sampling techniques must be sensible with respect to sample size and

  • 8/13/2019 Perf Meth

    10/14

    THE COE PERFORMANCE METHOD 10

    interval. Thevmstatreport shown in Figure 2was taken at five-second intervals. While short intervals showperformance spikes quite well, they also tend to exaggerate variances in values and therefore contain statisticalnoise. A better method is to take concurrent short and long samples to be able to analyze both averages andvariances to get a meaningful picture of performance.

    $ iostat xtc

    extended device statistics tty cpu

    device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id

    sd0 1.7 6.1 34.5 46.9 0.0 0.2 26.1 0 4 0 48 18 6 5 70

    sd1 0.1 0.0 1.1 1.7 0.0 0.0 7.6 0 0

    sd2 2.0 35.9 24.0 416.1 0.0 0.2 4.9 0 18

    sd3 1.2 35.9 9.5 416.1 0.0 0.2 4.9 0 18

    sd4 0.3 1.5 15.8 13.5 0.0 0.0 17.7 0 1

    sd5 0.4 7.8 28.4 17.1 0.0 0.1 10.2 0 6

    sd15 8.8 2.3 97.5 11.8 0.0 0.2 14.0 0 5

    sd16 14.1 6.7 140.1 46.7 0.0 0.3 14.6 0 12

    . . .

    This as an abbreviated iostatreport from the same 32 processor system as shown in Figure 1. Thesvc_tcolumn is actually theresponse time for the disk device, however misleading the name. When looking for input/ output bottlenecks on disks, a rule ofthumb is to look for response time greater than 30 milliseconds for any single device. A well-buffered and managed disk system canshow response times under 10 milliseconds.

    Here are the field names and their meanings:

    device name of the disk

    r/ s reads per second

    w/ s writes per second

    kr/ s kilobytes read per second

    kw/ s kilobytes written per second

    wait average number of transactions waiting for service (queue length)

    actv average number of transactions actively being serviced

    svc_t average service time, in milliseconds

    %w percent of time there are transactions waiting for service (queued)

    %b percent of time the disk is busy (transactions in progress)

    Figure 3

    A sudden burst of activity might cause a single disk drive to be so busy as to cause process queuing, yet maynot be of any real concern unless it become chronic. On the other hand, long iostatsamples will average diskservice time and tend to hide frequent spikes and could possibly mask a real problem. Seefigure 4for anexample of a CPU resource measurement illustrating how large variances in reported data can be misleading.

    If you look at the data for too short an interval, you might conclude CPU idle time is nearly seventy percentor nearly as low as twenty percent. I f you are trying to analyze a performance anomaly during a period ofhigh or low CPU usage, such a narrow slice of data can be quite helpful. On the other hand, taken as anindication of the norm, such a microscopic view could be completely misleading.

    The first priority at this early juncture is to eliminate obvious problems that can skew performance data andblur the analysis. We are concerned with quickly ascertaining the overall health of the components of eachtechnology stack to make sure we know where the possible problem both isand isnt. We do this by lookingfor exceptions to what we know to be normal behavior.

  • 8/13/2019 Perf Meth

    11/14

    THE COE PERFORMANCE METHOD 11

    CPU Idle Time

    0

    10

    20

    30

    40

    50

    60

    70

    80

    6:51

    :30

    7:06

    :30

    7:21

    :30

    7:36

    :30

    7:51

    :31

    8:06

    :31

    8:21

    :31

    8:36

    :31

    8:51

    :31

    9:06

    :31

    9:21

    :32

    9:36

    :32

    %idle

    trendline

    Low High

    CPU Idle times extracted from a sar report. The jagged line represents samples taken at fifteen-minute intervals. The trend line isshown to illustrate the degree to which variances among individual samples can be distracting and misleading. You need both averageand variance information to get a true picture of what is happening at the hardware and operating system levels. The interval markedLow is entirely different from the interval marked High. A narrow peek at a performance variation can be useful for analyzingbottlenecks, but can be misleading if taken as an indication of the norm.

    Figure 4

    For example, perhaps we received a report that an Oracle server had severe latch freewait events during aperiod of bad performance. If we respond directly to that symptom without adequate high-level analysis ofthe overall platform/database technology stack, we might overlook heavy process queuing at the operatingsystem level. That is, the Oracle database might appear to be the problem, when the real issue is a lack ofcapacity. Reports fromvmstator iostatwould indicate chronic process run queues, so we would know thatthe Oracle database itself is probably not the culprit, at least not the primary culprit. Once the resource limit isaddressed, by tuning the application, rescheduling processes or adding more or faster processors, we canproceed once again with the stack analysis and identify server constraints in their proper context.

  • 8/13/2019 Perf Meth

    12/14

    THE COE PERFORMANCE METHOD 12

    tracert mail12

    Tracing route to mail12.us.snerdley.com [148.22.88.200]over a maximum of 30 hops:

    1

  • 8/13/2019 Perf Meth

    13/14

    THE COE PERFORMANCE METHOD 13

    TESTAGAINSTBASELINE

    Now that the single bottleneck has been identified and relieved, it is time to rerun the test case and compareto the baseline and SLA to establish relative success. We use the term relative here to suggest the problemmight not be altogether solved. It is common to find the relief of one bottleneck only serves to reveal

    another. If you have achieved success, document that fact, stop tuning and go home. You do get to gohome, dont you?

    Performance management is, of course, an ongoing process. This is not meant to suggest the diagnosticianwill walk away and not continue to monitor performance. On the contrary, proactive monitoring is the bestway to avoid emergencies. It is important, however, to distinguish between reactive and proactive efforts andnot to be caught in the trap of managing one crisis into the next. After the crisis is resolved, reviewperformance against the baseline and update the baseline if hardware or software configurations havechanged. Continue to monitor proactively.

    REPEAT UNTIL COMPLETE

    If success, as defined by agreement established in the problem statement, is not yet declared, go back to thesecond step above and rerun the analysis to identify the stack now containing the worst bottleneck. Consider

    the possibility the bottleneck has moved to another stack. It is also possible there is no ready relief for theproblem. This may be a case where a performance problem is actually a capacity issue, in which case aninvestment decision may need to be made. Alternatively, the root of the problem may be a bug or ahardware failure for which there is no immediate solution.

    Often one symptom will mask another. It is not uncommon for multiple, unrelated problems to manifestthemselves at the same time. In a recent engagement involving a sudden and dramatic increase in responsetime in a production database, heavy contention was discovered within the file system. Once several largeobjects were moved to other, less busy disk drives, throughput increased fourfold, but response time forindividual users was still slow. Further investigation from the top down revealed certain SQL statements didnot properly use an index. Both issues surfaced at the same time because of the introduction of a newbusiness transaction type causing a concentration of activity on the affected disk objects, while at the sametime invoking SQL statements not previously executed. Once the SQL statement was corrected to be more

    selective, performance returned to normal, acceptable levels and the engagement ended.

    Performance problems are like onions: you peel them one layer at a time.

    TOOLS TO DO THE JOB

    In order to perform the multiple levels of diagnostics required for each stack, a number of tools will beneeded. Commercial software and hardware products are available from various vendors and free softwaretools abound. It is beyond the scope of this paper to attempt to identify all such tools, but some obvioussources are hardware and software vendors as well as the various open source consortia. Commonly useddiagnostics tools mentioned already includesar, iostat, vmstat, netstatand pingfor UNIX platforms. Sometools offer varying degrees of comprehensiveness and integration. Naturally, an integrated tool is likely to bemore convenient to implement than a set of point-solution tools.

    For Oracle servers, obvious choices include Oracle Enterprise Manager (EM), the utlbstat/ utlestat scripts,and Statspack. EM has features incorporating the basic methodology described here. Utlbstat/ utlestat andStatspack have the virtue of being included with the server at no extra charge. Statspack has been shippedwith Oracle database servers since 8.1.6 and is intended as a replacement for the utlbstat/ utlestat scripts. Itoffers excellent and comprehensive features for ongoing monitoring of the database. All of these tools willreport data for selected intervals and will provide a view of the wait event interface built into the Oracleserver kernel.

  • 8/13/2019 Perf Meth

    14/14

    THE COE PERFORMANCE METHOD 14

    A MEA SURE OF DIPLOMACY

    Besides tools to cover the technology spectrum under your domain, you will also need occasionalcooperation from other experts. One of the more common problems of the contemporary enterprise is adirect outgrowth of the integration of disparate technologies communication barriers. Often, the

    administrators of the database, hardware platform and the network belong to entirely different managementstructures. While a performance methodology such as this cannot address political turfs, cooperation isnecessary to quickly diagnose potentially complex problems.

    BACK TO THE CONCEPTSMANUAL

    An understanding of Oracle concepts is fundamental to effective performance analysis. Have you read theConcepts Manual lately? An understanding of all components of the Oracle server is contained in thatmaterial, including Buffer Cache operations, enqueues, latches, the Library Cache, the Shared Pool, redo,undo; lgwr, dbwr and smon background processes. Oracle 9i documentation includes Oracle9i DatabasePerformance Methods, which along with Oracle9i Performance Guide and Reference provides an in depthdiscussion of server and application tuning.

    For technology stacks other than the database, there is a wealth of material to read. Some excellent sources

    are listed in the bibliography below. Bear in mind some of them are written from the perspective of aparticular operating system, but contain concepts applicable to all brands and flavors of platform.

    Documents are available on the Oracle Technical Network site providing an understanding of the wait eventsOracle records to provide the queuing analysis perspective you need to apply this methodology and to tunethe database product effectively. There is a discussion of Oracle wait events, in some detail, as well as anintroduction to wait event analysis known as Yet Another Performance Profiling Method (YAPP), by AnjoKolk. Also, Oracle9i Database Performance Methods applies the holistic approach to the database inparticular. Both are well worth reading. See the Bibliography for details and additional reading.

    ACKNOWLEDGEMENTS

    The Center of Expertise Performance Methodology has been a collaborative work of many individuals.Current and former members of COE, including Jim Viscusi, Ray Dutcher, Kevin Reardon and others,

    provided much of the early research. Cary Millsap offered the theoretical foundation for this effort.

    BIBLIOGRAPHYPractical Queueing Analysis, Mike Tanner, McGraw-Hill Book Company (out of print in the United States,but a classic worth finding, available at Amazons United Kingdom site)

    The Art of Computer Systems Performance Analysis Techniques for Experimental Design, Measurement,Simulation, and Modeling, Raj Jain, John Wiley & Sons

    Capacity Planning for Web Performance, Daniel A. Menasce, Virgilio A. F. Almeida, Prentice Hall

    Oracle8iDesigning and Tuning for Performance Release 2 (8.1.6), Oracle Corporation, part A76992-01

    Oracle9iDatabase Performance Methods, Oracle Corporation, part A87504-02

    Oracle9iDatabase Performance Guide and Reference, Oracle Corporation, part A87503-02Sun Performance and Tuning, Java and the Internet, Adrian Cockcroft, Richard Pettit, Sun MicrosystemsPress, a Prentice Hall Title

    Oracle Performance Tuning 101, Gaja Krishna Vaidyanatha, Kirtikumar Deshpande, John A. Kostelac, Jr.,Oracle Press, Osborne/McGraw-Hill

    Oracle Applications Performance Tuning Handbook, Andy Tremayne, Oracle Press, Osborne/McGraw-Hill

    Yet Another Performance Profiling Method (YAPP), Anjo Kolk, http:/ /metalink.oracle.com