Establishing+Profile

25
Establishing a System Resource Usage Baseline Profile August 10, 2001

description

Teradata DBA

Transcript of Establishing+Profile

Page 1: Establishing+Profile

Establishing a System Resource Usage Baseline Profile

August 10, 2001

Report By:Larry HigaLawrence Higa Consulting, Inc.

Page 2: Establishing+Profile

Overview..........................................................................................................................................................3Elements of a System Usage Profile................................................................................................................3

Primary Elements of System Usage.............................................................................................................41. CPU (ResusageSpma Data) – Node Level....................................................................................42. Disk I/O (ResusageSpma Data) – Node Level..............................................................................63. Available Free Memory & Paging/Swapping (ResusageSpma Data) – Node Level.................74. Number of Concurrent Active Sessions........................................................................................7

Secondary Elements of System Usage.........................................................................................................71. CPU (ResusageSvpr Data) – Vproc Level....................................................................................82. Disk I/O (ResusageSvpr Data) – Vproc Level..............................................................................83. Bynet I/O (ResusageSpma Data) Node Level..............................................................................84. Host I/O (ResusageShst Data) - Vproc Level..............................................................................9

Sample Charts of Typical and Problem Situations.........................................................................................101. Parallel Efficiency Average / Max Node CPU Chart.........................................................................102. OS as PCT of CPU vs. AVG CPU Busy Chart (OS % CPU Problem)..............................................103. OS vs. DBS CPU Busy Chart (Different view of OS % CPU Problem)............................................104. Poor Parallel Efficiency Average / Max Node CPU Chart................................................................105. Avg CPU Busy vs. CPU Idle Waiting for I/O Chart..........................................................................116. I/O Wait vs. Disk I/Os Chart..............................................................................................................117. CPU Idle Waiting for I/O vs. Buddy Backup Bynet I/O Chart..........................................................118. Average & Min Free Memory Available vs. Total Page/Swap IOs...................................................119. Concurrent Sessions Chart.................................................................................................................11

2

Page 3: Establishing+Profile

Overview

The purpose of establishing a system resource usage profile is to obtain a picture in graphic and numerical format of the usage of a system to help isolate/identify performance problems that may be due to application changes, new software releases, hardware upgrades, etc. Having a long-term pattern of usage also enables one to see trends and helps one in doing capacity planning. The pattern or profile of usage can be seen as a cycle: daily, weekly, monthly, etc., corresponding to the customer’s business or workload cycle.

From a performance monitoring / debugging perspective, one is looking for changes in the pattern. Usually, one is looking for a marked increase in a particular resource. Often times, the system may be at 100% CPU capacity and the users’ applications are running fine with no complaints. Then something happens and the users are complaining about response time. The system is at 100% CPU busy, but this is no different from before. The change could be an increase in the number of concurrent queries in the system, or it could be an increase the volume of disk I/O or in Bynet broadcast messages. In some cases, a longer term of several months may be necessary to see a significant change in the pattern. Once a change in pattern is correlated with a performance problem or degradation, one can eliminate possible causes of the problem and narrow the search for the basic causes.

A baseline profile should be established when the system is at a semi-steady state. This means data is loaded, updated on a regular basis (daily, weekly), and accessed by users or by a production application. The baseline period could be all hours of a day or just the on-line daytime hours when users are running their queries. For some users, the critical period may be the nighttime load of the data; for others, it could be a monthly load and report generation over a short 2-5 day period. There are also different levels of data summarization for the purpose of establishing a baseline profile. At one level, one can maintain the detailed logging period data. Other levels could be totals by hour or totals by day.

For some users, a profile can be established for a known set of benchmark queries. With benchmark queries, the common basis for determining if the performance of a new software release is acceptable or not is if the response time of the queries are nearly the same or better, or if they are considerably slower. In situations where the system is “running slower”, the baseline profile will provide a contrast in the system resource usage between the different instances of running the benchmark.

Elements of a System Usage Profile

The primary sources of data for a system usage profile are the DBC system tables and views:

DBC.ResusageSpma DBC.ResusageSvpr DBC.ResusageSvpr2 DBC.ResusageShst DBC.Diskspace DBC.AccessLog DBC.Ampusage

In general, the bulk of the system usage profile comes from the Resusage tables in which data is recorded on a periodic basis, usually every 10 minutes. Established Resusage macros are used to extract data and automatically charted with an Excel program. For data from other tables, separate procedures need to be established in order to get data periodically.

3

Page 4: Establishing+Profile

The data in a profile can be grouped as primary and secondary elements. The primary elements are the more important ones that will generally give a first level indication of a performance problem. For example, the system may be a 100% CPU bottleneck utilization where normal may be at 80% or lower, or the system is at maximum disk I/O capacity compared to a normal of 50%. Another common situation is the number of concurrent queries (tasks) in the system. Normal my have been 10 to 15 concurrent queries and a performance problem is occurring when there are 50 to 60 concurrent queries.

Secondary elements are useful for a more detailed analysis. In such situations, several factors may be necessary to get a proper interpretation of what is affecting performance.

When establishing a baseline profile, one must first gather and chart the data. Then, one needs to describe what one sees in the charts. The following sections contain a description of the individual elements that make up a profile and what one generally can see / interpret from the different elements.

Primary Elements of System Usage

Three primary elements give the best picture of a system baseline profile:

CPU busy Disk I/O activity Concurrent active sessions

Data columns to look at and key interpretations are described below.

1. CPU (ResusageSpma Data) – Node Level

UNIX categorizes CPU busy information into 4 categories: busy executing user code busy executing operating system code idle waiting for I/O idle

The total of CPU busy executing user code and operating system code indicates how much of the CPU capacity is being used. UNIX detects the CPU busy state and passes this data on to the Teradata RDBMS where the data is recorded in the Resusage tables. This same data is passed to the UNIX sar (System Activity Reporter) which records the data in a flat file. While the source of the data is the same, the numbers can have a slight variation due to different logging time periods.

The important performance information that can be extracted about the system is: how busy the system is and if there is more available capacity if there is an imbalance of work across the nodes (skewing) if there is an application problem causing the system to do inefficient repetitive processing

Average CPU BusyRepresents the average CPU utilization of all CPUs in all nodes. The current norm is 4 CPUs per node.1 Within a node, if all the multiple CPUs are running at 100% busy, the node CPU busy number is “normalized” by dividing the sum of utilization all the CPUs by the number of CPUs per node.

This is the most important column in telling how much of the system is being used (from a CPU point of view). When this number is 100%, the system is running at maximum capacity.

1 For the 5100, the norm is 8 CPUs per node.

4

Page 5: Establishing+Profile

Maximum CPU Busy Represents the normalized CPU utilization of the busiest node in the system. For a parallel processing system, the overall relative efficiency of the system is calculated by dividing the average CPU busy by the maximum CPU busy. When this number varies greatly from the average CPU busy number for the same log period, there is a skew in the processing of the system. Skewing is a problem when it persists for multiple logging periods. Usually, this is an application issue rather than a system software or hardware issue. The reason for the imbalance in the workload of the system could be due to a number of different conditions. One, UNIX applications could be running in a single node. Two, partitioning of table data based on the primary index could be skewed where some nodes have significantly more data than other nodes. Three, processing of SQL join condition statements could cause a skewed redistribution of the data to a single AMP. Data going to a single AMP in turn means running in a single node.

OS as Pct of CPUPercent of CPU busy time that the system was executing operating system work as opposed to database work. The formula for calculating this column is CPU time for the operating system divided by the sum of the CPU time for both the user and the operating system. This column does not represent the percent of absolute time that the system was spending in the operating system.

The lower the value of this column means the CPUs are spending more time executing DBMS code. Conditions where this column goes below 20% are large product joins, duplicate row checks, collect statistics, SQL statements doing aggregation on lots of data columns, SQL statements with a lot of numeric expressions or use of SQL functions such as INDEX, SUBSTR, etc. Often times, when this number goes below 20% for lengthy periods of time and the maximum CPU busy is around 100%, it is an indication of a duplicate row check problem or a very large product join. Duplicate row check problems can be resolved by changing the physical data model of the table. The change could be modifying the primary index, adding a unique secondary index or changing the table from a SET table to a MULTISET table2. For large product joins, this usually can be corrected by collecting statistics on the join columns of the tables involved in the joins. I/O Wait %This is the most misunderstood column of all the Resusage columns. I/O Wait % is the percent of time the system is waiting for completion of disk or Bynet I/O, and there are no other tasks available to run in the system. It does not necessarily mean the system is at I/O throughput capacity.

The more common cause of the I/O Wait is the database software (user queries) has requested a disk read and there is a lack of concurrent tasks running in the system. For example, if there is a single job running in the system, when the tasks finishes processing a data block and requests another data block, the system may have to do a physical I/O to get the data block. At this point, the system will initiate a physical I/O and schedule another task for execution. As long there is another task to run, the CPU will not be idle even though there is a pending I/O completion. When there are no other tasks to run, the system will record that a CPU is idle and waiting for an I/O completion.

Similarly, some disk writes can occur asynchronously, in which case the writing task continues without waiting for I/O completion. When disk writes are for table data modifications that were not sent to the another node for buddy backup logic, the task must wait for completion of the physical disk I/O.

I/O Waits can occur when there is a lot of Bynet Buddy backup activity. The task in the sending node must wait for acknowledgement from the receiving node that it received the buddy backup data before

2 Duplicate row checks can happen inadvertently when the user neglects to define a primary index. In this case, by default, the first column of the table becomes the primary index. In an extreme case, the data for this column has only a single distinct value, which will result in a factor of (n2 / 2) duplicate row checks (where n is the number of rows in the table).

5

Page 6: Establishing+Profile

the task in the sending node can continue its processing. Because the receiving node can be busy for a number of different reasons, there is no Bynet threshold number to indicate that the buddy backup traffic over the Bynet is the bottleneck. In general, one can get an indication if this is the bottleneck by charting the I/O Wait column with the buddy backup column (in the ResNet macros).

The most common situation is that there simply are not enough concurrent tasks running in the system. A true disk I/O bottleneck occurs when the disk subsystem is transferring data at its maximum throughput capacity. Depending on the I/O hardware and configuration, I/O bottlenecks generally do not occur until the nodes are doing over 1400 I/Os per second, or transferring at least 70-80 MB/sec.

There are no known cases of Bynet bottleneck. The speed of the Bynet is significantly faster than the rate that the nodes can provide data for transferring over the Bynet.

On TNT systems, I/O Waits are not detected and zero is recorded for this column.

CPU profile summary. The key interpretations of the CPU profiles are: CPU is at maximum CPU busy capacity (100%) for most (much/high proportion) of the critical

period (day shift, end of month processing, all the time, etc.). Another interpretation could be: the system averages 60% (an arbitrary percent less than 100%)

and is almost never at 100% busy., i.e., there is meaningful CPU capacity available to do more work.

There is significant node CPU skewing for a number of periods during the critical time periods. Or, there is occasional skewing, but only for brief periods of time. None are large enough to

cause a significant performance problem.

The OS as percent of CPU is below 20% for over 2 continuous hours. This generally implies an application problem that should be investigated.

The OS as percent of CPU varies from 10% to 80% without any special pattern. This value is below 20% only for 10 –20 minutes at a time and happens only occasionally during the critical time periods. This could be due to the applications doing a lot of aggregation, collect statistics or numerous arithmetic expressions in the queries.

2. Disk I/O (ResusageSpma Data) – Node Level

Disk I/O is recorded for both number of reads and writes, and for the number of bytes transferred. For disk reads, both logical and physical reads are recorded in the ResusageSpma table. For disk writes, only the physical writes are recorded in the ResusageSpma table. In the ResusageSvpr table, logical and physical writes are recorded. Also recorded in the ResusageSvpr table is a breakdown of the disk I/O by type of I/O, table data, cylinder index (CI) for table data, spool, spool CI, transient journal (TJ) and permanent journal (PJ).

Position Reads (Logical and Physical)Number of disk positioning reads. A position read occurs for the first data block of a cylinder, for cylinder indexes, and for random data block accesses.

Pre-Reads (Logical and Physical)Number of disk pre-reads. Pre-reads, also commonly referred to as pre-fetches, occurs only for full table scans. This could be for table data or for spools. When a query does a full table scan, the first block of the cylinder is accessed by a positioning read and all other accesses to the cylinder are done with pre-reads. When there are no pre-reads, there are no full table scan queries.

6

Page 7: Establishing+Profile

Data Base Reads (Logical and Physical)Sum of the disk position reads and pre-reads. This column is output as an easier means for seeing the total number of disk reads.

Disk Read Kbytes (Logical and Physical)Number of Kbytes read for both the position reads and pre-reads.

Database Writes (Physical only)Number of database disk writes. Writes occur for table data, spool data, cylinder indices, transient journal and permanent journal. Disk Write Kbytes (Physical only)Number of Kbytes written.

3. Available Free Memory & Paging/Swapping (ResusageSpma Data) – Node Level

At system start up, memory is logically divided into FSG Cache for the Teradata file system to manage and available free memory for UNIX to manage. FSG cache is used for table data, spools, TJs, PJs, buddy backup, etc. Basically, FSG cache is used to manage the database data for queries and data modification. Free memory is managed by UNIX for AMP and PE code and data and Bynet buffers, including row redistribution and duplication. For non-TPA (trusted parallel application) work, i.e., a UNIX job and not a Teradata RDBMS task, memory is also allocated from free memory. When the amount of available free memory goes below a certain threshold, UNIX initiates Paging out of code or data from the UNIX managed portion of memory. The key value for available free memory is 40 MB per node. Customers have experienced UNIX panics when the amount of free memory goes below the 40 MB threshold. In essence, the Teradata RDBMS puts such a heavy and quick demand for memory that it will exhaust the amount of free memory before UNIX can make free up enough memory by paging out segments.

Guarding against the UNIX panics can be handled in two different ways. One is to set the tunable parameter, FSG Cache Percent, to a lower number. This essentially reduces the amount of memory dedicated to FSG cache and makes it available to UNIX to manage. The negative drawback with this approach is that it tends to leave too much free memory that is never used. The second way to guard against UNIX panics is to tune UNIX memory tunable parameters, LOTSFREE, DESFREE and MINFREE to a higher value to let UNIX start its paging earlier. Having the paging safeguard allows one to tune the FSG Cache percent parameter to a higher value so that less memory is taken away from FSG Cache to give to UNIX.

4. Number of Concurrent Active Sessions

Query response time is dependent on the number of active concurrent queries running in the system. The common situation is many users are logged on to Teradata, but not all are running queries at the same time. Teradata Manager provides a method for logging sessions to a log file that can be processed later. The Performance Monitor Application Program Interface (PM/API) also provides a means for capturing session data and charting the active sessions in real-time. Establishing this data as part of the profile allows one to correlate an increase in response time to an increase in the number of concurrent active AMP sessions.

Secondary Elements of System Usage

7

Page 8: Establishing+Profile

The secondary elements of system usage provide a profile of the system usage, but are not critical for immediate problem detection. They provide a background for comparison when problems occur to help identify the kind of changes in system usage that occur at the time of the problem. The secondary elements include:

CPU busy at the vproc level Disk I/O by type – table data, table data CI, spool, transient journal, permanent journal Bynet I/O by type – buddy backup (complete and partial segments), point to point, broadcast Host I/O

1. CPU (ResusageSvpr Data) – Vproc Level

CPU time is recorded for each AMP and PE vproc in the system. In addition, every node also has a node vproc which generally handles the physical I/O to disk and to the Bynet. Meaningful Resusage profiles to look at are:

AMP Vproc skewing Hot AMP problem

Comparison of Vproc Level CPU Use vs. Node Level CPU Use Indication of non-TPA work on system which could be cause of node skewing

PE and Node Vproc Skewing Imbalance of PE or node level processing

Combined AMP, PE and Node Use Shows proportion of AMP, PE and node vproc use. Unusual case is when PE has

relatively higher percentage than normal

2. Disk I/O (ResusageSvpr Data) – Vproc Level

Disk I/O at the vproc level is broken down by type of I/O. Some applications will do a lot more spool I/O than table data I/O due to complex, multiple table joins. Other applications may build only small spools because data is aggregated and only a small answer set is built. Different times of the day one may see update processing going on due to the existence of TJ I/Os. Some users workload is such that table data is supposed to be updated only at night, whereupon the day shift should only read table data and build intermediate spools. If a large number of TJ I/Os take place during the day shift (more than the system overhead), then this could be an indication of a job running at the wrong time and a reason why normal query response time is slower than normal.

A comparison between logical and physical I/Os give an indication of how well memory is used to cache data. Typically, spools are highly cached.

Disk I/O to look at include:

Logical and Physical Number of disk reads and writes, and Mbytes transferred for Table data blocks and Table data CIs Spools blocks and Spool CIs TJ (Transient Journal) and PJ (Permanent Journal)

For disk management data associated with running out of free disk cylinders, the data to look at are:

Mini-cylinder Packs Cylinder Defragmentations

8

Page 9: Establishing+Profile

3. Bynet I/O (ResusageSpma Data) Node Level

Bynet I/O covers inter-nodal communication. From a query viewpoint, after a statement is parsed and steps are created, the Dispatcher sends steps to the AMPs for execution. For all-AMP operations, a broadcast message is sent to all AMPs for step execution. Step completion is indicated by point to point messages. Generally, query step messages are few in number compared to the actual processing of data. For join processing, data is often redistributed (Bynet point to point messages) or duplicated (Bynet broadcast). For updates via BTEQ or Tpump, updated data blocks, CIs for updated table blocks, TJs and PJs are sent to a buddy node as part of the buddy backup process. When data blocks, which were sent to a buddy node, are written to disk, a flush message is sent to the buddy node to tell it to get rid of the backed up data block.

The buddy backup is especially useful when there are many updates to the same data block. This can occur when using volume updates via Tpump, or when primary index updates are made through BTEQ or a user’s pre-processor program. The many updates to the same block can be detected in the data by looking at the buddy backup complete and partial segments. When the number of buddy backup partial segments is nearly zero or is relatively small compared to the complete segments, the user has the option of turning off the use of the buddy backup mechanism for table data by setting the DBS tunable parameter, WrtDBsToDisk, to TRUE. If the important factor to optimize is throughput, then setting the parameter to TRUE is helpful. If response time for individual transactions is more important, then setting the value to FALSE is the better option.

Bynet messages are:

Point to Point Messages Number of I/Os & Mbytes transferred; Also KB per I/O

Broadcast Messages Number of I/Os & Mbytes transferred; Also KB per I/O

Redistribution This is a derived value based on the Buddy backup and the Point to Point messages. For

each buddy backup message (which is a point to point message), there is a corresponding Acknowledgement (ACK) message (also a point to point message). Thus, the estimated number of redistribution messages is the number of point to point messages minus 2 times the number of buddy backup messages. Because messages can be “piggy-backed”, i.e., more than one ACK can be sent in a point to point message, the number of redistribution messages can only be regarded as a calculated estimate.

Buddy Backup Complete and Partial Segments Buddy backup messages are included in the count of point to point messages Number of I/Os & Mbytes transferred; Also KB per I/O

Table data and CIs for table data TJs and PJs

Buddy Backup Flushes3

Buddy backup flushes are included in the count of point to point messages Number of I/Os & Mbytes transferred; Also KB per I/O

Table data and CIs for table data TJs and PJs

Bynet Merges Number of Rows returned for a SQL SELECT statement. This does not include data sent

back for FastExport, nor for archiving of data.

3 In a steady state, table data buddy backup flushes should be approximately the same as for complete segment backups. This column is generally not important unless there is a specific performance problem that cannot be understood or explained. Then any discrepancy between the flushes and the complete segments can point to an internal problem with the system.

9

Page 10: Establishing+Profile

4. Host I/O (ResusageShst Data) - Vproc Level

The Host I/O data presents a picture of when data is loaded and how much data is loaded via either a mainframe connection or LAN connection. It also shows when a large volume of data is sent back to a host, usually for archiving of data.

Data Read From and Written To a Host (Mainframe channel connect or LAN connect) Number of I/Os & Mbytes transferred; Also KB per I/O

Sample Charts of Typical and Problem Situations

The charts can be found on the succeeding pages. While much of the samples are oriented toward problem identification, they still could be considered part of a baseline profile.

1. Parallel Efficiency Average / Max Node CPU Chart Chart shows system is at maximum 100% utilization on 10/13, from 23:00 to 10:00 the next

morning. The parallel efficiency at this time is also at 100%, an excellent situation. However, this does not say anything about efficient use of the system. One needs to check the next chart that shows OS as Pct of CPU.

On 10/13, from 0:00 to 6:30, there is significant skewing even though the maximum utilization is less than 40%. This may or may not be a problem.

2. OS as PCT of CPU vs. AVG CPU Busy Chart (OS % CPU Problem) During the peak CPU utilization, the OS as % of CPU is extremely low for a lengthy period of

time. Below 20% for 11 hours indicates there is an application problem, usually a large number of duplicate row checks or a very large product join. (This turned out to be a duplicate row check problem where an Insert into a table was executing and the choice of primary index was a poor one.)

3. OS vs. DBS CPU Busy Chart (Different view of OS % CPU Problem) During the peak period, the sum of OS % busy and DBS % busy add up to 100%. The chart

shows the DBS was executing about 90% busy and the OS was executing only about 10% busy. The lengthy duration of the DBS executing at such a high % indicates an application problem, usually a large number of duplicate row checks or a very large product join. (This turned out to be a duplicate row check problem where an Insert into a table was executing and the choice of primary index was a poor one.)

4. Poor Parallel Efficiency Average / Max Node CPU Chart On 8/25, from midnight through 7:00, there was extreme skewing where one node was running at

100% and the other nodes averaged as low as 20%. Definitely, this needs to be investigated.

On 8/26, for 2 hours (21:00 and 22:00) and on 8/27 from 1:00 through 7:00, the valley shaped area shows the max CPU busy a little over 25% with an average around 6%. This is an indication of a single node (out of 4) running with a single CPU running at 100% busy and all other CPUs in the node and in other nodes running virtually at 0%. (With one CPU in a node at 100% and the 3 other CPUs at 0%, the overall node CPU average would be 25%. With the one node at 25% and the other nodes at 0, the overall average for all nodes would be 6.25%.) Again, there is a problem that needs to be investigated.

10

Page 11: Establishing+Profile

5. Avg CPU Busy vs. CPU Idle Waiting for I/O Chart Upper area of chart indicates the CPU was idle waiting of disk or Bynet I/O completion. The I/O

wait could be due to a real disk I/O bottleneck or simply not having enough jobs in the system. When there are not enough jobs in the system to keep the CPUs busy, the I/O wait could be due to disk I/O or Bynet buddy backup I/O.

6. I/O Wait vs. Disk I/Os Chart The chart shows high spikes of I/O wait without the corresponding spikes in disk I/Os. (A better

chart would have been to show Mbytes transferred per second.) By and large, the I/O wait is not due to an I/O throughput bottleneck.

7. CPU Idle Waiting for I/O vs. Buddy Backup Bynet I/O Chart The chart shows a high correlation between the occurrence of the I/O wait and the Buddy Backup

traffic. This indicates the users’ workload was doing table updates throughout most of the time period. However, this does not mean the Buddy Backup was at a throughput bottleneck. The I/O wait looks more like too few jobs in the system. If the data is available, the number of concurrent active sessions should also be looked at for this time period.

8. Average & Min Free Memory Available vs. Total Page/Swap IOs

For most of the time, available free memory is over 150 Mbytes. On occasion they drop down so low that they cause a high number of page/swap I/Os. This looks like a case where the FSG Cache percent should be raised so as to not leave so much free memory unused. Also, UNIX memory parameters, LOTSFREE, DESFREE and MINFREE should be raised to reduce the risk of UNIX panics.

9. Concurrent Sessions Chart

The Concurrent Sessions Chart shows the average and max node CPU busy at 10 minute logging periods and the number of concurrent active sessions at less than a minute frequency. The beginning of the chart shows the number of concurrent active sessions at 10 (scale on right hand side of chart), fluctuating to about 20 concurrency for an hour or so, then picking up to 40, dropping down and going back to 40. At approximately 17:22, the concurrent load on the system dropped from about 40 down to 12 while the CPU still remained at 100% busy. This helps to explain why response time is longer at different times of the day even though the CPU is 100% busy throughout most of the time period.

11

Page 12: Establishing+Profile
Page 13: Establishing+Profile

13

Page 14: Establishing+Profile

14

Page 15: Establishing+Profile

15

Page 16: Establishing+Profile

16

Page 17: Establishing+Profile

17

Page 18: Establishing+Profile

18

Page 19: Establishing+Profile

19