Diagnosing Oracle® Database Performance on AIX® Using IBM® NMON and Oracle Statspack
How STATSPACK Was Used to Solve Common Performance Issues Brian Hitchcock OCP 8, 8i, 9i DBA Sun...
-
Upload
andra-haynes -
Category
Documents
-
view
219 -
download
0
Transcript of How STATSPACK Was Used to Solve Common Performance Issues Brian Hitchcock OCP 8, 8i, 9i DBA Sun...
How STATSPACK Was Used to Solve Common Performance Issues
Brian HitchcockOCP 8, 8i, 9i DBA
Sun Microsystems
Dedicated to Pramitha Chowrira, the Goddess of the Rockies,Mike Waldron, the Student who became the Master,Sheryl Driscoll, who leads purely for the glory andAnn Bischoff, who took me in trade for a developer...
NoCOUG
Brian Hitchcock November 13, 2003 Page 1
NoCOUG
Brian Hitchcock November 13, 2003 Page 2
What STATSPACK Is
Set of SQL and PL/SQL Collects performance data from v$ tables Stores collected data in separate tables Each collection of data is a ‘snapshot’ Reports deltas in data between snapshots Supports ad-hoc SQL queries of the snapshot
data
NoCOUG
Brian Hitchcock November 13, 2003 Page 3
STATSPACK Details
Works for 8.1.7 onwards Gathers data for a single instance Snapshot levels
– Determine how much data is collected– Defaults are fine
Snapshot interval of 15 minutes suggested Long report periods miss transient events
– Reports over an instance restart are not valid
NoCOUG
Brian Hitchcock November 13, 2003 Page 4
STATSPACK -- Good
Free (very cool!) Gathers a wide range of data
– You don’t know what you’re looking for at first– Root cause isn’t usually obvious
Standard process to collect performance data– Gathers the same data on all instances– Easy to share with vendors, support groups
NoCOUG
Brian Hitchcock November 13, 2003 Page 5
STATSPACK -- Not Perfect
Gathers a wide range of data– Ocean of data– Any information?– Easy to get lost
Does not tell you what the problem is– Shows what is happening in the instance– You need to figure out if this is a problem or not
Does not tell you the solution Does not tell you that you are done tuning...
NoCOUG
Brian Hitchcock November 13, 2003 Page 6
How to Interpret Output?
Requires experience with your system– No single way to analyze output– Must have history of your system– Look for possible problem areas– Trial and error to change problem behavior– Only you can tell if you have a performance
problem
NoCOUG
Brian Hitchcock November 13, 2003 Page 7
STATSPACK Report Sections– Instance Summary, Efficiency– Top 5 Wait Events, Wait Events– SQL Ordered by Gets, Reads, Executions– Instance Activity Stats– Tablespace IO Stats Ordered by IOs, Tblspc-file– Buffer Pool Statistics– Rollback Segment Stats, Storage– Latch Activity, Sleep, Miss Sources– Dictionary Cache Stats– Library Cache Activity– SGA Memory Summary, Breakdown– init.ora Parameters
NoCOUG
Brian Hitchcock November 13, 2003 Page 8
Documentation of output No comprehensive documentation
– Oracle 8i Reference Appendix A -- Wait Events defined Appendix B -- Enqueue Names defined Appendix C -- Statistics Descriptions
– Database Performance Guide and Ref 9.0.1 Chapters 21-23, Supplied packages, how to use
– $ORACLE_HOME/rdbms/admin/spdoc.txt– ORACLE High-Performance Tuning with STATSPACK, Donald K.
Burleson, Oracle Press ISBN 0-07-213378-3
No explanation of – What output means for your system
NoCOUG
Brian Hitchcock November 13, 2003 Page 9
Configuration Used
Oracle 8.1.7.2 Snapshots every 15 minutes
– snapshots taken continuously
Default STATSPACK snapshot ‘level’ Application loads and analyzes web site click
stream data– Lots of data– More data all the time– We don’t know what vendor code looks like
NoCOUG
Brian Hitchcock November 13, 2003 Page 10
Actual Use
4 Performance issues in 2002– Case 1) Reports Running Slow
STATSPACK output didn’t show the problem
– Case 2) Vendor Demo Slow STATSPACK output didn’t show the problem
– Case 3) Data Load Slow STATSPACK output led to 18x speedup (1800%)
– Case 4) Data Load Time Varies STATSPACK output led to the root cause
NoCOUG
Brian Hitchcock November 13, 2003 Page 11
Case 1) Reports Running Slow Vendor code allows users to setup reports
– Vendor code generates SQL for report
Long run interferes with next day’s data load STATSPACK captures SQL
– Generate explain plan(s)
Report SQL doesn’t generate where clause properly to use partition pruning
Vendor refuses to change their code We simply removed the reports
– Performance issue ‘resolved’
NoCOUG
Brian Hitchcock November 13, 2003 Page 12
Case 2) Vendor Demo Slow
Due to issues like Case 1)– New vendor sets up demo, data load slow– Data load runs twice as fast at vendor– Statspack output doesn’t show anything obvious– Compare configuration of vendor and our dbs– Vendor has only one redo log file per group– We had two redo log files per group– We drop one file per group, performance issue
resolved
NoCOUG
Brian Hitchcock November 13, 2003 Page 13
Case 3) Data Load Slow First time loading new type of web log data No baseline to compare with
– Classical performance tuning doesn’t always apply to the real world
Data load so slow no time for daily reporting Must run faster or the data won’t be loaded We don’t know if this load will run faster Do we have a ‘performance’ issue?
– Yes, data load must run faster to be useful– No, perhaps this is as fast as it can be...
NoCOUG
Brian Hitchcock November 13, 2003 Page 14
Case 3) Data Load Slow
SQL -- Highest Gets per Exec
SQL ordered by Gets for DB: BHDATA04 Instance: BHDATA04 Snaps: 4426 -4427-> End Buffer Gets Threshold: 10000-> Note that resources reported for PL/SQL includes the resources used by all SQL statements called within the PL/SQL code. As individual SQL statements are also reported, it is possible and valid for the summed total % to exceed 100
Buffer Gets Executions Gets per Exec % Total Hash Value--------------- ------------ -------------- ------- ------------ 634,304 68 9,328.0 97.5 3948948039SELECT t526.keyvalueid FROM bh_lqueryvalue t526 WHERE t526.queryinfoid = :ph0 ORDER BY t526.keyvalueid ASC
NoCOUG
Brian Hitchcock November 13, 2003 Page 15
Bad SQL? SQL shouldn’t cost much
– Select looking for one row– Table has two indexes– Explain plan shows ‘index full scan’– Should show ‘index range scan’– Explain plan with hint to force one index– Verify cost of each index– Optimizer is choosing wrong index!
Drop the costly index!– Indexes added by vendor ‘to be safe’...
NoCOUG
Brian Hitchcock November 13, 2003 Page 16
Explain Plan Force Index1
SQL> truncate table plan_table;
Table truncated.
SQL> explain plan set Statement_Id = 'TEST' forSELECT /*+ INDEX(t526 X_LQRYVL_QUERYIDKYVL) */t526.keyvalueid FROM bh_lqueryvalue t526WHERE t526.queryinfoid = 100ORDER BY t526.keyvalueid ASC; 2 3 4 5
Explained.
SQL> @$ORACLE_HOME/rdbms/admin/utlxpls
Plan Table--------------------------------------------------------------------------------| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |--------------------------------------------------------------------------------| SELECT STATEMENT | | 3 | 24 | 1 | | || INDEX RANGE SCAN |X_LQRYVL_ | 3 | 24 | 3 | | |--------------------------------------------------------------------------------
Cost 4
NoCOUG
Brian Hitchcock November 13, 2003 Page 17
Explain Plan Force Index2
SQL> explain plan set Statement_Id = 'TEST' forSELECT /*+ INDEX(t526 X_LQYVAL_KYVLQRYID) */t526.keyvalueid FROM bh_lqueryvalue t526WHERE t526.queryinfoid = 100ORDER BY t526.keyvalueid ASC; 2 3 4 5
Explained.
SQL> @$ORACLE_HOME/rdbms/admin/utlxpls
Plan Table--------------------------------------------------------------------------------| Operation | Name | Rows | Bytes| Cost | Pstart| Pstop |--------------------------------------------------------------------------------| SELECT STATEMENT | | 3 | 24 | 2334 | | || INDEX FULL SCAN |X_LQYVAL_ | 3 | 24 | 9333 | | |--------------------------------------------------------------------------------
SQL>
Cost 11,667
NoCOUG
Brian Hitchcock November 13, 2003 Page 18
Solution
After dropping costly index– data load time was 18 hours, became 1 hour– 18:1 improvement (1800%)
Why did optimizer choose wrong index?– No idea, Oracle requested running 18 hour data load
to gather instance data– Business users said “NO!”– Indexes created by vendor, no need for both indexes
Know when to quit tuning!
NoCOUG
Brian Hitchcock November 13, 2003 Page 19
What About Wait Events?
Popular DBAs– Wait Events are all that matters
I want (desperately) to be popular too…– Return to Case 3)– Examine Top 5 Wait Events section– Try to understand what is causing the wait time
NoCOUG
Brian Hitchcock November 13, 2003 Page 20
Case 3) Wait Events
Top 5 Wait Events
Top 5 Wait Events~~~~~~~~~~~~~~~~~ Wait % TotalEvent Waits Time (cs) Wt Time-------------------------------------------- ------------ ------------ -------PX Deq: Execution Msg 335 1,156 45.30latch free 1,740 522 20.45control file parallel write 88 329 12.89db file sequential read 253 323 12.66log file parallel write 2 42 1.65 -------------------------------------------------------------
PX -- parallel query issues?– Contact Oracle Tech Support
NoCOUG
Brian Hitchcock November 13, 2003 Page 21
Ask the Experts
Oracle Tech Support– Many wait events in STATSPACK output should
be ignored (a bug perhaps? Or an RFE?)– Requests the full STATSPACK report– Tells me that the latch free wait event must be
addressed– set session_cached_cursors = 100– Performance improvement will be ‘significant’– Data load now takes 19 hours (about 10% worse)
NoCOUG
Brian Hitchcock November 13, 2003 Page 22
What Happened? How could the experts miss the bad SQL? Wait events are important
– If the total wait time is the largest problem
In this case– Bad SQL dominated the overall run time
Wait event analysis– Many events should be ignored– Need to determine how much of total run time is due to
wait events only
Go back and fix the SQL issue
NoCOUG
Brian Hitchcock November 13, 2003 Page 23
Total CPU Time
Total CPU time is 27475 10s of milliseconds– 10 milliseconds is 1 centisecond (cs) = 0.01 sec– 27457 --> 27475 cs = 274.75 seconds– Report interval was 4.60 minutes (276 seconds)– confused? How many cs left until Happy Hour?
From Instance Activity StatsInstance Activity Stats for DB: BHDATA04 Instance: BHDATA04 Snaps: 4426 -4427
Statistic Total per Second per Trans--------------------------------- ---------------- ------------ ------------CPU used by this session 27,441 99.4 27,441.0CPU used when call started 27,475 99.6 27,475.0………
NoCOUG
Brian Hitchcock November 13, 2003 Page 24
Total Wait Time Look at all Wait Events
Wait Events for DB: BHDATA04 Instance: BHDATA04 Snaps: 4426 -4427-> cs - centisecond - 100th of a second-> ms - millisecond - 1000th of a second-> ordered by wait time desc, waits desc (idle events last) Avg Total Wait wait WaitsEvent Waits Timeouts Time (cs) (ms) /txn---------------------------- ------------ ---------- ----------- ------ ------PX Deq: Execution Msg 335 0 1,156 35 335.0latch free 1,740 1,046 522 3 ######control file parallel write 88 0 329 37 88.0db file sequential read 253 0 323 13 253.0log file parallel write 2 0 42 210 2.0enqueue 61 0 40 7 61.0refresh controlfile command 602 0 37 1 602.0PX Deq: Msg Fragment 64 0 29 5 64.0PX Deq: Parse Reply 59 0 26 4 59.0control file sequential read 2,673 0 16 0 ######log file sync 1 0 15 150 1.0PX Deq: Signal ACK 2 2 11 55 2.0PX Deq: Join ACK 57 0 3 1 57.0PX Deq: Execute Reply 5 0 3 6 5.0SQL*Net more data to client 37 0 0 0 37.0file open 6 0 0 0 6.0db file parallel write 4 0 0 0 4.0PX Idle Wait 4,316 4,287 877,818 2034 ######SQL*Net message from client 2,213 0 27,615 125 ######SQL*Net message to client 2,214 0 1 0 ###### -------------------------------------------------------------
-----------> Total Wait Time 907997 cs
NoCOUG
Brian Hitchcock November 13, 2003 Page 25
Real Total Wait Time
Remove idle events– MetaLink Note:191103.1 PQ Wait Events– STATSPACK report should filter out idle events– Database Performance Guide and Ref 9.0.1
Explains more about this ‘feature’
NoCOUG
Brian Hitchcock November 13, 2003 Page 26
Real Total Wait Time Idle Events
Wait Events for DB: BHDATA04 Instance: BHDATA04 Snaps: 4426 -4427-> cs - centisecond - 100th of a second-> ms - millisecond - 1000th of a second-> ordered by wait time desc, waits desc (idle events last) Avg Total Wait wait WaitsEvent Waits Timeouts Time (cs) (ms) /txn---------------------------- ------------ ---------- ----------- ------ ------PX Deq: Execution Msg 335 0 1,156 35 335.0 <----- removelatch free 1,740 1,046 522 3 ######control file parallel write 88 0 329 37 88.0db file sequential read 253 0 323 13 253.0log file parallel write 2 0 42 210 2.0enqueue 61 0 40 7 61.0refresh controlfile command 602 0 37 1 602.0PX Deq: Msg Fragment 64 0 29 5 64.0PX Deq: Parse Reply 59 0 26 4 59.0control file sequential read 2,673 0 16 0 ######log file sync 1 0 15 150 1.0PX Deq: Signal ACK 2 2 11 55 2.0 <----- removePX Deq: Join ACK 57 0 3 1 57.0PX Deq: Execute Reply 5 0 3 6 5.0SQL*Net more data to client 37 0 0 0 37.0file open 6 0 0 0 6.0db file parallel write 4 0 0 0 4.0PX Idle Wait 4,316 4,287 877,818 2034 ###### <----- removeSQL*Net message from client 2,213 0 27,615 125 ###### <----- removeSQL*Net message to client 2,214 0 1 0 ###### -------------------------------------------------------------
----------> Total Wait Time 1397 cs
NoCOUG
Brian Hitchcock November 13, 2003 Page 27
Total Response Time
Total CPU Time + Total Wait Time– 27475 cs + 1397 cs = 28872 cs
Total Wait Time– 1397/28872 = 0.05– 5% of Total Response Time
Wait Time was never an issue! If you don’t remove the idle events
– 907997/(27475+907997) = 97%
NoCOUG
Brian Hitchcock November 13, 2003 Page 28
Bad SQL Rules Slow data load time
– Time due to Bad SQL– Time due All Others
Including Wait Events
Bad SQL
All Others
NoCOUG
Brian Hitchcock November 13, 2003 Page 29
Case 4) Data Load Time Varies
Data Load Time– Normal 6.5 hours, Long 16 hours– Varies randomly, no pattern
Generate STATSPACK report– Normal, Long– Compare reports– Look for differences between reports
Tablespace IO Stats Section– Normal -- 11 tablespaces accessed– Long -- 74 tablespaces accessed
NoCOUG
Brian Hitchcock November 13, 2003 Page 30
Problem and Solution
Vendor data load shouldn’t touch all tables What process would access all tables? Production db supported by another group
– We aren’t allowed to connect as ‘oracle’– Can’t see what they might be running (cron?)
Turns out– Production DBAs decided we needed full exports– We weren’t notified– Stop the exports, performance issue goes away!
NoCOUG
Brian Hitchcock November 13, 2003 Page 31
What About Cache Hit Rates? Back to the subject of experts
– Remember when it was cool to discuss hit rates?
For Case 4)– Compute buffer cache hit ratio from tables– Tables larger than physical memory– Can’t have all pages in memory at once– Buffer cache hit ratio won’t be 100%
Even if we had 100%– Bad SQL (index) was the real problem– Buffer cache hit ratio wasn’t relevant
NoCOUG
Brian Hitchcock November 13, 2003 Page 32
Select Buffer Cache Hit Ratio
00.20.40.60.8
11.2
Snapshot Id
Hit
Rati
o
Case 4)
Data Load without Exports running
NoCOUG
Brian Hitchcock November 13, 2003 Page 33
Total Wait Time? For Case 4), 15 minute report interval
– Wait Time is 28% of total time
Top 5 Wait Events~~~~~~~~~~~~~~~~~ Wait % TotalEvent Waits Time (cs) Wt Time--------------------------------------- ------------ ------------ -------PX Deq: Table Q Normal 495,566 1,017,115 30.12 <-- removeslave wait 25,580 744,985 22.06PX Deq Credit: send blkd 406,262 678,556 20.10 <-- removePX Deq: Execution Msg 33,009 438,309 12.98 <-- removelatch free 42,575 134,840 3.99 ------------------------------------------------------------- 879825 cs
total time = 2237915 + 879825 = 3117740 Wait time is 879825/3117740 = 28%
Instance Activity Stats for DB: BHDATA01 Instance: BHDATA01 Snaps: 13817 -1381
Statistic Total per Second per Trans--------------------------------- ---------------- ------------ ------------CPU used by this session 1,422,921 1,556.8 154.4CPU used when call started 2,237,915 2,448.5 242.8 <-----
NoCOUG
Brian Hitchcock November 13, 2003 Page 34
Review For the 4 issues we had, STATSPACK output
– Was useful for all 4 Provided standard set of data for all involved
– Fixed 2 issues Provided the data that led to the root cause Verified that the fix was working
Performance improvements were substantial– Tuning process much faster with STATSPACK– Same process worked for all 4 issues– Decided not to look for further improvements
Wait Time analysis might be useful...
NoCOUG
Brian Hitchcock November 13, 2003 Page 35
oraperf.com Analyzer
Website oraperf.com– Submit STATSPACK report– Analyzer reviews report– Generates detailed analysis
CPU time Wait time
– Gives specific advice– Not perfect, but it is fast and free!
Has same issues with idle wait events as STATSPACK report
NoCOUG
Brian Hitchcock November 13, 2003 Page 36
oraperf.com Who or what is oraperf.com?
– From the website...
Oraperf.com is run by Anjo Kolk. Anjo has worked for over 16 years at Oracle
(1985-2001). While at Oracle he worked in different countries and different departments.
Many people generate utlbstat/utlestats and statspack reports, but don't know how to interpret the data. People that do look at these are reports also mostly looking at the wrong information and end up making the wrong tuning decisions. That is why the reports are analyzed based on the YAPP method. The YAPP method will show what component of the total response time should be tuned first.
YAPP-Method -- Yet Another Performance Profiling Method
http://oraperf.com/download/yapp_anjo_kolk.pdf
NoCOUG
Brian Hitchcock November 13, 2003 Page 37
oraperf.com -- Case 3)
Upload report from slow data load Analyzer shows
– Response time 91.63% CPU Time 8.37% Wait Time
– Advice? Reduce the number of buffer gets or executions
Wait time– Matters only as a % of total response time
NoCOUG
Brian Hitchcock November 13, 2003 Page 38
oraperf.com -- Case 3)
Upload report from fast data load Analyzer shows
– Response time 5.16% CPU Time 94.84% Wait Time
– Advice? Tune PX Deq: Execution Msg event
But this is an idle event...
Non-idle wait time is only about 25% total time
NoCOUG
Brian Hitchcock November 13, 2003 Page 39
oraperf.com -- Case 4)
Conclusion– oraperf.com analyzer provides another tool for
performance tuning– Well worth using if only to compute
Response Time CPU Time Wait Time
Check for idle wait events...
NoCOUG
Brian Hitchcock November 13, 2003 Page 40
No Excuses
Install STATSPACK Generate two snapshots Generate standard report Upload to oraperf.com Review advice Fast, free performance analysis!
NoCOUG
Brian Hitchcock November 13, 2003 Page 41
Installing STATSPACK
Create separate tablespace Create PERFSTAT user Execute SQL script to create tables Setup job to execute snapshots Setup process to purge data over time Set timed_statistics = TRUE
– Not required, but needed to get wait time data
NoCOUG
Brian Hitchcock November 13, 2003 Page 42
Installing STATSPACK
As user ‘SYS’
create tablespace perfstat datafile'/xxx/xxx/perfstat_01.dbf' size 500M;
cd $ORACLE_HOME/rdbms/adminsqlplus [email protected]
Enter value for default_tablespace: perfstatEnter value for temporary_tablespace: temp
NoCOUG
Brian Hitchcock November 13, 2003 Page 43
Generate Standard Report
Report SQL supplied by Oracle
sqlplus perfstat/perfstat@<SID>execute statspack.snapexecute statspack.snap
sqlplus perfstat/perfstat@<SID>@$ORACLE_HOME/rdbms/admin/spreport.sqlEnter value for begin_snap: 1Enter value for end_snap: 2Enter value for report_name: testing
NoCOUG
Brian Hitchcock November 13, 2003 Page 44
Select STATSPACK Data
Query the tables directlyselect to_char(snap_time,'yyyy-mm-dd HH24') mydate,new.name buffer_pool_name, (((new.consistent_gets-old.consistent_gets)+(new.db_block_gets-old.db_block_gets))-(new.physical_reads-old.physical_reads))/((new.consistent_gets-old.consistent_gets)+(new.db_block_gets-old.db_block_gets)) bhrfromperfstat.stats$buffer_pool_statistics old,perfstat.stats$buffer_pool_statistics new,perfstat.stats$snapshot snwherenew.snap_id > 13125and new.snap_id < 13149andnew.name = old.nameandnew.snap_id = sn.snap_idandold.snap_id = sn.snap_id-1;
Based on SQL fromORACLE High-Performance Tuning with STATSPACKDonald K. BurlesonOracle Press ISBN 0-07-213378-3
NoCOUG
Brian Hitchcock November 13, 2003 Page 45
Buffer Cache Hit Ratio Case 4)
Output of SQL on previous slideyr. mo dy Hr BUFFER_POOL_NAME BHR------------- -------------------- -----2002-09-17 02 DEFAULT 1.002002-09-17 02 DEFAULT 1.002002-09-17 02 DEFAULT .942002-09-17 02 DEFAULT .872002-09-17 03 DEFAULT .902002-09-17 03 DEFAULT .932002-09-17 03 DEFAULT .972002-09-17 03 DEFAULT .822002-09-17 04 DEFAULT .782002-09-17 04 DEFAULT .782002-09-17 04 DEFAULT .762002-09-17 04 DEFAULT .802002-09-17 05 DEFAULT .692002-09-17 05 DEFAULT .732002-09-17 05 DEFAULT .762002-09-17 05 DEFAULT .692002-09-17 06 DEFAULT .812002-09-17 06 DEFAULT 1.002002-09-17 06 DEFAULT 1.002002-09-17 06 DEFAULT 1.002002-09-17 07 DEFAULT 1.002002-09-17 07 DEFAULT 1.002002-09-17 07 DEFAULT 1.00
NoCOUG
Brian Hitchcock November 13, 2003 Page 46
Space Used
Snapshot size varies with– Number of tablespaces– Number of SQL statements captured
Db1 21 tablespaces --> 0.15 Mb/snapshot Db2 376 tablespaces --> 0.37 Mb/snapshot Assuming a snapshot every 15 minutes
– 96 snapshots per day– Db1 --> 14.4 Mb/day– Db2 --> 35.6 Mb/day
NoCOUG
Brian Hitchcock November 13, 2003 Page 47
Removing Snapshot Data
Oracle supplied SQL– SQL removes snapshot data for a range of
snapshot id numbers
Example sqlplus perfstat/perfstat@<SID>@$ORACLE_HOME/rdbms/admin/sppurge...... (listing of all existing snapshots)...Specify the Lo Snap Id and Hi Snap Id range to purge~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Enter value for losnapid: 4001Using 4001 for lower bound.
Enter value for hisnapid: 5000Using 5000 for upper bound.
Deleting snapshots 4001 - 5000.
commit;Note: large deletes may fill rollback segments
NoCOUG
Brian Hitchcock November 13, 2003 Page 48
Summary
STATSPACK– Free, easy to install, easy to run– Output can be very useful or confusing– Real-world use has resulted in big performance
gains– Useful for all instances– Standard way to gather performance data