Near-Real-Time IT-Control Charts
description
Transcript of Near-Real-Time IT-Control Charts
CMG’09
Igor Trubin, PhD, SunTrust Bank
http://www.itrubin.blogspot.com/
Near-Real-Time IT-Control Charts
CMG’09 2
IntroductionAgenda
o Where and why the Control Chart is used: review of some systems performance tools on a market that build and use control charts.
o What is the Control Chart? - A little bit of theory and history.
o How SEDS (Statistical Exception Detection System) uses it - MASF charts vs. SPC ones.
o IT-Chart concept. The best control chart type for IT data visualization.
o Long gallery of already published charts in the CMG papers.
o Plus some new ones with explanations how to read them.
o How to build a Control Chart: using Excel for interactive analysis and R to automate the control chart generation with live demonstration of the technique.
CMG’09 3
Where the Control Chart is used in ITBMC software www.bmc.com: MASF technique in Performance Analysis for Servers and Performance Assurance tools; BMC ProactiveNet Analytics http://documents.bmc.com/products/documents/49/13/84913/84913.pdf Fujitsu www.fujitsu.com: ACTIVE BASELINING Technique www.fujitsu.com/downloads/AU/active_baselining_in_passive_data_environments.pdfMcAfee www.mcafee.com Anomaly-Based Intrusion Detection www.mcafee.com/us/local_content/white_papers/wp_ddt_anomaly.pdfBEZ systems www.bez.com for Oracle and Teradata performance www.wmoug.org/bezPresentation.pdf Integrien Alive™ http://www.integrien.com/
Netuitive http://netuitive.com/ Firescope http://www.firescope.com/default.htm
Managed Objects http://managedobjects.com/ Six Sigma http://www.isixsigma.com/st/control_charts/ SEDS (Statistical Exception Detection System) http://www.itrubin.blogspot.com/
CMG’09 4
Why Control Chart is used for Capacity Management
Control Chart has the ability to uncover some hidden trends and patterns of systems performance data
Control Chart is a really proactive tool and could capture unusual resource usage before it breaks
Control Chart is the best base-lining tool and can show how actual data deviate from historical baseline
Control Chart provides dynamic threshold: no needs in manual settings
Control Chart is the tool to detect a workload pathology (run-away, memory leaks and other)
CMG’09 5
Definitionso The control chart, also known as the Shewhart chart or process-
behavior chart, in statistical process control is a tool used to determine whether a manufacturing or business process is in a state of statistical control or not.
o A graphical tool for monitoring changes that occur within a process, by distinguishing variation that is inherent in the process (common cause) from variation that yield a change to the process (special cause). This change may be a single point or a series of points in time - each is a signal that something is different from what was previously observed and measured.
Control Chart Definitions
CMG’09 6
What the Control Chart isChart details o Points representing measurements of a quality characteristic in samples
taken from the process at different times [the data].
o A centre line, drawn at the process characteristic mean which is calculated from the data.
o Upper (UCL) and lower (LCL) control limits (sometimes called "natural process limits") that indicate the threshold at which the process output is considered statistically 'unlikely' .
CMG’09 7
What the Control Chart is (continued)
Choice of limits o UCL= Mean+ 3; LCL= Mean- 3; Centerline = Mean (or Average) ( - Standard Deviation The reason that 3
control limits balance the risk of error is that, for normally distributed data, data points will fall inside the 3 limits 99.7% of the time when a process is in control.)
o UCL=95th Percentile; LCL= 5th Percentile Centerline =50th Percentile
(A percentile or centile is the value of a variable below which a certain percent of observations fall.)
That choice is good if data is far from normal distribution.
CMG’09
What the Control Chart is (continued)Special Types of Control Charts o There are X-bar, R, S, U, Np, P and C Control charts.o X-bar is most common and used in Capacity Management. In this
chart the sample means are plotted in order to control the mean value of a variable.
o C-control chart (Poisson or Counts) plots the number of defectives and is sensitive to changes in the number of defectives in the measurement process. For our area that could be used to control workload pathologies (e.g. run-always, memory leaks and so on). For C-chart the control limits are calculated as: LCL = c – 3 √c; UCL = c + 3 √c
where c is the mean number of defectives. Also, zero serves as a lower bound on the LCL.
o Other types are more appropriate for mechanical engineering area.
8
CMG’09 9
MASF, SPC control and histogram charts comparison
MASF: Reference set vs. Actual data.
All three charts demonstrate different views of exceptions for CPU utilization that occurred at 8 am.
As opposed to classical X-bar univariate control chart, MASF chart can be most useful for showing a 24 (7x24) hour profile of a resource usage and actually is a multivariate Control Chart).
NOTE: Limits might need to be cut at 100% or 0% natural thresholds
CMG’09 10
… for global CPU utilization on the same Unix server?
How close is the data to normal distribution
Example of the 6 month hourly histograms for HP rp7400/550Mhz/6-way server global CPU utilization exception.
Reference set grouped by hours
CMG’09 11
Classical SPC type (daily or hourly aggregated (SEDS, BMC Visualizer) or raw granular data (Integrien – for near-real-time data alerting)
24 hour profile for Global or application level data (MASF type) (SEDS, BMC)
Weekly profile of daily data (SEDS)
Weekly profile hourly data (IT-chart, main SEDS tool) – most efficient type of
visualization tool to visualize IT systems performance
Monthly profile of daily data
Types of Control Charts against performance data
CMG’09 12
Control Chart and other type of graphs
Control chart is one of the possible graphical tools. One of the most powerful, but other type of charts could be used:
Top Bar Chart
CONTROL CHART
Trend Forecast Charts
CASE: SEDS detected VM server is moved to other host by v-motion
CMG’09
IT-Chart ConceptRadar screen analogy.Refresh border (line) is to separate current (week) period data from previous (week) period. Refreshing speed:o Day – every morning the
border shifts on 24 houro Hour – hourly refreshed control
chart, the border moves every hour. Good for near-real-time monitoring.
o Minutes? Or seconds like the real radar refreshing? Could be a capacity problem…
Weekly IT-control chart is the best as it shows weekend, night and even lunch time seasonality.
13
CMG’09 14
How to read weekly IT-charts
Green curve is the hourly average (Mean) for particular weekday and hour for the history of 6 month. Red is UCL; Blue is LCL;
This is the SEDS view to compare the last 7 days (actual) vs. the last 6 month baseline (historical) data.
Black curve is the actual hourly data. Left side from vertical line is THIS WEEK data up to yesterday. Right side is the last week data
CMG’09 15
How the weekly IT-chart is built
Take one week of recent data…
CMG’09 16
Take one week of recent data…
How the weekly IT-chart is built
CMG’09 17
How the weekly IT-chart is built
Take one week of recent data and put that in weekly profile form;
CMG’09 18
Take one week of recent data and put that in weekly profile form;Take some representative historical reference data; set it as a baseline and then compare it with the most recent actual data.
If the actual data exceeds some statistical thresholds, (e.g. Upper (UCL) and Lower (LCL) Control Limits are mean plus/minus 3 standard deviations or some percentiles),
How the weekly IT-chart is built
NOTE it predicts what is suppose to be happened tomorrow
CMG’09 19
NOTE it predicts what is suppose to be happened tomorrow
How the weekly IT-chart is built
If the actual data exceeds some statistical thresholds, (e.g. Upper (UCL) and Lower (LCL) Control Limits are mean plus/minus 3 standard deviations or some percentiles),generate an exception (alert via e-mail) and build a control chart.
Take one week of recent data and put that in weekly profile form;Take some representative historical reference data; set it as a baseline and then compare it with the most recent actual data.
CMG’09 20
Why is it so powerful? Forecasting vs. exception detecting
In addition to unusual resource usage capture, the Weekly Control Chart has the following features:o “Summarization” It uses summarized data
(6-8 month history of hourly data). o “Correlation” That allows you to see where system performance and/or business driver metrics correlate simply by analyzing synchronized control charts.o “Do Not Mix Shifts” Control Chart by nature visualizes the separation
of work or peak time and off time.o “Statistical Model Choice” means playing with different statistical
limits (e.g. 1 st. dev. vs. 3 or more st. dev. or percentiles) to tune the system and reduce the rate of false positives.
o “Significant Events” To adjust itself statistically to some events because the historical period follows the actual data and every event will occasionally be older than the oldest day in the reference set.
o “Outliers detection” All workload pathologies are definitely statistically unusual; they are captured and then suppose to be removed from historical data.
CMG’09 21
The SEDS and Memory Metrics (Paging exceptions )
Why is it so powerful? EXAMPLES
This metric has the following problem: there is no simple calculated threshold and, as such, it is hard to say if the 2 am spike is big enough to worry about.
CMG’09 22
Why is it so powerful? EXAMPLES
This metric has the following problem: there is no simple calculated threshold and, as such, it is hard to say if the 2 am spike is big enough to worry about
The SEDS and Memory Metrics (Paging exceptions )
CMG’09 23
Why is it so powerful? EXAMPLES
The control chart shows unusual paging activity. That is confirmed by reviewing the historical paging trend:
The SEDS and Memory Metrics (Paging exceptions )
CMG’09 24
Why is it so powerful? EXAMPLES
The SEDS and Memory Metrics (Weekly IT-chart )
CMG’09 25
Why is it so powerful? EXAMPLES
The SEDS and Memory Metrics (Weekly IT-chart )
CMG’09 26
Why is it so powerful? EXAMPLES
This example shows the weekly scheduled server reboot (to avoid memory leak issues). This kind of graph is also useful since, even if there were no exceptions from yesterday, it may show exceptions from previous days.
The SEDS and Memory Metrics (Weekly IT-chart )
CMG’09 27
Why is it so powerful? EXAMPLES
The SEDS and Memory Metrics (Weekly IT charts: Memory Leaks)
CMG’09 28
Why is it so powerful? EXAMPLES
The SEDS and CPU Metrics (24 hour and weekly control charts)
Some Citix apps defect on VMsGlobal exception correlates with some apps
CMG’09 29
Why is it so powerful? EXAMPLESThe SEDS and Virtual Machine metrics
HOST
Running-away VM
Running-away VM
Control Chart detects Run-away of the VM even though the CPU utilization is <80%
CMG’09 30
Run Queue is useful for capturing CPU bottlenecks. And it indirectly relates to the system response time.
This is Sun Fire V880 4-way box
Why is it so powerful? EXAMPLESThe SEDS and CPU Run Queue metric
CMG’09 31
This is Sun Fire V880 4-way box
Why is it so powerful? EXAMPLESThe SEDS and CPU Run Queue metric
If a CPU Queue exception is detected and CPU utilization had exception for the same hour plus CPU utilization was close to 100%, there is a high probability of a CPU capacity issue.
But which Application caused the exceptions?
CMG’09 32
Why is it so powerful? EXAMPLESThe SEDS and CPU Run Queue metric
When a global exception occurs (CPU Queue), the workload level data can be scanned to identify what particular application on the server was responsible for the exception.
CMG’09 33
Why is it so powerful? EXAMPLESThe SEDS and CPU Run Queue metric
The scan against the application level data showed Application5 had a similar exception.
CONCLUSION: An unusual number of active processes is the cause of global CPU Queue exception and indicates a potential application performance problem!
CMG’09 34
Why is it so powerful? EXAMPLESThe SEDS and response time and some other application metrics
SEDS could capture exceptions of Application Response Time (ART)
and Calls Volume of particular functions (APIs Calls) within the Middleware tier.
CMG’09 35
Why is it so powerful? EXAMPLESThe SEDS and response time and some other application metrics
IT-chart with E2E response time for “signon” application:
Historical trend chart with E2E response time and transaction volume:
CMG’09 36
Why is it so powerful? EXAMPLESThe SEDS and disk space metrics
CMG’09 37
Why is it so powerful? EXAMPLESThe SEDS and disk I/O metrics
o SEDS captured a Disk I/O rate exception at about 4:00 PM on ServerB,
o and the application detector found that the workload “Appl2” had an exception as well.
CMG’09 38
Why is it so powerful? EXAMPLESThe SEDS and Unisys and Tandem metrics
The Unisys server had unusual low utilization that might indicate Disk or Database performance problems
The Tandem server, in contrast, had two unusual spikes of CPUs utilization that crossed the upper limit.
CMG’09 39
Why is it so powerful? EXAMPLESMainframe metrics Control Chart
BMC Visualizer was used to find any exceptions based on different filtering policies. For that, the BMC collector needed to be installed on the server and BMC Visualizer used manually to capture any MASF exceptions.
BMC Visualizer example: the System Hierarchy (spectrum) and Control Charts
CMG’09 40
Why is it so powerful? EXAMPLESThe SEDS and Mainframe metrics
Looking at a stacked workload data chart it’s difficult to find an application, which is responsible for spikes in overall CPU usage.
SEDS shows that Appl1 was responsible for the global maxima in the overall MIPS chart .
Exceptions Captured for one of the LPAR
CMG’09 41
Why is it so powerful? EXAMPLESThe SEDS and Mainframe metrics
Hourly SUM of the average response per transaction - RESP,(It shows the values consistently higher than average)
Hourly SUM of ended transaction count - TRANS
Hourly SUM of elapsed tasks duration - CPUsec
CMG’09 42
Why is it so powerful? EXAMPLESThe SEDS and Mainframe metrics
To capture an unusual behavior of a relatively small application that was not big enough to create a global exception.
HEALTH CHECK: To prove a stable behavior of any essential or critical application.
CMG’09
Near-Real Time Control Chart (Proactive Availability Management)
Some tools do that:- Integrien (
http://www.integrien.com/)- Netuitive (http://netuitive.com/)- ProactiveNet (www.BMC.com)SEDS was tested to do that too
(see IT-chart on this slide)
43
Real-Time Statistical Exception Detection requires to process data every interval (at least hourly) to do smart alerting based on dynamic (statistical) thresholds vs. static ones (currently more common). That can be used for Proactive Availability Management.
CMG’09
How to build a Control ChartUsing existing statistical toolso SAS/Base and
SAS/Grapho SAS/QC (Quality Control):o JMP from SASo Minitab and othero qcc: An R
package for quality control charting
Using built-in Control Chart builder (BMC, BEZ and so on)
44
CMG’09 45
How to build a Control Chart - EXCEL
UpperLimit =F+M$2*G = H
LowerLimit =F-M$2*G = J
7-day Moving Average =AVERAGE(B:B+10) = F
1 st. dev =STDEV(B:B+10)
other limits can be used: =PERCENTILE(B3:B+10,0.05)=PERCENTILE(B3:B+10,0.95)(S+ =IF(B-H<0,0,B-H) = I S- =IF(B-J>0,0,B-J) = K EV= ExtraValue = I+K ) - see [1]
What about just Excel!o EXAMPLE: CPS Control Chart with moving or static reference set
LINK TO SPREADSHEET
CMG’09
How to build a Control Chart - EXCEL
What about just Excel!o EXAMPLE2: Weekly Health
Index (Concord metric) MASF Control Chart Builder
LINK TO SPREADSHEET
DATEHEALTH INDEX
WEEKDAY
6-Dec-05 2.3 3
7-Dec-05 1.5 4
8-Dec-05 0.0 5
9-Dec-05 1.1 6
… … …
23-May-06 4.4 3
24-May-06 6.0 4
25-May-06 0.3 5
26-May-06 1.0 6
• For SUNday (Column “B”):• Mean =
AVERAGE(B2:B25)• Upperlimit =
AVERAGE(B2:B25)+ 3*STDEV(B2:B25)
• Lowerlimit = IF(AVERAGE(B2:B25)- 3*STDEV(B2:B25)<0,0, AVERAGE(B2:B25)- 3*STDEV(B2:B25))
• StdDeviation = STDEV(B2:B25)
• For other columns “B’ should be replaced with other column letter (e.g. MONday – “C” and so on)
46
CMG’09
How to use Control Chart (e.g. in the SEDS structure)DATE
HEALTH INDEX
WEEKDAY
6-Dec-05 2.3 3
7-Dec-05 1.5 4
8-Dec-05 0.0 5
9-Dec-05 1.1 6
… … …
23-May-06 4.4 3
24-May-06 6.0 4
25-May-06 0.3 5
26-May-06 1.0 6
47
To visualize SEDS findings (exceptions)
PDB – raw data
SEDS DB
Control Chart
CMG’09
How to use Control Chart (e.g. in the SEDS structure)
To visualize SEDS findings (exceptions)
48
DATEHEALTH INDEX
WEEKDAY
6-Dec-05 2.3 3
7-Dec-05 1.5 4
8-Dec-05 0.0 5
9-Dec-05 1.1 6
… … …
23-May-06 4.4 3
24-May-06 6.0 4
25-May-06 0.3 5
26-May-06 1.0 6
script to chartDONE! See next slides (cchrt.r)
script to built data for charting/detecting
Subsystem 1Subsystem 2 …
Subsystem n
script to detect exceptions
Subsystem 1Subsystem 2 …
Subsystem n
Exceptional subsystems(e.g. servers)
Severity of exception (e.g. number of hours)
PDB – raw data
SEDS DB
Knowledge Base (Filtering Rules):- statistical (e.g. Metric >UCL or <LCL)- empirical (e.g. Duration>2 hours)
Control Chart
Exception DB
CMG’09 49
How to build a Control Chart – EXCEL vs. SAS vs. R
What about SAS, Excel or R!o EXAMPLE3: Monthly Profile
vs. Weekly Profile
LINK TO SPREADSHEET
The data is Unix File Space Utilization
EXCEL
SAS
CMG’09 50
How to build a Control Chart – EXCEL vs. SAS vs. R
What about SAS, Excel or R!o EXAMPLE3: Monthly Profile R download: http://www.r-project.org/
The data is Unix File Space Utilization: INPUT is CSV
R-script (published on my blog):
Output is JPEG
(FYI: qcc: An R package for quality control charting :http://cran.r-project.org/web/packages/qcc/index.html)
CMG’09 51
How to build an IT-Chart: SAS vs. EXCEL vs. R
What about SAS!o EXAMPLE4: IT-chart
The raw data was captured by a SEDS based on MXG data
SAS Version
This real performance issue was captured and data was provided by John Shuck (SunTrust )
CMG’09 52
How to build an IT-Chart: SAS vs. EXCEL vs. R
What about Excel!o EXAMPLE4: IT-chart
LINK TO SPREADSHEET
EXCEL Pivot Table Version
The raw data was captured by a SEDS based on MXG data
CMG’09 53
How to build an IT-Chart: SAS vs. EXCEL vs. RWhat about R!o EXAMPLE4: IT-CHART
The raw data was captured by a SEDS based on MXG data
R! Version
CMG’09 54
How to build an IT-Chart: SAS vs. EXCEL vs. RWhat about R!o EXAMPLE4: IT-CHART
The raw data was captured by a SEDS based on MXG data:
CMG’09 55
Summary
Control Chart is a really proactive tool and can help to capture unusual resource usage before it breaks.Control Chart is the best Base-lining tool and can show how actual data deviate from historical baseline.IT-Chart is like a radar to show what’s coming.Control Chart is the tool to detect a pathology detection (run-away, memory leaks).Control Chart has the ability to uncover some trends and patterns showing actual data deviations from an historical baseline.Control Chart could be Classical (SPC) or MASF (Actual vs. Reference set with grouping by hour-weekdays).Control Chart provides dynamic threshold: no need of manual setting.Control Chart is used to visualize exceptional behavior of a subsystem (e.g. one of the outputs from SEDS)Control Chart can be build just using Excel or R.
CMG’09 56
ReferencesJeffrey Buzen and Annie Shum: "MASF -- Multivariate Adaptive Statistical Filtering," Proceedings of the Computer Measurement Group, 1995, pp. 1-10.
Igor Trubin: “Global and Application Levels Exception Detection System, Based on MASF Technique ”, Proceedings of the Computer Measurement Group, 2002. (http://www.cmg.org/measureit/shared/trubin_02.pdf)
Linwood Merritt, Igor Trubin: “Disk Subsystem Capacity Management Based on Business Drivers I/O Performance Metrics and MASF”, Proceedings of the Computer Measurement Group, 2003. (http://regions.cmg.org/regions/ncacmg/downloads/june162004_session3.doc)
Linwood Merritt, Igor Trubin: : “Mainframe Global and Workload Level Statistical Exception Detection System, Based on MASF”, Proceedings of the Computer Measurement Group, 2004. (http://www.cmg.org/membersonly/2004/papers/4179.pdf)
Igor Trubin: “Capturing Workload Pathology by Statistical Exception Detection System”, Proceedings of the Computer Measurement Group, 2005. (http://www.cmg.org/membersonly/2005/papers/5016.pdf)
Igor Trubin: “System Management by Exception, Part 6”, Proceedings of the Computer Measurement Group, 2006. (http://www.cmg.org/membersonly/2006/papers/6120.pdf)
Igor Trubin: “System Management by Exception, Part Final”, Proceedings of the Computer Measurement Group, 2007.
Igor Trubin: “Exception Based Modeling and Forecasting”, Proceedings of the Computer Measurement Group, 2008.
CMG’09 57CMG’09
Near-Real-Time IT-Control Charts
Igor Trubin, PhD, SunTrust Bankhttp://www.itrubin.blogspot.com/
Questions?