Near-Real-Time IT-Control Charts

CMG’09

Igor Trubin, PhD, SunTrust Bank

http://www.itrubin.blogspot.com/

Near-Real-Time IT-Control Charts




CMG’09 2

IntroductionAgenda

o Where and why the Control Chart is used: review of some systems performance tools on a market that build and use control charts.

o What is the Control Chart? - A little bit of theory and history.

o How SEDS (Statistical Exception Detection System) uses it - MASF charts vs. SPC ones.

o IT-Chart concept. The best control chart type for IT data visualization.

o Long gallery of already published charts in the CMG papers.

o Plus some new ones with explanations how to read them.

o How to build a Control Chart: using Excel for interactive analysis and R to automate the control chart generation with live demonstration of the technique.

CMG’09 3

Where the Control Chart is used in ITBMC software www.bmc.com: MASF technique in Performance Analysis for Servers and Performance Assurance tools; BMC ProactiveNet Analytics http://documents.bmc.com/products/documents/49/13/84913/84913.pdf Fujitsu www.fujitsu.com: ACTIVE BASELINING Technique www.fujitsu.com/downloads/AU/active_baselining_in_passive_data_environments.pdfMcAfee www.mcafee.com Anomaly-Based Intrusion Detection www.mcafee.com/us/local_content/white_papers/wp_ddt_anomaly.pdfBEZ systems www.bez.com for Oracle and Teradata performance www.wmoug.org/bezPresentation.pdf Integrien Alive™ http://www.integrien.com/

Netuitive http://netuitive.com/ Firescope http://www.firescope.com/default.htm

Managed Objects http://managedobjects.com/ Six Sigma http://www.isixsigma.com/st/control_charts/ SEDS (Statistical Exception Detection System) http://www.itrubin.blogspot.com/

http://www.bmc.com/

http://documents.bmc.com/products/documents/49/13/84913/84913.pdf

http://www.fujitsu.com/

http://www.fujitsu.com/downloads/AU/active_baselining_in_passive_data_environments.pdf

http://www.mcafee.com/

http://www.mcafee.com/us/local_content/white_papers/wp_ddt_anomaly.pdf

http://www.bez.com/

http://www.bez.com/

http://www.bez.com/

http://www.wmoug.org/bezPresentation.pdf

http://www.integrien.com/

http://netuitive.com/

http://www.firescope.com/default.htm

http://managedobjects.com/

http://www.isixsigma.com/st/control_charts/


CMG’09 4

Why Control Chart is used for Capacity Management

Control Chart has the ability to uncover some hidden trends and patterns of systems performance data

Control Chart is a really proactive tool and could capture unusual resource usage before it breaks

Control Chart is the best base-lining tool and can show how actual data deviate from historical baseline

Control Chart provides dynamic threshold: no needs in manual settings

Control Chart is the tool to detect a workload pathology (run-away, memory leaks and other)

CMG’09 5

Definitionso The control chart, also known as the Shewhart chart or process-

behavior chart, in statistical process control is a tool used to determine whether a manufacturing or business process is in a state of statistical control or not.

o A graphical tool for monitoring changes that occur within a process, by distinguishing variation that is inherent in the process (common cause) from variation that yield a change to the process (special cause). This change may be a single point or a series of points in time - each is a signal that something is different from what was previously observed and measured.

Control Chart Definitions

http://en.wikipedia.org/wiki/Statistical_process_control

http://en.wikipedia.org/wiki/Process_(general)

http://en.wikipedia.org/wiki/Statistical_control

http://www.isixsigma.com/dictionary/Control_Chart-7.htm

CMG’09 6

What the Control Chart isChart details o Points representing measurements of a quality characteristic in samples

taken from the process at different times [the data].

o A centre line, drawn at the process characteristic mean which is calculated from the data.

o Upper (UCL) and lower (LCL) control limits (sometimes called "natural process limits") that indicate the threshold at which the process output is considered statistically 'unlikely' .

http://en.wikipedia.org/wiki/Mean

CMG’09 7

What the Control Chart is (continued)

Choice of limits o UCL= Mean+ 3; LCL= Mean- 3; Centerline = Mean (or Average) ( - Standard Deviation The reason that 3

control limits balance the risk of error is that, for normally distributed data, data points will fall inside the 3 limits 99.7% of the time when a process is in control.)

o UCL=95th Percentile; LCL= 5th Percentile Centerline =50th Percentile

(A percentile or centile is the value of a variable below which a certain percent of observations fall.)

That choice is good if data is far from normal distribution.

http://en.wikipedia.org/wiki/Standard_deviation

http://en.wikipedia.org/wiki/Percentile

http://en.wikipedia.org/wiki/Percentile

CMG’09

What the Control Chart is (continued)Special Types of Control Charts o There are X-bar, R, S, U, Np, P and C Control charts.o X-bar is most common and used in Capacity Management. In this

chart the sample means are plotted in order to control the mean value of a variable.

o C-control chart (Poisson or Counts) plots the number of defectives and is sensitive to changes in the number of defectives in the measurement process. For our area that could be used to control workload pathologies (e.g. run-always, memory leaks and so on). For C-chart the control limits are calculated as: LCL = c – 3 √c; UCL = c + 3 √c

where c is the mean number of defectives. Also, zero serves as a lower bound on the LCL.

o Other types are more appropriate for mechanical engineering area.

8

http://www.statsoft.com/textbook/stquacon.html

CMG’09 9

MASF, SPC control and histogram charts comparison

MASF: Reference set vs. Actual data.

All three charts demonstrate different views of exceptions for CPU utilization that occurred at 8 am.

As opposed to classical X-bar univariate control chart, MASF chart can be most useful for showing a 24 (7x24) hour profile of a resource usage and actually is a multivariate Control Chart).

NOTE: Limits might need to be cut at 100% or 0% natural thresholds

CMG’09 10

… for global CPU utilization on the same Unix server?

How close is the data to normal distribution

Example of the 6 month hourly histograms for HP rp7400/550Mhz/6-way server global CPU utilization exception.

Reference set grouped by hours

CMG’09 11

Classical SPC type (daily or hourly aggregated (SEDS, BMC Visualizer) or raw granular data (Integrien – for near-real-time data alerting)

24 hour profile for Global or application level data (MASF type) (SEDS, BMC)

Weekly profile of daily data (SEDS)

Weekly profile hourly data (IT-chart, main SEDS tool) – most efficient type of

visualization tool to visualize IT systems performance

Monthly profile of daily data

Types of Control Charts against performance data

CMG’09 12

Control Chart and other type of graphs

Control chart is one of the possible graphical tools. One of the most powerful, but other type of charts could be used:

Top Bar Chart

CONTROL CHART

Trend Forecast Charts

CASE: SEDS detected VM server is moved to other host by v-motion

CMG’09

IT-Chart ConceptRadar screen analogy.Refresh border (line) is to separate current (week) period data from previous (week) period. Refreshing speed:o Day – every morning the

border shifts on 24 houro Hour – hourly refreshed control

chart, the border moves every hour. Good for near-real-time monitoring.

o Minutes? Or seconds like the real radar refreshing? Could be a capacity problem…

Weekly IT-control chart is the best as it shows weekend, night and even lunch time seasonality.

13

CMG’09 14

How to read weekly IT-charts

Green curve is the hourly average (Mean) for particular weekday and hour for the history of 6 month. Red is UCL; Blue is LCL;

This is the SEDS view to compare the last 7 days (actual) vs. the last 6 month baseline (historical) data.

Black curve is the actual hourly data. Left side from vertical line is THIS WEEK data up to yesterday. Right side is the last week data

CMG’09 15

How the weekly IT-chart is built

Take one week of recent data…

CMG’09 16

Take one week of recent data…


CMG’09 17


Take one week of recent data and put that in weekly profile form;

CMG’09 18

Take one week of recent data and put that in weekly profile form;Take some representative historical reference data; set it as a baseline and then compare it with the most recent actual data.

If the actual data exceeds some statistical thresholds, (e.g. Upper (UCL) and Lower (LCL) Control Limits are mean plus/minus 3 standard deviations or some percentiles),


NOTE it predicts what is suppose to be happened tomorrow

CMG’09 19

NOTE it predicts what is suppose to be happened tomorrow


If the actual data exceeds some statistical thresholds, (e.g. Upper (UCL) and Lower (LCL) Control Limits are mean plus/minus 3 standard deviations or some percentiles),generate an exception (alert via e-mail) and build a control chart.

Take one week of recent data and put that in weekly profile form;Take some representative historical reference data; set it as a baseline and then compare it with the most recent actual data.

CMG’09 20

Why is it so powerful? Forecasting vs. exception detecting

In addition to unusual resource usage capture, the Weekly Control Chart has the following features:o “Summarization” It uses summarized data

(6-8 month history of hourly data). o “Correlation” That allows you to see where system performance and/or business driver metrics correlate simply by analyzing synchronized control charts.o “Do Not Mix Shifts” Control Chart by nature visualizes the separation

of work or peak time and off time.o “Statistical Model Choice” means playing with different statistical

limits (e.g. 1 st. dev. vs. 3 or more st. dev. or percentiles) to tune the system and reduce the rate of false positives.

o “Significant Events” To adjust itself statistically to some events because the historical period follows the actual data and every event will occasionally be older than the oldest day in the reference set.

o “Outliers detection” All workload pathologies are definitely statistically unusual; they are captured and then suppose to be removed from historical data.

CMG’09 21

The SEDS and Memory Metrics (Paging exceptions )

Why is it so powerful? EXAMPLES

This metric has the following problem: there is no simple calculated threshold and, as such, it is hard to say if the 2 am spike is big enough to worry about.

CMG’09 22


This metric has the following problem: there is no simple calculated threshold and, as such, it is hard to say if the 2 am spike is big enough to worry about


CMG’09 23


The control chart shows unusual paging activity. That is confirmed by reviewing the historical paging trend:


CMG’09 24


The SEDS and Memory Metrics (Weekly IT-chart )

CMG’09 25



CMG’09 26


This example shows the weekly scheduled server reboot (to avoid memory leak issues). This kind of graph is also useful since, even if there were no exceptions from yesterday, it may show exceptions from previous days.


CMG’09 27


The SEDS and Memory Metrics (Weekly IT charts: Memory Leaks)

CMG’09 28


The SEDS and CPU Metrics (24 hour and weekly control charts)

Some Citix apps defect on VMsGlobal exception correlates with some apps

CMG’09 29

Why is it so powerful? EXAMPLESThe SEDS and Virtual Machine metrics

HOST

Running-away VM

Running-away VM

Control Chart detects Run-away of the VM even though the CPU utilization is <80%

CMG’09 30

Run Queue is useful for capturing CPU bottlenecks. And it indirectly relates to the system response time.

This is Sun Fire V880 4-way box

Why is it so powerful? EXAMPLESThe SEDS and CPU Run Queue metric

CMG’09 31

This is Sun Fire V880 4-way box


If a CPU Queue exception is detected and CPU utilization had exception for the same hour plus CPU utilization was close to 100%, there is a high probability of a CPU capacity issue.

But which Application caused the exceptions?

CMG’09 32


When a global exception occurs (CPU Queue), the workload level data can be scanned to identify what particular application on the server was responsible for the exception.

CMG’09 33


The scan against the application level data showed Application5 had a similar exception.

CONCLUSION: An unusual number of active processes is the cause of global CPU Queue exception and indicates a potential application performance problem!

CMG’09 34

Why is it so powerful? EXAMPLESThe SEDS and response time and some other application metrics

SEDS could capture exceptions of Application Response Time (ART)

and Calls Volume of particular functions (APIs Calls) within the Middleware tier.

CMG’09 35

Why is it so powerful? EXAMPLESThe SEDS and response time and some other application metrics

IT-chart with E2E response time for “signon” application:

Historical trend chart with E2E response time and transaction volume:

CMG’09 36

Why is it so powerful? EXAMPLESThe SEDS and disk space metrics

CMG’09 37

Why is it so powerful? EXAMPLESThe SEDS and disk I/O metrics

o SEDS captured a Disk I/O rate exception at about 4:00 PM on ServerB,

o and the application detector found that the workload “Appl2” had an exception as well.

CMG’09 38

Why is it so powerful? EXAMPLESThe SEDS and Unisys and Tandem metrics

The Unisys server had unusual low utilization that might indicate Disk or Database performance problems

The Tandem server, in contrast, had two unusual spikes of CPUs utilization that crossed the upper limit.

CMG’09 39

Why is it so powerful? EXAMPLESMainframe metrics Control Chart

BMC Visualizer was used to find any exceptions based on different filtering policies. For that, the BMC collector needed to be installed on the server and BMC Visualizer used manually to capture any MASF exceptions.

BMC Visualizer example: the System Hierarchy (spectrum) and Control Charts

CMG’09 40

Why is it so powerful? EXAMPLESThe SEDS and Mainframe metrics

Looking at a stacked workload data chart it’s difficult to find an application, which is responsible for spikes in overall CPU usage.

SEDS shows that Appl1 was responsible for the global maxima in the overall MIPS chart .

Exceptions Captured for one of the LPAR

CMG’09 41


Hourly SUM of the average response per transaction - RESP,(It shows the values consistently higher than average)

Hourly SUM of ended transaction count - TRANS

Hourly SUM of elapsed tasks duration - CPUsec

CMG’09 42


To capture an unusual behavior of a relatively small application that was not big enough to create a global exception.

HEALTH CHECK: To prove a stable behavior of any essential or critical application.

CMG’09

Near-Real Time Control Chart (Proactive Availability Management)

Some tools do that:- Integrien (

http://www.integrien.com/)- Netuitive (http://netuitive.com/)- ProactiveNet (www.BMC.com)SEDS was tested to do that too

(see IT-chart on this slide)

43

Real-Time Statistical Exception Detection requires to process data every interval (at least hourly) to do smart alerting based on dynamic (statistical) thresholds vs. static ones (currently more common). That can be used for Proactive Availability Management.

http://www.integrien.com/

http://netuitive.com/

http://proactivenet.com/

http://www.bmc.com/

http://itrubin.blogspot.com/2009/02/realtime-statistical-exception.html

http://itrubin.blogspot.com/2009/02/realtime-statistical-exception.html

CMG’09

How to build a Control ChartUsing existing statistical toolso SAS/Base and

SAS/Grapho SAS/QC (Quality Control):o JMP from SASo Minitab and othero qcc: An R

package for quality control charting

Using built-in Control Chart builder (BMC, BEZ and so on)

44

http://www.stat.unipg.it/~luca/Rnews_2004-1-pag11-17.pdf






CMG’09 45

How to build a Control Chart - EXCEL

UpperLimit =F+M$2*G = H

LowerLimit =F-M$2*G = J

7-day Moving Average =AVERAGE(B:B+10) = F

1 st. dev =STDEV(B:B+10)

other limits can be used: =PERCENTILE(B3:B+10,0.05)=PERCENTILE(B3:B+10,0.95)(S+ =IF(B-H<0,0,B-H) = I S- =IF(B-J>0,0,B-J) = K EV= ExtraValue = I+K ) - see [1]

What about just Excel!o EXAMPLE: CPS Control Chart with moving or static reference set

LINK TO SPREADSHEET

CMG’09

How to build a Control Chart - EXCEL

What about just Excel!o EXAMPLE2: Weekly Health

Index (Concord metric) MASF Control Chart Builder

LINK TO SPREADSHEET

DATEHEALTH INDEX

WEEKDAY

6-Dec-05 2.3 3

7-Dec-05 1.5 4

8-Dec-05 0.0 5

9-Dec-05 1.1 6

… … …

23-May-06 4.4 3

24-May-06 6.0 4

25-May-06 0.3 5

26-May-06 1.0 6

• For SUNday (Column “B”):• Mean =

AVERAGE(B2:B25)• Upperlimit =

AVERAGE(B2:B25)+ 3*STDEV(B2:B25)

• Lowerlimit = IF(AVERAGE(B2:B25)- 3*STDEV(B2:B25)<0,0, AVERAGE(B2:B25)- 3*STDEV(B2:B25))

• StdDeviation = STDEV(B2:B25)

• For other columns “B’ should be replaced with other column letter (e.g. MONday – “C” and so on)

46

CMG’09

How to use Control Chart (e.g. in the SEDS structure)DATE

HEALTH INDEX

WEEKDAY

6-Dec-05 2.3 3

7-Dec-05 1.5 4

8-Dec-05 0.0 5

9-Dec-05 1.1 6

… … …

23-May-06 4.4 3

24-May-06 6.0 4

25-May-06 0.3 5

26-May-06 1.0 6

47

To visualize SEDS findings (exceptions)

PDB – raw data

SEDS DB

Control Chart

CMG’09

How to use Control Chart (e.g. in the SEDS structure)

To visualize SEDS findings (exceptions)

48

DATEHEALTH INDEX

WEEKDAY

6-Dec-05 2.3 3

7-Dec-05 1.5 4

8-Dec-05 0.0 5

9-Dec-05 1.1 6

… … …

23-May-06 4.4 3

24-May-06 6.0 4

25-May-06 0.3 5

26-May-06 1.0 6

script to chartDONE! See next slides (cchrt.r)

script to built data for charting/detecting

Subsystem 1Subsystem 2 …

Subsystem n

script to detect exceptions

Subsystem 1Subsystem 2 …

Subsystem n

Exceptional subsystems(e.g. servers)

Severity of exception (e.g. number of hours)

PDB – raw data

SEDS DB

Knowledge Base (Filtering Rules):- statistical (e.g. Metric >UCL or <LCL)- empirical (e.g. Duration>2 hours)

Control Chart

Exception DB

CMG’09 49

How to build a Control Chart – EXCEL vs. SAS vs. R

What about SAS, Excel or R!o EXAMPLE3: Monthly Profile

vs. Weekly Profile

LINK TO SPREADSHEET

The data is Unix File Space Utilization

EXCEL

SAS

CMG’09 50

How to build a Control Chart – EXCEL vs. SAS vs. R

What about SAS, Excel or R!o EXAMPLE3: Monthly Profile R download: http://www.r-project.org/

The data is Unix File Space Utilization: INPUT is CSV

R-script (published on my blog):

Output is JPEG

(FYI: qcc: An R package for quality control charting :http://cran.r-project.org/web/packages/qcc/index.html)

http://www.r-project.org/

http://itrubin.blogspot.com/2009/03/power-of-control-charts.html

http://itrubin.blogspot.com/2009/03/power-of-control-charts.html






http://cran.r-project.org/web/packages/qcc/index.html

http://cran.r-project.org/web/packages/qcc/index.html

CMG’09 51

How to build an IT-Chart: SAS vs. EXCEL vs. R

What about SAS!o EXAMPLE4: IT-chart

The raw data was captured by a SEDS based on MXG data

SAS Version

This real performance issue was captured and data was provided by John Shuck (SunTrust )

CMG’09 52

How to build an IT-Chart: SAS vs. EXCEL vs. R

What about Excel!o EXAMPLE4: IT-chart

LINK TO SPREADSHEET

EXCEL Pivot Table Version


CMG’09 53

How to build an IT-Chart: SAS vs. EXCEL vs. RWhat about R!o EXAMPLE4: IT-CHART


R! Version

CMG’09 54

How to build an IT-Chart: SAS vs. EXCEL vs. RWhat about R!o EXAMPLE4: IT-CHART

The raw data was captured by a SEDS based on MXG data:

CMG’09 55

Summary

Control Chart is a really proactive tool and can help to capture unusual resource usage before it breaks.Control Chart is the best Base-lining tool and can show how actual data deviate from historical baseline.IT-Chart is like a radar to show what’s coming.Control Chart is the tool to detect a pathology detection (run-away, memory leaks).Control Chart has the ability to uncover some trends and patterns showing actual data deviations from an historical baseline.Control Chart could be Classical (SPC) or MASF (Actual vs. Reference set with grouping by hour-weekdays).Control Chart provides dynamic threshold: no need of manual setting.Control Chart is used to visualize exceptional behavior of a subsystem (e.g. one of the outputs from SEDS)Control Chart can be build just using Excel or R.

CMG’09 56

ReferencesJeffrey Buzen and Annie Shum: "MASF -- Multivariate Adaptive Statistical Filtering," Proceedings of the Computer Measurement Group, 1995, pp. 1-10.

Igor Trubin: “Global and Application Levels Exception Detection System, Based on MASF Technique ”, Proceedings of the Computer Measurement Group, 2002. (http://www.cmg.org/measureit/shared/trubin_02.pdf)

Linwood Merritt, Igor Trubin: “Disk Subsystem Capacity Management Based on Business Drivers I/O Performance Metrics and MASF”, Proceedings of the Computer Measurement Group, 2003. (http://regions.cmg.org/regions/ncacmg/downloads/june162004_session3.doc)

Linwood Merritt, Igor Trubin: : “Mainframe Global and Workload Level Statistical Exception Detection System, Based on MASF”, Proceedings of the Computer Measurement Group, 2004. (http://www.cmg.org/membersonly/2004/papers/4179.pdf)

Igor Trubin: “Capturing Workload Pathology by Statistical Exception Detection System”, Proceedings of the Computer Measurement Group, 2005. (http://www.cmg.org/membersonly/2005/papers/5016.pdf)

Igor Trubin: “System Management by Exception, Part 6”, Proceedings of the Computer Measurement Group, 2006. (http://www.cmg.org/membersonly/2006/papers/6120.pdf)

Igor Trubin: “System Management by Exception, Part Final”, Proceedings of the Computer Measurement Group, 2007.

Igor Trubin: “Exception Based Modeling and Forecasting”, Proceedings of the Computer Measurement Group, 2008.

http://www.cmg.org/proceedings/1995/95INT089.pdf

http://www.cmg.org/membersonly/2002/papers/paper526.pdf

http://www.cmg.org/membersonly/2002/papers/paper526.pdf

http://www.cmg.org/measureit/shared/trubin_02.pdf

http://regions.cmg.org/regions/ncacmg/downloads/june162004_session3.doc



http://www.cmg.org/membersonly/2004/papers/4179.pdf







http://regions.cmg.org/regions/scmg/fall_07/richmond/SEDSCMG2007_v4.pdf


CMG’09 57CMG’09

Near-Real-Time IT-Control Charts

Igor Trubin, PhD, SunTrust Bankhttp://www.itrubin.blogspot.com/

Questions?



http://itrubin.blogspot.com/2009/05/seds-charts-at-scmg.html


Near-Real-Time IT-Control Charts

Documents

Transcript of Near-Real-Time IT-Control Charts