The definition of normal - An introduction and guide to anomaly detection.
-
Upload
alois-reitbauer -
Category
Software
-
view
1.357 -
download
0
Transcript of The definition of normal - An introduction and guide to anomaly detection.
ruxit theme 2014.05.15
The definition of normalAn introduction and guide to anomaly detection
Alois Reitbauer, ruxit@aloisreitbauer
ruxit theme 2014.05.15
Some backgroundWho I am and what I do
ruxit theme 2014.05.15
ruxit theme 2014.05.15
ruxit theme 2014.05.15
Anomaly DetectionWhat is an anomaly anyways?
ruxit theme 2014.05.15What is an anomaly?
In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind of problem such as ……….
Source: Wikipedia
ruxit theme 2014.05.15How many metrics would we have to look at
3 Metrics per Service5 Metrics per Host5 Metrics per Runtime
40 Services = 120 Metrics
20 Hosts = 100 Metrics
40 Runtimes = 200 Metrics
420 Metrics
ruxit theme 2014.05.15
We cannot watch 400+ metricsSo we need to find ways to automate finding anomalies
ruxit theme 2014.05.15
Historic
Data
“Normal”
Model
New Data
Hypothesis
Likeliness
Judgement
update
calculate derive
testproduces
Anomaly?
defines
Anomaly Detection Workflow
ruxit theme 2014.05.15We will look at three types of data
Response TimesDid our response times increase significantly?
Error RatesDid the error rate of any of our services change?
LoadIs there anything unusual happening to our service load?
ruxit theme 2014.05.15
Finding error rate anomaliesAre we having more errors than usual?
ruxit theme 2014.05.15How can we get our baseline?
Average or MeanEasy to calculate but does not learn over time
MedianNeeds more raw data as average, precise. Does not learn well either
Exponential SmoothingEasy to calculate and learns over time
ruxit theme 2014.05.15Using exponential smoothing for baseline
Source: Wikipedia
ruxit theme 2014.05.15
Example
ruxit theme 2014.05.15Is this an anomaly?
Our Observation:
Typical error of 3 percent at 10,000 transactions/min
Current System Behavior:
During night we see 5 errors in 100 requests
ruxit theme 2014.05.15
Binomial Distr ibut ionTells us how likely it is to see n successes in a certain number of trials
ruxit theme 2014.05.15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Likeliness of at least n errors
18 % probability to see 5 or more errors
Applying Binomial Distribution to our problem.
ruxit theme 2014.05.15
Derive an anomaly from a forecastWhat is unlikely enough to be interpreted as an anomaly?
ruxit theme 2014.05.15
95 % Probability Window
Borrowing from the Standard Deviation
ruxit theme 2014.05.15
Response Time AnomaliesAre our response times higher than usual?
ruxit theme 2014.05.15Challenges in finding response time anomalies
ruxit theme 2014.05.15Data representation is important
ruxit theme 2014.05.15Proper data representation with Median
ruxit theme 2014.05.15
Mean: 500 msStd. Dev.: 100 ms
68 %400ms – 600 ms
95 %300ms – 700 ms
100 200 300 400 500 600 700 800 900
99 %200ms – 800 ms
If our data would be normally distributed …
ruxit theme 2014.05.15
50 Percent slower than μ
97.6 Percent slower than μ + 2σ
Median97th Percentile
However, we can generalize the model
ruxit theme 2014.05.15Is this an anomaly?
Our Observation:
Usually we see median response time of 300 ms.
Current System Behavior:
During night with low traffic response times goup to 600 ms.
ruxit theme 2014.05.15
Our median response time is 300 ms
and we measure
200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
Testing against new data
ruxit theme 2014.05.15
Check all values above 300 ms
200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
7 values are higher than the median. Is this normal?
Using Binomial distribution on median
ruxit theme 2014.05.15
We have a 50 percent likeliness to see values above the median.
How likely is is that 7 out of 10 samples are higher?The probability is 17 percent, so we should not alert.
Applying percentile drift detection
ruxit theme 2014.05.15
Load AnomaliesAre we seeing unusually high or low load?
ruxit theme 2014.05.15We will look at three types of data
SeasonalityLoad is often directly related to time-based usage.
TrendGrowth patterns are not necessarily source of a problem.
We need a different approach
ruxit theme 2014.05.15Holt-Winters Seasonal Forecasting
ruxit theme 2014.05.15
Example
ruxit theme 2014.05.15
Causality Analysis of AnomaliesHow to derive meaningful information from anomalies.
ruxit theme 2014.05.15Anomalies vs. Health
AnomalyA system does not expose the expected behavior.
HealthA system does not operate within well-defined boundaries.
ruxit theme 2014.05.15Health and Anomaly Matrix
Healthy Unhealthy
No AnomaliesOperating normally Unstable System
Anomalies Resilient Operational issues
ruxit theme 2014.05.15Judging Anomalies by Impact
1st Degree Anomaly - CPU Saturation on a host - or similar
2nd Degree Anomaly - Application Functionality affected
3rd Degree Anomaly - Externally visible effects – User realize
ruxit theme 2014.05.15
Relationships of anomaliesTransferring system knowledge to monitoring systems
ruxit theme 2014.05.15The model
ruxit theme 2014.05.15Interpretation with expert knowledge
Strong Relationship Response time slow down impacted by CPU saturation
Potential RelationshipResponse time slow down potentially impacted by code deployment
No RelationshipCPU saturation not impacted by load drop
ruxit theme 2014.05.15
Distinguish Impact from CauseHow to infer root cause information from monitoring data
ruxit theme 2014.05.15Automated Analysis of ProblemsService slowdown
ruxit theme 2014.05.15Automated Analysis of ProblemsService slowdownDependent services slow down
ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected
ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected
Analyze Dependencies
ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected
Analyze DependenciesExclude non-relevant services
ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected
Analyze DependenciesExclude non-relevant servicesFollow causality chain
ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected
Analyze DependenciesExclude non-relevant servicesFollow causality chain
ruxit theme 2014.05.15Real World Example
ruxit theme 2014.05.15
Alois [email protected]@ruxit.comblog.ruxit.com