Better service monitoring through histograms
-
Upload
fred-moyer -
Category
Software
-
view
455 -
download
0
Transcript of Better service monitoring through histograms
Better service monitoring through histogramsFred Moyer - @phredmoyerSan Francisco Perl Mongers, 07-26-2016
Systems break while we sleep
How often are you woken up for false alarms?
Welcome
Synthetics
Easy to setup, but not a real user
Synthetics
Stephen Falken: Uh, uh, General, what you see on these screens up here is a fantasy; a computer-enhanced hallucination. Those blips are not real missiles. They're phantoms. (War Games, 1983)
Real Users
These are your users, right?
Real data
Real Users
500 ms is really 2,000 ms
Spike Erosion
What threshold do you choose?
Threshold Alerting
“Alert me if requests take longer than 200 ms”
10,10,10,10,10,10,10,10,10,5000
Alerts on one outlier in 10
Threshold Alerting
“Alert if request average over one minute is longer than 200 ms”
avg(10,10,210,210,210,210) = 143 (860/6)
Does not alert on multiple high samples
Threshold Alerting
‘average’ eq ‘arithmetic mean’A=S/N
A = averageN = the number of terms
S = the sum of the numbers in the set
Math Refresher
median = midpoint of data set
The 50th percentile is 555 - q(0.5)
Value 111 222 333 444 555
666
777 888 999
Sample # 1 2 3 4 5 6 7 8 9
Math Refresher
90th percentile - 90% of samples below it
The 90th percentile is 1,000 - q(0.9)
Value 111
222
333
444
555
666
777
888
999 1,00
01,111
Sample #
1 2 3 4 5 6 7 8 9 10 11
Math Refresher
100th Percentile - the maximum value
The 100th percentile is 1,111 - q(1)
Value 111
222
333
444
555
666
777
888
999
1,000 1,11
1Sample #
1 2 3 4 5 6 7 8 9 10 11
Math Refresher
Sample value
Number of samples
Histogram
Sample value
Number of samples
Normal Distribution
Sample value
Number of samples
Normal Distribution
34% within one sigma (σ)
Sample value
Number of samples
Non-Normal Distribution
Sample value
Number of samples
Non-Normal Distribution
Non-Normal Distribution
Operations data groups at different points
Non-Normal Distribution
Users to the right of the red line are gone
Request latency“We keep hearing from people that the
website is slow. But it is fine when we test it, and the request latency graph is
constant”
You are only looking at part of the picture.
Heat Map
Histograms over time windows
Percentiles
Practical PercentilesBandwidth usage is often billed at 95th percentile
usageRecord 5 minute data usage intervals
Sort samples by value of sampleThrow out the highest 5% of samples
Charge usage based on the remaining top sample, i.e. 300 MB transferred over 5 minutes = 1 MB/s rate
billing
Practical Percentiles
If I measure 95th percentile per 5 minutes all month long,
I CANNOT calculate 95th percentile over the month.
Angry users
How many users are you pissing off?
Angry users
“Alert me if request latency 90th percentile over one minute is
exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,10,10,5000] == 10Alert IS NOT triggered
Do you want to be woken up for this? NO!
“Alert me if request latency 90th percentile over one minute is exceeded”
Percentile based alerting
q(0.9)[10,10,10,10,10,10,250,300] = ~270Alert IS triggered
Do you want to be woken up for this? YES!
Percentile based alerting
Who’s using this approach?
Google.comCirconus.com
You?
Questions?
Thanks to Circonus.com for the tools and help with the math
http://www.circonus.com/free-account/