Top-Down Approach to Monitoring
-
Upload
bigpanda -
Category
Technology
-
view
65 -
download
5
Transcript of Top-Down Approach to Monitoring
![Page 1: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/1.jpg)
Top-Down Approach to MonitoringJuly 30, 2015
![Page 2: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/2.jpg)
1996
2
Tivoli Software acquired by IBM
Patrol Software acquired by BMC
Ethan Galstad creates a simple MS-DOS application designed to "ping" Novell Netware servers
“HOW to monitor?” is the primary question
![Page 4: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/4.jpg)
Shifting from “How?” to “What?”
4
![Page 5: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/5.jpg)
5
![Page 6: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/6.jpg)
Bottom-Up Approach
6
Network Servers Apps
Overall System Health
![Page 7: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/7.jpg)
Problem #1: Inflation of Tools
7
![Page 8: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/8.jpg)
Problem #2: Inflation of “Whats”
8
![Page 9: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/9.jpg)
Problem #3: Inflation of Alerts
9
![Page 10: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/10.jpg)
10
![Page 11: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/11.jpg)
11
We’re trying to answer a simple question:
Is our system in a healthy state?
![Page 12: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/12.jpg)
12
No Alerts
Many Alerts Unhealthy System≠
≠ Healthy System
![Page 13: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/13.jpg)
13
Healthy System =A system that continuously generates value for its users under a well known set of KPIs
![Page 14: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/14.jpg)
Top-Down Approach
14
KPIs UX
Overall System Health
![Page 15: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/15.jpg)
15
KPIs UX
Overall System Health Network Servers Apps
Overall System Health
• Selective • Proactive
• Exhaustive • Reactive
vs
Bottom-UpTop-Down
![Page 16: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/16.jpg)
A key performance indicator (KPI) is a business metric used to evaluate factors that are crucial to the success of an organization. KPIs differ per organization;
Definition of KPI
16
![Page 17: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/17.jpg)
Let’s play a game!
17
CPU Utilization # Clicks on a button
TemperatureThis is Sam
What does Sam’s company do?
![Page 18: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/18.jpg)
We sought out a single indicator that closely approximated our most important activity: viewing. We discovered that a server-side metric related to playback starts (the act of “clicking play”) had both a predictable pattern and fluctuated significantly when UI/device/server problems were happening. The Netflix streaming pulse was created.
The Pulse of Netflix
18
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
We named it “SPS” for “starts per second”.
![Page 19: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/19.jpg)
Healthy SPS Pattern
19
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
![Page 20: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/20.jpg)
Unhealthy SPS Pattern
20
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
![Page 21: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/21.jpg)
What’s so special about SPS?
21
• SPS is easy to understand by all stakeholders
• One metric that covers different point of failure: server problems, device problems, etc.
• Most important: it’s a clear KPI that indicates when user experience is compromised
![Page 22: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/22.jpg)
But what about root cause analysis?
22
KPIs UX
Overall System Health
Network Servers Apps
![Page 23: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/23.jpg)
Github: need for speed
23
https://github.com/blog/1252-how-we-keep-github-fast
The most important factor in web application
design is responsiveness. And the first step
toward responsiveness is speed. But speed
within a web application is complicated.
![Page 24: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/24.jpg)
Start from the Top:Response Times Dashboard
24
https://github.com/blog/1252-how-we-keep-github-fast
• Each row represented a different major component
• Clicking one of the rows allows you to dive in and see the mean, 98th percentile, and 99.9th percentile response times
![Page 25: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/25.jpg)
Digging Deeper:Mission Control Bar
25
https://github.com/blog/1252-how-we-keep-github-fast
Total Time Render Time Cache & Database JS & CSS Size
![Page 26: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/26.jpg)
And Deeper
26
https://github.com/blog/1252-how-we-keep-github-fast
Render Breakdown
SQL Query Viewer
![Page 27: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/27.jpg)
27
Why talk about BigPanda?
Because Pandas are awesome!
![Page 28: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/28.jpg)
BigPanda
28
Because.. • We’re not Netflix or Github: growing startup (7 devs, 1 full-time Ops)
• We feel the pain!
• Our KPIs are easy to describe and understand (especially if you’re an Ops person)
![Page 29: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/29.jpg)
BigPanda
29
As a unified dashboard on top of all your
monitoring systems, and eventually a single
point of truth for production incidents, our data
pipeline has to be reliable and fast.
KPI: Low data pipeline latency
![Page 30: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/30.jpg)
Pipeline Latency Metric
30
• Metric are sent from within the apps
• Stored in Graphite
• Sum of all the average latencies of all alerts that went through the pipeline
• Monitored by Nagios
![Page 31: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/31.jpg)
• Very good indicator of possible service outage
• Must have for detection of SLA violation
• Very good indicator of performance bottlenecks (can be broken down to sub-pipelines / specific organizations etc)
• Simple and high-level: easy to explain to non-technical stakeholders (e.g. sales)
Pipeline Latency Metric
31
![Page 32: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/32.jpg)
• Bottom-up approach (“monitor all the things”) is easier to start with, but soon enough leads to alert fatigue and disorientation.
• Top-down approach requires thought and custom instrumentation, but keeps you focused on what’s important.
• High level metrics can be complemented by low level metrics. Trying to deduce the former from the latter is futile.
• Take advantage of the rich monitoring landscape, but as means to an end. Don’t let the tools dictate to you what you need to measure.
• Monitoring is - first of all - about your business.
TL;DR
32
![Page 33: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/33.jpg)
33
Questions?
![Page 34: Top-Down Approach to Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022042514/55d3390dbb61eb78068b46b2/html5/thumbnails/34.jpg)
34
Thanks!