That Conference 2017: Refactoring your Monitoring
-
Upload
jamie-riedesel -
Category
Data & Analytics
-
view
110 -
download
0
Transcript of That Conference 2017: Refactoring your Monitoring
Jamie RiedeselDevOps Engineer
@sysadm1138
Route-Planning your Monitoring Stack Climb
@sysadm1138ThatConference 2017
Today’s Climb
Overview
Your monitoring stack
Deciding what to monitor
The monitoring project-plan
Extra: Humane on-call rotations
@sysadm1138ThatConference 2017
Your Monitoring Stack
LEARNING THE TERRITORY
@sysadm1138ThatConference 2017
This is your stack. Really
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
Po
licy E
ng
ine
@sysadm1138ThatConference 2017
Scheduled-tasks & Powershell
Scheduler runs scripts on a schedule.
Scripts emit email or update a database.
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
Po
licy
En
gin
e
@sysadm1138ThatConference 2017
“Full-Stack”
SolarWinds.
Xenoss.
Nagios.
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
Po
licy
En
gin
e
@sysadm1138ThatConference 2017
Open-Source Medley
Nagios + Graphite + Grafana
Logstash + InfluxDB + Kibana + Bosun
Greylog + New Relic + Hash.io + DataDog
Nexosis + Go + OpenTSDB + Grafana
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
Po
licy
En
gin
e
@sysadm1138ThatConference 2017
Polling Engine
The whatever that fetches data.
● SNMP agents
● WMI endpoints
● Nagios agent
● Solarwinds agent
● Powershell scripts
● Bash scripts
● Polling Engines in Nagios & SolarWinds
● Daily runbooks and spreadsheets
Polling Engine
@sysadm1138ThatConference 2017
Aggregation Engine
Turns raw data into useful data.
● Summarizes over time (think RRDTool)
● Does stats (min/max/%-tile) on incoming
stream.
● Summarizes over system/rack/datacenter
No one (except possibly Google) keeps full
granularity monitoring logs forever and ever in
a trivially queryable way. Too expensive, and
you don’t usually care about 2 years ago.
Aggregation Engine
Alerting Engine
@sysadm1138ThatConference 2017
Alerting Engine
Bothering humans in realtime!
● May do analytics.
● May be threshold-based, or trigger on
very sophisticated conditions.
● Scripts that send email every time.
● Scripts that drop notices in group-chat.
● Night-operator calling the Systems
Engineers
● PagerDuty.
Alerting Engine
@sysadm1138ThatConference 2017
Reporting Engine
Bothering humans on a lag!
● Long-term trends
● Capacity analysis
● Growth tracking
● Full-bore big-data analytics
● SLA pass/fail reporting
● Track user behaviors across features
● BA building reports for executives
Reporting Engine
@sysadm1138ThatConference 2017
API
Programatic interfaces into your monitoring
system.
● Build feedback systems
● Manage policy-engine details
● Could be your CM system
Good monitoring systems have APIs. It makes
them easier to integrate with. And integration is
usage.
API
@sysadm1138ThatConference 2017
User Interface
How humans interface with it.
A monitoring system with a bad user-
interface is a bad monitoring system.
- Jamie Riedesel, lots of times
I’ve seen things.User Interface
API
@sysadm1138ThatConference 2017
User Interface
To access a previous job’s monitoring system:
1. Open a browser.
2. Log in using 2-factor to our SSL-VPN.
3. Connect to RDP using same password as VPN.
4. Open another browser.
5. Hit Monitoring site.
6. Using non SSO-ed password, log in.
7. See what’s going on.
User Interface
API
@sysadm1138ThatConference 2017
Policy Engine
This defines the behavior of each stage of the
stack.
Configured as part of the User Interface and
API.
Po
licy
En
gin
e
User Interface
API
Humans
@sysadm1138ThatConference 2017
Policy Engine +Polling engine
● How often are things polled?
○ Every 10s, 1m, 2m, 5m, 1d?
● Does polling get paused for
maintenance-windows?
● What data gets reported to the
Aggregation Engine?
Polling Engine
Po
licy
En
gin
e
User Interface
API
Humans
@sysadm1138ThatConference 2017
Policy Engine +Aggregation Engine
● How long do you keep data at all?
● How long do you keep full granularity
data?
● How long do you keep summarized data?
● Where do you keep full granularity data?
● Where do you keep summarized data?
● How do you summarize data?
○ Time? System? Location?
● Do maintenance windows affect any of
Polling Engine
Aggregation Engine
Po
licy
En
gin
e
User Interface
API
Humans
@sysadm1138ThatConference 2017
Policy Engine + Alerting Engine
● Which alarms merit bothering humans?
● Which alarms merit automatic fixing?
● Which alarms can be ignored?
● How do maintenance-windows impact
alarms?
● What escalation policies are in place?
Polling Engine
Aggregation Engine
Alerting Engine
API
Po
licy
En
gin
e
User Interface
API
Humans
@sysadm1138ThatConference 2017
Policy-Engine + Reporting Engine
● Do reports get automatically generated?
● What reports are viewable on-demand?
● What reports are defined?
● Are ad-hoc reports possible?
● Who gets automatically generated
reports?
● What trends are we looking for?
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
Po
licy
En
gin
e
@sysadm1138ThatConference 2017
That’s cleared up!
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
Po
licy E
ng
ine
@sysadm1138ThatConference 2017
Deciding What To Monitor
PLANNING THE APPROACH
@sysadm1138ThatConference 2017
Different Kinds of Monitoring
Granularity and goals differ
from type to type. Be aware of
these as you build your system.
Performance Monitoring
Operational Monitoring
Capacity Monitoring
SLA Monitoring
@sysadm1138ThatConference 2017
Performance Monitoring
Granularity: Very high ( 10s, 1s, or even sub-second)
Duration: As-needed
Response: Realtime
Tools: Procmon, wireshark, strace, perf, Performance Monitor, gdb
Typically done as part of debugging, troubleshooting, and profiling activities. Granularity is much
higher than operational monitoring. Typically, results are reviewed in near realtime and not
persisted long.
Performance Monitoring
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Operational Monitoring
Granularity: Medium (1m, 2m, 5m, 10m, 1h, etc)
Duration: Continuous.
Response: Rapid.
Tools: Dell OpenManage, HP Operations Manager, Cisco OpManager, NetApp
What most people think of when you say monitoring (but they’re wrong). This type of monitoring
catches the health of your infrastructure and is not directly related to the services it provides.
Think disk replacements, switch failures, and tornados.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Operational Monitoring
This one is easy
OPERATIONAL MONITORING
1
The SLA for this is: our infrastructure can support the
delivery of our products and services.
● Switch failures.
● Disk failures.
● Blade-chassis failures.
● UPS failures.
● PSU / PDU failures.
● Compliance failures.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Capacity Monitoring
Granularity: Low (1h, 1d, 1w, 1m)
Duration: Continual or occasional
Response: Slow
Tools: Grafana, Kibana, Graphite, Nagios, Excel
Monitoring the capacity of your system to do work. Lead times can be quite long for some
replacements (SAN arrays), and capacity can be budgetary more than hardware. Especially in
cloud contexts.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Capacity Monitoring
How much do I need,And when do I need it?
CAPACITY MONITORING
2
Every product or service uses consumables. This is
where you track them:
● Disk-space
● Cloud budget
● Overtime allowance
● P1 incident usage
● SmartHands budget
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Service Level Agreement Monitoring
Granularity: Medium to Low
Duration: Continual
Response: Rapid and Slow
Tools: Everything
Monitoring to detect whether or not you’re meeting your SLA for a given service or services.
Where most monitoring really exists.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLA Monitoring
This one is complicated
SERVICE LEVEL AGREEMENT MONITORING
3
How your product or service is supposed to perform. Not
just executives care about SLAs.
SLA: Service Level Agreement
SLO: Service Level Objectives
SLI: Service Level Indicators
We’ll get into these.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
What if we don’t have SLAs? That’s like… commitment. We avoid that around here!
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
What if we don’t have SLAs? That’s like… commitment. We avoid that around here!
Yes, you have an SLA
No, really. You do.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
The service is up when our users need it to be.
And if it isn’t, they’re allowed to slag us on Twitter.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFACTO SERVICE LEVEL AGREEMENT
The service is up when our users need it to be.
And if it isn’t, they’re allowed to slag us on Twitter.
In short, 100% uptime or your reputation will be hauled through the meat-grinder.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFACTO SERVICE LEVEL AGREEMENT
We promise X availability, on penalty of Y things, outside of Q maintenance periods. Planned outages will have no less than Z days notice...
Less likely to end up as a meme on Twitter. This can be 100% an internal-only document!
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFINED SERVICE LEVEL AGREEMENT
Service Level Agreement (SLA): An agreement, written in Human; or sometimes Lawyer. Sets goalposts, defines penalties (if any), defines terms.
Service Level Objective (SLO): A set of objectives, written in Engineer. Technical definition of the goalposts in the SLA.
Service Level Indicator (SLI): Something that tells you whether or not you’re meeting your SLO.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFINITIONS
SLA: The service is up 99.99% of the time, not including scheduled maintenance.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLOs - SERVICE LEVEL OBJECTIVES
SLA: The service is up 99.99% of the time, not including scheduled maintenance.
● The settings page renders in under 10 seconds.● The site returns HTTP-200 from Europe within 2 seconds.● Branch-office ADC01 can reach the service.● 98%-tile end to end request time is not more than 3
seconds.● The SSL certificate is valid and chains to our CA.● The text, “Welcome to Example Co,” is on the main page.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLOs - SERVICE LEVEL OBJECTIVES
SLA: The site is up 99.99% of the time, not including scheduled maintenance.
SLO:● Site is reachable.● The site is showing the right content.● Scheduled maintenance is tracked.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLOs - SERVICE LEVEL OBJECTIVES: HasDCDoneSomethingStupidToday.com
SLA: Printing is available in Computer Labs 99.99% of the time, outside of scheduled closures and maintenance.
SLO:● Every Computer Lab has at least one working printer with
paper.● Printers service only the central print queues.● The swipe-card terminal in Computer Labs must work for
the printers to be considered ‘working’.● Printers do not work if they can’t talk to the payment
processor.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLOs - SERVICE LEVEL OBJECTIVES: University Print Services
SLO: The settings page renders in under 10 seconds.
SLI:● Logins work.● Page render-time from same data-center.● Page render-time from Europe.● Database disk-queue length.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!
SLO: 98%-tile end to end request time is not more than 3 seconds.
SLI:● Time-to-process for all requests.● Request processing is functional at least 30 seconds ago.● 10 minute 98th percentile request-time average.● 10 minute 50th percentile request-time average.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!
Service Level Agreement (SLA): An agreement, written in Human; or sometimes Lawyer. Sets goalposts, defines penalties (if any), defines terms.
Service Level Objective (SLO): A set of objectives, written in Engineer. Technical definition of the goalposts in the SLA.
Service Level Indicator (SLI): Something that tells you whether or not you’re meeting your SLO.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFINITIONS
Alarm: Informing humans of failing SLI/SLOs in realtime.
Report: Eventually informing humans of failing SLI/SLOs
Which humans do you bother for each SLI/SLO? Only you can figure that out!
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFINITIONS
Specific: Must tell me something specific is wrong.
Alarms that require a human to log in to figure out what is actually wrong, if anything is, are bad alarms.
FYI alarms lead to high cognitive load and decrease worker satisfaction.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
GOOD ALARMS
Actionable: Must be something I can directly fix
Getting alarmed for things you can’t fix is a great road to burnout. These are especially great at 3:19 AM.
The failure mode is teaching people that some alarms can be ignored safely. Eventually, they’ll ignore the wrong one. This is bad.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
GOOD ALARMS
Format Agnostic: Don’t be a dick about format
If a team wants full HTML with links to runbooks and wiki-pages, let ‘em.
If a team wants the entire alert to fit into their iPhone lock-screen, let ‘em.
Better, allow both!
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
GOOD ALARMS
Specific.
Actionable.
In the format you want.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
GOOD ALARMS
The Monitoring Project-Plan
MAKING THE ASCENT
@sysadm1138ThatConference 2017
Get Approval For The Project:
● If it’s just you, that’s easy! Do it.
● A good monitoring product is used by many people
○ Get buy-in from not just IT, but sales, support, etc.
● Pitch to the business-case, not process improvement for your department.
○ We will reduce customer churn by enabling our CSMs.
○ We will improve our reaction time to reputation-impacting events.
○ This will increase buy-in from other departments, enabling our IT
goals
0
PROJECT PLAN
@sysadm1138ThatConference 2017
Figure out high-level needs (SLA)
● If you have a written one? Great! Work backwards from that.
● If you have an unwritten one, ask people to see what they think it is.
○ Play 20-questions with higher level execs on impacts of down-time
and service degredations.
○ Point out the de facto SLA, see how they react.
○ Point out we don’t need to publish the SLA to our customers, but can
have one internally.
● If you have microservices, each service will need its own SLA.
1
PROJECT PLAN
@sysadm1138ThatConference 2017
Figure out concrete definitions (SLO)
● Now that you have an SLA, or many SLAs, do the analysis to determine
what ‘up’ and ‘responsive’ mean in a concrete way.
● Ask other people to get involved. Involvement keep the project rolling.
● This is an opportunity for education with business leaders.
2
PROJECT PLAN
@sysadm1138ThatConference 2017
Figure out specific monitorables (SLI)
● Take your SLO list and figure out how to monitor for each.
● You may need to monitor new things.
● You may be able to stop monitoring/alarming some other things.
● Magic happens: your first opportunity to turn off existing alarms!
3
PROJECT PLAN
@sysadm1138ThatConference 2017
Figure out how to monitor those things
● Some of this may already exist. If so, cool.
● Some may need to be monitored in a different way.
● Some may need to be monitored for the first time.
● This defines how the Polling Engine works.
● Build new engines if you need to.
● Poll direct measurements where you can, try not to use proxy
measurements.
4
PROJECT PLAN
@sysadm1138ThatConference 2017
Polling Engine
Decide on your aggregation techniques
● Some of this may already exist. If so, cool.
● Perhaps you don’t need to keep data as long as you thought.
● Perhaps you need to keep high granularity data longer than you thought.
● Perhaps you need to start tracking things like percentiles and standard-
deviations.
● This defines how the Aggregation Engine works.
5
PROJECT PLAN
@sysadm1138ThatConference 2017
Aggregation Engine
Alert Definition (Operational\SLA monitoring)
● Some of this may already exist. If so, cool.
● Figure out who needs to know what and how fast they need to know it.
● One person shop? Easy!
● Ops team of 80? There will be meetings.
○ Work with each group individually.
○ Be flexible with requirements in each.
○ Don’t force communications-format standards without good cause.
○ Ensure the alarms are specific and actionable.
6
PROJECT PLAN
@sysadm1138ThatConference 2017
Alerting Engine
Report Definition (Capacity\SLA monitoring)
● Some of this may already exist. If so, cool.
● Figure out how to write the pass/fail report for your SLAs.
● Determine what kind of response-times are needed to address SLA risks.
● Determine what kind of response-times are needed for capacity risks.
● Determine who gets what.
7
PROJECT PLAN
@sysadm1138ThatConference 2017
Reporting Engine
Periodic Review
● Run the system for a while.
● Come back 3 months, 6 months later and ask questions.
○ How are the alarms working for you?
○ What changes do you think need to be made?
○ What new things have shown up?
● Especially important for departments that haven’t been attached to a
monitoring system before.
8
PROJECT PLAN
@sysadm1138ThatConference 2017
Humans
Step 0: Get approval
Step 1: Figure out high level needs (Service Level Agreement)
Step 2: Turn that into concrete definitions (Service Level Objectives)
Step 3: Figure out specific monitorables (Service Level Indicators)
Step 4: Decide how to monitor it (Polling Engine)
Step 5: Determine aggregation requirements (Aggregation Engine)
Step 6: Define Alerts (Operational and SLA monitoring)
Step 7: Define Reports (Capacity and SLA monitoring)
Step 8: Periodic Review
@sysadm1138ThatConference 2017
Post-Incident Review Questions
1. Did the monitoring system see the problem?
2. Did we react to the monitoring system, or humans?
3. Is it worth our time to catch this problem in the monitoring system?
4. What changes do we need to make, including to alerts, to deal with this in
the future?
9
PROJECT MAINTENANCE
@sysadm1138ThatConference 2017
Questions?
STACK CLIMBING
@sysadm1138ThatConference 2017