That Conference 2017: Refactoring your Monitoring

Jamie RiedeselDevOps Engineer

@sysadm1138

Route-Planning your Monitoring Stack Climb

@sysadm1138ThatConference 2017

Today’s Climb

Overview

Your monitoring stack

Deciding what to monitor

The monitoring project-plan

Extra: Humane on-call rotations


Your Monitoring Stack

LEARNING THE TERRITORY


This is your stack. Really

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy E

ng

ine


Scheduled-tasks & Powershell

Scheduler runs scripts on a schedule.

Scripts emit email or update a database.

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy

En

gin

e


“Full-Stack”

SolarWinds.

Xenoss.

Nagios.

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy

En

gin

e


Open-Source Medley

Nagios + Graphite + Grafana

Logstash + InfluxDB + Kibana + Bosun

Greylog + New Relic + Hash.io + DataDog

Nexosis + Go + OpenTSDB + Grafana

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy

En

gin

e


Polling Engine

The whatever that fetches data.

● SNMP agents

● WMI endpoints

● Nagios agent

● Solarwinds agent

● Powershell scripts

● Bash scripts

● Polling Engines in Nagios & SolarWinds

● Daily runbooks and spreadsheets

Polling Engine


Aggregation Engine

Turns raw data into useful data.

● Summarizes over time (think RRDTool)

● Does stats (min/max/%-tile) on incoming

stream.

● Summarizes over system/rack/datacenter

No one (except possibly Google) keeps full

granularity monitoring logs forever and ever in

a trivially queryable way. Too expensive, and

you don’t usually care about 2 years ago.

Aggregation Engine

Alerting Engine


Alerting Engine

Bothering humans in realtime!

● May do analytics.

● May be threshold-based, or trigger on

very sophisticated conditions.

● Scripts that send email every time.

● Scripts that drop notices in group-chat.

● Night-operator calling the Systems

Engineers

● PagerDuty.

Alerting Engine


Reporting Engine

Bothering humans on a lag!

● Long-term trends

● Capacity analysis

● Growth tracking

● Full-bore big-data analytics

● SLA pass/fail reporting

● Track user behaviors across features

● BA building reports for executives

Reporting Engine


API

Programatic interfaces into your monitoring

system.

● Build feedback systems

● Manage policy-engine details

● Could be your CM system

Good monitoring systems have APIs. It makes

them easier to integrate with. And integration is

usage.

API


User Interface

How humans interface with it.

A monitoring system with a bad user-

interface is a bad monitoring system.

- Jamie Riedesel, lots of times

I’ve seen things.User Interface

API


User Interface

To access a previous job’s monitoring system:

1. Open a browser.

2. Log in using 2-factor to our SSL-VPN.

3. Connect to RDP using same password as VPN.

4. Open another browser.

5. Hit Monitoring site.

6. Using non SSO-ed password, log in.

7. See what’s going on.

User Interface

API


Policy Engine

This defines the behavior of each stage of the

stack.

Configured as part of the User Interface and

API.

Po

licy

En

gin

e

User Interface

API

Humans


Policy Engine +Polling engine

● How often are things polled?

○ Every 10s, 1m, 2m, 5m, 1d?

● Does polling get paused for

maintenance-windows?

● What data gets reported to the

Aggregation Engine?

Polling Engine

Po

licy

En

gin

e

User Interface

API

Humans


Policy Engine +Aggregation Engine

● How long do you keep data at all?

● How long do you keep full granularity

data?

● How long do you keep summarized data?

● Where do you keep full granularity data?

● Where do you keep summarized data?

● How do you summarize data?

○ Time? System? Location?

● Do maintenance windows affect any of

Polling Engine

Aggregation Engine

Po

licy

En

gin

e

User Interface

API

Humans


Policy Engine + Alerting Engine

● Which alarms merit bothering humans?

● Which alarms merit automatic fixing?

● Which alarms can be ignored?

● How do maintenance-windows impact

alarms?

● What escalation policies are in place?

Polling Engine

Aggregation Engine

Alerting Engine

API

Po

licy

En

gin

e

User Interface

API

Humans


Policy-Engine + Reporting Engine

● Do reports get automatically generated?

● What reports are viewable on-demand?

● What reports are defined?

● Are ad-hoc reports possible?

● Who gets automatically generated

reports?

● What trends are we looking for?

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy

En

gin

e


That’s cleared up!

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy E

ng

ine


Deciding What To Monitor

PLANNING THE APPROACH


Different Kinds of Monitoring

Granularity and goals differ

from type to type. Be aware of

these as you build your system.

Performance Monitoring

Operational Monitoring

Capacity Monitoring

SLA Monitoring



Granularity: Very high ( 10s, 1s, or even sub-second)

Duration: As-needed

Response: Realtime

Tools: Procmon, wireshark, strace, perf, Performance Monitor, gdb

Typically done as part of debugging, troubleshooting, and profiling activities. Granularity is much

higher than operational monitoring. Typically, results are reviewed in near realtime and not

persisted long.


@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017


Granularity: Medium (1m, 2m, 5m, 10m, 1h, etc)

Duration: Continuous.

Response: Rapid.

Tools: Dell OpenManage, HP Operations Manager, Cisco OpManager, NetApp

What most people think of when you say monitoring (but they’re wrong). This type of monitoring

catches the health of your infrastructure and is not directly related to the services it provides.

Think disk replacements, switch failures, and tornados.



This one is easy

OPERATIONAL MONITORING

1

The SLA for this is: our infrastructure can support the

delivery of our products and services.

● Switch failures.

● Disk failures.

● Blade-chassis failures.

● UPS failures.

● PSU / PDU failures.

● Compliance failures.


Capacity Monitoring

Granularity: Low (1h, 1d, 1w, 1m)

Duration: Continual or occasional

Response: Slow

Tools: Grafana, Kibana, Graphite, Nagios, Excel

Monitoring the capacity of your system to do work. Lead times can be quite long for some

replacements (SAN arrays), and capacity can be budgetary more than hardware. Especially in

cloud contexts.


Capacity Monitoring

How much do I need,And when do I need it?

CAPACITY MONITORING

2

Every product or service uses consumables. This is

where you track them:

● Disk-space

● Cloud budget

● Overtime allowance

● P1 incident usage

● SmartHands budget


Service Level Agreement Monitoring

Granularity: Medium to Low

Duration: Continual

Response: Rapid and Slow

Tools: Everything

Monitoring to detect whether or not you’re meeting your SLA for a given service or services.

Where most monitoring really exists.


SLA Monitoring

This one is complicated

SERVICE LEVEL AGREEMENT MONITORING

3

How your product or service is supposed to perform. Not

just executives care about SLAs.

SLA: Service Level Agreement

SLO: Service Level Objectives

SLI: Service Level Indicators

We’ll get into these.


What if we don’t have SLAs? That’s like… commitment. We avoid that around here!


What if we don’t have SLAs? That’s like… commitment. We avoid that around here!

Yes, you have an SLA

No, really. You do.


The service is up when our users need it to be.

And if it isn’t, they’re allowed to slag us on Twitter.


DEFACTO SERVICE LEVEL AGREEMENT

The service is up when our users need it to be.

And if it isn’t, they’re allowed to slag us on Twitter.

In short, 100% uptime or your reputation will be hauled through the meat-grinder.


DEFACTO SERVICE LEVEL AGREEMENT

We promise X availability, on penalty of Y things, outside of Q maintenance periods. Planned outages will have no less than Z days notice...

Less likely to end up as a meme on Twitter. This can be 100% an internal-only document!


DEFINED SERVICE LEVEL AGREEMENT

Service Level Agreement (SLA): An agreement, written in Human; or sometimes Lawyer. Sets goalposts, defines penalties (if any), defines terms.

Service Level Objective (SLO): A set of objectives, written in Engineer. Technical definition of the goalposts in the SLA.

Service Level Indicator (SLI): Something that tells you whether or not you’re meeting your SLO.


DEFINITIONS

SLA: The service is up 99.99% of the time, not including scheduled maintenance.


SLOs - SERVICE LEVEL OBJECTIVES

SLA: The service is up 99.99% of the time, not including scheduled maintenance.

● The settings page renders in under 10 seconds.● The site returns HTTP-200 from Europe within 2 seconds.● Branch-office ADC01 can reach the service.● 98%-tile end to end request time is not more than 3

seconds.● The SSL certificate is valid and chains to our CA.● The text, “Welcome to Example Co,” is on the main page.


SLOs - SERVICE LEVEL OBJECTIVES

SLA: The site is up 99.99% of the time, not including scheduled maintenance.

SLO:● Site is reachable.● The site is showing the right content.● Scheduled maintenance is tracked.


SLOs - SERVICE LEVEL OBJECTIVES: HasDCDoneSomethingStupidToday.com

SLA: Printing is available in Computer Labs 99.99% of the time, outside of scheduled closures and maintenance.

SLO:● Every Computer Lab has at least one working printer with

paper.● Printers service only the central print queues.● The swipe-card terminal in Computer Labs must work for

the printers to be considered ‘working’.● Printers do not work if they can’t talk to the payment

processor.


SLOs - SERVICE LEVEL OBJECTIVES: University Print Services

SLO: The settings page renders in under 10 seconds.

SLI:● Logins work.● Page render-time from same data-center.● Page render-time from Europe.● Database disk-queue length.


SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!

SLO: 98%-tile end to end request time is not more than 3 seconds.

SLI:● Time-to-process for all requests.● Request processing is functional at least 30 seconds ago.● 10 minute 98th percentile request-time average.● 10 minute 50th percentile request-time average.


SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!

Service Level Agreement (SLA): An agreement, written in Human; or sometimes Lawyer. Sets goalposts, defines penalties (if any), defines terms.

Service Level Objective (SLO): A set of objectives, written in Engineer. Technical definition of the goalposts in the SLA.

Service Level Indicator (SLI): Something that tells you whether or not you’re meeting your SLO.


DEFINITIONS

Alarm: Informing humans of failing SLI/SLOs in realtime.

Report: Eventually informing humans of failing SLI/SLOs

Which humans do you bother for each SLI/SLO? Only you can figure that out!


DEFINITIONS

Specific: Must tell me something specific is wrong.

Alarms that require a human to log in to figure out what is actually wrong, if anything is, are bad alarms.

FYI alarms lead to high cognitive load and decrease worker satisfaction.


GOOD ALARMS

Actionable: Must be something I can directly fix

Getting alarmed for things you can’t fix is a great road to burnout. These are especially great at 3:19 AM.

The failure mode is teaching people that some alarms can be ignored safely. Eventually, they’ll ignore the wrong one. This is bad.


GOOD ALARMS

Format Agnostic: Don’t be a dick about format

If a team wants full HTML with links to runbooks and wiki-pages, let ‘em.

If a team wants the entire alert to fit into their iPhone lock-screen, let ‘em.

Better, allow both!


GOOD ALARMS

Specific.

Actionable.

In the format you want.


GOOD ALARMS

The Monitoring Project-Plan

MAKING THE ASCENT


Get Approval For The Project:

● If it’s just you, that’s easy! Do it.

● A good monitoring product is used by many people

○ Get buy-in from not just IT, but sales, support, etc.

● Pitch to the business-case, not process improvement for your department.

○ We will reduce customer churn by enabling our CSMs.

○ We will improve our reaction time to reputation-impacting events.

○ This will increase buy-in from other departments, enabling our IT

goals

0

PROJECT PLAN


Figure out high-level needs (SLA)

● If you have a written one? Great! Work backwards from that.

● If you have an unwritten one, ask people to see what they think it is.

○ Play 20-questions with higher level execs on impacts of down-time

and service degredations.

○ Point out the de facto SLA, see how they react.

○ Point out we don’t need to publish the SLA to our customers, but can

have one internally.

● If you have microservices, each service will need its own SLA.

1

PROJECT PLAN


Figure out concrete definitions (SLO)

● Now that you have an SLA, or many SLAs, do the analysis to determine

what ‘up’ and ‘responsive’ mean in a concrete way.

● Ask other people to get involved. Involvement keep the project rolling.

● This is an opportunity for education with business leaders.

2

PROJECT PLAN


Figure out specific monitorables (SLI)

● Take your SLO list and figure out how to monitor for each.

● You may need to monitor new things.

● You may be able to stop monitoring/alarming some other things.

● Magic happens: your first opportunity to turn off existing alarms!

3

PROJECT PLAN


Figure out how to monitor those things

● Some of this may already exist. If so, cool.

● Some may need to be monitored in a different way.

● Some may need to be monitored for the first time.

● This defines how the Polling Engine works.

● Build new engines if you need to.

● Poll direct measurements where you can, try not to use proxy

measurements.

4

PROJECT PLAN


Polling Engine

Decide on your aggregation techniques


● Perhaps you don’t need to keep data as long as you thought.

● Perhaps you need to keep high granularity data longer than you thought.

● Perhaps you need to start tracking things like percentiles and standard-

deviations.

● This defines how the Aggregation Engine works.

5

PROJECT PLAN


Aggregation Engine

Alert Definition (Operational\SLA monitoring)


● Figure out who needs to know what and how fast they need to know it.

● One person shop? Easy!

● Ops team of 80? There will be meetings.

○ Work with each group individually.

○ Be flexible with requirements in each.

○ Don’t force communications-format standards without good cause.

○ Ensure the alarms are specific and actionable.

6

PROJECT PLAN


Alerting Engine

Report Definition (Capacity\SLA monitoring)


● Figure out how to write the pass/fail report for your SLAs.

● Determine what kind of response-times are needed to address SLA risks.

● Determine what kind of response-times are needed for capacity risks.

● Determine who gets what.

7

PROJECT PLAN


Reporting Engine

Periodic Review

● Run the system for a while.

● Come back 3 months, 6 months later and ask questions.

○ How are the alarms working for you?

○ What changes do you think need to be made?

○ What new things have shown up?

● Especially important for departments that haven’t been attached to a

monitoring system before.

8

PROJECT PLAN


Humans

Step 0: Get approval

Step 1: Figure out high level needs (Service Level Agreement)

Step 2: Turn that into concrete definitions (Service Level Objectives)

Step 3: Figure out specific monitorables (Service Level Indicators)

Step 4: Decide how to monitor it (Polling Engine)

Step 5: Determine aggregation requirements (Aggregation Engine)

Step 6: Define Alerts (Operational and SLA monitoring)

Step 7: Define Reports (Capacity and SLA monitoring)

Step 8: Periodic Review


Post-Incident Review Questions

1. Did the monitoring system see the problem?

2. Did we react to the monitoring system, or humans?

3. Is it worth our time to catch this problem in the monitoring system?

4. What changes do we need to make, including to alerts, to deal with this in

the future?

9

PROJECT MAINTENANCE


Questions?

STACK CLIMBING


That Conference 2017: Refactoring your Monitoring

Data & Analytics

Transcript of That Conference 2017: Refactoring your Monitoring