That Conference 2017: Refactoring your Monitoring

63

Transcript of That Conference 2017: Refactoring your Monitoring

Page 1: That Conference 2017: Refactoring your Monitoring
Page 2: That Conference 2017: Refactoring your Monitoring

Jamie RiedeselDevOps Engineer

@sysadm1138

Route-Planning your Monitoring Stack Climb

@sysadm1138ThatConference 2017

Page 3: That Conference 2017: Refactoring your Monitoring

Today’s Climb

Overview

Your monitoring stack

Deciding what to monitor

The monitoring project-plan

Extra: Humane on-call rotations

@sysadm1138ThatConference 2017

Page 4: That Conference 2017: Refactoring your Monitoring

Your Monitoring Stack

LEARNING THE TERRITORY

@sysadm1138ThatConference 2017

Page 5: That Conference 2017: Refactoring your Monitoring

This is your stack. Really

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy E

ng

ine

@sysadm1138ThatConference 2017

Page 6: That Conference 2017: Refactoring your Monitoring

Scheduled-tasks & Powershell

Scheduler runs scripts on a schedule.

Scripts emit email or update a database.

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy

En

gin

e

@sysadm1138ThatConference 2017

Page 7: That Conference 2017: Refactoring your Monitoring

“Full-Stack”

SolarWinds.

Xenoss.

Nagios.

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy

En

gin

e

@sysadm1138ThatConference 2017

Page 8: That Conference 2017: Refactoring your Monitoring

Open-Source Medley

Nagios + Graphite + Grafana

Logstash + InfluxDB + Kibana + Bosun

Greylog + New Relic + Hash.io + DataDog

Nexosis + Go + OpenTSDB + Grafana

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy

En

gin

e

@sysadm1138ThatConference 2017

Page 9: That Conference 2017: Refactoring your Monitoring

Polling Engine

The whatever that fetches data.

● SNMP agents

● WMI endpoints

● Nagios agent

● Solarwinds agent

● Powershell scripts

● Bash scripts

● Polling Engines in Nagios & SolarWinds

● Daily runbooks and spreadsheets

Polling Engine

@sysadm1138ThatConference 2017

Page 10: That Conference 2017: Refactoring your Monitoring

Aggregation Engine

Turns raw data into useful data.

● Summarizes over time (think RRDTool)

● Does stats (min/max/%-tile) on incoming

stream.

● Summarizes over system/rack/datacenter

No one (except possibly Google) keeps full

granularity monitoring logs forever and ever in

a trivially queryable way. Too expensive, and

you don’t usually care about 2 years ago.

Aggregation Engine

Alerting Engine

@sysadm1138ThatConference 2017

Page 11: That Conference 2017: Refactoring your Monitoring

Alerting Engine

Bothering humans in realtime!

● May do analytics.

● May be threshold-based, or trigger on

very sophisticated conditions.

● Scripts that send email every time.

● Scripts that drop notices in group-chat.

● Night-operator calling the Systems

Engineers

● PagerDuty.

Alerting Engine

@sysadm1138ThatConference 2017

Page 12: That Conference 2017: Refactoring your Monitoring

Reporting Engine

Bothering humans on a lag!

● Long-term trends

● Capacity analysis

● Growth tracking

● Full-bore big-data analytics

● SLA pass/fail reporting

● Track user behaviors across features

● BA building reports for executives

Reporting Engine

@sysadm1138ThatConference 2017

Page 13: That Conference 2017: Refactoring your Monitoring

API

Programatic interfaces into your monitoring

system.

● Build feedback systems

● Manage policy-engine details

● Could be your CM system

Good monitoring systems have APIs. It makes

them easier to integrate with. And integration is

usage.

API

@sysadm1138ThatConference 2017

Page 14: That Conference 2017: Refactoring your Monitoring

User Interface

How humans interface with it.

A monitoring system with a bad user-

interface is a bad monitoring system.

- Jamie Riedesel, lots of times

I’ve seen things.User Interface

API

@sysadm1138ThatConference 2017

Page 15: That Conference 2017: Refactoring your Monitoring

User Interface

To access a previous job’s monitoring system:

1. Open a browser.

2. Log in using 2-factor to our SSL-VPN.

3. Connect to RDP using same password as VPN.

4. Open another browser.

5. Hit Monitoring site.

6. Using non SSO-ed password, log in.

7. See what’s going on.

User Interface

API

@sysadm1138ThatConference 2017

Page 16: That Conference 2017: Refactoring your Monitoring

Policy Engine

This defines the behavior of each stage of the

stack.

Configured as part of the User Interface and

API.

Po

licy

En

gin

e

User Interface

API

Humans

@sysadm1138ThatConference 2017

Page 17: That Conference 2017: Refactoring your Monitoring

Policy Engine +Polling engine

● How often are things polled?

○ Every 10s, 1m, 2m, 5m, 1d?

● Does polling get paused for

maintenance-windows?

● What data gets reported to the

Aggregation Engine?

Polling Engine

Po

licy

En

gin

e

User Interface

API

Humans

@sysadm1138ThatConference 2017

Page 18: That Conference 2017: Refactoring your Monitoring

Policy Engine +Aggregation Engine

● How long do you keep data at all?

● How long do you keep full granularity

data?

● How long do you keep summarized data?

● Where do you keep full granularity data?

● Where do you keep summarized data?

● How do you summarize data?

○ Time? System? Location?

● Do maintenance windows affect any of

Polling Engine

Aggregation Engine

Po

licy

En

gin

e

User Interface

API

Humans

@sysadm1138ThatConference 2017

Page 19: That Conference 2017: Refactoring your Monitoring

Policy Engine + Alerting Engine

● Which alarms merit bothering humans?

● Which alarms merit automatic fixing?

● Which alarms can be ignored?

● How do maintenance-windows impact

alarms?

● What escalation policies are in place?

Polling Engine

Aggregation Engine

Alerting Engine

API

Po

licy

En

gin

e

User Interface

API

Humans

@sysadm1138ThatConference 2017

Page 20: That Conference 2017: Refactoring your Monitoring

Policy-Engine + Reporting Engine

● Do reports get automatically generated?

● What reports are viewable on-demand?

● What reports are defined?

● Are ad-hoc reports possible?

● Who gets automatically generated

reports?

● What trends are we looking for?

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy

En

gin

e

@sysadm1138ThatConference 2017

Page 21: That Conference 2017: Refactoring your Monitoring

That’s cleared up!

Polling Engine

Reporting Engine

User Interface

Aggregation Engine

Alerting Engine

API

Humans

Po

licy E

ng

ine

@sysadm1138ThatConference 2017

Page 22: That Conference 2017: Refactoring your Monitoring

Deciding What To Monitor

PLANNING THE APPROACH

@sysadm1138ThatConference 2017

Page 23: That Conference 2017: Refactoring your Monitoring

Different Kinds of Monitoring

Granularity and goals differ

from type to type. Be aware of

these as you build your system.

Performance Monitoring

Operational Monitoring

Capacity Monitoring

SLA Monitoring

@sysadm1138ThatConference 2017

Page 24: That Conference 2017: Refactoring your Monitoring

Performance Monitoring

Granularity: Very high ( 10s, 1s, or even sub-second)

Duration: As-needed

Response: Realtime

Tools: Procmon, wireshark, strace, perf, Performance Monitor, gdb

Typically done as part of debugging, troubleshooting, and profiling activities. Granularity is much

higher than operational monitoring. Typically, results are reviewed in near realtime and not

persisted long.

Performance Monitoring

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Page 25: That Conference 2017: Refactoring your Monitoring

Operational Monitoring

Granularity: Medium (1m, 2m, 5m, 10m, 1h, etc)

Duration: Continuous.

Response: Rapid.

Tools: Dell OpenManage, HP Operations Manager, Cisco OpManager, NetApp

What most people think of when you say monitoring (but they’re wrong). This type of monitoring

catches the health of your infrastructure and is not directly related to the services it provides.

Think disk replacements, switch failures, and tornados.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Operational Monitoring

Page 26: That Conference 2017: Refactoring your Monitoring

This one is easy

OPERATIONAL MONITORING

1

The SLA for this is: our infrastructure can support the

delivery of our products and services.

● Switch failures.

● Disk failures.

● Blade-chassis failures.

● UPS failures.

● PSU / PDU failures.

● Compliance failures.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Page 27: That Conference 2017: Refactoring your Monitoring

Capacity Monitoring

Granularity: Low (1h, 1d, 1w, 1m)

Duration: Continual or occasional

Response: Slow

Tools: Grafana, Kibana, Graphite, Nagios, Excel

Monitoring the capacity of your system to do work. Lead times can be quite long for some

replacements (SAN arrays), and capacity can be budgetary more than hardware. Especially in

cloud contexts.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Capacity Monitoring

Page 28: That Conference 2017: Refactoring your Monitoring

How much do I need,And when do I need it?

CAPACITY MONITORING

2

Every product or service uses consumables. This is

where you track them:

● Disk-space

● Cloud budget

● Overtime allowance

● P1 incident usage

● SmartHands budget

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Page 29: That Conference 2017: Refactoring your Monitoring

Service Level Agreement Monitoring

Granularity: Medium to Low

Duration: Continual

Response: Rapid and Slow

Tools: Everything

Monitoring to detect whether or not you’re meeting your SLA for a given service or services.

Where most monitoring really exists.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

SLA Monitoring

Page 30: That Conference 2017: Refactoring your Monitoring

This one is complicated

SERVICE LEVEL AGREEMENT MONITORING

3

How your product or service is supposed to perform. Not

just executives care about SLAs.

SLA: Service Level Agreement

SLO: Service Level Objectives

SLI: Service Level Indicators

We’ll get into these.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Page 31: That Conference 2017: Refactoring your Monitoring

What if we don’t have SLAs? That’s like… commitment. We avoid that around here!

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Page 32: That Conference 2017: Refactoring your Monitoring

What if we don’t have SLAs? That’s like… commitment. We avoid that around here!

Yes, you have an SLA

No, really. You do.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Page 33: That Conference 2017: Refactoring your Monitoring

The service is up when our users need it to be.

And if it isn’t, they’re allowed to slag us on Twitter.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

DEFACTO SERVICE LEVEL AGREEMENT

Page 34: That Conference 2017: Refactoring your Monitoring

The service is up when our users need it to be.

And if it isn’t, they’re allowed to slag us on Twitter.

In short, 100% uptime or your reputation will be hauled through the meat-grinder.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

DEFACTO SERVICE LEVEL AGREEMENT

Page 35: That Conference 2017: Refactoring your Monitoring

We promise X availability, on penalty of Y things, outside of Q maintenance periods. Planned outages will have no less than Z days notice...

Less likely to end up as a meme on Twitter. This can be 100% an internal-only document!

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

DEFINED SERVICE LEVEL AGREEMENT

Page 36: That Conference 2017: Refactoring your Monitoring

Service Level Agreement (SLA): An agreement, written in Human; or sometimes Lawyer. Sets goalposts, defines penalties (if any), defines terms.

Service Level Objective (SLO): A set of objectives, written in Engineer. Technical definition of the goalposts in the SLA.

Service Level Indicator (SLI): Something that tells you whether or not you’re meeting your SLO.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

DEFINITIONS

Page 37: That Conference 2017: Refactoring your Monitoring

SLA: The service is up 99.99% of the time, not including scheduled maintenance.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

SLOs - SERVICE LEVEL OBJECTIVES

Page 38: That Conference 2017: Refactoring your Monitoring

SLA: The service is up 99.99% of the time, not including scheduled maintenance.

● The settings page renders in under 10 seconds.● The site returns HTTP-200 from Europe within 2 seconds.● Branch-office ADC01 can reach the service.● 98%-tile end to end request time is not more than 3

seconds.● The SSL certificate is valid and chains to our CA.● The text, “Welcome to Example Co,” is on the main page.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

SLOs - SERVICE LEVEL OBJECTIVES

Page 39: That Conference 2017: Refactoring your Monitoring

SLA: The site is up 99.99% of the time, not including scheduled maintenance.

SLO:● Site is reachable.● The site is showing the right content.● Scheduled maintenance is tracked.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

SLOs - SERVICE LEVEL OBJECTIVES: HasDCDoneSomethingStupidToday.com

Page 40: That Conference 2017: Refactoring your Monitoring

SLA: Printing is available in Computer Labs 99.99% of the time, outside of scheduled closures and maintenance.

SLO:● Every Computer Lab has at least one working printer with

paper.● Printers service only the central print queues.● The swipe-card terminal in Computer Labs must work for

the printers to be considered ‘working’.● Printers do not work if they can’t talk to the payment

processor.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

SLOs - SERVICE LEVEL OBJECTIVES: University Print Services

Page 41: That Conference 2017: Refactoring your Monitoring

SLO: The settings page renders in under 10 seconds.

SLI:● Logins work.● Page render-time from same data-center.● Page render-time from Europe.● Database disk-queue length.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!

Page 42: That Conference 2017: Refactoring your Monitoring

SLO: 98%-tile end to end request time is not more than 3 seconds.

SLI:● Time-to-process for all requests.● Request processing is functional at least 30 seconds ago.● 10 minute 98th percentile request-time average.● 10 minute 50th percentile request-time average.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!

Page 43: That Conference 2017: Refactoring your Monitoring

Service Level Agreement (SLA): An agreement, written in Human; or sometimes Lawyer. Sets goalposts, defines penalties (if any), defines terms.

Service Level Objective (SLO): A set of objectives, written in Engineer. Technical definition of the goalposts in the SLA.

Service Level Indicator (SLI): Something that tells you whether or not you’re meeting your SLO.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

DEFINITIONS

Page 44: That Conference 2017: Refactoring your Monitoring

Alarm: Informing humans of failing SLI/SLOs in realtime.

Report: Eventually informing humans of failing SLI/SLOs

Which humans do you bother for each SLI/SLO? Only you can figure that out!

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

DEFINITIONS

Page 45: That Conference 2017: Refactoring your Monitoring

Specific: Must tell me something specific is wrong.

Alarms that require a human to log in to figure out what is actually wrong, if anything is, are bad alarms.

FYI alarms lead to high cognitive load and decrease worker satisfaction.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

GOOD ALARMS

Page 46: That Conference 2017: Refactoring your Monitoring

Actionable: Must be something I can directly fix

Getting alarmed for things you can’t fix is a great road to burnout. These are especially great at 3:19 AM.

The failure mode is teaching people that some alarms can be ignored safely. Eventually, they’ll ignore the wrong one. This is bad.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

GOOD ALARMS

Page 47: That Conference 2017: Refactoring your Monitoring

Format Agnostic: Don’t be a dick about format

If a team wants full HTML with links to runbooks and wiki-pages, let ‘em.

If a team wants the entire alert to fit into their iPhone lock-screen, let ‘em.

Better, allow both!

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

GOOD ALARMS

Page 48: That Conference 2017: Refactoring your Monitoring

Specific.

Actionable.

In the format you want.

@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

GOOD ALARMS

Page 49: That Conference 2017: Refactoring your Monitoring

The Monitoring Project-Plan

MAKING THE ASCENT

@sysadm1138ThatConference 2017

Page 50: That Conference 2017: Refactoring your Monitoring

Get Approval For The Project:

● If it’s just you, that’s easy! Do it.

● A good monitoring product is used by many people

○ Get buy-in from not just IT, but sales, support, etc.

● Pitch to the business-case, not process improvement for your department.

○ We will reduce customer churn by enabling our CSMs.

○ We will improve our reaction time to reputation-impacting events.

○ This will increase buy-in from other departments, enabling our IT

goals

0

PROJECT PLAN

@sysadm1138ThatConference 2017

Page 51: That Conference 2017: Refactoring your Monitoring

Figure out high-level needs (SLA)

● If you have a written one? Great! Work backwards from that.

● If you have an unwritten one, ask people to see what they think it is.

○ Play 20-questions with higher level execs on impacts of down-time

and service degredations.

○ Point out the de facto SLA, see how they react.

○ Point out we don’t need to publish the SLA to our customers, but can

have one internally.

● If you have microservices, each service will need its own SLA.

1

PROJECT PLAN

@sysadm1138ThatConference 2017

Page 52: That Conference 2017: Refactoring your Monitoring

Figure out concrete definitions (SLO)

● Now that you have an SLA, or many SLAs, do the analysis to determine

what ‘up’ and ‘responsive’ mean in a concrete way.

● Ask other people to get involved. Involvement keep the project rolling.

● This is an opportunity for education with business leaders.

2

PROJECT PLAN

@sysadm1138ThatConference 2017

Page 53: That Conference 2017: Refactoring your Monitoring

Figure out specific monitorables (SLI)

● Take your SLO list and figure out how to monitor for each.

● You may need to monitor new things.

● You may be able to stop monitoring/alarming some other things.

● Magic happens: your first opportunity to turn off existing alarms!

3

PROJECT PLAN

@sysadm1138ThatConference 2017

Page 54: That Conference 2017: Refactoring your Monitoring

Figure out how to monitor those things

● Some of this may already exist. If so, cool.

● Some may need to be monitored in a different way.

● Some may need to be monitored for the first time.

● This defines how the Polling Engine works.

● Build new engines if you need to.

● Poll direct measurements where you can, try not to use proxy

measurements.

4

PROJECT PLAN

@sysadm1138ThatConference 2017

Polling Engine

Page 55: That Conference 2017: Refactoring your Monitoring

Decide on your aggregation techniques

● Some of this may already exist. If so, cool.

● Perhaps you don’t need to keep data as long as you thought.

● Perhaps you need to keep high granularity data longer than you thought.

● Perhaps you need to start tracking things like percentiles and standard-

deviations.

● This defines how the Aggregation Engine works.

5

PROJECT PLAN

@sysadm1138ThatConference 2017

Aggregation Engine

Page 56: That Conference 2017: Refactoring your Monitoring

Alert Definition (Operational\SLA monitoring)

● Some of this may already exist. If so, cool.

● Figure out who needs to know what and how fast they need to know it.

● One person shop? Easy!

● Ops team of 80? There will be meetings.

○ Work with each group individually.

○ Be flexible with requirements in each.

○ Don’t force communications-format standards without good cause.

○ Ensure the alarms are specific and actionable.

6

PROJECT PLAN

@sysadm1138ThatConference 2017

Alerting Engine

Page 57: That Conference 2017: Refactoring your Monitoring

Report Definition (Capacity\SLA monitoring)

● Some of this may already exist. If so, cool.

● Figure out how to write the pass/fail report for your SLAs.

● Determine what kind of response-times are needed to address SLA risks.

● Determine what kind of response-times are needed for capacity risks.

● Determine who gets what.

7

PROJECT PLAN

@sysadm1138ThatConference 2017

Reporting Engine

Page 58: That Conference 2017: Refactoring your Monitoring

Periodic Review

● Run the system for a while.

● Come back 3 months, 6 months later and ask questions.

○ How are the alarms working for you?

○ What changes do you think need to be made?

○ What new things have shown up?

● Especially important for departments that haven’t been attached to a

monitoring system before.

8

PROJECT PLAN

@sysadm1138ThatConference 2017

Humans

Page 59: That Conference 2017: Refactoring your Monitoring

Step 0: Get approval

Step 1: Figure out high level needs (Service Level Agreement)

Step 2: Turn that into concrete definitions (Service Level Objectives)

Step 3: Figure out specific monitorables (Service Level Indicators)

Step 4: Decide how to monitor it (Polling Engine)

Step 5: Determine aggregation requirements (Aggregation Engine)

Step 6: Define Alerts (Operational and SLA monitoring)

Step 7: Define Reports (Capacity and SLA monitoring)

Step 8: Periodic Review

@sysadm1138ThatConference 2017

Page 60: That Conference 2017: Refactoring your Monitoring
Page 61: That Conference 2017: Refactoring your Monitoring

Post-Incident Review Questions

1. Did the monitoring system see the problem?

2. Did we react to the monitoring system, or humans?

3. Is it worth our time to catch this problem in the monitoring system?

4. What changes do we need to make, including to alerts, to deal with this in

the future?

9

PROJECT MAINTENANCE

@sysadm1138ThatConference 2017

Page 62: That Conference 2017: Refactoring your Monitoring

Questions?

STACK CLIMBING

@sysadm1138ThatConference 2017

Page 63: That Conference 2017: Refactoring your Monitoring