Fear of the Bus - Heidi Waterhouse - DevOpsDays Tel Aviv 2016
Monitoring "unknown unknowns" - Guy Fighel - DevOpsDays Tel Aviv 2017
-
Upload
devopsdays-tel-aviv -
Category
Technology
-
view
119 -
download
2
Transcript of Monitoring "unknown unknowns" - Guy Fighel - DevOpsDays Tel Aviv 2017
@guyfig
On-Call Engineer by Nature
"If a tree falls in a forest and no one is
around to hear it, does it make a sound?"
Observability is a superset between
monitoring and instrumentation.
Making systems debuggable and
understandable
@mipsytipsy
Do you really know what to observe?
Instrumentation - mostly Developer driven
What is the output? Dashboard? Exploration tool?
one can determine the behavior of the
entire system from the system's outputs
Observability In Control Theory
Unknown Unknowns - Rumsfeld Quadrant
-Static thresholds-Defined Alerts-Static Runbooks
-Anomaly Detection-Predictions-External Knowledge
-Knowledge-Recommendations-Auto Collaboration
-Inference-Auto Correlations-Semantic Analysis-Decision making
The Observability Quadrant (Based on Johari window)
Humans Driven Detection
Set thresholds to find patterns
Simulate based on known
Use percentiles, basic stats
Find The Problem
Thresholds? Baseline? Anomaly?
- Scale matters
- Stationary noise matters
- Use Autocorrelation
Independent component analysis (ICA)
separates a multivariate signal into
additive subcomponents that are
maximally independent.
from sklearn.decomposition
import FastICA, PCA
Find The Problem
CPU
90%
Time in Minutes EC2 Instance
changed from
t2.small to m3.xl
Events & context matters
Anomly?
What Can Machines Do?
Process different types of data, transform it fast and handle huge amounts in real-time
Automate and adapt Anomaly Detection
Apply Semantic text similarities to find patterns (Information Retrieval)
Apply auto correlation models
Evolve and adapt (overtime) based on human interaction
The Goal - Centralization
Observability for systems with imperfect outputs
Events enrichments, symptoms detection and inference
Automatic Outlier Detection
Automatic Correlation
Get closer to the Control Theory mathematical definition
- Define the model. Use a single schema (Apache Avro)
- Events are agnostic. Can represent logs, stack trace, metric, user action, HTTP event,
etc.
- Every event should have a set of common fields as well as optional key/value
attributes
Get a Common SchemaUse Common Schema
Deterministic models are better to start with (Fuzzy Logic, Rules)
Choose your logic and start run it across your data (schema)
Apply similarity checks to strings first (TF-IDF, BM25, Fuzzy, other classifiers)
Look into correlations, start with simple obvious ones, before building classifiers
(Unsupervised/Semi-supervised learning is much more relevant overall)
Build your prediction models on time series data first. (Statistics has solid models)
Time and context are dimensions you will be able to start addressing
Best Practices
Use It In Production
- Your team == your users
- Ask for feedback
- Re-calculate relevancy
- Apply Recommendations
based on your own team
knowledge