Anomaly detection and root cause analysis in distributed application transactions
-
Upload
yuchen-zhao -
Category
Engineering
-
view
905 -
download
4
Transcript of Anomaly detection and root cause analysis in distributed application transactions
Anomaly Detection and Root Cause Analysis in Distributed Application Transactions
Yuchen Zhao @
Software is Eating the World
it’s critical to make surethe software
is running properly
How? Through monitoring!
Monitoring shouldn’t be very hard… right?
Well, it can become a bit more complex...
Or… really complex...
Keep applications runningis hard.
Challenge 1:Enterprise applications are complex
Challenge 2:Data is heterogeneous.
Its volume is massive and growing
Challenge 3:Too many signals.
Finding anomalies & root causesare non-trival.
Our solution: Relevant Fields
Machine Learning + Engineering
Q1: How to get & organize data?
Collect data in the form ofBusiness Transactions
Q2: Can you give a real use case?
A hypothetical travel booking site with data in BT
An unexpected incident:
Step 1: filtering
Step 2: find relevant fields
the relevancy score
“airline:AA” related transactions:● 2% occurrence normally among all
travel bookings● 82% of the current slow transactions
are from “AA”.● 41 times more significant than normal.
What’s the root cause?
Step 3: take actions!
Q3: What’s the design of the system?
Architecture Overview
Data Collection
Smart Code Instrumentation
watch every line of code, self-learning, automatic
Stream Processing & Storage
Relevant Fields Processing
all transactions(baseline)
error/slow transactions(query)
Baseline & Query Sets
Q4: How to score the field?
all transactions(baseline)
error/slow transactions(query)
Optimization: Dynamics Baseline
Infer baseline context from query automatically
querytransactions
transactions of Entity 1
querytransactions
transactions of Entity 2
transactions of Entity n
Baseline entity is auto learned from two dimensions:
● physical (applications, tiers, nodes, etc)
● temporal
Score NormalizationNormalize the score using a function derived from
sigmoid:
Score Example
Fore more details, please check out our demo paper in ICDM 2015:
Discovering Anomalies and Root Causes in Applications via Relevant Fields Analysis,
in Proceedings of the 15th IEEE International Conference on Data Mining
Ongoing work...Support rich data types
● time series
● text
● graphs
● ...
We’re selling!
Thank you!