UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando...
-
Upload
deshaun-wigmore -
Category
Documents
-
view
219 -
download
1
Transcript of UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando...
![Page 1: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/1.jpg)
UC Berkeley
Online System Problem Detection by Mining Console Logs
Wei Xu* Ling Huang†
Armando Fox* David Patterson* Michael Jordan*
*UC Berkeley † Intel Labs Berkeley
![Page 2: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/2.jpg)
2
Why console logs?
• Detecting problems in large scale Internet services often requires detailed instrumentation
• Instrumentation can be costly to insert & maintain• High code churn• Often combine open-source building blocks that are not
all instrumented
• Can we use console logs in lieu of instrumentation?
+ Easy for developer, so nearly all software has them
– Imperfect: not originally intended for instrumentation
![Page 3: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/3.jpg)
Problems we are looking for
• The easy case – rare messages
• Harder but useful - abnormal sequences
NORMAL
receiving blk_1received blk_1
receiving blk_2
ERROR
what is wrong with blk_2 ???
3
![Page 4: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/4.jpg)
Overview and Contributions
* Large-scale system problem detection by mining console logs (SOSP’ 09)
• Accurate online detection with small latency
4
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing*
Free text logs200 nodes
![Page 5: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/5.jpg)
Constructing event traces from console logs
• Parse: message type + variables• Group messages by identifiers (automatically
discovered)– Group ~= event trace
receiving blk_1
receiving blk_1
received blk_1received blk_1
reading blk_1reading blk_1
receiving blk_2
received blk_2
receiving blk_2
receiving blk_1
receiving blk_1
received blk_1received blk_1
reading blk_1reading blk_1
receiving blk_2
received blk_2
receiving blk_2
5
![Page 6: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/6.jpg)
Online detection: When to make detection?
• Cannot wait for the entire trace• Can last arbitrarily long time
• How long do we have to wait?• Long enough to keep correlations• Wrong cut = false positive
• Difficulties• No obvious boundaries• Inaccurate message ordering • Variations in session duration
6
receiving blk_1
received blk_1
reading blk_1
reading blk_1
deleting blk_1
deleted blk_1
receiving blk_1
received blk_1
Time
![Page 7: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/7.jpg)
Frequent patterns help determine session boundaries
• Key Insight: Most messages/traces are normal• Strong patterns
• “Make common paths fast”• Tolerate noise
7
![Page 8: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/8.jpg)
8
Two stage detection overview
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing
Free text logs200 nodes
![Page 9: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/9.jpg)
Stage 1 - Frequent patterns (1): Frequent event sets
9
receiving blk_1
received blk_1reading blk_1
deleting blk_1
deleted blk_1
receiving blk_1
received blk_1
Time
Coarse cut by time
Find frequent item set
Refine time estimation
reading blk_1
error blk_1
Repeat until all patterns found
PCA Detection
![Page 10: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/10.jpg)
10
Stage 1 - Frequent patterns (2) : Modeling session duration time
• Assuming Gaussian?• 99.95th percentile estimation is off by half• 45% more false alarms
• Mixture distribution• Power-law tail + histogram head
DurationDuration
Cou
nt
Pr(
X>
=x)
![Page 11: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/11.jpg)
11
Stage 2 - Handling noise with PCA detection
• More tolerant to noise• Principal Component Analysis (PCA) based
detection
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing
Free text logs200 nodes
![Page 12: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/12.jpg)
Frequent pattern matching filters most of the normal events
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing*
Free text logs200 nodes
100%86%
14% 13.97%
0.03%
![Page 13: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/13.jpg)
13
Evaluation setup
• Hadoop file system (HDFS)• Experiment on Amazon’s EC2 cloud
• 203 nodes x 48 hours• Running standard map-reduce jobs• ~24 million lines of console logs
• 575,000 traces• ~ 680 distinct ones• Manual label from previous work
• Normal/abnormal + why it is abnormal• “Eventually normal” – did not consider time• For evaluation only
![Page 14: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/14.jpg)
14
Frequent patterns in HDFS
Frequent Pattern 99.95th percentile Duration
% of messages
Allocate, begin write 13 sec 20.3%
Done write, update metadata 8 sec 44.6%
Delete - 12.5%
Serving block - 3.8%
Read exception - 3.2%
Verify block - 1.1%
Total 85.6%
• Covers most messages• Short durations
(Total events ~20 million)
![Page 15: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/15.jpg)
15
Detection latency
• Detection latency is dominated by the wait time
Single event pattern
Frequent pattern (matched)
Frequent pattern (timed out)
Non pattern events
![Page 16: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/16.jpg)
16
Detection accuracy
True Positives
False Positives
False Negatives
Precision Recall
Online 16,916 2,748 0 86.0% 100.0%
Offline 16,808 1,746 108 90.6% 99.3%
(Total trace = 575,319)
• Ambiguity on “abnormal”• Manual labels: “eventually normal”• > 600 FPs in online detection as very long latency• E.g. a write session takes >500sec to complete
(99.99th percentile is 20sec)
![Page 17: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/17.jpg)
Future work
• Distributed log stream processing– Handle large scale cluster + partial failures
• Clustering alarms• Allowing feedback from operators• Correlation on logs from multiple applications /
layers
17
![Page 18: UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c9d5503460f9495c9ca/html5/thumbnails/18.jpg)
Summary
http://www.cs.berkeley.edu/~xuw/Wei Xu <[email protected]>
Precision Recall
Online 86.0% 100.0%
18
Frequent pattern based filtering
OK
PCA DetectionOK
ERROR
Dominant cases
Non-patternNormal cases
Real anomalies
Parsing
Free text logs200 nodes24 million lines