Detecting Large-Scale System Problems by Mining Console Logs
description
Transcript of Detecting Large-Scale System Problems by Mining Console Logs
Detecting Large-Scale System Problems by Mining Console Logs
Author : Wei Xu* , Ling Huang†, Armando Fox* David Patterson* ,Michael Jordan*Conference: ICML 2010, ACM SOSP2009Advisor: Yuh-Jye LeeReporter: Yi-Hsiang YangEmail: [email protected]
Outline
•Introduction•Methodology•Evaluation and Visualization•Conclusion
2
Introduction
• Information of console logs?Console logs rarely help in large-scale
datacenter servicesOperational problems are dependent on the
deployment and runtime environmentTypical console log is much more structured
• Anomaly detectionUnusual log messages often indicate the
source of the problem
3
Workflow • Log Parsing
Convert a log message from unstructured text to a data structure
• Feature creationConstructing the state ratio vector and the
message count vector features• Anomaly detection
Principal Component Analysis(PCA)-based anomaly detection method
• VisualizationDecision tree
4
Workflow
5
Log Parsing with Source Code•Difficulty: Templatize automatically
C languagefprintf(LOG, "starting: xact %d is %s")
JavaCLog.info("starting: " + txn)
• Not easy to distinguish variables 、 states
6
Parsing Approach-Source Code•Generate the source code’s abstract
syntax tree (AST) •Use AST to identify all method calls on
objects of the classes (or their subclasses)•Deduce the types of variables in message
templates
7
Parsing Approach-Source Code
8
Parsing Approach-Log•Apache Lucene reverse index•Implement as a Hadoop map-reduce
job Replicating the index to every node and
partitioning The map stage performs the reverse-index
search The reduce stage processing depends on the
features to be constructed
9
Parsing Approach
10
Feature Creation
•The state ratio vector Each state ratio vector : a group of state variables in
a time window
•The message count vector Each vector dimension : different message type Value of the dimension : messages appear in the
message group
11
12
13
Feature Creation-The message count vector
14
Anomaly Detection-Principal Component Analysis (PCA)
•Applied Term Frequency / Inverse Document Frequency (TF-IDF)
•Replace each entry yi,j with a weighted entry wi,j ≡ yi,j log(n/dfj), where dfj is total number of message groups that contain the j-th message type
15
Anomaly Detection-Principal Component Analysis (PCA)
Evalution and Visualization
•From Elastic Compute Cloud (EC2)•203 nodes of HDFS and 1 nodes of Darkstar
16
Evalution and Visualization
• Parse fails when cannot find a message template that matches the message and extract message variables.
17
Evalution and Visualization
•50 nodes, takes less than 3 minutes , less than 10 minutes with 10 node
18
Evalution and Visualization-Darkstar•DarkMud
Provided by the Darkstar teamEmulated 60 user clients in the DarkMud
virtual world performing random operationsRan the experiment for 4800 seconds Injected a performance disturbance by
capping the CPU during time 1400 to 1800 sec
19
Disturbance by capping the CPU
20
Evalution and Visualization-Darkstar•Ratio between number of ABORTING to
COMMITTING increases from about 1:2000 to about 1:2
•Darkstar does not adjust transaction timeout accordingly
21
Evalution and Visualization-Darkstar
•Augmented each feature vector using the timestamp of the last message in that group
22
Evalution and Visualization -Hadoop
23
Evalution and Visualization -Hadoop
24
Evalution and Visualization-Hadoop
25
Conclusion
•Using source code as a reference to understand the structure of console logs are able to parse logs accurately
•New opportunities for turning built-in console logs into a powerful monitoring system for problem detection
26
Thanks for your attentionQ&A
27