Detecting Large-Scale System Problems by Mining Console Logs

Detecting Large-Scale System Problems by Mining Console Logs

Author : Wei Xu* , Ling Huang†, Armando Fox* David Patterson* ,Michael Jordan*Conference: ICML 2010, ACM SOSP2009Advisor: Yuh-Jye LeeReporter: Yi-Hsiang YangEmail: [email protected]

Outline

•Introduction•Methodology•Evaluation and Visualization•Conclusion

2

Introduction

• Information of console logs?Console logs rarely help in large-scale

datacenter servicesOperational problems are dependent on the

deployment and runtime environmentTypical console log is much more structured

• Anomaly detectionUnusual log messages often indicate the

source of the problem

3

Workflow • Log Parsing

Convert a log message from unstructured text to a data structure

• Feature creationConstructing the state ratio vector and the

message count vector features• Anomaly detection

Principal Component Analysis(PCA)-based anomaly detection method

• VisualizationDecision tree

4

Workflow

5

Log Parsing with Source Code•Difficulty: Templatize automatically

C languagefprintf(LOG, "starting: xact %d is %s")

JavaCLog.info("starting: " + txn)

• Not easy to distinguish variables 、 states

6

Parsing Approach-Source Code•Generate the source code’s abstract

syntax tree (AST) •Use AST to identify all method calls on

objects of the classes (or their subclasses)•Deduce the types of variables in message

templates

7

Parsing Approach-Source Code

8

Parsing Approach-Log•Apache Lucene reverse index•Implement as a Hadoop map-reduce

job Replicating the index to every node and

partitioning The map stage performs the reverse-index

search The reduce stage processing depends on the

features to be constructed

9

Parsing Approach

10

Feature Creation

•The state ratio vector Each state ratio vector : a group of state variables in

a time window

•The message count vector Each vector dimension : different message type Value of the dimension : messages appear in the

message group

11

13

Feature Creation-The message count vector

14

Anomaly Detection-Principal Component Analysis (PCA)

•Applied Term Frequency / Inverse Document Frequency (TF-IDF)

•Replace each entry yi,j with a weighted entry wi,j ≡ yi,j log(n/dfj), where dfj is total number of message groups that contain the j-th message type

15

Anomaly Detection-Principal Component Analysis (PCA)

Evalution and Visualization

•From Elastic Compute Cloud (EC2)•203 nodes of HDFS and 1 nodes of Darkstar

16


• Parse fails when cannot find a message template that matches the message and extract message variables.

17


•50 nodes, takes less than 3 minutes , less than 10 minutes with 10 node

18

Evalution and Visualization-Darkstar•DarkMud

Provided by the Darkstar teamEmulated 60 user clients in the DarkMud

virtual world performing random operationsRan the experiment for 4800 seconds Injected a performance disturbance by

capping the CPU during time 1400 to 1800 sec

19

Disturbance by capping the CPU

20

Evalution and Visualization-Darkstar•Ratio between number of ABORTING to

COMMITTING increases from about 1:2000 to about 1:2

•Darkstar does not adjust transaction timeout accordingly

21

Evalution and Visualization-Darkstar

•Augmented each feature vector using the timestamp of the last message in that group

22

Evalution and Visualization -Hadoop

23

Evalution and Visualization -Hadoop

24

Evalution and Visualization-Hadoop

25

Conclusion

•Using source code as a reference to understand the structure of console logs are able to parse logs accurately

•New opportunities for turning built-in console logs into a powerful monitoring system for problem detection

26

Thanks for your attentionQ&A

27

Detecting Large-Scale System Problems by Mining Console Logs

Documents

Transcript of Detecting Large-Scale System Problems by Mining Console Logs