Information-Theoretic Measures for Anomaly Detection

Information-Theoretic Measures for Anomaly

Detection

Wenke Lee, and Dong Xiang(North Carolina State University)

IEEE Security and Privacy, 2001Speaker: Chang Huan Wu

2009/4/14

2

Outline

IntroductionInformation-Theoretic MeasuresCase StudiesConclusions

3

Introduction (1/2)

Misuse detection– Use the “signatures” of known attacks

Anomaly detection– Use established normal profiles

The basic premise for anomaly detection： There is regularity in audit data that is consistent with the normal behavior and thus distinct from the abnormal behavior

4

Introduction (2/2)

Most anomaly detection models are built based solely on “expert” knowledge or intuition

Provide theoretical foundations as well as useful tools that can facilitate the IDS development process and improve the effectiveness of ID technologies

5

Information-Theoretic Measures (1/7)

Entropy

Use entropy as a measure of the regularity of audit data

6

Information-Theoretic Measures (2/7) Conditional Entropy

Let X be a collection of sequences where each is (e1, e2, …, en-1, en), each ei is an audit event; let Y be the collection of subsequences where each is (e1, e2, …, ek), and k < n

H(X | Y) tell us how much uncertainty remains for the rest of audit events in a sequence x after we have seen y

7


Relative Entropy

Relative entropy measures the distance of the regularities between two datasets– Training dataset and testing dataset

8


When we use conditional entropy to measure the regularity of sequential dependencies, we can use relative conditional entropy to measure the distance between two audit datasets

9


Intrusion detection can be cast as a classification problem

When constructing a classifier, a classification algorithm needs to search for features with high information gain– When the dataset is partitioned according

to this feature values, the subsets will have lower entropy

10


Information Gain

11

H(X)=-((4/16)*log2(4/16)+(12/16)*log2(12/16))=0.8113E( 年齡 )=(6/16)*H(<35)+(10/16)*H(>35)=0.7946

Gain( 年齡 )=H(X)-E( 年齡 )=0.0167

Gain( 年齡 )=0.0167 Gain( 性別 )=0.0972 Gain( 家庭所得 )=0.0177

Information Gain

12


Intuitively, the more information we have, the better the detection performance– There is always a cost for any gain

We can define information cost as the average time for processing an audit record and checking against the detection model

13

UNM sendmail System Call Data (1/6)

University of New Mexico (UNM) sendmail system call data

Each trace contains the consecutive system calls made by the run-time processes

Used the first 80% traces as the training data and the last 20%as part of the testing data

14


H(length-n sequences | subsequences of the length n-1) Measures the regularity of how the first n-1 system calls

determines the n-th system call

=> Conditional entropy drops as sequence length increases

15


For normal data, the trend of misclassification rate coincides with the trend of conditional entropy

16


Misclassification rates for the intrusion traces are much higher This suggests that we can use the range of the

misclassification rate as the indicator of whether a given trace is normal or abnormal (intrusion)

17


When the training and testing normal datasets differs more, then the misclassification rate on testing normal data is also higher

18


The cost is a linear function of the sequence length Length ↑, accuracy ↑ but cost also↑

19

MIT Lincoln Lab sendmail BSM Data (1/6)

BSM data developed and distributed by MIT Lincoln Lab for the 1999 DARPA evaluation

Each audit record corresponds to a system call made by sendmail– Contains additional information (Ex. u

ser and group IDs, the obj name)

20


UNM data : (s1, s2, … , sl)BSM data

– so : (s1_o1, s2_o2, … , sl_ol)

– s-o : (s1, o1, s2, o2, … , sl, ol)

– s: system call , o: obj name (system or user or other)

21


Conditional entropy drops as sequence length increases

22


For in-bound mails the testing data have clearly higher misclassification rates than the training data

23


Out-bound mails have much smaller relative conditional entropy than in-bound mails

24


Though the performance with obj name is slightly better, if we consider cost, it is actually better to use system call name only

25

MIT Lincoln Lab Network Data (1/4)

tcpdump data developed and distributed by MIT Lincoln Lab for the 1998 DARPA evaluation

Each record describes a connection using the following features: timestamp, duration, source port, source host, service…

26


Destination host was used for partitioning the data into per-host subsets

27


We can see from the figure that intrusion datasets have much higher misclassification rates

Models from the (more) partitioned datasets have much better performance

28


Conditional entropy decrease as window size grows

29

Conclusion

Proposed to use some information-theoretic measures for anomaly detection

30

Comments

Provide theoretical foundations, use numbers to tell the result

Plentiful experiment result

Information-Theoretic Measures for Anomaly Detection

Documents

Transcript of Information-Theoretic Measures for Anomaly Detection