Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc....

Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection

© 2013 Narus, Inc.

Prakash Comar1 Lei Liu1 Sabyasachi (Saby) Saha2

Pang-Ning Tan1 Antonio Nucci2

1Michigan State University, Michigan, USA2Narus, Inc., Sunnyvale, California, USA.

2

Introduction

• Increasing threats– Continuous and increased attacks on infrastructure– Threats to business, national security

• Huge financial stake (Conficker: 10 million machines, loss $9.1 Billion)

• Zeus: 3.6 million machines [HTML Injection]• Koobface: 2.9 million machines [Social Networking Sites]• TidServ: 1.5 million machines [Email spam attachment]

• Attacks are becoming more advanced and sophisticated!

• Malware is …– Malicious software– Virus, Phishing, Spam, …

© 2013 Narus, Inc.

3

Introduction (Contd.)

• Host vs Network based approaches• Limitation of existing techniques

– Signature-based approach• Fails to detect zero-day attacks.• Fails to detect threats with evolving capabilities such as

metamorphic and polymorphic malwares.– Anomaly-based approach

• Producing high false alarm rate.– Supervised Learning based approach

• Poor performance on new and evolving malware• Building classifier model is challenging due to diversity of

malware classes, imbalanced distribution, data imperfection issues, etc.

There is no Silver Bullet

© 2013 Narus, Inc.

4

Our Goal

• Focus on Layer 3/4 features• Threats often exhibit specific

behavior in their layer-3/layer-4 flow level features

– Even when the payload is encrypted

• Machine Learning based approach– Two level Supervised learning

approach to detect malicious flows and further identify specific type

– Combine unsupervised learning with supervised learning to address new class discovery problem

© 2013 Narus, Inc.

5

Challenges

• Imbalanced class representation– Majority flows belong to a few dominant classes

• Missing values– The features used to characterize network flow

may contain missing values (only 7% records with all features)

• Noise in the training data– Training data labeled as good by IDS may contain

malwares• New class discovery

– Not all classes are present at the time of classifier is initially trained.

© 2013 Narus, Inc.

6

System Architecture

© 2013 Narus, Inc.

7

Proposed Framework

• Two level malware detection framework:• Macro-level classifier

– Used to isolate malicious flows from the non-malicious ones.

• Micro-level classifier– Further categorize

the malicious flows into one of the pre-existing malware or new malware

© 2013 Narus, Inc.

8

Methodology: Two-layered Learning Framework

• L1: Ensemble learning based binary classifier

Classifies Unknown or Malicious

Random Forest Classifier

• L2: One class SVM with tree-based kernel, along with probabilistic class profiling for specific malware class and novel class detection

Classification Process

9

Proposed Framework

• 1-Class SVM for Known Malware Detection:

10

Proposed Framework

• Tree based feature transformation

11

X = Y =

1

…

1

2

…

2

3

…

3

x11 x12 … x1d

x21 x22… x2d

… … … …

xm1 xm2 … xmd

… … … …

… … … …

… … … …

xn1 xn2… xnd

Proposed Framework

• Example of tree based features with three classes

C1

C2

C3

12

+1

…

+1

-1

…

-1

…

-1

x11 x12 … x1d

x21 x22… x2d

… … … …

xm1 xm2 … xmd

… … … …

… … … …

xn1 xn2… xnd

X

Sample m out of n, f out of d

X


X

… … … …

… … … …

P trees

13

X


X


X

… … … …

… … … …

-1

…

-1

-1

+1

+1

-1

…

-1

x11 x12 … x1d

x21 x22… x2d

… … … …

xm1 xm2 … xmd

… … … …

… … … …

xn1 xn2… xnd

P trees

14

X


X


X

… … … …

… … … …

-1

…

-1

-1

…

-1

-1

+1

+1

x11 x12 … x1d

x21 x22… x2d

… … … …

xm1 xm2 … xmd

… … … …

… … … …

xn1 xn2… xnd

P trees

15

Proposed Framework

• Example of tree base feature transformation.

16

Proposed Framework

• Kernel matrix for 1-class SVM:– Existing kernel, like RBF or Polynomial kernel

assume feature vector do not have missing value– Propose a weighted linear kernel matrix for 1-

class SVM based on transformed tree-based features by minimizing the following objective function.

– Wij is the model regularizer, Gij is a ground truth kernel, which defined as

17

Proposed Framework

• Probabilistic Profiling for New Class Discovery:

18

Experimental Evaluation

• Data:– Network flow data from Internet service

provider in Asia, a subset of 108 flow features extracted.

– Use IDS/IPS system to generate the class label for each flow by analyzing the payload.

• 38 different types of malware classes have been identified by IDS/IPS, including Conficker, Tidserv, Trojans, etc.

• The flows that unlabeled by IDS/IPS are assigned to “good” (unknown) category.

19


• Data:

20


• Comparison of Tree-based Feature Transformation against Missing Value Imputation

– Original: data without any missing value treatment– OMI: Overall mean value of the feature across all the classes– CMI: mean value of the feature for the given class– LKNN: Local KNN Imputation

21


• Results Comparison at Macro-level

• Results Comparison at Micro-level– ROC curve for new malware detection

22


• Overall Results Comparison for detecting both known and new malware

23

Conclusion

• We proposed an effective malware detection framework based on statistical flow-level features • Two level ML based classifier• New class detection• Encrypted data

• A tree based kernel for 1-class SVM was proposed to handle the data imperfection issue in network flow data

24

Future Works

• Extend the formulation to an online learning setting

• Develop a hierarchical multi-class learning method to enhance the testing efficiency when the number of malware classes becomes extremely large.

Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc....

Documents

Transcript of Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc....