Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc....

24
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning Tan 1 Antonio Nucci 2 1 Michigan State University, Michigan, USA 2 Narus, Inc., Sunnyvale, California, USA.

Transcript of Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc....

Page 1: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection

© 2013 Narus, Inc.

Prakash Comar1 Lei Liu1 Sabyasachi (Saby) Saha2

Pang-Ning Tan1 Antonio Nucci2

1Michigan State University, Michigan, USA2Narus, Inc., Sunnyvale, California, USA.

Page 2: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

2

Introduction

• Increasing threats– Continuous and increased attacks on infrastructure– Threats to business, national security

• Huge financial stake (Conficker: 10 million machines, loss $9.1 Billion)

• Zeus: 3.6 million machines [HTML Injection]• Koobface: 2.9 million machines [Social Networking Sites]• TidServ: 1.5 million machines [Email spam attachment]

• Attacks are becoming more advanced and sophisticated!

• Malware is …– Malicious software– Virus, Phishing, Spam, …

© 2013 Narus, Inc.

Page 3: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

3

Introduction (Contd.)

• Host vs Network based approaches• Limitation of existing techniques

– Signature-based approach• Fails to detect zero-day attacks.• Fails to detect threats with evolving capabilities such as

metamorphic and polymorphic malwares.– Anomaly-based approach

• Producing high false alarm rate.– Supervised Learning based approach

• Poor performance on new and evolving malware• Building classifier model is challenging due to diversity of

malware classes, imbalanced distribution, data imperfection issues, etc.

There is no Silver Bullet

© 2013 Narus, Inc.

Page 4: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

4

Our Goal

• Focus on Layer 3/4 features• Threats often exhibit specific

behavior in their layer-3/layer-4 flow level features

– Even when the payload is encrypted

• Machine Learning based approach– Two level Supervised learning

approach to detect malicious flows and further identify specific type

– Combine unsupervised learning with supervised learning to address new class discovery problem

© 2013 Narus, Inc.

Page 5: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

5

Challenges

• Imbalanced class representation– Majority flows belong to a few dominant classes

• Missing values– The features used to characterize network flow

may contain missing values (only 7% records with all features)

• Noise in the training data– Training data labeled as good by IDS may contain

malwares• New class discovery

– Not all classes are present at the time of classifier is initially trained.

© 2013 Narus, Inc.

Page 6: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

6

System Architecture

© 2013 Narus, Inc.

Page 7: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

7

Proposed Framework

• Two level malware detection framework:• Macro-level classifier

– Used to isolate malicious flows from the non-malicious ones.

• Micro-level classifier– Further categorize

the malicious flows into one of the pre-existing malware or new malware

© 2013 Narus, Inc.

Page 8: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

8

Methodology: Two-layered Learning Framework

• L1: Ensemble learning based binary classifier

Classifies Unknown or Malicious

Random Forest Classifier

• L2: One class SVM with tree-based kernel, along with probabilistic class profiling for specific malware class and novel class detection

Classification Process

Page 9: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

9

Proposed Framework

• 1-Class SVM for Known Malware Detection:

Page 10: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

10

Proposed Framework

• Tree based feature transformation

Page 11: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

11

X = Y =

1

1

2

2

3

3

x11 x12 … x1d

x21 x22… x2d

… … … …

xm1 xm2 … xmd

… … … …

… … … …

… … … …

xn1 xn2… xnd

Proposed Framework

• Example of tree based features with three classes

C1

C2

C3

Page 12: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

12

+1

+1

-1

-1

-1

x11 x12 … x1d

x21 x22… x2d

… … … …

xm1 xm2 … xmd

… … … …

… … … …

xn1 xn2… xnd

X

Sample m out of n, f out of d

X

Sample m out of n, f out of d

X

… … … …

… … … …

P trees

Page 13: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

13

X

Sample m out of n, f out of d

X

Sample m out of n, f out of d

X

… … … …

… … … …

-1

-1

-1

+1

+1

-1

-1

x11 x12 … x1d

x21 x22… x2d

… … … …

xm1 xm2 … xmd

… … … …

… … … …

xn1 xn2… xnd

P trees

Page 14: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

14

X

Sample m out of n, f out of d

X

Sample m out of n, f out of d

X

… … … …

… … … …

-1

-1

-1

-1

-1

+1

+1

x11 x12 … x1d

x21 x22… x2d

… … … …

xm1 xm2 … xmd

… … … …

… … … …

xn1 xn2… xnd

P trees

Page 15: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

15

Proposed Framework

• Example of tree base feature transformation.

Page 16: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

16

Proposed Framework

• Kernel matrix for 1-class SVM:– Existing kernel, like RBF or Polynomial kernel

assume feature vector do not have missing value– Propose a weighted linear kernel matrix for 1-

class SVM based on transformed tree-based features by minimizing the following objective function.

– Wij is the model regularizer, Gij is a ground truth kernel, which defined as

Page 17: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

17

Proposed Framework

• Probabilistic Profiling for New Class Discovery:

Page 18: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

18

Experimental Evaluation

• Data:– Network flow data from Internet service

provider in Asia, a subset of 108 flow features extracted.

– Use IDS/IPS system to generate the class label for each flow by analyzing the payload.

• 38 different types of malware classes have been identified by IDS/IPS, including Conficker, Tidserv, Trojans, etc.

• The flows that unlabeled by IDS/IPS are assigned to “good” (unknown) category.

Page 19: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

19

Experimental Evaluation

• Data:

Page 20: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

20

Experimental Evaluation

• Comparison of Tree-based Feature Transformation against Missing Value Imputation

– Original: data without any missing value treatment– OMI: Overall mean value of the feature across all the classes– CMI: mean value of the feature for the given class– LKNN: Local KNN Imputation

Page 21: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

21

Experimental Evaluation

• Results Comparison at Macro-level

• Results Comparison at Micro-level– ROC curve for new malware detection

Page 22: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

22

Experimental Evaluation

• Overall Results Comparison for detecting both known and new malware

Page 23: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

23

Conclusion

• We proposed an effective malware detection framework based on statistical flow-level features • Two level ML based classifier• New class detection• Encrypted data

• A tree based kernel for 1-class SVM was proposed to handle the data imperfection issue in network flow data

Page 24: Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

24

Future Works

• Extend the formulation to an online learning setting

• Develop a hierarchical multi-class learning method to enhance the testing efficiency when the number of malware classes becomes extremely large.