Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc....
-
Upload
howard-butler -
Category
Documents
-
view
212 -
download
1
Transcript of Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc....
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection
© 2013 Narus, Inc.
Prakash Comar1 Lei Liu1 Sabyasachi (Saby) Saha2
Pang-Ning Tan1 Antonio Nucci2
1Michigan State University, Michigan, USA2Narus, Inc., Sunnyvale, California, USA.
2
Introduction
• Increasing threats– Continuous and increased attacks on infrastructure– Threats to business, national security
• Huge financial stake (Conficker: 10 million machines, loss $9.1 Billion)
• Zeus: 3.6 million machines [HTML Injection]• Koobface: 2.9 million machines [Social Networking Sites]• TidServ: 1.5 million machines [Email spam attachment]
• Attacks are becoming more advanced and sophisticated!
• Malware is …– Malicious software– Virus, Phishing, Spam, …
© 2013 Narus, Inc.
3
Introduction (Contd.)
• Host vs Network based approaches• Limitation of existing techniques
– Signature-based approach• Fails to detect zero-day attacks.• Fails to detect threats with evolving capabilities such as
metamorphic and polymorphic malwares.– Anomaly-based approach
• Producing high false alarm rate.– Supervised Learning based approach
• Poor performance on new and evolving malware• Building classifier model is challenging due to diversity of
malware classes, imbalanced distribution, data imperfection issues, etc.
There is no Silver Bullet
© 2013 Narus, Inc.
4
Our Goal
• Focus on Layer 3/4 features• Threats often exhibit specific
behavior in their layer-3/layer-4 flow level features
– Even when the payload is encrypted
• Machine Learning based approach– Two level Supervised learning
approach to detect malicious flows and further identify specific type
– Combine unsupervised learning with supervised learning to address new class discovery problem
© 2013 Narus, Inc.
5
Challenges
• Imbalanced class representation– Majority flows belong to a few dominant classes
• Missing values– The features used to characterize network flow
may contain missing values (only 7% records with all features)
• Noise in the training data– Training data labeled as good by IDS may contain
malwares• New class discovery
– Not all classes are present at the time of classifier is initially trained.
© 2013 Narus, Inc.
6
System Architecture
© 2013 Narus, Inc.
7
Proposed Framework
• Two level malware detection framework:• Macro-level classifier
– Used to isolate malicious flows from the non-malicious ones.
• Micro-level classifier– Further categorize
the malicious flows into one of the pre-existing malware or new malware
© 2013 Narus, Inc.
8
Methodology: Two-layered Learning Framework
• L1: Ensemble learning based binary classifier
Classifies Unknown or Malicious
Random Forest Classifier
• L2: One class SVM with tree-based kernel, along with probabilistic class profiling for specific malware class and novel class detection
Classification Process
9
Proposed Framework
• 1-Class SVM for Known Malware Detection:
10
Proposed Framework
• Tree based feature transformation
11
X = Y =
1
…
1
2
…
2
3
…
3
x11 x12 … x1d
x21 x22… x2d
… … … …
xm1 xm2 … xmd
… … … …
… … … …
… … … …
xn1 xn2… xnd
Proposed Framework
• Example of tree based features with three classes
C1
C2
C3
12
+1
…
+1
-1
…
-1
…
-1
x11 x12 … x1d
x21 x22… x2d
… … … …
xm1 xm2 … xmd
… … … …
… … … …
xn1 xn2… xnd
X
Sample m out of n, f out of d
X
Sample m out of n, f out of d
X
… … … …
… … … …
P trees
13
X
Sample m out of n, f out of d
X
Sample m out of n, f out of d
X
… … … …
… … … …
-1
…
-1
-1
+1
+1
-1
…
-1
x11 x12 … x1d
x21 x22… x2d
… … … …
xm1 xm2 … xmd
… … … …
… … … …
xn1 xn2… xnd
P trees
14
X
Sample m out of n, f out of d
X
Sample m out of n, f out of d
X
… … … …
… … … …
-1
…
-1
-1
…
-1
-1
+1
+1
x11 x12 … x1d
x21 x22… x2d
… … … …
xm1 xm2 … xmd
… … … …
… … … …
xn1 xn2… xnd
P trees
15
Proposed Framework
• Example of tree base feature transformation.
16
Proposed Framework
• Kernel matrix for 1-class SVM:– Existing kernel, like RBF or Polynomial kernel
assume feature vector do not have missing value– Propose a weighted linear kernel matrix for 1-
class SVM based on transformed tree-based features by minimizing the following objective function.
– Wij is the model regularizer, Gij is a ground truth kernel, which defined as
17
Proposed Framework
• Probabilistic Profiling for New Class Discovery:
18
Experimental Evaluation
• Data:– Network flow data from Internet service
provider in Asia, a subset of 108 flow features extracted.
– Use IDS/IPS system to generate the class label for each flow by analyzing the payload.
• 38 different types of malware classes have been identified by IDS/IPS, including Conficker, Tidserv, Trojans, etc.
• The flows that unlabeled by IDS/IPS are assigned to “good” (unknown) category.
19
Experimental Evaluation
• Data:
20
Experimental Evaluation
• Comparison of Tree-based Feature Transformation against Missing Value Imputation
– Original: data without any missing value treatment– OMI: Overall mean value of the feature across all the classes– CMI: mean value of the feature for the given class– LKNN: Local KNN Imputation
21
Experimental Evaluation
• Results Comparison at Macro-level
• Results Comparison at Micro-level– ROC curve for new malware detection
22
Experimental Evaluation
• Overall Results Comparison for detecting both known and new malware
23
Conclusion
• We proposed an effective malware detection framework based on statistical flow-level features • Two level ML based classifier• New class detection• Encrypted data
• A tree based kernel for 1-class SVM was proposed to handle the data imperfection issue in network flow data
24
Future Works
• Extend the formulation to an online learning setting
• Develop a hierarchical multi-class learning method to enhance the testing efficiency when the number of malware classes becomes extremely large.