Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD...
-
Upload
megan-sims -
Category
Documents
-
view
230 -
download
0
Transcript of Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD...
![Page 1: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/1.jpg)
Data Mining for Security Applications:
Detecting Malicious Executables
Mr. Mehedy M. Masud (PhD Student)Prof. Latifur Khan
Prof. Bhavani Thuraisingham
Department of Computer ScienceThe University of Texas at Dallas
![Page 2: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/2.jpg)
Outline and Acknowledgement● Vision for Assured Information Sharing● Handling Different Trust levels● Defensive Operations between Untrustworthy
Partners– Detecting Malicious Executables using Data Mining
● Research Funded by Air Force Office of Scientific Research and Texas Enterprise Funds
![Page 3: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/3.jpg)
Vision: Assured Information Sharing
PublishData/Policy
ComponentData/Policy for
Agency A
Data/Policy for Coalition
PublishData/Policy
ComponentData/Policy for
Agency C
ComponentData/Policy for
Agency B
PublishData/Policy
1. Trustworthy Partners
2. Semi-Trustworthy partners
3. Untrustworthy partners
4. Dynamic Trust
![Page 4: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/4.jpg)
Our Approach● Integrate the Medicaid claims data and mine the data; next
enforce policies and determine how much information has been lost by enforcing policies
– Prof. Khan, Dr. Awad (Postdoc) and Student Workers (MS students)
● Apply game theory and probing techniques to extract information from semi-trustworthy partners
– Prof. Murat Kantarcioglu and Ryan Layfield (PhD Student)
● Data Mining for Defensive and offensive operations– E.g., Malicious code detection, Honeypots– Prof. Latifur Khan and Mehedy Masud
● Dynamic Trust levels, Peer to Peer Communication– Prof. Kevin Hamlen and Nathalie Tsybulnik (PhD student)
![Page 5: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/5.jpg)
Introduction: Detecting Malicious Executables using Data Mining
0 What are malicious executables?- Harm computer systems- Virus, Exploit, Denial of Service (DoS), Flooder, Sniffer,
Spoofer, Trojan etc.- Exploits software vulnerability on a victim - May remotely infect other victims- Incurs great loss. Example: Code Red epidemic cost $2.6
Billion
0 Malicious code detection: Traditional approach- Signature based- Requires signatures to be generated by human experts- So, not effective against “zero day” attacks
![Page 6: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/6.jpg)
State of the Art: Automated
Detection
OAutomated detection approaches:●Behavioural: analyse behaviours like source, destination address, attachment type, statistical anomaly etc.
●Content-based: analyse the content of the malicious executable– Autograph (H. Ah-Kim – CMU): Based on automated
signature generation process– N-gram analysis (Maloof, M.A. et .al.): Based on mining
features and using machine learning.
![Page 7: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/7.jpg)
New Ideas
✗Content -based approaches consider only machine-codes (byte-codes).✗Is it possible to consider higher-level source codes for malicious code detection?✗Yes: Diassemble the binary executable and retrieve the assembly program✗Extract important features from the assembly program✗Combine with machine-code features
![Page 8: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/8.jpg)
Feature Extraction
✗Binary n-gram features– Sequence of n consecutive bytes of binary executable
✗Assembly n-gram features– Sequence of n consecutive assembly instructions
✗System API call features– DLL function call information
![Page 9: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/9.jpg)
The Hybrid Feature Retrieval Model
● Collect training samples of normal and malicious executables.
● Extract features
● Train a Classifier and build a model
● Test the model against test samples
![Page 10: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/10.jpg)
Hybrid Feature Retrieval (HFR)
● Training
![Page 11: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/11.jpg)
Hybrid Feature Retrieval (HFR)
● Testing
![Page 12: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/12.jpg)
Binary n-gram features– Features are extracted from the byte codes in the form of
n-grams, where n = 2,4,6,8,10 and so on.
Example: Given a 11-byte sequence:
0123456789abcdef012345, The 2-grams (2-byte sequences) are: 0123, 2345, 4567,
6789, 89ab, abcd, cdef, ef01, 0123, 2345The 4-grams (4-byte sequences) are: 01234567, 23456789,
456789ab,...,ef012345 and so on....
Problem: – Large dataset. Too many features (millions!).
Solution: – Use secondary memory, efficient data structures – Apply feature selection
Feature Extraction
![Page 13: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/13.jpg)
Assembly n-gram features– Features are extracted from the assembly programs in
the form of n-grams, where n = 2,4,6,8,10 and so on.
Example:
three instructions “push eax”; “mov eax, dword[0f34]” ; “add ecx, eax”;
2-grams(1) “push eax”; “mov eax, dword[0f34]”;
(2) “mov eax, dword[0f34]”; “add ecx, eax”;
Problem: – Same problem as binary
Solution: – Same solution
Feature Extraction
![Page 14: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/14.jpg)
● Select Best K features
● Selection Criteria: Information Gain● Gain of an attribute A on a collection of
examples S is given by
Feature Selection
)(
)(||
||)(),(
AValuesVv
v SEn trop yS
SSEn trop yASG a in
![Page 15: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/15.jpg)
Experiments
0 Dataset– Dataset1: 838 Malicious and 597 Benign executables– Dataset2: 1082 Malicious and 1370 Benign executables– Collected Malicious code from VX Heavens
(http://vx.netlux.org)0 Disassembly
– Pedisassem ( http://www.geocities.com/~sangcho/index.html )
0 Training, Testing– Support Vector Machine (SVM)– C-Support Vector Classifiers with an RBF kernel
![Page 16: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/16.jpg)
Results
● HFS = Hybrid Feature Set● BFS = Binary Feature Set● AFS = Assembly Feature Set
![Page 17: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/17.jpg)
Results
● HFS = Hybrid Feature Set● BFS = Binary Feature Set● AFS = Assembly Feature Set
![Page 18: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/18.jpg)
Results
● HFS = Hybrid Feature Set● BFS = Binary Feature Set● AFS = Assembly Feature Set
![Page 19: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/19.jpg)
Future Plans
● System call: – seems to be very useful. – Need to Consider Frequency of call– Call sequence pattern (following program path) – Actions immediately preceding or after call
● Detect Malicious code by program slicing– requires analysis
![Page 20: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/20.jpg)
Data Mining to Detect Buffer Overflow Attack
Mohammad M. Masud, Latifur Khan,
Bhavani Thuraisingham
Department of Computer ScienceThe University of Texas at Dallas
![Page 21: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/21.jpg)
Introduction
● Goal– Intrusion detection. – e.g.: worm attack, buffer overflow attack.
● Main Contribution– 'Worm' code detection by data mining coupled
with 'reverse engineering'.– Buffer overflow detection by combining data
mining with static analysis of assembly code.
![Page 22: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/22.jpg)
Background
● What is 'buffer overflow'?– A situation when a fixed sized buffer is overflown
by a larger sized input.
● How does it happen?– example:
........char buff[100];gets(buff);........
buff Stackmemory
Input string
![Page 23: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/23.jpg)
Background (cont...)
● Then what?
........char buff[100];gets(buff);........
buff Stackmemory
Stack
Return address overwritten
buff Stackmemory
New return address points to this memory location
Attacker's code
buff
![Page 24: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/24.jpg)
Background (cont...)
● So what?– Program may crash or– The attacker can execute his arbitrary code
● It can now– Execute any system function– Communicate with some host and download
some 'worm' code and install it!– Open a backdoor to take full control of the victim
● How to stop it?
![Page 25: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/25.jpg)
Background (cont...)● Stopping buffer overflow
– Preventive approaches– Detection approaches
● Preventive approaches– Finding bugs in source code. Problem: can only
work when source code is available.– Compiler extension. Same problem.– OS/HW modification
● Detection approaches– Capture code running symptoms. Problem: may
require long running time.– Automatically generating signatures of buffer
overflow attacks.
![Page 26: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/26.jpg)
CodeBlocker (Our approach)
● A detection approach
● Based on the Observation:– Attack messages usually contain code while
normal messages contain data.
● Main Idea– Check whether message contains code
● Problem to solve:– Distinguishing code from data
![Page 27: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/27.jpg)
Severity of the problem
● It is not easy to detect actual instruction sequence from a given string of bits
![Page 28: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/28.jpg)
Our solution
● Apply data mining.
● Formulate the problem as a classification problem (code, data)
● Collect a set of training examples, containing both instances
● Train the data with a machine learning algorithm, get the model
● Test this model against a new message
![Page 29: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/29.jpg)
CodeBlocker Model
![Page 30: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/30.jpg)
Feature Extraction
![Page 31: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/31.jpg)
Disassembly
● We apply SigFree tool – implemented by Xinran Wang et al. (PennState)
![Page 32: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/32.jpg)
Feature extraction
● Features are extracted using– N-gram analysis– Control flow analysis
● N-gram analysis
Assembly program Corresponding IFG
What is an n-gram? -Sequence of n instructions
Traditional approach: -Flow of control is ignored
2-grams are: 02, 24, 46,...,CE
![Page 33: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/33.jpg)
Feature extraction (cont...)
● Control-flow Based N-gram analysis
Assembly program Corresponding IFG
What is an n-gram? -Sequence of n instructions
Proposed Control-flow based approach -Flow of control is considered
2-grams are: 02, 24, 46,...,CE, E6
![Page 34: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/34.jpg)
Feature extraction (cont...)● Control Flow analysis. Generated features
– Invalid Memory Reference (IMR)– Undefined Register (UR)– Invalid Jump Target (IJT)
● Checking IMR– A memory is referenced using register
addressing and the register value is undefined– e.g.: mov ax, [dx + 5]
● Checking UR– Check if the register value is set properly
● Checking IJT– Check whether jump target does not violate
instruction boundary
![Page 35: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/35.jpg)
Feature extraction (cont...)
● Why n-gram analysis?– Intuition: in general,
disassembled executables should have a different pattern of instruction usage than disassembled data.
● Why control flow analysis?– Intuition: there should be no invalid memory
references or invalid jump targets.
![Page 36: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/36.jpg)
Putting it together
● Compute all possible n-grams
● Select best k of them
● Compute feature vector (binary vector) for each training example
● Supply these vectors to the training algorithm
![Page 37: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/37.jpg)
Experiments
● Dataset– Real traces of normal messages– Real attack messages – Polymorphic shellcodes
● Training, Testing– Support Vector Machine (SVM)
![Page 38: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/38.jpg)
Results
● CFBn: Control-Flow Based n-gram feature● CFF: Control-flow feature
![Page 39: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/39.jpg)
Novelty / contribution
● We introduce the notion of control flow based n-gram
● We combine control flow analysis with data mining to detect code / data
● Significant improvement over other methods (e.g. SigFree)
![Page 40: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/40.jpg)
Advantages
● 1) Fast testing
● 2) Signature free operation
3) Low overhead
● 4) Robust against many obfuscations
![Page 41: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/41.jpg)
Limitations
● Need samples of attack and normal messages.
● May not be able to detect a completely new type of attack.
![Page 42: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/42.jpg)
Future Works
● Find more features
● Apply dynamic analysis techniques
● Semantic analysis
![Page 43: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/43.jpg)
Reference / suggested readings
– X. Wang, C. Pan, P. Liu, and S. Zhu. Sigfree: A signature free buffer overflow attack blocker. In USENIX Security, July 2006.
– Kolter, J. Z., and Maloof, M. A. Learning to detect malicious executables in the wild Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining Seattle, WA, USA Pages: 470 – 478, 2004.
![Page 44: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/44.jpg)
Email Worm Detection (behavioural approach)
Training data
Feature extraction
Clean or Infected ?
Outgoing Emails
ClassifierMachine Learning
Test data
The Model
![Page 45: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/45.jpg)
Feature Extraction
Per email features=Binary valued Features
Presence of HTML; script tags/attributes; embedded images; hyperlinks;
Presence of binary, text attachments; MIME types of file attachments
=Continuous-valued FeaturesNumber of attachments; Number of words/characters in the
subject and bodyPer window features
=Number of emails sent; Number of unique email recipients; Number of unique sender addresses; Average number of words/characters per subject, body; average word length:; Variance in number of words/characters per subject, body; Variance in word length
=Ratio of emails with attachments
![Page 46: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/46.jpg)
Feature Reduction & Selection
Principal Component Analysis=Reduce higher dimensional data into lower dimension=Helps reducing noise, overfitting
Decesion Tree=Used to Select Best features
![Page 47: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/47.jpg)
Experiments
0 Data Set - Contains instances for both normal and viral emails.– Six worm types:
● bagle.f, bubbleboy, mydoom.m, mydoom.u, netsky.d, sobig.f
- Collected from UC Berkeley
● Training, Testing:
- Decision Tree: C4.5 algorithm (J48) on Weka Systems
- Support Vector Machine (SVM) and Naïve Bayes (NB).
![Page 48: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/48.jpg)
Results
![Page 49: Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.](https://reader030.fdocuments.in/reader030/viewer/2022032705/56649da05503460f94a8c345/html5/thumbnails/49.jpg)
Conclusion & Future Work
● Three approaches has been tested– Apply classifier directly – Apply dimension reduction (PCA) and then
classify– Apply feature selection (decision tree) and then
classify
● Decision tree has the best performance● Future Plans
– Combine content based with behavioral approaches
● Offensive Operations– Honeypots, Information operations