A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham...
-
Upload
penelope-goodman -
Category
Documents
-
view
228 -
download
1
Transcript of A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham...
A Hybrid Model to Detect A Hybrid Model to Detect Malicious ExecutablesMalicious Executables
Mohammad M. MasudMohammad M. MasudLatifur KhanLatifur Khan
Bhavani ThuraisinghamBhavani Thuraisingham
Department of Computer ScienceDepartment of Computer Science
The University of Texas at DallasThe University of Texas at Dallas
Presentation OutlinePresentation Outline
Overview
Background
Our approach
Feature description
Feature extraction
Experiments
Results
Conclusion
OverviewOverviewGoal
Detecting Malicious Executables
ContributionA new Model that combines Binary, Assembly, and Library Call features
An efficient technique to retrieve Assembly features from Binary featuresA scalable solution to n-gram feature extraction
NoveltyCombining classical binary n-gram features with the features extracted through reverse-engineering
Malicious Executables: Malicious Executables: BackgroundBackground
Programs that performs malicious activities, such as
destroying data
stealing information
clogging network etc.
Consists of different architectures, such as
Independent programs (e.g. Worms)
Dependent (piggybacked) on a host program (e.g. virus)
Propagation mechanisms
Mobile: Propagates automatically through networks (worms)
Static : propagates when infected files are transferred (viruses)
Detecting Malicious ExecutablesDetecting Malicious ExecutablesTraditional way:
signature-based detection
Problems: Requires human interventionNot effective against “zero day attack”, because too slow
RequirementsFast detection No human intervention (automatic)
Recent techniquesSignature auto-generation (Earlybird, Autograph, Polygraph)Data Mining based (Stolfo et al., Maloof et al.)
Our Approach Our Approach Design goals: to obtain a solution that
Is free of signatures
Requires no human interventionCan detect new variants and / or zero day attacks
Our “Hybrid Feature Retrieval” (HFR) model Is Based on Data MiningMeets all three design goals
StepsCollection of Training Data (malicious & benign .exe)Feature Extraction & SelectionTraining with classifierTesting and detection
Top-level Architecture Top-level Architecture
Training Data(Executables)
Feature Extraction
Training(SVM)
classifier
New Executable
Feature-Selection
Feature Extraction
Testing(SVM)
Infected?
No
Yes
Keep
Delete
Training Testing
FeaturesFeatures
Binary n-grams
Assembly instruction sequences (corresponding
to the binary n-grams)
DLL function calls
Binary Binary nn-gram Features -gram Features Each binary executable is a 'string' of bytes
An n-gram of the binary is a sequence of n consecutive bytes
ExampleA string of four bytes: "ab05ef23" (in hexadecimal)
1-grams: "ab", "05", "ef", "23" (single bytes)
2-grams: "ab05", "05ef", "ef23" (2-byte sequences)
3-grams: "ab05ef", "05ef23" (3-byte sequences)
Binary Binary nn-gram Feature -gram Feature ExtractionExtraction
Each binary executable is scanned
Each extracted n-gram is stored in a balanced binary search tree to avoid duplicates
Each n-gram's frequency of occurrence in the training data is also stored in the tree
Binary Binary nn-gram Feature -gram Feature Extraction (contd...)Extraction (contd...)
Using AVL tree (a balanced binary search tree) we ensure fast insertion and searching
Using disk I/O we overcome memory limitations
Executables being scanned1: “abcdef” 2: “93abcd”3: “dc0ef2”4: “0ef7gh”
Current Scan Position
93ab,1
AVL tree for storing 2-grams and frequencies
abcd,2cdef,1
dc0e,1
Feature Selection Feature Selection MotivationTotal number extracted n-grams may be very large (order of millions)Classifier can't be trained with so many featuresWe select K best n-grams using Information Gain criterion
Information Gain of a binary attribute A on a collection of examples S is given by
Values(A): set of all possible values for attribute ASv: subset of S for which attribute A has value v.
Selected binary features are called “Binary Feature Set” or BFS
)(
)(||
||)(),(
AValuesVv
v SEn tropyS
SSEn tropyASG a in
Assembly FeaturesAssembly Features
An assembly feature is a sequence of assembly instructions
We call these features as “Derived Assembly Feature” or DAF
Every DAF corresponds to a selected binary n-gram
Motivation for extracting DAF :n-gram may contain partial information
DAF contains more complete information
Assembly Feature ExtractionAssembly Feature Extraction
Disassemble all executables
For each selected binary n-gram Q do
S all assembly instruction sequences in the disassembled executables corresponding to Q
DAFQ Best assembly instruction sequence in S according to information gain
Assembly Feature Extraction Assembly Feature Extraction (Contd...)(Contd...)
Example: Let “00005068” be a selected 4-gram (Q)
Following Assembly instruction sequences (S) corresponding to Q are found in the disassembled executables:
DAFQ is selected from these sequences using information gain
O p -co de A ssem b ly seq uence E8B 7020000 50 6828234000
call 00401818 push eax push 00402328
0F B 6800D 020000 50 68C C 000000
movzx eax,byte[eax+ 20] push eax push 000000C C
8B 805C 040000 50 6801040000
mov eax, dword[eax+ 45] push ea x push 00000401
0F B 6800D 020000 50 68C C 000000
movzx eax,byte[eax+ 20] push eax push 000000C C
6890100000 50 683C 614000
push 00001090 push eax push 0040613C
DAFQ
DLL function call featuresDLL function call features
DLL function call features are the names of system functions called from the executables
Ex: call getProcAddress()
These features are extracted from the executable header
We extract all the DLL call features from training data and select a subset using information gain
Combining featuresCombining featuresEach feature is considered as a 'binary' feature
We create a vector V of all selected features, where V[i] corresponds to the i-th feature
This vector is called the Hybrid Feature Set (HFS)
For each executable E in the training data, we create a binary feature vector B corresponding to V, where
B[i] is 1 if V[i] is present in E
B[i] is 0 if V[i] is absent in E
We train a classifier using these vectors
ExperimentsExperiments
Collect real samples of malicious and normal executables
Extract and select features
Combine the features into HFS
We also extract Assembly n-gram features (sequences of n assembly instructions), called Assembly Feature Set or (AFS)
Test accuracy of each three kind of feature sets (BFS, AFS, HFS) using SVM with three-fold cross validation
Data Set Data Set
There are two datasets, with the following distribution:
Malicious instances are collected from http://vx.netlux.org/
Benign instances are collected windows XP machines, and other sources
D a ta se t# m a lic io u s e x a m p le s
b e n ig n e x a m p le s
to ta l e x a m p le s
1 8 3 8 5 9 7 1 4 3 5 2 1 0 8 2 1 3 7 0 2 4 5 2
Experimental SetupExperimental Setup
OS & H/WPlatform: Sun Solaris & LinuxMachines: 2GHz, 4GB
Disassembler: PEdisassemDisassembles Windows Portable Executables Available from http://www.geocities.com/~sangcho/
Feature extraction implemented in java, JDK 1.5K = 500 (number of binary n-grams selected)
Support Vector MachineTool: libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) SVM parameters: C-SVC, with polynomial kernel
ResultsResultsHFS: Hybrid Feature Set - has the highest accuracy (best values are circled)
AFS: Assembly Feature Set
BFS: Binary Feature Set
DLL features are not shown because DLL n-gram features have poor performance for n > 1. So, We only use DLL 1-grams in HFS
C LAS S IF IC A TIO N A C C U R AC Y (% ) O F S V M O N D IF F ER EN T F EA TU R E S ETS F O R D IF F ER EN T V ALU ES O F N
D a t a s e t 1 D a t a s e t 2 n
H F S B F S A F S H F S B F S A F S
1 9 3 .4 6 3 .0 8 8 .4 9 2 .1 5 9 .4 8 8 .6
2 9 6 .8 9 4 .1 8 8 .1 9 6 .3 9 2 .1 8 7 .9
4 9 6 .3 9 5 .6 9 0 .9 9 7 .4 9 2 .8 8 9 .4
6 9 7 .4 9 5 .5 8 7 .2 9 6 .9 9 3 .0 8 6 .7
8 9 6 .9 9 5 .1 8 7 .7 9 7 .2 9 3 .4 8 5 .1
1 0 9 7 .0 9 5 .7 7 3 .7 9 7 .3 9 2 .8 7 5 .8
A v g 9 6 .3 0 8 9 .8 3 8 6 .0 0 9 6 .1 5 8 7 .5 2 8 5 .5 8
Results (Contd...)Results (Contd...)HFS: Hybrid Feature Set – has the lowest False Positive & False Negative
AFS: Assembly Feature Set
BFS: Binary Feature Set
F A LS E P O S ITIV E (% ) AN D F A LS E N EG A TIV E (% ) O N D IF F ER EN T F EA TU R E S ETS F O R D IF F ER EN T V ALU ES O F N
D a t a s e t 1 D a t a s e t 2 n
H F S B F S A F S H F S B F S A F S
1 8 .0 / 5 .6 7 7 .7 / 7 .9 1 2 .4 / 1 1 .1 7 .5 / 8 .3 6 5 .0 / 9 .8 1 2 .8 / 9 .6
2 5 .3 / 1 .7 6 .0 / 5 .7 2 2 .8 / 4 .2 3 .4 / 4 .1 5 .6 / 1 0 .6 1 5 .1 / 8 .3
4 4 .9 / 2 .9 6 .4 / 3 .0 1 6 .4 / 3 .8 2 .5 / 2 .2 7 .4 / 6 .9 1 2 .6 / 8 .1
6 3 .5 / 2 .0 5 .7 / 3 .7 2 4 .5 / 4 .5 3 .2 / 2 .9 6 .1 / 8 .1 1 7 .8 / 7 .6
8 4 .9 / 1 .9 6 .0 / 4 .1 2 6 .3 / 2 .3 3 .1 / 2 .3 6 .0 / 7 .5 1 9 .9 / 8 .6
1 0 5 .5 / 1 .2 5 .2 / 3 .6 4 3 .9 / 1 .7 3 .4 / 1 .9 6 .3 / 8 .4 3 0 .4 / 1 6 .4
A v g 5 .4 /2 .6 1 7 .8 /4 .7 2 4 .4/ 3 .3 3 .9 /3 .6 1 6 .1 /8 .9 1 8 .1 /9 .8
Results (Contd...)Results (Contd...)Receiver Operating
Characteristic (ROC) curves.
HFS has the best ROC curve
(better curve => greater area under the curve)
Results (Contd...)Results (Contd...)HFS has the greatest Area Under the Curve
A R EA U N D ER TH E R O C C U R V E O N D IF F ER EN T F EA TU R E S ETS
D a t a s e t 1 D a t a s e t 2 n
H F S B F S A F S H F S B F S A F S
1 0 .9 7 6 7 0 .7 0 2 3 0 .9 4 6 7 0 .9 6 6 6 0 .7 2 5 0 0 .9 4 8 9
2 0 .9 8 8 3 0 .9 7 8 2 0 .9 4 0 3 0 .9 9 1 9 0 .9 7 2 0 0 .9 3 7 3
4 0 .9 9 2 8 0 .9 8 2 5 0 .9 6 5 1 0 .9 9 4 8 0 .9 7 0 8 0 .9 5 1 5
6 0 .9 9 4 9 0 .9 8 3 1 0 .9 4 2 1 0 .9 9 5 1 0 .9 7 3 3 0 .9 3 5 8
8 0 .9 9 4 6 0 .9 7 6 6 0 .9 3 9 8 0 .9 9 5 6 0 .9 7 6 0 0 .9 2 5 4
1 0 0 .9 9 2 9 0 .9 7 7 7 0 .8 6 6 3 0 .9 9 6 7 0 .9 7 0 0 0 .8 7 3 6
A v g 0 .9 9 0 0 0 .9 3 3 4 0 .9 3 3 4 0 .9 9 0 1 0 .9 3 1 2 0 .9 2 8 8
ConclusionConclusion
Hybrid Feature Retrieval (HFR) model retrieves a novel combination of three different kinds of features
We have implemented an efficient, scalable solution to the n-gram feature extraction in general
Our results are better compared to other techniques
Future worksHandle obfuscationOperate online, real time
Thank youThank you