A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham...

A Hybrid Model to Detect A Hybrid Model to Detect Malicious ExecutablesMalicious Executables

Mohammad M. MasudMohammad M. MasudLatifur KhanLatifur Khan

Bhavani ThuraisinghamBhavani Thuraisingham

Department of Computer ScienceDepartment of Computer Science

The University of Texas at DallasThe University of Texas at Dallas

Presentation OutlinePresentation Outline

Overview

Background

Our approach

Feature description

Feature extraction

Experiments

Results

Conclusion

OverviewOverviewGoal

Detecting Malicious Executables

ContributionA new Model that combines Binary, Assembly, and Library Call features

An efficient technique to retrieve Assembly features from Binary featuresA scalable solution to n-gram feature extraction

NoveltyCombining classical binary n-gram features with the features extracted through reverse-engineering

Malicious Executables: Malicious Executables: BackgroundBackground

Programs that performs malicious activities, such as

destroying data

stealing information

clogging network etc.

Consists of different architectures, such as

Independent programs (e.g. Worms)

Dependent (piggybacked) on a host program (e.g. virus)

Propagation mechanisms

Mobile: Propagates automatically through networks (worms)

Static : propagates when infected files are transferred (viruses)

Detecting Malicious ExecutablesDetecting Malicious ExecutablesTraditional way:

signature-based detection

Problems: Requires human interventionNot effective against “zero day attack”, because too slow

RequirementsFast detection No human intervention (automatic)

Recent techniquesSignature auto-generation (Earlybird, Autograph, Polygraph)Data Mining based (Stolfo et al., Maloof et al.)

Our Approach Our Approach Design goals: to obtain a solution that

Is free of signatures

Requires no human interventionCan detect new variants and / or zero day attacks

Our “Hybrid Feature Retrieval” (HFR) model Is Based on Data MiningMeets all three design goals

StepsCollection of Training Data (malicious & benign .exe)Feature Extraction & SelectionTraining with classifierTesting and detection

Top-level Architecture Top-level Architecture

Training Data(Executables)

Feature Extraction

Training(SVM)

classifier

New Executable

Feature-Selection

Feature Extraction

Testing(SVM)

Infected?

No

Yes

Keep

Delete

Training Testing

FeaturesFeatures

Binary n-grams

Assembly instruction sequences (corresponding

to the binary n-grams)

DLL function calls

Binary Binary nn-gram Features -gram Features Each binary executable is a 'string' of bytes

An n-gram of the binary is a sequence of n consecutive bytes

ExampleA string of four bytes: "ab05ef23" (in hexadecimal)

1-grams: "ab", "05", "ef", "23" (single bytes)

2-grams: "ab05", "05ef", "ef23" (2-byte sequences)

3-grams: "ab05ef", "05ef23" (3-byte sequences)

Binary Binary nn-gram Feature -gram Feature ExtractionExtraction

Each binary executable is scanned

Each extracted n-gram is stored in a balanced binary search tree to avoid duplicates

Each n-gram's frequency of occurrence in the training data is also stored in the tree

Binary Binary nn-gram Feature -gram Feature Extraction (contd...)Extraction (contd...)

Using AVL tree (a balanced binary search tree) we ensure fast insertion and searching

Using disk I/O we overcome memory limitations

Executables being scanned1: “abcdef” 2: “93abcd”3: “dc0ef2”4: “0ef7gh”

Current Scan Position

93ab,1

AVL tree for storing 2-grams and frequencies

abcd,2cdef,1

dc0e,1

Feature Selection Feature Selection MotivationTotal number extracted n-grams may be very large (order of millions)Classifier can't be trained with so many featuresWe select K best n-grams using Information Gain criterion

Information Gain of a binary attribute A on a collection of examples S is given by

Values(A): set of all possible values for attribute ASv: subset of S for which attribute A has value v.

Selected binary features are called “Binary Feature Set” or BFS

)(

)(||

||)(),(

AValuesVv

v SEn tropyS

SSEn tropyASG a in

Assembly FeaturesAssembly Features

An assembly feature is a sequence of assembly instructions

We call these features as “Derived Assembly Feature” or DAF

Every DAF corresponds to a selected binary n-gram

Motivation for extracting DAF :n-gram may contain partial information

DAF contains more complete information

Assembly Feature ExtractionAssembly Feature Extraction

Disassemble all executables

For each selected binary n-gram Q do

S all assembly instruction sequences in the disassembled executables corresponding to Q

DAFQ Best assembly instruction sequence in S according to information gain

Assembly Feature Extraction Assembly Feature Extraction (Contd...)(Contd...)

Example: Let “00005068” be a selected 4-gram (Q)

Following Assembly instruction sequences (S) corresponding to Q are found in the disassembled executables:

DAFQ is selected from these sequences using information gain

O p -co de A ssem b ly seq uence E8B 7020000 50 6828234000

call 00401818 push eax push 00402328

0F B 6800D 020000 50 68C C 000000

movzx eax,byte[eax+ 20] push eax push 000000C C

8B 805C 040000 50 6801040000

mov eax, dword[eax+ 45] push ea x push 00000401

0F B 6800D 020000 50 68C C 000000

movzx eax,byte[eax+ 20] push eax push 000000C C

6890100000 50 683C 614000

push 00001090 push eax push 0040613C

DAFQ

DLL function call featuresDLL function call features

DLL function call features are the names of system functions called from the executables

Ex: call getProcAddress()

These features are extracted from the executable header

We extract all the DLL call features from training data and select a subset using information gain

Combining featuresCombining featuresEach feature is considered as a 'binary' feature

We create a vector V of all selected features, where V[i] corresponds to the i-th feature

This vector is called the Hybrid Feature Set (HFS)

For each executable E in the training data, we create a binary feature vector B corresponding to V, where

B[i] is 1 if V[i] is present in E

B[i] is 0 if V[i] is absent in E

We train a classifier using these vectors

ExperimentsExperiments

Collect real samples of malicious and normal executables

Extract and select features

Combine the features into HFS

We also extract Assembly n-gram features (sequences of n assembly instructions), called Assembly Feature Set or (AFS)

Test accuracy of each three kind of feature sets (BFS, AFS, HFS) using SVM with three-fold cross validation

Data Set Data Set

There are two datasets, with the following distribution:

Malicious instances are collected from http://vx.netlux.org/

Benign instances are collected windows XP machines, and other sources

D a ta se t# m a lic io u s e x a m p le s

b e n ig n e x a m p le s

to ta l e x a m p le s

1 8 3 8 5 9 7 1 4 3 5 2 1 0 8 2 1 3 7 0 2 4 5 2

Experimental SetupExperimental Setup

OS & H/WPlatform: Sun Solaris & LinuxMachines: 2GHz, 4GB

Disassembler: PEdisassemDisassembles Windows Portable Executables Available from http://www.geocities.com/~sangcho/

Feature extraction implemented in java, JDK 1.5K = 500 (number of binary n-grams selected)

Support Vector MachineTool: libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) SVM parameters: C-SVC, with polynomial kernel

ResultsResultsHFS: Hybrid Feature Set - has the highest accuracy (best values are circled)

AFS: Assembly Feature Set

BFS: Binary Feature Set

DLL features are not shown because DLL n-gram features have poor performance for n > 1. So, We only use DLL 1-grams in HFS

C LAS S IF IC A TIO N A C C U R AC Y (% ) O F S V M O N D IF F ER EN T F EA TU R E S ETS F O R D IF F ER EN T V ALU ES O F N

D a t a s e t 1 D a t a s e t 2 n

H F S B F S A F S H F S B F S A F S

1 9 3 .4 6 3 .0 8 8 .4 9 2 .1 5 9 .4 8 8 .6

2 9 6 .8 9 4 .1 8 8 .1 9 6 .3 9 2 .1 8 7 .9

4 9 6 .3 9 5 .6 9 0 .9 9 7 .4 9 2 .8 8 9 .4

6 9 7 .4 9 5 .5 8 7 .2 9 6 .9 9 3 .0 8 6 .7

8 9 6 .9 9 5 .1 8 7 .7 9 7 .2 9 3 .4 8 5 .1

1 0 9 7 .0 9 5 .7 7 3 .7 9 7 .3 9 2 .8 7 5 .8

A v g 9 6 .3 0 8 9 .8 3 8 6 .0 0 9 6 .1 5 8 7 .5 2 8 5 .5 8

Results (Contd...)Results (Contd...)HFS: Hybrid Feature Set – has the lowest False Positive & False Negative

AFS: Assembly Feature Set

BFS: Binary Feature Set

F A LS E P O S ITIV E (% ) AN D F A LS E N EG A TIV E (% ) O N D IF F ER EN T F EA TU R E S ETS F O R D IF F ER EN T V ALU ES O F N



1 8 .0 / 5 .6 7 7 .7 / 7 .9 1 2 .4 / 1 1 .1 7 .5 / 8 .3 6 5 .0 / 9 .8 1 2 .8 / 9 .6

2 5 .3 / 1 .7 6 .0 / 5 .7 2 2 .8 / 4 .2 3 .4 / 4 .1 5 .6 / 1 0 .6 1 5 .1 / 8 .3

4 4 .9 / 2 .9 6 .4 / 3 .0 1 6 .4 / 3 .8 2 .5 / 2 .2 7 .4 / 6 .9 1 2 .6 / 8 .1

6 3 .5 / 2 .0 5 .7 / 3 .7 2 4 .5 / 4 .5 3 .2 / 2 .9 6 .1 / 8 .1 1 7 .8 / 7 .6

8 4 .9 / 1 .9 6 .0 / 4 .1 2 6 .3 / 2 .3 3 .1 / 2 .3 6 .0 / 7 .5 1 9 .9 / 8 .6

1 0 5 .5 / 1 .2 5 .2 / 3 .6 4 3 .9 / 1 .7 3 .4 / 1 .9 6 .3 / 8 .4 3 0 .4 / 1 6 .4

A v g 5 .4 /2 .6 1 7 .8 /4 .7 2 4 .4/ 3 .3 3 .9 /3 .6 1 6 .1 /8 .9 1 8 .1 /9 .8

Results (Contd...)Results (Contd...)Receiver Operating

Characteristic (ROC) curves.

HFS has the best ROC curve

(better curve => greater area under the curve)

Results (Contd...)Results (Contd...)HFS has the greatest Area Under the Curve

A R EA U N D ER TH E R O C C U R V E O N D IF F ER EN T F EA TU R E S ETS



1 0 .9 7 6 7 0 .7 0 2 3 0 .9 4 6 7 0 .9 6 6 6 0 .7 2 5 0 0 .9 4 8 9

2 0 .9 8 8 3 0 .9 7 8 2 0 .9 4 0 3 0 .9 9 1 9 0 .9 7 2 0 0 .9 3 7 3

4 0 .9 9 2 8 0 .9 8 2 5 0 .9 6 5 1 0 .9 9 4 8 0 .9 7 0 8 0 .9 5 1 5

6 0 .9 9 4 9 0 .9 8 3 1 0 .9 4 2 1 0 .9 9 5 1 0 .9 7 3 3 0 .9 3 5 8

8 0 .9 9 4 6 0 .9 7 6 6 0 .9 3 9 8 0 .9 9 5 6 0 .9 7 6 0 0 .9 2 5 4

1 0 0 .9 9 2 9 0 .9 7 7 7 0 .8 6 6 3 0 .9 9 6 7 0 .9 7 0 0 0 .8 7 3 6

A v g 0 .9 9 0 0 0 .9 3 3 4 0 .9 3 3 4 0 .9 9 0 1 0 .9 3 1 2 0 .9 2 8 8

ConclusionConclusion

Hybrid Feature Retrieval (HFR) model retrieves a novel combination of three different kinds of features

We have implemented an efficient, scalable solution to the n-gram feature extraction in general

Our results are better compared to other techniques

Future worksHandle obfuscationOperate online, real time

Thank youThank you

A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham...

Documents

Transcript of A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham...