Statistical Structures: Fingerprinting Malware for ... · Statistical Structures: Fingerprinting...

Statistical Structures:Fingerprinting Malware forClassification and Analysis

Daniel BilarWellesley College (Wellesley, MA)

Colby College (Waterville, ME)

bilar <at> alum dot dartmouth dot org

Why Structural Fingerprinting?Goal: Identifying and classifying malware

Problem: For any single fingerprint, balancebetween over-fitting (type II error) and under-fitting (type I error) hard to achieve

Approach: View binaries simultaneously fromdifferent structural perspectives and performstatistical analysis on these ‘structuralfingerprints’

Different Perspectives

Primarilystatic

Graph structuralproperties

Explore graph-modeled control anddata dependencies

SystemDependenceGraph

Primarilydynamic

API call vectorObserve API callsmade

Win 32 APIcall

Primarilystatic

Opcodefrequencydistribution

Count differentinstructions

Assemblyinstruction

static /dynamic?

StatisticalFingerprint

DescriptionStructuralPerspective

Idea: Multiple perspectives may increase likelihood ofcorrect identification and classification

Fingerprint:Opcode frequency distribution

Synopsis: Statically disassemble the binary, tabulatethe opcode frequencies and construct a statisticalfingerprint with a subset of said opcodes.

Goal: Compare opcode fingerprint across non-malicious software and malware classes for quickidentification and classification purposes.

Main result: ‘Rare’ opcodes explain more datavariation then common ones

Goodware: Opcode DistributionProcedure:1. Inventoried PEs (EXE, DLL,

etc) on XP box withAdvanced Disk Catalog

2. Chose random EXE sampleswith MS Excel and Indexyour Files

3. Ran IDA with modifiedInstructionCounter plugin onsample PEs

4. Augmented IDA output fileswith PEID results (compiler)and general ‘functionalityclass’ (e.g. file utility, IDE,network utility, etc)

5. Wrote Java parser for rawdata files and fed JAMA’edmatrix into Excel for analysis

---------.exe-------.exe---------.exe

size: 122880

totalopcodes: 10680

compiler: MS Visual C++ 6.0

class: utility (process)

0001. 002145 20.08% mov

0002. 001859 17.41% push

0003. 000760 7.12% call

0004. 000759 7.11% pop

0005. 000641 6.00% cmp

1, 2

3, 4

5

Malware: Opcode DistributionProcedure:

1. Booted VMPlayer with XPimage

2. Inventoried PEs from C. Riesmalware collection withAdvanced Disk Catalog

3. Fixed 7 classes (e.g. virus,,rootkit, etc), chose randomPEs samples with MS Exceland Index your Files

4. Ran IDA with modifiedInstructionCounter plugin onsample PEs

5. Augmented IDA output fileswith PEID results (compiler,packer) and ‘class’

6. Wrote Java parser for raw datafiles and fed JAMA’ed matrixinto Excel for analysis

Giri.5209Gobi.a---------.b

size: 12288

totalopcodes: 615

compiler: unknown

class: virus

0001. 000112 18.21% mov

0002. 000094 15.28% push

0003. 000052 8.46% call

0004. 000051 8.29% cmp

0005. 000040 6.50% add

2,3

4, 5

6

AFXRK2K4.root.exevanquish.dll

1

Aggregate (Goodware): Opcode Breakdown

and1%

retn2%

jnz3%

add3%

mov25%

push19%

call9%

pop6%

cmp5%

jz4%

lea4%

test3%

jmp3%

xor2% 20 EXEs

(size-blocked random samples from 538 inventoried EXEs)

~1,520,000 opcodes read

192 out of 420 possible opcodes found

72 opcodes in pie chart account for >99.8%

14 opcodes labelled account for ~90%

Top 5 opcodes account for ~64 %

20 EXEs (size-blocked random samples from 538 inventoried EXEs)

~1,520,000 opcodes read

192 out of 398 possible opcodes found


14 opcodes labelled account for ~90%

Top 5 opcodes account for ~64 %

Aggregate (Malware):Opcode Breakdown

sub1%

xor3%

add3%

lea3%

jmp3%

jz4%

cmp4% pop

6% call10%

push16%

mov30%

test3%

retn3%

jnz3%

67 PEs (class-blocked random samples from 250 inventoried PEs)

~665,000 opcodes read

141 out of 420 possible opcodes found (two undocu-mented)


14 opcodes labelled account for >92%

Top 5 opcodes account for ~65%

67 PEs (class-blocked random samples from 250 inventoried PEs)

~665,000 opcodes read

141 out of 398 possible opcodes found (two undocu-mented)


14 opcodes labelled account for >92%

Top 5 opcodes account for ~65%

Class-blocked (Malware): Opcode Breakdown Comparison

mov

push call

pop

cmp

jz

jmp

lea

add

test retn

jnz sub xor

mov 30 % push 16 %call 10 %pop 6 %cmp 4 %jz 4 %jmp 4 %

lea 3%add 3%test 3%retn 3%jnz 2%xor 2%sub 1%

Aggregate breakdown

Aggregate

worms

v i r u s

tools

trojans

rootkit (kernel)

rootkit (user)

bots

Top 14 Opcodes: Frequency

1.5%

1.1%

1.7%

3.7%

5.8%

4.1%

1.8%

1.8%

3.3%

6.4%

2.7%

5.5%

15.6%

37.0%

KernelRK

1.0%

2.3%

2.3%

3.1%

3.7%

3.8%

3.2%

3.3%

3.9%

4.9%

5.1%

8.9%

16.6%

29.0%

UserRK

1.3%

2.1%

2.9%

3.4%

3.4%

3.4%

3.7%

3.1%

4.3%

5.3%

5.9%

8.2%

19.0%

25.4%

Tools

0.5%

3.2%

3.0%

2.2%

2.5%

3.0%

2.6%

2.6%

3.3%

3.6%

6.8%

11.0%

14.1%

34.6%

Bot

0.6%

2.7%

3.2%

2.6%

3.0%

3.4%

3.4%

2.7%

3.5%

3.6%

7.3%

10.0%

15.4%

30.5%

Trojan

1.5%

2.1%

2.0%

3.2%

3.5%

2.7%

3.1%

5.5%

4.4%

5.9%

7.0%

9.1%

22.7%

16.1%

Virus

1.3%

1.9%

2.2%

2.6%

3.0%

3.0%

3.2%

3.9%

4.3%

5.1%

6.3%

8.7%

19.5%

25.3%

Goodware

1.6%and

2.3%xor

2.3%retn

3.2%jnz

3.0%add

4.5%jmp

3.0%test

4.2%lea

4.0%jz

5.0%cmp

6.2%pop

8.7%call

20.7%push

22.2%mov

WormsOpcode

Comparison Opcode Frequencies

1.5%

1.1%

1.7%

3.7%

5.8%

4.1%

1.8%

1.8%

3.3%

6.4%

2.7%

5.5%

15.6%

37.0%

KernelRK

1.0%

2.3%

2.3%

3.1%

3.7%

3.8%

3.2%

3.3%

3.9%

4.9%

5.1%

8.9%

16.6%

29.0%

UserRK

1.3%

2.1%

2.9%

3.4%

3.4%

3.4%

3.7%

3.1%

4.3%

5.3%

5.9%

8.2%

19.0%

25.4%

Tools

0.5%

3.2%

3.0%

2.2%

2.5%

3.0%

2.6%

2.6%

3.3%

3.6%

6.8%

11.0%

14.1%

34.6%

Bot

0.6%

2.7%

3.2%

2.6%

3.0%

3.4%

3.4%

2.7%

3.5%

3.6%

7.3%

10.0%

15.4%

30.5%

Trojan

1.5%

2.1%

2.0%

3.2%

3.5%

2.7%

3.1%

5.5%

4.4%

5.9%

7.0%

9.1%

22.7%

16.1%

Virus

1.3%

1.9%

2.2%

2.6%

3.0%

3.0%

3.2%

3.9%

4.3%

5.1%

6.3%

8.7%

19.5%

25.3%

Goodware

1.6%and

2.3%xor

2.3%retn

3.2%jnz

3.0%add

4.5%jmp

3.0%test

4.2%lea

4.0%jz

5.0%cmp

6.2%pop

8.7%call

20.7%push

22.2%mov

WormsOpcodePerform distribution tests for top14 opcodes on 7 classes ofmalware:

Rootkit (kernel + user)

Virus and Worms

Trojan and Tools

Bots

Investigate: Which, if any,opcode frequency is significantlydifferent for malware?

Top 14 Opcode Testing (z-scores)

1.9

-8.9

-5.5

8.7

22.9

8.5

-12.2

-16.2

-7.4

7.4

-22.0

-17.0

-15.5

36.8

KernelRK

-7.3

6.7

2.5

7.4

10.8

11.7

0.0

-8.4

-6.1

-3.5

-13.5

1.2

-21.0

20.6

UserRK

-0.7

-2.6

-12.3

-11.7

-6.4

-5.0

-6.6

10.9

0.9

-0.6

4.9

5.2

4.6

2.0

Tools

-33.6

29.5

18.4

-12.2

-13.5

-2.2

-14.6

-29.2

-20.9

-30.8

5.1

26.0

-59.9

70.1

Bot

-17.0

15.3

17.8

-0.9

-0.1

5.0

1.8

-18.3

-11.0

-21.2

9.8

10.6

-31.2

28.7

Trojan

2.4

2.7

-1.4

5.3

4.3

-2.3

-0.2

11.5

1.4

4.7

4.8

2.6

12.1

-27.9

Virus

5.9and

7.7xor

2.6retn

8.0jnz

0.5add

20.4jmp

-3.4test

4.2lea

-4.4jz

-1.8cmp

-1.1pop

-0.3call

6.9push

-20.1mov

WormsOpcode

High

Low

Higher

Lower

Similar

Op

cod

e F

req

uen

cy

Testssuggestsopcodefrequencyroughly

1/3 same1/3 lower1/3 higher

vsgoodware

Top 14 Opcodes Results Interpretation5.25.69.515.04.06.110.3Cramer’s V

(in %)

Krn Usr Tools Bot Trojan Virus

andxorretnjnzaddjmptestleajzcmppopcallpushmov

WormOp

Kernel-mode Rootkit:most # of deviations handcoded assembly;‘evasive’ opcodes ?

Virus + Worms:few # of deviations;more jumps smallersize, simpler maliciousfunction, more controlflow ?

High

Low

Higher

Lower

Similar

Op

c Fre

q

Tools: (almost) nodeviation in top 5opcodes more‘benign’ (i.e. similarto goodware) ?

Most frequent 14opcodes weakpredictor

Explains just 5-15% ofvariation!

Rare 14 Opcodes (parts per million)

std

shld setle

setb sbb

rdtsc pushf

nop int

imul fstcw

fild

fdivp

bt

Opcode WormsVirusTrojanBotToolsUserRK

KernelRK

Goodware

9503148355627220

2454043545022

0021000020

2405222126806

7821133458431152313305881078

010801100012

12540059110116

8364742771101136216

010800921981402825

1126755406726708184916291182

120212200011

43801151330450357

5905252350037

118083704734030

Rare 14 Opcode Testing (z-scores)

4.8

-1.0

-1.0

-0.5

-6.5

-0.7

-2.4

-2.3

45.0

-3.3

-0.7

-4.3

-1.3

-1.2

KernelRK

1.4

0.6

-1.6

4.7

-2.0

-1.2

-3.7

-3.6

26.2

1.3

-1.2

-6.5

-2.2

-0.4

UserRK

0.8

0.6

-1.4

0.6

3.4

-1.1

-1.8

-3.2

28.7

-5.9

-1.0

-6.1

-0.3

0.7

Tools

0.3

-1.1

-1.6

4.6

-2.2

1.1

-3.9

-5.0

-1.8

4.4

3.3

-1.5

3.8

6.6

Bot

2.4

-0.9

1.3

7.9

0.3

-0.7

-2.2

-1.6

-1.0

-1.4

2.2

-0.8

2.8

5.9

Trojan

-0.6

1.0

-0.6

-0.3

0.8

3.8

-0.7

4.5

2.4

-1.7

-0.4

-2.6

-0.8

-0.7

Virus

4.8 std

0.2 shld

-1.2 setle

2.1 setb

-2.0 sbb

-0.9 rdtsc

-2.6 pushf

-2.3 nop

-1.4 int

0.9 imul

0.2 fstcw

2.1 fild

1.3 fdivp

4.8 bt

WormsOpcode

High

Low

Higher

Lower

Similar

Op

cod

e F

req

uen

cy

Testssuggestsopcodefrequencyroughly

1/10 lower1/5 higher7/10 same

vsgoodware

Rare 14 Opcodes: Interpretation12101617423663Cramer’s V

(in %)

Krn Usr Tools Bot Trojan Virus

std shld setle setb sbb rdtsc pushf nop int imul fstcw fild fdivp bt

WormOp

NOP:Virus makes use NOP sled, padding ?

High

Low

Higher

Lower

Similar

Op

c Fre

q

INT: Rooktkits (and tools) makeheavy use of software interrupts tell-tale sign of RK ?

Infrequent 14 opcodesmuch betterpredictor!

Explains 12-63% ofvariation

Summary: Opcode Distribution

Malware opcodefrequency distributionseems to deviatesignificantly from non-malicious software

‘Rare’ opcodes explainmore frequencyvariation then commonones

Compare opcodefingerprints againstvarious software classesfor quick identificationand classification

Giri.5209Gobi.a---------.b

size: 12288

totalopcodes: 615

compiler: unknown

class: virus

0001. 000112 18.21% mov

0002. 000094 15.28% push

0003. 000052 8.46% call

0004. 000051 8.29% cmp

0005. 000040 6.50% add

AFXRK2K4.root.exevanquish.dll

Opcodes: Further directionsAcquire more samples and software class differentiation

Investigate sophisticated tests for stronger control offalse discovery rate and type I error

Study n-way association with more factors (compiler,type of opcodes, size)

Go beyond isolated opcodes to semantic ‘nuggets’ (size-wise between isolated opcodes and basic blocks)

Investigate equivalent opcode substitution effects

Related Work

M. Weber (2002): PEAT – Toolkit for Detectingand Analyzing Malicious Software

R. Chinchani (2005): A Fast Static AnalysisApproach to Detect Exploit Code InsideNetwork Flows

S. Stolfo (2005): Fileprint Analysis forMalware Detection

Fingerprint: Win 32 API calls

Goal: Classify malware quickly into a family (set ofvariants make up a family)

Synopsis: Observe and record Win32 API calls madeby malicious code during execution, then comparethem to calls made by other malicious code to findsimilarities

Joint work with Chris Ries

Main result: Simple model yields > 80% correctclassification, call vectors seem robust towardsdifferent packer

Win 32 API call: System overview

Data Collection

Vector Builder

Vector Comparison

Database

Malicious Code FamilyLog File

Data Collection: Run malicious code, recordingWin32 API calls it makes

Vector Builder: Build count vector from collected APIcall data and store in database

Comparison: Compare vector to all other vectors in thedatabase to see if its related to any of them

Win 32 API Call: Data CollectionWin 2000

VMWare

WinXP

Linux

Relayer

Fake DNS ServerHoneyd

Malware

Logger

Malware runs for short period of time on VMWaremachine, can interact with fake network

API calls recorded by logger, passed on to Relayer

Relayer forwards logs to file, console

Win 32 API Call: Call RecordingMalicious process is started in suspended stateDLL is injected into process’s address spaceWhen DLL’s DllMain() function is executed, it hooksthe Win32 API function

Calling Function

Target Function

Target Function

Calling Function

Hook Trampoline

Function call before hooking Function call after hooking

Hook records the call’s time and arguments, calls thetarget, records the return value, and then returns thetarget’s return value to the calling function.

Win 32 API call: Call Vector

Column of the vector represents a hooked functionand # of times called

1200+ different functions recorded during execution

For each malware specimen, vector values recordedto database

…01561262Numberof Calls

…EndPathCloseHandleFindFirstFileAFindCloseFunctionName

Win 32 API call: ComparisonComputes cosine similarity measure csmbetween vector and each vector in the database

=•

=21

2121 ),(

vv

vvvvcsm vv

vvvv

If csm(vector, most similar vector in thedatabase) > threshold vector is classifiedas member of familymost-similar-vector

Otherwise vector classified as member offamilyno-variants-yet

v 1

v2

Win 32 API call: ResultsCollected 77 malware samples

33 Moega58SDBot

66Welchia11Pestlogger01Spybot12 Randex

23Sasser88Netsky55MyLife810MyDoom22Mitgleider11Klez02Inor11Gibe23Frethem11Blaster

1415Beagle

412

412

Banker Nibu Tarno

01Apost

# correct# ofmembers

Family

2720480.620.991132610.790.951041610.790.9842630.820.85563630.820.8465620.80.75285620.80.7

miss. fam.bothfalse fam. # %Threshold

Classification made by 17major AV scanners produced21 families (some aliases)

~80 % correct withcsm threshold 0.8

Discrepancies

Misclassifications

Win 32 API call: Packers

Wide variety ofdifferent packers usedwithin same families

Dynamic Win 32 APIcall fingerprint seemsrobust towards packer

PE PackNetsky.Y

PE-Patch,UPX

Netsky.S

FSGNetsky.P

tElockNetsky.K

PEtiteNetsky.D

PEtiteNetsky.C

UPXNetsky.B

PECompactNetsky.AB

IdentifiedPackerVariant

8 Netsky variants insample, 7 identified

Summary: Win 32 API calls

Simple model yields > 80% correct classification

Resolved discrepancies between some AV scanners

Dynamical API call vectors seem robust towardsdifferent packer

Data Collection

Vector Builder

Vector Comparison

Database

Malicious Code FamilyLog File

Allows researchers and analysts to quickly identifyvariants reasonably well, without manual analysis

API call : Further directionsAcquire more malware samples for better variantclassification

Explore resiliency to obfuscation techniques(substitutions of Win 32 API calls, call spamming)

Investigate patterns of ‘call bundles’ instead of justisolated calls for richer identification

Replace VSM with finite state automaton that capturesrich set of call relations

Related Work

R. Sekar et al (2001): A Fast Automaton-BasedMethod for Detecting Anamalous ProgramBehaviour

J. Rabek et al (2003): DOME – Detection ofInjected, Dynamically Generated, andObfuscated Malicious Code

K. Rozinov (2005): Efficient Static Analysis ofExecutables for Detecting Malicous Behaviour

Fingerprint: PDG measures

Goal: Compare ‘graph structure’ fingerprint ofunknown binaries across non-malicious software andmalware classes for identification, classification andprediction purposes

Synopsis: Represent binaries as a SystemDependence Graph, extract graph features to construct‘graph-structural’ fingerprints for particular softwareclasses

Main result: Work in progress

Program Dependence GraphA PDG models intra-procedural

Data Dependence:Program statementscompute data that areused by other statements.

Control Dependence:Arise from the orderedflow of control in aprogram.

Picture from J. Stafford (Colorado, Boulder)

Material from Codesurfer

Program Dependence Graph

System Dependence Graph

A SDG models control, data, and call dependences ina program

A SDG of a programare the aggregatedPDGs augmentedwith the callsbetween functions

Picture from J. Stafford (Colorado, Boulder)

System Dependence Graph

Material from Codesurfer

Graph measures as a fingerprint

Binaries represented as a SystemDependence Graph (SDG)Tally distributions on graph measures:

Edge weights (weight is jump distance)Node weight (weight is number of statements in basic block)Centrality (“How important is the node”)Clustering Coefficient (“probability of connected neighbours”)Motifs (“recurring patterns”)

Statistical structural fingerprint

Example: H. Flake (BH 05) Graph-structural measure

Primer: GraphsGraphs (Networks)are made up ofvertices (nodes) andedges

An edge connects twovertices

Nodes and edges canhave different weights(b,c)

Edges can havedirections (d)

Picture from Mark Newman (2003)

Modeling with Graphs

Genetic regulatory network (proteins, dependence)

Cardiovascular (organs, veins) [also transmission]

Food (predator, prey)

Neural (neurons, axons)

Biologicalbiological ‘entities’ interacting

SocialSet of people/groups withsome interaction betweenthem

Chemistry (conformation of polymers, transitions)Physical Sciencephysical ‘entities’ interacting

Power grid (power station, lines)

Telephone

Internet (routers, physical links)

Technology(Transmission)resource or commoditydistribution/transmission

Citation (paper, cited)

Thesaurus (words, synonym)

www (html pages, URL links)

Friendship (people, friendship bond)

Business (companies, business dealings)

Movies (actors, collaboration)

Phone calls (number, call)InformationInformation linked together

Example (nodes, edge)Category

Measures: CentralityCentrality tries to measure the ‘importance’ of avertex

Degree centrality: “How many nodes areconnected to me?”Closeness centrality: “How close am I fromall other nodes?”Betweenness centrality: “How importantam I for any two nodes?”

Freeman metric computes centralization forentire graph

Measure:Degree centrality

Degreecentrality: “Howmany nodes areconnected to me?”

∑≠=

=N

jijid jiedgenM

,1

),()(1

),(

)( ,1

−=∑

≠=

N

nnedge

nM

N

jijji

id

Normalized:

Measure:Closeness centrality

Closenesscentrality: “Howclose am I from allother nodes?”

1

1

),()(

−

=

= ∑

N

jjiic nndnM )1)(()(' −= NnMnM iCiC

Normalized:

Measure:Betweenness centrality

Betweennesscentrality: “Howimportant am I forany two nodes?”

∑≠≠

=ikj

jkijkiB gngnC /)()(2/)2)(1(

)()('

−−=

NN

nCnC iBiB

Normalized:

Local structure: Clusters and Motifs

Graphs can be decomposed into constitutivesubgraphs

Subgraph: A subset of nodes of the originalgraph and of edges connecting them (does nothave to contain all the edges of a node)

Cluster: Connected subgraph

Motif: Recurring subgraph in networks athigher frequencies than expected by randomchance

Measure: Clustering Coefficient

Slide from Albert (Penn State)

Measure: Network motifsA motif in a network is asubgraph ‘pattern’

Recurs in networks athigher frequencies thanexpected by randomchance

Motif may reflect underlying generativeprocesses, design principles andconstraints and driving dynamicsof the network

Motif example:

Found inBiochemistry (Transcriptional regulation)Neurobiology (Neuron connectivity)Ecology (food web)Engineering (electronic circuits) ..

and maybe Computer Science (PDGs) ??

Feed-forward loop:

X regulates Y and ZY regulates Z

Graph measures as a fingerprint

Binaries represented as a SystemDependence Graph (SDG)Tally distributions on graph measures:

Edge weights (weight is jump distance)Node weight (weight is number of statements in basic block)Centrality (“How important is the node”)Clustering Coefficient (“probability of connected neighbours”)Motifs (“recurring patterns”)

Statistical structural fingerprint

Summary: SDG measures

Goal: Compare ‘graph structure’ fingerprint ofunknown binaries across non-malicious software andmalware classes for identification, classification andprediction purposes

Synopsis: Represent binaries as a SystemDependence Graph, extract graph features to construct‘graph-structural’ fingerprints for particular softwareclasses

Main result: Work in progress

Related Work

H. Flake (2005): Compare, Port, Navigate

M. Christodorescu (2003): Static Analysisof Executables to Detect MaliciousPatterns

A. Kiss (2005): Using DynamicInformation in the Interprocedural StaticSlicing of Binary Executables

ReferencesStatistical testing:S. Haberman (1973): The Analysis of Residuals in Cross-Classified Tables,pp. 205-220B.S. Everitt (1992): The Analysis of Contingency Tables (2nd Ed.)

Network Graph Measures and Network Motifs:L. Amaral et al (2000): Classes of Small-World NetworksR Milo, Alon U. et al (2002): Network Motifs: Simple Building Blocks ofComplex NetworksM. Newman (2003): The structure and function of complex networksD. Bilar (2006): Science of Networks. http://cs.colby.edu/courses/cs298

System Dependence Graphs:GrammaTech Inc.: Static Program Dependence Analysis via DependenceGraphs. http://www.codesurfer.com/papers/Á. Kiss et al (2003). Interprocedural Static Slicing of Binary Executables

Statistical Structures: Fingerprinting Malware for ... · Statistical Structures: Fingerprinting...

Documents

Transcript of Statistical Structures: Fingerprinting Malware for ... · Statistical Structures: Fingerprinting...