Statistical Structures: Fingerprinting Malware for ... · Statistical Structures: Fingerprinting...
Transcript of Statistical Structures: Fingerprinting Malware for ... · Statistical Structures: Fingerprinting...
Statistical Structures:Fingerprinting Malware forClassification and Analysis
Daniel BilarWellesley College (Wellesley, MA)
Colby College (Waterville, ME)
bilar <at> alum dot dartmouth dot org
Why Structural Fingerprinting?Goal: Identifying and classifying malware
Problem: For any single fingerprint, balancebetween over-fitting (type II error) and under-fitting (type I error) hard to achieve
Approach: View binaries simultaneously fromdifferent structural perspectives and performstatistical analysis on these ‘structuralfingerprints’
Different Perspectives
Primarilystatic
Graph structuralproperties
Explore graph-modeled control anddata dependencies
SystemDependenceGraph
Primarilydynamic
API call vectorObserve API callsmade
Win 32 APIcall
Primarilystatic
Opcodefrequencydistribution
Count differentinstructions
Assemblyinstruction
static /dynamic?
StatisticalFingerprint
DescriptionStructuralPerspective
Idea: Multiple perspectives may increase likelihood ofcorrect identification and classification
Fingerprint:Opcode frequency distribution
Synopsis: Statically disassemble the binary, tabulatethe opcode frequencies and construct a statisticalfingerprint with a subset of said opcodes.
Goal: Compare opcode fingerprint across non-malicious software and malware classes for quickidentification and classification purposes.
Main result: ‘Rare’ opcodes explain more datavariation then common ones
Goodware: Opcode DistributionProcedure:1. Inventoried PEs (EXE, DLL,
etc) on XP box withAdvanced Disk Catalog
2. Chose random EXE sampleswith MS Excel and Indexyour Files
3. Ran IDA with modifiedInstructionCounter plugin onsample PEs
4. Augmented IDA output fileswith PEID results (compiler)and general ‘functionalityclass’ (e.g. file utility, IDE,network utility, etc)
5. Wrote Java parser for rawdata files and fed JAMA’edmatrix into Excel for analysis
---------.exe-------.exe---------.exe
size: 122880
totalopcodes: 10680
compiler: MS Visual C++ 6.0
class: utility (process)
0001. 002145 20.08% mov
0002. 001859 17.41% push
0003. 000760 7.12% call
0004. 000759 7.11% pop
0005. 000641 6.00% cmp
1, 2
3, 4
5
Malware: Opcode DistributionProcedure:
1. Booted VMPlayer with XPimage
2. Inventoried PEs from C. Riesmalware collection withAdvanced Disk Catalog
3. Fixed 7 classes (e.g. virus,,rootkit, etc), chose randomPEs samples with MS Exceland Index your Files
4. Ran IDA with modifiedInstructionCounter plugin onsample PEs
5. Augmented IDA output fileswith PEID results (compiler,packer) and ‘class’
6. Wrote Java parser for raw datafiles and fed JAMA’ed matrixinto Excel for analysis
Giri.5209Gobi.a---------.b
size: 12288
totalopcodes: 615
compiler: unknown
class: virus
0001. 000112 18.21% mov
0002. 000094 15.28% push
0003. 000052 8.46% call
0004. 000051 8.29% cmp
0005. 000040 6.50% add
2,3
4, 5
6
AFXRK2K4.root.exevanquish.dll
1
Aggregate (Goodware): Opcode Breakdown
and1%
retn2%
jnz3%
add3%
mov25%
push19%
call9%
pop6%
cmp5%
jz4%
lea4%
test3%
jmp3%
xor2% 20 EXEs
(size-blocked random samples from 538 inventoried EXEs)
~1,520,000 opcodes read
192 out of 420 possible opcodes found
72 opcodes in pie chart account for >99.8%
14 opcodes labelled account for ~90%
Top 5 opcodes account for ~64 %
20 EXEs (size-blocked random samples from 538 inventoried EXEs)
~1,520,000 opcodes read
192 out of 398 possible opcodes found
72 opcodes in pie chart account for >99.8%
14 opcodes labelled account for ~90%
Top 5 opcodes account for ~64 %
Aggregate (Malware):Opcode Breakdown
sub1%
xor3%
add3%
lea3%
jmp3%
jz4%
cmp4% pop
6% call10%
push16%
mov30%
test3%
retn3%
jnz3%
67 PEs (class-blocked random samples from 250 inventoried PEs)
~665,000 opcodes read
141 out of 420 possible opcodes found (two undocu-mented)
60 opcodes in pie chart account for >99.8%
14 opcodes labelled account for >92%
Top 5 opcodes account for ~65%
67 PEs (class-blocked random samples from 250 inventoried PEs)
~665,000 opcodes read
141 out of 398 possible opcodes found (two undocu-mented)
60 opcodes in pie chart account for >99.8%
14 opcodes labelled account for >92%
Top 5 opcodes account for ~65%
Class-blocked (Malware): Opcode Breakdown Comparison
mov
push call
pop
cmp
jz
jmp
lea
add
test retn
jnz sub xor
mov 30 % push 16 %call 10 %pop 6 %cmp 4 %jz 4 %jmp 4 %
lea 3%add 3%test 3%retn 3%jnz 2%xor 2%sub 1%
Aggregate breakdown
Aggregate
worms
v i r u s
tools
trojans
rootkit (kernel)
rootkit (user)
bots
Top 14 Opcodes: Frequency
1.5%
1.1%
1.7%
3.7%
5.8%
4.1%
1.8%
1.8%
3.3%
6.4%
2.7%
5.5%
15.6%
37.0%
KernelRK
1.0%
2.3%
2.3%
3.1%
3.7%
3.8%
3.2%
3.3%
3.9%
4.9%
5.1%
8.9%
16.6%
29.0%
UserRK
1.3%
2.1%
2.9%
3.4%
3.4%
3.4%
3.7%
3.1%
4.3%
5.3%
5.9%
8.2%
19.0%
25.4%
Tools
0.5%
3.2%
3.0%
2.2%
2.5%
3.0%
2.6%
2.6%
3.3%
3.6%
6.8%
11.0%
14.1%
34.6%
Bot
0.6%
2.7%
3.2%
2.6%
3.0%
3.4%
3.4%
2.7%
3.5%
3.6%
7.3%
10.0%
15.4%
30.5%
Trojan
1.5%
2.1%
2.0%
3.2%
3.5%
2.7%
3.1%
5.5%
4.4%
5.9%
7.0%
9.1%
22.7%
16.1%
Virus
1.3%
1.9%
2.2%
2.6%
3.0%
3.0%
3.2%
3.9%
4.3%
5.1%
6.3%
8.7%
19.5%
25.3%
Goodware
1.6%and
2.3%xor
2.3%retn
3.2%jnz
3.0%add
4.5%jmp
3.0%test
4.2%lea
4.0%jz
5.0%cmp
6.2%pop
8.7%call
20.7%push
22.2%mov
WormsOpcode
Comparison Opcode Frequencies
1.5%
1.1%
1.7%
3.7%
5.8%
4.1%
1.8%
1.8%
3.3%
6.4%
2.7%
5.5%
15.6%
37.0%
KernelRK
1.0%
2.3%
2.3%
3.1%
3.7%
3.8%
3.2%
3.3%
3.9%
4.9%
5.1%
8.9%
16.6%
29.0%
UserRK
1.3%
2.1%
2.9%
3.4%
3.4%
3.4%
3.7%
3.1%
4.3%
5.3%
5.9%
8.2%
19.0%
25.4%
Tools
0.5%
3.2%
3.0%
2.2%
2.5%
3.0%
2.6%
2.6%
3.3%
3.6%
6.8%
11.0%
14.1%
34.6%
Bot
0.6%
2.7%
3.2%
2.6%
3.0%
3.4%
3.4%
2.7%
3.5%
3.6%
7.3%
10.0%
15.4%
30.5%
Trojan
1.5%
2.1%
2.0%
3.2%
3.5%
2.7%
3.1%
5.5%
4.4%
5.9%
7.0%
9.1%
22.7%
16.1%
Virus
1.3%
1.9%
2.2%
2.6%
3.0%
3.0%
3.2%
3.9%
4.3%
5.1%
6.3%
8.7%
19.5%
25.3%
Goodware
1.6%and
2.3%xor
2.3%retn
3.2%jnz
3.0%add
4.5%jmp
3.0%test
4.2%lea
4.0%jz
5.0%cmp
6.2%pop
8.7%call
20.7%push
22.2%mov
WormsOpcodePerform distribution tests for top14 opcodes on 7 classes ofmalware:
Rootkit (kernel + user)
Virus and Worms
Trojan and Tools
Bots
Investigate: Which, if any,opcode frequency is significantlydifferent for malware?
Top 14 Opcode Testing (z-scores)
1.9
-8.9
-5.5
8.7
22.9
8.5
-12.2
-16.2
-7.4
7.4
-22.0
-17.0
-15.5
36.8
KernelRK
-7.3
6.7
2.5
7.4
10.8
11.7
0.0
-8.4
-6.1
-3.5
-13.5
1.2
-21.0
20.6
UserRK
-0.7
-2.6
-12.3
-11.7
-6.4
-5.0
-6.6
10.9
0.9
-0.6
4.9
5.2
4.6
2.0
Tools
-33.6
29.5
18.4
-12.2
-13.5
-2.2
-14.6
-29.2
-20.9
-30.8
5.1
26.0
-59.9
70.1
Bot
-17.0
15.3
17.8
-0.9
-0.1
5.0
1.8
-18.3
-11.0
-21.2
9.8
10.6
-31.2
28.7
Trojan
2.4
2.7
-1.4
5.3
4.3
-2.3
-0.2
11.5
1.4
4.7
4.8
2.6
12.1
-27.9
Virus
5.9and
7.7xor
2.6retn
8.0jnz
0.5add
20.4jmp
-3.4test
4.2lea
-4.4jz
-1.8cmp
-1.1pop
-0.3call
6.9push
-20.1mov
WormsOpcode
High
Low
Higher
Lower
Similar
Op
cod
e F
req
uen
cy
Testssuggestsopcodefrequencyroughly
1/3 same1/3 lower1/3 higher
vsgoodware
Top 14 Opcodes Results Interpretation5.25.69.515.04.06.110.3Cramer’s V
(in %)
Krn Usr Tools Bot Trojan Virus
andxorretnjnzaddjmptestleajzcmppopcallpushmov
WormOp
Kernel-mode Rootkit:most # of deviations handcoded assembly;‘evasive’ opcodes ?
Virus + Worms:few # of deviations;more jumps smallersize, simpler maliciousfunction, more controlflow ?
High
Low
Higher
Lower
Similar
Op
c Fre
q
Tools: (almost) nodeviation in top 5opcodes more‘benign’ (i.e. similarto goodware) ?
Most frequent 14opcodes weakpredictor
Explains just 5-15% ofvariation!
Rare 14 Opcodes (parts per million)
std
shld setle
setb sbb
rdtsc pushf
nop int
imul fstcw
fild
fdivp
bt
Opcode WormsVirusTrojanBotToolsUserRK
KernelRK
Goodware
9503148355627220
2454043545022
0021000020
2405222126806
7821133458431152313305881078
010801100012
12540059110116
8364742771101136216
010800921981402825
1126755406726708184916291182
120212200011
43801151330450357
5905252350037
118083704734030
Rare 14 Opcode Testing (z-scores)
4.8
-1.0
-1.0
-0.5
-6.5
-0.7
-2.4
-2.3
45.0
-3.3
-0.7
-4.3
-1.3
-1.2
KernelRK
1.4
0.6
-1.6
4.7
-2.0
-1.2
-3.7
-3.6
26.2
1.3
-1.2
-6.5
-2.2
-0.4
UserRK
0.8
0.6
-1.4
0.6
3.4
-1.1
-1.8
-3.2
28.7
-5.9
-1.0
-6.1
-0.3
0.7
Tools
0.3
-1.1
-1.6
4.6
-2.2
1.1
-3.9
-5.0
-1.8
4.4
3.3
-1.5
3.8
6.6
Bot
2.4
-0.9
1.3
7.9
0.3
-0.7
-2.2
-1.6
-1.0
-1.4
2.2
-0.8
2.8
5.9
Trojan
-0.6
1.0
-0.6
-0.3
0.8
3.8
-0.7
4.5
2.4
-1.7
-0.4
-2.6
-0.8
-0.7
Virus
4.8 std
0.2 shld
-1.2 setle
2.1 setb
-2.0 sbb
-0.9 rdtsc
-2.6 pushf
-2.3 nop
-1.4 int
0.9 imul
0.2 fstcw
2.1 fild
1.3 fdivp
4.8 bt
WormsOpcode
High
Low
Higher
Lower
Similar
Op
cod
e F
req
uen
cy
Testssuggestsopcodefrequencyroughly
1/10 lower1/5 higher7/10 same
vsgoodware
Rare 14 Opcodes: Interpretation12101617423663Cramer’s V
(in %)
Krn Usr Tools Bot Trojan Virus
std shld setle setb sbb rdtsc pushf nop int imul fstcw fild fdivp bt
WormOp
NOP:Virus makes use NOP sled, padding ?
High
Low
Higher
Lower
Similar
Op
c Fre
q
INT: Rooktkits (and tools) makeheavy use of software interrupts tell-tale sign of RK ?
Infrequent 14 opcodesmuch betterpredictor!
Explains 12-63% ofvariation
Summary: Opcode Distribution
Malware opcodefrequency distributionseems to deviatesignificantly from non-malicious software
‘Rare’ opcodes explainmore frequencyvariation then commonones
Compare opcodefingerprints againstvarious software classesfor quick identificationand classification
Giri.5209Gobi.a---------.b
size: 12288
totalopcodes: 615
compiler: unknown
class: virus
0001. 000112 18.21% mov
0002. 000094 15.28% push
0003. 000052 8.46% call
0004. 000051 8.29% cmp
0005. 000040 6.50% add
AFXRK2K4.root.exevanquish.dll
Opcodes: Further directionsAcquire more samples and software class differentiation
Investigate sophisticated tests for stronger control offalse discovery rate and type I error
Study n-way association with more factors (compiler,type of opcodes, size)
Go beyond isolated opcodes to semantic ‘nuggets’ (size-wise between isolated opcodes and basic blocks)
Investigate equivalent opcode substitution effects
Related Work
M. Weber (2002): PEAT – Toolkit for Detectingand Analyzing Malicious Software
R. Chinchani (2005): A Fast Static AnalysisApproach to Detect Exploit Code InsideNetwork Flows
S. Stolfo (2005): Fileprint Analysis forMalware Detection
Fingerprint: Win 32 API calls
Goal: Classify malware quickly into a family (set ofvariants make up a family)
Synopsis: Observe and record Win32 API calls madeby malicious code during execution, then comparethem to calls made by other malicious code to findsimilarities
Joint work with Chris Ries
Main result: Simple model yields > 80% correctclassification, call vectors seem robust towardsdifferent packer
Win 32 API call: System overview
Data Collection
Vector Builder
Vector Comparison
Database
Malicious Code FamilyLog File
Data Collection: Run malicious code, recordingWin32 API calls it makes
Vector Builder: Build count vector from collected APIcall data and store in database
Comparison: Compare vector to all other vectors in thedatabase to see if its related to any of them
Win 32 API Call: Data CollectionWin 2000
VMWare
WinXP
Linux
Relayer
Fake DNS ServerHoneyd
Malware
Logger
Malware runs for short period of time on VMWaremachine, can interact with fake network
API calls recorded by logger, passed on to Relayer
Relayer forwards logs to file, console
Win 32 API Call: Call RecordingMalicious process is started in suspended stateDLL is injected into process’s address spaceWhen DLL’s DllMain() function is executed, it hooksthe Win32 API function
Calling Function
Target Function
Target Function
Calling Function
Hook Trampoline
Function call before hooking Function call after hooking
Hook records the call’s time and arguments, calls thetarget, records the return value, and then returns thetarget’s return value to the calling function.
Win 32 API call: Call Vector
Column of the vector represents a hooked functionand # of times called
1200+ different functions recorded during execution
For each malware specimen, vector values recordedto database
…01561262Numberof Calls
…EndPathCloseHandleFindFirstFileAFindCloseFunctionName
Win 32 API call: ComparisonComputes cosine similarity measure csmbetween vector and each vector in the database
=•
=21
2121 ),(
vv
vvvvcsm vv
vvvv
If csm(vector, most similar vector in thedatabase) > threshold vector is classifiedas member of familymost-similar-vector
Otherwise vector classified as member offamilyno-variants-yet
v 1
v2
Win 32 API call: ResultsCollected 77 malware samples
33 Moega58SDBot
66Welchia11Pestlogger01Spybot12 Randex
23Sasser88Netsky55MyLife810MyDoom22Mitgleider11Klez02Inor11Gibe23Frethem11Blaster
1415Beagle
412
412
Banker Nibu Tarno
01Apost
# correct# ofmembers
Family
2720480.620.991132610.790.951041610.790.9842630.820.85563630.820.8465620.80.75285620.80.7
miss. fam.bothfalse fam. # %Threshold
Classification made by 17major AV scanners produced21 families (some aliases)
~80 % correct withcsm threshold 0.8
Discrepancies
Misclassifications
Win 32 API call: Packers
Wide variety ofdifferent packers usedwithin same families
Dynamic Win 32 APIcall fingerprint seemsrobust towards packer
PE PackNetsky.Y
PE-Patch,UPX
Netsky.S
FSGNetsky.P
tElockNetsky.K
PEtiteNetsky.D
PEtiteNetsky.C
UPXNetsky.B
PECompactNetsky.AB
IdentifiedPackerVariant
8 Netsky variants insample, 7 identified
Summary: Win 32 API calls
Simple model yields > 80% correct classification
Resolved discrepancies between some AV scanners
Dynamical API call vectors seem robust towardsdifferent packer
Data Collection
Vector Builder
Vector Comparison
Database
Malicious Code FamilyLog File
Allows researchers and analysts to quickly identifyvariants reasonably well, without manual analysis
API call : Further directionsAcquire more malware samples for better variantclassification
Explore resiliency to obfuscation techniques(substitutions of Win 32 API calls, call spamming)
Investigate patterns of ‘call bundles’ instead of justisolated calls for richer identification
Replace VSM with finite state automaton that capturesrich set of call relations
Related Work
R. Sekar et al (2001): A Fast Automaton-BasedMethod for Detecting Anamalous ProgramBehaviour
J. Rabek et al (2003): DOME – Detection ofInjected, Dynamically Generated, andObfuscated Malicious Code
K. Rozinov (2005): Efficient Static Analysis ofExecutables for Detecting Malicous Behaviour
Fingerprint: PDG measures
Goal: Compare ‘graph structure’ fingerprint ofunknown binaries across non-malicious software andmalware classes for identification, classification andprediction purposes
Synopsis: Represent binaries as a SystemDependence Graph, extract graph features to construct‘graph-structural’ fingerprints for particular softwareclasses
Main result: Work in progress
Program Dependence GraphA PDG models intra-procedural
Data Dependence:Program statementscompute data that areused by other statements.
Control Dependence:Arise from the orderedflow of control in aprogram.
Picture from J. Stafford (Colorado, Boulder)
Material from Codesurfer
Program Dependence Graph
System Dependence Graph
A SDG models control, data, and call dependences ina program
A SDG of a programare the aggregatedPDGs augmentedwith the callsbetween functions
Picture from J. Stafford (Colorado, Boulder)
System Dependence Graph
Material from Codesurfer
Graph measures as a fingerprint
Binaries represented as a SystemDependence Graph (SDG)Tally distributions on graph measures:
Edge weights (weight is jump distance)Node weight (weight is number of statements in basic block)Centrality (“How important is the node”)Clustering Coefficient (“probability of connected neighbours”)Motifs (“recurring patterns”)
Statistical structural fingerprint
Example: H. Flake (BH 05) Graph-structural measure
Primer: GraphsGraphs (Networks)are made up ofvertices (nodes) andedges
An edge connects twovertices
Nodes and edges canhave different weights(b,c)
Edges can havedirections (d)
Picture from Mark Newman (2003)
Modeling with Graphs
Genetic regulatory network (proteins, dependence)
Cardiovascular (organs, veins) [also transmission]
Food (predator, prey)
Neural (neurons, axons)
Biologicalbiological ‘entities’ interacting
SocialSet of people/groups withsome interaction betweenthem
Chemistry (conformation of polymers, transitions)Physical Sciencephysical ‘entities’ interacting
Power grid (power station, lines)
Telephone
Internet (routers, physical links)
Technology(Transmission)resource or commoditydistribution/transmission
Citation (paper, cited)
Thesaurus (words, synonym)
www (html pages, URL links)
Friendship (people, friendship bond)
Business (companies, business dealings)
Movies (actors, collaboration)
Phone calls (number, call)InformationInformation linked together
Example (nodes, edge)Category
Measures: CentralityCentrality tries to measure the ‘importance’ of avertex
Degree centrality: “How many nodes areconnected to me?”Closeness centrality: “How close am I fromall other nodes?”Betweenness centrality: “How importantam I for any two nodes?”
Freeman metric computes centralization forentire graph
Measure:Degree centrality
Degreecentrality: “Howmany nodes areconnected to me?”
∑≠=
=N
jijid jiedgenM
,1
),()(1
),(
)( ,1
−=∑
≠=
N
nnedge
nM
N
jijji
id
Normalized:
Measure:Closeness centrality
Closenesscentrality: “Howclose am I from allother nodes?”
1
1
),()(
−
=
= ∑
N
jjiic nndnM )1)(()(' −= NnMnM iCiC
Normalized:
Measure:Betweenness centrality
Betweennesscentrality: “Howimportant am I forany two nodes?”
∑≠≠
=ikj
jkijkiB gngnC /)()(2/)2)(1(
)()('
−−=
NN
nCnC iBiB
Normalized:
Local structure: Clusters and Motifs
Graphs can be decomposed into constitutivesubgraphs
Subgraph: A subset of nodes of the originalgraph and of edges connecting them (does nothave to contain all the edges of a node)
Cluster: Connected subgraph
Motif: Recurring subgraph in networks athigher frequencies than expected by randomchance
Measure: Clustering Coefficient
Slide from Albert (Penn State)
Measure: Network motifsA motif in a network is asubgraph ‘pattern’
Recurs in networks athigher frequencies thanexpected by randomchance
Motif may reflect underlying generativeprocesses, design principles andconstraints and driving dynamicsof the network
Motif example:
Found inBiochemistry (Transcriptional regulation)Neurobiology (Neuron connectivity)Ecology (food web)Engineering (electronic circuits) ..
and maybe Computer Science (PDGs) ??
Feed-forward loop:
X regulates Y and ZY regulates Z
Graph measures as a fingerprint
Binaries represented as a SystemDependence Graph (SDG)Tally distributions on graph measures:
Edge weights (weight is jump distance)Node weight (weight is number of statements in basic block)Centrality (“How important is the node”)Clustering Coefficient (“probability of connected neighbours”)Motifs (“recurring patterns”)
Statistical structural fingerprint
Summary: SDG measures
Goal: Compare ‘graph structure’ fingerprint ofunknown binaries across non-malicious software andmalware classes for identification, classification andprediction purposes
Synopsis: Represent binaries as a SystemDependence Graph, extract graph features to construct‘graph-structural’ fingerprints for particular softwareclasses
Main result: Work in progress
Related Work
H. Flake (2005): Compare, Port, Navigate
M. Christodorescu (2003): Static Analysisof Executables to Detect MaliciousPatterns
A. Kiss (2005): Using DynamicInformation in the Interprocedural StaticSlicing of Binary Executables
ReferencesStatistical testing:S. Haberman (1973): The Analysis of Residuals in Cross-Classified Tables,pp. 205-220B.S. Everitt (1992): The Analysis of Contingency Tables (2nd Ed.)
Network Graph Measures and Network Motifs:L. Amaral et al (2000): Classes of Small-World NetworksR Milo, Alon U. et al (2002): Network Motifs: Simple Building Blocks ofComplex NetworksM. Newman (2003): The structure and function of complex networksD. Bilar (2006): Science of Networks. http://cs.colby.edu/courses/cs298
System Dependence Graphs:GrammaTech Inc.: Static Program Dependence Analysis via DependenceGraphs. http://www.codesurfer.com/papers/Á. Kiss et al (2003). Interprocedural Static Slicing of Binary Executables