Post on 25-Feb-2016
description
Eureka: A Framework for Enabling Static Analysis on Malware
MARS.MTC.SRI.COM
Motivation Malware landscape is diverse and constant evolving
Large botnets Diverse propagation vectors, exploits, C&C Capabilities – backdoor, keylogging, rootkits, Logic bombs, time-bombs
Malware is not about script-kiddies anymore, it’s real business.
Manual reverse-engineering is close to impossible Need automated techniques to extract system logic,
interactions and side-effects
Dynamic vs Static Malware Analysis Dynamic Analysis
Techniques that profile actions of binary at runtime
Better track record to date CWSandbox, TTAnalyze Only provides partial ``effects-oriented profile’’ of
malware potential Static Analysis
Can provide complementary insights Potential for more comprehensive assessment
Malware Evasions and Obfuscations To defeat signature based detection schemes
Polymorphism, metamorphism: started appearing in viruses of the 90’s primarily to defeat AV tools
To defeat Dynamic Malware Analysis Anti-debugging, anti-tracing, anti-memory dumping VMM detection, emulator detection
To defeat Static Malware analysis Encryption (packing) API and control-flow obfuscations Anti-disassembly
System Goals
Desiderata for a Static Analysis Framework Unpack over 90% of contemporary malware Handle most if not all packers Deobfuscate API references Automate identification of capabilities Provide feedback on unpacking success Simplify and annotate call graphs to illustrate interactions
between key logical blocks
The Eureka Framework Novel unpacking technique based on coarse
grained execution tracing Heuristic-based and statistic-based upacking Implements several techniques to handle
obfucated API references Multiple metrics to evaluate unpack success Annotated call graphs provide bird’s eye view
of system interaction
The Eureka Workflow
Trace Malware
syscalls in VM
Syscall trace
Heuristic basedoffline
analysis
Eureka’s Unpacker
Favorable execution
point
Packed Binary
Un-packedBinary
Dis-assemblyIDA-Pro
Un-Packed .ASM
Dis-assemblyIDA-Pro
Packed .ASM
Statistics based
Evaluator
Unpack Evaluati
on
Eureka’s API Resolver(Control and Data-flow
Analysis)
Un-obfuscated
.ASM
Detailed call-graph
Annotated Call-Graphs
(Control and Data-flow Analysis)
Statistics based
Evaluator
Coarse-grained Execution Monitoring Generalized unpacking principle
Execute binary till it has sufficiently revealed itself Dump the process execution image for static
analysis Monitoring exection progress
Eureka employs a Windows driver that hooks to SSDT (System Service Dispatch Table)
Callback invoked on each NTDLL system call Filtering based on malware process pid
Related Work PolyUnpack (Royal et al. ACSAC 2006)
Static model using program analysis Fine-grained execution tracking detects execution
steps outside the model Renovo (Kang et al. WORM 2007)
Fine-grained execution tracking using QEMU Dumping trigger: execution of newly written code
OmniUnpack (Martigoni et al. ACSAC 2007) Coarse-grained monitoring using page-level
protection mechanisms
Design SpaceSystem Environm
entGranularity Trigger Child
processmonitoring
Output Layers
Speed Evasions
Poly-Unpack
Inside VM
Instruction Model No 1 Slow 1,2,3
Renovo OutsideVM
Instruction Heuristic Yes Many Slow 2,4
Omni-Unpack
Inside VM
Page Heuristic No Many Fast 2,3
Eureka Inside VM
System Call
HeuristicStatistic
Yes 1,Many Fast 2,3
Evasions: (1) multiple packing (2) partial code revealing packers (3) VM detection(4) Emulator detection
Heuristic-based Unpacking How do you determine when to dump?
Heuristic #1: Dump as late as possible. NtTerminateProcess
Heuristic #2: Dump when your program generates errors. NtRaiseHardError
Heuristic #3: Dump when program forks a child process. NtCreateProcess
Issues Weak adversarial model, too simple to evade… Doesn’t work well for package non-malware programs
Statistics-based Unpacking Observations
Statistical properties of packed executable differ from unpacked exectuable
As malware executes code-to-data ratio increases Complications
Code and data sections are interleaved in PE executables
Data directories(import tables) look similar to data but are often found in code sections
Properties of data sections vary with packers
Statistics-based Unpacking (2) Our Approach
Model statistical properties of unpacked code Volume of unpacked code must strictly increase
Estimating unpacked code N-gram analysis to look for frequent instructions We use bi-grams (2-grams) because x-86 opcodes are 1 or
2 bytes Extract subroutine code from 9 benign executables FF 15 (call), FF 75 (push), E8 _ _ _ ff (call), E8 _ _ _ 00
(call)
Statistics-based Unpacking (3)
Bigram Calc117 KB
Explorer1010 KB
Ipconfig59 KB
lpr11 KB
Mshearts131 KB
Notepad72 KB
Ping21 KB
Shutdown23 KB
Taskman19 KB
FF 15call
246 3045 184 24 192 415 58 132 126
FF 75push
235 2494 272 33 274 254 41 63 85
E8 _ _ _ 0xffcall
1583 2201 181 19 369 180 87 49 41
E8 _ _ _ 0x00call
746 1091 152 62 641 108 57 66 50
Statistics-based Unpacking (4) Feasibility test
Corpus of (pre- and post-unpacked) executables unpacked with heuristic unpacking
1090 executables: 125 originally unpacked, 965 unpacked
Simple bi-gram counting was able to distinguish 922 out of 965 unpacked executables (95% success rate)
STOP Algorithm STOP – Statistical Test for Online unPacking
Online algorithm for determing dumping trigger Simple hypothesis test for change in mean
Null Hypothesis: mean bigram count has not increased Assumption: bigram counts are normally distributed with
prior mean μo. If (μ1 – μ0) / σ1 > 1.645, we reject null hypothesis with
confidence level of 0.95. Test is repeated to determine beginning of
unpacking and end of unpacking.
API Resolution User-level malware programs require system
calls to perform malicious actions Use Win32 API to access user level libraries Obufscations impede malware analysis using
IDA Pro or OllyDbg Packers use non-standard linking and loading of
dlls Obfuscated API resolution
Standard API Resolution API Calls
Calls to various user-level DLL’s linked by the Windows Linker/Loader Legitimate executables have import table Import table is used to fill up IAT with virtual addresses at run-time
CALL F ; call by thunk…CALL [X] ; indirect call
CALL X
Imports X KERNEL32.OpenFile……..
IAT (Import Address Table) B+R……..
ExportsOpenFile R……..
KERNEL32.DLL
B:
R: Entrypoint to OpenFile
X:Dynamic linking
F: JMP [X] ; thunk
Standard API ResolutionImports in IAT identified by IDA by looking at Import Table
API Obfuscation by Packers Import table is removed IAT is not filled in by the linker and loader Unpacker fills in IAT or similar data structure by itself Hard to identify corresponding API call in executable
………….CALL F
CALL X
Imports X KERNEL32.OpenFile……..
IAT (Import Address Table) B+R……..
ExportsOpenFile R……..
KERNEL32.DLL
B:
R: Entrypoint to OpenFile
X:
F: JMP [X] ; thunk
Identifying APIs by Address For each DLL build relative and absolute address database
Default “Image address” is the base address Calculate corresponding virtual address for each exported API
Match addresses used in calls with the databaseS
………….CALL [X]
CALL X
Imports X KERNEL32.OpenFile……..
IAT (Import Address Table) 7c810332……..
ExportsOpenFile R……..
KERNEL32.DLL
7c800000:
7c810332: Entrypoint to OpenFile
X:Dynamic linking
Handling DLL Load Obfuscations Intercept dynamic loading at arbitrary addresses
Look for “NtOpenSection” and “NtMapViewOfSection” in trace Search for DLL headers in memory during dumping
Can even identify DLL code that are copied to arbitrary location
………….CALL F
CALL X
Imports X KERNEL32.OpenFile……..
IAT (Import Address Table) 21810332……..
ExportsOpenFile R……..
KERNEL32.DLL
RVA:00000:
RVA:10332: Entrypoint to OpenFile
X:Dynamic linking
Handling Thunks Identify subroutines with a JMP instruction only
Treat any calls to these subs as an API call IsDebuggerPresent
Using Dataflow Analysis Identify register based indirect calls
GetEnvironmentStringW
use
def
Handling Dynamic Pointer Updates Identify register based indirect calls
dword_41e304 has no staticvalue to look up API
use
def
A def to dword_41e308 is foundLook for probable call toGetProcAddress earlier
Call to GetProcAddress
Evaluation Metrics Measuring analyzability
Code-to-data ratio Use disassembler to separate code and data. Most successfully unpacked malware have code-to-data
ratio over 50% API resolution success
Percentage of API calls that have been resolved from the set of all call sites.
Higher percentage implies more the malware is amenable to static anlaysis.
Graph Generation Call graph simplification
Most malware contain hundreds of functions Remove nodes without APIs connecting inbound and
outbound edges Micro-ontology labeling
Bird’s eye view of malware instance Translate API functions into categories based on
functionality Categories based on Microsoft’s Classifications Common Filesystem, Random, Time, Registry, Socket, File
Management
Storm Worm Case Study
Storm Worm: Bird’s Eye View(Semi-manually generated)
Storm Worm Case Study (2)
Control Flow Graph:eDonkey Handler
Eureka Ontology Graph
Experimental Evaluation Evaluation using three different datasets Goat (packed benign executable) dataset
15 common packers Provides ground truth for what packer is used and
what is expected after unpacking Spam malware corpus Honeynet malware corpus
Goat DatasetPacker Poly-
UnpackRenovo Eureka Eureka-
APIArmadillo No Partial Yes 64%ASPack Partial Yes Yes 99%ASProtect Partial Yes No -ExeCryptor Yes Partial Yes 2%ExeStealth No Yes Yes 97%FSG Yes Yes Yes 0%MEW Yes Yes Yes 97%
Goat DatasetPacker Poly-
UnpackRenovo Eureka Eureka-
APIMoleBox No Yes Yes 98%Morphine Yes Partial Yes 0%Obsidium No No Yes 99%PeCompact No Yes Yes 99%Themida No Partial Partial -UPX Yes Yes Yes 99%WinUPack Partial Yes Yes 99%Yoda Partial Partial Yes 97%
Evaluation (ASPack)
Evaluation (MoleBox)
Evaluation (Armadillo)
Spam Corpus Evaluation Evaluation of a corpus of 481 executables
Binaries collected at spam traps 470 executables successfully unpacked (over
97% success) 401 unpacked simply using heuristic unpacker Rest unpacked using statistical hypothesis test Most API references were successfully
deobfuscated
Spam Corpus Evaluation (2)
Packer Count Eureka Eureka-API
Unknown 186 184 85%UPX 134 132 78%Virus 79 79 79%PEX 18 18 58%MEW 12 11 70%Rest(10) 52 46 83%
Spam Corpus Evaluation (3)
Virus Family Count Eureka Eureka-API
TRSmall 98 98 93%TRDldr 63 61 48%Bagle 67 67 84%Mydoom 45 44 99%Klez 77 77 78%Rest(39) 131 123 78%
Honeynet Corpus Evaluation Evaluation of a corpus of 435 executables
Binaries collected at SRI honeynet 178 out of 435 packed with Themida (only partially
analyzable)
Analysis of the 257 non-Themida binaries 20 did not execute on Win XP Eureka unpacks 228 / 237 remaining binaries High API resolution rates on unpacked binaries
Honeynet Corpus Evaluation (2)
Packer Count* Eureka Eureka-API
PolyEne 109 109 97%FSG 36 35 94%Unknown 33 29 67%ASPack 23 22 93%tELock 9 9 91%Rest(9) 27 24 62%
*Includes all binaries except those packed with Themida
Honeynet Corpus Evaluation (3)
Virus Family Count* Eureka Eureka-API
Korgo 70 70 86%Virut 24 24 90%Padobot 21 21 82%Sality 17 17 96%Parite 15 15 96%Rest(19) 90 81 90%
*Includes all binaries except those packed with Themida
Runtime Performance Evaluation of a corpus of 435 executables
Binaries collected at SRI honeynet 178 out of 435 packed with Themida (only partially
analyzable)
Analysis of the 257 non-Themida binaries 20 did not execute on Win XP Eureka unpacks 228 / 237 remaining binaries High API resolution rates on unpacked binaries