Decompilers and beyond Hex-Rays Ilfak Guilfanov. 2 (c) 2008 Hex-Rays SA Presentation Outline Why do...
-
Upload
joseph-manning -
Category
Documents
-
view
227 -
download
0
Transcript of Decompilers and beyond Hex-Rays Ilfak Guilfanov. 2 (c) 2008 Hex-Rays SA Presentation Outline Why do...
2(c) 2008 Hex-Rays SA
Presentation Outline
Why do we need decompilers?Complexity must be justified
Typical decompiler designThere are some misconceptions
Decompiler based analysisNew analysis type and tools become possible
Future“...is bright and sunny”
Your feedback
Online copy of this presentation is available athttp://www.hex-rays.com/idapro/ppt/decompilers_and_beyond.ppt
3(c) 2008 Hex-Rays SA
Disassemblers
We need disassemblers to analyze binary codeSimple disassemblers produce a listing with instructionsBetter disassemblers assist in analysis by annotating the code, good navigation etc. You know the difference.Even the ideal disassembler stays at low level: the output is an assembler listingThe main output of a disassembler is still one-to-one mapping of opcodes to instruction mnemonicsNo leverage, no abstractions, little insightThe analyst must mentally map assembly instructions to higher level abstractions and conceptsA boring and routine task after a while
4(c) 2008 Hex-Rays SA
Disassembler limitations
The output isBoringInhumanRepetitiveError proneRequires special skillsDid I say repetitive?
Yet some geeks like it?...
5(c) 2008 Hex-Rays SA
Decompilers
The need:Software grows like gasTime spent on analysis skyrocketsMalware proliferates and mutates
We need better tools to handle thisDecompilation is the next logical step, yet a tough one
6(c) 2008 Hex-Rays SA
Building ideal decompiler
The answer is clear and easy to give: ideal decompilers do not existIt is customary to compare compilers and decompilers:
PreprocessingLexical analysisSyntax analysisCode generationOptimization
This comparison is correct but superficial
7(c) 2008 Hex-Rays SA
Compilers are privileged
Strictly defined input languageAnything nonconforming – spit out an error message
Reasonable amount of information on all functions, variables, types, etc.The output may be ugly
Who will ever read it but some geeks? :)
8(c) 2008 Hex-Rays SA
Machine code decompilers are impossible
Informal and sometimes hostile inputMany problems are unsolved or proved to be unsolvable in generalThe output is examined in detail by a human being, any suboptimality is noticed because it annoys the analyst
Conclusion: robust decompilers are impossible
What if we address the common cases? For example, if we cover 90%, will the rest be handled manually?
9(c) 2008 Hex-Rays SA
Easy for humans, hard for computers
In fact, many (all?) problems encountered during decompilation are hardFor every problem, there is a naïve solution, which, unfortunately, does not workJust a few examples...
10(c) 2008 Hex-Rays SA
Function calls are a problem
Function calls require answering the following questions:Where does the function expect its input registers?Where does it return the result?What registers or memory cells does it spoil?How does it change the stack pointer?Does it return to the caller or somewhere else?
11(c) 2008 Hex-Rays SA
Function return values are a problem
Does the function return anything?How big is the return value?
12(c) 2008 Hex-Rays SA
Function input arguments are a problem
When a register is accessed, it can beTo save its valueTo allocate stack frameUsed as function argument
14(c) 2008 Hex-Rays SA
Indirect jumps are a problem
Indirect jumps are used for switch idioms and tail callsRecognizing them is necessary to build the control flow graph
15(c) 2008 Hex-Rays SA
Problems, problems, problems...
Save-restore (push/pop) pairsPartial register accesses (al/ah/ax/eax)64-bit arithmeticCompiler idiomsVariable live ranges (for stack variables)Lost type informationPointers vs. numbersVirtual functionsRecursive functions
16(c) 2008 Hex-Rays SA
Hopeless situation?
Well, yes and noWhile fully automatic decompiler capable of handling arbitrary input is impossible, approximative solutions existWe could start with a “simple” case:
Compiler generated output (no hostile adversary generating increasingly complex input)Only 32-bit codeNo floating point, exception handling and other fancy stuff
17(c) 2008 Hex-Rays SA
Basic ideas
Make some configurable assumptions about the input (calling conventions, stack frames, memory model, etc)Use sound theoretical approach to solvable problems (data flow analysis on registers, peephole optimization within basic blocks, instruction simplification, etc)Use heuristics for unsolvable problems (indirect jumps, function prolog/epilogs, call arguments)Prefer to generate ugly but correct output rather than nice but incorrect codeLet the user guide the decompilation in difficult cases (specify indirect call targets, function prototypes, etc)Interactivity is necessary to achieve good results
18(c) 2008 Hex-Rays SA
Decompiler architecture
Overall, it could look like this:
Disassembler: read input file, decode instructions and divide into functions
Microgen: translate decoded instructions to microcode; handle all platform specific
aspects
Kernel: decompiler core engine
Add-ons: decompiler based analysis tools, plugins, visualizers, etc
19(c) 2008 Hex-Rays SA
Decompilation phases - 1
Microcode generation
Local optimization
Global optimization
Globally propagate expressions, delete dead code, resolve memory references, analyze call instructions, determine input/output registers of
the function
Simplify instructions, propagate expressions, determine block types and control graph edges
Analyze function prolog and epilog, switch idioms, verify the function
Local variable allocation
Determine variable live ranges and their sizes, get rid of all stack and register references,
schedule instruction combinations,assign simple types to all variables
continued...
20(c) 2008 Hex-Rays SA
Decompilation phases - 2
Structural analysis
Pseudocode generation
Pseudocode transformation
Massage the output to make it more readable, create for-loops, remove superfluous gotos,
create break/continue, add/remote casts, etc
Based on the microcode and structural analysis results, generate output text
Analyze control flow graph and create while/if/switch and other constructs
Type analysis Analyze pseudocode, build type equations and solve them, modify variable types
Final touch Rename variables, create va_list, etc
21(c) 2008 Hex-Rays SA
Microcode – just generated
It is very detailedRedundantOne basic block at a time
23(c) 2008 Hex-Rays SA
After local optimization
This is much betterPlease note that the condition codes are still present because they might be used by other blocksUse-def lists are calculated dynamically
24(c) 2008 Hex-Rays SA
After global optimization
Condition codes are goneThe LDX instruction got propagated to jz and all references to eax are goneNote that the jz target has changed (@3) since global optimization removed some unused code and blocksWe are ready for local variable allocation
25(c) 2008 Hex-Rays SA
After local variable allocation
All registers have been replaced by local variables (ecx0, esi1; except ds)Use-def lists are useless now but we do not need them anymoreNow we will perform structural analysis and create pseudocode
27(c) 2008 Hex-Rays SA
Graph structure as a tree
Structural analysis extracts the standard control flow constructs from CFGThe result is a tree similar to the one below. It will be used to generate pseudocodeThe structural analysis algorithm is robust and can handle any graphs, including irreducible ones
30(c) 2008 Hex-Rays SA
Interactive operation allows us to fine tune it
Final result after some renamings and type adjustments:The initial assemblyis too long to be displayed on a slidePseudocode is muchshorter and morereadable
31(c) 2008 Hex-Rays SA
What decompilation gives us
Obvious benefitsSaves timeEliminates routine tasksMakes source code recovery easier (...)
New thingsNext abstraction level - closer to application domainData flow based tools (vulnerability scanner, anyone? :)Binary translation
32(c) 2008 Hex-Rays SA
Base to build on...
To be useful and make other tools possible, decompiler must have a programmable APIIt already exists but it needs some refinement
Microcode is not accessible yet
Decompiler is retargetable (x86 now, ARM will be next)Both interactive and batch modes are possibleIn addition to being a tool to examine binaries, decompiler could be used for...
33(c) 2008 Hex-Rays SA
...program verification
Well, “verification” won't be strict but it can help to spot interesting locations in the code:
Missing return value validations (e.g. for NULL pointers)Missing input value validationsTaint analysisInsecure code patternsUninitialized variablesetc..
34(c) 2008 Hex-Rays SA
...assembly listing improvement
Hardcore users who prefer to work with assembly instructions can benefit from data flow analysis resultsHover the mouse over a register or data to get:
Its possible values or value rangesLocations where is is definedLocations where it is used
Highlight definitions or uses of the current register in two different colorsShow list of indirect call targets, calling conventions, etcGray out dead instructionsDetermine if a value comes from a system call (ReadFile)etc...
35(c) 2008 Hex-Rays SA
...more insight into the application domain
One could reconstruct data types used by the applicationIn fact, serious reverse engineering is impossible without knowing data types (.,,)Fortunately API already exposes all necessary information for type handlingPlenty of work ahead
36(c) 2008 Hex-Rays SA
...more abstract representations
Tools to build more abstract representationsFunction clustering (think of modules or libraries)Global data flow diagrams (functions exposed to tainted data in red)Statistical analysis of pseudocodeC++ template detection, generic code detection
37(c) 2008 Hex-Rays SA
...binary code comparison
You know better than me the possible applicationsTo find code plagiarismsTo detect changes between program versionsTo find library functions (high-gear FLIRT)etc... (you know better than me :)
38(c) 2008 Hex-Rays SA
Back to the earth
The tools and possibilities described on the previous slides do not exist yetYes they become possible thanks to decompilationWe have a long way to go
More processors and platformsFloating point calculationsException handlingType recoveryHandling hostile codeIn fact, too many ideas to enumerate them here
The future is bright... is it?...