Decompilers and beyond Hex-Rays Ilfak Guilfanov. 2 (c) 2008 Hex-Rays SA Presentation Outline Why do...

39
Decompilers and beyond Hex-Rays Ilfak Guilfanov

Transcript of Decompilers and beyond Hex-Rays Ilfak Guilfanov. 2 (c) 2008 Hex-Rays SA Presentation Outline Why do...

Decompilers and beyondHex-RaysIlfak Guilfanov

2(c) 2008 Hex-Rays SA

Presentation Outline

Why do we need decompilers?Complexity must be justified

Typical decompiler designThere are some misconceptions

Decompiler based analysisNew analysis type and tools become possible

Future“...is bright and sunny”

Your feedback

Online copy of this presentation is available athttp://www.hex-rays.com/idapro/ppt/decompilers_and_beyond.ppt

3(c) 2008 Hex-Rays SA

Disassemblers

We need disassemblers to analyze binary codeSimple disassemblers produce a listing with instructionsBetter disassemblers assist in analysis by annotating the code, good navigation etc. You know the difference.Even the ideal disassembler stays at low level: the output is an assembler listingThe main output of a disassembler is still one-to-one mapping of opcodes to instruction mnemonicsNo leverage, no abstractions, little insightThe analyst must mentally map assembly instructions to higher level abstractions and conceptsA boring and routine task after a while

4(c) 2008 Hex-Rays SA

Disassembler limitations

The output isBoringInhumanRepetitiveError proneRequires special skillsDid I say repetitive?

Yet some geeks like it?...

5(c) 2008 Hex-Rays SA

Decompilers

The need:Software grows like gasTime spent on analysis skyrocketsMalware proliferates and mutates

We need better tools to handle thisDecompilation is the next logical step, yet a tough one

6(c) 2008 Hex-Rays SA

Building ideal decompiler

The answer is clear and easy to give: ideal decompilers do not existIt is customary to compare compilers and decompilers:

PreprocessingLexical analysisSyntax analysisCode generationOptimization

This comparison is correct but superficial

7(c) 2008 Hex-Rays SA

Compilers are privileged

Strictly defined input languageAnything nonconforming – spit out an error message

Reasonable amount of information on all functions, variables, types, etc.The output may be ugly

Who will ever read it but some geeks? :)

8(c) 2008 Hex-Rays SA

Machine code decompilers are impossible

Informal and sometimes hostile inputMany problems are unsolved or proved to be unsolvable in generalThe output is examined in detail by a human being, any suboptimality is noticed because it annoys the analyst

Conclusion: robust decompilers are impossible

What if we address the common cases? For example, if we cover 90%, will the rest be handled manually?

9(c) 2008 Hex-Rays SA

Easy for humans, hard for computers

In fact, many (all?) problems encountered during decompilation are hardFor every problem, there is a naïve solution, which, unfortunately, does not workJust a few examples...

10(c) 2008 Hex-Rays SA

Function calls are a problem

Function calls require answering the following questions:Where does the function expect its input registers?Where does it return the result?What registers or memory cells does it spoil?How does it change the stack pointer?Does it return to the caller or somewhere else?

11(c) 2008 Hex-Rays SA

Function return values are a problem

Does the function return anything?How big is the return value?

12(c) 2008 Hex-Rays SA

Function input arguments are a problem

When a register is accessed, it can beTo save its valueTo allocate stack frameUsed as function argument

13(c) 2008 Hex-Rays SA

Indirect accesses are a problem

Pointer aliasesNo precise object boundaries

14(c) 2008 Hex-Rays SA

Indirect jumps are a problem

Indirect jumps are used for switch idioms and tail callsRecognizing them is necessary to build the control flow graph

15(c) 2008 Hex-Rays SA

Problems, problems, problems...

Save-restore (push/pop) pairsPartial register accesses (al/ah/ax/eax)64-bit arithmeticCompiler idiomsVariable live ranges (for stack variables)Lost type informationPointers vs. numbersVirtual functionsRecursive functions

16(c) 2008 Hex-Rays SA

Hopeless situation?

Well, yes and noWhile fully automatic decompiler capable of handling arbitrary input is impossible, approximative solutions existWe could start with a “simple” case:

Compiler generated output (no hostile adversary generating increasingly complex input)Only 32-bit codeNo floating point, exception handling and other fancy stuff

17(c) 2008 Hex-Rays SA

Basic ideas

Make some configurable assumptions about the input (calling conventions, stack frames, memory model, etc)Use sound theoretical approach to solvable problems (data flow analysis on registers, peephole optimization within basic blocks, instruction simplification, etc)Use heuristics for unsolvable problems (indirect jumps, function prolog/epilogs, call arguments)Prefer to generate ugly but correct output rather than nice but incorrect codeLet the user guide the decompilation in difficult cases (specify indirect call targets, function prototypes, etc)Interactivity is necessary to achieve good results

18(c) 2008 Hex-Rays SA

Decompiler architecture

Overall, it could look like this:

Disassembler: read input file, decode instructions and divide into functions

Microgen: translate decoded instructions to microcode; handle all platform specific

aspects

Kernel: decompiler core engine

Add-ons: decompiler based analysis tools, plugins, visualizers, etc

19(c) 2008 Hex-Rays SA

Decompilation phases - 1

Microcode generation

Local optimization

Global optimization

Globally propagate expressions, delete dead code, resolve memory references, analyze call instructions, determine input/output registers of

the function

Simplify instructions, propagate expressions, determine block types and control graph edges

Analyze function prolog and epilog, switch idioms, verify the function

Local variable allocation

Determine variable live ranges and their sizes, get rid of all stack and register references,

schedule instruction combinations,assign simple types to all variables

continued...

20(c) 2008 Hex-Rays SA

Decompilation phases - 2

Structural analysis

Pseudocode generation

Pseudocode transformation

Massage the output to make it more readable, create for-loops, remove superfluous gotos,

create break/continue, add/remote casts, etc

Based on the microcode and structural analysis results, generate output text

Analyze control flow graph and create while/if/switch and other constructs

Type analysis Analyze pseudocode, build type equations and solve them, modify variable types

Final touch Rename variables, create va_list, etc

21(c) 2008 Hex-Rays SA

Microcode – just generated

It is very detailedRedundantOne basic block at a time

22(c) 2008 Hex-Rays SA

After preoptimization

23(c) 2008 Hex-Rays SA

After local optimization

This is much betterPlease note that the condition codes are still present because they might be used by other blocksUse-def lists are calculated dynamically

24(c) 2008 Hex-Rays SA

After global optimization

Condition codes are goneThe LDX instruction got propagated to jz and all references to eax are goneNote that the jz target has changed (@3) since global optimization removed some unused code and blocksWe are ready for local variable allocation

25(c) 2008 Hex-Rays SA

After local variable allocation

All registers have been replaced by local variables (ecx0, esi1; except ds)Use-def lists are useless now but we do not need them anymoreNow we will perform structural analysis and create pseudocode

26(c) 2008 Hex-Rays SA

Control graphs

Original graph view Control flow graph

27(c) 2008 Hex-Rays SA

Graph structure as a tree

Structural analysis extracts the standard control flow constructs from CFGThe result is a tree similar to the one below. It will be used to generate pseudocodeThe structural analysis algorithm is robust and can handle any graphs, including irreducible ones

28(c) 2008 Hex-Rays SA

Initial pseudocode is ugly

Almost unreadable...

29(c) 2008 Hex-Rays SA

Transformations improve it

Some casts still remain

30(c) 2008 Hex-Rays SA

Interactive operation allows us to fine tune it

Final result after some renamings and type adjustments:The initial assemblyis too long to be displayed on a slidePseudocode is muchshorter and morereadable

31(c) 2008 Hex-Rays SA

What decompilation gives us

Obvious benefitsSaves timeEliminates routine tasksMakes source code recovery easier (...)

New thingsNext abstraction level - closer to application domainData flow based tools (vulnerability scanner, anyone? :)Binary translation

32(c) 2008 Hex-Rays SA

Base to build on...

To be useful and make other tools possible, decompiler must have a programmable APIIt already exists but it needs some refinement

Microcode is not accessible yet

Decompiler is retargetable (x86 now, ARM will be next)Both interactive and batch modes are possibleIn addition to being a tool to examine binaries, decompiler could be used for...

33(c) 2008 Hex-Rays SA

...program verification

Well, “verification” won't be strict but it can help to spot interesting locations in the code:

Missing return value validations (e.g. for NULL pointers)Missing input value validationsTaint analysisInsecure code patternsUninitialized variablesetc..

34(c) 2008 Hex-Rays SA

...assembly listing improvement

Hardcore users who prefer to work with assembly instructions can benefit from data flow analysis resultsHover the mouse over a register or data to get:

Its possible values or value rangesLocations where is is definedLocations where it is used

Highlight definitions or uses of the current register in two different colorsShow list of indirect call targets, calling conventions, etcGray out dead instructionsDetermine if a value comes from a system call (ReadFile)etc...

35(c) 2008 Hex-Rays SA

...more insight into the application domain

One could reconstruct data types used by the applicationIn fact, serious reverse engineering is impossible without knowing data types (.,,)Fortunately API already exposes all necessary information for type handlingPlenty of work ahead

36(c) 2008 Hex-Rays SA

...more abstract representations

Tools to build more abstract representationsFunction clustering (think of modules or libraries)Global data flow diagrams (functions exposed to tainted data in red)Statistical analysis of pseudocodeC++ template detection, generic code detection

37(c) 2008 Hex-Rays SA

...binary code comparison

You know better than me the possible applicationsTo find code plagiarismsTo detect changes between program versionsTo find library functions (high-gear FLIRT)etc... (you know better than me :)

38(c) 2008 Hex-Rays SA

Back to the earth

The tools and possibilities described on the previous slides do not exist yetYes they become possible thanks to decompilationWe have a long way to go

More processors and platformsFloating point calculationsException handlingType recoveryHandling hostile codeIn fact, too many ideas to enumerate them here

The future is bright... is it?...

39(c) 2008 Hex-Rays SA

The “thank you” slide

Thank you for your attention!Questions?