[IEEE 2011 IEEE International Workshop on Information Forensics and Security (WIFS) - Iguacu Falls,...

Software Code Obfuscation by Hiding Control FlowInformation in Stack

Vivek Balachandran, Sabu Emmanuel

School of Computer EngineeringNanyang Technological University

[email protected]

[email protected]

Abstract—Software code released to the user has the risk ofreverse engineering attacks. Obfuscation is a technique in whichthe software code is transformed into a semantically equivalentform which is harder to reverse engineer. In this paper, wepropose an algorithm to obfuscate software programs. The basicidea of our algorithm is to remove vital information such asjump instructions from the program code section and hide themin the data section. These instructions are then reconstructedto their original form dynamically at run time, thus makingthe program semantically equivalent to the original program.Experimental results on programs from the SPECint benchmarksuites indicate that the algorithm performs well in introducinginstruction disassembly errors and control flow errors withoutbloating up the size of the program too much.

I. INTRODUCTION

In today’s digital world software is an important asset.Computers and most of the electronic devices have softwareprograms, that controls their actions. With this tremendousgrowth in the software industry there is significant advance-ment in software analysis tools, that helps in improving thesoftware quality. Simultaneously, tools on reverse engineeringare also advancing. Tools and documents on how to reverseengineer and crack software systems are readily availablein various Internet websites [9][10]. Stealing the logic ofthe algorithm by competing companies is a threat posed bysoftware reverse engineering. In addition, software reverseengineering may be used for discovering vulnerabilities in thesoftware and exploiting it. Software being victim of reverseengineering might lead to heavy financial loss to the softwareindustry. Thus, software reverse engineering is a major threatto the software industry.

One of the countermeasures taken against reverse engi-neering is encryption [11]. In an encryption scheme theencrypted instructions need to be decrypted before they getexecuted. It is then possible for an adversary to obtain theunencrypted software, if the adversary taps the instructionsafter the decryption process.

An alternative approach, which we focus on, is softwareobfuscation [1] to harden the process of reverse engineering.

WIFS‘2011, November 29th-December 2nd, 2011, Foz doIguacu, Brazil. 978-1-4577-1019-3/11/$26.00 c©2011 IEEE.

The basic idea of software obfuscation is that the software issent through an obfuscater which transforms the software toa functionally equivalent but very hard to understand obscureform.

Code obfuscation techniques are broadly classified intolayout [13], design [12], data [3][4] and control [5][8] ob-fuscations [4]. Layout obfuscation [13] refers to obscuringthe layout of the software, deleting comments for instance.Changing the source code formatting, renaming variables,removing debugging information there by obfuscating thelexical structure of the program falls under the categoryof layout obfuscation. Design obfuscation [12] deals withobscuring the design information of the software. For example,merging and splitting of classes, hiding types will help inobscuring the design intend of object oriented programs. Classsplitting, splits a class into several classes and class mergingmerges two or more classes into one. Data obfuscation isdeployed to prevent the adversary to extract information fromdata. Data obfuscation techniques like array splitting, splitvariable, convert data to procedure, changing variable life timeare discussed in [3] and [4]. Control obfuscation, obscuresthe control flow information of the program. There are alsodynamic obfuscation algorithms based on self modifying codeconcept have been proposed in [5], [15] and [16]. In selfmodifying code approach, the program code can modify itselfduring runtime.

In [8] a signals based approach is proposed. In this algo-rithm [8], the control flow obfuscation is achieved with thehelp of signals. Signals are used to carry messages betweenprocesses in an operating system. The idea is to replace thecontrol flow instructions, including call, jump and return witha trap instruction. When this trap function is raised a signalat execution time it triggers the programmed signal handler,which invokes the restoring function which will transfer thesystem control to the original target address.

A self modifying code based algorithm is proposed in [5].In this method the control flow obfuscation is achieved bycamouflaging control flow instructions like jump with normalinstruction like move. Only the opcode of the instructionsis changed and the target address of the jump instructionsremains in the destination field of the move instruction.

Modifying instructions to convert the opcode back to jumpis added above the camouflaged instruction, thus, fixing thecode at runtime.

In [15], an obfuscation technique based on dynamic codemutation is proposed. The idea is to mutate the program withthe help of an editing engine, running edit scripts. Portionsof the procedures in the program are removed and a stub isplaced at the entry point of the procedure. During executionthe portions will be restored as the routine will go into thestub to execute the editing engine and then the stub will beremoved.

Another self modifying code based approach is proposedin [16]. In this method dummy variables are inserted inthe program at fixed intervals. Dummy instructions will bemodified to its original form during execution. Restorationinstructions are added in a predecessor block which dominatesthe basic block with dummy variables, i.e., restoration codewill be in a block which will be the predecessor of the basicblock of dummy variable in all the execution paths of theprogram.

A control flow obfuscation method using control flowflattening is proposed in [14]. The objective of this methodis to confuse disassembler on execution sequence of the basicblocks in the program. The basic technique is to assign allthe basic blocks in the program the same predecessor andsuccessor. After execution of a basic block it goes to thesuccessor block which gives control back to the predecessorand then the control flows to the right basic block from thecommon predecessor block.

In the self modifying code algorithm proposed in [5], eventhough the control flow instructions are camouflaged withother instructions, the control flow information, that is thetarget address, is available in the program code sections, inthe modifying instructions. So, this may be revealed duringdisassembly of the code section to an adversary. Similarly,in signal based approach [8], the control flow informationis available in the signal handlers code section. If the targetaddresses are exposed an adversary can calculate the edgesbetween the basic blocks and there by understand the controlflow of the program. The dynamic mutation method proposedin [15], adds an extra module, the editing engine, to theprogram. Protecting the editing engine is a concern for thismethod. Similarly, the signal based approach in [8] also addsan extra module, the signal handler, to achieve obfuscation.The technique in [16] needs a precise decision on the controlflow of the program and loop operations with dummy variablesmay lead to chaos.

In this paper we propose an algorithm where the controlflow information, like jump instructions are camouflaged bystoring them in the data area hidden among the local variables.These instructions are reconstructed at runtime by self modi-fying code. Stripping the control flow information from codesection has the advantage that by analyzing the code section anadversary will not be able to find the control flow information.It is also not trivial to find the control flow information byanalyzing the data area as they are defined and initialized

similar to ordinary variables. Our method does not add anyadditional modules to the program. And hence, we do nothave the overhead of protecting an additional module. Ouralgorithm works for control flow instructions in any basicblock, even if it is a part of the loop. Also, in the experimentalevaluation, our algorithm gives better control flow errors andinstruction disassembly errors compared to the signal based[8] and self modifying [5] based approaches.

The paper is organized as follows. In section II, we proposean obfuscation technique to transform binary program to anobfuscated binary program. Section III explains the implemen-tation details of the algorithm. Experimental evaluation metricsand evaluation results are discussed in section IV. In sectionV we present the conclusion.

II. PROPOSED METHOD

A program consists of code area and data area. Differentdata areas are global, local which is the stack and dynamicwhich is called heap. Our method is basically built on the factthat most reverse engineering tools and methods consider dataarea and code areas separately. Reverse engineers and reverseengineering tools try to extract programming information fromthe code segments of the software and extracts data values andinformation about the data structures from the data segmentsand symbol tables.

The basic idea of our obfuscation is to hide the code infor-mation like jump instructions, both conditional and uncondi-tional, in the data area, stack, with other data elements thusobscuring the program code. This is done at obfuscation time.Even with the knowledge of the algorithm the attacker cannotdistinguish between the ordinary data element and the one usedto store instructions. The code information stored in the dataarea is used to reconstruct the original code at runtime, thereby the execution of the program is semantically equivalent.This is achieved by inserting reconstruction instructions justabove the camouflaged original location. This will result inreconstructing the original instruction at runtime.

A. Selecting Instruction to be Stored in Data Space

The first step of the algorithm is to select the instructionhas to be camouflaged. One of the methods that can be usedis to randomly pick instructions from the code. We, instead,in our algorithm decided to select jmp instructions.

One of the motivations for selecting the jmp instructions isthat it can be used to obscure the control flow. Instructionswhich have a say in the control flow of the program have agreat significance in the logical understanding of the program.And hence the absence of control flow instruction confusesthe adversary. Another motivation is that it gives scope toadd junk bytes in between the code blocks. This will increasethe instruction errors when the obfuscated program is reverseengineered.

B. Storing Code Information in Data Area

Once we decided on the instruction to be moved out of thecode segment the second job is to store the information in

data area. Since we know that we are moving jmp instruction,the opcode of the instruction is known and hence the addressto which the jump happens constitutes the code information.This means that the information stored in the data area is theaddress location to which the jump happens.

We store the code information in the stack area. Selectingstack area to store the code information has an advantageover global data area. When we store the code informationin stack area, it is considered as a local variable definition ofthe function. The instructions which use the local variable forjmp instruction is just similar to other instructions using localvariables of the function. Instead if we use global data area,the code information will be stored in the global data areabut used only locally in the respective function which hasthe corresponding camouflaged instruction. An adversary whosees a global variable used exclusively in a function recognizesit as an abnormal behavior and senses something amiss.

Figure 1 shows how the jmp instructions target address isstored in the stack area. Each block represents an instructionand A1, A2,A3, A4 represents the address of each instructions.

Fig. 1. Storing code information in stack

C. Obfuscating the jmp Instructions

As we have stored the code information in the data area,the next step is to remove it from the code area. Now, we starttransforming the original code into obfuscated code. Instead ofremoving the instruction from the code area we replace it withanother instruction. The jmp instruction was replaced with thefollowing instruction:

mov eax, 0

This mov instruction replaces the jmp instruction and thereby the control flow information of that basic block of codeis lost from the code section. So, an automated reverseengineering tool will believe that after this mov instructionthe control goes to next address location. We selected movinstruction to replace jmp instruction because of the fact that

it is the most used instruction in a program. One could alsouse other instructions instead of mov to camouflage the jmpinstructions.

D. Reconstructing the jmp Instruction

The second step in the transformation of original code toobfuscated code is the insertion of reconstruction instructions.The jmp instructions are reconstructed during the runtime ofthe program. Extra code needed to reconstruct the instructionare added above the camouflaged jmp instruction, as shown inFigure 2.

The first step is to change the opcode of mov instruction tothat of jmp instruction. The opcode of jmp instruction is 0xE9and that of mov instruction 0xB8. We add an instruction toXOR the address location of mov instruction with 0x00000051.This changes the instruction to jmp offset 0. Now the nextstep is to add the address offset stored in the data area tothe instruction. We add an instruction to add the value in thelocal variable to the instruction address. Now the exact jmpinstruction is created at the address location of mov instruction.

In Figure 2, the camouflaged jmp instruction, which is themov instruction, is at A2. The reconstruction instructions, toreconstruct the jmp instruction are added in a preceding block.As we can see from Figure 2, the block containing A1 precedesA2 and hence the reconstruction instructions are added in theblock containing A1.The XOR operation can be replaced byother logical and arithmetic operations. Thus we can have aset of possible reconstruction instructions and the randomnessin choosing this during obfuscation gives more robustness tothe method.

E. Re-obfuscating the jmp Instruction

During the execution of the program, once the jmp in-struction is executed, it has to be re-obfuscated back to movinstruction. After reconstruction, the jmp instruction will bein its true form and if an adversary does a core-dump afterthe reconstruction operation then the jmp instruction willbe exposed. To avoid this, we add extra instructions in thesucceeding basic blocks to re-obfuscate jmp back to mov. Thejmp instruction is XOR-ed again with 0x00000051 to get theinstruction.

mov eax, 0

According to the control flow of the example shown inFigure 2, block containing A4 executes after the jmp instruc-tion, hence the instructions to re-obfuscate jmp back to movdynamically at runtime are introduced in the block containingA4.

F. Junk Bytes Insertion

The replacement of jmp instruction with mov instructionopens space for inserting junk bytes into the code section.Insertions of junk bytes introduce more confusion to thedisassembler [6]. This will result in making more error duringdisassembly process.

Fig. 2. Program before and after obfuscation

Since the jmp instruction is replaced by mov instruction thedisassembler will consider that control flows directly after themov instruction to the next instruction. This let’s us introducejunk bytes after the mov instruction. Partial junk bytes areintroduced as discussed in [6] to achieve maximum confusion.

Another effect of insertion of junk bytes is that as a sideeffect there will be wrong jmp instructions in the junk byteregion, which will confuse the disassembler further.

Figure 3, shows how the junk bytes are introduced in theprogram. When partial junk bytes are inserted, at the time ofdisassembly they will be associated with the nearby instructionbytes and thereby increasing the instruction disassembly error.

G. Conditional Instructions

Conditional jump instructions like, jle, also adds to thecontrol flow of the program. These instructions are alsocamouflaged by the same method used for unconditional jumpinstructions. According to the opcode of the jump instructionwe have to change the constant value to be used for XOR-ing. For example, the mov instruction, whose opcode is 0xB8was XOR-ed with 0x51 for reconstruction jmp instruction toget 0xE9 which is the opcode of jmp instruction. Similarly, inthe case of jle, 0x0000B78E is XOR-ed with 0xB800 to get0x0F8E, the opcode of jle.

Fig. 3. Junk bytes insertion

III. IMPLEMENTATION

We implement the proposed algorithm at link time. Theinput to our algorithm is a C/C++ binary program. The outputis the obfuscated binary program. PLTO, Pentium Link TimeOptimizer [7] is the tool we used for implementing ourproposed algorithm and generating the obfuscated binary file.GNU Linux operating system was used as the developmentplatform and the input binary files are in the extended linkerformat (ELF).

The control flow graph of the input binary program isobtained by using PLTO. Once we had the control flow graphof the program, we look out for the possible jump instructionsin the program that can be used for the obfuscation technique.Every basic block of the each function in the program isscanned for finding possible candidate instructions.

Once we know how many jump instructions are available formodification in each function the next step is to expand the sizeof the stack. The local variables of each function are stored inthe stack. The activation record of the function will be of fixedsize which has space for local variables, parameters and returnvalue. Every time a function is called this constant space isallocated in the stack for the function. With our obfuscation,we need more local variables to be defined and the number oflocal variables needed vary from function to function as thenumber of jump instruction varies.

So, the total number of jump instructions in the functionis calculated. Then the size of the activation record of thefunction is changed by adding the extra space needed. So,

when the function is called, it pushes the stack pointer furtherto accommodate the new local variables.

In the implementation the reconstruction instructions areadded before changing the jump instructions. The exact se-quence of implementation is as follows. The mov instructioninstead of jump instruction is added just above the jumpinstruction. The offset address of the jump instruction iscalculated and stored in the local variable. Reconstruction in-structions are added in the preceding block and re-obfuscationinstructions in the succeeding blocks of jump instruction.Finally, the jump instruction is deleted from the program.

The whole program, which is in the intermediate controlflow graph representation, is then compiled to binary exe-cutable by the PLTO.

IV. EXPERIMENTAL EVALUATION

In this section we discuss the evaluation of the proposedalgorithm. We will discuss the metrics we used to evaluatethe performance of the algorithm.

A. Evaluation Metrics

1) Instruction disassembly errors: We evaluate the instruc-tion disassembly error with confusion factor. Confusion factoris the fraction of instruction address that the disassemblerfails to identify[8]. If Ttotal is the total number of actualinstruction addresses before obfuscation and Tdisasm is thetotal number of instruction addresses properly recognized bythe disassembler, then the confusion factor is defined by thefollowing,

CFinstr = |Ttotal − Tdisasm|/Ttotal.

Program size overhead: Obfuscation will have effect on thesize of the program. Spaceeff defines the effective bloatingup of the program due to obfuscation.

Spaceeff = (Scode1 + Sdata1)/(Scode0 + Sdata0)

Where Scode1 and Scode0 are size of the code after andbefore obfuscation. Similarly Sdata1 and Sdata0 are size ofdata section after and before obfuscation [8].

2) Control flow disassembly errors: We calculated thenumber of conditional and unconditional jump instructions inthe program before and after the obfuscation. If CFGbefore

is the total number of conditional and unconditional jumpinstructions in the program and CFGafter is the total numberof jump instructions in the obfuscated program. CFcfg is theconfusion factor in the control flow of the program.

CFcfg = |CFGbefore − CFGafter|/CFGbefore

The ratio gives the control flow confusion caused by theobfuscation.

B. Performance

We evaluated the efficacy of the obfuscation with programsfrom the SPECint-2006 benchmark suites. The algorithms in,[5] and [8], use the same benchmark suite for evaluation.The evaluation results are similar when applied on other Cprograms. Our evaluation platform is 2.6GHz Pentium systemwith a 2 GB internal main memory. The operating system onwhich the evaluation system is running is Ubuntu distributionof GNU Linux. The compiler used is gcc version 3.4 at opti-mization level -O3. The disassembly results from IDAPro[9]version 5.2.0.911 is used for the performance evaluation.

The experimental result of instruction confusion factor,CFinstr, is listed in Table I.

TABLE IINSTRUCTION DISASSEMBLY ERROR

Program Ttotal |Ttotal −Tdisasm| CFinstr

Bzip2 980149 787157 80.03 %Hmmer 1608118 1326325 82.47 %Lbm 870411 640529 73.58 %Mcf 901763 681301 75.52 %Sjeng 1105023 964533 87.28 %Mean 79.78 %

The average instruction disassembly error of the SPECint-2006 programs is 79.78%. This means that the the disassem-bler succeeds in recovering only 20.22% of the instructionsproperly.

The error in identifying the control flow instructions in theprogram gives us an account of the control flow obfuscationattained by the algorithm. Table II shows the number of controlflow instructions in the original program and the obfuscatedprogram and their ratio.

TABLE IICONTROL FLOW ERRORS

Program CFGbefore |CFGbefore −CFGafter| CFcfg

Bzip2 16883 9516 56.36 %Hmmer 24183 17729 73.31 %Lbm 14766 7660 51.87 %Mcf 15217 8095 53.19 %Sjeng 19281 12421 64.42 %Mean 61.35 %

One advantage of our algorithm is that the program spacehas not bloated up too much after obfuscation. Table III showsthe space efficiency of our algorithm.

The average increase of programs after obfuscation is 2.2times of the original size. The increase in size is due to tworeasons. The reconstruction instructions added in the programcontributes to increasing the size of the program. For eachjump instruction which is removed from the program, tenadditional instructions are inserted into the program.

Another reason is the insertion of junk bytes to achievemore instruction disassembly errors. Junk bytes are added inthe succeeding block of the removed jmp instruction. This alsoaccounts to the increase in the size of the program.

TABLE IIISPACE EFFICIENCY

Program Spacebefore Spaceafter Spaceeff

(in bytes) (in bytes)Bzip2 589489 1296240 2.19Hmmer 862922 2045808 2.37Lbm 527128 1103728 2.09Mcf 533107 1140592 2.14Sjeng 707023 1562480 2.21Mean 2.22

The performance of our algorithm is compared with twoalgorithms, namely signal-based obfuscation [8] (SBC) andself modifying code based algorithm [5] (SMC). Table IV,shows the comparison on the basis of instruction disassemblyerror, control flow error and space efficiency of our algorithmwith the other two.

TABLE IVALGORITHM PERFORMANCE COMPARISON

Comparison Signal-based SMC ProposedItem Algorithm (Algorithm) AlgorithmInstr. disas. 57.28 % 75.36 % 79.78 %ErrorControl Flow 41.18 % 52.21 % 61.35 %ErrorSize 2.39 2.05 2.22

Instruction disassembly error and control flow errorsachieved by our algorithm, is better than the other two algo-rithms. The size efficiency of our algorithm, that is the increasein the size of the program after obfuscation, is comparable withthe other two algorithms.

V. CONCLUSION AND FUTURE WORKS

In this paper we proposed a software obfuscation algorithmto protect binary programs from reverse engineering. Wemoved some of the vital code information from the codesegment and hid it in data segment and reconstructed itdynamically when needed. It also uses the concept of junkbytes addition to increase the complexity for disassemblytechniques. We implemented the algorithm and the evaluationresults show that the technique is effective in confusing thedisassemblers like IDAPro. Compared to other algorithms likesignal based approach [8], our algorithm has better instructiondisassembly error, control flow error and space efficiency.Currently we are trying to extend the algorithm to suit softwareprograms in distributed environments.

REFERENCES

[1] C. Collberg, C. Thomborson, and D. Low, “Manufacturing Cheap,Resilient, and Stealthy Opaque Constructs,” in Proc. of the 25th ACMSIGPLAN-SIGACT symposium on Principles of programming languages,pp. 184-196, 1998.

[2] B. Barak, O. Goldreich, R. Impagliazzo, S. Rudich, A. Sahai, S. P.Vadhan, and K. Yang, “On the (im)possibility of obfuscating programs,”in CRYPTO, pp. 1-18, 2001.

[3] W. F. Zhu, “Concepts and Techniques of Software Watermarking andObfuscation,” Ph. D. thesis, Department of Computer Science, Univer-sity of Auckland, New Zealand, August 2007.

[4] C. Collberg, C. Thomborson, and D. Low, “A taxonomy of obfuscatingtransformations,” in Technical Report 14, Department of ComputerSciernce, University of Auckland, pp. 36, 1997.

[5] S. Liang, and S. Emmanuel, “Mobile Agent Protection with FunctionLevel Self-Modifying Code Obfuscation,”inSpringer Journal of SignalProcessing Systems, November, 2010.

[6] C. Linn and S. Debray,“Obfuscation of executable code to improveresistance to static disassembly,” in Proc. of the 10th ACM conferenceon Computer and communications security, pp. 290299, 2003.

[7] B. Schwarz , S. Debray , G. Andrews , M. Legendre. “PLTO: A Link-Time Optimizer for the Intel IA-32 Architecture,”in Proc. of Workshopon Binary Translation, Barcelona, Catalunya, Spain, 2001.

[8] I. Popov, S. Debray, and G. Andrews, “Binary obfuscation using signals,”in Proc. of 16th USENIX Security Symposium on USENIX SecuritySymposium table of contents, USENIX Association Berkely, CA, USA,2007.

[9] (2011) The IDAPro website. [Online]. Available:http://www.datarescue.com/

[10] J. Miecznikowski and L. Hendren, “Decompiling Java using stagedencapsulation,” , in Proc. of the Eighth Working Conference on ReverseEngineering (WCRE01), IEEE Computer Society Washington, DC,USA, 2001.

[11] W. Thompson, A. Yasinsac, and J. McDonald, “Semantic En- cryptionTransformation Scheme,” in Proc. of 2004 International Workshopon Security in Parallel and Distributed Systems, San Francisco, CA.Citeseer, 2004.

[12] M. Sosonkin, G. Naumovich, and N. Memon, “Obfuscation of designintent in object-oriented application,” in DRM03, ACM, Oct. 2003, pp.142-153.

[13] D. Hachez, “A comparative study of software protection tools suitedfor e-commerce with contributions to software watermarking and smartcards,” Ph.D. dissertation, Universite Catholique de Louvain, March2003.

[14] C. Wang, J. Davidson, and H. J. Knight, “Protection of Software-based Survivability Mechanisms,” in Proc. of International Conferenceof Dependable Systems and Netwroks, July 2001.

[15] M. Madou, B. Anckaert, P. Moseley, S. Debray, B. De Sutter, K. DeBosschere, “Software Protection Through Dynamic Code Mutation,”LECTURE NOTES IN COMPUTER SCIENCE 3786, 2006, 194.

[16] Y. Kanzaki, A. Monden, M. Nakamura, K. Matsumoto “Exploiting selfmodifcation mechanism for program protection,” in Proc. of 27th AnnualInternational Computer Software and Applications Conference, pp.170-179, 2003.

[IEEE 2011 IEEE International Workshop on Information Forensics and Security (WIFS) - Iguacu Falls,...

Documents

Transcript of [IEEE 2011 IEEE International Workshop on Information Forensics and Security (WIFS) - Iguacu Falls,...