Potent and Stealthy Control Flow Obfuscation by Stack Based Self-Modifying Code

13
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 4, APRIL 2013 669 Potent and Stealthy Control Flow Obfuscation by Stack Based Self-Modifying Code Vivek Balachandran and Sabu Emmanuel Abstract—Software code released to the user has the risk of re- verse engineering attacks. Software obfuscation techniques can be employed to make the reverse engineering of software programs harder. In this paper, we propose a potent, stealthy, and cost-effec- tive algorithm to obfuscate software programs. The main idea of the algorithm is to remove control ow information from the code area and hide them in the data area. During execution time, these instructions are reconstructed, thereby preserving the semantics of the program. Experimental results indicate that the algorithm per- forms well against static and dynamic attacks. Also the obfuscated program is hard to be differentiated from normal binary programs demonstrating the obfuscations good stealth measure. Index Terms—Computer security, software engineering, soft- ware safety, software security. I. INTRODUCTION S OFTWARE, over the years, has evolved from free code given along with the hardware to a valuable asset, au- tomating almost all of the electronic devices and systems. The growth in the software analyzing tools has helped the software developers to analyze and better their software programs. Unfortunately, the same software analyzing technologies [1], [2] are used to reverse engineer software systems with mali- cious intent such as stealing the intellectual property of the program and for identifying the vulnerabilities in a program and exploiting them. Tools and documents on software reverse engineering are readily available in various websites [1], [2]. There have been several cases of software law suits involving intellectual property theft employing reverse engineering tech- niques. In 1992, Atari Games v. Nintendo [3]; in 2000, Sony v. Connectix [4] and in 2002, Blizzard v. bnetd are some law suits involving reverse engineering of software programs. Blizzard [5] entertainment’s online multiplayer gaming service called Battle.net was reverse engineered into the software package bnetd. Blizzard won the United States lawsuit against bnetd’s original developers [6]. Another threat of reverse engineering software programs into a higher level abstraction is that it is easier for an adversary to identify the vulnerabilities in software programs and exploit them. Adversaries can insert trojans [7], viruses and worms [8], Manuscript received February 14, 2012; revised February 21, 2013; accepted February 23, 2013. Date of publication March 07, 2013; date of current version March 13, 2013. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. C.-C. Jay Kuo. The authors are with Nanyang Technological University, Singapore (e-mail: [email protected]; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TIFS.2013.2250964 [9] by exploiting the discovered vulnerabilities in the program. The vulnerabilities can also be exploited to mount denial of ser- vice attacks [10], [11]. Thus an adversary who successfully re- verse engineers a software program pose a threat to the software users and the companies, where the software is installed. Encryption [12]–[14] is one of the countermeasures taken against software reverse engineering. The program is encrypted and decrypted in parts on demand. The decrypted part is exe- cuted and immediately re-encrypted at runtime [13], [15]. Using specialized hardware for encryption [14] is another encryption approach. However, these approaches have the disadvantage of performance overhead due to the multiple calls to the encryp- tion and decryption routines and loss of exibility to run the software in standard hardware. An alternative approach, which we focus on, is software obfuscation [16], which hardens the process of reverse engi- neering. Software obfuscation is a practical approach, where the software developer obscures the code to a level such that it is harder for the adversary to reverse engineer and make sense out of the reverse engineered program. In a perfect scenario, the obfuscator wants the program to be as obscure so that it is economical for the adversary to develop the program from scratch than reverse engineering the program. Code obfuscation can be broadly classied into layout [17], design [18], data [19], [20] and control [21], [22] obfuscations. Layout obfuscation [17] refers to obscuring the layout of the program. For example, deleting comments, removing debug- ging information, renaming variables and changing the source code formatting falls under the category of layout obfuscation. Design obfuscation [18] tries to obscure the design of the soft- ware systems. For example, in the case of object oriented pro- grams, obfuscations such as splitting classes, hiding type infor- mation and merging classes will obscure the design intent of the program. Data obfuscation [19], [23] is deployed to prevent the adversary from extracting information from the data used in the program. Data structures used in the program and the data values can give out information regarding the nature of the program. Obfuscation techniques like array splitting, data to procedure conversion, variable splitting and changing variable life time as discussed in [19] and [23] are examples for data obfuscation. Control obfuscation, obstructs the control ow information of the program. Control ow of a program gives logical meaning to the program. Control ow attening [24] and using opaque predicates [20] are methods, which give control ow obfusca- tions. Another classication of obfuscation is based on the soft- ware language level the obfuscation is carried out. Various levels in which obfuscation can be carried out are source code (high-level language) level [20], [48], intermediate (assembly language, byte code) level [49], [50] and binary level [21], 1556-6013/$31.00 © 2013 IEEE

Transcript of Potent and Stealthy Control Flow Obfuscation by Stack Based Self-Modifying Code

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 4, APRIL 2013 669

Potent and Stealthy Control Flow Obfuscation byStack Based Self-Modifying Code

Vivek Balachandran and Sabu Emmanuel

Abstract—Software code released to the user has the risk of re-verse engineering attacks. Software obfuscation techniques can beemployed to make the reverse engineering of software programsharder. In this paper, we propose a potent, stealthy, and cost-effec-tive algorithm to obfuscate software programs. The main idea ofthe algorithm is to remove control flow information from the codearea and hide them in the data area. During execution time, theseinstructions are reconstructed, thereby preserving the semantics ofthe program. Experimental results indicate that the algorithm per-forms well against static and dynamic attacks. Also the obfuscatedprogram is hard to be differentiated from normal binary programsdemonstrating the obfuscations good stealth measure.

Index Terms—Computer security, software engineering, soft-ware safety, software security.

I. INTRODUCTION

S OFTWARE, over the years, has evolved from free codegiven along with the hardware to a valuable asset, au-

tomating almost all of the electronic devices and systems. Thegrowth in the software analyzing tools has helped the softwaredevelopers to analyze and better their software programs.Unfortunately, the same software analyzing technologies [1],[2] are used to reverse engineer software systems with mali-cious intent such as stealing the intellectual property of theprogram and for identifying the vulnerabilities in a programand exploiting them. Tools and documents on software reverseengineering are readily available in various websites [1], [2].There have been several cases of software law suits involvingintellectual property theft employing reverse engineering tech-niques. In 1992, Atari Games v. Nintendo [3]; in 2000, Sony v.Connectix [4] and in 2002, Blizzard v. bnetd are some law suitsinvolving reverse engineering of software programs. Blizzard[5] entertainment’s online multiplayer gaming service calledBattle.net was reverse engineered into the software packagebnetd. Blizzard won the United States lawsuit against bnetd’soriginal developers [6].Another threat of reverse engineering software programs into

a higher level abstraction is that it is easier for an adversaryto identify the vulnerabilities in software programs and exploitthem. Adversaries can insert trojans [7], viruses and worms [8],

Manuscript received February 14, 2012; revised February 21, 2013; acceptedFebruary 23, 2013. Date of publication March 07, 2013; date of current versionMarch 13, 2013. The associate editor coordinating the review of this manuscriptand approving it for publication was Prof. C.-C. Jay Kuo.The authors are with Nanyang Technological University, Singapore (e-mail:

[email protected]; [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIFS.2013.2250964

[9] by exploiting the discovered vulnerabilities in the program.The vulnerabilities can also be exploited to mount denial of ser-vice attacks [10], [11]. Thus an adversary who successfully re-verse engineers a software program pose a threat to the softwareusers and the companies, where the software is installed.Encryption [12]–[14] is one of the countermeasures taken

against software reverse engineering. The program is encryptedand decrypted in parts on demand. The decrypted part is exe-cuted and immediately re-encrypted at runtime [13], [15]. Usingspecialized hardware for encryption [14] is another encryptionapproach. However, these approaches have the disadvantage ofperformance overhead due to the multiple calls to the encryp-tion and decryption routines and loss of flexibility to run thesoftware in standard hardware.An alternative approach, which we focus on, is software

obfuscation [16], which hardens the process of reverse engi-neering. Software obfuscation is a practical approach, wherethe software developer obscures the code to a level such that itis harder for the adversary to reverse engineer and make senseout of the reverse engineered program. In a perfect scenario,the obfuscator wants the program to be as obscure so that itis economical for the adversary to develop the program fromscratch than reverse engineering the program.Code obfuscation can be broadly classified into layout [17],

design [18], data [19], [20] and control [21], [22] obfuscations.Layout obfuscation [17] refers to obscuring the layout of theprogram. For example, deleting comments, removing debug-ging information, renaming variables and changing the sourcecode formatting falls under the category of layout obfuscation.Design obfuscation [18] tries to obscure the design of the soft-ware systems. For example, in the case of object oriented pro-grams, obfuscations such as splitting classes, hiding type infor-mation and merging classes will obscure the design intent of theprogram. Data obfuscation [19], [23] is deployed to prevent theadversary from extracting information from the data used in theprogram.Data structures used in the program and the data valuescan give out information regarding the nature of the program.Obfuscation techniques like array splitting, data to procedureconversion, variable splitting and changing variable life time asdiscussed in [19] and [23] are examples for data obfuscation.Control obfuscation, obstructs the control flow information ofthe program. Control flow of a program gives logical meaningto the program. Control flow flattening [24] and using opaquepredicates [20] are methods, which give control flow obfusca-tions.Another classification of obfuscation is based on the soft-

ware language level the obfuscation is carried out. Variouslevels in which obfuscation can be carried out are source code(high-level language) level [20], [48], intermediate (assemblylanguage, byte code) level [49], [50] and binary level [21],

1556-6013/$31.00 © 2013 IEEE

670 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 4, APRIL 2013

[22], [25]. Certain programs are distributed as high-level lan-guage/intermediate level programs such as Java/Java byte-codeprograms. These programs then need to be obfuscated eitherin Java source code level or Java byte-code level [50]. Incase of C/C++ programs, they are often distributed in binarylevel; hence binary level obfuscation techniques are relevantto such binary programs [25]. Although, C/C++ programs areoften distributed as binaries, they can be obfuscated in C/C++source code level, intermediate assembly language level orbinary level. Implementing the obfuscation in source codelevel or assembly language level is easier compared to thebinary level [21], [25]. However, the optimization algorithmsin compiler/assembler may remove these obfuscations. Hence,obfuscations carried out in source code/assembly languagelevels may not be as robust as that carried out in binary level[21].Binary level obfuscation obscures the binary program and

makes it difficult for an adversary to disassemble the binaryto correct assembly level representation. Reverse engineeringof a binary program starts with the disassembly of binary pro-gram to assembly level representation. Obfuscation at the bi-nary level thus means confusing the disassembler during disas-sembly process and making the disassembler produce corruptedassembly program. This further corrupts the programs in higherlevel representations in the progressive steps of reverse engi-neering [29]–[32].In the past decade various algorithms for binary obfuscations

were formulated. Each approach uses a different method, likesignals [21], dynamic code mutation [22], [26], [27], controlflow flattening [24], to obscure the binary program.In signals based approach, as described in [21], the signals

used to carry messages between processes in an operatingsystem are used to obtain obfuscation. The control flow in-structions, which include call, jump and return, are replacedwith trap instructions. This trap raises a signal at the executiontime and it triggers the programmed signal handler. The signalhandler contains code to transfer the system control to the exactlocation, which is the original target address. One of the disad-vantages of this method is that the control flow information isavailable in the signal handler’s code section and an adversarycan find the control flow by analyzing the signal handler’s codesection. Protecting the new module, the signal handler, is aconcern for this method. In our method no new modules areadded and the extra instructions for supporting the obfuscationare blended along with the original program. Also, the controlflow information is hidden in the data area than in the code areathus making it harder for an adversary to attain control flowinformation by analyzing the code section.In [22], a self modifying code based algorithm is proposed.

In this method the control flow instructions like jmp are camou-flagedwith normal instructions likemov instruction. The opcodeof the jmp instructions are changed and the target addresses arestored in the destination field of themove instruction.Modifyinginstructions to change the opcode back to the opcode of jmp in-struction are added at the beginning of the program. A problemwith this method is that, even though the control flow instruc-tions are camouflaged with other instructions, the control flowinformation, that is the target address, is available in the pro-gram code sections, in the modifying instructions. So, this may

be revealed during disassembly of the code section to an adver-sary. As explained earlier, our method handles this problem bymoving the control flow information completely from code sec-tion to data area.Another dynamically mutating method is proposed in [26].

This is based on an editing engine, which runs edit scripts. In theprocedures certain parts are removed and at the entry point of theprocedure a stub is placed. During execution of that procedurethe control goes into the stub, which executes the editing enginethat puts in all the removed code in the procedure. The stub isremoved from the procedure after that. The major disadvantageof this method is the addition of the extra module, the editingengine, to the program. The extra module draws the attention ofan adversary easily. The extra module contains the informationneeded to de-obfuscate the program and it is not desirable toattract the attention of an adversary towards it.Another method based on insertion of dummy variables is

explained in [27]. At regular intervals in the program, dummyvariables are inserted. During execution the dummy instructionsare modified to the original form. The restoration process is en-abled by inserting restoration instructions in a predecessor basicblock i.e., a basic block which will precede the dummy instruc-tion in all the execution paths of the program. A disadvantage ofthis technique is that it needs a precise decision on the controlflow of the program and loop operations with dummy variablesmay lead to chaos. Also, the size overhead for this method isrelatively high.Control flow flattening is another control obfuscation method

[24] to confuse the disassembler about the execution sequenceof the procedure. The idea is that, all the basic blocks will beassigned with the same predecessor and successor block. Oncea block is executed, the control flows to the successor blockand then to the predecessor and eventually to the exact blockfrom the predecessor block. One of the advantages of controlflow flattening is that it provides very good control obfuscation.On the other hand, the performance overhead in terms of spaceand time is high for this method. Instruction disassembly erroris also less for control flow flattening, i.e., using an automateddisassembly tool an adversary can disassemble a majority ofinstructions from the binary program.Another obfuscation technique is obfuscation based on mim-

imorphism [46], where the obfuscation is done by encoding thebinary program using a mimic function to a different binaryexecutable format. The decoder is also stored in the programthat will decode the binary executable to its original form. Thismethod is effective against detection based on the frequency ofbyte distribution and semantic analysis.Virtual machine based obfuscation [47] is another dynamic

obfuscation technique. The basic idea of this method is to ap-pend a virtual machine core to the program. The binary repre-sentation of the program is then converted to byte code repre-sentation, which will be interpreted by the virtual machine. Toincrease the complexity of the obfuscation, the program is iter-atively obfuscated repeatedly using various byte code interpre-tations.In this paper we propose an algorithm to perform binary level

obfuscation, which has good control flow and instruction obfus-cation. In most methods [21], [26] performing binary level ob-fuscation, they introduce a new module to the program to sup-

BALACHANDRAN AND EMMANUEL: POTENT AND STEALTHY CONTROL FLOW OBFUSCATION 671

port their obfuscations. In other methods like [22], [24], [27],which do not use extra modules for obfuscation, the controlflow information is available in the code area which can be seenduring disassembly using tools like IDAPro [1]. So, this moti-vated us to develop an algorithm that blends the instructions tosupport obfuscation along with the original program, instead ofhaving an extra module. Also, an algorithm that will not exposethe control flow information when disassembled using an auto-mated disassembly tool.The basic idea used in our method is to camouflage control

flow instructions, like jump instructions and storing their details,needed to reconstruct them in the data area. The target addressinformation is thus in the data area and not in code area like in[21], [22], [26], [27]. During runtime these instructions get re-constructed by the self modifying code inserted during the ob-fuscation time. One advantage of this method is that the controlflow information is stripped from the code section and an adver-sary will not be able to find the control flow information by justanalyzing the code area. It is also not trivial to find the controlflow information by analyzing the data area as they are definedand initialized similar to the ordinary variables.Hence, the major contribution of our paper compared to

other algorithm is that the target address location informationis stripped from the code area and is stored in the data area.An adversary cannot reconstruct the control flow by analyzingjust the code area. Another contribution of our paper is theintroduction of junk bytes in the execution path. This facilitatesthe obfuscation of conditional jump instructions and adds moreconfusion to the adversary. Another advantage of our methodis that extra modules are not added to the program so as tofacilitate dynamic mutation. The self modifying instructions areinserted within the program procedures. Thus our method doesnot have the overhead of protecting the additional modules.The paper is organized as follows. Section II provides

preliminaries necessary for understanding the proposed al-gorithm. Threat model assumption of the attack is discussedin Section III. Section IV covers the proposed algorithm indetail. The implementation details are discussed in Section V.Section VI covers the performance evaluation of the obfusca-tion. It covers the static and dynamic potency of the obfuscationagainst automated attacks. Obfuscation overheads in terms ofexecution time and space are discussed in Section VI. Stealthanalysis and performance of our algorithm compared to otherstate-of-the-art algorithms are also discussed in Section VI.

II. PRELIMINARIES

A. Analysis of Binary Program

Reverse engineering of binary can be classified broadly intotwo: static analysis based [33], [34] and dynamic analysis based[35]. In a static analysis based scheme, the binary program is an-alyzed and disassembled statically. It tries to create the assemblyrepresentation of the binary program without executing the pro-gram. There are various tools, open source [30], [36], [37] andproprietary [1], [38], which helps in the process.In a dynamic scheme, the execution of the program is mon-

itored and analyzed to get the context information during run-time. As a result it covers only those regions of the program

which are executed, and there can be infinite number of execu-tion paths for a program.Static analysis is mostly employed because it gives a com-

plete overview of the program than a single execution path givenby dynamic analysis. Research works mostly measure the po-tency of their algorithm against static analysis. We have mea-sured the efficiency of our algorithm against static and dynamicanalysis.

B. Disassembly of Binary Program

The first step of reverse engineering of a binary program isdisassembly. It is the process of creating the assembly represen-tation of the binary program. Linear sweep and recursive travelare the most widely used disassembly algorithms [39]. Linearsweep begins disassembly at the programs first executable byte,and sweeps through the program, disassembling instructions se-quentially one after other. But this method has a weakness, thatit misinterprets data bytes as instructions if the data is embeddedbetween instructions in the code section. The popular tools usingthis method are GNU objdump [36] and Microsoft’s DumpBin[38].The problem in linear sweep is solved by recursive traversal

algorithm. Recursive traversal takes control flow of the programinto account. Control flow of a program is the order in which thebasic blocks of a program are executed. In recursive traversal,it follows the execution flow recursively and disassembles theinstructions. However, the assumption that recursive traversalcan precisely find the control transfer location may not hold inthe case of conditional jumps and calls. Disassemblers imple-menting this algorithm are the IDAPro [1] and OllyDbg [37].All these disassemblers are used for static disassembly of bi-nary programs.Assembly language debugger [30], ald, can be used for dy-

namic disassembly. One can disassemble a program, instructionby instruction while its execution using ald.

C. Junk Byte Insertion

Junk byte insertion [21] is a method used to confuse the au-tomated disassembler and forcing them to give wrong disas-sembly results. The idea is to add junk bytes in areas wheredisassembler expects code. The junk bytes are partial instruc-tions added at locations which will not change the semanticsof the program. For instance, junk bytes can be inserted into abasic block immediately after a block ending with an uncondi-tional jump. Junk bytes are added to the beginning of this block,which is in fact an unreachable code area during runtime. But astatic disassembler sees it as a valid instruction and tries to dis-assemble the partial instruction bytes. Since the instructions arepartial, the disassembler clubs it with the next valid instructionbytes, to create a valid assembly instruction. This corrupts thevalid instruction in the basic block.

D. Self Modifying Program

Self modifying program is one which modifies itself whileexecuting. This method is used in different binary obfuscationtechniques in different form. The basic idea of this method isthat, parts of the programs are removed or replaced by otherinstructions, thus statically the program looks different. Duringruntime the program is transformed back to its original form.

672 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 4, APRIL 2013

Different methods are adopted to achieve this as presented in[22], [26], [27]. The basis of all the methods is to add extra codemodules to the program which knows exactly which area of theprogram is to be modified and when to be modified.The advantage of using self modifying programs is that it ob-

scures the programs really well and makes it difficult for thestatic disassemblers to correctly disassemble the program. Stat-ically, the program will look completely different and it getsfixed dynamically through self modifying code. Self modifyingcode can also be used for obfuscating program areas dynami-cally. So a dynamically restored code can again be obfuscatedduring runtime. So, the period in which the code is in its trueform is during its execution. So, even if an adversary decides torun the program and break at some point and dynamically dis-assemble the program, his/her chance to get the program in itstrue form is low.

III. THREAT MODEL

For designing a protection mechanism for software, oneshould understand the threat faced by the software from theadversary. The assumption we make is that the adversary istrying to reverse engineer a binary program to assembly levelrepresentation.One of the factors to be considered in the threat model is the

platform and the access level the adversary has. Our assump-tion is that the adversary owns the software program and pro-gram runs in the adversary’s computer. We also assume that theadversary has complete control over the system, where the ad-versary can analyze the program, modify it and execute it.Another assumption is that the adversary has access to reverse

engineering tools that will help in disassembling the binary pro-gram to assembly representation. We assume that the adversaryhas access to disassembly tools like IDAPro [1] and ald [30].Our protection mechanism uses self modifying code, which

mutates the program during runtime. We assume that the adver-sary has access to this information and uses dynamic analysis todisassemble.

IV. PROPOSED METHOD

A program consists of code area and data area. Different dataareas are global, local and dynamic. Stack is an example forlocal data area and heap for dynamic. Our method is basicallybuilt on the fact that most reverse engineering tools andmethodsconsider data area and code area separately. Reverse engineersand reverse engineering tools try to extract programming infor-mation from the code segments of the software and extracts datavalues and information about the data structures from the datasegments and symbol tables.The basic idea of our obfuscation is to hide the code informa-

tion like jump instructions, in the data area, stack, with otherdata elements thus obscuring the program code. The processof hiding code information in data area is done at the obfusca-tion time. The information is stored in stack and hence it lookslike ordinary variables defined in the function. It is harder foran adversary to distinguish this from ordinary variables by justanalyzing the stack. Removing instructions from the code areaor camouflaging it with other instructions makes the program

semantically different. The code information stored in the dataarea is used to reconstruct the original code at runtime and thereby the execution of the program is semantically equivalent. Thisis achieved by inserting reconstruction instructions just abovethe original location. This will result in reconstructing the orig-inal instruction at runtime. We further explain our algorithm indetail.

A. Offline Obfuscation

This is the first phase of our obfuscation algorithm. The bi-nary program is converted to its equivalent assembly programusing PLTO (Pentium Link Time Optimizer) [40]. It is then an-alyzed to find suitable instructions to be obfuscated. Once theobfuscation is done, the assembly program is assembled backto binary.1) Selecting Instruction to be Obfuscated: The first step of

the algorithm is to identify which all instructions have to becamouflaged. The trivial method is randomly picking instruc-tions from the code area. But, in our method jump instructionsare chosen to be camouflaged for the following reasons.Jump instructions decide the control flow of a procedure in

the program. By obscuring the jump instructions in the proce-dure we are thus obfuscating the control flow of the program. In-structions which give information about the control flow of theprogram will help the adversary to easily understand the logicof the program.Another motivation for considering jump instructions, to be

camouflaged, is the scope it provides for inserting junk bytes inthe program. Camouflaging jump instructions obscures controlflow of the program. This will lead in confusing the disassemblytool to assume wrong control flow to the program and makesit possible to add junk bytes between code blocks which areunreachable. This will increase the errors while an adversarytries to reverse engineer the binary program.2) Storing Target Address in the Stack: With the instructions

to be camouflaged known, the space required in the stack tostore the target addresses of camouflaged instruction can alsobe calculated. In the method proposed, for each instruction in aprocedure to be camouflaged, a variable space is allocated in thestack. The count of instructions in the function which are goingto be camouflaged are calculated and then the stack is expandedaccordingly.The expansion of the stack is possible with a small tweak

in the assembly program. In the calling convention of the ELF(Extended Linker Format) programs in 86 platforms, the stackallocation for a function is done by the function itself. All thefunctions start with the following instructions:

push ebp

mov ebp,esp

sub esp,8

Once the function is called the base pointer of the caller func-tion is pushed onto the stack. Then the current stack pointer isstored as the new base pointer (for the called function). The firsttwo assembly instructions in the code segment are essentiallydoing that. The third instruction is where the allocation of thestack for the particular function happens. The size of the stack

BALACHANDRAN AND EMMANUEL: POTENT AND STEALTHY CONTROL FLOW OBFUSCATION 673

Fig. 1. Storing code information in stack.

needed by the function in this particular case is 8 bytes. Bymod-ifying the value in the third instruction, the size of the stack forthat particular function can be changed.Once, the instructions that are going to be obfuscated and

their count are known, the stack is expanded accordingly asmentioned in the previous paragraph.Since we know that we are moving jmp instructions, the target

address to which jump happens constitutes the code informa-tion. This target address is what we store in the data area.Selecting stack area to store the code information has an ad-

vantage over global data area. The code information in stackarea is stored in a way similar to that of local variable defini-tion. Self modifying instructions use these variables to recon-struct the control flow. The way the variables are used in the pro-gram are similar to manipulating ordinary variables—loadingthe value from a variable to a register and analyzing the value.The variables of a function are used only by the instructions ofthat function. On the contrary, if global data area was used tostore the code information, then the code information will bestored in the global data area. Each local function will use onlythose variables which are used to store the control flow infor-mation of that particular function. A global variable used exclu-sively by a local function is suspicious and an adversary mayeasily notice it.Fig. 1 shows how the jmp instructions target address is stored

in the stack area. The target address xxxx of the jmp instructionin the first block is stored in a stack variable.3) Obfuscating the jmp Instructions: The jmp instruction is

ready to be obfuscated as the target address of the jmp has al-ready been stored in the stack. The jmp instructions are replacedwith another instruction instead of removing. The jmp instruc-tions are replaced by the following instruction,

mov eax, 0

The replacement of jmp instruction with mov results in theloss of control flow information. The new instruction, mov, isan ordinary instruction and does not have a say in the controlflow of the program. When an automated disassembler tries todisassemble the program, it assumes the control flows just to thenext address location after mov.We decided on the instruction mov to be used to replace jmp

instructions owing to the fact that it is the most used instructionin a program. It is possible to use other instructions instead ofmov to camouflage the jmp instructions. The logic remains the

same. Randomizing the selection of instruction to be used toreplace jmp instruction will increase the challenge posed by themethod to an adversary.

B. Runtime Deobfuscation

Camouflaging the instructions in the program as explainedin the previous section changes the semantics of the program.Running this program just like that gives erroneous results andmost probably crashes the program. And hence, the program hasto be changed back to its original form before it gets executed.In our method we do this dynamically at runtime with the helpof self modifying code.Reconstruction instructions which reconstruct jmp instruc-

tion at runtime are inserted in a block that precedes the jmp in-struction. The block should be a dominator block, which meansit should precedes the jmp instruction in all execution paths. Theinsertion of reconstruction instructions are shown in Fig. 2.The first step is to change the opcode of mov instruction to

that of jmp instruction. The opcode of jmp instruction is 0xE9and that of mov instruction is 0xB8. We insert an instruction toXOR the address location of mov instruction with 0x00000051.This changes the instruction to jmp offset 0. Now the next step isto add the address offset stored in the data area to the instruction.We insert an instruction to add the value in the local variable tothe instruction address. Now the exact jmp instruction is createdat the address location of mov instruction.In Fig. 2, the camouflaged jmp instruction is at address lo-

cation A1 in basic block B1. The jmp instruction is camou-flaged into mov instruction and the reconstruction instructionsare added before the camouflaged instruction.

C. Runtime Reobfuscation

With the reconstruction instructions in place, the program se-mantics are restored and program works perfectly well. Now,the instructions which are obfuscated are restored and is in itsoriginal form. An adversary, who tracks the image of the pro-gram at regular intervals will be able to find the de-obfuscatedinstructions. A core dump of the image of the program will givethe instructions in its true form if it is done after the reconstruc-tion operations.A method to address this problem is by reobfuscating the

instruction at runtime after its execution. This is achievedby adding extra reobfuscation instructions in the succeedingblocks to reobfuscate jmp instruction back to mov. Note that,the reobfuscation instruction should be inserted in all the suc-cessor blocks as the execution path is chosen dynamically atruntime. Reobfuscation is done by XOR-ing the jmp instructionwith 0x00000051 to get the instruction:

mov eax, 0

According to the control flow of the example in Fig. 2, thebasic block B3 follows after the execution of the jmp instruction.The reobfuscation instructions for the program are hence addedin the beginning of the basic block B3.

D. Junk Bytes Insertion

The replacement of jmp instruction with mov instructionopens space for inserting junk bytes into the code section.

674 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 4, APRIL 2013

Fig. 2. Obfuscation of jmp instructions.

Insertions of junk bytes introduce more confusion to the dis-assembler [41]. This will result in making more instructiondisassembly error during disassembly process.Since the jmp instruction is replaced by mov instruction the

disassembler will think that control flows directly after the movinstruction to the next instruction. This lets us introduce junkbytes after themov instruction. Partial junk bytes are introducedas discussed in [41] to achieve maximum confusion.Another effect of insertion of junk bytes is that there can be

wrong jmp instructions in the junk byte region, which will con-fuse the disassembler further.Fig. 3, shows how the junk bytes are introduced in the pro-

gram. The existence of junk bytes corrupts the original code inthe program too, since partial junk bytes of an instruction areadded.In case the jmp instruction is a part of the loop, then a new

basic block is added to the loop edge and the reconstructioninstructions are added in that blocks shown in Fig. 4.

E. Conditional Jump Instructions

Conditional jump instructions like, jle (jump if less than orequal), jge (jump if greater than or equal), jz (jump if zero), jg

Fig. 3. Junk byte insertion.

Fig. 4. Reobfuscation in loops.

(jump if greater than) etc., also adds to the control flow of theprocedures in a program. Obfuscation of these instructions canbe done similar to unconditional jump instructions. Conditionaljump instructions can be camouflaged using other ordinary in-structions and the target address can be stored in the stack. Theproblem is that the insertion of junk bytes, which is responsiblefor confusing the disassembler and increasing the instructiondisassembly error can’t be done with conditional instructions.The basic reason for junk byte insertion is difficult with con-

ditional instruction is that the instruction followed by the con-ditional jump instruction is a valid instruction point. Insertingjunk bytes at that point will corrupt the program.To take care of this condition, our method deals with condi-

tional jumps in a different manner, so as to get better obfusca-tion.In this method a junk byte is added just above the conditional

jump instruction. This junk byte should be a partial byte of aninstruction as explained in [41]. This junk byte will club withthe initial bytes of the conditional jump instruction, resulting incorrupting the jump instruction and few instructions after that.In the example shown in Fig. 5, 10h is the junk byte added above

BALACHANDRAN AND EMMANUEL: POTENT AND STEALTHY CONTROL FLOW OBFUSCATION 675

Fig. 5. Junk byte addition to obfuscate conditional jumps.

Fig. 6. Obfuscation of conditional jump instructions.

the jump instruction and the instruction adc [esi], bhwill be seenwhen the program is disassembled.The semantics of the program will be changed by this in-

sertion of the junk byte and that is handled by self modifyingcode. Reconstruction instructions are added just like in the caseof unconditional jump. But in this case, the reconstruction in-structions are used to convert the junk byte into nop instruc-tion—no operation instruction. Thus the semantics of the pro-gram remains the same during runtime. The or instruction in B1of Fig. 6, converts the junk byte 10 to 0x90, the opcode of nopinstruction.Similar to the case of unconditional instructions, reobfusca-

tion instructions are added in all the successor blocks. In thiscase, the reobfuscation instructions obfuscate the nop instruc-tion back to the junk byte. The and instructions in B2 and B3 ofFig. 6, converts 0x90, the opcode of nop instruction, to 0x10.

F. Indirect Jump Instructions

Indirect jump instructions also add to the control flow of aprogram. In an indirect jump instruction the address location towhich the control flow transfer happens is stored in a registeror a memory location. For example, jmp eax is an indirect jumpinstruction, where the control flow is transferred to the addressstored in the register eax, as shown in Fig. 7.Obfuscation of the indirect jump instructions can be done at

compile time by camouflaging the indirect jump instruction withnormal instructions. The camouflaged instructions can be recon-structed, by adding reconstruction instructions above the cam-ouflaged instruction. However, we have not considered indirect

Fig. 7. Control flow of indirect jump.

jump instruction for obfuscation as it is difficult to reobfuscatethe indirect jump instructions during runtime.In the proposed algorithm, during runtime the reconstructed

jump instructions are reobfuscated after the jump. This is doneby adding reobfuscation instructions in the successor blocks.For indirect jumps the target locations of the jump depends onthe value residing in the register or memory location used in theindirect jump instruction and hence can change dynamically.So, if we obfuscate the indirect jump instructions by camou-flaging it at compile time and reconstructing it during runtime,it will be a onetime obfuscation, as the reconstructed instructioncannot be reobfuscated.Another problem is when indirect jump instruction jumps

back creating a loop. In this case the reconstruction instructionsused to convert the camouflaged instruction to jump instructionget executed again. So, the reconstruction instructions shouldbe chosen in such a way that the indirect jump instruction is notaffected when they get executed more than once. This will limitthe instructions that can be used as reconstruction instructions.

G. Randomization to Improve Obfuscation

To improve the performance of the algorithm against an intel-ligent adversary, randomization is used. XOR instruction is usedin the reconstruction and reobfuscation process in the method.Other logical and arithmetic instructions can be used to achievethe same result. An adversary who is trying to find the obfus-cated points by filtering based on a specific instruction, such asXOR in this case, will not work if randomly chosen instructionsare used to reconstruct and reobfuscate the program.An equivalent operation which gives the effect of the XOR

(xor (A1), 0x00000051) instruction during dynamic reconstruc-tion of jmp instruction, is the following,

And(A1), 0x00000000

Add(A1), 0x000000E9

Themov instruction (0x000000B8) at the address location A1will be converted to 0 by the and instruction and then adding0x000000E9 at the address location will reconstruct back the

676 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 4, APRIL 2013

jmp instruction. Using an OR instruction instead of addwill alsogive the same result.

And(A1),0x00000000

Or(A1),0x000000E9

Similarly during junk byte insertion, it is better to select ajunk byte at random from the set of junk bytes rather than usingthe same junk byte for all the insertions.Another randomization that can be done is the selection of the

basic blocks in which the reconstruction and reobfuscation in-structions are inserted. The only condition for the block in whichreconstruction instructions are added is that the block shouldbe a dominator [42] to the block containing the obfuscated in-struction. Similarly the reobfuscation instruction should be in apost-dominator [42] block.

V. IMPLEMENTATION

The proposed obfuscation is carried out at link time of thecompilation process. The implementation expects a binary pro-gram as input, which is obfuscated and gives out an obfuscatedbinary program as output. The development platforms used isGNU Linux operating system and the input binary files are ex-pected in the extended linker format (ELF). For the implemen-tation of our algorithm at link time, PLTO, Pentium Link TimeOptimizer [40] was used.The input binary program is fed to PLTO which creates the

control flow graph of the program. The control flow graph thusgenerated is scanned to find possible candidate instructions tobe obfuscated. Each function of the program is scanned blockby block to find the unconditional jump instructions.Once the count of the jump instructions that are going to be

obfuscated is finalized then the size of the stack is expanded.The local variables of each function are stored in the stack. Theactivation record for each function will be of constant size de-fined in the beginning of a function. It has the space required forstoring local variables, parameters and return value. Every timea function is called, this constant space in stack is allotted for thefunction. Since our method stores the code information as vari-ables in the stack, this stack size has to be expanded. The codein the function which defines the required stack size is modifiedaccording to the requirement. With this modification, when thefunction is called it pushes the stack pointer further and thus in-corporating the space for the new local variables used to storethe control flow information.For each function in the program, obfuscation is done in three

rounds. In the first round all the unconditional jumps are han-dled. Junk byte insertions at locations after unconditional jumpsare done in the second round. Conditional jumps are handled inthe third round of the algorithm. The process repeats for all thefunctions.The exact sequence of implementation in the first round is

as follows. The target address of each unconditional jump in-struction is extracted from the instruction and is stored in thelocal variable. The jmp instruction is then replaced withmov in-struction. The basic blocks in which the reconstruction instruc-tions and reobfuscation instructions have to be inserted are cal-culated. Reconstruction instructions and reobfuscation, which

use the variable where the address is stored, are inserted in therespective basic blocks. The successor block of the jmp instruc-tion is flagged as candidate block for junk byte insertion.In the second round, all the basic blocks which are flagged

as candidate blocks for junk byte insertion are visited and fromthe set of junk bytes, which are partial instructions, randomlychosen junk byte is added to the beginning of the basic block.The third round in the implementation is similar to the first

round. Each basic block with unconditional jump instructionsare visited. The junk byte to be inserted is randomly chosen andis stored in the variable in the stack to be used for reconstruc-tion and reobfuscation instructions. The junk byte is then in-serted just above the unconditional jump instruction. The basicblocks in which the reconstruction instructions and reobfusca-tion instructions have to be inserted are calculated. Instructionswhich convert the junk bytes to nop instructions are inserted inthe basic block for reconstruction instructions. The instructionsfor converting nop back to the junk byte are inserted in the basicblocks for reobfuscation instructions.The obfuscated program should have right permissions for

the reconstruction and reobfuscation instructions to modify theprogram code area. We introduce system calls in the program sothat the write permissions can be given when it is necessary. Thesys_mprotect system call is called at the beginning of a function,with flags to enable write permission to the necessary programcode area. The write permissions are disabled by calling thesys_mprotect system call at the end of the procedure.Enabling right permissions to the entire code area for self

modification may lead to the risk of code injection attacks.Hence, we use sys_mprotect system call in the program toenable write permissions to address locations that are needed tobe modified. But just enabling write permissions to the addresslocations to be modified will give away the information to theadversary about the areas of self modifications. So, a tradeoffhas to be made between giving write permissions to the entirecode area and exact address locations, giving the adversary theinformation regarding the self modifying addresses. Thus, inour current implementation, as a compromise, the sys_mprotectsystem calls are added at the beginning of a function and atthe exit blocks of a function. When a function call is made,the sys_mprotect system call gets executed and enables writepermission to the function code area, thereby enabling writepermissions to reconstruction instructions. The write permis-sions are again disabled by sys_mprotect system call at the exitpoint of the function. This makes sure that the write permis-sions are activated only when a function is being executed.Just before the function returns, the write permissions of thefunction code area are disabled.The whole program, which is in the intermediate control flow

representation in the PLTO framework is then recompiled tobinary executable.

VI. PERFORMANCE EVALUATION

In this section we evaluate the performance of the proposedalgorithm against static and dynamic analysis. The three per-formance measures used are potency, cost and stealth as ex-plained in [16]. Potency is a measure of the strength of the ob-fuscation algorithm. It measures how well the obfuscation per-

BALACHANDRAN AND EMMANUEL: POTENT AND STEALTHY CONTROL FLOW OBFUSCATION 677

forms under automatic deobfuscators. Instruction disassemblyerror and the control flow errors in the de-obfuscated assemblyprogram analyzed statically gives the potency measure againststatic analysis. The percentage of original instructions that werenot obtained by static analysis but gained through dynamic anal-ysis, gives the potency of the algorithm against dynamic anal-ysis. The cost of obfuscation can be measured in terms of pro-gram size overhead and execution time overhead due to obfus-cation. Stealth of obfuscation measures the difficulty to iden-tify whether the program is obfuscated or not. We measure thestealth of the obfuscation as Mahalanobis distance between theoriginal program and the obfuscated program [31].We evaluated the efficacy of the obfuscation with programs

from the SPECint-2006 benchmark suites. The major reason forusing the SPECint 2006 benchmark programs for evaluating ourobfuscation algorithm was to compare with other obfuscationalgorithms in the literature. The programs in the SPECint 2006benchmark are used by both algorithms that are considered forcomparison (Signal based algorithm and Self modifying codebased algorithm). The evaluation results are similar when ap-plied on other C programs. Our evaluation platform is 2.6GHzPentium system with a 2 GB internal main memory. The op-erating system used is Ubuntu distribution of GNU Linux. Thecompiler used is gcc version 3.4 at optimization level . Thedisassembly results from IDAPro [1], version 5.2.0.911, is usedfor the performance evaluation.

A. Potency Against Static Analysis

Potency measures the performance of the obfuscation algo-rithms. It measures how well the obfuscation performs againstautomatic reverse engineering tools. For statically analyzingthe binary program we used IDAPro [1], a professional reverseengineering tool. The binary program is disassembled usingIDAPro. The program thus disassembled is compared with theoriginal program to find the instruction disassembly error andcontrol flow error caused by our obfuscation.1) Instruction Disassembly Errors: We evaluate the instruc-

tion disassembly error with confusion factor. Confusion factoris the fraction of instruction address that the disassembler failsto identify [22]. is the total number of actual instructionaddresses before obfuscation and is the total number ofinstruction addresses properly recognized by the disassembler,then the confusion factor is defined by the following,

(1)

Table I shows the instruction confusion factor of the SPECint-2006 programs. The average instruction disassembly error of theSPECint-2006 programs is 79.78%. This means that the disas-sembler succeeds in recovering only 20.22% of the instructionsproperly, on an average.2) Control Flow Disassembly Errors: We calculated the

number of conditional and unconditional jump instructions inthe program before and after the obfuscation. Ifis the total number of conditional and unconditional jumpinstructions in the program and is the total numberof jump instructions in the obfuscated program. is theconfusion factor in the control flow of the program,

TABLE IINSTRUCTION DISASSEMBLY ERROR

TABLE IICONTROL FLOW ERRORS

(2)

The ratio gives the control flow confusion caused by the ob-fuscation.The error in identifying the control flow instructions in the

program gives us an account of the control flow obfuscation at-tained by the algorithm. Table II shows the number of controlflow instructions in the original program and the obfuscated pro-gram and their ratio.Even after obfuscation, on an average 38.65% of jump in-

structions are available to an adversary for analysis. In somecases like indirect jumps the target address is stored in a reg-ister or a memory address. We cannot conclude statically on thetarget address of these jumps at obfuscation time, hence thesejumps are excluded from obfuscation. Also, the jump instruc-tions from the library functions included in the program are alsonot modified.

B. Potency Against Dynamic Analysis

In this section we will discuss one of the dynamic analysesthat can be done against our obfuscation technique and how ef-fective our method works against it.One of the methods of dynamic analysis of any program is

executing the program and breaking after each instruction exe-cution and tracking the instruction. This will definitely give thecorrect instructions, in its true form to the adversary but onlythose instructions in that execution path. Just disassembling oneexecution path of the program is the major hurdle of this methodof dynamic analysis. There can be infinite number of executionpaths in a computer program due to the loops and to track all theexecution paths is really difficult.To analyze the program containing self modifying code, an-

other approach is to core dump the image of the program whileexecuting. This approach is more realistic than the earlier one.

678 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 4, APRIL 2013

Fig. 8. Jump instructions revealed during dynamic analysis.

Adversary statically disassembles the program first and then runthe program. The program is stopped in between and the imageof the program is dumped. The dynamic binary image of the pro-gram is disassembled by a static disassembler. The statically dis-assembled binary and the disassembly of dynamic binary imageare then compared to see if there are any differences. If the pro-gram has self modified during the execution of the program,those modified regions will be exposed in the dynamic binaryimage. An adversary can easily point out those changed areasin the code by simply comparing it with the statically disassem-bled program.We chose the latter method for dynamic analysis as our ob-

fuscation method has self modifying code and it is more real-istic. We implemented the dynamic analyzer using python andald—assembly language debugger [30]. Python was the frame-work used to call ald and give instructions to the ald tool. Thebinary program is disassembled using ald before it starts exe-cuting and is stored in a file. The section addresses of the pro-gram are available through ald’s file secinfo command. The ad-dress location of code area is thus known to the automated soft-ware. Python randomly picks an address location from this area.It then gives ald to run through the binary program and breakat the randomly chosen address location. Once the ald success-fully breaks, the image of the program in ald is disassembledagain to get the new assembly code. This is stored in anotherfile and is compared with the first file for differences.We tried this method in SPECint-2006 programs starting from

5000 breaks to 100,000 breaks. When we say 5000 breaks, itmeans that the python program forces the ald to break at 5000random locations while running and disassembles the programafter each break and compares it with the statically disassem-bled program. So, if an obfuscated jump instruction is revealedduring this process then it is added information to the adversaryalong with the static analysis information which he has.The dynamic analysis results of SPECint-2006 programs are

shown in Fig. 8. Number of breaks during the dynamic analysisand the number of jump instructions revealed forms the axis ofthe graph.The graph give the absolute value of the number of the jump

instructions revealed during dynamic analysis of the SPECint

TABLE IIIPERCENTAGE OF JUMP INSTRUCTIONS REVEALED

Fig. 9. Time taken for dynamic analysis.

2006 programs. For example 52 obfuscated jump instructions ofsjeng were revealed by dynamic analysis with 100,000 breaks.Table III, gives a relative perspective of dynamic analysis

unlike the absolute information in Fig. 8. It is clear from Fig. 8that the number of jump instructions revealed increases asthe number of breaks used in the dynamic analysis increases.Table III, shows the percentage of obfuscated jump instructionsrevealed by dynamic analysis with 100,000 breaks i.e., thenumber of jump instructions that are revealed out of the totalnumber of jump instructions that are obfuscated.The percentage of obfuscated jump instructions by dynamic

analysis is less than 1%. The dynamic analysis with 100,000breaks yields only small additional information than static anal-ysis.Dynamic analysis is time consuming than static analysis.

When static disassembly of SPECint-2006 programs takeminutes to complete its action, dynamic analysis takes hours.Fig. 9, shows the time taken for dynamic disassembly ofSPECint-2006 and how it varies as the number of breaks varies.The time taken for dynamic analysis is comparatively larger

than that of static analysis. For instance, in the case of 5000breaks ald has to disassemble the program 5000 times. For dy-namic analysis of bzip2 program with 100,000 breaks it takes38.4 hours.

C. Cost of Obfuscation

In this section we will discuss the overheads caused by theobfuscation to the program. The two overheads are, increase inthe space complexity and time complexity of the program.1) Program Size Overhead: Obfuscation will have effect on

the size of the program. defines the increase in thesize of the program.

(3)

BALACHANDRAN AND EMMANUEL: POTENT AND STEALTHY CONTROL FLOW OBFUSCATION 679

TABLE IVSPACE OVERHEAD

TABLE VTIME OVERHEAD

and are the size of the code section before andafter obfuscation. Similarly and are the size of thedata section before and after obfuscation.One advantage of our algorithm is that the program space has

not bloated up too much after obfuscation. Table IV shows thespace overhead caused by our algorithm.The average increase of programs after obfuscation is

2.2 times of the original size. The increase in size is due totwo reasons. The reconstruction instructions added in theprogram contributes to increasing the size of the program. Foreach jump instruction removed from the program, additionalinstructions are added to reconstruct the instruction at runtime.Similarly instructions are added to dynamically obfuscate thede-obfuscated jump instruction.Another reason is the insertion of junk bytes to achieve more

instruction disassembly errors. Junk bytes are added in the suc-ceeding block of the removed jmp instruction. This also ac-counts to the increase in the size of the program.2) Time Overhead: Obfuscation will have effect on the time

complexity of the program. With the insertion of new instruc-tionsmore instructions are computed during runtime. In this sec-tion we will discuss the increase in the time complexity due toobfuscation. We evaluate the effect of obfuscation on executionspeed with defined as,

(4)

refers to the execution time of the original file andrefers to that of the obfuscated code. Table V, shows the timecomplexity overhead created by our algorithm.

D. Stealth Analysis

The stealth of obfuscation measures the difficulty to identifywhether a binary is obfuscated or not. To measure the stealth ofour obfuscation we considered measuringMahalanobis distance

Fig. 10. Comparison of obfuscated and normal binaries.

TABLE VIMAHALANOBIS DISTANCE OF TEST PROGRAMS

between the obfuscated program and normal binary samples asdiscussed in [31]. Mahalanobis distance is a common measureused for statistics based malware detection [43], [44]. It is thedissimilarity measure between two random vectors whose com-ponents are scalar-valued on the same probability space [45].To measure the stealth of our obfuscation we calculated the

Mahalanobis distance of the obfuscated binary and 248normal binaries:

(5)

Where, is a vector, whose each component is the proba-bility mass function (pmf) of each opcode’s mean occurrencein the normal binary programs. The pmf of each opcode’s oc-currence in the obfuscated binary makes up the vector . Theis the diagonal matrix which is made up of variance of mean

values of each opcode for the normal binary programs. We cansay that the obfuscated binary is more abnormal, when the Ma-halanobis distance, is greater. We calculated the ofnormal binaries and plotted the graph as in Fig. 10. Theof the obfuscated binaries were then calculated and plotted. Weobserve that the of the obfuscated binaries are in the rangeof normal binaries.Further, we calculated the mahalanobis distance of binary

programs without obfuscation and compared it with their equiv-alent obfuscated binaries. The mahalanobis distance of the ob-fuscated binary programs is not far from the original binary pro-gram. Table VI shows the mahalanobis distance of binary pro-grams before and after obfuscation.

E. Comparison With Other Algorithms

The performance of the algorithm is compared with two al-gorithms, namely signal-based obfuscation [21] (SBC) and selfmodifying code based algorithm [22] (SMC). Fig. 11, shows the

680 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 4, APRIL 2013

Fig. 11. Algorithm performance comparison.

TABLE VIIFEATURES OF PROPOSED METHOD

comparison on the basis of instruction disassembly error, con-trol flow error and space and time overhead of our algorithmwith the other two.Instruction disassembly error and control flow errors

achieved by our algorithm, is better than the other two algo-rithms. The size and time overhead created by our algorithm iscomparable to the other two algorithms.Features of the proposed obfuscation algorithm in compar-

ison with other obfuscation algorithms in the literature areshown in Table VII. The features are compared with signalbased obfuscation [21], self modifying code based obfuscation[22] and mimimorphic obfuscation [46].

VII. CONCLUSION

In this paper we proposed a software obfuscation algorithm toincrease the difficulty in reverse engineering binary programs.The control flow information from the program are removedfrom the code area and stored in the data area, stack, andreconstructed dynamically on demand. The concept of addingjunk bytes and randomization are used to make the disassemblyprocess harder. Potency, stealth and cost overheads of the al-gorithm are measured against static and dynamic analysis. Theevaluation results show that the proposed method is effectivein confusing professional disassemblers like IDAPro, potent

against dynamic analysis and give stealthy obfuscation. Com-paring with other algorithms like signal based approach [5], theproposed algorithm has better potency and cost effectiveness.

REFERENCES[1] Data Rescue [Online]. Available: http://www.datarescue.com/ [Last

accessed: Feb. 14, 2012][2] J. Miecznikowski and L. Hendren, “Decompiling java using staged

encapsulation,” in Proc. Eighth Working Conf. Reverse Engineering,2001, pp. 368–374.

[3] Digital Law Online, Reverse Engineering [Online]. Available: http://digital-law-online.info/lpdi1.0/treatise25.html [Last accessed: Feb. 14,2012]

[4] PR Newswire [Online]. Available: http://www.prnewswire.com/news-releases/siia-files-six-new-software-piracy-lawsuits-against-fraud-ulent-online-vendors-across-the-country-69854267.html [Last ac-cessed: Feb. 14, 2012]

[5] Blizzard [Online]. Available: www.blizzard.com [Last accessed: Feb.14, 2012]

[6] “Bnetd,” Wikipedia [Online]. Available: http://en.wikipedia.org/wiki/Bnetd [Last accessed: Feb. 14, 2012]

[7] A. Adamov and A. Saprykin, “The problem of trojan inclusions insoftware and hardware,” in Proc. Design and Test Symp., 2010, pp.449–451.

[8] P. Li, M. Salour, and X. Su, “A survey of internet worm detection andcontainment,” IEEE Commun. Surveys Tutorials, vol. 10, no. 1, pp.20–35, First Quarter, 2008.

[9] M. Mannan and P. C. van Oo, “On instant messaging worms, analysisand countermeasures,” in Proc. ACM Workshop on Rapid Malcode,2005, pp. 2–11.

[10] C. Li, W. Jiang, and X. Zou, “Botnet: Survey and case study,” in Proc.Innovative Computing, Information and Control (ICICIC), 2009, pp.1184–1187.

[11] L. Zhang, S. Yu, D. Wu, and P. Watters, “A survey on latest botnetattack and defense,” in Proc. Int. Joint Conf. IEEE TrustCom-11/IEEEICESS-11/FCST-11, 2011, pp. 53–60.

[12] W. Thompson, A. Yasinsac, and J. McDonald, “Semantic encryptiontransformation scheme,” in Proc. Int. Workshop on Security in Paralleland Distributed Systems, San Francisco, CA, USA, 2004.

[13] D. Aucsmith, “Tamper resistant software: An implementation,” inProc. Int. Workshop on Information Hiding, 1996, pp. 317–3336.

[14] D. Lie, C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell,and M. Horowitz, “Architectural support for copy and tamper resistantsoftware,” ACM SIGPLAN Notices, vol. 35, no. 11, pp. 168–177, Nov.2000.

[15] J. Cappaert, B. Preneel, B. Anckaert, and M. Madou, “Towards tamperresistant code encryption: Practice and experience,” in Proc. Informa-tion Security Practice and Experience Conf., 2008, pp. 86–100.

[16] C. Collberg, C. Thomborson, and D. Low, “Manufacturing cheap, re-silient, and stealthy opaque constructs,” inProc. ACM Symp. Principlesof Programming Languages, 1998, vol. 25, pp. 184–196.

[17] D. Hanchez, “A Comparative Study of Software Protection ToolsSuited for e-Commerce With Contributions to Software Watermarkingand Smart Cards,” Ph.D. Thesis, University of Catholique de Louvain,Louvain, Belgium, 2003.

[18] M. Sosonkin, G. Naumovich, and N. Memon, “Obfuscation of designintent in object-oriented applications,” inProc. ACMWorkshop onDig-ital Rights Management, 2003, pp. 142–153.

[19] W. F. Zhu, “Concepts and Techniques in Software Watermarking andObfuscation,” Ph.D. Thesis, University of Auckland, Auckland, NewZealand, 2007.

[20] C. Collberg, C. Thomborson, and D. Low, A Taxonomy of Obfus-cating Transformations, University of Auckland, Tech. Rep. 148,1997 [Online]. Available: http://www.cs.arizona.edu/collberg/Re-search/Publications/CollbergThomborsonLow97a/LETTER.pdf,[Last accessed: Feb. 14, 2012]

[21] I. V. Popov, S. K. Debray, and G. R. Andrews, “Binary obfuscationusing signals,” in Proc. USENIX Security Symp., 2007, pp. 1–16.

[22] L. Shan and S. Emmanuel, “Mobile agent protection with self-modi-fying code,” J. Signal Process. Syst., vol. 65, pp. 105–116, 2010.

[23] R. Parameswaran and D. M. Blough, “Privacy preserving collabora-tive filtering using data obfuscation,” IEEE Granular Comput., pp.380–387, 2007.

[24] C. Wang, J. Davidson, J. Hill, and J. Knight, “Protection of software-based survivability mechanisms,” Depend. Syst. Netw., pp. 193–202,2001.

BALACHANDRAN AND EMMANUEL: POTENT AND STEALTHY CONTROL FLOW OBFUSCATION 681

[25] V. Balachandran and S. Emmanuel, “Software code obfuscation byhiding control flow information in stack,” in Proc. IEEE Workshop onInformation Forensics and Security, 2011, pp. 1–6.

[26] M. Madou, B. Anckaert, P. Moseley, S. Debray, B. De Sutter, and K.De Bosschere, “Software protection through dynamic code mutation,”Inf. Security Applicat., pp. 194–206, 2006.

[27] Y. Kanzaki, A. Monden, M. Nakamura, and K. Matsumoto, “Ex-ploiting self-modification mechanism for program protection,” inProc. Computer Software and Applications Conf., 2003, pp. 170–179.

[28] S. Bhatkar, D. DuVarney, and R. Sekar, “Address obfuscation: An ef-ficient approach to combat a board range of memory error exploits,” inProc. USENIX Security Symp., 2003, pp. 105–120.

[29] C. Cifuentes and K. J. Gough, “Decompilation of binary programs,”Software: Practice and Experience, vol. 25, no. 7, pp. 811–829, 1995.

[30] Assembly Language Debugger [Online]. Available: http://ald.source-forge.net/ [Last accessed: Feb. 14, 2012]

[31] B. Lee, Y. Kim, and J. Kim, “binOb+: A framework for potent andstealthy binary obfuscation,” in Proc. ACM Symp. Information, Com-puter and Communications Security, 2010, pp. 271–281.

[32] The dcc Decompiler [Online]. Available: http://itee.uq.edu.au/cristina/dcc.html [Last accessed: Feb. 14, 2012]

[33] M. Christodorescu and S. Jha, “Static analysis of executables to detectmalicious patterns,” in Proc. USENIX Security Symp., 2003, p. 12-12.

[34] X. Hu, T. Chiueh, and K. G. Shin, “Large-scale malware indexing usingfunction-call graphs,” in Proc. ACM Conf. Computer and Communica-tions Security, 2009, pp. 611–620.

[35] N. Nethercote and J. Seward, “Valgrind: A framework for heavyweightdynamic binary instrumentation,” ACM SIGPLAN Notices, vol. 42, no.6, pp. 89–100, 2007.

[36] Objdump, GNU Binary Utilities [Online]. Available: http://source-ware.org/binutils/docs/binutils/objdump.html, [Last accessed: Feb.14, 2012]

[37] Yuschuk, OllyDbg [Online]. Available: http://www.ollydbg.de [Lastaccessed: Feb. 14, 2012]

[38] Dumpbin Microsoft Corporation [Online]. Available: http://support.microsoft.com/kb/177429, [Last accessed: Feb. 14, 2012]

[39] E. Eilam, Reversing: Secrets of Reverse Engineering. Hoboken, NJ,USA: Wiley, 2005.

[40] B. Schwarz, S. Debray, G. Andrews, and M. Legendre, “PLTO: A link-time optimizer for the Intel IA-32 architecture,” in Proc. Workshop onBinary Translation (WBT-2001), Barcelona, Catalunya, Spain, 2001.

[41] C. Linn and S. Debray, “Obfuscation of executable code to improveresistance to static disassembly,” in Proc. ACM Conf. Computer andCommunications Security, 2003, pp. 290–299.

[42] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Princi-ples, Techniques, and Tools, 2nd ed. Reading, MA, USA: Addison-Wesley, 2006.

[43] M. Shafiq, S. Khayam, and M. Farooq, “Embedded malware detectionusing markov n-grams,” in Proc. Detection of Intrusions and Malware,and Vulnerability Assessment, 2008, pp. 88–107.

[44] S. Stolfo, K. Wang, and W. J. Li, “Towards stealthy malware detec-tion,” Malware Detection, pp. 231–249, 2007.

[45] S. Theodoridis and K. Koutroumbas, Pattern Recognition, 4th ed.New York, NY, USA: Academic, 2008, vol. 11, pp. 164–165.

[46] Z. Wu, S. Gianvecchio, M. Xie, and H.Wang, “Mimimorphism: A newapproach to binary code obfuscation,” in Proc. ACM Conf. Computerand Communications Security (CCS ’10), 2010, pp. 536–546.

[47] X. Lai, J. Zhou, and H. Li, “Multi-stage binary code obfuscation usingimproved virtual machine,” Lecture Notes in Computer Science, vol.7001, pp. 168–181, 2011.

[48] S. Martinez, Source Code Obfuscation by Mean of Evolutionary Al-gorithms University of Luxemborg, August 2012 [Online]. Available:http://dumas.ccsd.cnrs.fr/docs/00/72/53/30/PDF/Martinez.pdf, Intern-ship Report, [Last accessed: Dec. 4, 2012]

[49] C. LeDoux, M. Sharkey, B. Primeaux, and C. Miles, “Instruction em-bedding for improved obfuscation,” in Proc. 50th Ann. Southeast Re-gional Conf. (ACM-SE ’12), 2012, pp. 130–135.

[50] J. M. Memon, Shams-ul-Arfeen, A. Mughal, and F. Memon, “Pre-venting reverse engineering threat in Java using byte code obfusca-tion techniques,” in Proc. Int. Conf. Emerging Technologies, 2006, pp.689–694.

Vivek Balachandran received the B.Tech. degreein computer science and engineering from Govern-ment Engineering College, Thrissur, India, in 2007,and the M.Tech. degree from National Institute ofTechnology, Calicut, India, in 2010. He was withinDSP audio technologies as an embedded softwareengineer in 2008. He is currently working towardthe Ph.D. degree with the School of ComputerEngineering, Nanyang Technological University,Singapore.His current research interests include software se-

curity, software obfuscation, and program analysis.

Sabu Emmanuel received the B.E. degree inelectronics and communication engineering fromRegional Engineering College, Durgapur, India, in1988, the M.E. degree in electrical communicationengineering from the Indian Institute of Science,Bangalore, India, in 1998, and the Ph.D. degree incomputer science from the National University ofSingapore, Singapore, in 2002.He is currently an Assistant Professor with the

School of Computer Engineering, Nanyang Techno-logical University, Singapore.

His current research interests include multimedia and software security andsurveillance video processing.Dr. Emmanuel has served as the Guest Editor for special issues of several

journals and also Reviewer for several journals, such as Springer MultimediaSystems, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEOTECHNOLOGY, the IEEE TRANSACTIONS ON MULTIMEDIA, and the IEEETRANSACTIONS ON INFORMATION FORENSICS AND SECURITY.