Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

22
Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania

Transcript of Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Page 1: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Discovering Similarity of Short Programs by Canonical Form

Baohua Wu

University of Pennsylvania

Page 2: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Scenario

• With a known malicious program P1 about a security hole, and an unknown suspicious program P2, how to identify the similarity of P2 to P1?

• If there are known polymorphic malicious program P1, P2, … Pn, how to identify their common “fingerprints”?

Page 3: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Assumption

• Malicious programs are short in size, for example– Scripts < 500 lines– Assembly code < 10 kilobytes

Page 4: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Obfuscation Techniques

• Dead-Code Insertion– NOP, CLI, STI, etc– Complicated ones: inc/dec, push/pop

• Code Transposition– Add (unconditional) branches– Reorder independent instructions

Page 5: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Obfuscation Techniques

• Register Reassignment– Replace eax with ebx if ebx is unused in a live

range– Prologue/epilogue code to swap registers

• Instruction Substitution– IA32 instruction set has many equivalent

instructions

Page 6: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Obfuscation Techniques

• Data modification– Replace a boolean variable with two integers

• X a < b

• Encryption– Polymorph Engine– Variable keys, algorithms, decriptors

Page 7: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Obfuscation Summary

• Changing instructions inside a basic block

• Changing control flows

• Dynamic code generation

• How to solve them?

Page 8: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Objective of Canonical Form of Programs

• Reducing polymorphism

• Identifying tokens for statistic analysis

Page 9: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Canonical Form of Programs

• Compact intermediate instructions– No or few alternative instructions

• Simplified programming model– Code segment – read only– Data segment – heap only (no stack, no

registers)– No function calls except system calls– Conditional and loop instructions are kept

Page 10: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

More about Canonical Form

• Encrypted code are processed in advance– Multiple phases of compilation– Or simply report it as suspicious

• No user-defined function calls– Recursive function elimination– Inline function expansion

• Code optimization by compiler techniques– no dead or useless code– No or few redundant common expressions

Page 11: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

More about Canonical Form

• For assembly program, treat registers as variables – No limitation on number of registers– No unnecessary swapping instructions

• Rename variables in some Total Order (v1,v2…)– Definition position in the program is a total order

• But it may be changed in polymorphism

– Main order by data dependency– Secondary order by variable type, length, name, def

position

• Reorder interexchangeable instructions by alphabetic order

Page 12: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

What else for polymorphism?

• Changes in algorithm– Not in my scope…

• Changes in control flow– Unconditional branch insertion– Combination of conditional branches– Exchanging internal and external loop– Useless branches

Page 13: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Unconditional branch insertion

A;

B;

C;

goto 3;

1: C;

goto 4;

2: B;

goto 1;

3: A;

goto 2;

4:

Page 14: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Combination of conditional branches

If a < b Then A;

Else B;

If c < d Then C;

Else D;

If a < b and c < d

Then A; C;

Else if a<b and c>=d

Then A; D;

Else if a>=b and c<d

Then B; C;

Else B; D;

Page 15: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Exchanging internal and external loop

Sum(matrix a)

For (i=0;i<10;i++)

For (j=0;j<10;j++)

sum+= a[i][j];

Sum(matrix a)

For (j=0;j<10;j++)

For (i=0;i<10;i++)

sum+= a[i][j];

Page 16: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Useless branches

A;

B;

C;

.

.

.

End: D;

A;

If date<1900 Goto End;

B;

C;

.

.

.

End: D;

Page 17: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Linearizing Control Flow

• …So far, no semantics is lost. Now it is different!• Remove backward branches

– Replace them (such as a loop) with repetitive conditional statements

– Number of repetitions is set to N (ex. 2)• Remove forward branches by enumerating

possible combinations of executed branches• Further change each path into canonical form• CPS -- Canonical Path Set

– Critical Canonical Path in CPS is a sub-path of a actual execution path causing damage

Page 18: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Similarity of Canonical Programs

• P1 is a known malicious program• P2 is an unknown program• Similarity(P1, P2) =

)),((|)1(|

1

)),((|)2(|

1

)1()2(

)2()1(

PCPSi

PCPSj

PCPSiPCPSj

jiPathSimMAXPCPS

jiPathSimMAXPCPS

Page 19: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

PathSim: Similarity of Canonical Paths

• Recall in canonical paths– Linear execution– No control flow– No redundant common expression– No useless code– No dead code– No registers– Variables are renamed by some total order– Independent instructions are sorted in alphabetic

order• Similarity algorithms for text documents can be

used

Page 20: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Identifying Critical Canonical Path (CCP)

• P1, P2, P3, … Pn are known malicious programs

• A CCP must have at least one similar path in all Canonical Path Sets CPS(P1), CPS(P2), … CPS(Pn)

• Statistic algorithms can be applied, ex. Gibbs Sampler

Page 21: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Summary

• Assumption: malicous programs are short

• Canonical form for comparison

• Limited number of canonical linear paths

• Similarity problem for text documents

• Statistic methods to identify common fingerprints

Page 22: Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Acknowledgement

Thank You All!