Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in...

20
Statistical Tools for Statistical Tools for Linking Linking Engine-generated Malware to Engine-generated Malware to its Engine its Engine Edna C. Milgo Edna C. Milgo M.S. Student in Applied Computer Science M.S. Student in Applied Computer Science TSYS School of Computer Science TSYS School of Computer Science Columbus State University Columbus State University November 19 November 19 th th , 2009 , 2009

Transcript of Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in...

Page 1: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Statistical Tools for Linking Statistical Tools for Linking Engine-generated Malware to its Engine-generated Malware to its

EngineEngine

Edna C. MilgoEdna C. MilgoM.S. Student in Applied Computer ScienceM.S. Student in Applied Computer Science

TSYS School of Computer ScienceTSYS School of Computer Science

Columbus State UniversityColumbus State UniversityNovember 19November 19thth, 2009, 2009

Page 2: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

MalwareMalware

State of the Threat: Anti-virus firms are having to analyze

between 15,000 and 20,000 new malware instances a day. [AV Test Lab 2007, McAfee2009]

1.6 million malware instances were detected in 2008. [F-Secure2008]

Professionals are being recruited to make stealthier malware. [ESET2008]

Automation is being used to generate malware. [ESET2008]

Generic Detection is not good. 630 out of 1000 instances of new malware went unnoticed. [Team-cymru2008]

Classes: Viruses, Worms, TrojansMalware-generating Engines: Script Kiddies, Morphers, Metamorphic, and Virus Generating Toolkits

2/20

Page 3: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

• Static Program Analysis:Static program analysis may be imprecise and

inefficient (e.g., Def-Use analysis).

Static program analysis may be challenged by obfuscation.• Dynamic Program Analysis:May be challenged by testing the patience of the emulator.[Aycock2005]

Extract Procedures

DisassemblySignature Verification

Control Flow Graphs

Malicious

Benign

Suspect Program

Malware is Hard to Detect as Malware is Hard to Detect as it is…it is…

Read x

Entry

Call Proc 1

Call Proc 2

Call Proc 2

Call Proc 1

Exit

YESNO

Dead code

insertion

If (x*x* (x+1) *(x+1) % 4==0)

…Stop early in the program analysis pipeline.

3/20

Page 4: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Variant 2Variant

1Variant 3 Variant n

IN

OUT

Engines-Generated Engines-Generated MalwareMalware

ENGINE

Variant 0

MALWARE DETECTOR

Engine generates new variants at a high rate.Malware detectors typically store one signature per variant. Too many signatures challenge the detector.

4/20

Page 5: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Proposal: View Engine as Proposal: View Engine as AuthorAuthorGoal 1: Reduce the number of steps required in the

program analysis pipeline.Goal 2: Eliminate the need for signature per variant.Goal 3: Must be satisfactorily accurate.

Proposed Model [Chouchane2006]

ENGINE

Variant 2Variant

1Variant 3 Variant n

MALWARE DETECTOR

Engine Signatu

re

OUT

Source: Google Images

Proposed approach was inspired by Keselj’s work on authorship analysis of natural text produced by humans. [V. Keselj2003]

5/20

Page 6: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Example Program: P

IFV(P) Normalized IFV(P)

Feature 1: Instruction Frequency Feature 1: Instruction Frequency VectorVector

6

add,push,pop,add,and,jmp,pop,and,mov,jmp,mov,push,jmp,jmp,push, jmp,add,pop,mov,add,mov,push,jmp,mov,mov,jmp,push

mov push add and jmp pop

0.22 0.19 0.15 0.07 0.26 0.11

mov push add and jmp pop

6 5 4 2 7 3

6/20

Page 7: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

IFV ClassificationIFV Classification

threshold

Malicious

Benign

Suspect program

STEPS1. Given a sample of malicious

programs and a sample of benign ones select sets of trainers from each sample.

2. Compute the IFVs of all trainers.3. Choose a threshold ε.4. Input : IFVsuspect , where ‘suspect’

is a program that is not among the trainers.

5. Count the number of malicious training IFVs within ε of IFVsuspect .

6. Count the number of benign training IFVs within ε of IFVsuspect .

7. Output: The family that has the highest number of trainers within ε is declared to be that of the suspect program. If there is a tie, pick one at random.

7/20

Distance Measure

1

0

2iyixyx ))(IFV)((IFV)IFV,d(IFV

n

i

Page 8: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Experimental SetupExperimental Setup

Metamorphic Malware (from vx.netlux.org)

W32.Simile (100 samples)

Benign Programs (from download.com, sourceforge.net)

(100 samples)

8/20

Thanks to Jessica Turner for extracting the original variant of W32.Simile.

Page 9: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Classifying W32.Simile vs. Classifying W32.Simile vs. BenignsBenigns

RI is the number of instructions considered in the IFVFor RI=4: 0.1 ≤ ε ≤ 0.7, 98 % ≤ accuracy ≤ 100% For RI=5: 0.1 ≤ ε ≤ 0.7, 96 % ≤ accuracy ≤ 100% Very small signatures (4 and 5 doubles per IFV) But does not use single signature

9/20

Page 10: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Feature 2: N-gram Frequency Feature 2: N-gram Frequency VectorVector

Example Program: P

NFV(P) Normalized NFV(P)addpush pushcall callpop popcall callmov movadd

3 5 4 6 1 7addpush pushcall callpop popcall callmov movadd

0.12 0.19 0.15 0.23 0.04 0.27

add,push,call,pop,call,add,push,call,pop,call,add,mov,add,add,mov,add,add, mov,add,push,call,push,call,call,pop,call,push,mov,add,mov,add,push,call,pop,pop,call, pop,call,pop,call,mov,add,mov,add

10/20

Page 11: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

N-Gram Authorship Attribution N-Gram Authorship Attribution (Proposed)(Proposed) Malware Detector

DS

DE

DV

DN

FSB

FSS

FSE

FSV

FSN

STEPS1. Choose a set of trainers from each of

the families. 2. For each family, compute the

average of the NFVs of the family’s trainers to create a Family Signature(FS) for each family.

3. Input : NFVsuspect , where ‘suspect’ is a program that is not among the trainers.

4. Compute the distance between each of the FS’s and NFVsuspect.

5. Output: The suspect program classified as a member of the family with the shortest distance. If there are ties, choose one at random.

NFVsuspect

11/20

Prediction for = MIN (DB,DS,DE,DV,DN)

DB

Distance Measure2

1)

)()(

)()(*2(),(

nm

iiyix

iyixyx NFVNFV

NFVNFVNFVNFVd

Page 12: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

k-nn Classification k-nn Classification STEPS

1. Given a sample of malicious programs and a sample of benign ones select sets of trainers from each sample.

2. Choose k > 0.3. Input: NFVsuspect of suspect

program.4. Find the k closest training NFVs

(neighbors of NFVsuspect).5. Output: The suspect program

classified as a member of the family with the most neighbors. If there are ties, choose one at random.

NGVCK

Evol

VCL

Simile

Benign

12/20

Distance Measure2

1)

)()(

)()(*2(),(

nm

iiyix

iyixyx NFVNFV

NFVNFVNFVNFVd

Page 13: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Experimental SetupExperimental SetupMetamorphic Malware (from vx.netlux.org)

W32.Simile + W32.Evol (100 samples each)Malware Generation Toolkits (from

vx.netlux.org) VCL + NGVCK (100 samples each)

Benign Programs (from download.com, sourceforge.net). (100 samples)

13/20

Thanks to Yasmine Kandissounon for collecting the NGVCK and VCL variants.

Page 14: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Ten-fold Cross Ten-fold Cross ValidationValidation

Divide each family into a training set of 90 instances and a testing set of 10 instances.

Perform 10-fold cross validation using a new testing set and a new training set each time.

Cross accuracy equals the average accuracy across all the validation accuracies.

14/20

Page 15: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Bigram Selection (Relevant Bigram Selection (Relevant Instructions) Instructions)

RI is the number of most relevant instructions across the samples used to construct the features. Best Accuracy 85% for RI =3, RI=4 and RI=9.

15/20

Page 16: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Bigram Selection(Relevant Bigram Selection(Relevant Bigrams)Bigrams)

RB is the number of most relevant bigrams across the samples used to construct the features. Best accuracy 95% for 17 doubles.Accuracies of 94.8 % for 6, 8 and 14 doubles. 16/20

Page 17: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Successful Evaluation…Successful Evaluation…A single, small family signature of 17 doubles

for each family induced a 95% detection accuracy.

W32.Simile's Engine signature = (0.190, 0.030, 0.155, 0.048, 0.043, 0.057, 0.063, 0.020,0.076, 0.022, 0.0, 0.041, 0.109, 0.0, 0.122, 0.022, 0.0)

W32.Evol's Engine signature = (0.074, 0.026, 0.006, 0.326, 0.208, 0.014, 0.024, 0.073,0.043, 0.048, 0.0, 0.071, 0.042, 0.0, 0.026, 0.019, 0.0)

W32.VCL's Engine signature = (0.111, 0.238, 0.142, 0.027, 0.076, 0.063, 0.063, 0.033,0.009, 0.018, 0.018, 0.054, 0.042, 0.0, 0.040, 0.052, 0.013)

W32.NGVCK's Engine signature = ( 0.132, 0.113, 0.106, 0.048, 0.203, 0.018, 0.055,0.038, 0.022, 0.017, 0.070, 0.122, 0.007, 0.0, 0.007, 0.020, 0.017)

Benign's “Engine signature” = (0.165, 0.173, 0.091, 0.061, 0.052, 0.060, 0.052, 0.046, 0.060,0.028, 0.019, 0.043, 0.024, 0.029, 0.02, 0.031, 0.029)

A single, small family signature of 6 doubles for each family induced a 94.8% detection accuracy.

W32.Simile's Engine signature = (0.362, 0.058, 0.295 , 0.093 , 0.082, 0.110 )

W32.Evol's Engine signature = (0.113, 0.039, 0.010, 0.497, 0.319, 0.021 ) W32.VCL's Engine signature = (0.176, 0.358, 0.212 , 0.041, 0.115 , 0.100 ) W32.NGVCK's Engine signature = (0.212, 0.182, 0.171, 0.078, 0.327,

0.029 ) Benign's “Engine signature” = (0.265, 0.279, 0.147, 0.102 , 0.098,

0.098 )

17/20

Page 18: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

MALWARE GENERATING

ENGINE

Variant 2

Variant 1

Variant 3Variant

n

MALWARE DETECTOR

ES

OUT

… Re-examining our goals Goal 1: Simplified analysis. Analysis involves only

the disassembly and signature verification stages of the program analysis pipeline.

Goal 2: One signature per family (Family Signature). Goal 3: Accuracy of 95% using only 17 doubles as a

signature.

Successful Evaluation Successful Evaluation cont’d…cont’d…

18/20

Page 19: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

Directions for Future WorkDirections for Future WorkExperiment with other malware instances and

familiesAddress scalability issue?

Experiment with other feature selection methodsCould we do “better” than 95% for a signature of 17

doubles? Try other classifiers

Other distance measure?Try byte NFV’s instead of opcode NFV’s

Take into account malware that comes as binary. Import existing forensic linguistics methods to

malware detection 19/20

Page 20: Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.

ReferencesReferences Paper documenting this work has already been submitted for possible

publication at the Journal in Computer Virology.• E. Milgo. A Fast Approximate Detection of Win.32 Simile Malware. Columbus

State University Colloquium Series, Feb.’09 and Best Paper Award, 2nd Place-Masters category: ACM MidSE, ’08 .

M. R. Chouchane, A. Lakhotia. Using Engine Signature to Detect Metamorphic Malware. WORM, ‘06.

M. R. Chouchane. Approximate Detection of Machine-morphed Malware. Ph.D. Dissertation, University of Louisiana at Lafayette, ’08.

P. Ször. The Art of Computer Virus Research and Defense. 2005. J. D. Aycock. Computer Viruses and Malware. 2005. V. Keselj, F. Peng, N. Cercone, and C. Thomas. N-gram-based Author Profiles for

Authorship Attribution. PACL, ’03. T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram Based Detection

of New Malicious Code. CMPSAC, ‘04. http://www.av-test.org/, 2007 . http://resources.mcafee.com/content/AvertReportQ109, 2009. http://www.eset.com, 2008. http://www.f-secure.com/en_US/, 2008 http://www.team-cymru.org, 2008.

20/20