A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish,...

19
A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted So ftware (Malware 2009)

Transcript of A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish,...

Page 1: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

A Comprehensive Approach for Malicious Javascript Detection

EJ Jung

12/18/09

with Peter Likarish, Insoon Jo

in The 4th International Malicious and Unwanted Software (Malware 2009)

Page 2: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Why javascript?

>60% of Internet attacks are on web app’s [sans09]

• SQL injection, cross-site scripting(xss)

XSS is the most prevalent bug on the web• drive-by download, malicious advertisements,

… • take over the user’s browser using JavaScript• Cross-site request forgeries (CSRF)

– forces to execute commands without users’ consent

Page 3: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

What has been done before?

Blacklist-based approaches• profiles from known malicious javascripts• domain names and URLs of known bad

websites• most scanners adopt this

Sandbox-based approaches• run in a virtual machine and check the state

change• honey* approaches to find new malware

Limited-capability approaches• run with limited function calls• only use in a subset of javascript

Page 4: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Limitations

Blacklist-based approaches• zero-day vulnerability• cannot respond to new ones spontaneously

Sandbox-based approaches• delay before execution• imperfect sandbox might leak

Limited-capability approaches• compatibility issues

Page 5: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Good and bad javascripts

Clue: Obfuscation!>90% in our dataset

Page 6: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

De-obfuscation?

Why not de-obfuscation then blacklist check?• complete de-obfuscation is extremely difficult• we do use partial de-obfuscation for URL extraction• still vulnerable to 0-day attacks

Only need to know the existence for detection

Good and obfuscated codes?• copyright, tamper-proof, protection against reverse-

engineering• other features to reduce false positives

Page 7: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Our approach

Comprehensive framework consists of• a targeted web crawler• url extractions&feedback• javascript classifiers

Classifier benefits• mitigate 0-day

vulnerability• smaller delay • compatibility with legacy

codes

Page 8: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Preliminaries on classifiers

Classifiers “learn” from training set how to classify • is this script benign or malicious? • probabilistic analysis, decision tree, rule

induction, hyperplane, ...

Example classifier: Naive Bayes• highly used in spam filtering

Page 9: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Classifier evaluation

Confusion matrix[thanks to Prof. Press]

Page 10: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Precision/NPP

Precision• if the classifier says malicious, how much can

we trust this decision? • precision = tp/(tp+fp)• the higher the precision is, the tougher we can

be on the positives

Negative Predictive Power(NPP)• if the classifier says benign, how much can we

trust this decision?• NPP = (tn/tn+fn)• the higher the NPP is, less risk we have letting

this script run

Page 11: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

How to get good classifiers?

Given a word “stock” in an email, what is the probability of this email being spam?

we can compute these from the sample set of

emails

the closer the sample set is to the real Internet

the better this classifier gets. -> importance of crawler

Page 12: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Targeted Crawls

Based on Heritrix, open-source crawler Initial seeds from popular and blacklisted

domains Alexa top 500

• top 500 websites with the most traffic • may include some malicious scripts but mostly

benign

Blacklisted domains• malekal.com, malwareurl.com

Feedback from newly found malicious scripts• extract URLs from redirections and downloads

Page 13: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Crawled scripts

Dates Initial seeds # pages downloaded

# unique scripts

Jan. 26 ~ Feb. 3

Alexa 500 9, 028, 469 ~63million

Jun. 2~16 827 blacklisted domains

163, 938 24,269

Jul. 16 ~ Aug. 1

559 blacklisted domains

79,696 7,602

Training set: 50,000 benign + 66 malicious scripts from Feb~Mar 2009

65 out of 66 obfuscated

Page 14: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Is this training set good?

10-fold cross validation by 5,000 incrementsClassifier Precision

(stdev)Recall(stdev)

NPP(stdev)

NaiveBayes

0.808(0.11)

0.659(0.18)

0.996(0.0023)

REPTree 0.884(0.12)

0.769(0.17)

0.997(0.0022)

SVM 0.920(0.14)

0.742(0.17)

0.997(0.0021)

RIPPER 0.882(0.17)

0.787(0.21)

0.997(0.0027)

Page 15: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Feature extraction

Identify commonly observed features of malicious javascript• manually added features (obfuscation)• 50 reserved javascript keywords

Important features• human readability (obfuscation)

– >70% alphabetical, 60%>vowels>20%, <15 characters long, <=2 repetitions

• eval– obfuscation and hiding malicious code

Page 16: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Feature evaluation

Scatterplots: good vs. bad = red vs. blue

Page 17: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Helpful features

Page 18: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Detection in the real world

Test data• 2 weeks’ data from malwaredomains.com• 24,269 unique scripts by MD5• 22 malicious scripts found by classifiers

– all obfuscated

• 2 found by the latest virus scanner

Classifier #found #mal

precision

NaiveBayes 19 17 89.5%

REPTree(decision tree learner)

21 19 90.4%

SVM 22 19 86.3%

Ripper(inductive rule learner)

28 19 67.9%

Page 19: A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish, Insoon Jo in The 4th International Malicious and Unwanted.

Future work

Correlation among malicious domains• more effective domain-based blacklist

Language-model classifiers Resilience testing

• feedback from newly found malicious scripts• sustain the classifiers’ accuracy

Combine with other features• HTTP and connection information [Seifert08]

Recall testing with blacklists