A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish,...

A Comprehensive Approach for Malicious Javascript Detection

EJ Jung

12/18/09

with Peter Likarish, Insoon Jo

in The 4th International Malicious and Unwanted Software (Malware 2009)

http://www.malware2009.org/

http://www.malware2009.org/

Why javascript?

>60% of Internet attacks are on web app’s [sans09]

• SQL injection, cross-site scripting(xss)

XSS is the most prevalent bug on the web• drive-by download, malicious advertisements,

… • take over the user’s browser using JavaScript• Cross-site request forgeries (CSRF)

– forces to execute commands without users’ consent

What has been done before?

Blacklist-based approaches• profiles from known malicious javascripts• domain names and URLs of known bad

websites• most scanners adopt this

Sandbox-based approaches• run in a virtual machine and check the state

change• honey* approaches to find new malware

Limited-capability approaches• run with limited function calls• only use in a subset of javascript

Limitations

Blacklist-based approaches• zero-day vulnerability• cannot respond to new ones spontaneously

Sandbox-based approaches• delay before execution• imperfect sandbox might leak

Limited-capability approaches• compatibility issues

Good and bad javascripts

Clue: Obfuscation!>90% in our dataset

De-obfuscation?

Why not de-obfuscation then blacklist check?• complete de-obfuscation is extremely difficult• we do use partial de-obfuscation for URL extraction• still vulnerable to 0-day attacks

Only need to know the existence for detection

Good and obfuscated codes?• copyright, tamper-proof, protection against reverse-

engineering• other features to reduce false positives

Our approach

Comprehensive framework consists of• a targeted web crawler• url extractions&feedback• javascript classifiers

Classifier benefits• mitigate 0-day

vulnerability• smaller delay • compatibility with legacy

codes

Preliminaries on classifiers

Classifiers “learn” from training set how to classify • is this script benign or malicious? • probabilistic analysis, decision tree, rule

induction, hyperplane, ...

Example classifier: Naive Bayes• highly used in spam filtering

Classifier evaluation

Confusion matrix[thanks to Prof. Press]

Precision/NPP

Precision• if the classifier says malicious, how much can

we trust this decision? • precision = tp/(tp+fp)• the higher the precision is, the tougher we can

be on the positives

Negative Predictive Power(NPP)• if the classifier says benign, how much can we

trust this decision?• NPP = (tn/tn+fn)• the higher the NPP is, less risk we have letting

this script run

How to get good classifiers?

Given a word “stock” in an email, what is the probability of this email being spam?

we can compute these from the sample set of

emails

the closer the sample set is to the real Internet

the better this classifier gets. -> importance of crawler

Targeted Crawls

Based on Heritrix, open-source crawler Initial seeds from popular and blacklisted

domains Alexa top 500

• top 500 websites with the most traffic • may include some malicious scripts but mostly

benign

Blacklisted domains• malekal.com, malwareurl.com

Feedback from newly found malicious scripts• extract URLs from redirections and downloads

Crawled scripts

Dates Initial seeds # pages downloaded

# unique scripts

Jan. 26 ~ Feb. 3

Alexa 500 9, 028, 469 ~63million

Jun. 2~16 827 blacklisted domains

163, 938 24,269

Jul. 16 ~ Aug. 1

559 blacklisted domains

79,696 7,602

Training set: 50,000 benign + 66 malicious scripts from Feb~Mar 2009

65 out of 66 obfuscated

Is this training set good?

10-fold cross validation by 5,000 incrementsClassifier Precision

(stdev)Recall(stdev)

NPP(stdev)

NaiveBayes

0.808(0.11)

0.659(0.18)

0.996(0.0023)

REPTree 0.884(0.12)

0.769(0.17)

0.997(0.0022)

SVM 0.920(0.14)

0.742(0.17)

0.997(0.0021)

RIPPER 0.882(0.17)

0.787(0.21)

0.997(0.0027)

Feature extraction

Identify commonly observed features of malicious javascript• manually added features (obfuscation)• 50 reserved javascript keywords

Important features• human readability (obfuscation)

– >70% alphabetical, 60%>vowels>20%, <15 characters long, <=2 repetitions

• eval– obfuscation and hiding malicious code

Feature evaluation

Scatterplots: good vs. bad = red vs. blue

Helpful features

Detection in the real world

Test data• 2 weeks’ data from malwaredomains.com• 24,269 unique scripts by MD5• 22 malicious scripts found by classifiers

– all obfuscated

• 2 found by the latest virus scanner

Classifier #found #mal

precision

NaiveBayes 19 17 89.5%

REPTree(decision tree learner)

21 19 90.4%

SVM 22 19 86.3%

Ripper(inductive rule learner)

28 19 67.9%

Future work

Correlation among malicious domains• more effective domain-based blacklist

Language-model classifiers Resilience testing

• feedback from newly found malicious scripts• sustain the classifiers’ accuracy

Combine with other features• HTTP and connection information [Seifert08]

Recall testing with blacklists

A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish,...

Documents

Transcript of A Comprehensive Approach for Malicious Javascript Detection EJ Jung 12/18/09 with Peter Likarish,...