Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...

23
Design and Evaluation Design and Evaluation of a Real-Time URL Spam of a Real-Time URL Spam Filtering Service Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011 1

Transcript of Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...

Page 1: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Design and Evaluation of a Design and Evaluation of a Real-Time URL Spam Real-Time URL Spam Filtering ServiceFiltering Service

Kurt Thomas, Chris Grier, Justin Ma,Vern Paxson, and Dawn Song

IEEE Symposium on Security and Privacy 2011

1

Page 2: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

OUTLINEOUTLINEIntroduction - MonarchRelated WorkSystem DesignImplementationEvaluationDiscussion and Conclusion

2

Page 3: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Spam URLSpam URLAdvertisementHarmful content

◦ Phishing, malware, and scams

Use of compromised and fraudulent accounts◦ Email, web services

3

Page 4: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

MonarchMonarchSpam URL Filtering as a Service

Tens of millions of features

4

Page 5: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Related WorkRelated Work“Detecting spammers on Twitter” (2010)

◦ Post frequency, URLs, friends…

“Behind phishing: an examination of phisher modi operandi” (2008)◦ Lexical characteristics of phishing URLs

“Cantina: a content-based approach to detecting phishing web sites” (2007)◦ Parse HTML content

5

Page 6: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

System DesignSystem Design

Monarch’s cloud infrastructureUrl Aggregation

◦ Email providers and Twitter’s streaming APIFeature Collection

◦ Visits a URL with web browsers to collect page content

6

Page 7: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

System Design(cont.)System Design(cont.)

Monarch’s cloud infrastructureFeature Extraction

◦ Transform the raw data into a sparse feature vectorClassification

◦ Training and testing by distributed logistic regression

7

Page 8: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Collect Raw Features – Collect Raw Features – Web Web BrowserBrowser“A taxonomy of JavaScript redirection

spam”(2007)Lightweight browser not enough

◦ Poor HTML parsing, lack of JavaScript and plugins

Instrumented version of Firefox◦ JavaScript enabled◦ Flash and Java installed◦ Visited a URL and monitor a number of details

8

Page 9: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Raw FeaturesRaw FeaturesWeb Browser

◦ Initial URL and Landing URL, Redirects, Sources and Frames

◦ HTML Content, Page Links◦ JavaScript Events, Pop-up Windows, Plugins◦ HTTP Headers

DNS Resolver◦ Initial, final, and redirect URLs

IP Address Analysis◦ City, country, ASN

Proxy and Whitelist (200 domains)

9

Page 10: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Features VectorFeatures VectorRaw Features => sparse feature vector

◦ Canonicalize URLs◦ Remove obfuscation

Tokenize the text corpus◦ Splitting on non-alphanumeric characters

http://adl.tw/~dada/dada2.php?a=1&b=3

=> domain feature [adl,tw]

path feature [dada,dada2,php]

query parameters feature [a,1,b,3]

=> (…,adl:true,adm:false,…,dada:true,…,tw:true,……..)

total 49,960,691 feature(dimension)…

=> (1,3,a,adl,b,dada,dada2,php,tw)

10

Page 11: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Distributed Classifier DesignDistributed Classifier DesignLinear classification

◦ : feature vector◦ Determine a weight vector

A parallel online learner◦ With regularization to yield a sparse weight vector

Labeled data ,Testing =>

-1 => non-spam site 1 => spam site

11

Page 12: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Training the weight vectorTraining the weight vectorLogistic Regression

◦ With subgradient L1-Regularization

yi(xi. wi) larger => f(w) smaller

(Classification margin, hyperplane)

12

iii wxye

wf1

1log)(

Page 13: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Distributed Classifier Distributed Classifier AlgorithmAlgorithm

13

m

10

1

100I

Page 14: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Data Set and assumptionData Set and assumption1.25 million spam email URLs567,784 spam Twitter URLs9 million non-spam Twitter URLs

Checking all Twitter URLs against:◦ Google Safebrowsing, SURBL, URIBL, APWG,

Phishtank◦ Any of its source URLs become blacklisted

14

Page 15: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Data Set and Data Set and assumption(cont.)assumption(cont.)On Twitter:

◦ 36% scams, 60% phishing, 4% malware

15

Page 16: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

After regularizationAfter regularization

16

Page 17: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

ImplementationImplementationAmazon Web Services(AWS) infrastructure

URL Aggregation◦ A queue, keeps 300,000 URLs

Feature Collection◦ 20x6 Firefox(4.0b4) on Ubuntu 10.04

With a custom extension

◦ Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views

Classifier◦ Hadoop Distributed File System◦ On the 50-node cluster

17

Page 18: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Evaluation – Overall Evaluation – Overall AccuracyAccuracy5-fold cross-validation500,000 spam and non-spam eachTraining set size to 400,000 example

◦ 1:1, 4:1, 10:1Testing set size to 200,000 example

◦ 1:1

18

Page 19: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Evaluation – Single Evaluation – Single FeatureFeature

19

Page 20: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Evaluation – Accuracy Over Evaluation – Accuracy Over TimeTimeTraining once only <-> Retraining every four

days

20

Page 21: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Evaluation – Comparing Email Evaluation – Comparing Email and Tweet Spamand Tweet SpamLog odds ratio:

21

ii pqqpqp 1|,/log| 1221

Page 22: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Evaluation – The CostEvaluation – The Cost

For Twitter, $22,751 per month

22

Page 23: Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security.

Discussion and Discussion and ConclusionConclusionEvasion

◦ Feature Evasion◦ Time-based Evasion◦ Crawler Evasion

Monarch◦ Real-time system◦ Spam URL Filtering as a Service◦ $22,751 a month

23