Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...
-
Upload
lewis-pitts -
Category
Documents
-
view
252 -
download
3
Transcript of Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin...
Design and Evaluation of a Design and Evaluation of a Real-Time URL Spam Real-Time URL Spam Filtering ServiceFiltering Service
Kurt Thomas, Chris Grier, Justin Ma,Vern Paxson, and Dawn Song
IEEE Symposium on Security and Privacy 2011
1
OUTLINEOUTLINEIntroduction - MonarchRelated WorkSystem DesignImplementationEvaluationDiscussion and Conclusion
2
Spam URLSpam URLAdvertisementHarmful content
◦ Phishing, malware, and scams
Use of compromised and fraudulent accounts◦ Email, web services
3
MonarchMonarchSpam URL Filtering as a Service
Tens of millions of features
4
Related WorkRelated Work“Detecting spammers on Twitter” (2010)
◦ Post frequency, URLs, friends…
“Behind phishing: an examination of phisher modi operandi” (2008)◦ Lexical characteristics of phishing URLs
“Cantina: a content-based approach to detecting phishing web sites” (2007)◦ Parse HTML content
5
System DesignSystem Design
Monarch’s cloud infrastructureUrl Aggregation
◦ Email providers and Twitter’s streaming APIFeature Collection
◦ Visits a URL with web browsers to collect page content
6
System Design(cont.)System Design(cont.)
Monarch’s cloud infrastructureFeature Extraction
◦ Transform the raw data into a sparse feature vectorClassification
◦ Training and testing by distributed logistic regression
7
Collect Raw Features – Collect Raw Features – Web Web BrowserBrowser“A taxonomy of JavaScript redirection
spam”(2007)Lightweight browser not enough
◦ Poor HTML parsing, lack of JavaScript and plugins
Instrumented version of Firefox◦ JavaScript enabled◦ Flash and Java installed◦ Visited a URL and monitor a number of details
8
Raw FeaturesRaw FeaturesWeb Browser
◦ Initial URL and Landing URL, Redirects, Sources and Frames
◦ HTML Content, Page Links◦ JavaScript Events, Pop-up Windows, Plugins◦ HTTP Headers
DNS Resolver◦ Initial, final, and redirect URLs
IP Address Analysis◦ City, country, ASN
Proxy and Whitelist (200 domains)
9
Features VectorFeatures VectorRaw Features => sparse feature vector
◦ Canonicalize URLs◦ Remove obfuscation
Tokenize the text corpus◦ Splitting on non-alphanumeric characters
http://adl.tw/~dada/dada2.php?a=1&b=3
=> domain feature [adl,tw]
path feature [dada,dada2,php]
query parameters feature [a,1,b,3]
=> (…,adl:true,adm:false,…,dada:true,…,tw:true,……..)
total 49,960,691 feature(dimension)…
=> (1,3,a,adl,b,dada,dada2,php,tw)
10
Distributed Classifier DesignDistributed Classifier DesignLinear classification
◦ : feature vector◦ Determine a weight vector
A parallel online learner◦ With regularization to yield a sparse weight vector
Labeled data ,Testing =>
-1 => non-spam site 1 => spam site
11
Training the weight vectorTraining the weight vectorLogistic Regression
◦ With subgradient L1-Regularization
yi(xi. wi) larger => f(w) smaller
(Classification margin, hyperplane)
12
iii wxye
wf1
1log)(
Distributed Classifier Distributed Classifier AlgorithmAlgorithm
13
m
10
1
100I
Data Set and assumptionData Set and assumption1.25 million spam email URLs567,784 spam Twitter URLs9 million non-spam Twitter URLs
Checking all Twitter URLs against:◦ Google Safebrowsing, SURBL, URIBL, APWG,
Phishtank◦ Any of its source URLs become blacklisted
14
Data Set and Data Set and assumption(cont.)assumption(cont.)On Twitter:
◦ 36% scams, 60% phishing, 4% malware
15
After regularizationAfter regularization
16
ImplementationImplementationAmazon Web Services(AWS) infrastructure
URL Aggregation◦ A queue, keeps 300,000 URLs
Feature Collection◦ 20x6 Firefox(4.0b4) on Ubuntu 10.04
With a custom extension
◦ Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views
Classifier◦ Hadoop Distributed File System◦ On the 50-node cluster
17
Evaluation – Overall Evaluation – Overall AccuracyAccuracy5-fold cross-validation500,000 spam and non-spam eachTraining set size to 400,000 example
◦ 1:1, 4:1, 10:1Testing set size to 200,000 example
◦ 1:1
18
Evaluation – Single Evaluation – Single FeatureFeature
19
Evaluation – Accuracy Over Evaluation – Accuracy Over TimeTimeTraining once only <-> Retraining every four
days
20
Evaluation – Comparing Email Evaluation – Comparing Email and Tweet Spamand Tweet SpamLog odds ratio:
21
ii pqqpqp 1|,/log| 1221
Evaluation – The CostEvaluation – The Cost
For Twitter, $22,751 per month
22
Discussion and Discussion and ConclusionConclusionEvasion
◦ Feature Evasion◦ Time-based Evasion◦ Crawler Evasion
Monarch◦ Real-time system◦ Spam URL Filtering as a Service◦ $22,751 a month
23