Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer...

Learning to Detect Malicious URLs

Justin Ma, Lawrence Saul, Stefan Savage, Geoff VoelkerComputer Science & Engineering

UC San Diego

Presentation for GoogleMay 5, 2010

Malicious Web Sites

Spam-advertised goods

Trojan downloads

Phishing: which one is real?

3

Visiting Malicious Web Sites

Predict what is safe without

committing to risky actions

• Safe URL?

• Malicious download?

• Spam-advertised site?

• Phishing site?

URL = Uniform Resource Locator

http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

http://fblight.com

http://mail.ru

http://www.cs.ucsd.edu/~jtma/


http://fblight.com/

4

Problem in a Nutshell URL features to identify malicious Web sites

No context, no content

Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. malicious

facebook.com fblight.com

5

What we want...

6

How to build this service?http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

Blacklist

Hand-picked features (properties of URL)

Machine learning-based classifier

7

State of the Practice

Current approaches Blacklists [SORBS,

URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense]

Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09]

Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features

Arms race: fast feedback cycle is critical

More automated approach?

8

Contributions

URL classification system Used by large Web mail providers

Practical large-scale machine learning in computer security

Large feature sets + online learning Related work: Whittaker, Ryner, Nazif, “Large-Scale

Automatic Classification of Phishing Pages” (NDSS'10)

9

Today's Talk

Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

10

Today's Talk


11

Moving Beyond Blacklists

Preliminary study: Do our features work? Batch algorithms for now 104 examples and features

Outline System overview Features ← focus of this segment Experimental results

[Ma, Saul, Savage, Voelker (KDD 2009)]

12

URL Classification System

Label Example Hypothesis

13

Data Sets

Malicious URLs 5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing, etc)

Benign URLs 15,000 from Yahoo Web directory 15,000 from DMOZ directory

Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set

14

Algorithms

Logistic regression w/ L1-norm regularization

Other models Naive Bayes Support vector machines (linear, RBF kernels)

Implicit feature selection Easier to interpret

15

Features

Example

16

Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...

[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical

17

Features to consider?

1)Blacklists

2)Simple heuristics

3)Domain name registration

4)Host properties

5)Lexical

18

(1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL, Spamhaus Not comprehensive

http://www.bfuduuioo1fp.mobiIn blacklist?

Yes

http://fblight.com

No

In blacklist?

http://www.bfuduuioo1fp.mobi

Blacklist queries as features

........................................

........................................

19

stopgap.cn registered 2 May 2010

(2) Manually-Selected Features

Considered by previous studies IP address in hostname? Number of dots in URL WHOIS (domain name) registration date

[Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]

http://72.23.5.122/www.bankofamerica.com/

http://www.bankofamerica.com.qytrpbcw.stopgap.cn/

20

(3) WHOIS Features

Domain name registration Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration?

http://sleazysalmon.com

http://angryalbacore.com

http://mangymackerel.com

http://yammeringyellowtail.com

Registered on4 May 2010

By SpamMedia

21

(4) Host-Based Features

Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)

WHOIS: registrar, registrant, dates

IP address: Which ASes/IP prefixes?

DNS: TTL? PTR record exists/resolves?

Geography-related: Locale? Connection speed?

75.102.60.0/2269.63.176.0/20

facebook.com fblight.com

Bad Parts of the Internet

22

(5) Lexical Features

Tokens in URL hostname + path Length of URL Number of dots


[Kolari et al., 2006]

23

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

4,000

# Features

13,000

4

7

17,000

More features → Better accuracy

Error rate (%)

24

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

Full 96—99% accuracy

4,000

# Features

13,000

4

7

17,000

30,000

Error rate (%)

25

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

Full

w/o WHOIS/Blacklist

4,000

# Features

13,000

4

7

17,000

30,000

26,000

Error rate (%)

Blacklists and WHOIS are not comprehensive

26

Beyond Blacklists

Blacklist

Full features

Yah

oo-P

his

hTan

k

Higher detection rate for given false positive rate

27

Summary

Detect malicious URLs with high accuracy Only using URL Diverse feature set helps: 99% w/ 30,000+ features Model analysis (more in KDD'09 paper)

What about scalable and adaptive malicious URL detection system?

28

Today's Talk


29

URL Classification System


30

Live URL Classification System


31

Large-Scale Online Learning

How do we scale to live, large-scale data? Outline

Live training feed Challenges: scale and non-stationarity Need for large, fresh training sets Online learning

[Ma, Saul, Savage, Voelker (ICML 2009)]

32

Live Training Feed

Malicious URLs (spamming and phishing) 6,000—7,500 per day from Web mail provider

Benign URLs From Yahoo Web directory

Total of 20,000 URLs per day Collected Jan – May, 2009

120 days More than two million examples

33

Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...

[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical

60+ features 1.1 million 1.8 millionGROWING

Day 100

34

New Features Encountered All the Time

Many binary features Enumerating tokens, ISPs, registrants, etc... 2.9 million features after 100 days

35

Live URL Classficiation System

Online learning

36

Practical Challenges of ML in Systems

Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms

race w/ criminals)

Pivotal decision: batch or online?

37

Batch vs. Online Learning

Batch/offline learning SVM, logistic regression,

decision trees, etc Multiple passes over

data No incremental updates Potentially high memory

and processing overhead

Online learning Perceptron-style

algorithms Single pass over data Incremental updates Low memory and

processing overhead

Online learning addresses scale and non-stationarity

38

Evaluations

Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms Continuous retraining Growing feature vector

39

Need lots of fresh training data?SVM trained once

SVM retrained daily

Fresh data helps

40

Need lots of fresh training data?

Fresh data helps More data helps

SVM trained once on 2 weeks

SVM w/ 2-week sliding window

41

Which online algorithm?

Perceptron

Stochastic Gradient Descent for Logistic Regression

Confidence-Weighted Learning

42

Perceptron

Convergence result:

[Rosenblatt, 1958]

+−+ +

+

+

+−

−

−

−

−

Update on each mistake:

radius

margin

Number of mistakes ≤

43

Logistic Regression with SGD

Log likelihood:

where

[Bottou, 1998]

For every example:

Proportional

44

Confidence-Weighted Learning

Maintain Gaussian distribution over weight vector:

[Dredze et al., 2008] [Crammer et al., 2009]

Constrained problem:

Closed-form update:Update features at different rates

Diagonal covariance matrix

for evals

45

Which online algorithms?

Perceptron

46


LR w/ SGD

Proportional update helps

Perceptron

47


Proportional update helps Per-feature confidence really helps

Confidence-Weighted

LR w/ SGD

Perceptron

48

Batch...

Fresh data helps More data helps

Batch

49

Batch vs. Online

Confidence-Weighted

Batch

Fresh data helps More data helps Online matches batch

50

Why online does well?


Confidence-Weighted

51

Why online does well?

Confidence-Weightedonce-a-day

More data eventually helps Continuous retraining helps


Confidence-Weighted

52

Growing feature vector?

Confidence-Weightedfixed features

53

Growing feature vector?

Confidence-Weightedfixed features

Confidence-Weightedgrowing features

Growing feature vector helps

54

Fixed vs. Variable Features

Perceptronfixed features

Growing feature vector helps

55

Fixed vs. Variable Features

Perceptronfixed features

Growing feature vector helps Growing + CW really helps

Perceptrongrowing features

56

Proof-of-Concept Plugin

57

Examine URL details...

58

It sure looks like phishing...The Real Site

59

Blacklisted later on

60

Summary

Detecting malicious URLs Relevant real-world problem Successful application of online learning

What helps? More, fresher data Continuous retraining Growing feature vector

Confidence-Weighted vs. Batch As accurate More adaptive Fewer resources

61

Today's Talk


62

Impact

Public data set (UCI ML repo) Industrial impact

Mail providers have adopted our approach for classifying URLs in email messages

Project Info + Data Sethttp://sysnet.ucsd.edu/projects/url/

63

Final Thoughts: Systems + ML

Systems: high-impact, large-scale applications ML: Methodical approaches Systems: Embrace real-world constraints ML: More than “plug-and-play” solutions

Systems ↔ Machine Learning

Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer...

Documents

Transcript of Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer...