Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer...

63
Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google May 5, 2010

Transcript of Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer...

Page 1: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

Learning to Detect Malicious URLs

Justin Ma, Lawrence Saul, Stefan Savage, Geoff VoelkerComputer Science & Engineering

UC San Diego

Presentation for GoogleMay 5, 2010

Page 2: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

Malicious Web Sites

Spam-advertised goods

Trojan downloads

Phishing: which one is real?

Page 3: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

3

Visiting Malicious Web Sites

Predict what is safe without

committing to risky actions

• Safe URL?

• Malicious download?

• Spam-advertised site?

• Phishing site?

URL = Uniform Resource Locator

http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

http://fblight.com

http://mail.ru

http://www.cs.ucsd.edu/~jtma/

Page 4: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

4

Problem in a Nutshell URL features to identify malicious Web sites

No context, no content

Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. malicious

facebook.com fblight.com

Page 5: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

5

What we want...

Page 6: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

6

How to build this service?http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

Blacklist

Hand-picked features (properties of URL)

Machine learning-based classifier

Page 7: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

7

State of the Practice

Current approaches Blacklists [SORBS,

URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense]

Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09]

Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features

Arms race: fast feedback cycle is critical

More automated approach?

Page 8: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

8

Contributions

URL classification system Used by large Web mail providers

Practical large-scale machine learning in computer security

Large feature sets + online learning Related work: Whittaker, Ryner, Nazif, “Large-Scale

Automatic Classification of Phishing Pages” (NDSS'10)

Page 9: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

9

Today's Talk

Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

Page 10: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

10

Today's Talk

Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

Page 11: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

11

Moving Beyond Blacklists

Preliminary study: Do our features work? Batch algorithms for now 104 examples and features

Outline System overview Features ← focus of this segment Experimental results

[Ma, Saul, Savage, Voelker (KDD 2009)]

Page 12: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

12

URL Classification System

Label Example Hypothesis

Page 13: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

13

Data Sets

Malicious URLs 5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing, etc)

Benign URLs 15,000 from Yahoo Web directory 15,000 from DMOZ directory

Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set

Page 14: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

14

Algorithms

Logistic regression w/ L1-norm regularization

Other models Naive Bayes Support vector machines (linear, RBF kernels)

Implicit feature selection Easier to interpret

Page 15: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

15

Features

Example

Page 16: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

16

Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...

[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical

Page 17: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

17

Features to consider?

1)Blacklists

2)Simple heuristics

3)Domain name registration

4)Host properties

5)Lexical

Page 18: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

18

(1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL, Spamhaus Not comprehensive

http://www.bfuduuioo1fp.mobiIn blacklist?

Yes

http://fblight.com

No

In blacklist?

http://www.bfuduuioo1fp.mobi

Blacklist queries as features

........................................

........................................

Page 19: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

19

stopgap.cn registered 2 May 2010

(2) Manually-Selected Features

Considered by previous studies IP address in hostname? Number of dots in URL WHOIS (domain name) registration date

[Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]

http://72.23.5.122/www.bankofamerica.com/

http://www.bankofamerica.com.qytrpbcw.stopgap.cn/

Page 20: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

20

(3) WHOIS Features

Domain name registration Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration?

http://sleazysalmon.com

http://angryalbacore.com

http://mangymackerel.com

http://yammeringyellowtail.com

Registered on4 May 2010

By SpamMedia

Page 21: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

21

(4) Host-Based Features

Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)

WHOIS: registrar, registrant, dates

IP address: Which ASes/IP prefixes?

DNS: TTL? PTR record exists/resolves?

Geography-related: Locale? Connection speed?

75.102.60.0/2269.63.176.0/20

facebook.com fblight.com

Bad Parts of the Internet

Page 22: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

22

(5) Lexical Features

Tokens in URL hostname + path Length of URL Number of dots

http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

[Kolari et al., 2006]

Page 23: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

23

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

4,000

# Features

13,000

4

7

17,000

More features → Better accuracy

Error rate (%)

Page 24: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

24

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

Full 96—99% accuracy

4,000

# Features

13,000

4

7

17,000

30,000

Error rate (%)

Page 25: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

25

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

Full

w/o WHOIS/Blacklist

4,000

# Features

13,000

4

7

17,000

30,000

26,000

Error rate (%)

Blacklists and WHOIS are not comprehensive

Page 26: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

26

Beyond Blacklists

Blacklist

Full features

Yah

oo-P

his

hTan

k

Higher detection rate for given false positive rate

Page 27: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

27

Summary

Detect malicious URLs with high accuracy Only using URL Diverse feature set helps: 99% w/ 30,000+ features Model analysis (more in KDD'09 paper)

What about scalable and adaptive malicious URL detection system?

Page 28: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

28

Today's Talk

Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

Page 29: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

29

URL Classification System

Label Example Hypothesis

Page 30: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

30

Live URL Classification System

Label Example Hypothesis

Page 31: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

31

Large-Scale Online Learning

How do we scale to live, large-scale data? Outline

Live training feed Challenges: scale and non-stationarity Need for large, fresh training sets Online learning

[Ma, Saul, Savage, Voelker (ICML 2009)]

Page 32: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

32

Live Training Feed

Malicious URLs (spamming and phishing) 6,000—7,500 per day from Web mail provider

Benign URLs From Yahoo Web directory

Total of 20,000 URLs per day Collected Jan – May, 2009

120 days More than two million examples

Page 33: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

33

Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...

[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical

60+ features 1.1 million 1.8 millionGROWING

Day 100

Page 34: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

34

New Features Encountered All the Time

Many binary features Enumerating tokens, ISPs, registrants, etc... 2.9 million features after 100 days

Page 35: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

35

Live URL Classficiation System

Online learning

Page 36: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

36

Practical Challenges of ML in Systems

Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms

race w/ criminals)

Pivotal decision: batch or online?

Page 37: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

37

Batch vs. Online Learning

Batch/offline learning SVM, logistic regression,

decision trees, etc Multiple passes over

data No incremental updates Potentially high memory

and processing overhead

Online learning Perceptron-style

algorithms Single pass over data Incremental updates Low memory and

processing overhead

Online learning addresses scale and non-stationarity

Page 38: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

38

Evaluations

Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms Continuous retraining Growing feature vector

Page 39: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

39

Need lots of fresh training data?SVM trained once

SVM retrained daily

Fresh data helps

Page 40: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

40

Need lots of fresh training data?

Fresh data helps More data helps

SVM trained once on 2 weeks

SVM w/ 2-week sliding window

Page 41: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

41

Which online algorithm?

Perceptron

Stochastic Gradient Descent for Logistic Regression

Confidence-Weighted Learning

Page 42: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

42

Perceptron

Convergence result:

[Rosenblatt, 1958]

+−+ +

+

+

+−

Update on each mistake:

radius

margin

Number of mistakes ≤

Page 43: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

43

Logistic Regression with SGD

Log likelihood:

where

[Bottou, 1998]

For every example:

Proportional

Page 44: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

44

Confidence-Weighted Learning

Maintain Gaussian distribution over weight vector:

[Dredze et al., 2008] [Crammer et al., 2009]

Constrained problem:

Closed-form update:Update features at different rates

Diagonal covariance matrix

for evals

Page 45: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

45

Which online algorithms?

Perceptron

Page 46: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

46

Which online algorithms?

LR w/ SGD

Proportional update helps

Perceptron

Page 47: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

47

Which online algorithms?

Proportional update helps Per-feature confidence really helps

Confidence-Weighted

LR w/ SGD

Perceptron

Page 48: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

48

Batch...

Fresh data helps More data helps

Batch

Page 49: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

49

Batch vs. Online

Confidence-Weighted

Batch

Fresh data helps More data helps Online matches batch

Page 50: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

50

Why online does well?

SVM w/ 2-week sliding window

Confidence-Weighted

Page 51: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

51

Why online does well?

Confidence-Weightedonce-a-day

More data eventually helps Continuous retraining helps

SVM w/ 2-week sliding window

Confidence-Weighted

Page 52: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

52

Growing feature vector?

Confidence-Weightedfixed features

Page 53: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

53

Growing feature vector?

Confidence-Weightedfixed features

Confidence-Weightedgrowing features

Growing feature vector helps

Page 54: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

54

Fixed vs. Variable Features

Perceptronfixed features

Growing feature vector helps

Page 55: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

55

Fixed vs. Variable Features

Perceptronfixed features

Growing feature vector helps Growing + CW really helps

Perceptrongrowing features

Page 56: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

56

Proof-of-Concept Plugin

Page 57: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

57

Examine URL details...

Page 58: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

58

It sure looks like phishing...The Real Site

Page 59: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

59

Blacklisted later on

Page 60: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

60

Summary

Detecting malicious URLs Relevant real-world problem Successful application of online learning

What helps? More, fresher data Continuous retraining Growing feature vector

Confidence-Weighted vs. Batch As accurate More adaptive Fewer resources

Page 61: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

61

Today's Talk

Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

Page 62: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

62

Impact

Public data set (UCI ML repo) Industrial impact

Mail providers have adopted our approach for classifying URLs in email messages

Project Info + Data Sethttp://sysnet.ucsd.edu/projects/url/

Page 63: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

63

Final Thoughts: Systems + ML

Systems: high-impact, large-scale applications ML: Methodical approaches Systems: Embrace real-world constraints ML: More than “plug-and-play” solutions

Systems ↔ Machine Learning