Malicious url detection using machine learning

24
Beyond Blacklists: Malicious Url Detection Using Machine Learning

Transcript of Malicious url detection using machine learning

Page 1: Malicious url detection using machine learning

Beyond Blacklists: Malicious Url Detection Using Machine Learning

Page 2: Malicious url detection using machine learning

Who am I ?• Info security Investigator @ Cisco.• Completed Mtech from IIT Jodhpur in

2014.• Areas of interest include machine

learning, computer vision and A.I.• Email : [email protected]

Page 3: Malicious url detection using machine learning

Malicious websites

Phishing : which one is real ??

Page 4: Malicious url detection using machine learning

Visiting Malicious Websites

Page 5: Malicious url detection using machine learning

What we want ?

Page 6: Malicious url detection using machine learning

6

Problem in a Nutshell URL features to identify malicious Web

sites No context, no content

Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. maliciousfacebook.com fblight.com

Page 7: Malicious url detection using machine learning

Information about new websites

Page 8: Malicious url detection using machine learning

8

State of the Practice Current approaches

Blacklists [SORBS, URIBL, SURBL, Spamhaus] Learning on hand-tuned features [Garera et al,

2007] Limitations

Cannot predict unlisted sites Cannot account for new features

Arms race: Fast feedback cycle is critical More automated approach?

Page 9: Malicious url detection using machine learning

9

URL Classification System

Label Example Hypothesis

Page 10: Malicious url detection using machine learning

10

Data Sets Malicious URLs

5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing,

etc) Benign URLs

15,000 from Yahoo Web directory 15,000 from DMOZ directory

Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set

Page 11: Malicious url detection using machine learning

11

Algorithms Logistic regression w/ L1-norm

regularization

Other models Naive Bayes Support vector machines (linear, RBF kernels)

Implicit feature selection Easier to interpret

Page 12: Malicious url detection using machine learning

Feature vector construction

Page 13: Malicious url detection using machine learning

14

Features to consider?1) Blacklists2) Simple heuristics3) Domain name registration4) Host properties5) Lexical

Page 14: Malicious url detection using machine learning

15

(1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL,

Spamhaus

http://www.bfuduuioo1fp.mobiIn blacklist?

Yes

http://fblight.com

No

In blacklist?

http://www.bfuduuioo1fp.mobi

Blacklist queries as features

........................................

........................................

Page 15: Malicious url detection using machine learning

16

(2) Manually-Selected Features Considered by previous studies

IP address in hostname? Number of dots in URL WHOIS (domain name) registration date

stopgap.cn registered 28 June 2009

http://72.23.5.122/www.bankofamerica.com/

http://www.bankofamerica.com.qytrpbcw.stopgap.cn/

Page 16: Malicious url detection using machine learning

17

(3) WHOIS Features Domain name registration

Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration?

http://sleazysalmon.comhttp://angryalbacore.com

http://mangymackerel.com

http://yammeringyellowtail.comRegistered on29 June 2009

By SpamMedia

Page 17: Malicious url detection using machine learning

18

(4) Host-Based Features Blacklisted? (SORBS, URIBL, SURBL, Spamhaus) WHOIS: registrar, registrant, dates IP address: Which ASes/IP prefixes? DNS: TTL? PTR record exists/resolves? Geography-related: Locale? Connection speed?

75.102.60.0/2269.63.176.0/20

facebook.com fblight.com

Page 18: Malicious url detection using machine learning

19

(5) Lexical Features Tokens in URL hostname + path Length of URL Entropy of the domain name

http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

Page 19: Malicious url detection using machine learning

20

Which feature sets?BlacklistManualWHOIS

Host-basedLexical

Fullw/o WHOIS/Blacklist

4,000

# Features

13,000

4

3

17,000

30,000

26,000

Page 20: Malicious url detection using machine learning

21

Beyond Blacklists

Blacklist

Full featuresYahoo-PhishTank

Higher detection rate for given false positive rate

Page 21: Malicious url detection using machine learning

22

Limitations False positives

Sites hosted in disreputable ISP Guilt by association

False negatives Compromised sites Free hosting sites Hosted in reputable ISP

Future work: Web page content

Page 22: Malicious url detection using machine learning

23

Conclusion Detect malicious URLs with high

accuracy Only using URL Diverse feature set helps: 86.5% w/

18,000+ features Proof concept working in lab

Future work Scaling up for deployment

Page 23: Malicious url detection using machine learning

References Ma, Justin, et al. "Beyond blacklists:

learning to detect malicious web sites from suspicious URLs." Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009.

Page 24: Malicious url detection using machine learning

Q & A