Malicious Url Detection Using Machine Learning

24
Beyond Blacklists: Malicious Url Detection Using Machine Learning

Transcript of Malicious Url Detection Using Machine Learning

Beyond Blacklists: Malicious Url Detection Using

Machine Learning

Who am I ?

• Info security Investigator @ Cisco.

• Completed Mtech from IIT Jodhpur in 2014.

• Areas of interest include machine learning,

computer vision and A.I.

• Email : [email protected]

Malicious websites

Phishing : which one is real ??

Visiting Malicious Websites

What we want ?

Problem in a Nutshell6

URL features to identify malicious Web sites

No context, no content

Different classes of URLs

Benign, spam, phishing, exploits, scams...

For now, distinguish benign vs. malicious

facebook.com fblight.com

Information about new websites

State of the Practice8

Current approaches

Blacklists [SORBS, URIBL, SURBL, Spamhaus]

Learning on hand-tuned features [Garera et al, 2007]

Limitations

Cannot predict unlisted sites

Cannot account for new features

Arms race: Fast feedback cycle is critical

More automated approach?

URL Classification System9

Label Example Hypothesis

Data Sets10

Malicious URLs

5,000 from PhishTank (phishing)

15,000 from Spamscatter (spam, phishing, etc)

Benign URLs

15,000 from Yahoo Web directory

15,000 from DMOZ directory

Malicious x Benign → 4 Data Sets

30,000 – 55,000 features per data set

Algorithms11

Logistic regression w/ L1-norm regularization

Other models

Naive Bayes

Support vector machines (linear, RBF kernels)

Implicit feature selection

Easier to interpret

Feature vector construction

Features to consider?14

1) Blacklists

2) Simple heuristics

3) Domain name registration

4) Host properties

5) Lexical

(1) Blacklist Queries15

List of known malicious sites

Providers: SORBS, URIBL, SURBL,

Spamhaus

http://www.bfuduuioo1fp.mobiIn blacklist?

Yes

http://fblight.com

No

In blacklist?

http://www.bfuduuioo1fp.mobi

Blacklist queries as features

........................................

........................................

(2) Manually-Selected Features16

Considered by previous studies

IP address in hostname?

Number of dots in URL

WHOIS (domain name) registration date

stopgap.cn registered 28

June 2009

http://72.23.5.122/www.bankofamerica.com/

http://www.bankofamerica.com.qytrpbcw.stopgap.cn/

(3) WHOIS Features17

Domain name registration

Date of registration, update, expiration

Registrant: Who registered domain?

Registrar: Who manages registration?

http://sleazysalmon.com

http://angryalbacore.com

http://mangymackerel.com

http://yammeringyellowtail.com

Registered on

29 June 2009

By SpamMedia

(4) Host-Based Features18

Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)

WHOIS: registrar, registrant, dates

IP address: Which ASes/IP prefixes?

DNS: TTL? PTR record exists/resolves?

Geography-related: Locale? Connection speed?

75.102.60.0/2269.63.176.0/20

facebook.com fblight.com

(5) Lexical Features19

Tokens in URL hostname + path

Length of URL

Entropy of the domain name

http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

Which feature sets?20

Blacklist

Manual

WHOIS

Host-based

Lexical

Full

w/o WHOIS/Blacklist

4,000

# Features

13,000

4

3

17,000

30,000

26,000

Beyond Blacklists21

Blacklist

Full features

Yah

oo

-Ph

ish

Tan

k

Higher detection rate for

given false positive rate

Limitations22

False positives

Sites hosted in disreputable ISP

Guilt by association

False negatives

Compromised sites

Free hosting sites

Hosted in reputable ISP

Future work: Web page content

Conclusion23

Detect malicious URLs with high accuracy

Only using URL

Diverse feature set helps: 86.5% w/ 18,000+

features

Proof concept working in lab

Future work

Scaling up for deployment

References

Ma, Justin, et al. "Beyond blacklists: learning

to detect malicious web sites from suspicious

URLs." Proceedings of the 15th ACM SIGKDD

international conference on Knowledge

discovery and data mining. ACM, 2009.

Q & A