Malicious Url Detection Using Machine Learning
-
Upload
securityxploded -
Category
Technology
-
view
685 -
download
4
Transcript of Malicious Url Detection Using Machine Learning
Who am I ?
• Info security Investigator @ Cisco.
• Completed Mtech from IIT Jodhpur in 2014.
• Areas of interest include machine learning,
computer vision and A.I.
• Email : [email protected]
Problem in a Nutshell6
URL features to identify malicious Web sites
No context, no content
Different classes of URLs
Benign, spam, phishing, exploits, scams...
For now, distinguish benign vs. malicious
facebook.com fblight.com
State of the Practice8
Current approaches
Blacklists [SORBS, URIBL, SURBL, Spamhaus]
Learning on hand-tuned features [Garera et al, 2007]
Limitations
Cannot predict unlisted sites
Cannot account for new features
Arms race: Fast feedback cycle is critical
More automated approach?
Data Sets10
Malicious URLs
5,000 from PhishTank (phishing)
15,000 from Spamscatter (spam, phishing, etc)
Benign URLs
15,000 from Yahoo Web directory
15,000 from DMOZ directory
Malicious x Benign → 4 Data Sets
30,000 – 55,000 features per data set
Algorithms11
Logistic regression w/ L1-norm regularization
Other models
Naive Bayes
Support vector machines (linear, RBF kernels)
Implicit feature selection
Easier to interpret
Features to consider?14
1) Blacklists
2) Simple heuristics
3) Domain name registration
4) Host properties
5) Lexical
(1) Blacklist Queries15
List of known malicious sites
Providers: SORBS, URIBL, SURBL,
Spamhaus
http://www.bfuduuioo1fp.mobiIn blacklist?
Yes
http://fblight.com
No
In blacklist?
http://www.bfuduuioo1fp.mobi
Blacklist queries as features
........................................
........................................
(2) Manually-Selected Features16
Considered by previous studies
IP address in hostname?
Number of dots in URL
WHOIS (domain name) registration date
stopgap.cn registered 28
June 2009
http://72.23.5.122/www.bankofamerica.com/
http://www.bankofamerica.com.qytrpbcw.stopgap.cn/
(3) WHOIS Features17
Domain name registration
Date of registration, update, expiration
Registrant: Who registered domain?
Registrar: Who manages registration?
http://sleazysalmon.com
http://angryalbacore.com
http://mangymackerel.com
http://yammeringyellowtail.com
Registered on
29 June 2009
By SpamMedia
(4) Host-Based Features18
Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)
WHOIS: registrar, registrant, dates
IP address: Which ASes/IP prefixes?
DNS: TTL? PTR record exists/resolves?
Geography-related: Locale? Connection speed?
75.102.60.0/2269.63.176.0/20
facebook.com fblight.com
(5) Lexical Features19
Tokens in URL hostname + path
Length of URL
Entropy of the domain name
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
Which feature sets?20
Blacklist
Manual
WHOIS
Host-based
Lexical
Full
w/o WHOIS/Blacklist
4,000
# Features
13,000
4
3
17,000
30,000
26,000
Beyond Blacklists21
Blacklist
Full features
Yah
oo
-Ph
ish
Tan
k
Higher detection rate for
given false positive rate
Limitations22
False positives
Sites hosted in disreputable ISP
Guilt by association
False negatives
Compromised sites
Free hosting sites
Hosted in reputable ISP
Future work: Web page content
Conclusion23
Detect malicious URLs with high accuracy
Only using URL
Diverse feature set helps: 86.5% w/ 18,000+
features
Proof concept working in lab
Future work
Scaling up for deployment
References
Ma, Justin, et al. "Beyond blacklists: learning
to detect malicious web sites from suspicious
URLs." Proceedings of the 15th ACM SIGKDD
international conference on Knowledge
discovery and data mining. ACM, 2009.