Copyright The OWASP Foundation Permission is granted to copy,
distribute and/or modify this document under the terms of the OWASP
License. The OWASP Foundation OWASP http://www.owasp.org Anti
"Anti-Crawling Techniques Ayman Mohammed Mohammed IBM
14/06/2014
OWASP Why Data is important ?
OWASP Web 2.0
OWASP Web 3.0 (semantic web)
OWASP Web 3.0 (semantic web)
OWASP 6 Data Scraping (crawling) Risks Scrapers take for free
what the company has spent large sums to develop, resulting in loss
of revenue and loss of customer confidence with a brand. This is
theft of digital property and an attack on the uniqueness of online
brands. It is impossible for traditional network security devices
such firewalls, intrusion detection and prevention, or even
application layer firewalls to detect or block them as
sophisticated scraping tools mimic user search patterns.
OWASP Security Triangle
OWASP 8 Anti Crawling After analyzing the frequency of requests
to the server and based on your analysis you can pick one or more
from the following techniques
OWASP 9 IP-address ban The easiest and most common way to
determine attempts of website scraping is analyzing the frequency
of requests to the server. If requests from a certain IP- address
are too often or too much, the address might be blocked and it is
often asked to enter CAPTCHA to unblock. The most important thing
in this protection method is to find the boundary between the
common frequency and number of requests and attempts of scraping in
order not to block ordinary users. Commonly this might be
determined by analyzing common users behavior.
OWASP 10 Bypass (IP-address ban) One may bypass this protection
using multiple proxies to hide the real IP-address of the scraper.
Dont use your Real IP Address in the first attack.
OWASP 11 CAPTCHA Its a popular way of data protection from web
scraping, too. In this case a user is invited to type captcha text
to get access to the website. The inconvenience to the regular
users forced to enter captchas is the significant disadvantage of
this method. Therefore, its mostly applicable in systems where data
is accessed not very often and upon individual requests.
OWASP 12 Bypass (CAPTCHA) Many web services and browsers
extensions allows you to bypass chaptcha. Most of CAPTCHA cracking
services are commercial
OWASP 13 Using different accounts With this protection method
the data might be accessed by authorized users only. It simplifies
the control on users behavior and blocking suspicious accounts
regardless of the IP- address the client is working from. You cant
always use this approach , hence you will lose many customers.
OWASP 14 Bypass (Using different accounts) This protection
might be bypassed by creating a set of accounts including the
automatic ones. There are certain services selling accounts on
well-known social networks. Verifying the account by phone
(so-called, PVA-Phone Verified Account) to check its authenticity
may create the essential complexity for automatic accounts
creation, although it could be bypassed using disposable SIM-cards.
Create your own bulk account generator
OWASP 15 Usage of complex JavaScript logic In this case browser
sends a special code (or several codes) in its request to server
and the codes are formed by complex logic written in JavsScript.
The code is often obfuscated, and the logic is placed in one or
more JavaScript- loadable files.
OWASP 16 Bypass (Usage of complex JavaScript logic) It might be
bypassed through scraping with real browsers (for example using
Selenium or Mechanize libraries). But it gives an additional
advantage to this method: the scraper will show up in website
traffic analytics (eg Google Analytics) when executing JavaScript,
which allows webmaster immediately notice that something is going
on.
OWASP Crawljax Demo
OWASP 18 Frequent update of the page structure One of the most
effective ways to protect a website against automatic scraping is
to change its structure frequently. This can apply not only on
changing the names of HTML element identifiers and classes, but
even on the entire hierarchy. This makes writing scraper very
complicated, although it overloads the website code and, sometimes,
the entire system as well.
OWASP 19 Bypass (Frequent update of the page structure) To
bypass protection like this a more flexible and intelligent scraper
is required, or just a scrapers manual correction is needed when
these changes occur. Selenium also will help in this developing
such a scraper
OWASP 20 Limitation of the frequency of requests and
downloadable data allowance This allows to make scraping of large
amounts of data very slow and therefore impractical. At the same
time the restrictions must be applied considering the needs of a
common user, so that it would not reduce the overall usability of
the site.
OWASP 21 Bypass (Limitation of the frequency of requests and
downloadable data allowance) It might be bypassed through accessing
the website from different IP-addresses or accounts (multiple users
simulation). Multiple VPS servers will help also
OWASP 22 Mapping the important data as images This method of
content protection makes automatic data collection more complicated
and at the same time it maintains visual access for common users.
Images often replace e-mail addresses and phone numbers, but some
websites even manage to replace random letters in the text.
Although nothing prevents to display the content of a website in
graphic form (eg using Flash or HTML 5), it can significantly hurt
the indexing for search engines.
OWASP 23 Bypass (Mapping the important data as images) Its hard
to bypass this protection as some automatic or manual images
recognition is required, similar to the one used in CAPTCHA
case.
OWASP Questions?
OWASP 25 The Question is : Whats the fastest way to collect
Facebook users info ??