Download - Anti (anti crawling) techniques

Copyright The OWASP Foundation Permission is granted to copy, distribute and/or modify this document under the terms of the OWASP License. The OWASP Foundation OWASP http://www.owasp.org Anti "Anti-Crawling Techniques Ayman Mohammed Mohammed IBM 14/06/2014

OWASP Why Data is important ?

OWASP Web 2.0

OWASP Web 3.0 (semantic web)

OWASP 6 Data Scraping (crawling) Risks Scrapers take for free what the company has spent large sums to develop, resulting in loss of revenue and loss of customer confidence with a brand. This is theft of digital property and an attack on the uniqueness of online brands. It is impossible for traditional network security devices such firewalls, intrusion detection and prevention, or even application layer firewalls to detect or block them as sophisticated scraping tools mimic user search patterns.

OWASP Security Triangle

OWASP 8 Anti Crawling After analyzing the frequency of requests to the server and based on your analysis you can pick one or more from the following techniques

OWASP 9 IP-address ban The easiest and most common way to determine attempts of website scraping is analyzing the frequency of requests to the server. If requests from a certain IP- address are too often or too much, the address might be blocked and it is often asked to enter CAPTCHA to unblock. The most important thing in this protection method is to find the boundary between the common frequency and number of requests and attempts of scraping in order not to block ordinary users. Commonly this might be determined by analyzing common users behavior.

OWASP 10 Bypass (IP-address ban) One may bypass this protection using multiple proxies to hide the real IP-address of the scraper. Dont use your Real IP Address in the first attack.

OWASP 11 CAPTCHA Its a popular way of data protection from web scraping, too. In this case a user is invited to type captcha text to get access to the website. The inconvenience to the regular users forced to enter captchas is the significant disadvantage of this method. Therefore, its mostly applicable in systems where data is accessed not very often and upon individual requests.

OWASP 12 Bypass (CAPTCHA) Many web services and browsers extensions allows you to bypass chaptcha. Most of CAPTCHA cracking services are commercial

OWASP 13 Using different accounts With this protection method the data might be accessed by authorized users only. It simplifies the control on users behavior and blocking suspicious accounts regardless of the IP- address the client is working from. You cant always use this approach , hence you will lose many customers.

OWASP 14 Bypass (Using different accounts) This protection might be bypassed by creating a set of accounts including the automatic ones. There are certain services selling accounts on well-known social networks. Verifying the account by phone (so-called, PVA-Phone Verified Account) to check its authenticity may create the essential complexity for automatic accounts creation, although it could be bypassed using disposable SIM-cards. Create your own bulk account generator

OWASP 15 Usage of complex JavaScript logic In this case browser sends a special code (or several codes) in its request to server and the codes are formed by complex logic written in JavsScript. The code is often obfuscated, and the logic is placed in one or more JavaScript- loadable files.

OWASP 16 Bypass (Usage of complex JavaScript logic) It might be bypassed through scraping with real browsers (for example using Selenium or Mechanize libraries). But it gives an additional advantage to this method: the scraper will show up in website traffic analytics (eg Google Analytics) when executing JavaScript, which allows webmaster immediately notice that something is going on.

OWASP Crawljax Demo

OWASP 18 Frequent update of the page structure One of the most effective ways to protect a website against automatic scraping is to change its structure frequently. This can apply not only on changing the names of HTML element identifiers and classes, but even on the entire hierarchy. This makes writing scraper very complicated, although it overloads the website code and, sometimes, the entire system as well.

OWASP 19 Bypass (Frequent update of the page structure) To bypass protection like this a more flexible and intelligent scraper is required, or just a scrapers manual correction is needed when these changes occur. Selenium also will help in this developing such a scraper

OWASP 20 Limitation of the frequency of requests and downloadable data allowance This allows to make scraping of large amounts of data very slow and therefore impractical. At the same time the restrictions must be applied considering the needs of a common user, so that it would not reduce the overall usability of the site.

OWASP 21 Bypass (Limitation of the frequency of requests and downloadable data allowance) It might be bypassed through accessing the website from different IP-addresses or accounts (multiple users simulation). Multiple VPS servers will help also

OWASP 22 Mapping the important data as images This method of content protection makes automatic data collection more complicated and at the same time it maintains visual access for common users. Images often replace e-mail addresses and phone numbers, but some websites even manage to replace random letters in the text. Although nothing prevents to display the content of a website in graphic form (eg using Flash or HTML 5), it can significantly hurt the indexing for search engines.

OWASP 23 Bypass (Mapping the important data as images) Its hard to bypass this protection as some automatic or manual images recognition is required, similar to the one used in CAPTCHA case.

OWASP Questions?

OWASP 25 The Question is : Whats the fastest way to collect Facebook users info ??