The Challenges in Crawling the Web

THE CHALLENGES IN CRAWLING THE WEB.

As an ever-evolving field, extracting data from the web is still a gray area.

No clear ground rules regarding the legality of web scraping exists!

The concern over privacy issues on collecting data off the Web is growing.

People are wary about how data is or can be used.

http://blog.promptcloud.com/2013/01/is-crawling-legal.html

http://blog.promptcloud.com/2013/01/is-crawling-legal.html

Increasingly, Big Data is being frowned upon.

Its harvesting, even more so!

Yet, undeniably, data crawling is growing exponentially.

As it grows, the Web is gradually becoming more complicated to crawl.

CHALLENGE I NON-UNIFORM STRUCTURES

Data formats & structures are inconsistent in the Web space.

Norms on how to build an Internet presence are non-existent.

The result?

Lack of uniformity and the vast ever-changing terrains of the Internet.

The problem?

Collecting data in a machine-readable format becomes difficult.

Problems increase with increase in scale!

Especially, when:

a) structured data is needed, and,

b) large number of details are to be extracted from multiple sources.

CHALLENGE II OMNIPRESENCE OF AJAX ELEMENTS

AJAX and interactive web components make websites more user-friendly. But not for crawlers!

The result?

Content is produced dynamically (and on-the-go) by the browser and therefore not visible to crawlers.

The problem?

To keep the content up-to-date, the crawler needs to be maintained manually on a regular basis.

Even Google’s crawlers find it difficult to extract information!

https://developers.google.com/webmasters/ajax-crawling/

Crawlers need to be refined in their approach to be more efficient and scalable. We have a solution that makes crawling AJAX pages prompt. Click here.

http://blog.promptcloud.com/2014/02/web-scraping-interactive-ajax-crawls.html

CHALLENGE III THE “REAL” REAL-TIME LATENCY

Acquiring data-sets in real-time is a huge problem! Real-time data is critical in security and intelligence to predict, report, and enable preemptive actions.

The result?

While near-real-time is achieved, real-time latency remains the Holy Grail.

The problem?

The real problem comes in deciding what is and isn't important in real time.

CHALLENGE IV WHO OWNS UGC?

User-Generated Content (UGC) proprietorship is claimed by giants like Craigslist and Yelp and is usually out-of-bounds for commercial crawlers.

The result?

Only 2-3 % sites disallow bots. Others believe in data democratization, but it is possible these may follow suit and shut access to the data gold mine!

The problem?

Site policing for web scraping and rejecting bots.

http://www.dnattorney.com/056-057-IPM_October_2012.pdf

http://law.justia.com/cases/federal/district-courts/california/candce/4:2012cv01444/252923/1/

CHALLENGE V THE RISE OF ANTI-SCRAPING TOOLS

Tools like ScrapeDefender, ScrapeShield, ScrapeSentry are capable of differentiating bots from humans.

The result?

Restriction on web crawlers via e-mail obfuscation, real- time monitoring, and instant alerts etc.

The problem?

This is <1%, yet it may rise; all thanks to rogue crawlers, responsible for multiple hits on target servers. DDoS becomes unavoidable!

http://www.darkreading.com/vulnerability/scrapedefender-launches-cloud-based-anti/240165737

https://www.cloudflare.com/apps/scrapeshield

http://www.sentormss.com/managed-security-services/scrapesentry-anti-scraping/

Web data is a vast uncharted territory full of bounty, and having the proper tools helps.

So does knowing how to use them since there exists a very thin line between being ‘crawlers’ and ‘hackers’.

And this is where the genuine concern for privacy arises.

At PromptCloud, these crawling challenges are met head-on.

Our two ground rules we recommend that every web-crawling solution should follow.

http://www.promptcloud.com/

COURTESY

In our experience, a little courtesy goes a long way.

Burdening small servers and causing DDoS on target sites is easy.

Yet it is detrimental to the success of any company – especially small businesses!

Rule #1 is to allow at least an interval of 2 seconds in successive requests.

This helps avoid hitting servers too hard.

CRAWLABILITY

Many (and most) websites restrict the amount of data (either sections of the site or complete sites) that can be crawled by spiders via the robots.txt file.

Rule #2 is to establish feasibility of such site(s)!

It helps greatly to check the site’s policy on bots — whether it allows bots in target sections from where data is desired.

The Challenges in Crawling the Web

Data & Analytics

Transcript of The Challenges in Crawling the Web