The Challenges in Crawling the Web

15

description

Data crawling and data scraping are as challenging as exciting. While the opportunity for data crawling is larger, these are a few practical challenges in data crawling and scraping.

Transcript of The Challenges in Crawling the Web

Page 1: The Challenges in Crawling the Web
Page 2: The Challenges in Crawling the Web

THE CHALLENGES IN CRAWLING THE WEB.

Page 3: The Challenges in Crawling the Web

As an ever-evolving field, extracting data from the web is still a gray area.

No clear ground rules regarding the legality of web scraping exists!

The concern over privacy issues on collecting data off the Web is growing.

People are wary about how data is or can be used.

Page 4: The Challenges in Crawling the Web

Increasingly, Big Data is being frowned upon.

Its harvesting, even more so!

Yet, undeniably, data crawling is growing exponentially.

As it grows, the Web is gradually becoming more complicated to crawl.

Page 5: The Challenges in Crawling the Web

CHALLENGE I NON-UNIFORM STRUCTURES

Data formats & structures are inconsistent in the Web space.

Norms on how to build an Internet presence are non-existent.

The result?

Lack of uniformity and the vast ever-changing terrains of the Internet.

The problem?

Collecting data in a machine-readable format becomes difficult.

Page 6: The Challenges in Crawling the Web

Problems increase with increase in scale!

Especially, when:

a) structured data is needed, and,

b) large number of details are to be extracted from multiple sources.

Page 7: The Challenges in Crawling the Web

CHALLENGE II OMNIPRESENCE OF AJAX ELEMENTS

AJAX and interactive web components make websites more user-friendly. But not for crawlers!

The result?

Content is produced dynamically (and on-the-go) by the browser and therefore not visible to crawlers.

The problem?

To keep the content up-to-date, the crawler needs to be maintained manually on a regular basis.

Even Google’s crawlers find it difficult to extract information!

Page 8: The Challenges in Crawling the Web

Crawlers need to be refined in their approach to be more efficient and scalable. We have a solution that makes crawling AJAX pages prompt. Click here.

Page 9: The Challenges in Crawling the Web

CHALLENGE III THE “REAL” REAL-TIME LATENCY

Acquiring data-sets in real-time is a huge problem! Real-time data is critical in security and intelligence to predict, report, and enable preemptive actions.

The result?

While near-real-time is achieved, real-time latency remains the Holy Grail.

The problem?

The real problem comes in deciding what is and isn't important in real time.

Page 10: The Challenges in Crawling the Web

CHALLENGE IV WHO OWNS UGC?

User-Generated Content (UGC) proprietorship is claimed by giants like Craigslist and Yelp and is usually out-of-bounds for commercial crawlers.

The result?

Only 2-3 % sites disallow bots. Others believe in data democratization, but it is possible these may follow suit and shut access to the data gold mine!

The problem?

Site policing for web scraping and rejecting bots.

Page 11: The Challenges in Crawling the Web

CHALLENGE V THE RISE OF ANTI-SCRAPING TOOLS

Tools like ScrapeDefender, ScrapeShield, ScrapeSentry are capable of differentiating bots from humans.

The result?

Restriction on web crawlers via e-mail obfuscation, real- time monitoring, and instant alerts etc.

The problem?

This is <1%, yet it may rise; all thanks to rogue crawlers, responsible for multiple hits on target servers. DDoS becomes unavoidable!

Page 12: The Challenges in Crawling the Web

Web data is a vast uncharted territory full of bounty, and having the proper tools helps.

So does knowing how to use them since there exists a very thin line between being ‘crawlers’ and ‘hackers’.

And this is where the genuine concern for privacy arises.

At PromptCloud, these crawling challenges are met head-on.

Our two ground rules we recommend that every web-crawling solution should follow.

Page 13: The Challenges in Crawling the Web

COURTESY

In our experience, a little courtesy goes a long way.

Burdening small servers and causing DDoS on target sites is easy.

Yet it is detrimental to the success of any company – especially small businesses!

Rule #1 is to allow at least an interval of 2 seconds in successive requests.

This helps avoid hitting servers too hard.

Page 14: The Challenges in Crawling the Web

CRAWLABILITY

Many (and most) websites restrict the amount of data (either sections of the site or complete sites) that can be crawled by spiders via the robots.txt file.

Rule #2 is to establish feasibility of such site(s)!

It helps greatly to check the site’s policy on bots — whether it allows bots in target sections from where data is desired.

Page 15: The Challenges in Crawling the Web