Webcrawler

20
Web Crawling Submitted by:Govind Raj Registration no:1001227464 INFORMATION TECHNOLOGY

Transcript of Webcrawler

Page 1: Webcrawler

Web Crawling

Submitted by:Govind RajRegistration no:1001227464INFORMATION TECHNOLOGY

Page 2: Webcrawler

A key motivation for designing Web

crawlers has been to retrieve Web pages and add their representations to a local repository.

Beginning

Page 3: Webcrawler

What is the “Web Crawling”?

What are the uses of Web Crawling?

Types of crawling

Web crawling ?

Page 4: Webcrawler

• A Web crawler (also known as a Web spider, Web robot, or—especially in the FOAF community—Web scutter) is a program or automated script that browses the World Wide Web in a

- methodical

- automated manner. • Other less frequently used names for Web crawlers

are ants, automatic indexers, bots, and worms.

Web Crawling: -

Page 5: Webcrawler

Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

The role of Crawlers is to collect Web Content.

What the Crawlers are:-

Page 6: Webcrawler

Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a Queue Fetch each URL on the queue and repeat

Basic crawler operation:-

Page 7: Webcrawler

HT'06 7

Traditional Web Crawler

Page 8: Webcrawler

The basic Algorithm :

{ Pick up the next URL Connect to the server GET the URL When the page arrives, get its links

(optionally do other stuff) REPEAT}

Beginning with Web Crawler:

Page 9: Webcrawler

Complete web search engine

Search Engine = Crawler + Indexer/Searcher /(Lucene)

+ GUI Find stuff Gather stuff Check stuff

Uses for crawling:-

Page 10: Webcrawler

• Batch Crawlers- Crawl a snapshot of their crawl space, until reaching a certain size or time limit.

• Incremental Crawlers- Continuously crawl their crawl space, revisiting URL to ensure freshness.

• Focused Crawlers- Attempt to crawl pages pertaining to some topic/theme, while minimizing number of off topic pages that are collected.

Several Types of Crawlers:

Page 11: Webcrawler

Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. The term URL normalization refers to the process of-

modifying standardizing A URL in a consistent manner.

URL normalization

Page 12: Webcrawler

There are three important characteristics of the Web that make crawling it very difficult:

Its large volume Its fast rate of change Dynamic page generation

The challenges of “Web Crawling”:-

Page 13: Webcrawler

Yahoo! Slurp: Yahoo Search crawler. Msnbot: Microsoft's Bing web crawler. Googlebot : Google’s web crawler. WebCrawler : Used to build the first publicly-

available full-text index of a subset of the Web. World Wide Web Worm : Used to build a simple

index of document titles and URLs. Web Fountain: Distributed, modular crawler written

in C++. Slug: Semantic web crawler

Examples of Web crawlers

Page 14: Webcrawler

Web 3.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in

-Semantic Web -Website Parse Template concepts

Web 3.0 crawling and indexing technologies will be based on

-Human-machine clever associations

Web 3.0 Crawling

Page 15: Webcrawler

A distributed computing technique whereby search engines employ many computers to index the Internet via web crawling.

The idea is to spread out the required resources of computation and bandwidth to many computers and networks.

Types of distributed web crawling: 1. Dynamic Assignment 2. Static Assignment

Distributed Web Crawling

Page 16: Webcrawler

With this, a central server assigns new URLs to different crawlers dynamically. This allows the central server dynamically balance the load of each crawler.

Configurations of crawling architectures with dynamic assignments:

• A small crawler configuration, in which there is a central DNS resolver and central queues per Web site, and distributed down loaders.

• A large crawler configuration, in which the DNS resolver and the queues are also distributed.

Dynamic Assigment

Page 17: Webcrawler

• Here a fixed rule is stated from the beginning of the crawl that defines how to assign new URLs to the crawlers.

• A hashing function can be used to transform URLs into a number that corresponds to the index of the corresponding crawling process.

• To reduce the overhead due to the exchange of URLs between crawling processes, when links switch from one website to another, the exchange should be done in batch.

Static Assignment

Page 18: Webcrawler

Web crawlers are an important aspect of the search engines.

Web crawling processes deemed high performance are the basic components of various Web services.

It is not a trivial matter to set up such systems: 1. Data manipulated by these crawlers cover

a wide area. 2. It is crucial to preserve a good balance

between random access memory and disk accesses.

Conclusion

Page 19: Webcrawler

• http://en.wikipedia.org/wiki/Web_crawling• www.cs.cmu.edu/~spandey• www.cs.odu.edu/~fmccown/research/lazy/crawling-

policies-ht06.ppt• http://java.sun.com/developer/technicalArticles/

ThirdParty/WebCrawler/• www.grub.org• www.filesland.com/companies/Shettysoft-com/web-crawler.html

• www.ciw.cl/recursos/webCrawling.pdf • www.openldap.org/conf/odd-wien-2003/peter.pdf

References

Page 20: Webcrawler

Thank You For

Your Attention