Overview of Web-Crawlers

28
Mining the Web Chakrabarti and Ramakrishnan 1 Overview of Web-Crawlers Neal Richter & Anthony Arnone Nov 30, 2005 – CS Conference Room These slides are edited versions of the chapter 2 lecture notes from “Mining the Web” by Soumen Chakrabarti. “Programming Spiders, Bots, and Aggregators” in Java by Jeff Heaton also is a good reference.

description

Overview of Web-Crawlers. Neal Richter & Anthony Arnone Nov 30, 2005 – CS Conference Room These slides are edited versions of the chapter 2 lecture notes from “Mining the Web” by Soumen Chakrabarti. “Programming Spiders, Bots, and Aggregators” in Java by Jeff Heaton also is a good reference. - PowerPoint PPT Presentation

Transcript of Overview of Web-Crawlers

Page 1: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 1

Overview of Web-Crawlers Neal Richter & Anthony Arnone

•Nov 30, 2005 – CS Conference Room• These slides are edited versions of the chapter 2 lecture notes from “Mining the Web”

by Soumen Chakrabarti.

• “Programming Spiders, Bots, and Aggregators” in Java by Jeff Heaton also is a good reference.

Page 2: Overview of Web-Crawlers

Crawling the WebWeb pages

•Few thousand characters long

•Served through the internet using the hypertext transport protocol (HTTP)

•Viewed at client end using `browsers’Crawler

•To fetch the pages to the computer

•At the computerAutomatic programs can analyze hypertext documents

Page 3: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 3

HTML HyperText Markup Language Lets the author

•specify layout and typeface•embed diagrams•create hyperlinks.

expressed as an anchor tag with a HREF attribute

HREF names another page using a Uniform Resource Locator (URL),

•URL = protocol field (“HTTP”) + a server hostname (“www.cse.iitb.ac.in”) + file path (/, the `root' of the published file

system).

Page 4: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 4

HTTP(hypertext transport protocol)

Built on top of the Transport Control Protocol (TCP)

Steps(from client end)• resolve the server host name to an Internet

address (IP) Use Domain Name Server (DNS) DNS is a distributed database of name-to-IP mappings

maintained at a set of known servers

• contact the server using TCP connect to default HTTP port (80) on the server. Enter the HTTP requests header (E.g.: GET) Fetch the response header

– MIME (Multipurpose Internet Mail Extensions)– A meta-data standard for email and Web content transfer

Fetch the HTML page

Page 5: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 5

Crawl “all” Web pages? Problem: no catalog of all accessible

URLs on the Web. Solution:

•start from a given set of URLs

•Progressively fetch and scan them for new outlinking URLs

•fetch these pages in turn…..

•Submit the text in page to a text indexing system

•and so on……….

Page 6: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 6

Crawling procedure Simple

•Great deal of engineering goes into industry-strength crawlers

• Industry crawlers crawl a substantial fraction of the Web

•E.g.: Alta Vista, Northern Lights, Inktomi No guarantee that all accessible Web

pages will be located in this fashion Crawler may never halt …….

•pages will be added continually even as it is running.

Page 7: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 7

Crawling overheads Delays involved in

•Resolving the host name in the URL to an IP address using DNS

•Connecting a socket to the server and sending the request

•Receiving the requested page in response

Solution: Overlap the above delays by•fetching many pages at the same time

Page 8: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 8

Anatomy of a crawler. Page fetching threads

•Starts with DNS resolution •Finishes when the entire page has been

fetched Each page

•stored in compressed form to disk/tape •scanned for outlinks

Work pool of outlinks•maintain network utilization without

overloading it Dealt with by load manager

Continue till the crawler has collected a sufficient number of pages.

Page 9: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 9

Typical anatomy of a large-scale crawler.

Page 10: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 10

Large-scale crawlers: performance and reliability

considerations Need to fetch many pages at same time• utilize the network bandwidth• single page fetch may involve several seconds of

network latency Highly concurrent and parallelized DNS

lookups Use of asynchronous sockets

• Explicit encoding of the state of a fetch context in a data structure

• Polling socket to check for completion of network transfers

• Multi-processing or multi-threading: Impractical Care in URL extraction

• Eliminating duplicates to reduce redundant fetches• Avoiding “spider traps”

Page 11: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 11

DNS caching, pre-fetching and resolution

A customized DNS component with…..1. Custom client for address resolution

Tailored for concurrent handling of multiple outstanding requests Allows issuing of many resolution requests together

– polling at a later time for completion of individual requests Facilitates load distribution among many DNS servers.

2. Caching server With a large cache, persistent across DNS restarts Residing largely in memory if possible.

3. Prefetching client Parse a page that has just been fetched & extract host names from

HREF targets Query DNS cache via UDP Don’t wait for it, Check later (when you need it)

Page 12: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 12

Multiple concurrent fetches• Managing multiple concurrent

connections• A single download may take several seconds

• Open many socket connections to different HTTP servers simultaneously

• Multi-CPU machines not useful• crawling performance limited by network

and disk

• Two approaches1. using multi-threading 2. using non-blocking sockets with event

handlers

Page 13: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 13

Multi-threading• logical threads

• physical thread of control provided by the operating system (E.g.: pthreads) OR

• concurrent processes

• fixed number of threads allocated in advance• programming paradigm

• create a client socket• connect the socket to the HTTP service on a server• Send the HTTP request header• read the socket (recv) until

• no more characters are available

• close the socket.

• use blocking system calls

Page 14: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 14

Multi-threading: Problems• performance penalty

•mutual exclusion

•concurrent access to data structures

• slow disk seeks.•great deal of interleaved, random input-

output on disk

•Due to concurrent modification of document repository by multiple threads

Page 15: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 15

Non-blocking sockets and event handlers

• non-blocking sockets• connect, send or recv call returns immediately

without waiting for the network operation to complete.

• poll the status of the network operation separately

• “select” system call• lets application suspend until more data can be

read from or written to the socket• timing out after a pre-specified deadline• Monitor polls several sockets at the same time

• More efficient memory management• code that completes processing not interrupted by

other completions• No need for locks and semaphores on the pool• only append complete pages to the log

Page 16: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 16

Link extraction and normalization

• Goal: Obtaining a canonical form of URL

• URL processing and filtering•Avoid multiple fetches of pages known

by different URLs

•Relative URLs• need to be interpreted w.r.t to a base URL.

•many IP addresses / Mirror???

Page 17: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 17

Canonical URLFormed by

• Using a standard string for the protocol

• Canonicalizing the host name• Adding an explicit port number• Normalizing and cleaning up the path

Page 18: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 18

Robot exclusion• Check

•whether the server prohibits crawling a normalized URL

• In robots.txt file in the HTTP root directory of the server

• species a list of path prefixes which crawlers should not attempt to fetch.

• Meant for crawlers only

Page 19: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 19

Eliminating already-visited URLs

Checking if a URL has already been fetched • Before adding a new URL to the work pool

• Needs to be very quick.

• Achieved by computing MD5 hash function on the URL

Exploiting spatio-temporal locality of access Two-level hash function.

– most significant bits (say, 24) derived by hashing the host name plus port

– lower order bits (say, 40) derived by hashing the path concatenated bits used as a key in a B-tree

qualifying URLs added to frontier of the crawl.

hash values added to B-tree.

Page 20: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 20

Spider traps Protecting from crashing on

• Ill-formed HTML E.g.: page with 68 kB of null characters

•Misleading sites indefinite number of pages dynamically

generated by CGI scripts paths of arbitrary depth created using soft

directory links and path remapping features in HTTP server

Page 21: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 21

Spider Traps: Solutions No automatic technique can be

foolproof Check for URL length Guards

•Preparing regular crawl statistics

•Adding dominating sites to guard module

•Disable crawling active content such as CGI form queries

•Eliminate URLs with non-textual data types

Page 22: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 22

Avoiding repeated expansion of links on duplicate pages

Reduce redundancy in crawls Duplicate detection

• Mirrored Web pages and sites

Detecting exact duplicates• Checking against MD5 digests of stored URLs

• Representing a relative link v (relative to aliases u1 and u2) as tuples (h(u1); v) and (h(u2); v)

Detecting near-duplicates• Even a single altered character will completely

change the digest ! E.g.: date of update/ name and email of the site

administrator

• Solution : Shingling

Page 23: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 23

Text repository Crawler’s last task

Dumping fetched pages into a repository Decoupling crawler from other

functions for efficiency and reliability preferred

Page-related information stored in two parts meta-data page contents.

Page 24: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 24

Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.

Page 25: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 25

Refreshing crawled pages Search engine's index should be fresh Web-scale crawler never `completes'

its job High variance of rate of page changes “If-modified-since” request header

with HTTP protocol Impractical for a large web crawler

Solution At commencement of new crawling round

estimate which pages have changed

Page 26: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 26

Determining page changes “Expires” HTTP response header

For page that come with an expiry date Otherwise need to guess if revisiting

that page will yield a modified version. Score reflecting probability of page

being modified Crawler fetches URLs in decreasing

order of score. Assumption : recent past predicts the

future

Page 27: Overview of Web-Crawlers

Mining the Web Chakrabarti and Ramakrishnan 27

Estimating page change rates Brewington and Cybenko & Cho

Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch.

Prerequisite average interval at which crawler checks for

changes is smaller than the inter-modification times of a page

Small scale intermediate crawler runs to monitor fast changing sites

E.g.: current news, weather, etc. Patched intermediate indices into master

index

Page 28: Overview of Web-Crawlers

Questions?

Lots of Programming detail missing!!!!