A Brief Look at Web Crawlers

A Brief Look at Web Crawlers

Bin Tan

03/15/07

Web Crawlers

“… is a program or automated script which browses the World Wide Web in a methodical, automated manner”

Uses: Create an archive / index from the visited web pages t

o support offline browsing / search / mining. Automating maintenance tasks on a website Harvesting specific information from web pages

High-level architecture

SeedsFrontier

How easy is it to write a program to crawl all uiuc.edu web pages?

All sorts of real problems:

Managing multiple download threads is nontrivial If you make requests to a server in short

intervals, you’ll overloading it Pages may be missing; servers may be down or

sluggish You may be trapped in dynamic-generated

pages Web page may use ill-formed HTML

This is only a small-scale crawl…

(Shkapenyuk and Suel, 2002): "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability."

Data characterics in large-scale crawls Large volume, fast changes, dynamic page

generation: a wide selection of possibly crawlable URLs

Edwards et al: "Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."

Selection policy: which page to download Need to prioritize according to some page

importance metrics Depth-first Breadth-first Partial PageRank calculation OPIC (On-line Page Importance Computation) Length of per-site queues In focused crawling, prediction of similarity

between page text and query

re-visit policy

Revisit policy: when to check for changes to the pages Pages are frequently updated, created or

deleted Cost functions to minimize:

Freshness (0 for stale pages, 1 for fresh pages )

Age (amount of time for which a page has been stale)

Revisit Policy (cont.)

Uniform policy: revisiting all pages in the collection with the same frequency

Proportional policy: revisiting more often the pages that change more frequently

The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page.

Numerical methods are used for calculation based on distribution of page changes

Politeness policy: how to avoid overloading websites Badly-behaved crawlers can be a nuisanc

e Robots exclusion protocol (robots.txt)

Google Interval/delay between connections (10sec

– 5 min) fixedproportional to page downloading time

Parallelization policy: how to coordinate distributed web crawlers Nutch: "A successful search engine requir

es more bandwidth to upload query result pages than its crawler needs to download pages"

Crawling the deep web

Many web spiders run by popular search engines ignore URLs with a query string

Google’s Sitemap protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling

Also: mod-oai is an Apache module that allows web crawlers to efficiently discover new, modified, and deleted web resources from a web server by using OAI-PMH, a protocol which is widely used in the digital libraries community

Example Web Crawler Software

wget heritrix nutch others

Wget

Command-line tool, non-extensible Config: recursive downloading Config: spanning hosts Breadth-first for HTTP, depth-first for FTP Config: include/exclude filters Updates outdated pages based on timestamps Supports robots.txt protocol Config: connection delay Single-threaded

Heritrix

Heritrix is Internet Archive’s web crawler which was specially designed for web archiving

Licence: LGPL Written in Java

Features

Highly modular; easily extensible Scales to large data volume Implemented selection policies:

Breadth-first with options to throttle activity against particular hosts and to bias towards finishing hosts in progress or cycling among all hosts with pending URLs

Domain sensitive: allows specifying an upper-bound on the number of pages downloaded per site

Adaptive revisiting: repeatedly visit all encountered URLs (wait time between visits configurable)

Implements fixed / proportional connection delay Detailed documentation Web-based UI for crawler administration

Nutch

Nutch is an effort to build an open source search engine based on Lucene for the search and index component.

License: Apache 2.0 Written in Java

Features

Modular; extensible Breadth-first Includes parsing and indexing components Implements a MapReduce facility and a di

stributed file system (Haddop)

Recrawl command lines

# The generate/fetch/update cycle for ((i=1; i <= depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir

-adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment done

Appedix: Parsers

HTML: lynx –dump Beautiful Soup (Python) tidylib (C)

PDF xpdf

Others Nutch plugins Office API (Windows)

A Brief Look at Web Crawlers

Documents

Transcript of A Brief Look at Web Crawlers