Web Crawlers
-
Upload
indies-is-now-milople -
Category
Technology
-
view
1.905 -
download
0
description
Transcript of Web Crawlers
WEB CRAWLERS
Presented At: Indies Services
Contents
What is a web crawler How does it work? Why use it? Challenges faced Coding crawlers Possible uses for us
What are crawlers?
It’s a computer program. Ants, Automatic Indexers, Bots, Web
spiders, Web robots, Web scutters Search the web for web pages, links on
the pages. Any type of automated search or listing. Crawlers identification (user agent in
http request)
How it works
Basic algorithm for a crawler1. Remove a URL from the unvisited URL list2. Determine the IP Address of its host name3. Download the corresponding document4. Extract any links contained in it.5. If the URL is new, add it to the list of
unvisited URLs6. Process the downloaded document7. Back to step 1
The Process
Initialize URL list with starting URLs(seeds)
List over ?
Pick URL from URL list
Parse page
Add URL to URL List
[No more URL][URL]
[No]
[Yes]
Crawling loop
[new URL]
Uses of crawlers
Search engines : list out URLs, get page information up-to-
date Manipulates the web graph
Uses of crawlers
Automated maintenance tasks : checking for broken internal links Validating HTML code
Crawler
Uses of crawlers
Linguistics Textual search (what word common today)
Market researchers Determine trends
Getting Certain type of information from the web Email addresses (spamming) Images (special images searches) Meta tags information
Challenges faced
What pages should it download? Large size of web : prioritize downloads
How to determine useful and unique links? URLs with GET requests (Internal links) URL normalization
Challenges …
Crawling policies Selective policy (download most relevant
pages) Re-visit policy (when to check for changes
in the page) Politeness policy (robots
exclusion/robots.txt protocol) Parallelization policy (list new URLs)
Coding Crawlers
Common Languages : PHP Python PERL Java etc. or any other server side scripting languages
Logic used: Get the URLs Search for unique URLs from the list Download the page or get information from
any particular page Process that information
Possible uses for us
To maintain coding standards : check for proper code in a page.
Getting rid of unwanted or deprecated data : images or files that are no longer used.
To provide customized search in any particular site.
Thanks
http://www.indies.co.inhttp://www.indieswebs.com