Intelligent web crawling

88
INTELLIGENT WEB CRAWLING WI-IAT 2013 Tutorial WI-IAT 2013 Tutorial, Atlanta, USA, 20.11.2013, ver 1.7 Denis Shestakov [email protected] Department of Media Technology, Aalto University, Finland

description

Intelligent web crawling Denis Shestakov, Aalto University Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013 Outline: - overview of web crawling; - intelligent web crawling; - open challenges

Transcript of Intelligent web crawling

Page 1: Intelligent web crawling

INTELLIGENT WEB CRAWLINGWI-IAT 2013 Tutorial

WI-IAT 2013 Tutorial, Atlanta, USA, 20.11.2013, ver 1.7

Denis [email protected]

Department of Media Technology, Aalto University, Finland

Page 2: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20131/87

Speaker’s Bio

I Postdoc in Web ServicesGroup, Aalto University,Finland

I PhD dissertation onlimited coverage of webcrawlers

I Over ten years ofexperience in the area

I Two tutorials on webcrawling given at SAC’12and ICWE’13

Page 3: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20132/87

Speaker’s Bio

I http://www.linkedin.com/in/dshestakov

I http://www.mendeley.com/profiles/denis-shestakov/

I http://mediatech.aalto.fi/~denis/

Page 4: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20133/87

TUTORIAL OUTLINEI. OVERVIEW

I Web crawling in a nutshellI Web crawling applicationsI Web size and web link structure

II. INTELLIGENT WEB CRAWLINGI Architecture of web crawlerI Crawling strategiesI Adaptive crawling approaches

III. OPEN CHALLENGESI Crawlers in Web ecosystemI Collaborative web crawlingI Deep Web crawlingI Crawling multimedia content

Page 5: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20134/87

Links to Tutorial

I Slides:I http://goo.gl/woVtQkI http://www.slideshare.net/denshe/presentations

I Similar tutorials:I Tutorials on web crawling at ICWE’13 and SAC’12I Their diffs with this tutorial: better overview the topic (parts I

and III), but not cover crawling strategies (part II)I Supporting materials:

I http://www.mendeley.com/groups/531771/web-crawling/

Page 6: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20135/87

PART I: OVERVIEW

Visualization of http://media.tkk.fi/webservices by aharef.info applet

Page 7: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20136/87

Outline of Part I

Overview of Web CrawlingI Web crawling in a nutshellI Web crawling applicationsI Web size and web link structure

Page 8: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20137/87

Web Crawling in a Nutshell

I Automatic harvesting of web contentI Done by web crawlers (also known as robots, bots or

spiders)I Follow a link from a set of links (URL queue), download a

page, extract all links, eliminate already visited, add therest to the queue

I Then repeatI Set of policies involved (like ’ignore links to images’, etc.)

Page 9: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20138/87

Web Crawling in a NutshellExample:

1. Follow http://media.tkk.fi/webservices (vizualization of itsHTML DOM tree below)

2. Extract URLs inside blue bubbles (designating <a> tags)3. Remove already visited URLs4. For each non-visited URL, start at Step 1

Page 10: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.20139/87

Web Crawling in a Nutshell

I In essence: simple and naive processI However, a number of ’restrictions’ imposed make it much

more complicatedI Most complexities due to operating environment (Web)I For example, do not overload web servers (challenging as

distribution of web pages on web servers is non-uniform)I Or avoiding web spam (not only useless but consumes

resources and often spoils the collected content)

Page 11: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201310/87

Web Crawling in a Nutshell

Crawler Agents

I First in 1993: the Wanderer (written in Perl)I Over different 1100 crawler signatures (User-Agent string

in HTTP request header) mentioned athttp://www.crawltrack.net/crawlerlist.php

I Educated guess on overall number of different crawlers –at least several thousands

I Write your own in a few dozens lines of code (usinglibraries for URL fetching and HTML parsing)

I Or use existing agent: e.g., wget tool (developed from1996; http://www.gnu.org/software/wget/)

Page 12: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201311/87

Web Crawling in a Nutshell

Crawler Agents

I For advanced things, you may modify the code of existingprojects for programming language preferred

I Crawlers play a big role on the WebI Bring more traffic to certain web sites than human visitorsI Generate sizeable portion of traffic to any (public) web siteI Crawler traffic important for emerging web sites

Page 13: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201312/87

Web Crawling in a NutshellClassification

I General/universal crawlersI Not so many of them, lots of resources requiredI Big web search engines

I Topical/focused crawlersI Pages/sites on certain topicI Crawling all in one specific (i.e., national) web segment is

rather general, thoughI Batch crawling

I One or several (static) snapshotsI Incremental/continuous crawling

I Re-visitingI Resources divided between fetching newly discovered

pages and re-downloading previously crawled pagesI Search engines

Page 14: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201313/87

Applications of Web CrawlingWeb Search Engines

I Google, Microsoft Bing, (Yahoo), Baidoo, Navier, Yandex,Ask, ...

I One of three underlying technology stacks

Page 15: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201314/87

Applications of Web CrawlingWeb Search Engines

I One of three underlying technology stacks

I BTW, what are the other two and which is the most’crucial’?

Page 16: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201315/87

Applications of Web Crawling

Web Search Engines

I What are the other two and which is the most ’crucial’?Query processor (particularly, ranking)

Page 17: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201316/87

Applications of Web Crawling

Web Archiving

I Digital preservationI “Librarian” look on the WebI The biggest: Internet ArchiveI Quite huge collectionsI Batch crawlsI Primarily, collection of national web sites – web sites at

country-specific TLDs or physically hosted in a countryI There are quite many and some are huge! see the list of

Web Archiving Initiatives at Wikipedia

Page 18: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201317/87

Applications of Web Crawling

Vertical Search Engines

I Data aggregating from many sources on certain topicI E.g., apartment search, car search

Page 19: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201318/87

Applications of Web Crawling

Web Data Mining

I “To get data to be actually mined”I Usually using focused crawlersI For example, opinion miningI Or digests of current happenings on the Web (e.g., what

music people listen now)

Page 20: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201319/87

Applications of Web Crawling

Web Monitoring

I Monitoring sites/pages for changes and updates

Page 21: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201320/87

Applications of Web Crawling

Detection of malicious web sitesI Typically a part of anti-virus, firewall, search engine, etc.

serviceI Building a list of such web sites and inform a user about

potential threat of visiting such

Page 22: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201321/87

Applications of Web Crawling

Web site/application testing

I Crawl a web site to check a navigation through it, validitythe links, etc.

I Regression/security/... testing a rich internet application(RIA) via crawling

I Checking different application states by simulating possibleuser interaction events (e.g., mouse click, time-out)

Page 23: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201322/87

Applications of Web Crawling

Copyright violation detection

I Crawl to find (media) items under copyright or links to themI Regular re-visiting ’suspicious’ web sites, forums, etc.I Tasks like finding terrorist chat rooms also go here

Page 24: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201323/87

Applications of Web Crawling

Web Scraping

I Extracting particular pieces of information from a group oftypically similar pages

I When API to data is not availableI Interestingly, scraping might be more preferable even with

API available as scraped data often more clean andup-to-date than data-via-API

Page 25: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201324/87

Applications of Web Crawling

Web Mirroring

I Copying of web sitesI Hosting copies on different servers to ensure 24x7

accessibility

Page 26: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201325/87

Industry vs. Academia Divide

In web crawling domain

I Huge lag between industrial and academic web crawlersI Research-wise and development-wiseI Algorithms, techniques, strategies used in industrial

crawlers (namely, operated by search engines) poorlyknown

I Industrial crawlers operate on a web-scaleI That is, dozens of billions pagesI Only a few academic crawlers dealt with more than one

billion pagesI Academic scale is rather hundreds of millions

Page 27: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201326/87

Industry vs. Academia

I Re-crawlingI Batch crawls in academiaI Regular re-crawls by industrial crawlers

I Evaluation of crawled dataI Crucial for corrections/improvements into crawlersI Direct evaluation by users of search enginesI To some extent, artificial evaluation of academic crawls

Page 28: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201327/87

Web Size and Structure

Some numbersI Number of pages per host is not uniform: most hosts

contain only a few pages, others contain millionsI Roughly 100 links on a pageI According to Google statistics (over 4 billions pages,

2010): fetching a page takes 320KB (textual content plusall embeddings)

I Page has 10-100KB of textual (HTML) content on averageI One trillion URLs known by Google/Yahoo in 2008

Page 29: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201328/87

Web Size and Structure

Some numbersI 20 million web pages in 1995 (indexed by AltaVista)I One trillion (1012) URLs known by Google/Yahoo in 2008

- ’Independent’ search engine called Majestic12(P2P-crawling) confirms one trillion items

I Doesn’t mean one trillion indexed pagesI Supposedly, index has dozens times less pagesI Cool crawler facts: IRLbot crawler (running on one server)

downloaded 6.4 billion pages over 2 monthsI Throughput: 1000-1500 pages per secondI Over 30 billion discovered URLs

Page 30: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201329/87

Web Size and StructureBow-tie model of the Web

Illustration taken from http://dx.doi.org/doi:10.1038/35012155

Page 31: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201330/87

PART II: INTELLIGENT WEB CRAWLING

Page 32: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201331/87

Outline of Part II

Intelligent Web CrawlingI Architecture of web crawlerI Crawling strategiesI Adaptive crawling approaches

Page 33: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201332/87

Architecture of Web CrawlerCrawler crawls the Web

CrawledURLs

URL Frontier

Seed URLs

Uncrawled Web

Page 34: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201333/87

Architecture of Web CrawlerTypically in a distributed fashion

Seed URLs

CrawledURLs

URL Frontier

crawling thread

Uncrawled Web

Page 35: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201334/87

Architecture of Web Crawler

URL FrontierI Include multiple pages from the same hostI Must avoid trying to fetch them all at the same timeI Must try to keep all crawling threads busyI Prioritization also helps

Page 36: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201335/87

Architecture of Web CrawlerCrawler Architecture

Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.

Page 37: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201336/87

Architecture of Web CrawlerContent seen?

I If page fetched is already in the base/index, don’t process itI Document fingerprints (shingles)

Filtering

I Filter out URLs – due to ’politeness’, restrictions on crawlI Fetched robots.txt are cached to avoid fetching them

repeatedly

Duplicate URL Elimination

I Check if an extracted+filtered URL has been alreadypassed to frontier (batch crawling)

I More complicated in continuous crawling (different URLfrontier implementation)

Page 38: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201337/87

Architecture of Web Crawler

Distributed Crawling

I Run multiple crawl threads, under different processes(often at different nodes)

I Nodes can be geographically distributedI Partition hosts being crawled into nodes

Page 39: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201338/87

Architecture of Web CrawlerHost Splitter

Illustration taken from Introduction to Information Retrieval (Cambridge University Press, 2008) by Manning et al.

Page 40: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201339/87

Architecture of Web Crawler

Implementation (in Perl)

Other popular languages: Java, Python, C/C++

Page 41: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201340/87

Architecture of Web CrawlerCrawling objectives

I High web coverageI High page freshnessI High content qualityI High download rate

Internal and External factorsI Amount of hardware (I)I Network bandwidth (I)I Rate of web growth (E)I Rate of web change (E)I Amount of malicious content (i.e., spam, duplicates) (E)

Page 42: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201341/87

Crawling Strategies

Download prioritization

I Given a period, only a subset of web pages can bedownloaded

I “Important” pages firstI Hence, need in prioritizationI Ordering a queue of URLs to be visited

Strategies (ordering metrics)

I Breadth-First, Depth-FirstI Backlink countI Best-FirstI PageRankI Shark-Search

Page 43: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201342/87

Crawling Strategies

Breadth-First, Depth-First

I Breadth-First searchI Implemented with

QUEUE (FIFO)I Pages with shortest

paths first

I Depth-First searchI Implemented with

STACK (LIFO)

Page 44: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201343/87

Crawling StrategiesPseudocode for Breadth-First

Page 45: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201344/87

Crawling Strategies

Backlink countI Use the link graph informationI Count # of crawled pages that point to a pageI Links with highest counts first

Page 46: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201345/87

Crawling Strategies

Best-FirstI Best link selected based on some criterionI I.e., lexical similarity between topic’s keywords and link’s

source pageI Similarity score sim(topic,p) assigned to outgoing links of

page pI Cosine similarity often used

where q is a topic, p is a crawled page, fkq ,fkp are frequencies of term kin q and p

Page 47: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201346/87

Crawling StrategiesPseudocode for Best-First

Page 48: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201347/87

Crawling Strategies

PageRank

I The pagerank of a page is the probability for a randomsurfer (who follows links randomly) to be on this page atany given time

I A page’s score (rank) defined by scores of pages with linksto this page

where p is a page, in(p) is a set of pages with links to p, out(d) is a setof links out of d , γ are damping factor

I PageRank of pages periodically recalculated using datastructure with crawled pages

Page 49: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201348/87

Crawling StrategiesPseudocode for PageRank

Page 50: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201349/87

Crawling Strategies

Shark-SearchI More emphasis on web segments where relevant pages

were foundI Penalizing segments yielding a few relevant pages

I A link’s score defined by a link’s anchor text, textsurrounding a link (link context) and inherited score fromancestor pages (pages pointing to a page with this link)

I Parameters:I d - depth boundI r - relative importance of inherited score versus link

neighbourhood score

Page 51: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201350/87

Crawling StrategiesPseudocode for Shark-Search

Page 52: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201351/87

Adaptive Crawling

Static vs. adaptive strategies

I Strategies presented to this point are staticI Not adjust in the course of the crawl

Adaptive (intelligent) crawling

I InfoSpidersI Ant-based crawling

Page 53: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201352/87

Adaptive CrawlingInfoSpiders

I Independent agents crawling in parallel

HTML parser

Noise word remover

Stemmer

Document relevance

assessment

Reproductionor death

Learning

Link assessment

and selection

HTML document

Compactdocument

representation

Documentassessment

########## $$$

########## $$$

Term weights

Neural netweights

Keywordvector

Agent representation

Page 54: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201353/87

Adaptive Crawling

InfoSpiders

I Independent agents crawling in parallelI Each agent uses list of keywords (initialized with topic

keywords)I Neural network evaluates new links

I Keywords in the vicinity a link used as inputI More importance (weight) to those keywords close to a linkI Maximum to words in the anchor textI Output is a numerical quality estimate for a link

I Link score combined with cosine similarity score (betweenagent’s keywords and a page with this link)

Page 55: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201354/87

Adaptive CrawlingInfoSpiders

I Each agent has an energy levelI Agent moves from a current to a new page if boltzmann

function returns true

where δ is diff between similarity of new and current page to agent’skeywords

I If energy level passes some threshold, an agentreproduces

I Offspring gets the half of parent’s frontierI Offspring keywords mutated (expanded) with most

frequent terms in parent’s current document

Page 56: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201355/87

Adaptive CrawlingPseudocode for InfoSpiders

Page 57: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201356/87

Adaptive CrawlingPseudocode for InfoSpiders (cont.)

Page 58: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201357/87

Adaptive Crawling

Ant-based crawling

I Motivation: allow crawling agents to communicate witheach other

I Follow a model of social insect collective behaviourI Ants leave the pheromone along the followed pathI Other ants follow such pheromone trailsI A crawler agent follows some path by visiting many URLsI At some moment, a certain amount of pheromone (weight)

can be assigned to sequence of URLs on the followed pathI The amount can depend on similarity of visited pages to a

given topic

Page 59: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201358/87

Adaptive Crawling

Ant-based crawling

I Ants (crawlers) operate in cyclesI During each cycle, agents make a predefined number of

moves (visits of pages)I #moves = constant ∗ #cycleI At the end of each cycle, pheromone intensity values are

updated for the followed pathI Agents-ants return to their starting positions

Page 60: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201359/87

Adaptive CrawlingAnt-based crawling

I Next link selected based on probability, which is defined bythe corresponding pheromone intensity

I If no pheromone information, an agent-ant movesrandomly

Page 61: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201360/87

Adaptive Crawling

Ant-based crawling

I Probability of selecting a link

where t is the cycle number, τij(t) is pheromone value between pi andpj and (i, l) designates the presence of a link from pi to pl

I During the cycle, each ant stores the list of visited URLsI If pj was already visited, Pij(t) = 0I At the end of cycle, the list with visited URLs emptied out

Page 62: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201361/87

Adaptive Crawling

Implications

I Strategies evaluating links based on their context (textclose by) are not directly applicable to large-scale crawling

I I.e., consider crawling of 109 pages within one monthI Crawl rate: around 400 documents per secondI Around 40000 links per secondI Every second 10000-30000 “new” links to be evaluated

(scored) and added to the frontierI Too many even for link’s anchor text evaluation only

Page 63: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201362/87

PART III: OPEN CHALLENGES

Page 64: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201363/87

Outline of Part III

Open ChallengesI Crawlers in Web ecosystemI Collaborative web crawlingI Deep Web crawlingI Crawling multimedia content

Page 65: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201364/87

Crawlers in Web ecosystemPush vs. Pull model

I Web pages accessed via pull model- HTTP is a pull protocol

I That is, a client requests a page from a serverI If push, a server would send a page/info to a client

Why Pull?

I Pull is just easier for both partiesI No ’agreement’ between provider and aggregatorI No specific protocols for content providers – serving

content is enoughI Perhaps pull model is the reason why the Web is

succeeded while earlier hypertext systems failed

Page 66: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201365/87

Crawlers in Web ecosystem

Why not Push?

I Still pull model has several disadvantagesI What are these?

Page 67: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201366/87

Crawlers in Web ecosystem

Why not Push?

I Still pull model has several disadvantagesI Publishing/updating content easier with push: no need in

redundant requests from crawlersI Better control over the content from providers: no need in

crawler politeness

Page 68: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201367/87

Crawlers in Web ecosystem

Crawler politeness

I Content providers possess some control over crawlersI Via special protocols to define access to parts of a siteI Via direct banning of agents hitting a site too often

Page 69: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201368/87

Crawlers in Web ecosystem

Crawler politeness

I Robots.txt says what can(not) be crawledI Sitemaps is newer protocol specifying access restrictions

and other infoI No agent should visit any URL starting with

“yoursite/notcrawldir”, except an agent called“goodsearcher”

ExampleUser-agent: *Disallow: yoursite/notcrawldir

User-agent: goodsearcherDisallow:

Page 70: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201369/87

Collaborative Crawling

Main considerationsI Lots of redundant crawlingI To get data (often on a specific topic) need to crawl broadly

- Often lack of expertise when large crawl required- Often, crawl a lot, use only a small subset

I Too many redundant requests for content providersI Idea: have one crawler doing very broad and intensive

crawl and many parties accessing the crawled data via API- Specify filters to select required pages

I Crawler as a common service

Page 71: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201370/87

Collaborative Crawling

Some requirements

I Filter language for specifying conditionsI Efficient filter processing (millions filter to process)I Efficient fetching (hundreds pages per second)I Support real-time requests

Page 72: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201371/87

Collaborative Crawling

New component

I Process a stream of documents against a filter index

Page 73: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201372/87

Collaborative Crawling

Filter processing architecture

Page 74: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201373/87

Collaborative Crawling

Filter processing architecture

Page 75: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201374/87

Collaborative Crawling

I Based on ’The architecture and implementation of anextensible web crawler’ by Hsieh, Gribble, Levy, 2010(illustrations on slides 61-62 from Hsieh’s slides)

I E.g., 80legs provides similar crawling servicesI In a way, it is reconsidering pull/push model of content

delivery on the Web

Page 76: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201375/87

Deep Web Crawling

Visualization of http://amazon.com by aharef.info applet

Page 77: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201376/87

Deep Web CrawlingIn a nutshell

I Problem is in yellow nodes (designating web formelements)

Page 78: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201377/87

Deep Web Crawling

See slides on deep Web crawling at http://goo.gl/zwMqU5

Page 79: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201378/87

Crawling Multimedia Content

I The web is now multimedia platformI Images, video, audio are integral part of web pages (not

just supplementing them)I Almost all crawlers, however, consider it as a textual

repositoryI One reason: indexing techniques for multimedia doesn’t

reach yet the maturity required by interesting usecases/applications

I Hence, no real need to harvest multimediaI But state-of-the-art multimedia retrieval/computer vision

techniques already provide adequate search qualityI E.g., search for images with a cat and a man based on

actual image content (not text around/close to image)I In case of video: set of frames plus audio (can be converted

to textual form)

Page 80: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201379/87

Crawling Multimedia Content

Challenges in crawling multimedia

I Bigger load on web sites since files are biggerI More apparent copyright issuesI More resources (e.g., bandwidth, storage place) required

from a crawlerI More complicated duplicate resolvingI Re-visiting policy

Page 81: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201380/87

Crawling Multimedia Content

I Scalable Multimedia Web Observatory of ARCOMEMproject (http://www.arcomem.eu)

I Focus on web archiving issuesI Uses several crawlers

- ’Standard’ crawler for regular web pages- API crawler to mine social media sources (e.g., Twitter,Facebook, YouTube, etc.)- Deep Web crawler able to extract information frompre-defined web sites

I Data can be exported in WARC (Web ARChive) files and inRDF

Page 82: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201381/87

Future Directions

I Collaborative crawling, mixed pull-push modelI Scalable adaptive strategiesI Understanding site structureI Deep Web crawlingI Semantic Web crawlingI Media content crawlingI Social network crawling

Page 83: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201382/87

References: Crawl Datasets

Use for building your crawls, web graph analysis, web datamining tasks, etc.

ClueWeb09 Dataset:- http://lemurproject.org/clueweb09.php/- One billion web pages, in ten languages- 5TBs compressed- Hosted at several cloud services (free license required) ora copy can be ordered on hard disks (pay for disks)

ClueWeb12:- Almost 900 millions English web pages

Page 84: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201383/87

References: Crawl Datasets

Use for building your crawls, web graph analysis, web datamining tasks, etc.

Common Crawl Corpus:- See http://commoncrawl.org/data/accessing-the-data/

and http://aws.amazon.com/datasets/41740- Around six billion web pages- Over 100TB uncompressed- Available as Amazon Web Services’ public dataset (pay forprocessing)

Page 85: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201384/87

References: Crawl Datasets

Use for building your crawls, web graph analysis, web datamining tasks, etc.

Internet Archive:- See http://blog.archive.org/2012/10/26/

80-terabytes-of-archived-web-crawl-data-available-for-research/- Crawl of 2011- 80TB WARC files- 2.7 billions pages- Includes multimedia data- Available by request

Page 86: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201385/87

References: Crawl Datasets

LAW Datasets:- http://law.dsi.unimi.it/datasets.php- Variety of web graphs datasets (nodes, arcs, etc.) includingbasic properties of recent Facebook graphs (!)

- Thoroughly studied in a number of publications

ICWSM 2011 Spinn3r Dataset:- http://www.icwsm.org/data/- 130mln blog posts and 230mln social media publications- 2TB compressed

Academic Web Link Database Project:- http://cybermetrics.wlv.ac.uk/database/- Crawls of national universities web sites

Page 87: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201386/87

References: Literature

I For beginners: Udacity/CS101 course;http://www.udacity.com/overview/Course/cs101

I Intermediate: Chapter 20 of Introduction to InformationRetrieval book by Manning, Raghavan, Schütze;http://nlp.stanford.edu/IR-book/pdf/20crawl.pdf

I Intermediate: Current Challenges in Web Crawling tutorialat ICWE 2013 by Shestakov; http://www.slideshare.net/denshe/icwe13-tutorial-webcrawling

I Advanced: Web Crawling by Olston and Najork;http://www.nowpublishers.com/product.aspx?product=INR&doi=1500000017

Page 88: Intelligent web crawling

Denis ShestakovIntelligent Web Crawling

WI-IAT’13, Atlanta, USA, 20.11.201387/87

References: Literature

I See relevant publications at Mendeley:I http://www.mendeley.com/groups/531771/web-crawling/

I Feel free to join the group!I Check ’Deep Web’ group too

http://www.mendeley.com/groups/601801/deep-web/