Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Augmenting Focused Crawling using Search Engine Queries

Wang Xuan

10th Nov 2006

What is focused crawling

Crawling vs. Focused crawling

Seed Page

Target page

Crawling methods

Web search algorithm:– Breadth-first (using in standard crawling)– Best-first (using in focused crawling)– They are local-search strategies

Web analysis algorithm– content-based web analysis

page text, title, URL, page layout

– link-based web analysis hard to analyze the page while the knowledge about the

search graph is not yet known completely.

Related works

Naïve Bayes Crawler: relevance score is the cosine similarity between page and topic

IBM focused crawler introduce a distiller to find topic hubs.

CORA crawler: assign Q-value according number of target pages in neighborhood

Context focused crawler introduce a link hierarchy Automatic Publication Data Gatherer: classified the

webpage without the page PaSE: locate publication using Search Engine

General framework

repository

Page fetch Unit URL filterURL extractor

FrontierClassifierFeature extractor

Highly depend on the seed pages

Term Extraction module

Baseline system

Three stage of the crawling

0

20

40

60

80

100

120

140

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

00

AB C

Framework for upgraded system

repository

Page Fetch Unit URL filter

Frontier ClassifierFeature Extractor

URL Extractor

Term extraction Module targetURL

no target ORtargets keepdecreasing

TargetURL Search Engine

More Pages

Term Extraction

0

20

40

60

80

100

120

140

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

0014

00

basel i nei mproved

AB

C

Baseline system Upgrade system

Publication pages found 45 117

precision3.21% 8.36%

recall26.63% 69.23%

F10.057 0.149

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Documents

Transcript of Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.