Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
-
Upload
job-charles -
Category
Documents
-
view
213 -
download
0
Transcript of Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Crawling methods
Web search algorithm:– Breadth-first (using in standard crawling)– Best-first (using in focused crawling)– They are local-search strategies
Web analysis algorithm– content-based web analysis
page text, title, URL, page layout
– link-based web analysis hard to analyze the page while the knowledge about the
search graph is not yet known completely.
Related works
Naïve Bayes Crawler: relevance score is the cosine similarity between page and topic
IBM focused crawler introduce a distiller to find topic hubs.
CORA crawler: assign Q-value according number of target pages in neighborhood
Context focused crawler introduce a link hierarchy Automatic Publication Data Gatherer: classified the
webpage without the page PaSE: locate publication using Search Engine
General framework
repository
Page fetch Unit URL filterURL extractor
FrontierClassifierFeature extractor
Highly depend on the seed pages
Term Extraction module
Three stage of the crawling
0
20
40
60
80
100
120
140
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
00
AB C
Framework for upgraded system
repository
Page Fetch Unit URL filter
Frontier ClassifierFeature Extractor
URL Extractor
Term extraction Module targetURL
no target ORtargets keepdecreasing
0
20
40
60
80
100
120
140
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
0014
00
basel i nei mproved
AB
C