1
Focused CrawlingA New Approach to Topic-Specific
Web Resource Discovery
Soumen ChakrabartiIBM Almaden
Joint work with:Martin van Den Berg (Xerox)
Byron Dom (IBM)David Gibson (Berkeley)
Funded by Global Web Solutions, IBM Atlanta
2
Portals and portholes Popular search portals and directories
Useful for generic needs Difficult to do serious research
Information needs of net-savvy users are getting very sophisticated
Relatively little business incentive Need handmade specialty sites: portholes Resource discovery must be personalized
3
QuoteThe emergence of portholes will be one of the major Internet trends of 1999. As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals.
Jim Hake(Founder, Global Information Infrastructure Awards)
4
QuoteThe most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical—and useful—than trying to cover the entire universe.
Dan Gillmore(Tech Columnist, San Jose Mercury News)
5
Scenario Disk drive research group wants to track
magnetic surface technologies Compiler research group wants to trawl the
web for graduate student resumés ____ wants to enhance his/her collection of
bookmarks about ____ with prominent and relevant links
Virtual libraries like the Open Directory Project and the Mining Co.
6
Goal Automatically construct a focused portal
(porthole) containing resources that are Relevant to the user’s focus of interest Of high influence and quality Collectively comprehensive
7
Tools at hand Keyword search engines
Synonymy, polysemy Abundance, lack of quality
Hand compiled topic directories Labor intensive, subjective judgements
Resources automatically located using keyword search and link graph distillation Dependence on large crawls and indices
8
Estimating popularity Extensive research on social network theory
Wasserman and Faust Hyperlink based
Large in-degree indicates popularity/authority Not all votes are worth the same
Several similar ideas and refinements Googol (Page and Brin) and HITS (Kleinberg) CLEVER (Chakrabarti et al) Topic distillation (Bharat and Henzinger)
9
Topic distillation overview Given web graph
and query Search engine
selects sub-graph Expansion, pruning
and edge weights Nodes iteratively
transfer authority to cited neighbors
Search Engine Query
The Web
Selected subgraph
10
Preliminary approach Use topic distillation for focused crawling
Each node in topic taxonomy is a query Query is refined by trial-and-error Topic distillation runs at each node
E.g.: European airlines +swissair +iberia +klm
11
12
Query construction
+“power suppl*”“switch* mode” smps-multiprocessor*“uninterrupt* power suppl*” +ups-parcel*
/Companies/Electronics/Power_Supply
13
Query complexity Complex queries (966 trials)
Average words 7.03 Average operators (+*–") 4.34
Typical Alta Vista queries are much simpler [Silverstein, Henzinger, Marais and Moricz] Average query words 2.35 Average operators (+*–") 0.41
Forcibly adding a hub or authority node helped in 86% of the queries
14
Problems with preliminary approach Difficulty of query construction Dependence on large web crawl and index
System = crawler + index + distiller Unreliability of keyword match
Engines differ significantly on a given query due to small overlap [Bharat and Bröder]
Narrow, arbitrary view of relevant subgraph Topic model does not improve over time
Lack of output sensitivity
15
Output sensitivity Say the goal is to find a comprehensive
collection of recreational and competitive bicycling sites and pages
Ideally effort should scale with size of the result
Time spent crawling and indexing sites unrelated to the topic is wasted
Likewise, time that does not improve comprehensiveness is wasted
16
Proposed solution Resource discovery system that can be
customized to crawl for any topic by giving examples
Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their goodness
Crawler has guidance hooks controlled by these two scores
17
Advantages No need for query formulation—system
learns from examples No dependence on global crawls Specialized, deep and up-to-date web
exploration Modest desktop hardware adequate
18
Administration scenario
TaxonomyEditor
CurrentExamples
SuggestedAdditionalExamples
Drag
19
RelevanceAll
Bus&Econ Recreation
Companies Cycling
Bike ShopsMt.Biking
Clubs
Arts
... ...
Path nodes
Good nodesSubsumed nodes
)good(
]|Pr[]good is Pr[c
dcd
20
Classification How relevant is a document w.r.t. a class?
Supervised learning, filtering, classification, categorization
Many types of classifiers Bayesian, nearest neighbor, rule-based
Hypertext Both text and links are class-dependent clues How to model link-based features?
21
Exploiting link features c=class, t=text,
N=neighbors Text-only model: Pr[t|c] Using neighbors’ text
to judge my topic:Pr[t, t(N) | c]
Better model:Pr[t, c(N) | c]
Non-linear relaxation
?
22
Exploiting link features c=class, t=text,
N=neighbors Text-only model: Pr[t|c] Using neighbors’ text
to judge my topic:Pr[t, t(N) | c]
Better model:Pr[t, c(N) | c]
Non-linear relaxation
0
5
10
15
20
25
30
35
40
0 50 100
%Neighborhood known%
Err
or
Text Link Text+Link
23
Putting it together
TaxonomyDatabase
TaxonomyEditor
ExampleBrowser
CrawlDatabase
HypertextClassifier(Learn)
TopicModels
HypertextClassifier(Apply)
Scheduler
Workers
TopicDistiller
Feedback
24
Monitoring the crawler
Time
Rele
vanc
e
One URL
MovingAverage
25
RDBMS benefits Multiple priority controls Dynamically changing crawling strategies Concurrency and crash recovery Effective out-of-core computations Ad-hoc crawl monitoring and tweaking Synergy of scale
26
Measures of success Harvest rate
What fraction of crawled pages are relevant Robustness across seed sets
Separate crawls with random disjoint samples Measure overlap in URLs and servers crawled Measure agreement in best-rated resources
Evidence of non-trivial work #Links from start set to the best resources
27
Harvest rateHarvest Rate (Cycling, Unfocused)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000 10000
#URLs fetched
Ave
rage
Rel
evan
ce
Avg over 100
Harvest Rate (Cycling, Soft Focus)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000 4000 6000
#URLs fetched
Ave
rage
Rel
evan
ce
Avg over 100Avg over 1000
Unfocused Focused
28
Crawl robustnessCrawl Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1000 2000 3000
#URLs crawled
UR
L O
verla
p
Crawl Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000#URLs crawled
Ser
ver o
verla
p
Overlap1Overlap2
URL Overlap Server OverlapCrawl A Crawl B
29
Top resources after one hour Recreational and competitive cycling
http://www.truesport.com/Bike/links.htm http://reality.sgi.com/employees/billh_hampton/
jrvs/links.html http://www.acs.ucalgary.ca/~bentley
/mark_links.html HIV/AIDS research and treatment
http://www.stopaids.org/Otherorgs.html http://www.iohk.com/UserPages/mlau/aidsinfo.html http://www.ahandyguide.com/cat1/a/a66.htm
Purer and better than root set
32
Distance to best resourcesResource Distance (Mutual Funds)
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Min. distance from crawl seed (#links)
#Ser
vers
in to
p 10
0
Resource Distance (Cycling)
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12
Min. distance from crawl seed (#links)
#Ser
vers
in to
p 10
0
Cycling: cooperative Mutual funds: competitive
33
Robustness of resource discovery Sample disjoint sets
of starting URL’s Two separate crawls Find best authorities Order by rank Find overlap in the
top-rated resources
Resource Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25#Top resources
Ser
ver O
verla
p
Overlap1Overlap2
34
Future work Harvest rate at different levels of taxonomy
By definition harvest rate is 1 for root node Sociology of citations
Build a gigantic citation matrix for web topics Further enhance resource finding skills
Semi-structured queries Suspicious link neighborhoods, e.g., traffic
radar manufacturer and auto insurance company
35
Related work WebWatcher, HotList&ColdList
Filtering as post-processing, not acquisition Fish search, WebCrawler
Crawler guided by query keyword matches Ahoy!, Cora
Hand-crafted to find home pages and papers ReferralWeb
Social network on the Web
36
Conclusion New architecture for example-driven topic-
specific web resource discovery No dependence on full web crawl and index Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from
keyword query response nodes
37
References [email protected] www.cs.berkeley.edu/~soumen/
www8focus.pdf sigmod98.ps
www.almaden.ibm.com/cs/k53/ir.html
Top Related