Download - Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

1

Focused CrawlingA New Approach to Topic-Specific

Web Resource Discovery

Soumen ChakrabartiIBM Almaden

Joint work with:Martin van Den Berg (Xerox)

Byron Dom (IBM)David Gibson (Berkeley)

Funded by Global Web Solutions, IBM Atlanta

2

Portals and portholes Popular search portals and directories

Useful for generic needs Difficult to do serious research

Information needs of net-savvy users are getting very sophisticated

Relatively little business incentive Need handmade specialty sites: portholes Resource discovery must be personalized

3

QuoteThe emergence of portholes will be one of the major Internet trends of 1999. As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals.

Jim Hake(Founder, Global Information Infrastructure Awards)

4

QuoteThe most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical—and useful—than trying to cover the entire universe.

Dan Gillmore(Tech Columnist, San Jose Mercury News)

5

Scenario Disk drive research group wants to track

magnetic surface technologies Compiler research group wants to trawl the

web for graduate student resumés ____ wants to enhance his/her collection of

bookmarks about ____ with prominent and relevant links

Virtual libraries like the Open Directory Project and the Mining Co.

http://directory.mozilla.org/

http://www.miningco.com/

6

Goal Automatically construct a focused portal

(porthole) containing resources that are Relevant to the user’s focus of interest Of high influence and quality Collectively comprehensive

7

Tools at hand Keyword search engines

Synonymy, polysemy Abundance, lack of quality

Hand compiled topic directories Labor intensive, subjective judgements

Resources automatically located using keyword search and link graph distillation Dependence on large crawls and indices

8

Estimating popularity Extensive research on social network theory

Wasserman and Faust Hyperlink based

Large in-degree indicates popularity/authority Not all votes are worth the same

Several similar ideas and refinements Googol (Page and Brin) and HITS (Kleinberg) CLEVER (Chakrabarti et al) Topic distillation (Bharat and Henzinger)

9

Topic distillation overview Given web graph

and query Search engine

selects sub-graph Expansion, pruning

and edge weights Nodes iteratively

transfer authority to cited neighbors

Search Engine Query

The Web

Selected subgraph

10

Preliminary approach Use topic distillation for focused crawling

Each node in topic taxonomy is a query Query is refined by trial-and-error Topic distillation runs at each node

E.g.: European airlines +swissair +iberia +klm

12

Query construction

+“power suppl*”“switch* mode” smps-multiprocessor*“uninterrupt* power suppl*” +ups-parcel*

/Companies/Electronics/Power_Supply

13

Query complexity Complex queries (966 trials)

Average words 7.03 Average operators (+*–") 4.34

Typical Alta Vista queries are much simpler [Silverstein, Henzinger, Marais and Moricz] Average query words 2.35 Average operators (+*–") 0.41

Forcibly adding a hub or authority node helped in 86% of the queries

14

Problems with preliminary approach Difficulty of query construction Dependence on large web crawl and index

System = crawler + index + distiller Unreliability of keyword match

Engines differ significantly on a given query due to small overlap [Bharat and Bröder]

Narrow, arbitrary view of relevant subgraph Topic model does not improve over time

Lack of output sensitivity

15

Output sensitivity Say the goal is to find a comprehensive

collection of recreational and competitive bicycling sites and pages

Ideally effort should scale with size of the result

Time spent crawling and indexing sites unrelated to the topic is wasted

Likewise, time that does not improve comprehensiveness is wasted

16

Proposed solution Resource discovery system that can be

customized to crawl for any topic by giving examples

Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their goodness

Crawler has guidance hooks controlled by these two scores

17

Advantages No need for query formulation—system

learns from examples No dependence on global crawls Specialized, deep and up-to-date web

exploration Modest desktop hardware adequate

18

Administration scenario

TaxonomyEditor

CurrentExamples

SuggestedAdditionalExamples

Drag

19

RelevanceAll

Bus&Econ Recreation

Companies Cycling

Bike ShopsMt.Biking

Clubs

Arts

... ...

Path nodes

Good nodesSubsumed nodes

)good(

]|Pr[]good is Pr[c

dcd

20

Classification How relevant is a document w.r.t. a class?

Supervised learning, filtering, classification, categorization

Many types of classifiers Bayesian, nearest neighbor, rule-based

Hypertext Both text and links are class-dependent clues How to model link-based features?

21

Exploiting link features c=class, t=text,

N=neighbors Text-only model: Pr[t|c] Using neighbors’ text

to judge my topic:Pr[t, t(N) | c]

Better model:Pr[t, c(N) | c]

Non-linear relaxation

?

22

Exploiting link features c=class, t=text,

N=neighbors Text-only model: Pr[t|c] Using neighbors’ text

to judge my topic:Pr[t, t(N) | c]

Better model:Pr[t, c(N) | c]

Non-linear relaxation

0

5

10

15

20

25

30

35

40

0 50 100

%Neighborhood known%

Err

or

Text Link Text+Link

23

Putting it together

TaxonomyDatabase

TaxonomyEditor

ExampleBrowser

CrawlDatabase

HypertextClassifier(Learn)

TopicModels

HypertextClassifier(Apply)

Scheduler

Workers

TopicDistiller

Feedback

24

Monitoring the crawler

Time

Rele

vanc

e

One URL

MovingAverage

25

RDBMS benefits Multiple priority controls Dynamically changing crawling strategies Concurrency and crash recovery Effective out-of-core computations Ad-hoc crawl monitoring and tweaking Synergy of scale

26

Measures of success Harvest rate

What fraction of crawled pages are relevant Robustness across seed sets

Separate crawls with random disjoint samples Measure overlap in URLs and servers crawled Measure agreement in best-rated resources

Evidence of non-trivial work #Links from start set to the best resources

27

Harvest rateHarvest Rate (Cycling, Unfocused)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000 10000

#URLs fetched

Ave

rage

Rel

evan

ce

Avg over 100

Harvest Rate (Cycling, Soft Focus)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2000 4000 6000

#URLs fetched

Ave

rage

Rel

evan

ce

Avg over 100Avg over 1000

Unfocused Focused

28

Crawl robustnessCrawl Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1000 2000 3000

#URLs crawled

UR

L O

verla

p

Crawl Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000#URLs crawled

Ser

ver o

verla

p

Overlap1Overlap2

URL Overlap Server OverlapCrawl A Crawl B

29

Top resources after one hour Recreational and competitive cycling

http://www.truesport.com/Bike/links.htm http://reality.sgi.com/employees/billh_hampton/

jrvs/links.html http://www.acs.ucalgary.ca/~bentley

/mark_links.html HIV/AIDS research and treatment

http://www.stopaids.org/Otherorgs.html http://www.iohk.com/UserPages/mlau/aidsinfo.html http://www.ahandyguide.com/cat1/a/a66.htm

Purer and better than root set

http://www.truesport.com/Bike/links.htm




http://reality.sgi.com/employees/billh_hampton/jrvs/links.html









http://www.acs.ucalgary.ca/~bentley/mark_links.html





http://www.stopaids.org/Otherorgs.html





http://www.iohk.com/UserPages/mlau/aidsinfo.html









http://www.ahandyguide.com/cat1/a/a66.htm




32

Distance to best resourcesResource Distance (Mutual Funds)

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Min. distance from crawl seed (#links)

#Ser

vers

in to

p 10

0

Resource Distance (Cycling)

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12

Min. distance from crawl seed (#links)

#Ser

vers

in to

p 10

0

Cycling: cooperative Mutual funds: competitive

33

Robustness of resource discovery Sample disjoint sets

of starting URL’s Two separate crawls Find best authorities Order by rank Find overlap in the

top-rated resources

Resource Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25#Top resources

Ser

ver O

verla

p

Overlap1Overlap2

34

Future work Harvest rate at different levels of taxonomy

By definition harvest rate is 1 for root node Sociology of citations

Build a gigantic citation matrix for web topics Further enhance resource finding skills

Semi-structured queries Suspicious link neighborhoods, e.g., traffic

radar manufacturer and auto insurance company

35

Related work WebWatcher, HotList&ColdList

Filtering as post-processing, not acquisition Fish search, WebCrawler

Crawler guided by query keyword matches Ahoy!, Cora

Hand-crafted to find home pages and papers ReferralWeb

Social network on the Web

36

Conclusion New architecture for example-driven topic-

specific web resource discovery No dependence on full web crawl and index Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from

keyword query response nodes

37

References [email protected] www.cs.berkeley.edu/~soumen/

www8focus.pdf sigmod98.ps

www.almaden.ibm.com/cs/k53/ir.html