Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

35
1 Focused Crawling A New Approach to Topic- Specific Web Resource Discovery Soumen Chakrabarti IBM Almaden Joint work with: Martin van Den Berg (Xerox) Byron Dom (IBM) David Gibson (Berkeley) Funded by Global Web Solutions, IBM Atlanta

description

Focused Crawling A New Approach to Topic-Specific Web Resource Discovery. Soumen Chakrabarti IBM Almaden. Joint work with: Martin van Den Berg (Xerox) Byron Dom (IBM) David Gibson (Berkeley) Funded by Global Web Solutions, IBM Atlanta. Portals and portholes. - PowerPoint PPT Presentation

Transcript of Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

Page 1: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

1

Focused CrawlingA New Approach to Topic-Specific

Web Resource Discovery

Soumen ChakrabartiIBM Almaden

Joint work with:Martin van Den Berg (Xerox)

Byron Dom (IBM)David Gibson (Berkeley)

Funded by Global Web Solutions, IBM Atlanta

Page 2: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

2

Portals and portholes Popular search portals and directories

Useful for generic needs Difficult to do serious research

Information needs of net-savvy users are getting very sophisticated

Relatively little business incentive Need handmade specialty sites: portholes Resource discovery must be personalized

Page 3: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

3

QuoteThe emergence of portholes will be one of the major Internet trends of 1999. As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals.

Jim Hake(Founder, Global Information Infrastructure Awards)

Page 4: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

4

QuoteThe most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical—and useful—than trying to cover the entire universe.

Dan Gillmore(Tech Columnist, San Jose Mercury News)

Page 5: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

5

Scenario Disk drive research group wants to track

magnetic surface technologies Compiler research group wants to trawl the

web for graduate student resumés ____ wants to enhance his/her collection of

bookmarks about ____ with prominent and relevant links

Virtual libraries like the Open Directory Project and the Mining Co.

Page 6: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

6

Goal Automatically construct a focused portal

(porthole) containing resources that are Relevant to the user’s focus of interest Of high influence and quality Collectively comprehensive

Page 7: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

7

Tools at hand Keyword search engines

Synonymy, polysemy Abundance, lack of quality

Hand compiled topic directories Labor intensive, subjective judgements

Resources automatically located using keyword search and link graph distillation Dependence on large crawls and indices

Page 8: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

8

Estimating popularity Extensive research on social network theory

Wasserman and Faust Hyperlink based

Large in-degree indicates popularity/authority Not all votes are worth the same

Several similar ideas and refinements Googol (Page and Brin) and HITS (Kleinberg) CLEVER (Chakrabarti et al) Topic distillation (Bharat and Henzinger)

Page 9: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

9

Topic distillation overview Given web graph

and query Search engine

selects sub-graph Expansion, pruning

and edge weights Nodes iteratively

transfer authority to cited neighbors

Search Engine Query

The Web

Selected subgraph

Page 10: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

10

Preliminary approach Use topic distillation for focused crawling

Each node in topic taxonomy is a query Query is refined by trial-and-error Topic distillation runs at each node

E.g.: European airlines +swissair +iberia +klm

Page 11: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

11

Page 12: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

12

Query construction

+“power suppl*”“switch* mode” smps-multiprocessor*“uninterrupt* power suppl*” +ups-parcel*

/Companies/Electronics/Power_Supply

Page 13: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

13

Query complexity Complex queries (966 trials)

Average words 7.03 Average operators (+*–") 4.34

Typical Alta Vista queries are much simpler [Silverstein, Henzinger, Marais and Moricz] Average query words 2.35 Average operators (+*–") 0.41

Forcibly adding a hub or authority node helped in 86% of the queries

Page 14: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

14

Problems with preliminary approach Difficulty of query construction Dependence on large web crawl and index

System = crawler + index + distiller Unreliability of keyword match

Engines differ significantly on a given query due to small overlap [Bharat and Bröder]

Narrow, arbitrary view of relevant subgraph Topic model does not improve over time

Lack of output sensitivity

Page 15: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

15

Output sensitivity Say the goal is to find a comprehensive

collection of recreational and competitive bicycling sites and pages

Ideally effort should scale with size of the result

Time spent crawling and indexing sites unrelated to the topic is wasted

Likewise, time that does not improve comprehensiveness is wasted

Page 16: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

16

Proposed solution Resource discovery system that can be

customized to crawl for any topic by giving examples

Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their goodness

Crawler has guidance hooks controlled by these two scores

Page 17: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

17

Advantages No need for query formulation—system

learns from examples No dependence on global crawls Specialized, deep and up-to-date web

exploration Modest desktop hardware adequate

Page 18: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

18

Administration scenario

TaxonomyEditor

CurrentExamples

SuggestedAdditionalExamples

Drag

Page 19: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

19

RelevanceAll

Bus&Econ Recreation

Companies Cycling

Bike ShopsMt.Biking

Clubs

Arts

... ...

Path nodes

Good nodesSubsumed nodes

)good(

]|Pr[]good is Pr[c

dcd

Page 20: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

20

Classification How relevant is a document w.r.t. a class?

Supervised learning, filtering, classification, categorization

Many types of classifiers Bayesian, nearest neighbor, rule-based

Hypertext Both text and links are class-dependent clues How to model link-based features?

Page 21: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

21

Exploiting link features c=class, t=text,

N=neighbors Text-only model: Pr[t|c] Using neighbors’ text

to judge my topic:Pr[t, t(N) | c]

Better model:Pr[t, c(N) | c]

Non-linear relaxation

?

Page 22: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

22

Exploiting link features c=class, t=text,

N=neighbors Text-only model: Pr[t|c] Using neighbors’ text

to judge my topic:Pr[t, t(N) | c]

Better model:Pr[t, c(N) | c]

Non-linear relaxation

0

5

10

15

20

25

30

35

40

0 50 100

%Neighborhood known%

Err

or

Text Link Text+Link

Page 23: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

23

Putting it together

TaxonomyDatabase

TaxonomyEditor

ExampleBrowser

CrawlDatabase

HypertextClassifier(Learn)

TopicModels

HypertextClassifier(Apply)

Scheduler

Workers

TopicDistiller

Feedback

Page 24: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

24

Monitoring the crawler

Time

Rele

vanc

e

One URL

MovingAverage

Page 25: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

25

RDBMS benefits Multiple priority controls Dynamically changing crawling strategies Concurrency and crash recovery Effective out-of-core computations Ad-hoc crawl monitoring and tweaking Synergy of scale

Page 26: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

26

Measures of success Harvest rate

What fraction of crawled pages are relevant Robustness across seed sets

Separate crawls with random disjoint samples Measure overlap in URLs and servers crawled Measure agreement in best-rated resources

Evidence of non-trivial work #Links from start set to the best resources

Page 27: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

27

Harvest rateHarvest Rate (Cycling, Unfocused)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5000 10000

#URLs fetched

Ave

rage

Rel

evan

ce

Avg over 100

Harvest Rate (Cycling, Soft Focus)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2000 4000 6000

#URLs fetched

Ave

rage

Rel

evan

ce

Avg over 100Avg over 1000

Unfocused Focused

Page 28: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

28

Crawl robustnessCrawl Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1000 2000 3000

#URLs crawled

UR

L O

verla

p

Crawl Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000#URLs crawled

Ser

ver o

verla

p

Overlap1Overlap2

URL Overlap Server OverlapCrawl A Crawl B

Page 29: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

29

Top resources after one hour Recreational and competitive cycling

http://www.truesport.com/Bike/links.htm http://reality.sgi.com/employees/billh_hampton/

jrvs/links.html http://www.acs.ucalgary.ca/~bentley

/mark_links.html HIV/AIDS research and treatment

http://www.stopaids.org/Otherorgs.html http://www.iohk.com/UserPages/mlau/aidsinfo.html http://www.ahandyguide.com/cat1/a/a66.htm

Purer and better than root set

Page 30: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

32

Distance to best resourcesResource Distance (Mutual Funds)

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Min. distance from crawl seed (#links)

#Ser

vers

in to

p 10

0

Resource Distance (Cycling)

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12

Min. distance from crawl seed (#links)

#Ser

vers

in to

p 10

0

Cycling: cooperative Mutual funds: competitive

Page 31: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

33

Robustness of resource discovery Sample disjoint sets

of starting URL’s Two separate crawls Find best authorities Order by rank Find overlap in the

top-rated resources

Resource Robustness (Cycling)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25#Top resources

Ser

ver O

verla

p

Overlap1Overlap2

Page 32: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

34

Future work Harvest rate at different levels of taxonomy

By definition harvest rate is 1 for root node Sociology of citations

Build a gigantic citation matrix for web topics Further enhance resource finding skills

Semi-structured queries Suspicious link neighborhoods, e.g., traffic

radar manufacturer and auto insurance company

Page 33: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

35

Related work WebWatcher, HotList&ColdList

Filtering as post-processing, not acquisition Fish search, WebCrawler

Crawler guided by query keyword matches Ahoy!, Cora

Hand-crafted to find home pages and papers ReferralWeb

Social network on the Web

Page 34: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

36

Conclusion New architecture for example-driven topic-

specific web resource discovery No dependence on full web crawl and index Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from

keyword query response nodes

Page 35: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

37

References [email protected] www.cs.berkeley.edu/~soumen/

www8focus.pdf sigmod98.ps

www.almaden.ibm.com/cs/k53/ir.html