December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark...

Post on 22-Dec-2015

213 views 0 download

Tags:

Transcript of December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark...

December 20, 2002 CUL Metadata WG Meeting 1

Focused Crawling and Collection Synthesis

Donna Bergmark

Cornell Information Systems

December 20, 2002 CUL Metadata WG Meeting 2

Outline

• Crawlers

• Collection Synthesis

• Focused Crawling

• Some Results

• Student Project (Fall 2002)

December 20, 2002 CUL Metadata WG Meeting 3

Definition

Spider = robot = crawler

Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

December 20, 2002 CUL Metadata WG Meeting 4

Crawlers – some background

• Resource discovery

• Crawlers and internet history

• Crawling and crawlers

• Mercator

December 20, 2002 CUL Metadata WG Meeting 5

Resource Discovery

• Finding info on the Web– Surfing (random strategy, goal is serendipity)

– Searching (inverted indices; specific info)

– Crawling (“all” the info)

• Uses for crawling– Find stuff

– Gather stuff

– Check stuff

December 20, 2002 CUL Metadata WG Meeting 6

Crawlers and internet history• 1991: HTTP• 1992: 26 servers• 1993: 60+ servers; self-register; archie• 1994 (early) – first crawlers• 1996 – search engines abound• 1998 – focused crawling• 1999 – web graph studies• 2002 – use for digital libraries

December 20, 2002 CUL Metadata WG Meeting 7

Crawling and Crawlers

• Web overlays the internet

• A crawl overlays the webseed

December 20, 2002 CUL Metadata WG Meeting 8

Crawler Issues

• The web is so big

• Visit Order

• The URL itself

• Politeness

• Robot Traps

• The hidden web

• System Considerations

December 20, 2002 CUL Metadata WG Meeting 9

Standard for Robot Exclusion

• Martin Koster (1994)

• http://any-server:80/robots.txt

• Maintained by the webmaster

• Forbid access to pages, directories

• Commonly excluded: /cgi-bin/

• Adherence is voluntary for the crawler

December 20, 2002 CUL Metadata WG Meeting 10

Robot Traps

• Cycles in the Web graph

• Infinite links on a page

• Traps set out by the Webmaster

December 20, 2002 CUL Metadata WG Meeting 11

The Hidden Web

• Dynamic pages increasing

• Subscription pages

• Username and password pages

• Research in progress on how crawlers can “get into” the hidden web

December 20, 2002 CUL Metadata WG Meeting 12

System Issues

• Crawlers are complicated systems

• Efficiency is of utmost importance

• Crawlers are demanding of system and network resources

13CUL Metadata WG MeetingDecember 20, 2002

December 20, 2002 CUL Metadata WG Meeting 14

Mercator Features

• Written in Java• One file configures a crawl• Can add your own code

– Extend one or more of M’s base classes– Add totally new classes called by your own

• Industrial-strength crawler:– uses its own DNS and java.net package

December 20, 2002 CUL Metadata WG Meeting 15

Collection Synthesis

• The NSDL– National Scientific Digital Library– Educational materials for K-thru-grave– A collection of digital collections

• Collection (automatically derived)– 20-50 items on a topic, represented by their

URLs, expository in nature, precision trumps recall

December 20, 2002 CUL Metadata WG Meeting 16

Crawler is the Key

• A general search engine is good for precise results, few in number

• A search engine must cover all topics, not just scientific

• For automatic collection assembly, a Web crawler is needed

• A focused crawler is the key

December 20, 2002 CUL Metadata WG Meeting 17

Focused Crawling

December 20, 2002 CUL Metadata WG Meeting 18

Focused Crawling

432

765

1

1

R

Breadth-first crawl

1

432

5R

X X

Focused crawl

December 20, 2002 CUL Metadata WG Meeting 19

Collections and Clusters

• Traditional – document universe is divided into clusters, or collections

• Each collection represented by its centroid• Web – size of document universe is infinite• Agglomerative clustering is used instead• Two aspects:

– Collection descriptor– Rule for when items belong to that Collection

December 20, 2002 CUL Metadata WG Meeting 20

Q = 0.2

Q = 0.6

December 20, 2002 CUL Metadata WG Meeting 21

The Setup

A virtual collection of items about Chebyshev Polynomials

December 20, 2002 CUL Metadata WG Meeting 22

Adding a Centroid

An empty collection of items about Chebyshev Polynomials

December 20, 2002 CUL Metadata WG Meeting 23

Document Vector Space

• Classic information retrieval technique

• Each word is a dimension in N-space

• Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001>

• Normalize the weights

Both the “centroid” and the downloaded document are term vectors

December 20, 2002 CUL Metadata WG Meeting 24

Agglomerate

A collection with 3 items about Ch. Polys.

December 20, 2002 CUL Metadata WG Meeting 25

Where does the Centroid come from?

“ChebyshevPolynomials”

A really good centroid fora collection about C.P.’s

December 20, 2002 CUL Metadata WG Meeting 26

Building a Centroid

1. Google(“Chebyshev Polynomials”) {url1 … url-n

2. Let H be a hash (k,v) where k=word, value=freq

3. For each url in {u1 … un} do

D download(url)V term vector(d)

For each term t in V doIf t not in H add it with value H(t) ++

4. Compute tf-idf weights. C top 20 terms.

December 20, 2002 CUL Metadata WG Meeting 27

Dictionary

• Given centroids C1, C2, C3 …

• Dictionary is C1 + C2 + C3 …– Terms are union of terms in Ci– Term Frequencies are total frequency in Ci– Document Frequency is how many C’s have t– Term IDF is as from Berkeley

• Dictionary is 300-500 terms

December 20, 2002 CUL Metadata WG Meeting 28

Focused Crawling• Recall the cartoon for a focused crawl:

• A simple way to do it is with 2 “knobs”

1

432

5R

X X

December 20, 2002 CUL Metadata WG Meeting 29

Focusing the Crawl

• Threshold: page is on-topic if correlation to the closest centroid is above this value

• Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff

December 20, 2002 CUL Metadata WG Meeting 30

Illustration

2 3

4

6

7

1

5555

Cutoff = 1

Corr >= threshold

December 20, 2002 CUL Metadata WG Meeting 31

Min-avg-max correlation vs. crawl length

00.10.2

0.30.40.50.6

0.70.8

0 20000 40000 60000 80000 100000 120000

No. documents downloaded

corr

elat

ion Maximum

Average

Minimum

Closest

Furthest

December 20, 2002 CUL Metadata WG Meeting 32

Collection “Evaluation”

• Assume higher correlations are good

• With human relevance assessments, one can also compute a “precision” curve

• Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.

December 20, 2002 CUL Metadata WG Meeting 33

Cutoff = 0Threshold = 0.3

December 20, 2002 CUL Metadata WG Meeting 34

Precision vs. Rank

0

0.2

0.4

0.6

0.8

1

1.2

0 20 40 60

Rank

Pre

cisi

on

Crawling

Google

December 20, 2002 CUL Metadata WG Meeting 35

Tunneling with Cutoff

• Nugget – dud – dud… - dud – nugget

Notation: 0 – X – X … - X – 0

• Fixed cutoff: 0 – X1 – X2 - … Xc

• Adaptive cutoff: 0 – X1 – X2 - … X?

December 20, 2002 CUL Metadata WG Meeting 36

Statistics Collected

• 500,000 documents

• Number of seeds: 4

• Path data for all but seeds

• 6620 completed paths (0-x…x-0)

• 100,000s incomplete paths (0-x…x..)

December 20, 2002 CUL Metadata WG Meeting 37

Nuggets that are x steps from a nugget

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

X - number of links from nugget

# nuggets

December 20, 2002 CUL Metadata WG Meeting 38

Nuggets that are x steps from a seed and/or a nugget

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

X - number of links from nugget

from seeds# nuggets

December 20, 2002 CUL Metadata WG Meeting 39

Better parents have better children.

0

0.05

0.1

0.15

0.2

0.251 3 5 7 9 11

13

15

17

Correlation bracket

Nu

mb

er

of

no

de

s

General Population

children of .45-.5nodes

December 20, 2002 CUL Metadata WG Meeting 40

Using the Empirical Observations

• Use the path history

• Use the page quality - cosine correlation

• Current distance should increase exponentially as you get away from quality nodes

Distance = 0 if this is a nugget, otherwise:1 or (1-corr) exp (2 x parent’s distance / cutoff)

December 20, 2002 CUL Metadata WG Meeting 41

Results

• Details in the ECDL paper

• Smaller frontier more docs/second

• More documents downloaded in same time

• Higher-scoring documents were downloaded

• Cutoff of 20 averaged 7 steps at the cutoff

December 20, 2002 CUL Metadata WG Meeting 42

Fall 2002 Student Project

Query

Mercator

Centroid Collection Description

Term vectors

Centroids,Dictionary

CollectionURLs

Chebyshev P.s HTML

December 20, 2002 CUL Metadata WG Meeting 43

Conclusion

• We’ve covered crawling – history, technology, use

• Focused crawling with tunneling• Adaptive cutoff with tunneling

• We have a good experimental setup for exploring automatic collection synthesis