Focused Crawling for Vertical Search

40
Focused Crawling for Vertical Search Focused Crawling for Vertical Search Marcelo Mendoza 11.11.11 - JCC 2011 - Curic´ o, Chile - 11.11.11 1 / 40

description

A tutorial for focused crawling presented at the JCC, Curicó, Chile, 2011.

Transcript of Focused Crawling for Vertical Search

Page 1: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search

Focused Crawling for Vertical Search

Marcelo Mendoza

11.11.11

- JCC 2011 - Curico, Chile - 11.11.11 1 / 40

Page 2: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search

Overview

1 Vertical Search

2 Crawling

3 State-of-the-art

4 Conclusion

- JCC 2011 - Curico, Chile - 11.11.11 2 / 40

Page 3: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Vertical Search

Why Web Vertical Search Matters?

Web size: More than 20 billion pages.

Millions of users, millions of queries, millions of needs.

Advantages:1 Greater precision due to limited scope2 Leverage domain knowledge (ontologies)

Domains: business, medicine, science, education, ...

- JCC 2011 - Curico, Chile - 11.11.11 3 / 40

Page 4: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Vertical Search

Science Vertical Search

scienceresearch.com- JCC 2011 - Curico, Chile - 11.11.11 4 / 40

Page 5: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Vertical Search

Business Vertical Search

biznar.com

- JCC 2011 - Curico, Chile - 11.11.11 5 / 40

Page 6: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Vertical Search

Education Vertical Search

contentcompass.cl11Fondef D08I1155

- JCC 2011 - Curico, Chile - 11.11.11 6 / 40

Page 7: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

Hyperlinks among web pages

- JCC 2011 - Curico, Chile - 11.11.11 7 / 40

Page 8: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

The Web as a graph

web pages

hyperlinks

- JCC 2011 - Curico, Chile - 11.11.11 8 / 40

Page 9: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

The Web: Some facts

The size of the Web: 11.5 billion of pages (indexable, 2005).

The deep Web: available by quering databases.

Static / dynamic pages.

Graph model: Free-scale network, degree distribution ≈ power law.

The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS).

- JCC 2011 - Curico, Chile - 11.11.11 9 / 40

Page 10: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

Crawler architecture

Online resource: C. Castillo, Effective Web Crawling (PhD Thesis) URL

- JCC 2011 - Curico, Chile - 11.11.11 10 / 40

Page 11: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

Crawling strategies

Breadth-first crawlers: URL frontier implemented as a FIFO queue.

Preferential crawlers: URL frontier implemented as a priority queue.

Priority scores:1 Topological properties (e.g. indegree of the target page).2 Content properties (e.g. similarity between a query and the source

page).3 Hybrid measures.

- JCC 2011 - Curico, Chile - 11.11.11 11 / 40

Page 12: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

Universal / Focused crawling

Universal crawlers: General purpose.

Challenges:1 Scalability2 Coverage / Freshness

Focused crawlers: We may want to crawl pages in certain topics.

Challenges:1 Coverage / Accuracy

- JCC 2011 - Curico, Chile - 11.11.11 12 / 40

Page 13: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

Focused Crawling

Breadth-first: depth 1

SeedTarget

- JCC 2011 - Curico, Chile - 11.11.11 13 / 40

Page 14: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

Focused Crawling

Breadth-first: depth 2

SeedTarget

- JCC 2011 - Curico, Chile - 11.11.11 14 / 40

Page 15: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

Focused Crawling

Breadth-first: depth 3

SeedTarget

- JCC 2011 - Curico, Chile - 11.11.11 15 / 40

Page 16: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Crawling

Focused Crawling

Breadth-first: unreacheble pages, excessive computational costs!

SeedTarget

- JCC 2011 - Curico, Chile - 11.11.11 16 / 40

Page 17: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Early algorithms: Fish search

Bra, P., and Post, R. (1994)

Query (keywords), source page terms, term-based distance, best-first

- JCC 2011 - Curico, Chile - 11.11.11 17 / 40

Page 18: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Early algorithms: Shark search

Hersovici et al. (1998)

Query (keywords), anchor text, term-based distance, best-first

- JCC 2011 - Curico, Chile - 11.11.11 18 / 40

Page 19: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Early algorithms: ARACHNID

Menczer, F. (1997)

Multi-agents, evolutionary inspired: mutation (new seeds), fitness (scoreacc.), term-based scores.

- JCC 2011 - Curico, Chile - 11.11.11 19 / 40

Page 20: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Context: Link Analysis

The Web graph as an information source (beyond the text)

Kleinberg, J. (1998)

HITS: authoritative pages (OUT), hub pages (IN).

Brin, S. & Page, L. (1998)

PageRank: Random walk over the Web graph, stationary probabilityvector.

- JCC 2011 - Curico, Chile - 11.11.11 20 / 40

Page 21: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Link-based algorithms

Cho, J., Garcia-Molina, H., Page L. (1998)

Link-based scores: Backlinks count, PageRank

Chakrabarti, S., Van den Berg, M., and Dom, B. (1999)

Topic distillation: Text-based classifier over web page examples percategory (off-line dataset construction, human labeling, content textpositive and negative examples). On-line phase: Anchor-based score (ML)+ HITS-based score for distillation.

- JCC 2011 - Curico, Chile - 11.11.11 21 / 40

Page 22: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Link-based algorithms: Basic assumptions

SeedTarget

Davidson, B. (2000)

Topical locality: Locality based on anchor text and links.

- JCC 2011 - Curico, Chile - 11.11.11 22 / 40

Page 23: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Link-based algorithms: Basic assumptions

Menczer, F. (2004)

Link cluster conjecture: Related pages tend to be linked.

- JCC 2011 - Curico, Chile - 11.11.11 23 / 40

Page 24: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Link-based algorithms: Backlink graph

Considering how far is the target: Layered backlink graph!

Diligenti et al. (2000)

Using the backlink graph for multiclass learning. Greedy approach.

Babaria et al. (2007)

Using the backling graph for ordinal regression. Greedy approach.

- JCC 2011 - Curico, Chile - 11.11.11 24 / 40

Page 25: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Off-line learning-based algorithms

Kinds of features

The content of the web pages which are known to link to thecandidate URL.

URL tokens from the candidate URL.

- JCC 2011 - Curico, Chile - 11.11.11 25 / 40

Page 26: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Off-line learning-based algorithms

Rennie & McCallum (1999)

1st stage (Off-line): Text-based features (anchor + header + title of thetarget). 2nd stage (On-line): Candidate URL scoring based on the textclassifier (candidate URL (anchor + URL text)).

Li et al. (2005)

1st stage (Off-line): ID3 learning strategy. Anchor text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier(candidate URL (anchor)).

- JCC 2011 - Curico, Chile - 11.11.11 26 / 40

Page 27: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Off-line learning-based algorithms

Pant & Srinivasan (2006)

1st stage (Off-line): SVM learning strategy. Content text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier(candidate URL (surrounding text)).

Feng et al. (2010)

1st stage (Off-line): Term-based weights. Weighted graph construction.2nd stage (Off-line): PageRank over the weighted graph. 3rd stage(Off-line): Labeling based on PageRank. Term-based learning. 4th stage(On-line): Candidate URL scoring based on the text classifier (candidateURL (anchor)).

- JCC 2011 - Curico, Chile - 11.11.11 27 / 40

Page 28: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Machine Learning-based adaptive algorithms

Learning on-the-fly from the context

- JCC 2011 - Curico, Chile - 11.11.11 28 / 40

Page 29: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Machine Learning-based adaptive algorithms

Learning on-the-fly from the context

candidate URL"Bach"

"Bach"

Aggarwal et al. (2000)

1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Bayes learning strategy. Content text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier +feature selection based on interest ratio (candidate URL (anchor)).

- JCC 2011 - Curico, Chile - 11.11.11 29 / 40

Page 30: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Machine Learning-based adaptive algorithms

Learning on-the-fly from the context

Chakrabarti et al. (2002)

1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Content text-based features. 2nd stage (On-line):Training from positive examples using fetched pages (more sophisticatedfeatures such as DOM tree). 3rd stage (On-line): URL scoring based onthe apprentice learner.

- JCC 2011 - Curico, Chile - 11.11.11 30 / 40

Page 31: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Machine Learning-based adaptive algorithms

Learning to skip off-topic pages

SeedTarget

- JCC 2011 - Curico, Chile - 11.11.11 31 / 40

Page 32: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Machine Learning-based adaptive algorithms

Learning to skip off-topic pages

���������

���������

���������

���������

���������

���������

������������

������������

SeedTarget

0.20.4

0.5

0.50.7

0.15

0.10.2

0.25

0.450.6

0.8 0.7

0.75

0.8

0.5

0.7

0.7

0.70.7

0.75

0.8

0.7

0.5

0.5

Dud

- JCC 2011 - Curico, Chile - 11.11.11 32 / 40

Page 33: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Machine Learning-based adaptive algorithms

Learning to skip off-topic pages: Tunneling!

Bergmark et al. (2002)

1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Content text-based features. 2nd stage (Off-line):Tunneling module construction. Cutoff threshold learning based onnugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner.Adaptive cutoff based on paths evaluated by using fetched pages.

- JCC 2011 - Curico, Chile - 11.11.11 33 / 40

Page 34: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Machine Learning-based adaptive algorithms

Agents for path detection: Ants

Gasparetti & Micarelli (2004)

Close in aim to ARACHNID (multi agents, multi seeds). Back and forthtrips to relevant resources generates pheromone trails. Shortest pathsattract more ants.

- JCC 2011 - Curico, Chile - 11.11.11 34 / 40

Page 35: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Ontology driven crawling strategies

Knowledge representation: Ontologies

i

stadiums

nationalteams

SubClassOfDomainRangeInstanceOfEquivalentSubPropertyOf

::::::

scdomrangeieqsp

coastal_city

plays_insoccer

sp

sports

sp

sp

range

dom

dom

Barcelona F.C.

Camp Nou

i

i

eqfootball

city

range

country

i

sc

Spain

Barcelona

- JCC 2011 - Curico, Chile - 11.11.11 35 / 40

Page 36: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Ontology driven crawling strategies

Ontology-based match expansion

Ehrig & Maedge (2003)

Relevance scoring. 1st stage: Concept matching (ontology + lexicon). 2ndstage: Ontology-based expansion. 3rd stage: Summarization.

- JCC 2011 - Curico, Chile - 11.11.11 36 / 40

Page 37: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

Ontology driven crawling strategies

Ontology-based learning strategy

Zheng et al. (2008)

Relevance scoring for fetched pages. 1st stage: Concept matching(ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANNtraining. 3rd stage (On-line): term-based URL scoring (ANN, anchor asinput).

- JCC 2011 - Curico, Chile - 11.11.11 37 / 40

Page 38: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search State-of-the-art

More features for unvisited URL scoring

Feng et al. (2010)

On-line PageRank + term scoring (anchor, surrounding)

Patel & Schmidt (2011)

Term scoring based on matching and document structure (structure of thecurrent page).

- JCC 2011 - Curico, Chile - 11.11.11 38 / 40

Page 39: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Conclusion

Challenges

Precision / Recall trade off

Benchmarking

Ontology IE for effective crawling

Unbiased seed identification

Efficiency issues (scalability,...)

- JCC 2011 - Curico, Chile - 11.11.11 39 / 40

Page 40: Focused Crawling for Vertical Search

Focused Crawling for Vertical Search Conclusion

References

References here

- JCC 2011 - Curico, Chile - 11.11.11 40 / 40