Focused Crawling for Vertical Search
-
Upload
marcelo-mendoza -
Category
Art & Photos
-
view
1.791 -
download
0
description
Transcript of Focused Crawling for Vertical Search
Focused Crawling for Vertical Search
Focused Crawling for Vertical Search
Marcelo Mendoza
11.11.11
- JCC 2011 - Curico, Chile - 11.11.11 1 / 40
Focused Crawling for Vertical Search
Overview
1 Vertical Search
2 Crawling
3 State-of-the-art
4 Conclusion
- JCC 2011 - Curico, Chile - 11.11.11 2 / 40
Focused Crawling for Vertical Search Vertical Search
Why Web Vertical Search Matters?
Web size: More than 20 billion pages.
Millions of users, millions of queries, millions of needs.
Advantages:1 Greater precision due to limited scope2 Leverage domain knowledge (ontologies)
Domains: business, medicine, science, education, ...
- JCC 2011 - Curico, Chile - 11.11.11 3 / 40
Focused Crawling for Vertical Search Vertical Search
Science Vertical Search
scienceresearch.com- JCC 2011 - Curico, Chile - 11.11.11 4 / 40
Focused Crawling for Vertical Search Vertical Search
Business Vertical Search
biznar.com
- JCC 2011 - Curico, Chile - 11.11.11 5 / 40
Focused Crawling for Vertical Search Vertical Search
Education Vertical Search
contentcompass.cl11Fondef D08I1155
- JCC 2011 - Curico, Chile - 11.11.11 6 / 40
Focused Crawling for Vertical Search Crawling
Hyperlinks among web pages
- JCC 2011 - Curico, Chile - 11.11.11 7 / 40
Focused Crawling for Vertical Search Crawling
The Web as a graph
web pages
hyperlinks
- JCC 2011 - Curico, Chile - 11.11.11 8 / 40
Focused Crawling for Vertical Search Crawling
The Web: Some facts
The size of the Web: 11.5 billion of pages (indexable, 2005).
The deep Web: available by quering databases.
Static / dynamic pages.
Graph model: Free-scale network, degree distribution ≈ power law.
The Web structure: Bow-tie model (IN/SCC/OUT/ISLANDS).
- JCC 2011 - Curico, Chile - 11.11.11 9 / 40
Focused Crawling for Vertical Search Crawling
Crawler architecture
Online resource: C. Castillo, Effective Web Crawling (PhD Thesis) URL
- JCC 2011 - Curico, Chile - 11.11.11 10 / 40
Focused Crawling for Vertical Search Crawling
Crawling strategies
Breadth-first crawlers: URL frontier implemented as a FIFO queue.
Preferential crawlers: URL frontier implemented as a priority queue.
Priority scores:1 Topological properties (e.g. indegree of the target page).2 Content properties (e.g. similarity between a query and the source
page).3 Hybrid measures.
- JCC 2011 - Curico, Chile - 11.11.11 11 / 40
Focused Crawling for Vertical Search Crawling
Universal / Focused crawling
Universal crawlers: General purpose.
Challenges:1 Scalability2 Coverage / Freshness
Focused crawlers: We may want to crawl pages in certain topics.
Challenges:1 Coverage / Accuracy
- JCC 2011 - Curico, Chile - 11.11.11 12 / 40
Focused Crawling for Vertical Search Crawling
Focused Crawling
Breadth-first: depth 1
SeedTarget
- JCC 2011 - Curico, Chile - 11.11.11 13 / 40
Focused Crawling for Vertical Search Crawling
Focused Crawling
Breadth-first: depth 2
SeedTarget
- JCC 2011 - Curico, Chile - 11.11.11 14 / 40
Focused Crawling for Vertical Search Crawling
Focused Crawling
Breadth-first: depth 3
SeedTarget
- JCC 2011 - Curico, Chile - 11.11.11 15 / 40
Focused Crawling for Vertical Search Crawling
Focused Crawling
Breadth-first: unreacheble pages, excessive computational costs!
SeedTarget
- JCC 2011 - Curico, Chile - 11.11.11 16 / 40
Focused Crawling for Vertical Search State-of-the-art
Early algorithms: Fish search
Bra, P., and Post, R. (1994)
Query (keywords), source page terms, term-based distance, best-first
- JCC 2011 - Curico, Chile - 11.11.11 17 / 40
Focused Crawling for Vertical Search State-of-the-art
Early algorithms: Shark search
Hersovici et al. (1998)
Query (keywords), anchor text, term-based distance, best-first
- JCC 2011 - Curico, Chile - 11.11.11 18 / 40
Focused Crawling for Vertical Search State-of-the-art
Early algorithms: ARACHNID
Menczer, F. (1997)
Multi-agents, evolutionary inspired: mutation (new seeds), fitness (scoreacc.), term-based scores.
- JCC 2011 - Curico, Chile - 11.11.11 19 / 40
Focused Crawling for Vertical Search State-of-the-art
Context: Link Analysis
The Web graph as an information source (beyond the text)
Kleinberg, J. (1998)
HITS: authoritative pages (OUT), hub pages (IN).
Brin, S. & Page, L. (1998)
PageRank: Random walk over the Web graph, stationary probabilityvector.
- JCC 2011 - Curico, Chile - 11.11.11 20 / 40
Focused Crawling for Vertical Search State-of-the-art
Link-based algorithms
Cho, J., Garcia-Molina, H., Page L. (1998)
Link-based scores: Backlinks count, PageRank
Chakrabarti, S., Van den Berg, M., and Dom, B. (1999)
Topic distillation: Text-based classifier over web page examples percategory (off-line dataset construction, human labeling, content textpositive and negative examples). On-line phase: Anchor-based score (ML)+ HITS-based score for distillation.
- JCC 2011 - Curico, Chile - 11.11.11 21 / 40
Focused Crawling for Vertical Search State-of-the-art
Link-based algorithms: Basic assumptions
SeedTarget
Davidson, B. (2000)
Topical locality: Locality based on anchor text and links.
- JCC 2011 - Curico, Chile - 11.11.11 22 / 40
Focused Crawling for Vertical Search State-of-the-art
Link-based algorithms: Basic assumptions
Menczer, F. (2004)
Link cluster conjecture: Related pages tend to be linked.
- JCC 2011 - Curico, Chile - 11.11.11 23 / 40
Focused Crawling for Vertical Search State-of-the-art
Link-based algorithms: Backlink graph
Considering how far is the target: Layered backlink graph!
Diligenti et al. (2000)
Using the backlink graph for multiclass learning. Greedy approach.
Babaria et al. (2007)
Using the backling graph for ordinal regression. Greedy approach.
- JCC 2011 - Curico, Chile - 11.11.11 24 / 40
Focused Crawling for Vertical Search State-of-the-art
Off-line learning-based algorithms
Kinds of features
The content of the web pages which are known to link to thecandidate URL.
URL tokens from the candidate URL.
- JCC 2011 - Curico, Chile - 11.11.11 25 / 40
Focused Crawling for Vertical Search State-of-the-art
Off-line learning-based algorithms
Rennie & McCallum (1999)
1st stage (Off-line): Text-based features (anchor + header + title of thetarget). 2nd stage (On-line): Candidate URL scoring based on the textclassifier (candidate URL (anchor + URL text)).
Li et al. (2005)
1st stage (Off-line): ID3 learning strategy. Anchor text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier(candidate URL (anchor)).
- JCC 2011 - Curico, Chile - 11.11.11 26 / 40
Focused Crawling for Vertical Search State-of-the-art
Off-line learning-based algorithms
Pant & Srinivasan (2006)
1st stage (Off-line): SVM learning strategy. Content text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier(candidate URL (surrounding text)).
Feng et al. (2010)
1st stage (Off-line): Term-based weights. Weighted graph construction.2nd stage (Off-line): PageRank over the weighted graph. 3rd stage(Off-line): Labeling based on PageRank. Term-based learning. 4th stage(On-line): Candidate URL scoring based on the text classifier (candidateURL (anchor)).
- JCC 2011 - Curico, Chile - 11.11.11 27 / 40
Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning on-the-fly from the context
- JCC 2011 - Curico, Chile - 11.11.11 28 / 40
Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning on-the-fly from the context
candidate URL"Bach"
"Bach"
Aggarwal et al. (2000)
1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Bayes learning strategy. Content text-based features.2nd stage (On-line): Candidate URL scoring based on the text classifier +feature selection based on interest ratio (candidate URL (anchor)).
- JCC 2011 - Curico, Chile - 11.11.11 29 / 40
Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning on-the-fly from the context
Chakrabarti et al. (2002)
1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Content text-based features. 2nd stage (On-line):Training from positive examples using fetched pages (more sophisticatedfeatures such as DOM tree). 3rd stage (On-line): URL scoring based onthe apprentice learner.
- JCC 2011 - Curico, Chile - 11.11.11 30 / 40
Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning to skip off-topic pages
SeedTarget
- JCC 2011 - Curico, Chile - 11.11.11 31 / 40
Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning to skip off-topic pages
���������
���������
���������
���������
���������
���������
������������
������������
SeedTarget
0.20.4
0.5
0.50.7
0.15
0.10.2
0.25
0.450.6
0.8 0.7
0.75
0.8
0.5
0.7
0.7
0.70.7
0.75
0.8
0.7
0.5
0.5
Dud
- JCC 2011 - Curico, Chile - 11.11.11 32 / 40
Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Learning to skip off-topic pages: Tunneling!
Bergmark et al. (2002)
1st stage (Off-line): Crawling for dataset construction. Human labeling(positive examples). Content text-based features. 2nd stage (Off-line):Tunneling module construction. Cutoff threshold learning based onnugget-dud paths. 3rd stage (On-line): Apprentice tunneling learner.Adaptive cutoff based on paths evaluated by using fetched pages.
- JCC 2011 - Curico, Chile - 11.11.11 33 / 40
Focused Crawling for Vertical Search State-of-the-art
Machine Learning-based adaptive algorithms
Agents for path detection: Ants
Gasparetti & Micarelli (2004)
Close in aim to ARACHNID (multi agents, multi seeds). Back and forthtrips to relevant resources generates pheromone trails. Shortest pathsattract more ants.
- JCC 2011 - Curico, Chile - 11.11.11 34 / 40
Focused Crawling for Vertical Search State-of-the-art
Ontology driven crawling strategies
Knowledge representation: Ontologies
i
stadiums
nationalteams
SubClassOfDomainRangeInstanceOfEquivalentSubPropertyOf
::::::
scdomrangeieqsp
coastal_city
plays_insoccer
sp
sports
sp
sp
range
dom
dom
Barcelona F.C.
Camp Nou
i
i
eqfootball
city
range
country
i
sc
Spain
Barcelona
- JCC 2011 - Curico, Chile - 11.11.11 35 / 40
Focused Crawling for Vertical Search State-of-the-art
Ontology driven crawling strategies
Ontology-based match expansion
Ehrig & Maedge (2003)
Relevance scoring. 1st stage: Concept matching (ontology + lexicon). 2ndstage: Ontology-based expansion. 3rd stage: Summarization.
- JCC 2011 - Curico, Chile - 11.11.11 36 / 40
Focused Crawling for Vertical Search State-of-the-art
Ontology driven crawling strategies
Ontology-based learning strategy
Zheng et al. (2008)
Relevance scoring for fetched pages. 1st stage: Concept matching(ontology + lexicon), Concept distances, Doc. scoring. 2nd stage: ANNtraining. 3rd stage (On-line): term-based URL scoring (ANN, anchor asinput).
- JCC 2011 - Curico, Chile - 11.11.11 37 / 40
Focused Crawling for Vertical Search State-of-the-art
More features for unvisited URL scoring
Feng et al. (2010)
On-line PageRank + term scoring (anchor, surrounding)
Patel & Schmidt (2011)
Term scoring based on matching and document structure (structure of thecurrent page).
- JCC 2011 - Curico, Chile - 11.11.11 38 / 40
Focused Crawling for Vertical Search Conclusion
Challenges
Precision / Recall trade off
Benchmarking
Ontology IE for effective crawling
Unbiased seed identification
Efficiency issues (scalability,...)
- JCC 2011 - Curico, Chile - 11.11.11 39 / 40
Focused Crawling for Vertical Search Conclusion
References
References here
- JCC 2011 - Curico, Chile - 11.11.11 40 / 40