1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh...
-
date post
22-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of 1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh...
1
Matching DOM Trees to Search Logs for Accurate Webpage Clustering
Deepayan Chakrabarti
Rupesh Mehta
2
Data extraction
Website-specific
wrappers
Webpages from a site
Structured DB
(product_name, price, rating)
3
Data ExtractionWrapper 1
Wrapper 2
Building wrappers[Muslea+/98, Crescenzi+/01, Cohen+/02,
Hogue+/05, Irmak+/06]
• Cluster pages from the website based on similarity of DOM structure
• Pick a few example pages per cluster
• Manually annotate the DOM nodes which contain the data
• Automatic wrapper induction using these annotations
4
Data Extraction
Clustering affects quality Too few clusters:
Heterogeneity of clusters Imperfect wrappers, or even inability to build wrappers
Too many clusters: Significant editorial effort required to build wrappers
We want to automatically get a good clustering, for any website
5
Main Idea
“Useful” info on a page
Wrappers extract it
Users search for it
html
h1
b
search
+click
search terms match page content
DOM paths repeatedly referenced by search terms are “key” paths
“html h1” and “html h1 b” are key paths
6
Main Idea
Clustering using key paths Pre-processing step (for each site)
Given a large sample of pages and search logs Identify key paths
Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths
7
Mapping pages to clusters
Pages in a cluster should have similar tree structure and hence, similar paths Represent a page by a shingle of its paths
[Buttler/04]
Using key paths: Shingle preferentially picks key paths in the page Requires a global ranking of key paths
8
Mapping pages to clusters
One cluster per shingle
All pages in a cluster share the same k “key” paths
9
Main Idea
Clustering using key paths Pre-processing step (for each site)
Given a large sample of pages and search logs Identify key paths
Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths
10
Identify key paths
For every (query, webpage) pair match query terms to text of a DOM path yields precision and recall for every path
Need to aggregate over all queries and webpages Expected precision and recall of a path
High if path appears on many queried pages, and has high precision/recall in most of them
html
h1
b
title
price
11
Identify key paths
How can we combine expected precision and recall into one ranking of key paths? F-measure, but
Precision typically more important than recall Precision and recall may be in completely different scales This scaling factor varies among websites
12
Identify key paths
How can we combine expected precision and recall into one ranking of key paths? Borda method [Borda/1781]
Create two rankings of paths, one by precision and one by recall
Combine rankings into one ranking, using relative importance of precision to recall
Immune to varying scales of precision/recall values among websites
13
Main Idea
Clustering using key paths Pre-processing step (for each site)
Given a large sample of pages and search logs Identify key paths, but
Key paths can be dependent
Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths
14
Handling dependent paths
Consider the following two paths: html body div div table tr td h1 span (“product name”)
html body div div table tr td h1 If one is a key path, probably the other is too
Shingle can get “swamped” Shingle of a page becomes:
(product_name, product_name_parent, product_name_ancestor) instead of:
(product_name, buy_button, rating)
15
Handling dependent paths
Several sources of dependence Multiple paths may have similar content
“product name” header and its parent product name mentioned in a header and in the text
Multiple paths may always co-occur “product name” header and “price”
16
Handling dependent paths
Identify key independent paths Build a graph of dependencies between paths Pick an independent set of paths
i.e., a set of paths where no one is connected to another
Computation is weighted strongly towards the top-ranked paths Under our weighting scheme, greedily picking an
independent set is optimal
17
Main Idea
Clustering using key paths Pre-processing step (for each site)
Given a large sample of pages and search logs Identify key paths
Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths
Several other optimizations (in paper)
18
Experiments
10 major websites Sampled ~20,000 pages each Built ground truth
Ran an existing clustering algorithm Manually checked results
Homogeneous clusters: merge when necessary Heterogeneous clusters: change parameters, repeat
Small sample of search logs ~5K unique queries per site Far fewer than the number of pages per site
19
Experiments
Compared to clustering using well-known tree-similarity metrics Path Shingles: Shingle of DOM paths without
using key paths [Buttler/04] pq-Grams: Shingle of sub-trees of DOM tree
[Augsten+/05] m/k Path Shingles: Like path shingles, except only
m out of k shingle elements need to match
20
Experiments
Compared clustering using Adjusted RAND index higher is better, 1.0 is perfect
Our algorithm [Buttler/04] [Augsten+/05]
Search logs give significant lift, with very low variance
21
Experiments
Comparison against paths actually used by manually-designed wrappers
Precision of IndepPaths
Key Paths correspond to paths used in wrappers
23
Conclusions
Clusters affect both wrapper quality, and degree of editorial effort
We use search logs to automatically find good clusters
Current efforts: Combining search features with content features
to pick key paths
24
Mapping pages to clusters
Given an ranked list of key paths Given a shingle-size k For any page P
Find KP = all key paths in P If |KP| < k
Shingle = KP plus randomly chosen paths from page else
Shingle = top-ranked k paths in KP