Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of...
-
Upload
loreen-willis -
Category
Documents
-
view
215 -
download
0
Transcript of Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of...
![Page 1: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/1.jpg)
Improving Suffix Tree Clustering
• Base cluster rankings(B) = |B| * f(|P|)|B| is the number of documents in base cluster B|P| is the number of words in P that have a non-zero scorezero score words: stopwords, too few(<3) or too many( >40%)
• Tf-Idf is better
1
![Page 2: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/2.jpg)
Improving Suffix Tree clustering
• Cluster similarity– Page overlap– Add: cluster label distance (word pair distance)
• Google normalised distance• WikiMiner: wikilink similarity
2
![Page 3: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/3.jpg)
Improving suffix tree clustering
• 3rd step: cluster merging– If more than half overlapped pages, then merge– New: HAC
3
![Page 4: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/4.jpg)
4
Query Directed Web Page Clustering
Daniel CrabtreePeter Andreae, Xiaoying Gao
Victoria University of Wellington
![Page 5: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/5.jpg)
5
Related Work: Web Page Clustering• All Standard Algorithms
– partitioning (k-means), hierarchical (agglomerative, divisive), …………• Web Features
– structure, hyperlinks, colour• Textual Features
– STC: phrases, Lingo: latent semantic indexing• Word Semantics
– Global document analysis, co-occurrence statistics
• Query is never used
![Page 6: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/6.jpg)
QDC – Query Directed Clustering
6
1: Find Base Clusters
2: Merge Clusters
3: Split Clusters
4: Select Clusters
5: Clean Clusters
![Page 7: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/7.jpg)
QDC – 1: Find Base Clusters
• Clean Pages• Identify Base
Clusters• Prune Small
Clusters• Semantic Prune #1• Semantic Prune #2
7
Mac (28)
Car (40)
Auto (25)
Animal (18)
OS (12)
Atari (22)
Game (5)
Service (80)
Forest (11)
cluster size
distance(cluster,query)Score #1 = Score #2 =
![Page 8: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/8.jpg)
Car
Home Page
Toyota Specific
Broad
Query: Jaguar
AmbiguousAmbiguous
QDC – 1: Query Distance
8
![Page 9: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/9.jpg)
QDC – 1: Find Base Clusters
• Removes Many Base Clusters– Normally Negative Effect on Performance
But …
• Query Directed Score– Reliable Guide to Cluster Quality– Removes just Low Quality Clusters– Improves Performance
9
![Page 10: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/10.jpg)
QDC – 2: Merge Clusters
• Merging
10
Mac (28)
Car (40)
Auto (25)
Animal (18)
OS (12)
Atari (22)
Car, Auto (40)
Mac, OS (28)
![Page 11: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/11.jpg)
QDC – 2: Merge Clusters
• Single-link Clustering• Similarity Function
– Extension (by page overlap)– Intension (by description similarity)
• Global document analysis: co-occurrence frequency relative to expected frequency if independent
11
![Page 12: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/12.jpg)
QDC – 2: Merge Clusters
• Reducing Page Overlap Threshold– Normally Negative Effect on Performance
But …
• Description Similarity– More semantically related clusters merge
• Increasing cluster coverage
– Fewer semantically unrelated clusters merge• Increasing cluster quality
12
![Page 13: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/13.jpg)
QDC – 3: Split Clusters
• Single Link Merging– Cluster Chaining (Drifting)
• Hierarchical Agglomerative– Distance Measure: Path Length
13
![Page 14: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/14.jpg)
QDC – 4: Select Clusters• ESTC cluster selection algorithm
– Heuristic based hill-climbing search with look-ahead and advanced branch and bound pruning
• Original heuristic– Page Coverage and Cluster Overlap
• New heuristic– Page Coverage and Cluster Overlap– Pages Not Covered and Cluster Quality
14
![Page 15: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/15.jpg)
QDC – 5: Clean Clusters
• Page-Cluster Relevance– Based on Base Cluster Membership– Cluster Size, Cluster Quality
• Remove Outliers and Erroneous Inclusions• Sorting improves usability
1513
![Page 16: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/16.jpg)
Evaluation
• Algorithm Efficiency on 250 Documents– Ten Times Faster than STC– One Hundred Times Faster than K-means
• Algorithm Performance– External Evaluation against a rich gold standard
• Real World Usability– Informal Usability Comparison with four algorithms
• K-means, ESTC, Lingo, Vivisimo
16
![Page 17: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/17.jpg)
Evaluation: Algorithm Performance• External Evaluation against a rich gold standard • Four Algorithms
– STC, ESTC, K-means, Random• Four Data Sets
– Salsa, Jaguar, GP, Victoria University• Eleven Measurements
– Average and Weighted: Quality, Coverage, Precision, Recall, and Entropy + Mutual Information
• Snippets and Full Page Text
17
![Page 18: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/18.jpg)
Evaluation: Quality and Coverage
18
![Page 19: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/19.jpg)
Evaluation: Improvement over Random
19
![Page 20: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/20.jpg)
Evaluation: Precision and Recall
20
![Page 21: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/21.jpg)
Evaluation: Entropy and Mutual Information
21
![Page 22: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/22.jpg)
Evaluation: Real World Usability
• QDC finds broader topics– Maximizes probability of
refinement– Simplifies user’s decision process
• Fewer choices• Less chance of multiple relevant
choices
• Fewer semantically meaningless clusters
22
Jaguar Results
![Page 23: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/23.jpg)
Evaluation: Real World Usability
• Performance better than indicated by external evaluation– No penalty for overly specific clusters since gold standard
included them
• External evaluation shows QDC clusters have: – Fewer irrelevant pages– Cover more relevant pages
23
![Page 24: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/24.jpg)
Conclusion
• QDC: New Web Page Clustering Algorithm• Key innovations:
– Query Directed Scoring– Merging using cluster descriptions– Solve cluster chaining by splitting– Improved cluster selection heuristic
• Vastly improved performance over other algorithms– External evaluation – Informal usability evaluation
24
![Page 25: Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.](https://reader035.fdocuments.in/reader035/viewer/2022062423/56649ed95503460f94be7d14/html5/thumbnails/25.jpg)
25
Further Extension• Use Phrases rather than just Words
– STC, Lingo show large improvement possible
• Use Wiki Link similarity (WikiMiner) instead of GND• Future work:
– Improve cluster description similarity merging to consider entire description
– Common shared phrases as key features, use VSM, build vectors for each cluster, new weighting
– Formal usability evaluation