Web Document Clustering: A Feasibility Demonstration
-
Upload
aaron-keith -
Category
Documents
-
view
49 -
download
1
description
Transcript of Web Document Clustering: A Feasibility Demonstration
![Page 1: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/1.jpg)
Web Document Clustering: A Feasibility DemonstrationWeb Document Clustering: A Feasibility Demonstration
Hui HanHui Han
CSE dept. PSUCSE dept. PSU
10/15/0110/15/01
![Page 2: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/2.jpg)
MotivationMotivation
Low precision of Web search engines—hard for users Low precision of Web search engines—hard for users to locate expected information quickly…to locate expected information quickly…
Solutions:Solutions:
1. Increase precision– by filtering methods? by advanced pruning options?…
2222 Web Document Clustering Web Document Clustering - - Cluster documents returned by search engine in response to a query and re-present them
![Page 3: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/3.jpg)
Key RequirementsKey Requirements for Web Document Clusteringfor Web Document Clustering
RelevanceBrowsable SummariesOverlapSnippet-tolerance
– “snippet”: small piece of info. Or brief extract
SpeedIncrementality
![Page 4: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/4.jpg)
Suffix Tree Clustering(STC)Suffix Tree Clustering(STC)
STC is a linear time clustering algorithm that is based on a suffix tree which efficiently identifies sets of documents that share common phrases.
STC satisfies the key requirements:– STC treats a document as a string, making use of
proximity information between words.– STC is novel, incremental, and O(n) time algorithm.– STC succinctly summarizes clusters’ contents for users.– Quick because of working on smaller set smaller set of documents,
incremantality– …
![Page 5: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/5.jpg)
Operating procedure of STCOperating procedure of STC
Step1: Document “cleaning”– Html -> plain text– Words stemming– Mark sentence boundaries– Remove non-word tokens
Step 2: Identifying Base ClustersStep3: Combining Base Clusters
![Page 6: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/6.jpg)
Step2:Step2: Identifying base ClustersIdentifying base Clusters——Suffix TreeSuffix Tree
* STC treats a document as a set of strings… Suffix tree of string S: a compact tree containing all the
suffixes of S– Suffix of a word: lovely – Suffix of a string: “Friends” is a lovely show.
Precise definition:– A suffix tree is a rooted, directed tree.– Each internal node has 2+ children.– Each edge is labeled with a non-empty sub-string of S. The
label of a node is defined to be the concatenation of the edge-labels on the path from the root to that node
– No two edges out of the same node can have edge-labels that begin with the same word—compact.
![Page 7: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/7.jpg)
Ex. A SEx. A Suffix Tree of Stringsuffix Tree of Strings
String1: “cat ate cheese”, String2: “mouse ate cheese too”
String3: “cat ate mouse too”
![Page 8: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/8.jpg)
Base clustersBase clusters
Base clusters corresponding to the suffix tree nodes
![Page 9: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/9.jpg)
Cluster scoreCluster score
s(B) = |B| * f(|P|)– |B| is the number of documents in base cluster
B– |P| is the number of words in P that have a non-
zero score zero score words: stopwords, too few(<3) or too
many( >40%)
![Page 10: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/10.jpg)
Step 3:Combining Base ClustersStep 3:Combining Base Clusters
Merge base clusters with a high overlap in their document sets– documents may share multiple phrases.
Similarity of Bm and Bn (0.5 is paramter)
1 iff | Bm Bn| / | Bm | > 0.5
= and | Bm Bn| / | Bn | > 0.5
0 otherwise
![Page 11: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/11.jpg)
Base Cluster GraphBase Cluster GraphNode: clusterEdge: similarity between two clusters > 1
What if “ate” is in the stop word list?
![Page 12: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/12.jpg)
STC is IncrementalSTC is Incremental
As each document arrives from the web, we– “clean” it (linear with collection size)– Add it to the suffix tree. Each node that is
updated/created as a result of this is tagged(linear) – Update the relevant base clusters and recalculate the
similarity of these base clusters to the rest of k highest scoring base clusters(linear)
– Check any changes to the final clusters(linear)– Score and sort the final clusters, choose top 10...(linear)
![Page 13: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/13.jpg)
STC allows cluster overlap…STC allows cluster overlap…
– Why overlap is reasonable?Why overlap is reasonable?
a document often has 1+ topicsa document often has 1+ topics– STC allows a document to appear in 1+ STC allows a document to appear in 1+
clusters, since documents may share 1+ clusters, since documents may share 1+ phrases with other documentsphrases with other documents
– But not too similar to be merged into one But not too similar to be merged into one cluster..cluster..
![Page 14: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/14.jpg)
ExperimentsExperiments
Cluster output of meta search engine, using STC alg. – Representative of Web search engines– WEB clustering, instead of “IR corpus”
![Page 15: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/15.jpg)
Evaluation-Evaluation-PrecisionPrecision
Precision of different Clustering algorithm
![Page 16: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/16.jpg)
Cluster overlap & multi-word phrases Cluster overlap & multi-word phrases are critical to STC’s successare critical to STC’s success
![Page 17: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/17.jpg)
Cluster overlap & multi-word phrases Cluster overlap & multi-word phrases are specifically effective to STC’s successare specifically effective to STC’s success
![Page 18: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/18.jpg)
Why?Why?
Allowing a document to appear in multiple clusters is only advantageous if that document is relevant; placing an irrelevant document in multiple clusters can only hurt cluster quality
![Page 19: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/19.jpg)
Snippets versus Whole DocumentSnippets versus Whole Document
![Page 20: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/20.jpg)
Execution timeExecution time
Incremental – use “free” CPU time when the system is waiting for the search engine results to arrive over the web – speedy
![Page 21: Web Document Clustering: A Feasibility Demonstration](https://reader035.fdocuments.in/reader035/viewer/2022081504/5681351b550346895d9c7620/html5/thumbnails/21.jpg)
ConclusionConclusion
The identification of the unique requirements of document clustering of Web seach engine results
The definition of STC – an incremental, o(n) time clustering algorithm that satisfies these requirements
The first experimental evaluation of clustering algorithms on Web search engine results