A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search...
-
Upload
arthur-shaw -
Category
Documents
-
view
212 -
download
0
Transcript of A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search...
A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results
Kummamuru et al.
Presented by Bei YuSept. 22nd, 2004
Roadmap
Properties of topic hierarchy Automatic taxonomy generation (ATG) Monothetic ATG DisCover algorithm CAARD algorithm DSP algorithm Result comparison Questions
Generating Topic Hierarchy (Taxonomy)
Desirable properties of topic hierarchy document coverage Compactness (breadth/depth, node numb
er) Sibling node distinctiveness Node label predictiveness General to specific Reach time
Monothetic ATG
Automatic Taxonomy Generation (ATG) monothetic vs. polythetic
Monothetic: single-feature based cluster assignment Polythetic: multiple-features based assignment
Keywords vs. documents vs. both clustering Top-down vs. bottom-up
Monothetic ATG Subsumption algorithm (Sanderson and Croft, 1999) DSP (Lawrie et al., 2001) CAARD (Kummamuru and Krishnapuram, 2001) DisCover (this paper)
DisCover
Progressively grow the hierarchy Coverage and compactness tradeoff Generate an optimal permuted sequence of the
concepts under a node. Every document represented as a set of concepts; “concepts under the node” means all the the
other concepts in the documents covered by the node.
Select an optimal subset from the concepts with maximal coverage and distinctiveness
Question: preset the child node number?
DisCover
|)()(|),(
|)()(|),(
),(2),(1),(
),()(
1,1,
1,1,
1,1,1,
1, 1,maxarg
kjjkd
kjjkc
jkdjkcjk
jjk
j
StctcSg
SdcdcSg
cSgwcSgwcSg
UccSgkk
Coveragedistinctiveness
CAARD (Kummamuru and Krishnapuram, 2001)
corpus
concepts
Inclusion Degree:||/|| iijij wwwID
top-level Min_subset
Rest subset
ijID
recursive
DSP (Lawrie et al., 2001)
corpus
Topic terms
top-level topic terms
Vocabulary terms
Maximal predictive power and vocabulary coverage
Language modelA: topic term; B: vocabularyA=B
RecursionA <- subtopic term around topicB=A?
)|(Pr BAx
Evaluation
In general Precision F-measure User study Summary evaluation (EMIM cmp. TF*IDF) Reachability Reach time
This paper compares Computation complexity Coverage and compactness Reach time User study
Results
Results
Results
Questions
The performance as the number of nodes even increase (greater than 9) ?
How to exactly map the concept sequence to the tree structure?