A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search...

A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results

Kummamuru et al.

Presented by Bei YuSept. 22nd, 2004

Roadmap

Properties of topic hierarchy Automatic taxonomy generation (ATG) Monothetic ATG DisCover algorithm CAARD algorithm DSP algorithm Result comparison Questions

Generating Topic Hierarchy (Taxonomy)

Desirable properties of topic hierarchy document coverage Compactness (breadth/depth, node numb

er) Sibling node distinctiveness Node label predictiveness General to specific Reach time

Monothetic ATG

Automatic Taxonomy Generation (ATG) monothetic vs. polythetic

Monothetic: single-feature based cluster assignment Polythetic: multiple-features based assignment

Keywords vs. documents vs. both clustering Top-down vs. bottom-up

Monothetic ATG Subsumption algorithm (Sanderson and Croft, 1999) DSP (Lawrie et al., 2001) CAARD (Kummamuru and Krishnapuram, 2001) DisCover (this paper)

DisCover

Progressively grow the hierarchy Coverage and compactness tradeoff Generate an optimal permuted sequence of the

concepts under a node. Every document represented as a set of concepts; “concepts under the node” means all the the

other concepts in the documents covered by the node.

Select an optimal subset from the concepts with maximal coverage and distinctiveness

Question: preset the child node number?

DisCover

|)()(|),(

|)()(|),(

),(2),(1),(

),()(

1,1,

1,1,

1,1,1,

1, 1,maxarg

kjjkd

kjjkc

jkdjkcjk

jjk

j

StctcSg

SdcdcSg

cSgwcSgwcSg

UccSgkk

Coveragedistinctiveness

CAARD (Kummamuru and Krishnapuram, 2001)

corpus

concepts

Inclusion Degree:||/|| iijij wwwID

top-level Min_subset

Rest subset

ijID

recursive

DSP (Lawrie et al., 2001)

corpus

Topic terms

top-level topic terms

Vocabulary terms

Maximal predictive power and vocabulary coverage

Language modelA: topic term; B: vocabularyA=B

RecursionA <- subtopic term around topicB=A?

)|(Pr BAx

Evaluation

In general Precision F-measure User study Summary evaluation (EMIM cmp. TF*IDF) Reachability Reach time

This paper compares Computation complexity Coverage and compactness Reach time User study

Results

Questions

The performance as the number of nodes even increase (greater than 9) ?

How to exactly map the concept sequence to the tree structure?

A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search...

Documents

Transcript of A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search...