A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search...

13
A Hierarchical Monothetic Docum ent Clustering Algorithm for Su mmarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd , 2004

Transcript of A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search...

Page 1: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results

Kummamuru et al.

Presented by Bei YuSept. 22nd, 2004

Page 2: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Roadmap

Properties of topic hierarchy Automatic taxonomy generation (ATG) Monothetic ATG DisCover algorithm CAARD algorithm DSP algorithm Result comparison Questions

Page 3: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Generating Topic Hierarchy (Taxonomy)

Desirable properties of topic hierarchy document coverage Compactness (breadth/depth, node numb

er) Sibling node distinctiveness Node label predictiveness General to specific Reach time

Page 4: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Monothetic ATG

Automatic Taxonomy Generation (ATG) monothetic vs. polythetic

Monothetic: single-feature based cluster assignment Polythetic: multiple-features based assignment

Keywords vs. documents vs. both clustering Top-down vs. bottom-up

Monothetic ATG Subsumption algorithm (Sanderson and Croft, 1999) DSP (Lawrie et al., 2001) CAARD (Kummamuru and Krishnapuram, 2001) DisCover (this paper)

Page 5: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

DisCover

Progressively grow the hierarchy Coverage and compactness tradeoff Generate an optimal permuted sequence of the

concepts under a node. Every document represented as a set of concepts; “concepts under the node” means all the the

other concepts in the documents covered by the node.

Select an optimal subset from the concepts with maximal coverage and distinctiveness

Question: preset the child node number?

Page 6: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

DisCover

|)()(|),(

|)()(|),(

),(2),(1),(

),()(

1,1,

1,1,

1,1,1,

1, 1,maxarg

kjjkd

kjjkc

jkdjkcjk

jjk

j

StctcSg

SdcdcSg

cSgwcSgwcSg

UccSgkk

Coveragedistinctiveness

Page 7: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

CAARD (Kummamuru and Krishnapuram, 2001)

corpus

concepts

Inclusion Degree:||/|| iijij wwwID

top-level Min_subset

Rest subset

ijID

recursive

Page 8: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

DSP (Lawrie et al., 2001)

corpus

Topic terms

top-level topic terms

Vocabulary terms

Maximal predictive power and vocabulary coverage

Language modelA: topic term; B: vocabularyA=B

RecursionA <- subtopic term around topicB=A?

)|(Pr BAx

Page 9: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Evaluation

In general Precision F-measure User study Summary evaluation (EMIM cmp. TF*IDF) Reachability Reach time

This paper compares Computation complexity Coverage and compactness Reach time User study

Page 10: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Results

Page 11: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Results

Page 12: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Results

Page 13: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results Kummamuru et al. Presented by Bei Yu Sept. 22 nd,

Questions

The performance as the number of nodes even increase (greater than 9) ?

How to exactly map the concept sequence to the tree structure?