A matrix density based algorithm to hierarchically co-cluster documents and words
-
Upload
quinn-beach -
Category
Documents
-
view
21 -
download
0
description
Transcript of A matrix density based algorithm to hierarchically co-cluster documents and words
![Page 1: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/1.jpg)
A matrix density based algorithm to hierarchically co-cluster documents and words
Advisor : Dr. HsuGraduate : Keng-Wei ChangAuthor : Bhushan Mandhani
Sachindra Joshi Krishna Kummamuru
![Page 2: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/2.jpg)
outline
Motivation Objective Introduction background Rowset Partitioning and Submatrix Agglomeration(RPSA)
Experimental results Conclusions Personal Opinion
![Page 3: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/3.jpg)
Motivation
With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.
![Page 4: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/4.jpg)
Objective
A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo 、 Google.
This paper proposes an algorithm to hierarchically cluster documents for solving problems.
![Page 5: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/5.jpg)
Introduction
90s -> 100 thousand pages ; 2002 -> 2 billion pages; it has become increasingly important to organize
the information Manually is accurate, but not always feasible Need tools to automatically arrange documents t
o labeled hierarchies Propose RPSA -> two step partitional-agglomerative
![Page 6: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/6.jpg)
background
Vector Model for Documents Evaluation of Clustering Quality Evaluation of Hierarchical Clustering
![Page 7: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/7.jpg)
Vector Model for Documents
We have d documents
Document i is represented by
is the number of occurrences of word j in document i
Term Frequency , TF
Inverse Document Frequency , IDF
im
ijt
Unitized-TF IDF
![Page 8: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/8.jpg)
Evaluation of Clustering Quality
1. Purity :
2. Entropy :
ij
g
i ijj ppE
1log-
![Page 9: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/9.jpg)
Evaluation of Hierarchical Clustering
![Page 10: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/10.jpg)
Rowset Partitioning and Submatrix Agglomeration(RPSA)
tow-step partitional-agglomerative algorithm 1th step : The Partitioning Step 2th step : The Agglomerative Step
![Page 11: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/11.jpg)
The Partitioning Step
Define the density of submatices
a row r, a column c
a set R of rows , a set C of columns
![Page 12: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/12.jpg)
The Partitioning Step
Generating a Leaf Cluster
![Page 13: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/13.jpg)
The Partitioning Step
Choice of Leader Documents
The sum of TFIDF vector representing that document
Documents with relatively large lengths were observed to be better leader documents for the algorithm above
![Page 14: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/14.jpg)
The Partitioning Step
The Complete Partitioning Algorithm
![Page 15: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/15.jpg)
The Partitioning Step
Complexity Analysis The time complexity is O(mz) The space complexity is O(z)
![Page 16: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/16.jpg)
The Agglomerative Step
Reduce the number of clusters The similarity measure between two clusters
for merging Flat Clustering Hierarchical Clustering
![Page 17: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/17.jpg)
The Agglomerative Step
Complexity Analysis The time complexity is O( ) The space complexity is O( )
zm2
2m
![Page 18: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/18.jpg)
Experimental results-Flat Clustering
Data Sets
![Page 19: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/19.jpg)
Experimental results-Flat Clustering
Results
![Page 20: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/20.jpg)
Experimental results-Flat Clustering
![Page 21: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/21.jpg)
Experimental results-Hierarchical Clustering
Data Sets
![Page 22: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/22.jpg)
Experimental results-Hierarchical Clustering
Data Sets
![Page 23: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/23.jpg)
Experimental results-Hierarchical Clustering
Results
![Page 24: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/24.jpg)
Conclusions
It is comparable with or better than the best k-means run
It’s performance does not degrade on small data sets
It’s acceptable on purity in hierarchy
![Page 25: A matrix density based algorithm to hierarchically co-cluster documents and words](https://reader036.fdocuments.in/reader036/viewer/2022062720/568134f7550346895d9c40af/html5/thumbnails/25.jpg)
Personal Opinion