Post on 25-May-2020
Text Document Clustering and Similarity Measures
By: Pranjal Singh(10511) Mohit Sharma(11434) Department of Computer Science & Engineering
o Problem and the Approach
o Document Representation
o Metrics and Similarity measures
o Clustering Algorithms
o Evaluation
o Work done so far
o Work Remaining
The Problem
Ever increasing volume of text documents has brought challenges for their effective and efficient organization
Clustering organizes a large quantity of unordered data into a small number of meaningful and coherent clusters
No single similarity measure or clustering algorithm outperform all others in all domains.
Approach at a high level
Similarity measures quantify how similar or different two documents are
In this work we contrast clustering algorithms that use five different similarity measures and contrast their effectiveness on different text domains
We are also evaluating cluster qualities based on purity and entropy measures.
o Problem and the Approach
o Document Representation
o Metrics and Similarity measures
o Clustering Algorithms
o Evaluation
o Work done so far
o Work Remaining
Document Representation
We are using the ‘bag of words’ model.
Each word corresponds to a dimension in the
resulting data space
Each document then becomes a vector
consisting of non-negative values on each
dimension.
Document Representation
Let D = {d1,d2,d3,d4 ....} be the set of documents and T = {t1,t2,t3,t4....tm} be the set of unique terms in D.
A document is then represented as a m dimensional vector, td = (tf(d, t1), . . . , tf(d, tm))).
Pre-processing of documents is necessary to make computations more efficient as well as faster. More on this later.
o Problem and the Approach
o Document Representation
o Metrics and Similarity measures
o Clustering Algorithms
o Evaluation
o Work done so far
o Work Remaining
Metric
Not every distance measure is a metric.
Distance measure must satisfy the following:
Euclidean Distance
Given two documents da and db represented by their term vectors ta and tb respectively, the Euclidean distance is given simply as :
where DE = Distance between vectors
wt,a and wt,b are weights as given by tfidf values i.e wt,a= tfidf(da, t)
tfidf(d,t) = tf(d, t) * Log(|D| / df(t))
|D| = number of documents
df (t) = number of documents in which ‘t’ appears.
Cosine Similarity
Quantifies correlation b/w vectors ta and tb as
cosine of the angle between them in m
dimensional space.
where SIMC = Cosine similarity
t a and tb are vectors containing weights corresponding to each
dimension.
Bounded b/w [0,1] and independent of document
length.
Mahalanobis Distance
Differs from Euclidean distance in that it takes
into account the correlations of the data set and
is scale-invariant.
where
dst = Distance between vectors
xs and xt are weights as given by tfidf values i.e
wt,a= tfidf(da, t)
C is the covariance matrix
Jaccard Coefficient
Measures similarity as the intersection divided by
the union of the objects.
For text document, the Jaccard coefficient
compares the sum weight of shared terms to the
sum weight of terms that are present in either of
the two document but are not the shared terms.
Pearson Correlation
Takes many different forms, we are using the
following formula in our work.
Ranges form [-1,1]
In subsequent experiments we use the distance
measure, which is DP=1−SIMP when SIMP ≥ 0and
DP = |SIMP| when SIMP< 0
o Problem and the Approach
o Document Representation
o Metrics and Similarity measures
o Clustering Algorithms
o Evaluation
o Work done so far
o Work Remaining
Clustering Algorithms
Hierarchical Algorithms is a method of cluster analysis which seeks to build a
hierarchy of clusters.
Agglomerative : This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of clusters are
merged as one moves up the hierarchy. Divisive: This is a "top down" approach: all observations
start in one cluster, and splits are performed recursively as
one moves down the hierarchy.
Clustering Algorithms
K-means Algorithms aims to partition n observations into k clusters in
which each observation belongs to the cluster with the
nearest mean, serving as a representative of the
cluster.
Comparisons
Hierarchical Algorithms:
Agglomerative : O (n3)
Divisive: O (2n)
K-means Algorithms:
Various implementations with different
heuristics. All run in polynomial time.
o Problem and the Approach
o Document Representation
o Metrics and Similarity measures
o Clustering Algorithms
o Evaluation
o Work done so far
o Work Remaining
Datasets
20 news: news articles on different topics
Classic: abstracts and scientific paper
Hitech : San Hose newspaper articles
tr41: from TREC collection of articles
wap : web pages collection
webkb : another web page dataset
r0 : standard cluster testing database.
Evaluation
Evaluating and contrasting cluster quality objectively is a difficult task in itself.
In practice, manually assigned category labels are usually used as a baseline criteria for evaluating clusters.
As a result, the clusters, which are generated in an unsupervised way, are compared to the pre-defined category structure, which is normally created by human experts.
This kind of evaluation assumes that the objective of clustering is to replicate human thinking, so a clustering solution is good if the clusters are consistent with the manually created categories.
Evaluation Purity Measure :
Measures the coherence of a cluster, i.e degree to
which a cluster contains documents from a single
category
For an ideal cluster, which only contains
documents from a single category, its purity value
is 1. In general, the higher the purity value, the
better the quality of the cluster is.
Evaluation Entropy Measure :
The entropy measure evaluates the distribution
of categories in a given cluster.
The entropy measure is more comprehensive
than purity because rather than just considering
the number of objects in and not in the dominant
category in a cluster; it considers the overall
distribution of all the categories in a given cluster.
Experiments
We plan to use both hierarchical and k-means
clustering using all the mentioned similarity
measures on different datasets from different
domains.
We shall then use purity and entropy techniques
to measure the quality of clusters that the two
clustering algorithm give on the 5 similarity
measures.
We hope to critique on the effectiveness of
similarity measures based on cluster qualities.
Past Work and Results The paper by Anna Huang claims following results for
similar experiments.
We hope to do better owing to use of stemming and
better feature selection by PCA.
o Problem and the Approach
o Document Representation
o Metrics and Similarity measures
o Clustering Algorithms
o Evaluation
o Work done so far
o Work Remaining
Work done • Decided and obtained the datasets. Created one
ourselves. Manual labeling had to be done for some documents.
• Repeatedly pruned the documents before making idf matrices for datasets. Removed words below a certain threshold frequency and also irrelevant high frequency words.
• Written codes for creating the tf(d,t) and tfidf(d,t) matrices from documents to mat files.
Work done
Tried clustering algorithms on small datasets on
MATLAB. Trying to figure out a way to make
them run faster for a sparse matrix.
Integrating the five similarity measures with the
clustering algorithm, the default is Euclidean
distance.
o Problem and the Approach
o Document Representation
o Metrics and Similarity measures
o Clustering Algorithms
o Evaluation
o Work done so far
o Work Remaining
Work Remaining
Optimization of codes for better running times.
Cluster analysis using Purity and Entropy
measures
Quantified results for effectiveness of different
similarity measures used.
If time permits, we would like to improve upon
document representation. Stemming and some
semantic knowledge looks promising to improve
cluster coherence.
References [1] M. Steinbach, G. Karypis, and V. Kumar. A comparison of
document clustering techniques. In KDD Workshop on Text Mining, 2000.
[2] J. M. Neuhaus and J. D. Kalbfleisch. Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics, 54(2):638–645, Jun. 1998.
[3] Anna Huang. Similarity Measures for text document clustering.
[4] B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
Datasets:
[1] CLUTO package : http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
[2] 20 news : http://qwone.com/~jason/20Newsgroups/