Text Document Clustering and Similarity Measures

By: Pranjal Singh(10511) Mohit Sharma(11434) Department of Computer Science & Engineering

o Problem and the Approach

o Document Representation

o Metrics and Similarity measures

o Clustering Algorithms

o Evaluation

o Work done so far

o Work Remaining

The Problem

Ever increasing volume of text documents has brought challenges for their effective and efficient organization

Clustering organizes a large quantity of unordered data into a small number of meaningful and coherent clusters

No single similarity measure or clustering algorithm outperform all others in all domains.

Approach at a high level

Similarity measures quantify how similar or different two documents are

In this work we contrast clustering algorithms that use five different similarity measures and contrast their effectiveness on different text domains

We are also evaluating cluster qualities based on purity and entropy measures.

o Evaluation

o Work done so far

o Work Remaining

Document Representation

We are using the ‘bag of words’ model.

Each word corresponds to a dimension in the

resulting data space

Each document then becomes a vector

consisting of non-negative values on each

dimension.

Document Representation

Let D = {d1,d2,d3,d4 ....} be the set of documents and T = {t1,t2,t3,t4....tm} be the set of unique terms in D.

A document is then represented as a m dimensional vector, td = (tf(d, t1), . . . , tf(d, tm))).

Pre-processing of documents is necessary to make computations more efficient as well as faster. More on this later.

o Evaluation

o Work done so far

o Work Remaining

Metric

Not every distance measure is a metric.

Distance measure must satisfy the following:

Euclidean Distance

Given two documents da and db represented by their term vectors ta and tb respectively, the Euclidean distance is given simply as :

where DE = Distance between vectors

wt,a and wt,b are weights as given by tfidf values i.e wt,a= tfidf(da, t)

tfidf(d,t) = tf(d, t) * Log(|D| / df(t))

|D| = number of documents

df (t) = number of documents in which ‘t’ appears.

Cosine Similarity

Quantifies correlation b/w vectors ta and tb as

cosine of the angle between them in m

dimensional space.

where SIMC = Cosine similarity

t a and tb are vectors containing weights corresponding to each

dimension.

Bounded b/w [0,1] and independent of document

length.

Mahalanobis Distance

Differs from Euclidean distance in that it takes

into account the correlations of the data set and

is scale-invariant.

dst = Distance between vectors

xs and xt are weights as given by tfidf values i.e

wt,a= tfidf(da, t)

C is the covariance matrix

Jaccard Coefficient

Measures similarity as the intersection divided by

the union of the objects.

For text document, the Jaccard coefficient

compares the sum weight of shared terms to the

sum weight of terms that are present in either of

the two document but are not the shared terms.

Pearson Correlation

Takes many different forms, we are using the

following formula in our work.

Ranges form [-1,1]

In subsequent experiments we use the distance

measure, which is DP=1−SIMP when SIMP ≥ 0and

DP = |SIMP| when SIMP< 0

o Evaluation

o Work done so far

o Work Remaining

Clustering Algorithms

Hierarchical Algorithms is a method of cluster analysis which seeks to build a

hierarchy of clusters.

Agglomerative : This is a "bottom up" approach: each

observation starts in its own cluster, and pairs of clusters are

merged as one moves up the hierarchy. Divisive: This is a "top down" approach: all observations

start in one cluster, and splits are performed recursively as

one moves down the hierarchy.

Clustering Algorithms

K-means Algorithms aims to partition n observations into k clusters in

which each observation belongs to the cluster with the

nearest mean, serving as a representative of the

cluster.

Comparisons

Hierarchical Algorithms:

Agglomerative : O (n3)

Divisive: O (2n)

K-means Algorithms:

Various implementations with different

heuristics. All run in polynomial time.

o Evaluation

o Work done so far

o Work Remaining

Datasets

20 news: news articles on different topics

Classic: abstracts and scientific paper

Hitech : San Hose newspaper articles

tr41: from TREC collection of articles

wap : web pages collection

webkb : another web page dataset

r0 : standard cluster testing database.

Evaluation

Evaluating and contrasting cluster quality objectively is a difficult task in itself.

In practice, manually assigned category labels are usually used as a baseline criteria for evaluating clusters.

As a result, the clusters, which are generated in an unsupervised way, are compared to the pre-defined category structure, which is normally created by human experts.

This kind of evaluation assumes that the objective of clustering is to replicate human thinking, so a clustering solution is good if the clusters are consistent with the manually created categories.

Evaluation Purity Measure :

Measures the coherence of a cluster, i.e degree to

which a cluster contains documents from a single

Text Document Clustering and Similarity Measures · Text Document Clustering and Similarity...

Transcript of Text Document Clustering and Similarity Measures · Text Document Clustering and Similarity...

Text Document Clustering and Similarity Measures · Text Document Clustering and Similarity...

Documents

Transcript of Text Document Clustering and Similarity Measures · Text Document Clustering and Similarity...

Similarity-Based Text Clustering: A Comparative Studywainer/cursos/1s2007/ia/text-clust.pdf · Similarity-Based Text Clustering: A Comparative Study 75 documents before this step

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

Similarity Collect - Clustering of Saccades

A New Suffix Tree Similarity Measure for Document Clustering

Measure Clustering with Multiviewpoint-Based Similarity

Clustering: Similarity-Based Clustering · 2014-11-18 · Hierarchical Agglomerative Clustering (HAC) •Assumes a similarity function for determining the similarity of two clusters.

Search Techniques for Multimedia Databases Similarity-Based Queries Similarity Computation Indexing Techniques Data Clustering Search Algorithms.

An Improved Co-Similarity Measure for Document Clustering

Frame Similarity Detection and Frame Clustering Using ...

Clustering for approximate similarity search in high ......Clustering for Approximate Similarity Search in High-Dimensional Spaces Chen Li, Member, IEEE, Edward Chang, Member, IEEE,

Clustering: Partition Clustering. Lecture outline Distance/Similarity between data objects Data objects as geometric data points Clustering problems and.

Clustering: Similarity-Based Clustering · Clustering: Similarity-Based Clustering CS4780/5780 – Machine Learning Fall 2013 Thorsten Joachims Cornell University Reading: Manning/Raghavan/Schuetze,

On a Theory of Similarity Functions for Learning and Clustering

“A SIMILARITY MEASURE FOR CLASSIFICATION AND CLUSTERING … · “A SIMILARITY MEASURE FOR CLASSIFICATION AND CLUSTERING IN TEXT BASED BANKING AND IMAGE BASED MEDICAL APPLICATIONS”

Han, Kamber, Eick: Introduction to Clustering and Similarity Assessment 1 2013 Teaching of Clustering Part1: Introduction to Similarity Assessment and.

Fast similarity search and clustering of video …cheungsc/docs/mm04.pdfFast similarity search and clustering of video sequences on the world-wide-web Sen-ching S. Cheung Center for

Similarity Measures for Text Document Clustering

Matching Similarity for Keyword - based Clustering

Clustering Uncertain Data Based on Probability Distribution Similarity

Han, Kamber, Eick: Object Similarity & Clustering for COSC 6340 1 Clustering and Similarity Assessment ©Jiawei Han and Micheline Kamber with major Additions.