Similarity-based Clustering by Left-Stochastic Matrix Factorization
Clustering with multiviewpoint based similarity measure.bak
-
Upload
ieeexploreprojects -
Category
Documents
-
view
2.242 -
download
5
Transcript of Clustering with multiviewpoint based similarity measure.bak
Clustering with Multiviewpoint-BasedSimilarity Measure
Duc Thang Nguyen, Lihui Chen, Senior Member, IEEE, and Chee Keong Chan
Abstract—All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity
between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multiviewpoint-based similarity
measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is
that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects
assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment
of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for
document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that
use other popular similarity measures on various document collections to verify the advantages of our proposal.
Index Terms—Document clustering, text mining, similarity measure.
Ç
1 INTRODUCTION
CLUSTERING is one of the most interesting and importanttopics in data mining. The aim of clustering is to find
intrinsic structures in data, and organize them into mean-ingful subgroups for further study and analysis. There havebeen many clustering algorithms published every year.They can be proposed for very distinct research fields, anddeveloped using totally different techniques and ap-proaches. Nevertheless, according to a recent study [1],more than half a century after it was introduced, the simplealgorithm k-means still remains as one of the top 10 datamining algorithms nowadays. It is the most frequently usedpartitional clustering algorithm in practice. Another recentscientific discussion [2] states that k-means is the favoritealgorithm that practitioners in the related fields choose touse. Needless to mention, k-means has more than a few basicdrawbacks, such as sensitiveness to initialization and tocluster size, and its performance can be worse than otherstate-of-the-art algorithms in many domains. In spite of that,its simplicity, understandability, and scalability are thereasons for its tremendous popularity. An algorithm withadequate performance and usability in most of applicationscenarios could be preferable to one with better performancein some cases but limited usage due to high complexity.While offering reasonable results, k-means is fast and easy tocombine with other methods in larger systems.
A common approach to the clustering problem is to treatit as an optimization process. An optimal partition is foundby optimizing a particular function of similarity (ordistance) among data. Basically, there is an implicit
assumption that the true intrinsic structure of data couldbe correctly described by the similarity formula defined andembedded in the clustering criterion function. Hence,effectiveness of clustering algorithms under this approachdepends on the appropriateness of the similarity measure tothe data at hand. For instance, the original k-means hassum-of-squared-error objective function that uses euclideandistance. In a very sparse and high-dimensional domainlike text documents, spherical k-means, which uses cosinesimilarity (CS) instead of euclidean distance as the measure,is deemed to be more suitable [3], [4].
In [5], Banerjee et al. showed that euclidean distance wasindeed one particular form of a class of distance measurescalled Bregman divergences. They proposed Bregman hard-clustering algorithm, in which any kind of the Bregmandivergences could be applied. Kullback-Leibler divergencewas a special case of Bregman divergences that was said togive good clustering results on document data sets. Kullback-Leibler divergence is a good example of nonsymmetricmeasure. Also on the topic of capturing dissimilarity in data,Pakalska et al. [6] found that the discriminative power ofsome distance measures could increase when their none-uclidean and nonmetric attributes were increased. Theyconcluded that noneuclidean and nonmetric measures couldbe informative for statistical learning of data. In [7], Pelilloeven argued that the symmetry and nonnegativity assump-tion of similarity measures was actually a limitation of currentstate-of-the-art clustering approaches. Simultaneously, clus-tering still requires more robust dissimilarity or similaritymeasures; recent works such as [8] illustrate this need.
The work in this paper is motivated by investigationsfrom the above and similar research findings. It appears tous that the nature of similarity measure plays a veryimportant role in the success or failure of a clusteringmethod. Our first objective is to derive a novel method formeasuring similarity between data objects in sparse andhigh-dimensional domain, particularly text documents.From the proposed similarity measure, we then formulatenew clustering criterion functions and introduce their
988 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
. The authors are with the Division of Information Engineering, School ofElectrical and Electronic Engineering, Nanyang Technological University,Block S1, Nanyang Avenue, Republic of Singapore 639798.E-mail: [email protected], {elhchen, eckchan}@ntu.edu.sg.
Manuscript received 16 July 2010; revised 25 Feb. 2011; accepted 10 Mar.2011; published online 22 Mar. 2011.Recommended for acceptance by M. Ester.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2010-07-0398.Digital Object Identifier no. 10.1109/TKDE.2011.86.
1041-4347/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society
http://ieeexploreprojects.blogspot.com
respective clustering algorithms, which are fast and scalablelike k-means, but are also capable of providing high-qualityand consistent performance.
The remaining of this paper is organized as follows: InSection 2, we review related literature on similarity andclustering of documents. We then present our proposal fordocument similarity measure in Section 3.1. It is followedby two criterion functions for document clustering and theiroptimization algorithms in Section 4. Extensive experimentson real-world benchmark data sets are presented anddiscussed in Sections 5 and 6. Finally, conclusions andpotential future work are given in Section 7.
2 RELATED WORK
First of all, Table 1 summarizes the basic notations that willbe used extensively throughout this paper to representdocuments and related concepts. Each document in acorpus corresponds to an m-dimensional vector d, wherem is the total number of terms that the document corpushas. Document vectors are often subjected to some weight-ing schemes, such as the standard Term Frequency-InverseDocument Frequency (TF-IDF), and normalized to haveunit length.
The principle definition of clustering is to arrange dataobjects into separate clusters such that the intraclustersimilarity as well as the intercluster dissimilarity ismaximized. The problem formulation itself implies thatsome forms of measurement are needed to determine suchsimilarity or dissimilarity. There are many state-of-the-artclustering approaches that do not employ any specific formof measurement, for instance, probabilistic model-basedmethod [9], nonnegative matrix factorization [10], informa-tion theoretic coclustering [11] and so on. In this paper,though, we primarily focus on methods that indeed doutilize a specific measure. In the literature, euclideandistance is one of the most popular measures
Distðdi; djÞ ¼ kdi � djk: ð1Þ
It is used in the traditional k-means algorithm. The objectiveof k-means is to minimize the euclidean distance betweenobjects of a cluster and that cluster’s centroid
minXkr¼1
Xdi2Srkdi � Crk2: ð2Þ
However, for data in a sparse and high-dimensional space,such as that in document clustering, cosine similarity ismore widely used. It is also a popular similarity score intext mining and information retrieval [12]. Particularly,similarity of two document vectors di and dj, Simðdi; djÞ, isdefined as the cosine of the angle between them. For unitvectors, this equals to their inner product
Simðdi; djÞ ¼ cosðdi; djÞ ¼ dtidj: ð3Þ
Cosine measure is used in a variant of k-means calledspherical k-means [3]. While k-means aims to minimizeeuclidean distance, spherical k-means intends to maximizethe cosine similarity between documents in a cluster andthat cluster’s centroid
maxXkr¼1
Xdi2Sr
dtiCrkCrk
: ð4Þ
The major difference between euclidean distance and cosinesimilarity, and therefore between k-means and spherical k-means, is that the former focuses on vector magnitudes,while the latter emphasizes on vector directions. Besidesdirect application in spherical k-means, cosine of documentvectors is also widely used in many other documentclustering methods as a core similarity measurement. Themin-max cut graph-based spectral method is an example[13]. In graph partitioning approach, document corpus isconsider as a graph G ¼ ðV ;EÞ, where each document is avertex in V and each edge in E has a weight equal to thesimilarity between a pair of vertices. Min-max cut algorithmtries to minimize the criterion function
minXkr¼1
SimðSr; S n SrÞSimðSr; SrÞ
where SimðSq; SrÞ1�q;r�k
¼X
di2Sq;dj2SrSimðdi; djÞ;
and when the cosine as in (3) is used, minimizing thecriterion in (5) is equivalent to
minXkr¼1
DtrD
kDrk2: ð6Þ
There are many other graph partitioning methods withdifferent cutting strategies and criterion functions, such asAverage Weight [14] and Normalized Cut [15], all of whichhave been successfully applied for document clusteringusing cosine as the pairwise similarity score [16], [17]. In[18], an empirical study was conducted to compare avariety of criterion functions for document clustering.
Another popular graph-based clustering technique isimplemented in a software package called CLUTO [19]. Thismethod first models the documents with a nearest neighborgraph, and then splits the graph into clusters using a min-cutalgorithm. Besides cosine measure, the extended Jaccardcoefficient can also be used in this method to representsimilarity between nearest documents. Given nonunit docu-ment vectors ui, uj ðdi ¼ ui=kuik; dj ¼ uj=kujkÞ, their ex-tended Jaccard coefficient is
SimeJaccðui; ujÞ ¼utiuj
kuik2 þ kujk2 � utiuj: ð7Þ
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 989
TABLE 1Notations
http://ieeexploreprojects.blogspot.com
Compared with euclidean distance and cosine similarity, theextended Jaccard coefficient takes into account both themagnitude and the direction of the document vectors. If thedocuments are instead represented by their correspondingunit vectors, this measure has the same effect as cosinesimilarity. In [20], Strehl et al. compared four measures:euclidean, cosine, Pearson correlation, and extended Jaccard,and concluded that cosine and extended Jaccard are the bestones on web documents.
In nearest neighbor graph clustering methods, such as theCLUTO’s graph method above, the concept of similarity issomewhat different from the previously discussed methods.Two documents may have a certain value of cosine similarity,but if neither of them is in the other one’s neighborhood, theyhave no connection between them. In such a case, somecontext-based knowledge or relativeness property is alreadytaken into account when considering similarity. Recently,Ahmad and Dey [21] proposed a method to compute distancebetween two categorical values of an attribute based on theirrelationship with all other attributes. Subsequently, Iencoet al. [22] introduced a similar context-based distancelearning method for categorical data. However, for a givenattribute, they only selected a relevant subset of attributesfrom the whole attribute set to use as the context forcalculating distance between its two values.
More related to text data, there are phrase-based andconcept-based document similarities. Lakkaraju et al. [23]employed a conceptual tree-similarity measure to identifysimilar documents. This method requires representingdocuments as concept trees with the help of a classifier.For clustering, Chim and Deng [24] proposed a phrase-based document similarity by combining suffix tree modeland vector space model. They then used HierarchicalAgglomerative Clustering algorithm to perform the cluster-ing task. However, a drawback of this approach is the highcomputational complexity due to the needs of building thesuffix tree and calculating pairwise similarities explicitlybefore clustering. There are also measures designed speci-fically for capturing structural similarity among XMLdocuments [25]. They are essentially different from thedocument-content measures that are discussed in this paper.
In general, cosine similarity still remains as the mostpopular measure because of its simple interpretation andeasy computation, though its effectiveness is yet fairlylimited. In the following sections, we propose a novel way toevaluate similarity between documents, and consequentlyformulate new criterion functions for document clustering.
3 MULTIVIEWPOINT-BASED SIMILARITY
3.1 Our Novel Similarity Measure
The cosine similarity in (3) can be expressed in thefollowing form without changing its meaning:
Simðdi; djÞ ¼ cosðdi�0; dj�0Þ ¼ ðdi�0Þtðdj�0Þ; ð8Þ
where 0 is vector 0 that represents the origin point.According to this formula, the measure takes 0 as one andonly reference point. The similarity between two documentsdi and dj is determined w.r.t. the angle between the twopoints when looking from the origin.
To construct a new concept of similarity, it is possible touse more than just one point of reference. We may have amore accurate assessment of how close or distant a pair ofpoints are, if we look at them from many differentviewpoints. From a third point dh, the directions anddistances to di and dj are indicated, respectively, by thedifference vectors ðdi � dhÞ and ðdj � dhÞ. By standing atvarious reference points dh to view di, dj and working ontheir difference vectors, we define similarity between thetwo documents as
Simðdi; djÞdi;dj2Sr
¼ 1
n�nrX
dh2SnSrSimðdi�dh; dj�dhÞ: ð9Þ
As described by the above equation, similarity of twodocuments di and dj—given that they are in the samecluster—is defined as the average of similarities measuredrelatively from the views of all other documents outside thatcluster. What is interesting is that the similarity here isdefined in a close relation to the clustering problem. Apresumption of cluster memberships has been made prior tothe measure. The two objects to be measured must be in thesame cluster, while the points from where to establish thismeasurement must be outside of the cluster. We call thisproposal the Multiviewpoint-based Similarity, or MVS. Fromthis point onwards, we will denote the proposed similaritymeasure between two document vectors di and dj byMVSðdi; djjdi; dj2SrÞ, or occasionally MVSðdi; djÞ for short.
The final form of MVS in (9) depends on particularformulation of the individual similarities within the sum. Ifthe relative similarity is defined by dot-product of thedifference vectors, we have
MVSðdi; djjdi; dj 2 SrÞ
¼ 1
n�nrX
dh2SnSrðdi�dhÞtðdj�dhÞ
¼ 1
n�nrXdh
cosðdi�dh; dj�dhÞkdi�dhkkdj�dhk:
ð10Þ
The similarity between two points di and dj inside clusterSr, viewed from a point dh outside this cluster, is equal tothe product of the cosine of the angle between di and djlooking from dh and the euclidean distances from dh tothese two points. This definition is based on the assumptionthat dh is not in the same cluster with di and dj. The smallerthe distances kdi � dhk and kdj � dhk are, the higher thechance that dh is in fact in the same cluster with di and dj,and the similarity based on dh should also be small to reflectthis potential. Therefore, through these distances, (10) alsoprovides a measure of intercluster dissimilarity, given thatpoints di and dj belong to cluster Sr, whereas dh belongs toanother cluster. The overall similarity between di and dj isdetermined by taking average over all the viewpoints notbelonging to cluster Sr. It is possible to argue that whilemost of these viewpoints are useful, there may be some ofthem giving misleading information just like it may happenwith the origin point. However, given a large enoughnumber of viewpoints and their variety, it is reasonable toassume that the majority of them will be useful. Hence, theeffect of misleading viewpoints is constrained and reducedby the averaging step. It can be seen that this method offers
990 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
http://ieeexploreprojects.blogspot.com
more informative assessment of similarity than the single
origin point-based similarity measure.
3.2 Analysis and Practical Examples of MVS
In this section, we present analytical study to show that the
proposed MVS could be a very effective similarity measure
for data clustering. In order to demonstrate its advantages,
MVS is compared with cosine similarity on how well they
reflect the true group structure in document collections.
First, exploring (10), we have
MVS ðdi; djjdi; dj 2 SrÞ
¼ 1
n� nrX
dh2SnSr
�dtidj � dtidh � dtjdh þ dthdh
�
¼ dtidj�1
n�nrdtiXdh
dh�1
n�nrdtjXdh
dhþ1; kdhk¼1
¼ dtidj �1
n� nrdtiDSnSr �
1
n� nrdtjDSnSr þ 1
¼ dtidj � dtiCSnSr � dtjCSnSr þ 1;
ð11Þ
where DSnSr ¼P
dh2SnSr dh is the composite vector of all the
documents outside cluster r, called the outer composite
w.r.t. cluster r, and CSnSr ¼ DSnSr=ðn�nrÞ the outer centroid
w.r.t. cluster r, 8r ¼ 1; . . . ; k. From (11), when comparing
two pairwise similarities MVSðdi; djÞ and MVSðdi; dlÞ,document dj is more similar to document di than the other
document dl is, if and only if
dtidj � dtjCSnSr > dtidl � dtlCSnSr, cosðdi; djÞ � cosðdj; CSnSrÞkCSnSrk >
cosðdi; dlÞ � cosðdl; CSnSrÞkCSnSrk:ð12Þ
From this condition, it is seen that even when dl is considered
“closer” to di in terms of CS, i.e., cosðdi; djÞ� cosðdi; dlÞ, dl can
still possibly be regarded as less similar to di based on MVS
if, on the contrary, it is “closer” enough to the outer centroid
CSnSr than dj is. This is intuitively reasonable, since the
“closer” dl is to CSnSr , the greater the chance it actuallybelongs to another cluster rather than Sr and is, therefore,less similar to di. For this reason, MVS brings to the table anadditional useful measure compared with CS.
To further justify the above proposal and analysis, wecarried out a validity test for MVS and CS. The purpose ofthis test is to check how much a similarity measurecoincides with the true class labels. It is based on oneprinciple: if a similarity measure is appropriate for theclustering problem, for any of a document in the corpus, thedocuments that are closest to it based on this measureshould be in the same cluster with it.
The validity test is designed as following: For each typeof similarity measure, a similarity matrix A ¼ faijgn�n iscreated. For CS, this is simple, as aij ¼ dtidj. The procedurefor building MVS matrix is described in Fig. 1.
First, the outer composite w.r.t. each class is determined.Then, for each row ai of A, i ¼ 1; . . . ; n, if the pair ofdocuments di and dj; j ¼ 1; . . . ; n are in the same class, aij iscalculated as in line 10, Fig. 1. Otherwise, dj is assumed tobe in di’s class, and aij is calculated as in line 12, Fig. 1. Aftermatrix A is formed, the procedure in Fig. 2 is used to get itsvalidity score.
For each document di corresponding to row ai of A, weselect qr documents closest to di. The value of qr is chosenrelatively as percentage of the size of the class r that containsdi, where percentage 2 ð0; 1�. Then, validity w.r.t. di iscalculated by the fraction of these qr documents having thesame class label withdi, as in line 12, Fig. 2. The final validity isdetermined by averaging over all the rows ofA, as in line 14,Fig. 2. It is clear that validity score is bounded within 0 and 1.The higher validity score a similarity measure has, the moresuitable it should be for the clustering task.
Two real-world document data sets are used as examplesin this validity test. The first is reuters7, a subset of thefamous collection, Reuters-21578 Distribution 1.0, of Reuter’snewswire articles.1 Reuters-21578 is one of the most widely
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 991
Fig. 1. Procedure: Build MVS similarity matrix.Fig. 2. Procedure: Get validity score.
1. http://www.daviddlewis.com/resources/testcollections/reuters21578/.
http://ieeexploreprojects.blogspot.com
used test collection for text categorization. In our validitytest, we selected 2,500 documents from the largest sevencategories: “acq,” “crude,” “interest,” “earn,” “money-fx,”“ship,” and “trade” to form reuters7. Some of the documentsmay appear in more than one category. The second data set
is k1b, a collection of 2,340 webpages from the Yahoo! subjecthierarchy, including six topics: “health,” “entertainment,”“sport,” “politics,” “tech,” and “business.” It was createdfrom a past study in information retrieval called WebAce[26], and is now available with the CLUTO toolkit [19].
The two data sets were preprocessed by stop-word
removal and stemming. Moreover, we removed words thatappear in less than two documents or more than 99.5 percentof the total number of documents. Finally, the documentswere weighted by TF-IDF and normalized to unit vectors.The full characteristics of reuters7 and k1b are presented inFig. 3.
Fig. 4 shows the validity scores of CS and MVS on the
two data sets relative to the parameter percentage. Thevalue of percentage is set at 0.001, 0.01, 0.05, 0.1, 0:2; . . . ; 1:0.According to Fig. 4, MVS is clearly better than CS for bothdata sets in this validity test. For example, with k1b data setat percentage ¼ 1:0, MVS’ validity score is 0.80, while that ofCS is only 0.67. This indicates that, on average, when we
pick up any document and consider its neighborhood ofsize equal to its true class size, only 67 percent of thatdocument’s neighbors based on CS actually belong to itsclass. If based on MVS, the number of valid neighborsincreases to 80 percent. The validity test has illustrated thepotential advantage of the new multiviewpoint-based
similarity measure compared to the cosine measure.
4 MULTIVIEWPOINT-BASED CLUSTERING
4.1 Two Clustering Criterion Functions IRIR and IVIVHaving defined our similarity measure, we now formulateour clustering criterion functions. The first function, calledIR, is the cluster size-weighted sum of average pairwisesimilarities of documents in the same cluster. First, let usexpress this sum in a general form by function F
F ¼Xkr¼1
nr1
n2r
Xdi;dj2Sr
Simðdi; djÞ
24
35: ð13Þ
We would like to transform this objective function intosome suitable form such that it could facilitate the
optimization procedure to be performed in a simple, fast
and effective way. According to (10)
Xdi;dj2Sr
Simðdi; djÞ
¼X
di;dj2Sr
1
n� nrX
dh2SnSrdi � dhð Þt dj � dh
� �
¼ 1
n� nrXdi;dj
Xdh
�dtidj � dtidh � dtjdh þ dthdh
�:
Since
Xdi2Sr
di ¼Xdj2Sr
dj ¼ Dr;
Xdh2SnSr
dh ¼ D�Dr and kdhk ¼ 1;
we have
Xdi;dj2Sr
Simðdi; djÞ
¼X
di;dj2Srdtidj �
2nrn� nr
Xdi2Sr
dtiX
dh2SnSrdh þ n2
r
¼ DtrDr �
2nrn� nr
DtrðD�DrÞ þ n2
r
¼ nþ nrn� nr
kDrk2 � 2nrn� nr
DtrDþ n2
r :
Substituting into (13) to get
F ¼Xkr¼1
1
nr
nþ nrn� nr
kDrk2 � nþ nrn� nr
� 1
� �DtrD
� �þ n:
Because n is constant, maximizing F is equivalent to
maximizing �F
�F ¼Xkr¼1
1
nr
nþnrn�nr
kDrk2 � nþnrn�nr
� 1
� �DtrD
� �: ð14Þ
If comparing �F with the min-max cut in (5), both functions
contain the two terms kDrk2 (an intracluster similarity
992 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 3. Characteristics of reuters7 and k1b data sets.
Fig. 4. CS and MVS validity test.
http://ieeexploreprojects.blogspot.com
measure) and DtrD (an intercluster similarity measure).
Nonetheless, while the objective of min-max cut is to
minimize the inverse ratio between these two terms, our
aim here is to maximize their weighted difference. In �F , this
difference term is determined for each cluster. They are
weighted by the inverse of the cluster’s size, before summed
up over all the clusters. One problem is that this formula-
tion is expected to be quite sensitive to cluster size. From the
formulation of COSA [27]—a widely known subspace
clustering algorithm—we have learned that it is desirable
to have a set of weight factors � ¼ f�rgk1 to regulate the
distribution of these cluster sizes in clustering solutions.
Hence, we integrate � into the expression of �F to have it
become
�F� ¼Xkr¼1
�rnr
nþnrn�nr
kDrk2 � nþnrn�nr
�1
� �DtrD
� �: ð15Þ
In common practice, f�rgk1 are often taken to be simplefunctions of the respective cluster sizes fnrgk1 [28]. Let ususe a parameter � called the regulating factor, which hassome constant value (� 2 ½0; 1�), and let �r ¼ n�r in (15), thefinal form of our criterion function IR is
IR¼Xkr¼1
1
n1��r
nþnrn�nr
kDrk2� nþnrn�nr
�1
� �DtrD
� �: ð16Þ
In the empirical study of Section 5.4, it appears that IR’s
performance dependency on the value of � is not very
critical. The criterion function yields relatively good
clustering results for � 2 ð0; 1Þ.In the formulation of IR, a cluster quality is measured by
the average pairwise similarity between documents withinthat cluster. However, such an approach can lead tosensitiveness to the size and tightness of the clusters. WithCS, for example, pairwise similarity of documents in asparse cluster is usually smaller than those in a densecluster. Though not as clear as with CS, it is still possiblethat the same effect may hinder MVS-based clustering ifusing pairwise similarity. To prevent this, an alternativeapproach is to consider similarity between each documentvector and its cluster’s centroid instead. This is expressed inobjective function G
G ¼Xkr¼1
Xdi2Sr
1
n�nrX
dh2SnSrSim di�dh;
CrkCrk
�dh� �
G ¼Xkr¼1
1
n�nrXdi2Sr
Xdh2SnSr
di�dhð Þt CrkCrk
�dh� �
:
ð17Þ
Similar to the formulation of IR, we would like to express
this objective in a simple form that we could optimize more
easily. Exploring the vector dot product, we get
Xdi2Sr
Xdh2SnSr
di � dhð Þt CrkCrk
� dh� �
¼Xdi
Xdh
dtiCrkCrk
� dtidh � dthCrkCrk
þ 1
� �
¼ ðn�nrÞDtr
Dr
kDrk�Dt
rðD�DrÞ � nrðD�DrÞtDr
kDrk
þ nrðn� nrÞ; sinceCrkCrk
¼ Dr
kDrk
¼ nþ kDrkð ÞkDrk � nr þ kDrkð Þ DtrD
kDrkþ nrðn� nrÞ:
Substituting the above into (17) to have
G ¼Xkr¼1
nþkDrkn�nr
kDrk �nþkDrkn�nr
� 1
� �DtrD
kDrk
� �þ n:
Again, we could eliminate n because it is a constant.Maximizing G is equivalent to maximizing IV below
IV ¼Xkr¼1
nþkDrkn�nr
kDrk�nþkDrkn�nr
�1
� �DtrD
kDrk
� �: ð18Þ
IV calculates the weighted difference between the twoterms: kDrk and Dt
rD=kDrk, which again represent anintracluster similarity measure and an intercluster similar-ity measure, respectively. The first term is actuallyequivalent to an element of the sum in spherical k-meansobjective function in (4); the second one is similar to anelement of the sum in min-max cut criterion in (6), but withkDrk as scaling factor instead of kDrk2. We have presentedour clustering criterion functions IR and IV in the simpleforms. Next, we show how to perform clustering by using agreedy algorithm to optimize these functions.
4.2 Optimization Algorithm and Complexity
We denote our clustering framework by MVSC, meaningClustering with Multiviewpoint-based Similarity. Subse-quently, we have MVSC-IR and MVSC-IV , which are MVSCwith criterion function IR and IV , respectively. The maingoal is to perform document clustering by optimizing IR in(16) and IV in (18). For this purpose, the incremental k-wayalgorithm [18], [29]—a sequential version of k-means—isemployed. Considering that the expression of IV in (18)depends only on nr and Dr, r ¼ 1; . . . ; k, IV can be written ina general form
IV ¼Xkr¼1
Ir nr;Drð Þ; ð19Þ
where Irðnr;DrÞ corresponds to the objective value of clusterr. The same is applied to IR. With this general form, theincremental optimization algorithm, which has two majorsteps Initialization and Refinement, is described in Fig. 5.
At Initialization, k arbitrary documents are selected to bethe seeds from which initial partitions are formed. Refine-ment is a procedure that consists of a number of iterations.During each iteration, the n documents are visited one byone in a totally random order. Each document is checked ifits move to another cluster results in improvement of theobjective function. If yes, the document is moved to the
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 993
http://ieeexploreprojects.blogspot.com
cluster that leads to the highest improvement. If no clustersare better than the current cluster, the document is notmoved. The clustering process terminates when an iterationcompletes without any documents being moved to newclusters. Unlike the traditional k-means, this algorithm is astepwise optimal procedure. While k-means only updatesafter all n documents have been reassigned, the incrementalclustering algorithm updates immediately whenever eachdocument is moved to new cluster. Since every move whenhappens increases the objective function value, convergenceto a local optimum is guaranteed.
During the optimization procedure, in each iteration, themain sources of computational cost are
. Searching for optimum clusters to move individualdocuments to: Oðnz � kÞ.
. Updating composite vectors as a result of suchmoves: Oðm � kÞ.
where nz is the total number of nonzero entries in alldocument vectors. Our clustering approach is partitional andincremental; therefore, computing similarity matrix isabsolutely not needed. If � denotes the number of iterationsthe algorithm takes, since nz is often several tens times largerthanm for document domain, the computational complexityrequired for clustering with IR and IV is Oðnz � k � �Þ.
5 PERFORMANCE EVALUATION OF MVSC
To verify the advantages of our proposed methods, weevaluate their performance in experiments on documentdata. The objective of this section is to compare MVSC-IRand MVSC-IV with the existing algorithms that also usespecific similarity measures and criterion functions fordocument clustering. The similarity measures to be com-pared includes euclidean distance, cosine similarity, andextended Jaccard coefficient.
5.1 Document Collections
The data corpora that we used for experiments consist of
20 benchmark document data sets. Besides reuters7 and k1b,
which have been described in details earlier, we included
another 18 text collections so that the examination of the
clustering methods is more thorough and exhaustive.
Similar to k1b, these data sets are provided together with
CLUTO by the toolkit’s authors [19]. They had been used
for experimental testing in previous papers, and their
source and origin had also been described in details [30],
[31]. Table 2 summarizes their characteristics. The corpora
present a diversity of size, number of classes and class
balance. They were all preprocessed by standard proce-
dures, including stop-word removal, stemming, removal of
too rare as well as too frequent words, TF-IDF weighting
and normalization.
5.2 Experimental Setup and Evaluation
To demonstrate how well MVSCs can perform, we compare
them with five other clustering methods on the 20 data sets
in Table 2. In summary, the seven clustering algorithms are
. MVSC-IR: MVSC using criterion function IR
. MVSC-IV : MVSC using criterion function IV
. k-means: standard k-means with euclidean distance
. Spkmeans: spherical k-means with CS
. graphCS: CLUTO’s graph method with CS
. graphEJ: CLUTO’s graph with extended Jaccard
. MMC: Spectral Min-Max Cut algorithm [13].
Our MVSC-IR and MVSC-IV programs are implemented in
Java. The regulating factor � in IR is always set at 0.3 during
the experiments. We observed that this is one of the most
appropriate values. A study on MVSC-IR’s performance
relative to different � values is presented in a later section.
The other algorithms are provided by the C library interface
which is available freely with the CLUTO toolkit [19]. For
994 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 5. Algorithm: Incremental clustering.
TABLE 2Document Data Sets
http://ieeexploreprojects.blogspot.com
each data set, cluster number is predefined equal to thenumber of true class, i.e., k ¼ c.
None of the above algorithms are guaranteed to findglobal optimum, and all of them are initialization-dependent. Hence, for each method, we performedclustering a few times with randomly initialized values,and chose the best trial in terms of the correspondingobjective function value. In all the experiments, each testrun consisted of 10 trials. Moreover, the result reportedhere on each data set by a particular clustering method isthe average of 10 test runs.
After a test run, clustering solution is evaluated bycomparing the documents’ assigned labels with their truelabels provided by the corpus. Three types of externalevaluation metric are used to assess clustering performance.They are the FScore, Normalized Mutual Information (NMI),and Accuracy. FScore is an equally weighted combination ofthe “precision” (P ) and “recall” (R) values used ininformation retrieval. Given a clustering solution, FScore isdetermined as
FScore ¼Xki¼1
nin
maxjðFi;jÞ
where Fi;j ¼2� Pi;j �Ri;j
Pi;j þRi;j;Pi;j ¼
ni;jnj
; Ri;j ¼ni;jni
;
where ni denotes the number of documents in class i, nj thenumber of documents assigned to cluster j, and ni;j thenumber of documents shared by class i and cluster j. Fromanother aspect, NMI measures the information the true classpartition and the cluster assignment share. It measures howmuch knowing about the clusters helps us know about theclasses
NMI ¼Pk
i¼1
Pkj¼1 ni;j log
n�ni;jninj
� ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPk
i¼1 ni log nin
� Pkj¼1 nj log
njn
� r :
Finally, Accuracy measures the fraction of documents thatare correctly labels, assuming a one-to-one correspondencebetween true classes and assigned clusters. Let q denote anypossible permutation of index set f1; . . . ; kg, Accuracy iscalculated by
Accuracy ¼ 1
nmaxq
Xki¼1
ni;qðiÞ:
The best mapping q to determine Accuracy could be foundby the Hungarian algorithm.2 For all three metrics, theirrange is from 0 to 1, and a greater value indicates a betterclustering solution.
5.3 Results
Fig. 6 shows the Accuracy of the seven clustering algorithmson the 20 text collections. Presented in a different way,clustering results based on FScore and NMI are reported inTables 3 and 4, respectively. For each data set in a row, thevalue in bold and underlined is the best result, while thevalue in bold only is the second to best.
It can be observed that MVSC-IR and MVSC-IV performconsistently well. In Fig. 6, 19 out of 20 data sets, exceptreviews, either both or one of MVSC approaches are in thetop two algorithms. The next consistent performer isSpkmeans. The other algorithms might work well on certaindata set. For example, graphEJ yields outstanding result onclassic; graphCS and MMC are good on reviews. But they donot fare very well on the rest of the collections.
To have a statistical justification of the clusteringperformance comparisons, we also carried out statisticalsignificance tests. Each of MVSC-IR and MVSC-IV waspaired up with one of the remaining algorithms for a pairedt-test [32]. Given two paired sets X and Y of N measuredvalues, the null hypothesis of the test is that the differencesbetween X and Y come from a population with mean 0. Thealternative hypothesis is that the paired sets differ fromeach other in a significant way. In our experiment, thesetests were done based on the evaluation values obtained onthe 20 data sets. The typical 5 percent significance level wasused. For example, considering the pair (MVSC-IR, k-means), from Table 3, it is seen that MVSC-IR dominates k-means w.r.t. FScore. If the paired t-test returns a p-valuesmaller than 0.05, we reject the null hypothesis and say thatthe dominance is significant. Otherwise, the null hypothesisis true and the comparison is considered insignificant.
The outcomes of the paired t-tests are presented inTable 5. As the paired t-tests show, the advantage of MVSC-IR and MVSC-IV over the other methods is statisticallysignificant. A special case is the graphEJ algorithm. On theone hand, MVSC-IR is not significantly better than graphEJif based on FScore or NMI. On the other hand, when MVSC-IR and MVSC-IV are tested obviously better than graphEJ,the p-values can still be considered relatively large,although they are smaller than 0.05. The reason is that, asobserved before, graphEJ’s results on classic data set arevery different from those of the other algorithms. Whileinteresting, these values can be considered as outliers, andincluding them in the statistical tests would affect theoutcomes greatly. Hence, we also report in Table 5 the testswhere classic was excluded and only results on the other19 data sets were used. Under this circumstance, bothMVSC-IR and MVSC-IV outperform graphEJ significantlywith good p-values.
5.4 Effect of �� on MVSC-IRIR’s Performance
It has been known that criterion function-based partitionalclustering methods can be sensitive to cluster size andbalance. In the formulation of IR in (16), there existsparameter � which is called the regulating factor, � 2 ½0; 1�.To examine how the determination of � could affect MVSC-IR’s performance, we evaluated MVSC-IR with differentvalues of � from 0 to 1, with 0.1 incremental interval. Theassessment was done based on the clustering results inNMI, FScore, and Accuracy, each averaged over all the20 given data sets. Since the evaluation metrics for differentdata sets could be very different from each other, simplytaking the average over all the data sets would not be verymeaningful. Hence, we employed the method used in [18]to transform the metrics into relative metrics before aver-aging. On a particular document collection S, the relative
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 995
2. http://en.wikipedia.org/wiki/Hungarian_algorithm.
http://ieeexploreprojects.blogspot.com
FScore measure of MVSC-IR with � ¼ �i is determined asfollowing:
relative FScoreðIR;S; �iÞ ¼max�j FScoreðIR;S; �jÞ
� �FScoreðIR;S; �iÞ
;
where �i; �j 2 f0:0; 0:1; . . . ; 1:0g, FScoreðIR;S; �iÞ is theFScore result on data set S obtained by MVSC-IR with� ¼ �i. The same transformation was applied to NMI andAccuracy to yield relative_NMI and relative_Accuracy, respec-tively. MVSC-IR performs the best with an �i if its relativemeasure has a value of 1. Otherwise its relative measure isgreater than 1; the larger this value is, the worse MVSC-IRwith �i performs in comparison with other settings of �.Finally, the average relative measures were calculated overall the data sets to present the overall performance.
Fig. 7 shows the plot of average relative FScore, NMI, andAccuracy w.r.t. different values of �. In a broad view,MVSC-IR performs the worst at the extreme values of � (0
and 1), and tends to get better when � is set at some softvalues in between 0 and 1. Based on our experimentalstudy, MVSC-IR always produces results within 5 percentof the best case, regarding any types of evaluation metric,with � from 0.2 to 0.8.
6 MVSC AS REFINEMENT FOR kk-MEANS
From the analysis of (12) in Section 3.2, MVS provides anadditional criterion for measuring the similarity amongdocuments compared with CS. Alternatively, MVS can beconsidered as a refinement for CS, and consequently MVSCalgorithms as refinements for spherical k-means, whichuses CS. To further investigate the appropriateness andeffectiveness of MVS and its clustering algorithms, wecarried out another set of experiments in which solutionsobtained by Spkmeans were further optimized by MVSC-IRand MVSC-IV . The rationale for doing so is that if the finalsolutions by MVSC-IR and MVSC-IV are better than the
996 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 6. Clustering results in Accuracy. Left-to-right in legend corresponds to left-to-right in the plot.
http://ieeexploreprojects.blogspot.com
intermediate ones obtained by Spkmeans, MVS is indeed
good for the clustering problem. These experiments would
reveal more clearly if MVS actually improves the clustering
performance compared with CS.In the previous section, MVSC algorithms have been
compared against the existing algorithms that are closely
related to them, i.e., ones that also employ similarity
measures and criterion functions. In this section, we make
use of the extended experiments to further compare the
MVSC with a different type of clustering approach, the
NMF methods [10], which do not use any form of explicitly
defined similarity measure for documents.
6.1 TDT2 and Reuters-21578 Collections
For variety and thoroughness, in this empirical study, we
used two new document copora described in Table 6: TDT2
and Reuters-21578. The original TDT2 corpus,3 which
consists of 11,201 documents in 96 topics (i.e., classes), has
been one of the most standard sets for document clustering
purpose. We used a subcollection of this corpus which
contains 10,021 documents in the largest 56 topics. The
Reuters-21578 Distribution 1.0 has been mentioned earlier in
this paper. The original corpus consists of 21,578 documents
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 997
TABLE 4Clustering Results in NMI
TABLE 3Clustering Results in FScore
3. http://www.nist.gov/speech/tests/tdt/tdt98/index.html.
http://ieeexploreprojects.blogspot.com
in 135 topics. We used a subcollection having 8,213 docu-
ments from the largest 41 topics. The same two document
collections had been used in the paper of the NMF methods
[10]. Documents that appear in two or more topics were
removed, and the remaining documents were preprocessed
in the same way as in Section 5.1.
6.2 Experiments and Results
The following clustering methods:
. Spkmeans: spherical k-means
. rMVSC-IR: refinement of Spkmeans by MVSC-IR
. rMVSC-IV : refinement of Spkmeans by MVSC-IV
. MVSC-IR: normal MVSC using criterion IR
. MVSC-IV : normal MVSC using criterion IV ,
and two new document clustering approaches that do not
use any particular form of similarity measure:
. NMF: Nonnegative Matrix Factorization method
. NMF-NCW: Normalized Cut Weighted NMF
were involved in the performance comparison. When used
as a refinement for Spkmeans, the algorithms rMVSC-IRand rMVSC-IV worked directly on the output solution of
Spkmeans. The cluster assignment produced by Spkmeans
was used as initialization for both rMVSC-IR and rMVSC-
IV . We also investigated the performance of the original
MVSC-IR and MVSC-IV further on the new data sets.Besides, it would be interesting to see how they and theirSpkmeans-initialized versions fare against each other. Whatis more, two well-known document clustering approachesbased on nonnegative matrix factorization, NMF and NMF-NCW [10], are also included for a comparison with ouralgorithms, which use explicit MVS measure.
During the experiments, each of the two corpora in Table 6were used to create six different test cases, each of whichcorresponded to a distinct number of topics used(c ¼ 5; . . . ; 10). For each test case, c topics were randomlyselected from the corpus and their documents were mixedtogether to form a test set. This selection was repeated50 times so that each test case had 50 different test sets. Theaverage performance of the clustering algorithms with k ¼ cwere calculated over these 50 test sets. This experimentalsetup is inspired by the similar experiments conducted in theNMF paper [10]. Furthermore, similar to previous experi-mental setup in Section 5.2, each algorithm (including NMFand NMF-NCW) actually considered 10 trials on any test setbefore using the solution of the best obtainable objectivefunction value as its final output.
The clustering results on TDT2 and Reuters-21578 areshown in Table 7 and 8, respectively. For each test case in acolumn, the value in bold and underlined is the bestamong the results returned by the algorithms, while thevalue in bold only is the second to best. From the tables,several observations can be made. First, MVSC-IR andMVSC-IV continue to show they are good clusteringalgorithms by outperforming other methods frequently.They are always the best in every test case of TDT2.
998 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 7. MVSC-IR’s performance with respect to �.
TABLE 6TDT2 and Reuters-21578 Document Corpora
TABLE 5Statistical Significance of Comparisons Based on Paired t-Tests with 5 Percent Significance Level
http://ieeexploreprojects.blogspot.com
Compared with NMF-NCW, they are better in almost allthe cases, except only the case of Reuters-21578, k ¼ 5,where NMF-NCW is the best based on Accuracy.
The second observation, which is also the main objectiveof this empirical study, is that by applying MVSC to refinethe output of spherical k-means, clustering solutions areimproved significantly. Both rMVSC-IR and rMVSC-IV leadto higher NMIs and Accuracies than Spkmeans in all the
cases. Interestingly, there are many circumstances whereSpkmeans’ result is worse than that of NMF clusteringmethods, but after refined by MVSCs, it becomes better. Tohave a more descriptive picture of the improvements, wecould refer to the radar charts in Fig. 8. The figure showsdetails of a particular test case where k ¼ 5. Remember thata test case consists of 50 different test sets. The chartsdisplay result on each test set, including the accuracy result
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 999
TABLE 7Clustering Results on TDT2
TABLE 8Clustering Results on Reuters-21578
Fig. 8. Accuracies on the 50 test sets (in sorted order of Spkmeans) in the test case k ¼ 5.
http://ieeexploreprojects.blogspot.com
obtained by Spkmeans, and the results after refinement byMVSC, namely rMVSC-IR and rMVSC-IV . For effectivevisualization, they are sorted in ascending order of theaccuracies by Spkmeans (clockwise). As the patterns in bothFigs. 8a and 8b reveal, improvement in accuracy is mostlikely attainable by rMVSC-IR and rMVSC-IV . Many of theimprovements are with a considerably large margin,especially when the original accuracy obtained bySpkmeans is low. There are only few exceptions whereafter refinement, accuracy becomes worst. Nevertheless, thedecreases in such cases are small.
Finally, it is also interesting to notice from Tables 7 and 8that MVSC preceded by spherical k-means does notnecessarily yields better clustering results than MVSC withrandom initialization. There are only a small number ofcases in the two tables that rMVSC can be found better thanMVSC. This phenomenon, however, is understandable.Given a local optimal solution returned by spherical k-means, rMVSC algorithms as a refinement method wouldbe constrained by this local optimum itself and, hence, theirsearch space might be restricted. The original MVSCalgorithms, on the other hand, are not subjected to thisconstraint, and are able to follow the search trajectory oftheir objective function from the beginning. Hence, whileperformance improvement after refining spherical k-means’result by MVSC proves the appropriateness of MVS and itscriterion functions for document clustering, this observationin fact only reaffirms its potential.
7 CONCLUSIONS AND FUTURE WORK
In this paper, we propose a Multiviewpoint-based Similar-ity measuring method, named MVS. Theoretical analysisand empirical examples show that MVS is potentially moresuitable for text documents than the popular cosinesimilarity. Based on MVS, two criterion functions, IR andIV , and their respective clustering algorithms, MVSC-IRand MVSC-IV , have been introduced. Compared with otherstate-of-the-art clustering methods that use different typesof similarity measure, on a large number of document datasets and under different evaluation metrics, the proposedalgorithms show that they could provide significantlyimproved clustering performance.
The key contribution of this paper is the fundamentalconcept of similarity measure from multiple viewpoints.Future methods could make use of the same principle, butdefine alternative forms for the relative similarity in (10), ordo not use average but have other methods to combine therelative similarities according to the different viewpoints.Besides, this paper focuses on partitional clustering ofdocuments. In the future, it would also be possible to applythe proposed criterion functions for hierarchical clusteringalgorithms. Finally, we have shown the application of MVSand its clustering algorithms for text data. It would beinteresting to explore how they work on other types ofsparse and high-dimensional data.
REFERENCES
[1] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J.McLachlan, A. Ng, B. Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J.Hand, and D. Steinberg, “Top 10 Algorithms in Data Mining,”Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2007.
[2] I. Guyon, U.V. Luxburg, and R.C. Williamson, “Clustering:Science or Art?,” Proc. NIPS Workshop Clustering Theory, 2009.
[3] I. Dhillon and D. Modha, “Concept Decompositions for LargeSparse Text Data Using Clustering,” Machine Learning, vol. 42,nos. 1/2, pp. 143-175, Jan. 2001.
[4] S. Zhong, “Efficient Online Spherical K-means Clustering,” Proc.IEEE Int’l Joint Conf. Neural Networks (IJCNN), pp. 3180-3185, 2005.
[5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering withBregman Divergences,” J. Machine Learning Research, vol. 6,pp. 1705-1749, Oct. 2005.
[6] E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke,“Non-Euclidean or Non-Metric Measures Can Be Informative,”Structural, Syntactic, and Statistical Pattern Recognition, vol. 4109,pp. 871-880, 2006.
[7] M. Pelillo, “What Is a Cluster? Perspectives from Game Theory,”Proc. NIPS Workshop Clustering Theory, 2009.
[8] D. Lee and J. Lee, “Dynamic Dissimilarity Measure for SupportBased Clustering,” IEEE Trans. Knowledge and Data Eng., vol. 22,no. 6, pp. 900-905, June 2010.
[9] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Clustering on theUnit Hypersphere Using Von Mises-Fisher Distributions,”J. Machine Learning Research, vol. 6, pp. 1345-1382, Sept. 2005.
[10] W. Xu, X. Liu, and Y. Gong, “Document Clustering Based on Non-Negative Matrix Factorization,” Proc. 26th Ann. Int’l ACM SIGIRConf. Research and Development in Informaion Retrieval, pp. 267-273,2003.
[11] I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-TheoreticCo-Clustering,” Proc. Ninth ACM SIGKDD Int’l Conf. KnowledgeDiscovery and Data Mining (KDD), pp. 89-98, 2003.
[12] C.D. Manning, P. Raghavan, and H. Schutze, An Introduction toInformation Retrieval. Cambridge Univ. Press, 2009.
[13] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A Min-Max CutAlgorithm for Graph Partitioning and Data Clustering,” Proc.IEEE Int’l Conf. Data Mining (ICDM), pp. 107-114, 2001.
[14] H. Zha, X. He, C. Ding, H. Simon, and M. Gu, “Spectral Relaxationfor K-Means Clustering,” Proc. Neural Info. Processing Systems(NIPS), pp. 1057-1064, 2001.
[15] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,”IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.
[16] I.S. Dhillon, “Co-Clustering Documents and Words UsingBipartite Spectral Graph Partitioning,” Proc. Seventh ACMSIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD),pp. 269-274, 2001.
[17] Y. Gong and W. Xu, Machine Learning for Multimedia ContentAnalysis. Springer-Verlag, 2007.
[18] Y. Zhao and G. Karypis, “Empirical and Theoretical Comparisonsof Selected Criterion Functions for Document Clustering,”Machine Learning, vol. 55, no. 3, pp. 311-331, June 2004.
[19] G. Karypis, “CLUTO a Clustering Toolkit,” technical report, Dept.of Computer Science, Univ. of Minnesota, http://glaros.dtc.umn.edu/~gkhome/views/cluto, 2003.
[20] A. Strehl, J. Ghosh, and R. Mooney, “Impact of SimilarityMeasures on Web-Page Clustering,” Proc. 17th Nat’l Conf. ArtificialIntelligence: Workshop of Artificial Intelligence for Web Search (AAAI),pp. 58-64, July 2000.
[21] A. Ahmad and L. Dey, “A Method to Compute Distance BetweenTwo Categorical Values of Same Attribute in UnsupervisedLearning for Categorical Data Set,” Pattern Recognition Letters,vol. 28, no. 1, pp. 110-118, 2007.
[22] D. Ienco, R.G. Pensa, and R. Meo, “Context-Based DistanceLearning for Categorical Data Clustering,” Proc. Eighth Int’l Symp.Intelligent Data Analysis (IDA), pp. 83-94, 2009.
[23] P. Lakkaraju, S. Gauch, and M. Speretta, “Document SimilarityBased on Concept Tree Distance,” Proc. 19th ACM Conf. Hypertextand Hypermedia, pp. 127-132, 2008.
[24] H. Chim and X. Deng, “Efficient Phrase-Based DocumentSimilarity for Clustering,” IEEE Trans. Knowledge and Data Eng.,vol. 20, no. 9, pp. 1217-1229, Sept. 2008.
[25] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, “FastDetection of xml Structural Similarity,” IEEE Trans. Knowledge andData Eng., vol. 17, no. 2, pp. 160-175, Feb. 2005.
[26] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V.Kumar, B. Mobasher, and J. Moore, “Webace: A Web Agent forDocument Categorization and Exploration,” Proc. Second Int’lConf. Autonomous Agents (AGENTS ’98), pp. 408-415, 1998.
1000 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
http://ieeexploreprojects.blogspot.com
[27] J. Friedman and J. Meulman, “Clustering Objects on Subsets ofAttributes,” J. Royal Statistical Soc. Series B Statistical Methodology,vol. 66, no. 4, pp. 815-839, 2004.
[28] L. Hubert, P. Arabie, and J. Meulman, Combinatorial Data Analysis:Optimization by Dynamic Programming. SIAM, 2001.
[29] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, seconded. John Wiley & Sons, 2001.
[30] S. Zhong and J. Ghosh, “A Comparative Study of GenerativeModels for Document Clustering,” Proc. SIAM Int’l Conf. DataMining Workshop Clustering High Dimensional Data and Its Applica-tions, 2003.
[31] Y. Zhao and G. Karypis, “Criterion Functions for DocumentClustering: Experiments and Analysis,” technical report, Dept. ofComputer Science, Univ. of Minnesota, 2002.
[32] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.
Duc Thang Nguyen received the BEng degreein electrical and electronic engineering fromNanyang Technological University, Singapore,where he is also working toward the PhDdegree in the Division of Information Engineer-ing. Currently, he is an operations planninganalyst at PSA Corporation Ltd., Singapore. Hisresearch interests include algorithms, informa-tion retrieval, data mining, optimizations, andoperations research.
Lihui Chen received the BEng degree incomputer science and engineering at ZhejiangUniversity, China and the PhD degree incomputational science at the University of St.Andrews, United Kingdom. Currently, she is anassociate professor in the Division of InformationEngineering at Nanyang Technological Univer-sity in Singapore. Her research interests includemachine learning algorithms and applications,data mining and web intelligence. She has
published more than 70 referred papers in international journals andconferences in these areas. She is a senior member of the IEEE, and amember of the IEEE Computational Intelligence Society.
Chee Keong Chan received the BEng degreefrom National University of Singapore in elec-trical and electronic engineering, the MSCdegree and DIC in computing from ImperialCollege, University of London, and the PhDdegree from Nanyang Technological University.Upon graduation from NUS, he worked as anR&D engineer in Philips Singapore for severalyears. Currently, he is an associate professor inthe Information Engineering Division, lecturing in
subjects related to computer systems, artificial intelligence, softwareengineering, and cyber security. Through his years in NTU, he haspublished about numerous research papers in conferences, journals,books, and book chapter. He has also provided numerous consultationsto the industries. His current research interest areas include data mining(text and solar radiation data mining), evolutionary algorithms (schedul-ing and games), and renewable energy.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
NGUYEN ET AL.: CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY MEASURE 1001
http://ieeexploreprojects.blogspot.com