Clustering algorithms and validity...
Transcript of Clustering algorithms and validity...
Clustering algorithms and validity measures
M. Halkidi, Y. Batistakis, M. VazirgiannisDepartment of Informatics
Athens University of Economics & BusinessEmail: {mhalk, yannis, mvazirg}@aueb.gr
Abstract
Clustering aims at discovering groups and identifyinginteresting distributions and patterns in data sets.Researchers have extensively studied clustering since itarises in many application domains in engineering andsocial sciences. In the last years the availability of hugetransactional and experimental data sets and the arisingrequirements for data mining created needs for clusteringalgorithms that scale and can be applied in diversedomains. This paper surveys clustering methods andapproaches available in literature in a comparative way.It also presents the basic concepts, principles andassumptions upon which the clustering algorithms arebased. Another important issue is the validity of theclustering schemes resulting from applying algorithms.This is also related to the inherent features of the data setunder concern. We review and compare clusteringvalidity measures available in the literature.Furthermore, we illustrate the issues that are under-addressed by the recent algorithms and we address newresearch directions.
1. Introduction
Clustering is one of the most useful tasks in data miningprocess for discovering groups and identifying interestingdistributions and patterns in the underlying data.Clustering problem is about partitioning a given data setinto groups (clusters) such that the data points in a clusterare more similar to each other than points in differentclusters [10]. For example, consider a retail databaserecords containing items purchased by customers. Aclustering procedure could group the customers in such away that customers with similar buying patterns are in thesame cluster. Thus, the main concern in the clusteringprocess is to reveal the organization of patterns into“sensible” groups, which allow us to discover similaritiesand differences, as well as to derive useful conclusionsabout them. This idea is applicable in many fields, such aslife sciences, medical sciences and engineering.Clustering may be found under different names indifferent contexts, such as unsupervised learning (inpattern recognition), numerical taxonomy (in biology,ecology), typology (in social sciences) and partition (ingraph theory) [28].
In the clustering process, there are no predefinedclasses and no examples that would show what kind ofdesirable relations should be valid among the data that iswhy it is perceived as an unsupervised process [1]. On theother hand, classification is a procedure of assigning adata item to a predefined set of categories [8]. Clusteringproduces initial categories in which values of a data setare classified during the classification process.The clustering process may result in different partitioningof a data set, depending on the specific criterion used forclustering. Thus, there is a need of preprocessing beforewe assume a clustering task in a data set. The basic stepsto develop clustering process can be summarized asfollows [8]:♦ Feature selection. The goal is to select properly the
features on which clustering is to be performed so as toencode as much information as possible concerning thetask of our interest. Thus, preprocessing of data may benecessary prior to their utilization in clustering task.
♦ Clustering algorithm. This step refers to the choice ofan algorithm that results in the definition of a goodclustering scheme for a data set. A proximity measureand a clustering criterion mainly characterize aclustering algorithm as well as its efficiency to define aclustering scheme that fits the data set.i) Proximity measure is a measure that quantifies how“similar” two data points (i.e. feature vectors) are. Inmost of the cases we have to ensure that all selectedfeatures contribute equally to the computation of theproximity measure and there are no features thatdominate others.ii) Clustering criterion. In this step, we have to definethe clustering criterion, which can be expressed via acost function or some other type of rules. We shouldstress that we have to take into account the type ofclusters that are expected to occur in the data set. Thus,we may define a “good” clustering criterion, leading to apartitioning that fits well the data set.
♦ Validation of the results. The correctness of clusteringalgorithm results is verified using appropriate criteriaand techniques. Since clustering algorithms defineclusters that are not known a priori, irrespective of theclustering methods, the final partition of data requiressome kind of evaluation in most applications [24].
♦ Interpretation of the results. In many cases, theexperts in the application area have to integrate the
clustering results with other experimental evidence andanalysis in order to draw the right conclusion.
Clustering Applications. Cluster analysis is a major toolin a number of applications in many fields of business andscience. Hereby, we summarize the basic directions inwhich clustering is used [28]:♦ Data reduction. Cluster analysis can contribute in
compression of the information included in data. Inseveral cases, the amount of available data is very largeand its processing becomes very demanding. Clusteringcan be used to partition data set into a number of“interesting” clusters. Then, instead of processing thedata set as an entity, we adopt the representatives of thedefined clusters in our process. Thus, data compressionis achieved.
♦ Hypothesis generation. Cluster analysis is usedhere in order to infer some hypotheses concerning thedata. For instance we may find in a retail database thatthere are two significant groups of customers based ontheir age and the time of purchases. Then, we may infersome hypotheses for the data, that it, “young people goshopping in the evening”, “old people go shopping inthe morning”.
♦ Hypothesis testing. In this case, the cluster analysisis used for the verification of the validity of a specifichypothesis. For example, we consider the followinghypothesis: “Young people go shopping in the evening”.One way to verify whether this is true is to apply clusteranalysis to a representative set of stores. Suppose thateach store is represented by its customer’s details (age,job etc) and the time of transactions. If, after applyingcluster analysis, a cluster that corresponds to “youngpeople buy in the evening” is formed, then thehypothesis is supported by cluster analysis.
♦ Prediction based on groups. Cluster analysis isapplied to the data set and the resulting clusters arecharacterized by the features of the patterns that belongto these clusters. Then, unknown patterns can beclassified into specified clusters based on their similarityto the clusters’ features. Useful knowledge related to ourdata can be extracted. Assume, for example, that thecluster analysis is applied to a data set concerningpatients infected by the same disease. The result is anumber of clusters of patients, according to theirreaction to specific drugs. Then for a new patient, weidentify the cluster in which he/she can be classified andbased on this decision his/her medication can be made.
More specifically, some typical applications of theclustering are in the following fields [12]:• Business. In business, clustering may help marketers
discover significant groups in their customers’ databaseand characterize them based on purchasing patterns.
• Biology. In biology, it can be used to definetaxonomies, categorize genes with similar functionalityand gain insights into structures inherent in populations.
• Spatial data analysis. Due to the huge amounts ofspatial data that may be obtained from satellite images,medical equipment, Geographical Information Systems(GIS), image database exploration etc., it is expensiveand difficult for the users to examine spatial data indetail. Clustering may help to automate the process ofanalysing and understanding spatial data. It is used toidentify and extract interesting characteristics andpatterns that may exist in large spatial databases.
• Web mining. In this case, clustering is used to discoversignificant groups of documents on the Web hugecollection of semi-structured documents. Thisclassification of Web documents assists in informationdiscovery.
In general terms, clustering may serve as a pre-processing step for other algorithms, such asclassification, which would then operate on the detectedclusters.
Clustering Algorithms Categories. A multitude ofclustering methods are proposed in the literature.Clustering algorithms can be classified according to:♦ The type of data input to the algorithm.♦ The clustering criterion defining the similarity
between data points♦ The theory and fundamental concepts on which
clustering analysis techniques are based (e.g. fuzzytheory, statistics)
Thus according to the method adopted to defineclusters, the algorithms can be broadly classified into thefollowing types [16]:•• Partitional clustering attempts to directly decompose
the data set into a set of disjoint clusters. Morespecifically, they attempt to determine an integernumber of partitions that optimise a certain criterionfunction. The criterion function may emphasize the localor global structure of the data and its optimisation is aniterative procedure.
•• Hierarchical clustering proceeds successively by eithermerging smaller clusters into larger ones, or by splittinglarger clusters. The result of the algorithm is a tree ofclusters, called dendrogram, which shows how theclusters are related. By cutting the dendrogram at adesired level, a clustering of the data items into disjointgroups is obtained.
•• Density-based clustering. The key idea of this type ofclustering is to group neighbouring objects of a data setinto clusters based on density conditions.
•• Grid-based clustering. This type of algorithms is mainlyproposed for spatial data mining. Their maincharacteristic is that they quantise the space into a finite
number of cells and then they do all operations on thequantised space.
For each of above categories there is a wealth ofsubtypes and different algorithms for finding the clusters.Thus, according to the type of variables allowed in thedata set can be categorized into [11, 15, 24]:•• Statistical, which are based on statistical analysis
concepts. They use similarity measures to partitionobjects and they are limited to numeric data.
•• Conceptual, which are used to cluster categorical data.They cluster objects according to the concepts theycarry.
Another classification criterion is the way clusteringhandles uncertainty in terms of cluster overlapping.•• Fuzzy clustering, which uses fuzzy techniques to
cluster data and they consider that an object can beclassified to more than one clusters. This type ofalgorithms leads to clustering schemes that arecompatible with everyday life experience as they handlethe uncertainty of real data. The most important fuzzyclustering algorithm is Fuzzy C-Means [2].
•• Crisp clustering, considers non-overlapping partitionsmeaning that a data point either belongs to a class ornot. Most of the clustering algorithms result in crispclusters, and thus can be categorized in crisp clustering.
•• Kohonen net clustering, which is based on theconcepts of neural networks. The Kohonen network hasinput and output nodes. The input layer (input nodes)has a node for each attribute of the record, each oneconnected to every output node (output layer). Eachconnection is associated with a weight, whichdetermines the position of the corresponding outputnode. Thus, according to an algorithm, which changesthe weights properly, output nodes move to formclusters.
In general terms, the clustering algorithms are basedon a criterion for assessing the quality of a givenpartitioning. More specifically, they take as input someparameters (e.g. number of clusters, density of clusters)and attempt to define the best partitioning of a data set forthe given parameters. Thus, they define a partitioning of adata set based on certain assumptions and not necessarilythe “best” one that fits the data set.
Since clustering algorithms discover clusters, whichare not known a priori, the final partitions of a data setrequires some sort of evaluation in most applications[24]. For instance questions like “how many clusters arethere in the data set?”, “does the resulting clusteringscheme fits our data set?”, “is there a better partitioningfor our data set?” call for clustering results validation andare the subjects of methods discussed in the literature.They aim at the quantitative evaluation of the results ofthe clustering algorithms and are known under the generalterm cluster validity methods.
The remainder of the paper is organized as follows. Inthe next section we present the main categories of
clustering algorithms that are available in literature. Then,in Section 3 we discuss the main characteristics of thesealgorithms in a comparative way. In Section 4 we presentthe main concepts of clustering validity indices and thetechniques proposed in literature for evaluating theclustering results. Moreover, an experimental studybased on some of these validity indices is presented,using synthetic and real data sets. We conclude inSection 5 by summarizing and providing the trends inclustering.
2. Clustering Algorithms
In recent years, a number of clustering algorithms hasbeen proposed and is available in the literature. Somerepresentative algorithms of the above categories follow.
2.1 Partitional Algorithms
In this category, K-Means is a commonly usedalgorithm [18]. The aim of K-Means clustering is theoptimisation of an objective function that is described bythe equation
E d x mix Ci
c
i
=∈=∑∑ ( , )
1
( 1)
In the above equation, mi is the center of cluster Ci,while d(x, mi) is the Euclidean distance between a point xand mi. Thus, the criterion function E attempts tominimize the distance of every point from the center ofthe cluster to which the point belongs. More specifically,the algorithm begins by initialising a set of c clustercenters. Then, it assigns each object of the dataset to thecluster whose center is the nearest, and re-computes thecenters. The process continues until the centers of theclusters stop changing.
Another algorithm of this category isPAM(Partitioning Around Medoids). The objective ofPAM is to determine a representative object (medoid) foreach cluster, that is, to find the most centrally locatedobjects within the clusters. The algorithm begins byselecting an object as medoid for each of c clusters. Then,each of the non-selected objects is grouped with themedoid to which it is the most similar. PAM swapsmedoids with other non-selected objects until all objectsqualify as medoid. It is clear that PAM is an expensivealgorithm as regards finding the medoids, as it comparesan object with entire dataset [22].
CLARA (Clustering Large Applications), is animplementation of PAM in a subset of the dataset. Itdraws multiple samples of the dataset, applies PAM onsamples, and then outputs the best clustering out of thesesamples [22].
CLARANS (Clustering Large Applications based onRandomized Search), combines the sampling techniqueswith PAM. The clustering process can be presented assearching a graph where every node is a potential
solution, that is, a set of k medoids. The clusteringobtained after replacing a medoid is called the neighbourof the current clustering. CLARANS selects a node andcompares it to a user-defined number of their neighbourssearching for a local minimum. If a better neighbour isfound (i.e., having lower-square error), CLARANS movesto the neighbour’s node and the process start again;otherwise the current clustering is a local optimum. If thelocal optimum is found, CLARANS starts with a newrandomly selected node in search for a new localoptimum.
Finally K-prototypes, K-mode [15] are based on K-Means algorithm, but they aim at clustering categoricaldata.
2.2Hierarchical Algorithms
Hierarchical clustering algorithms according to themethod that produce clusters can further be divided into[28]:♦ Agglomerative algorithms. They produce a
sequence of clustering schemes of decreasing number ofclusters at east step. The clustering scheme produced ateach step results from the previous one by merging thetwo closest clusters into one.
♦ Divisive algorithms. These algorithms produce asequence of clustering schemes increasing the numberof clusters at each step. Contrary to the agglomerativealgorithms the clustering produced at each step resultsfrom the previous one by splitting a cluster into two.
In sequel, we describe some representative hierarchicalclustering algorithms.
BIRCH [32] uses a hierarchical data structure calledCF-tree for partitioning the incoming data points in anincremental and dynamic way. CF-tree is a height-balanced tree, which stores the clustering features and it isbased on two parameters: branching factor B andthreshold T, which referred to the diameter of a cluster(the diameter (or radius) of each cluster must be less thanT). BIRCH can typically find a good clustering with asingle scan of the data and improve the quality furtherwith a few additional scans. It is also the first clusteringalgorithm to handle noise effectively [32]. However, itdoes not always correspond to a natural cluster, since eachnode in CF-tree can hold a limited number of entries dueto its size. Moreover, it is order-sensitive as it maygenerate different clusters for different orders of the sameinput data.
CURE [10] represents each cluster by a certainnumber of points that are generated by selecting well-scattered points and then shrinking them toward thecluster centroid by a specified fraction. It uses acombination of random sampling and partition clusteringto handle large databases.
ROCK [11], is a robust clustering algorithm forBoolean and categorical data. It introduces two newconcepts, that is a point's neighbours and links, and it is
based on them in order to measure thesimilarity/proximity between a pair of data points
2.3Density-based algorithms
Density based algorithms typically regard clusters asdense regions of objects in the data space that areseparated by regions of low density.A widely known algorithm of this category isDBSCAN[6]. The key idea in DBSCAN is that for eachpoint in a cluster, the neighbourhood of a given radius hasto contain at least a minimum number of points.DBSCAN can handle noise (outliers) and discoverclusters of arbitrary shape. Moreover, DBSCAN is used asthe basis for an incremental clustering algorithm proposedin [7]. Due to its density-based nature, the insertion ordeletion of an object affects the current clustering only inthe neighbourhood of this object and thus efficientalgorithms based on DBSCAN can be given forincremental insertions and deletions to an existingclustering [7].
In [14] another density-based clustering algorithm,DENCLUE, is proposed. This algorithm introduces a newapproach to cluster large multimedia databases The basicidea of this approach is to model the overall point densityanalytically as the sum of influence functions of the datapoints. The influence function can be seen as a function,which describes the impact of a data point within itsneighbourhood. Then clusters can be identified bydetermining density attractors. Density attractors are localmaximum of the overall density function. In addition,clusters of arbitrary shape can be easily described by asimple equation based on overall density function. Themain advantages of DENCLUE are that it has goodclustering properties in data sets with large amounts ofnoise and it allows a compact mathematically descriptionof arbitrary shaped clusters in high-dimensional data sets.However, DENCLUE clustering is based on twoparameters and as in most other approaches the quality ofthe resulting clustering depends on the choice of them.These parameters are [14]: i) parameter σ whichdetermines the influence of a data point in itsneighbourhood and ii) ξ describes whether a density-attractor is significant, allowing a reduction of the numberof density-attractors and helping to improve theperformance.
2.4 Grid-based algorithms
Recently a number of clustering algorithms have beenpresented for spatial data, known as grid-basedalgorithms. These algorithms quantise the space into afinite number of cells and then do all operations on thequantised space.
STING (Statistical Information Grid-based method) isrepresentative of this category. It divides the spatial areainto rectangular cells using a hierarchical structure.STING [30] goes through the data set and computes the
statistical parameters (such as mean, variance, minimum,maximum and type of distribution) of each numericalfeature of the objects within cells. Then it generates ahierarchical structure of the grid cells so as to representthe clustering information at different levels. Based onthis structure STING enables the usage of clusteringinformation to search for queries or the efficientassignment of a new object to the clusters.
WaveCluster [25] is the latest grid-based algorithmproposed in literature. It is based on signal processingtechniques (wavelet transformation) to convert the spatialdata into frequency domain. More specifically, it firstsummarizes the data by imposing a multidimensional gridstructure onto the data space [12]. Each grid cellsummarizes the information of a group of points that mapinto the cell. Then it uses a wavelet transformation totransform the original feature space. In wavelet transform,convolution with an appropriate function results in atransformed space where the natural clusters in the databecome distinguishable. Thus, we can identify the clustersby finding the dense regions in the transformed domain.A-priori knowledge about the exact number of clusters isnot required in WaveCluster.
2.5 Fuzzy Clustering
The algorithms described above result in crisp clusters,meaning that a data point either belongs to a cluster ornot. The clusters are non-overlapping and this kind ofpartitioning is further called crisp clustering. The issue ofuncertainty support in clustering task leads to theintroduction of algorithms that use fuzzy logic concepts intheir procedure. A common fuzzy clustering algorithm isthe Fuzzy C-Means (FCM), an extension of classical C-Means algorithm for fuzzy applications [2]. FCMattempts to find the most characteristic point in eachcluster, which can be considered as the “center” of thecluster and, then, the grade of membership for each objectin the clusters.
Another approach proposed in literature to solve theproblems of crisp clustering is based on probabilisticmodels. The basis of this type of clustering algorithms isthe EM algorithm, which provides a quite generalapproach to learning in presence of unobservablevariables [20]. A common algorithm is the probabilisticvariant of K-Means, which is based on the mixture ofGaussian distributions. This approach of K-Means usesprobability density rather than distance to associaterecords with clusters [1]. More specifically, it regards thecenters of clusters as means of Gaussian distributions.Then, it estimates the probability that a data point isgenerated by jth Gaussian (i.e., belongs to jth cluster). Thisapproach is based on Gaussian model to extract clustersand assigns the data points to clusters assuming that theyare generated by normal distribution. Also, this approachis implemented only in the case of algorithms, which arebased on EM (Expectation Maximization) algorithm.
3. Comparison of Clustering Algorithms
Clustering is broadly recognized as a useful tool in manyapplications. Researchers of many disciplines haveaddressed the clustering problem. However, it is adifficult problem, which combines concepts of diversescientific fields (such as databases, machine learning,pattern recognition, statistics). Thus, the differences inassumptions and context among different researchcommunities caused a number of clusteringmethodologies and algorithms to be defined.
This section offers an overview of the maincharacteristics of the clustering algorithms presented in acomparative way. We consider the algorithms categorizedin four groups based on their clustering method:partitional, hierarchical, density-based and grid-basedalgorithms. Tables 1, 2 and 3 summarize the mainconcepts and the characteristics of the most representativealgorithms of these clustering categories. Morespecifically our study is based on the following features ofthe algorithms: i) the type of the data that an algorithmsupports (numerical, categorical), ii) the shape of clusters,iii) ability to handle noise and outliers, iv) the clusteringcriterion and, v) complexity. Moreover, we present theinput parameters of the algorithms while we study theinfluence of these parameters to the clustering results.Finally we describe the type of algorithms results, i.e., theinformation that an algorithm gives so as to represent thediscovered clusters in a data set.
As Table 1 depicts, partitional algorithms areapplicable mainly to numerical data sets. However, thereare some variants of K-Means such as K-mode, whichhandle categorical data. K-Mode is based on K-meansmethod to discover clusters while it adopts new conceptsin order to handle categorical data. Thus, the clustercenters are replaced with “modes”, a new dissimilaritymeasure used to deal with categorical objects. Anothercharacteristic of partitional algorithms is that they areunable to handle noise and outliers and they are notsuitable to discover clusters with non-convex shapes.Moreover, they are based on certain assumption topartition a data set. Thus, they need to specify the numberof clusters in advance except for CLARANS, which needsas input the maximum number of neighbours of a node aswell as the number of local minima that will be found inorder to define a partitioning of a dataset. The result ofclustering process is the set of representative points of thediscovered clusters. These points may be the centers orthe medoids (most centrally located object within acluster) of the clusters depending on the algorithm. Asregards the clustering criteria, the objective of algorithmsis to minimize the distance of the objects within a clusterfrom the representative point of this cluster. Thus, thecriterion of K-Means aims at the minimization of thedistance of objects belonging to a cluster from the clustercenter, while PAM from its medoid. CLARA and
CLARANS, as mentioned above, are based on theclustering criterion of PAM. However, they considersamples of the data set on which clustering is applied andas a consequence they may deal with larger data sets thanPAM. More specifically, CLARA draws multiple samplesof the data set and it applies PAM on each sample. Then itgives the best clustering as the output. The problem of thisapproach is that its efficiency depends on the sample size.Also, the clustering results are produced based only onsamples of a data set. Thus, it is clear that if a sample isbiased, a good clustering based on samples will notnecessarily represent a good clustering of the whole dataset. On the other hand, CLARANS is a mixture of PAMand CLARA. A key difference between CLARANS andPAM is that the former searches a subset of dataset inorder to define clusters [22]. The subsets are drawn withsome randomness in each step of the search, in contrast toCLARA that has a fixed sample at every stage. This hasthe benefit of not confining a search to a localized area. Ingeneral terms, CLARANS is more efficient and scalablethan both CLARA and PAM. The algorithms describedabove are crisp clustering algorithms, that is, theyconsider that a data point (object) may belong to one andonly one cluster. However, the boundaries of a cluster canhardly be defined in a crisp way if we consider real-lifecases. FCM is a representative algorithm of fuzzyclustering which is based on K-means concepts in order topartition a data set into clusters. However, it introducesthe concept of uncertainty and it assigns the objects to theclusters with an attached degree of belief. Thus, an objectmay belong to more than one cluster with different degreeof belief.
A summarized view of the characteristics ofhierarchical clustering methods is presented in Table 2.The algorithms of this category create a hierarchicaldecomposition of the database represented as dendrogram.They are more efficient in handling noise and outliersthan partitional algorithms. However, they break downdue to their non-linear time complexity (typically,complexity O(n2), where n is the number of points in thedataset) and huge I/O cost when the number of input datapoints is large. BIRCH tackles this problem using ahierarchical data structure called CF-tree for multiphaseclustering. In BIRCH, a single scan of the dataset yields agood clustering and one or more additional scans can beused to improve the quality further. However, it handlesonly numerical data and it is order-sensitive (i.e., it maygenerate different clusters for different orders of the sameinput data.) Also, BIRCH does not perform well when theclusters do not have uniform size and shape since it usesonly the centroid of a cluster when redistributing the datapoints in the final phase. On the other hand, CUREemploys a combination of random sampling andpartitioning to handle large databases. It identifies clustershaving non-spherical shapes and wide variances in size by
representing each cluster by multiple points. Therepresentative points of a cluster are generated byselecting well-scattered points from the cluster andshrinking them toward the centre of the cluster by aspecified fraction. However, CURE is sensitive to someparameters such as the number of representative points,the shrink factor used for handling outliers, number ofpartitions. Thus, the quality of clustering results dependson the selection of these parameters. ROCK is arepresentative hierarchical clustering algorithm forcategorical data. It introduces a novel concept called“link” in order to measure the similarity/proximitybetween a pair of data points. Thus, the ROCK clusteringmethod extends to non-metric similarity measures that arerelevant to categorical data sets. It also exhibits goodscalability properties in comparison with the traditionalalgorithms employing techniques of random sampling.Moreover, it seems to handle successfully data sets withsignificant differences in the sizes of clusters.
The third category of our study is the density-basedclustering algorithms (Table 3). They suitably handlearbitrary shaped collections of points (e.g. ellipsoidal,spiral, cylindrical) as well as clusters of different sizes.Moreover, they can efficiently separate noise (outliers).Two widely known algorithms of this category, asmentioned above, are: DBSCAN and DENCLUE.DBSCAN requires the user to specify the radius of theneighbourhood of a point, Eps, and the minimum numberof points in the neighbourhood, MinPts. Then, it isobvious that DBSCAN is very sensitive to the parametersEps and MinPts, which are difficult to determine.Similarly, DENCLUE requires careful selection of itsinput parameters’ value (i.e., σ and ξ), since suchparameters may influence the quality of clustering results.However, the major advantage of DENCLUE incomparison with other clustering algorithms are [12]: i) ithas a solid mathematical foundation and generalized otherclustering methods, such as partitional, hierarchical, ii) ithas good clustering properties for data sets with largeamount of noise, iii) it allows a compact mathematicaldescription of arbitrary shaped clusters in high-dimensional data sets, iv) it uses grid cells and only keepsinformation about the cells that actually contain points. Itmanages these cells in a tree-based access structure andthus it is significant faster than some influentialalgorithms such as DBSCAN. In general terms thecomplexity of density based algorithms is O(logn). Theydo not perform any sort of sampling, and thus they couldincur substantial I/O costs. Finally, density-basedalgorithms may fail to use random sampling to reduce theinput size, unless sample’s size is large. This is becausethere may be substantial difference between the density inthe sample’s cluster and the clusters in the whole data set
Tab
le1
The
mai
nch
arac
teri
stic
sof
the
Par
titi
onal
Clu
ster
ing
Alg
orit
hms.
Cat
egor
yP
artit
iona
l
Nam
eT
ype
ofda
taC
ompl
exity
*G
eom
etry
Out
liers
,no
ise
Inpu
tpa
ram
eter
sR
esul
tsC
lust
erin
gcr
iteri
on
K-M
ean
Num
eric
alO
(n)
non-
conv
exsh
apes
No
Num
ber
ofcl
uste
rsC
ente
rof
clus
ters
min
v1,v
2,…
,vk
(Ek)
K-m
ode
Cat
egor
ical
O(n
)no
n-co
nvex
shap
esN
oN
umbe
rof
clus
ters
Mod
esof
clus
ters
min
Q1,
Q2,
…,Q
k(E
k)
D(X
i,Ql)
=di
stan
cebe
twee
nca
tego
rica
lob
ject
sX
l,an
dm
odes
Qi
PA
MN
umer
ical
O(k
(n-k
)2 )no
n-co
nvex
shap
esN
oN
umbe
rof
clus
ters
Med
oids
ofcl
uste
rsm
in(
TC
ih)
TC
ih=Σ j
Cjih
CL
AR
AN
umer
ical
O(k
(40+
k)2
+k(
n-k)
)no
n-co
nvex
shap
esN
oN
umbe
rof
clus
ters
Med
oids
ofcl
uste
rsm
in(
TC
ih)
TC
ih=Σ j
Cjih
(Cjih
=th
eco
stof
repl
acin
gce
nter
iw
ith
has
far
asO
jis
conc
erne
d)C
LA
RA
NS
Num
eric
alO
(kn2 )
non-
conv
exsh
apes
No
Num
ber
ofcl
uste
rs,
max
imum
num
ber
ofne
ighb
ors
exam
ined
Med
oids
ofcl
uste
rsm
in(
TC
ih)
TC
ih=Σ j
Cjih
FCM
Fuzz
yC
-Mea
nsN
umer
ical
O(n
)N
on-c
onve
xsh
apes
No
Num
ber
ofcl
uste
rsC
ente
rof
clus
ter,
belie
fs
min
U,v
1,v2
,…,v
k(J
m(U
,V))
*n
isth
enu
mbe
rof
poin
tsin
the
data
seta
ndk
the
num
ber
ofcl
uste
rsde
fine
d
),
(1
1i
k i
n ll
QX
dE∑∑
==
=
),
()
,(
11
2i
j
k i
n j
m ikm
vx
dU
VU
J∑∑
==
=
),
(1
1
2i
k
k i
n kk
vx
dE
∑∑
==
=
Tab
le2.
The
mai
nch
arac
teri
stic
sof
the
Hie
rarc
hica
lClu
ster
ing
Alg
orit
hms.
Cat
egor
yH
iera
rchi
cal
Nam
eT
ype
ofda
taC
ompl
exity
*G
eom
etry
Out
liers
Inpu
tpa
ram
eter
sR
esul
tsC
lust
erin
gcr
iteri
on
BIR
CH
Num
eric
alO
(n)
non-
conv
exsh
apes
Yes
Rad
ius
ofcl
uste
rs,
bran
chin
gfa
ctor
CF
=(n
umbe
rof
poin
tsin
the
clus
ter
N,
linea
rsu
mof
the
poin
tsin
the
clus
ter
LS,
the
squa
resu
mof
Nda
tapo
ints
SS)
Apo
int
isas
sign
edto
clos
est
node
(clu
ster
)ac
cord
ing
toa
chos
endi
stan
cem
etri
c.A
lso,
the
clus
ters
defi
nitio
nis
base
don
the
requ
irem
ent
that
the
num
ber
ofpo
ints
inea
chcl
uste
rm
usts
atis
fya
thre
shol
d.
CU
RE
Num
eric
alO
(n2 lo
gn),
O(n
)A
rbitr
ary
shap
esY
esN
umbe
rof
clus
ters
,nu
mbe
rof
clus
ters
repr
esen
tati
ves
Ass
ignm
ent
ofda
tava
lues
tocl
uste
rs
The
clus
ters
with
the
clos
est
pair
ofre
pres
enta
tive
s(w
ell
scat
tere
dpo
ints
)ar
em
erge
dat
each
step
.
RO
CK
Cat
egor
ical
O(n
2 +nm
mm
a+n2 lo
gn),
O(n
2,n
mmm
a)w
here
mm
isth
em
axim
umnu
mbe
rof
neig
hbor
sfo
ra
poin
tand
mais
the
aver
age
num
ber
ofne
ighb
ors
for
apo
int
Arb
itrar
ysh
apes
Yes
Num
ber
ofcl
uste
rsA
ssig
nmen
tof
data
valu
esto
clus
ters
max
(El)
-v i
cent
erof
clus
ter
I-
link
(pq,
p r)
=th
enu
mbe
rof
com
mon
neig
hbor
sbe
twee
np i
and
p r.
*n
isth
enu
mbe
rof
poin
tsin
the
data
setu
nder
cons
ider
atio
n
∑∑
∈+
=
=i
rq
Vp
pf
i
rq
k ii
ln
pp
link
nE
,)
(2
11
),
(θ
Tab
le3
The
mai
nch
arac
teri
stic
sof
the
Den
sity
-bas
edC
lust
erin
gA
lgor
ithm
s.C
ateg
ory
Den
sity
-bas
ed
Nam
eT
ype
ofda
taC
ompl
exity
*G
eom
etry
Out
liers
,no
ise
Inpu
tpa
ram
eter
sR
esul
tsC
lust
erin
gcr
iteri
on
DB
SCA
NN
umer
ical
O(n
logn
)A
rbitr
ary
shap
esY
esC
lust
erra
dius
,m
inim
umnu
mbe
rof
obje
cts
Ass
ignm
ent
ofda
tava
lues
tocl
uste
rs
Mer
gepo
ints
that
are
dens
ityre
acha
ble
into
one
clus
ter.
DE
NC
LU
EN
umer
ical
O(l
ogn)
Arb
itra
rysh
apes
Yes
Clu
ster
radi
usσ,
Min
imum
num
ber
ofob
ject
sξ
Ass
ignm
ent
ofda
tava
lues
tocl
uste
rsx*
dens
ity
attr
acto
rfo
ra
poin
tx
ifF
Gau
ss>ξ
then
xat
tach
edto
the
clus
ter
belo
ngin
gto
x*.
Tab
le4.
The
mai
nch
arac
teri
stic
sof
the
Gri
d-ba
sed
Clu
ster
ing
Alg
orit
hms
Cat
egor
yG
rid-
Bas
ed
Nam
eT
ype
ofda
taC
ompl
exity
*G
eom
etry
Out
liers
Inpu
tpar
amet
ers
Out
put
Clu
ster
ing
crite
rion
Wav
e-C
lust
erSp
atia
ldat
aO
(n)
Arb
itrar
ysh
apes
Yes
Wav
elet
s,th
enu
mbe
rof
grid
cells
for
each
dim
ensi
on,
the
num
ber
ofap
plic
atio
nsof
wav
elet
tran
sfor
m.
Clu
ster
edob
ject
sD
ecom
pose
feat
ure
spac
eap
plyi
ngw
avel
ettr
ansf
orm
atio
nA
vera
gesu
b-ba
ndcl
uste
rsD
etai
lsu
b-ba
nds
clus
ters
boun
dari
es
STIN
GSp
atia
ldat
aO
(K)
Kis
the
num
ber
ofgr
idce
llsat
the
low
est
leve
l
Arb
itrar
ysh
apes
Yes
Num
ber
ofob
ject
sin
ace
llC
lust
ered
obje
cts
Div
ide
the
spat
ial
area
into
rect
angl
ece
lls
and
empl
oya
hier
arch
ical
stru
ctur
e.E
ach
cell
ata
high
leve
lis
part
ition
edin
toa
num
ber
ofsm
alle
rce
llsin
the
next
low
erle
vel
*n
isth
enu
mbe
rof
poin
tsin
the
data
setu
nder
cons
ider
atio
n
∑∈
−=
)(
2
),
(*
*1
2
21
*
)(
xne
arx
xx
dD Gau
sse
xf
σ
The last category of our study (see Table 4) refers to grid-based algorithms. The basic concept of these algorithmsis that they define a grid for the data space and then do allthe operations on the quantised space. In general termsthese approaches are very efficient for large databases andare capable of finding arbitrary shape clusters andhandling outliers. STING is one of the well-known grid-based algorithms. It divides the spatial area intorectangular cells while it stores the statistical parametersof the numerical features of the objects within cells. Thegrid structure facilitates parallel processing andincremental updating. Since STING goes through thedatabase once to compute the statistical parameters of thecells, it is generally an efficient method for generatingclusters. Its time complexity is O(n). However, STINGuses a multiresolution approach to perform clusteranalysis and thus the quality of its clustering resultsdepends on the granularity of the lowest level of grid.Moreover, STING does not consider the spatialrelationship between the children and their neighbouringcells to construct the parent cell. The result is that allcluster boundaries are either horizontal or vertical andthus the quality of clusters is questionable [25]. On theother hand, WaveCluster efficiently achieves to detectarbitrary shape clusters at different scales exploiting well-known signal processing techniques. It does not requirethe specification of input parameters (e.g. the number ofclusters or a neighbourhood radius), though a-prioriestimation of the expected number of clusters helps inselecting the correct resolution of clusters. Inexperimental studies, WaveCluster was found tooutperform BIRCH, CLARANS and DBSCAN in termsof efficiency and clustering quality. Also, the study showsthat it is not efficient in high dimensional space [12].
4. Clustering results validity assessment
4.1 Problem Specification
The objective of the clustering methods is to discoversignificant groups present in a data set. In general, theyshould search for clusters whose members are close toeach other (in other words have a high degree ofsimilarity) and well separated. A problem we face in
clustering is to decide the optimal number of clusters thatfits a data set.
In most algorithms’ experimental evaluations 2D-datasets are used in order that the reader is able to visuallyverify the validity of the results (i.e., how well theclustering algorithm discovered the clusters of the dataset). It is clear that visualization of the data set is a crucialverification of the clustering results. In the case of largemultidimensional data sets (e.g. more than threedimensions) effective visualization of the data set wouldbe difficult. Moreover the perception of clusters usingavailable visualization tools is a difficult task for humansthat are not accustomed to higher dimensional spaces.
The various clustering algorithms behave in a differentway depending on: i) the features of the data set(geometry and density distribution of clusters), ii) theinput parameters values
For instance, assume the data set in Figure 1a. It isobvious that we can discover three clusters in the givendata set. However, if we consider a clustering algorithm(e.g. K-Means) with certain parameter values (in the caseof K-means the number of clusters) so as to partition thedata set in four clusters, the result of clustering processwould be the clustering scheme presented in Figure 1b. Inour example the clustering algorithm (K-Means) foundthe best four clusters in which our data set could bepartitioned. However, this is not the optimal partitioningfor the considered data set. We define, here, the term“optimal” clustering scheme as the outcome of running aclustering algorithm (i.e., a partitioning) that best fits theinherent partitions of the data set. It is obvious fromFigure 1b that the depicted scheme is not the best for ourdata set i.e., the clustering scheme presented in Figure 1bdoes not fit well the data set. The optimal clustering forour data set will be a scheme with three clusters.
As a consequence, if the clustering algorithmparameters are assigned an improper value, the clusteringmethod may result in a partitioning scheme that is notoptimal for the specific data set leading to wrongdecisions. The problems of deciding the number ofclusters better fitting a data set as well as the evaluation ofthe clustering results has been subject of several researchefforts [3, 9, 24, 27, 28, 31].
Figure 1. (a) A data set that consists of 3 three clusters, (b) The results from the application of K-means when we ask four clusters
In the sequel, we discuss the fundamental concepts ofclustering validity and we present the most importantcriteria in the context of clustering validity assessment.
4.2 Validity Indices
In this section, we discuss methods suitable forquantitative evaluation of the clustering results, known ascluster validity methods. However, we have to mentionthat these methods give an indication of the quality of theresulting partitioning and thus they can only beconsidered as a tool at the disposal of the experts in orderto evaluate the clustering results. In the sequel, wedescribe the fundamental criteria for each of the abovedescribed cluster validity approaches as well as theirrepresentative indices.
How Monde Carlo is used in Cluster ValidityThe goal of using Monde Carlo techniques is thecomputation of the probability density function of thedefined statistic indices. First, we generate a large amountof synthetic data sets. For each one of these synthetic datasets, called Xi, we compute the value of the defined index,denoted qi. Then based on the respective values of qi foreach of the data sets Xi, we create a scatter-plot. Thisscatter-plot is an approximation of the probability densityfunction of the index. In Figure 2 we see the threepossible cases of probability density function’s shape ofan index q. There are three different possible shapes
depending on the critical interval ρD , corresponding to
significant level ρ (statistic constant). As we can see theprobability density function of a statistic index q, under
Ho, has a single maximum and the ρD region is either a
half line, or a union of two half lines.Assuming that this shape is right-tailed (Figure 2b) andthat we have generated the scatter-plot using r values ofthe index q, called qi, in order to accept or reject the NullHypothesis Ho we examine the following conditions [28]:
We reject (accept) Ho If q’s value for our data set, isgreater (smaller) than (1-ρ)⋅ r of qi values, of therespective synthetic data sets Xi.Assuming that the shape is left-tailed (Figure 2c), wereject (accept) Ho if q’s value for our data set, issmaller (greater) than ρ⋅ r of qi values.Assuming that the shape is two-tailed (Figure 2a) weaccept Ho if q is greater than (ρ/2)⋅r number of qi valuesand smaller than (1- ρ/2)⋅r of qi values.
Based on the external criteria we can work in twodifferent ways. Firstly, we can evaluate the resultingclustering structure C, by comparing it to an independentpartition of the data P built according to our intuitionabout the clustering structure of the data set. Secondly, wecan compare the proximity matrix P to the partition P.
a) Comparison of C with partition P (not for hierarchyof clustering)Consider C ={C1… Cm} is a clustering structure of a dataset X and P ={P1… Ps} is a defined partition of the data,where m ≠ s. We refer to a pair of points (xv, xu) from thedata set using the following terms: SS: if both points belong to the same cluster of theclustering structure C and to the same group ofpartition P. SD: if both points belong to the same cluster of C andto different groups of P. DS: if both points belong to different clusters of C andto the same group of P. DD: if both points belong to different clusters of Cand to different groups of P.Assuming now that a, b, c and d are the number of SS,
SD, DS and DD pairs respectively, then a + b + c + d = Mwhich is the maximum number of all pairs in the data set(meaning, M=N(N-1)/2 where N is the total number ofpoints in the data set).
Now we can define the following indices to measure thedegree of similarity between C and P:
Figure 2. Confidence interval for (a) two-tailed index, (b) right-tailed index, (c) left-tailed index, where q0p is the ρ proportion of
q under hypothesis Η0. [28]
• Rand Statistic: R = (a + d) / M,
• Jaccard Coefficient: J = a / (a + b + c),
The above two indices take values between 0 and 1,and are maximized when m = s.Another index is the:• Folkes and Mallows index:
ca
a
ba
amm/aFM 21 +
⋅+
== ( 2)
where m1 = a / (a + b), m2=a / (a + c).For the previous three indices it has been proven that highvalues of indices indicate great similarity between C andP. The higher the values of these indices are the moresimilar C and P are. Other indices are:• Huberts Γ statistic:
∑∑+==
=ΓN
1ij
1-N
1i
j)Y(i,j)X(i,(1/M) ( 3)
High values of this index indicate a strong similaritybetween X and Y.• Normalized Γ statistic:
ΥΧ
1-N
1i
N
1ijYX σσ])µ-j)Y(i,)(µ-j)X(i,([(1/M)∑∑
= +=
_
=Γ ( 4)
where µx, µy, σx, σy are the respective means and variancesof X, Y matrices. This index takes values between –1 and1.
All these statistics have right-tailed probability densityfunctions, under the random hypothesis. In order to usethese indices in statistical tests we must know theirrespective probability density function under the NullHypothesis Ho, which is the hypothesis of randomstructure of our data set. This means that using statisticaltests, if we accept the Null Hypothesis then our data arerandomly distributed. However, the computation of theprobability density function of these indices is difficult. Asolution to this problem is to use Monde Carlo techniques.The procedure is as follows: For i = 1 to ro Generate a data set Xi with N vectors (points) in the
area of X, which means that the generated vectorshave the same dimension with those of the data set X.
o Assign each vector yj, i of Xi to the group that xj ∈ Xbelongs, according to the partition P.
o Run the same clustering algorithm used to producestructure C, for each Xi, and let Ci the resultingclustering structure.
o Compute q(Ci) value of the defined index q for P andCi.
End For Create scatter-plot of the r validity index values , q(Ci)
(that computed into the for loop).
After having plotted the approximation of theprobability density function of the defined statistic index,
we compare its value, let q, to the q(Ci) values, let qi. Theindices R, J, FM, Γ defined previously are used as the qindex mentioned in the above procedure.Example: Assume a given data set, X, containing 100three-dimensional vectors (points). The points of X formfour clusters of 25 points each. Each cluster is generatedby a normal distribution. The covariance matrices of thesedistributions are all equal to 0.2I, where I is the 3x3identity matrix. The mean vectors for the fourdistributions are [0.2, 0.2, 0.2] T, [0.5, 0.2, 0.8] T, [0.5, 0.8,0.2] T, and [0.8, 0.8, 0.8] T. We independently group dataset X in four groups according to the partition P for whichthe first 25 vectors (points) belong to the first group P1,the next 25 belong to the second group P2, the next 25belong to the third group P3 and the last 25 vectors belongto the fourth group P4. We run k-means clusteringalgorithm for k = 4 clusters and we assume that C is theresulting clustering structure. We compute the values ofthe indices for the clustering C and the partition P, and weget R = 0.91, J = 0.68, FM = 0.81 and Γ = 0.75. Then wefollow the steps described above in order to define theprobability density function of these four statistics. Wegenerate 100 data sets Xi, i = 1,…, 100, and each one ofthem consists of 100 random vectors (in 3 dimensions)using the uniform distribution. According to the partitionP defined earlier for each Xi we assign the first 25 of itsvectors to P1 and the second, third and forth groups of 25vectors to P2, P3 and P4 respectively. Then we run k-means i-times, one time for each Xi, so as to define therespective clustering structures of datasets, denoted Ci.For each of them we compute the values of the indices Ri,Ji, FMi, Γi, i=1, …,100. We set the significance level ρ =0.05 and we compare these values to the R, J, FM and Γvalues corresponding to X. We accept or reject the nullhypothesis whether (1-ρ)r =(1 – 0.05) 100 = 95 values ofRi, Ji, FMi, Γi are greater or smaller than thecorresponding values of R, J, FM, Γ. In our case the Ri, Ji,FMi, Γi values are all smaller than the correspondingvalues of R, J, FM, and Γ, which lead us to the conclusionthat the null hypothesis Ho is rejected. Something that wewere expecting because of the predefined clusteringstructure of data set X.
b) Comparison of P(proximity matrix) with partition PPartition P can be considered as a mapping
g: X {1…nc}.Assuming matrix Y: Y(i, j) = {1, if g(xi) ≠ g(xj) and 0,
otherwise}, i, j = 1…N, we can compute Γ (or normalizedΓ) statistic using the proximity matrix P and the matrix Y.Based on the index value, we may have an indication ofthe two matrices’ similarity.
To proceed with the evaluation procedure we use theMonde Carlo techniques as mentioned above. In the“Generate” step of the procedure we generate thecorresponding mappings gi for every generated Xi dataset. So in the “Compute” step we compute the matrix Yi,
for each Xi in order to find the Γi corresponding statisticindex.
4.2.1 Internal Criteria. Using this approach of clustervalidity our goal is to evaluate the clustering result of analgorithm using only quantities and features inherent tothe dataset. There are two cases in which we applyinternal criteria of cluster validity depending on theclustering structure: a) hierarchy of clustering schemes,and b) single clustering scheme.
a) Validating hierarchy of clustering schemes.A matrix called cophenetic matrix, Pc, can represent thehierarchy diagram that produced by a hierarchicalalgorithm. We may define a statistical index to measurethe degree of similarity between Pc and P (proximitymatrix) matrices. This index is called CopheneticCorrelation Coefficient and defined as:
( )
( ) ( )
1CPCC1
,
c/1dM/1
cd1/M
CPCC-1N
1i
N
1ij
2C
2ij
-1N
1i
N
1ij
2P
2ij
-1N
1i
N
1ijPijij
≤≤−
⎥⎦
⎤⎢⎣
⎡−⎥
⎦
⎤⎢⎣
⎡−
−=
∑ ∑∑ ∑
∑ ∑
= +== +=
= +=
µµ
µµ
M
C
( 5)
where M=N⋅(N-1)/2 and N is the number of points in adataset. Also, µp and µc are the means of matrices P and Pc
respectively, and are given by the equation (6):
( ) ( ) ( ) ( )∑∑ ∑∑= += = +=
==1-N
1i
N
1ij
1-N
1i
N
1ijc ji,P1/M,ji,P1/M CP µµ ( 6)
Moreover, dij, cij are the (i, j) elements of P and Pc
matrices respectively. A value of the index close to 0 is anindication of a significant similarity between the twomatrices. The procedure of the Monde Carlo techniquesdescribed above is also used in this case of validation.
b) Validating a single clustering schemeThe goal here is to find the degree of agreement betweena given clustering scheme C, consisting of nc clusters, andthe proximity matrix P. The defined index for thisapproach is Hubert’s Γ statistic (or normalized Γ statistic).An additional matrix for the computation of the index isused, that is Y(i, j) = {1 , if xi and xj belong to differentclusters, and 0 , otherwise}, i, j = 1,…, N.
The application of Monde Carlo techniques is alsohere the way to test the random hypothesis in a given dataset.
4.2.2 Relative Criteria. The basis of the abovedescribed validation methods is statistical testing. Thus,the major drawback of techniques based on internal orexternal criteria is their high computational demands. Adifferent validation approach is discussed in this section.
It is based on relative criteria and does not involvestatistical tests. The fundamental idea of this approach isto choose the best clustering scheme of a set of definedschemes according to a pre-specified criterion. Morespecifically, the problem can be stated as follows:
“ Let P the set of parameters associated with aspecific clustering algorithm (e.g. the number ofclusters nc). Among the clustering schemes Ci,i=1,..,nc, defined by a specific algorithm, fordifferent values of the parameters in P, choose theone that best fits the data set.”
Then, we can consider the following cases of the problem:I) P does not contain the number of clusters, nc, as a
parameter. In this case, the choice of the optimalparameter values are described as follows: We run thealgorithm for a wide range of its parameters’ valuesand we choose the largest range for which nc remainsconstant (usually nc << N (number of tuples). Thenwe choose as appropriate values of the P parametersthe values that correspond to the middle of this range.Also, this procedure identifies the number of clustersthat underlie our data set.
II) P contains nc as a parameter. The procedure ofidentifying the best clustering scheme is based on avalidity index. Selecting a suitable performance index,q, we proceed with the following steps:• We run the clustering algorithm for all values of nc
between a minimum ncmin and a maximum ncmax.The minimum and maximum values have beendefined a-priori by user.
• For each of the values of nc, we run the algorithm rtimes, using different set of values for the otherparameters of the algorithm (e.g. different initialconditions).
• We plot the best values of the index q obtained byeach nc as the function of nc.
Based on this plot we may identify the best clusteringscheme. We have to stress that there are two approachesfor defining the best clustering depending on thebehaviour of q with respect to nc. Thus, if the validityindex does not exhibit an increasing or decreasing trend asnc increases we seek the maximum (minimum) of theplot. On the other hand, for indices that increase (ordecrease) as the number of clusters increase we search forthe values of nc at which a significant local change invalue of the index occurs. This change appears as a“knee” in the plot and it is an indication of the number ofclusters underlying the dataset. Moreover, the absence ofa knee may be an indication that the data set possesses noclustering structure.
In the sequel, some representative validity indices forcrisp and fuzzy clustering are presented.
Crisp clustering. This section discusses validity indicessuitable for crisp clustering.
a) The modified Hubert Γ statistic. The definition of themodified Hubert Γ statistic is given by the equation
( 7)
where M=N(N-1)/2, P is the proximity matrix of the dataset and Q is an NXN matrix whose (i, j) element is equalto the distance between the representative points (vci, vcj )of the clusters where the objects xi and xj belong.
Similarly, we can define the normalized Hubert Γstatistic (given by equation (4)). If the d(vci, vcj) is close tod(xi, xj) for i, j=1,2,..,N, P and Q will be in close
agreement and the values of Γ and∧Γ (normalized Γ) will
be high. Conversely, a high value of Γ (∧Γ ) indicates the
existence of compact clusters. Thus, in the plot ofnormalized Γ versus nc, we seek a significant knee thatcorresponds to a significant increase of normalized Γ. Thenumber of clusters at which the knee occurs is anindication of the number of clusters that underlie the data.We note, that for nc =1 and nc=N the index is notdefined.
b) Dunn and Dunn-like indices. A cluster validity indexfor crisp clustering proposed in [5], attempts to identify“compact and well separated clusters”. The index isdefined by equation (8) for a specific number of clusters
( 8)
where d(ci, cj) is the dissimilarity function between twoclusters ci and cj defined as
( 9)
and diam(c) is the diameter of a cluster , which may beconsidered as a measure of dispersion of the clusters. Thediameter of a cluster C can be defined as follows:
),(max)(,
yxdCdiamCyx ∈
= ( 10)
It is clear that if the dataset contains compact andwell-separated clusters, the distance between the clustersis expected to be large and the diameter of the clusters isexpected to be small. Thus, based on the Dunn’s indexdefinition, we may conclude that large values of the indexindicate the presence of compact and well-separatedclusters.
The index Dnc does not exhibit any trend with respectto number of clusters. Thus, the maximum in the plot ofDnc versus the number of clusters can be an indication ofthe number of clusters that fits the data.
The implications of the Dunn index are: i) theconsiderable amount of time required for its computation,ii) the sensitive to the presence of noise in datasets, sincethese are likely to increase the values of diam(c) (i.e.,dominator of equation (8))
Three indices, are proposed in [23] that are morerobust to the presence of noise. They are widely knownas Dunn-like indices since they are based on Dunn index.Moreover, the three indices use for their definition theconcepts of the minimum spanning tree (MST), therelative neighbourhood graph (RNG) and the Gabrielgraph respectively [28].
Consider the index based on MST. Let a cluster ci andthe complete graph Gi whose vertices correspond to thevectors of ci. The weight, we, of an edge, e, of this graphequals the distance between its two end points, x, y. LetEi
MST be the set of edges of the MST of the graph Gi andei
MST the edge in EiMST with the maximum weight. Then
the diameter of Ci is defined as the weight of eiMST. Then
the Dunn-like index based on the concept of the MST isgiven by equation
( 11)
The number of clusters at which DmMST takes its
maximum value indicates the number of clusters in theunderlying data. Based on similar arguments we maydefine the Dunn-like indices for GG nad RGN graphs.
c) The Davies-Bouldin (DB) index. A similarity measureRij between the clusters Ci and Cj is defined based on ameasure of dispersion of a cluster Ci and a dissimilaritymeasure between two clusters dij. The Rij index is definedto satisfy the following conditions [4]:
1. Rij ≥ 02. Rij = Rji
3. Αν si = 0 και sj = 0 τότε Rij = 04. Αν sj > sk και dij = dik τότε Rij > Rik
5. Αν sj = sk και dij < dik τότε Rij > Rik.These conditions state that Rij is nonnegative andsymmetric.
A simple choice for Rij that satisfies the aboveconditions is [4]:
Rij = (si + sj)/dij. ( 12)Then the DB index is defined as
( 13)
It is clear for the above definition that DBnc is theaverage similarity between each cluster ci, i=1, …, nc andits most similar one. It is desirable for the clusters to have
( )∑∑−
= +=
⋅=Γ1
1 1
),(),(/1N
i
N
ij
jiQjiPM
nc1,...,i,max
1
,,...,1
1
==
=
≠=
=∑
ijjinci
i
nc
ii
cnc
RR
Rn
DB
( )( ) ⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛=
=+==
knck
ji
ncijncinc cdiam
ccdD
,...,1,...,1,...,1 max
,minmin
( ) ( ),,min,,
yxdccdji cycx
ji ∈∈=
( )⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛=
=+== MST
knck
ji
ncijncinc
diam
ccdD
,...,1
,...,1,...,1 max
,minmin
the minimum possible similarity to each other; thereforewe seek clusterings that minimize DB. The DBnc indexexhibits no trends with respect to the number of clustersand thus we seek the minimum value of DBnc in its plotversus the number of clusters.Some alternative definitions of the dissimilarity betweentwo clusters as well as the dispersion of a cluster, ci, isdefined in [4].
Moreover, in [23] three variants of the DBnc index areproposed. They are based on MST, RNG and GGconcepts similarly to the cases of the Dunn-like indices.
Other validity indices for crisp clustering have beenproposed in [3] and [21]. The implementation of most ofthese indices is very computationally expensive,especially when the number of clusters and number ofobjects in the data set grows very large [31]. In [19], anevaluation study of thirty validity indices proposed inliterature is presented. It is based on small data sets (about50 points each) with well-separated clusters. The resultsof this study [19] place Caliski and Harabasz(1974),Je(2)/Je(1) (1984), C-index (1976), Gamma and Bealeamong the six best indices. However, it is noted thatalthough the results concerning these methods areencouraging they are likely to be data dependent. Thus,the behaviour of indices may change if different datastructures were used [19]. Also, some indices based on asample of clustering results. A representative example isJe(2)/Je(1) which is computed based only on theinformation provided by the items involved in the lastcluster merge.
d) RMSSDT, SPR, RS, CD. In this point we will give thedefinitions of four validity indices, which have to be usedsimultaneously to determine the number of clustersexisting in the data set. These four indices can be appliedto each step of a hierarchical clustering algorithm andthey are known as [26]: Root-mean-square standard deviation (RMSSTD) of thenew cluster Semi-partial R-squared (SPR) R-squared (RS) Distance between two clusters.
Getting into a more detailed description of them we cansay that:
RMSSTD of a new clustering scheme defined in alevel of clustering hierarchy is the square root of thepooled sample variance of all the variables (attributesused in the clustering process). This index measures thehomogeneity of the formed clusters at each step of thehierarchical algorithm. Since the objective of clusteranalysis is to form homogeneous groups the RMSSTD ofa cluster should be as small as possible. In case that thevalues of RMSSTD are higher at this step than the ones ofthe previous step, we have an indication that the newclustering scheme is not homogenous.
In the following definitions we shall use the symbolismSS, which means Sum of Squares and refers to the
equation:
2
1
)(∑=
−−=
n
ii XXSS . Along with this we shall
use some additional symbolism like:i)SSw referring to the within group sum of squares,ii) SSb referring to the between groups sum of squares.iii) SSt referring to the total sum of squares, of the
whole data set.
SPR of the new cluster is the difference between thepooled SSw of the new cluster and the sum of the pooledSSw ’s values of clusters joined to obtain the new cluster(loss of homogeneity), divided by the pooled SSt for thewhole data set. This index measures the loss ofhomogeneity after merging the two clusters of a singlealgorithm step. If the index value is zero then the newcluster is obtained by merging two perfectlyhomogeneous clusters. If its value is high then the newcluster is obtained by merging two heterogeneousclusters.
RS of the new cluster is the ratio of SSb to SSt. As wecan understand SSb is a measure of difference betweengroups. Since SSt = SSb + SSw the greater the SSb thesmaller the SSw and vise versa. As a result, the greater thedifferences between groups the more homogenous eachgroup is and vise versa. Thus, RS may be considered as ameasure of the degree of difference between clusters.Furthermore, it measures the degree of homogeneitybetween groups. The values of RS range between 0 and 1.In case that the value of RS is zero (0) indicates that nodifference exists among groups. On the other hand, whenRS equals 1 there is an indication of significant differenceamong groups.
The CD index measures the distance between the twoclusters that are merged in a given step. This distance ismeasured each time depending on the selectedrepresentatives for the hierarchical clustering we perform.For instance, in case of Centroid hierarchical clusteringthe representatives of the formed clusters are the centersof each cluster, so CD is the distance between the centersof the clusters. In case that we use single linkage CDmeasures the minimum Euclidean distance between allpossible pairs of points. In case of complete linkage CD isthe maximum Euclidean distance between all pairs of datapoints, and so on.Using these four indices we determine the number ofclusters that exist into our data set, plotting a graph of allthese indices values for a number of different stages of theclustering algorithm. In this graph we search for thesteepest knee, or in other words, the greater jump of theseindices’ values from higher to smaller number of clusters
In the case of nonhierarchical clustering (e.g. K-Means) we may also use some of these indices in order toevaluate the resulting clustering. The indices that are more
meaningful to use in this case are RMSSTD and RS. Theidea, here, is to run the algorithm a number of times fordifferent number of clusters each time. Then we plot therespective graphs of the validity indices for theseclusterings and as the previous example shows, we searchfor the significant “knee” in these graphs. The number ofclusters at which the “knee” is observed indicates theoptimal clustering for our data set. In this case the validityindices described before take the following form:
21
11
11 1
2
)(
)(
1
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
−
−
=∑
∑ ∑
==
== =
djnci
ij
djnci
n
k
jk
n
xxij
RMSSTD
KK
KK
( 14)
∑ ∑
∑ ∑∑ ∑
= =
== == =
⎥⎥⎦
⎤
⎢⎢⎣
⎡−
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎥⎦
⎤
⎢⎢⎣
⎡−−
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎥⎥⎦
⎤
⎢⎢⎣
⎡−
=
dj
n
k
jk
djnci
n
k
jkdj
n
k
jk
j
ijj
xx
xxxx
RS
K
KKK
1 1
2
11 1
2
1 1
2
)(
)()(
( 15)
where nc is the number of clusters, d the number ofvariables(data dimension), nj is the number of data valuesof j dimension while nij corresponds to the number of data
values of j dimension that belong to cluster i. Also jx is
the mean of data values of j dimension.
e) The SD validity index. A most recent clusteringvalidity approach is proposed in [13]. The SD validityindex is defined based on the concepts of the averagescattering for clusters and total separation betweenclusters. In the sequel, we give the fundamental definitionfor this index.
Average scattering for clusters. The average scatteringfor clusters is defined as
( 16)
Total separation between clusters. The definition oftotal scattering (separation) between clusters is given byequation (17)
( 17)
where Dmax = max(||vi - vj||) ∀i, j ∈{1, 2,3,…,nc} is themaximum distance between cluster centers. The Dmin =min(||vi - vj||) ∀i, j ∈{1, 2,…,nc} is the minimum distancebetween cluster centers.Now, we can define a validity index based on equations(16) and (17), as follows
SD(nc) = a⋅ Scat(nc) + Dis(nc) ( 18)where α is a weighting factor equal to Dis(cmax) where cmax
is the maximum number of input clusters.The first term (i.e., Scat(nc) is defined by equation
(16) indicates the average compactness of clusters (i.e.,intra-cluster distance). A small value for this term
indicates compact clusters and as the scattering withinclusters increases (i.e., they become less compact) thevalue of Scat(nc) also increases. The second term Dis(nc)indicates the total separation between the nc clusters (i.e.,an indication of inter-cluster distance). Contrary to thefirst term the second one, Dis(nc), is influenced by thegeometry of the clusters centres and increase with thenumber of clusters. It is obvious for previous discussionthat the two terms of SD are of the different range, thus aweighting factor is needed in order to incorporate bothterms in a balanced way. The number of clusters, c, thatminimizes the above index can be considered as anoptimal value for the number of clusters present in thedata set. Also, the influence of the maximum number ofclusters cmax, related to the weighting factor, in theselection of the optimal clustering scheme is discussed in[13]. It is proved that SD proposes an optimal number ofclusters almost irrespectively of cmax. However, the indexcannot handle properly arbitrary shaped clusters. Thesame applies to all the aforementioned indices.
Fuzzy Clustering. In this section, we present validityindices suitable for fuzzy clustering. The objective is toseek clustering schemes where most of the vectors of thedataset exhibit high degree of membership in one cluster.We note, here, that a fuzzy clustering is defined by amatrix U=[uij], where uij denotes the degree ofmembership of the vector xi in the j cluster. Also, a set ofthe cluster representatives has been defined. Similarly tothe crisp clustering case we define validity index, q, andwe search for the minimum or maximum in the plot of qversus m. Also, in case that q exhibits a trend with respectto the number of clusters, we seek a significant knee ofdecrease (or increase) in the plot of q.
In the sequel two categories of fuzzy validity indicesare discussed. The first category uses only thememberships values, uij, of a fuzzy partition of data. Onthe other hand the latter one involves both the U matrixand the dataset itself.
a) Validity Indices involving only the membershipvalues. Bezdek proposed in [2] the partition coefficient,which is defined as
( 19)
The PC index values range in [1/nc, 1], where nc is thenumber of clusters. The closer to unity the index the“crisper” the clustering is. In case that all membershipvalues to a fuzzy partition are equal, that is, uij=1/nc, thePC obtains its lower value. Thus, the closer the value ofPC is to 1/nc, the fuzzier the clustering is. Furthermore, avalue close to 1/nc indicates that there is no clusteringtendency in the considered dataset or the clusteringalgorithm failed to reveal it.
∑∑= =
=N
i
nc
jiju
NPC
1 1
21
( )Xi
vnc
nc
i
ncScat σσ∑ ⎟⎠⎞⎜
⎝⎛=
= 1
1)(
1
1 1min
max)(
−
= =∑ ∑ ⎟
⎠
⎞⎜⎝
⎛ −=nc
k
c
zzk vv
D
DncDis
The partition entropy coefficient is another index ofthis category. It is defined as
( 20)
where a is the base of the logarithm. The index iscomputed for values of nc greater than 1 and its valuesranges in [0, loganc]. The closer the value of PE to 0, theharder the clustering is. As in the previous case, thevalues of index close to the upper bound (i.e., loganc),indicate absence of any clustering structure in the datasetor inability of the algorithm to extract it.
The drawbacks of these indices are:i) their monotonous dependency on the number of
clusters. Thus, we seek significant knees of increase(for PC) or decrease (for PE) in plot of the indicesversus the number of clusters.,
ii) their sensitivity to the fuzzifier, m. More specifically,as m 1 the indices give the same values for all valuesof nc. On the other hand when m ∞, both PC and PEexhibit significant knee at nc=2.
iii) the lack of direct connection to the geometry of thedata [3], since they do not use the data itself.
b) Indices involving the membership values and thedataset. The Xie-Beni index [31], XB, also called thecompactness and separation validity function, is arepresentative index of this category.
Consider a fuzzy partition of the data set X={xj; j=1,..,n} with vi(i=1,…, nc} the centers of each cluster and uij
the membership of data point j belonging to cluster i.The fuzzy deviation of xj form cluster i, dij, is defined
as the distance between xj and the center of clusterweighted by the fuzzy membership of data point jbelonging to cluster i.
dij=uij||xj-vi|| ( 21)Also, for a cluster i, the summation of the squares offuzzy deviation of the data point in X, denoted σi, is calledvariation of cluster i.
The summation of the variations of all clusters, σ, iscalled total variation of the data set.
The quantity π=(σi/ni), is called compactness of clusteri. Since ni is the number of point in cluster belonging tocluster i, π, is the average variation in cluster i.
Also, the separation of the fuzzy partitions is definedas the minimum distance between cluster centres, that is
dmin = min||vi -vj||Then the XB index is defined as
XB=π/N⋅dminwhere N is the number of points in the data set.
It is clear that small values of XB are expected forcompact and well-separated clusters. We note, however,that XB is monotonically decreasing when the number ofclusters nc gets very large and close to n. One way to
eliminate this decreasing tendency of the index is todetermine a starting point, cmax, of the monotonicbehaviour and to search for the minimum value of XB inthe range [2, cmax]. Moreover, the values of the index XBdepend on the fuzzifier values, so as if m ∞ thenXB ∞.
Another index of this category is the Fukuyama-Sugeno index, which is defined as
( 22)
where v is the mean vector of X and A is an lxl positivedefinite, symmetric matrix. When A=I, the above distancebecome the squared Euclidean distance. It is clear that forcompact and well-separated clusters we expect smallvalues for FSm. The first term in the parenthesis measuresthe compactness of the clusters and the second onemeasures the distances of the clusters representatives.
Other fuzzy validity indices are proposed in [9], whichare based on the concepts of hypervolume and density.Let Σj the fuzzy covariance matrix of the j-th clusterdefined as
( )( )∑
∑=
=−−
=∑N
i
mij
N
i
Tjiji
m
ij
u
vxvxuj
1
1
( 23)
Then the total fuzzy hyper volume is given by theequation
∑=
=nc
jjVFH
1
( 24)
Small values of FH indicate the existence of compactclusters.The average partition density is also an index of thiscategory. It can be defined as
∑=
=nc
j j
j
V
S
ncPA
1
1( 25)
where xj is the set of data points that are within a pre-
specified region around vj. Then ∑ ∈=
jXx ijj uS is
called the sum of the central members of the j cluster.A different measure is the partition density index that
is defined asPD=S/FH ( 26)
where ∑ == nc
j jSS1
A few other indices are proposed and discussed in [17,24].
4.3 Other approaches for cluster validity
Another approach for finding the best number of clusterof a data set proposed in [27]. It introduces a practicalclustering algorithm based on Monte Carlo cross-validation. More specifically, the algorithm consists of M
( )∑∑= =
⋅−=N
i
nc
jijij uu
NPE
1 1alog
1
( )∑∑= =
−−−=N
i
nc
jAjA
mijm vvvuFS
1 1
22
jix
cross-validation runs over M chosen train/test partitions ofa data set, D. For each partition u, the EM algorithm isused to define c clusters to the training data, while c isvaried from 1 to cmax. Then, the log-likelihood Lc
u(D) iscalculated for each model with c clusters. It is definedusing the probability density function of the data as
( 27)
where fk is the probability density function for the dataand Φk denotes parameters that have been estimated fromdata. This is repeated M times and the M cross-validatedestimates are averaged for each nc. Based on theseestimates we may define the posterior probabilities foreach value of the number of clusters nc, p(nc/D). If oneof p(nc/D) is near 1, there is strong evidence that theparticular number of clusters is the best for our data set.
The evaluation approach proposed in [27] is based ondensity functions considered for the data set. Thus, it isbased on concepts related to probabilistic models in orderto estimate the number of clusters, better fitting a data set,and it does not use concepts directly related to the data,(i.e., inter-cluster and intra-clusters distances).
4.4An experimental study of validity indices
In this section we present a comparative experimentalevaluation of the important validity measures, aiming atillustrating their advantages and disadvantages.We consider the known relative validity indices proposedin the literature, such as RS-RMSSTD [26], DB [28] andthe recent one SD [13]. The definitions of these validity
indices can be found in Section 4.3.RMSSTD and RS have to be taken into accountsimultaneously in order to find the correct number ofclusters. The optimal values of the number of clusters arethose for which a significant local change in values of RSand RMSSTD occurs. As regards DB, an indication of theoptimal clustering scheme is the point at which it takes itsminimum value.
For our study, we used four synthetic two-dimensionaldata sets further referred to as DataSet1, DataSet2,DataSet3 and DataSet4 (see Figure 3a-d) and a real dataset Real_Data1 (Figure 3e), representing a part of Greekroad network [29].
Table 8 summarizes the results of the validity indices(RS, RMSSDT, DB, SD), for different clustering schemesof the above-mentioned data sets as resulting from aclustering algorithm. For our study, we use the results ofthe algorithms K-means and CURE with their input value(number of clusters), ranging between 2 and 8. IndicesRS, RMSSTD propose the partitioning of DataSet1 intothree clusters while DB selects six clusters as the bestpartitioning. On the other hand, SD selects four clusters asthe best partitioning for DataSet1, which is the correctnumber of clusters fitting the data set. Moreover, theindex DB selects the correct number of clusters (i.e.,seven) as the optimal partitioning for DataSet3 while RS,RMSSTD and SD select the clustering scheme of five andsix clusters respectively. Also, all indices propose threeclusters as the best partitioning for Real_Data1. In thecase of DataSet2, DB and SD select three clusters as theoptimal scheme, while RS-RMSSDT selects two clusters(i.e., the correct number of clusters fitting the data set).
Figure 3 Datasets
Table 5: Optimal number of clusters proposed by validity indices RS-RMSSTD, DB, SD
DataSet1 DataSet2 DataSet3 DataSet4 Real_Data1Optimal number of clusters
RS, RMSSTD 3 2 5 4 3DB 6 3 7 4 3SD 4 3 6 3 3
)/(log)(1 ki
N
i k xfDLk Φ=∑ =
(a) DataSet1 (b) DataSet2 (c)DataSet3
(d) DataSet4 (e) Real_Data1
Moreover, SD finds the correct number of clusters(three) for DataSet4, on the contrary to RS – RMSSTDand DB indices, which propose four clusters as the bestpartitioning.
Here, we have to mention that the validity indices arenot clustering algorithms themselves but a measure toevaluate the results of clustering algorithms and give anindication of a partitioning that best fits a data set. Theessence of clustering is not a totally resolved issue anddepending on the application domain we may considerdifferent aspects as more significant. For instance, for aspecific application it may be important to have wellseparated clusters while for another to consider more thecompactness of the clusters. Having an indication of agood partitioning as proposed by the index, the domainexperts may analyse further the validation procedureresults. Thus, they could select some of the partitioningschemes proposed by indices, and select the one betterfitting their demands for crisp or overlapping clusters. Forinstance DataSet2 can be considered as having threeclusters with two of them slightly overlapping or havingtwo well-separated clusters.
5. Conclusions and Trends in Clustering
Cluster analysis is one of the major tasks in variousresearch areas. However, it may be found under differentnames in different contexts such as unsupervised learningin pattern recognition, taxonomy in biology, partition ingraph theory. The clustering aims at identifying andextract significant groups in underlying data. Thus basedon a certain clustering criterion the data are grouped sothat data points in a cluster are more similar to each otherthan points in different clusters.
Since clustering is applied in many fields, a number ofclustering techniques and algorithms have been proposedand are available in literature. In this paper we presentedthe main characteristics and applications of clusteringalgorithms. Moreover, we discussed the differentcategories in which algorithms can be classified (i.e.,partitional, hierchical, density-based, grid-based, fuzzyclustering) and we presented representative algorithms ofeach category. We concluded the discussion on clusteringalgorithms by a comparative presentation and stressingthe pros and cons of each category.
Another important issue that we discussed in thispaper is the cluster validity. This is related to the inherentfeatures of the data set under concern. The majority ofalgorithms are based on certain criteria in order to definethe clusters in which a data set can be partitioned. Sinceclustering is an unsupervised method and there is no a-priori indication for the actual number of clusterspresented in a data set, there is a need of some kind ofclustering results validation. We presented a survey of themost known validity criteria available in literature,classified in three categories: external, internal, andrelative. Moreover, we discussed some representative
validity indices of these criteria along with sampleexperimental evaluation.
Trends in clustering. Though cluster analysis is subjectof thorough research for many years and in a variety ofdisciplines, there are still several open of research issues.We summarize some of the most interesting trends inclustering as follows:i) Discovering and finding representatives of arbitraryshaped clusters. One of the requirements in clustering isthe handling of arbitrary shaped clusters and there aresome efforts in this context. However, there is no well-established method to describe the structure of arbitraryshaped clusters as defined by an algorithm. Consideringthat clustering is a major tool for data reduction, it isimportant to find the appropriate representatives of theclusters describing their shape. Thus, we may effectivelydescribe the underlying data based on clustering resultswhile we achieve a significant compression of the hugeamount of stored data (data reduction).ii) Non-point clustering. The vast majority of algorithmshave only considered point objects, though in many caseswe have to handle sets of extended objects such as(hyper)-rectangles. Thus, a method that handlesefficiently sets of non-point objects and discovers theinherent clusters presented in them is a subject of furtherresearch with applications in diverse domains (such asspatial databases, medicine, biology).iii) Handling uncertainty in the clustering process andvisualization of results. The majority of clusteringtechniques assumes that the limits of clusters are crisp.Thus each data point may be classified into at most onecluster. Moreover all points classified into a cluster,belong to it with the same degree of belief (i.e., all valuesare treated equally in the clustering process). The result isthat, in some cases "interesting" data points fall out of thecluster limits so they are not classified at all. This isunlikely to everyday life experience where a value may beclassified into more than one categories. Thus a furtherwork direction is taking in account the uncertaintyinherent in the data. Another interesting direction is thestudy of techniques that efficiently visualizemultidimensional clusters taking also in accountuncertainty features.iv) Incremental clustering. The clusters in a data set maychange as insertions/updates and deletions occur throughout its life cycle. Then it is clear that there is a need ofevaluating the clustering scheme defined for a data set soas to update it in a timely manner. However, it isimportant to exploit the information hidden in the earlierclustering schemes so as to update them in an incrementalway.v) Constraint-based clustering. Depending on theapplication domain we may consider different clusteringaspects as more significant. It may be important to stressor ignore some aspects of data according to the
requirements of the considered application. In recentyears, there is a trend so that cluster analysis is based onless parameters but on more constraints. These constrainsmay exist in data space or in users’ queries. Then aclustering process has to be defined so as to take inaccount these constrains and define the inherent clustersfitting a dataset.
AcknowledgementsThis work was supported by the General Secretariat forResearch and Technology through the PENED ("99Ε∆ 85")project. We thank C. Amanatidis for his suggestions and hishelp in the experimental study. Also, we are grateful to C.Rodopoulos for the implementation of CURE algorithm as wellas to Dr Eui-Hong (Sam) Han for providing information and thesource code for CURE algorithm.
References[1] Michael J. A. Berry, Gordon Linoff. Data Mining
Techniques For marketing, Sales and Customer Support.John Willey & Sons, Inc, 1996.
[2] Bezdeck J.C, Ehrlich R., Full W., "FCM:Fuzzy C-MeansAlgorithm", Computers and Geoscience 1984
[3] Rajesh N. Dave. "Validating fuzzy partitions obtainedthrough c-shells clustering", Pattern Recognition Letters,Vol .17, pp613-623, 1996.
[4]. Davies DL, Bouldin D.W. “ A cluster separation measure”.IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 1(2), 1979.
[5] J. C. Dunn. "Well separated clusters and optimal fuzzypartitions", J. Cybern. Vol.4, pp. 95-104, 1974
[6] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu."A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise", Proceedings of 2nd Int.Conf. On Knowledge Discovery and Data Mining, Portland,OR, pp. 226-231, 1996.
[7] Martin Ester, Hans-Peter Kriegel, Jorg Sander, MichaelWimmer, Xiaowei Xu. "Incremental Clustering for Miningin a Data Warehousing Environment", Proceedings of 24th
VLDB Conference, New York, USA, 1998.[8] Usama M. Fayyad, Gregory Piatesky-Shapiro, Padhraic
Smuth and Ramasamy Uthurusamy. Advances in KnowledgeDiscovery and Data Mining. AAAI Press 1996
[9] Gath I., Geva A.B. “Unsupervised optimal fuzzy clustering”,IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 11(7), 1989.
[10] Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. "CURE:An Efficient Clustering Algorithm for Large Databases",Published in the Proceedings of the ACM SIGMODConference, 1998.
[11] Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. "ROCK: ARobust Clustering Algorithm for Categorical Attributes",Published in the Proceedings of the IEEE Conference onData Engineering, 1999.
[12] Jiawei Han, Micheline Kamber. Data Mining: Conceptsand Techniques. Morgan Kaufmann Publishers, 2001
[13] M. Halkidi, M. Vazirgiannis, Y. Batistakis. "Qualityscheme assessment in the clustering process", Proceedingsof PKDD, Lyon, France, 2000.
[14] Alexander Hinneburg, Daniel Keim. "An EfficientApproach to Clustering in Large Multimedia Databases withNoise. Proceedings of KDD Conference, 1998.
[15] Zhexue Huang. "A Fast Clustering Algorithm to Clustervery Large Categorical Data Sets in Data Mining", DMKD,1997
[16] A.K Jain, M.N. Murty, P.J. Flyn. “Data Clustering: AReview”, ACM Computing Surveys, Vol. 31, No3,September 1999.
[17] Krishnapuram R., Frigui H., Nasraoui O. “Quadratic shellclustering algorithms and the detection of second-degreecurves”, Pattern Recognition Letters, Vol. 14(7), 1993.
[18] MacQueen, J.B "Some Methods for Classification andAnalysis of Multivariate Observations", In Proceedings of5th Berkley Symposium on Mathematical Statistics andProbability, Volume I: Statistics, pp281-297, 1967
[19] Milligan, G.W. and Cooper, M.C.), "An Examination ofProcedures for Determining the Number of Clusters in aData Set", Psychometrika, 50, 159-179, 1985
[20] T. Mitchell. Machine Learning. McGraw-Hill, 1997.[21] Milligan G. W., Soon S.C., Sokol L. M. “The effect of
cluster size, dimensionality and the number of clusters onrecovery of true cluster structure”. IEEE Transactions onPattern Analysis and Machine Intelligence, Vol. 5, pp. 40-47, 1983
[22] Raymond Ng, Jiawei Han. "Effecient and EffictiveClustering Methods for Spatial Data Mining". Proceeding ofthe 20th VLDB Conference, Santiago, Chile, 1994.
[23]. Pal N.R., Biswas J. “Cluster Validation using graphtheoretic concepts”. Pattern Recognition, Vol. 30(6), 1997.
[24] Ramze Rezaee, B.P.F. Lelieveldt, J.H.C Reiber. "A newcluster validity index for the fuzzy c-mean", PatternRecognition Letters, 19, pp237-246, 1998.
[25] C. Sheikholeslami, S. Chatterjee, A. Zhang. "WaveCluster:A-MultiResolution Clustering Approach for Very LargeSpatial Database". Proceedings of 24th VLDB Conference,New York, USA, 1998
[26] Sharma S.C. Applied Multivariate Techniques. John Willwy& Sons, 1996.
[27] Padhraic Smyth. "Clustering using Monte Carlo Cross-Validation". Proceedings of KDD Conference, 1996.
[28] S. Theodoridis, K. Koutroubas. Pattern recognition,Academic Press, 1999
[29] Y. Theodoridis. Spatial Datasets: an "unofficial" collection.http://dias.cti.gr/~ytheod/research/datasets/spatial.html[30] Wei Wang, Jiorg Yang and Richard Muntz. “STING: A
ststistical information grid approach to spatial data mining”.Proceedings of 23rd VLDB Conference, 1997.
[31] Xunali Lisa Xie, Genardo Beni. "A Validity measure forFuzzy Clustering", IEEE Transactions on Pattern Analysisand machine Intelligence, Vol.13, No4, August 1991.
[32] Tian Zhang, Raghu Ramakrishnman, Miron Linvy."BIRCH: An Efficient Method for Very Large Databases",ACM SIGMOD, Montreal, Canada, 1996.