base-jits
-
Upload
mirza-nadeem-baig -
Category
Documents
-
view
218 -
download
0
Transcript of base-jits
8/8/2019 base-jits
http://slidepdf.com/reader/full/base-jits 1/4
An Effective Clustering Algorithm for Data
Mining*Singh Vijendra ,and Kelkar Ashwini
Faculty of Engineering and TechnologyMody Institute of Technology and Science
Lakshmangarh, Sikar, Rajasthan, India
Sahoo Laxman
Department of Computer Science and Engineering Northern India Engineering College
Luck now, UP, India
Abstract —This paper proposes an effective clustering
algorithm for databases, which are benchmark data sets of
data mining applications. We present a Genetic Clustering
Algorithm (GCA) that finds a globally optimal partition of a
given data sets into a specified number of clusters. Thealgorithm is distance-based and creates centroids. To evaluate
the proposed algorithm, we use some artificial data sets and
compare with results of K-means. Experimental results show
that the proposed algorithm has better performance and
efficiently finds accurate clusters .
Keywords- Clustering; K-means; Genetic algorithm.
I. I NTRODUCTION
Clustering is the process of grouping a set of objects intoclusters so that objects within a cluster are similar to eachother but are dissimilar to objects in other clusters [3].
Clustering has been effectively applied in a variety of engineering and scientific disciplines such as psychology, biology, medicine, computer vision, communications, andremote sensing. Cluster analysis organizes data (a set of patterns, each pattern could be a vector measurements) byabstracting underlying structure. The grouping is done suchthat patterns within a group (cluster) are more similar to eachother than patterns belonging to different groups. Thus,organization of data using cluster analysis employs somedissimilarity measure among the set of patterns. Thedissimilarity measure is defined based on the data under analysis and the purpose of the analysis. Various types of clustering algorithms have been proposed to suit differentrequirements. Clustering algorithms can be broadly classified
into hierarchical and partitioning algorithms based on thestructure of abstraction. Hierarchical clustering algorithmsconstruct a hierarchy of partitions, represented as adendrogram in which each partition is nested within the partition at the next level in the hierarchy. Partitioningclustering algorithms generate a single partition, with aspecified or estimated number of no overlapping clusters, of the data in an attempt to recover natural groups present in thedata. One of the important problems in partitioningclustering is to find a partition of the given data, with aspecified number of clusters that minimizes the total within
cluster variation. Unfortunately in many real life cases thenumber of clusters in a data set is not known a priori. Under this condition, how to automatically provide the number of clusters and find the clustering partition becomes a
challenge.In this regard, some attempts have been made to use
genetic algorithms for automatically clustering data sets [2].Genetic algorithms (GA’s) work on a coding of the parameter set over which the search has to be performed,rather than the parameters themselves [1]. These encoded parameters are called solutions or chromosomes and theobjective function value at a solution is the objectivefunction value at the corresponding parameters. GA’s solveoptimization problems using a population of a fixed number,called the population size, of solutions. A solution consists of a string of symbols, typically binary symbols. GA’s evolveover generations. During each generation, they produce anew population from the current population by applying
genetic operator’s viz., natural selection, crossover, andmutation. Each solution in the population is associated with afigure of merit (fitness value) depending on the value of thefunction to be optimized. The selection operator selects asolution from the current population for the next populationwith probability proportional to its fitness value. Crossover operates on two solution strings and results in another twostrings. Typical crossover operator exchanges the segmentsof selected strings across a crossover point with a probability. The mutation operator toggles each position in astring with a probability, called the mutation probability. Bandyopadhyay and Maulik [6] applied the variable stringlength genetic algorithm with the real encoding of thecoordinates of the cluster centers in the chromosome to the
clustering problem. Tseng and Yang [7] proposed a geneticalgorithm based approach for the clustering problem. Their method consists of two stages, nearest neighbor clusteringand genetic optimization. Lin et al. [4] presented a geneticclustering algorithm based on a binary chromosomerepresentation. The proposed method selects the cluster centers directly from the data set. Lai [8] adopted thehierarchical genetic algorithm to solve the clustering problem. In the proposed method, the chromosome consistsof two types of genes, control genes and parametric genes.
2010 International Conference on Data Storage and Data Engineering
978-0-7695-3958-4/10 $26.00 © 2010 IEEE
DOI 10.1109/DSDE.2010.34
252
2010 International Conference on Data Storage and Data Engineering
978-0-7695-3958-4/10 $26.00 © 2010 IEEE
DOI 10.1109/DSDE.2010.34
250
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply
8/8/2019 base-jits
http://slidepdf.com/reader/full/base-jits 2/4
II. CLUSTERING PROBLEM
Clustering is a formal study of algorithms and methodsfor classifying objects without category labels. A cluster is aset of objects that are alike, and objects from differentclusters are not like. The set of n objects X={X1, X2….,Xn}
is to be clustered. Each ∈ X R p is an attribute vector consisting of p real measurements describing the object. Theobjects are to be clustered into non overlapping groups C ={C1, C2… Ck } (C is known as a clustering), where k is the
number of clusters, C1UC2U….UCk = X, φ ≠iC and
φ =∩ ji C C for i ≠ j.
The objects within each group should be more similar toeach other than to objects in any other group, and the valueof k may be unknown. If k is known, the problem is referredto as the k-clustering problem. Many methods described inthe literature assume that k is given by the user [12], thesemethods search for k clusters according to a predefinedcriterion. Doing so, the number of ways of sorting N objectsinto k clusters is given by Liu [5]:
Thus, there are a large number of possible partitions evenfor moderate N and k (e.g. NW (25, 5) ≈ 2.5×1015), and thecomplete enumeration of every possible partition is simplynot possible [10]. In other words, it is not easy to find the best partitioning even assuming that k is known. Indeed, thisis rarely the case in practice. A usual approach is to run aclustering algorithm several times and, based on the obtainedresults; choose the value for k that provides the most naturalclustering. This strategy assumes domain knowledge andusually has the disadvantage of searching for the bestsolution in a small subset of the search space. Consequently,these methods have, in general, low probabilities of success.Another alternative involves optimizing k according tonumeric criteria. In this case, k is unknown and the number of ways of grouping N instances into k clusters, consideringS different scenarios (each one resulting from a different k),is [5]:
∑ s
k N NW ),( (2)
The problem of finding an optimal solution to the
partition of N data into k clusters is NP-complete [11] and, provided that the number of distinct partitions of N instances
into k clusters increases approximately as k N
/k!, attemptingto find a globally optimum solution is usually notcomputationally feasible. This difficulty has stimulated thesearch for efficient approximation algorithms. Furthermoretraditional clustering algorithms search a relatively smallsubset of the solution space (these subsets are defined by thenumber of clusters, the clustering criteria, and the clusteringmethod). Consequently the probability of success of thesemethods is small. Algorithms such as single-linkage aredeterministic and will repeatedly and the same solution for a
given data set, whereas algorithms such as k-means conducta local search starting from an initial partition. In each case,the solution may be a local optimum, which is notnecessarily the global solution. This is exacerbated when thesolution space is very large.
Clearly, we need an algorithm with the potential tosearch large solution spaces effectively. The genetic
algorithms have been widely employed for optimization problems in several domains. Their success lies in their ability to span a large subset of the search space.
III. GENETIC CLUSTERING ALGORITHM
We proposed a Genetic algorithm to the problem of k clustering, where the required number of clusters is known.Various adaptations are used to enable the GA to cluster andto enhance their performance. Further the Genetic ClusteringAlgorithms are tested on databases, which are benchmarksfor data mining applications or heuristics are added to enablethe GAs to cope with a larger number of objects. Geneticalgorithm for the clustering problem fall into the followingareas representation, fitness function, operators and
parameter values.
A. Representation
Genetic representations for clustering or grouping problems are based on underlying scheme. The schemerepresents the objects with gene values, and the position of these genes signifies how the objects are divided amongst theclusters.
The use of simple encoding scheme causes problems of redundant codification and context insensitivity [1].This hasled researchers to devise complicated representations andspecialized operators for clustering problems [13]. Thecluster label based on n bit encoding is simple compared to parameterization of prototype location. In such a
representation many genotype translate to a unique phenotype. The notion of cluster labels built into therepresentation makes little intuitive sense. Suchrepresentations have spawned off a set of pre treatmentmethodologies to make the representations suitable for genetic operators.
Let us consider a dataset formed by N instances. Then, agenotype is an integer vector of (N+1) positions. Each position corresponds to an instance, i.e., the ith position(gene) represents the i-th instance, whereas the last generepresents the number of clusters (k)[9]. Thus, each gene hasa value over the alphabet {1,2,3,y,k}. For instance, in adataset composed of 20 instances, a possible genotype is:
Genotype : 1123245125432533424 5
In this case, three instances {1,2,8} form the cluster whose label is 1. The cluster whose label is 2 has 5 instances{3,5,9,13,18}, and so on. Standard genetic operators areusually not suitable for clustering problems for severalreasons. First, the encoding scheme presented above isnaturally redundant, i.e., the encoding is one-to-many. Infact, there is k! different genotypes that represent the samesolution.
253251
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply
8/8/2019 base-jits
http://slidepdf.com/reader/full/base-jits 3/4
Thus, the size of the search space the genetic algorithmhas to search is much larger than the original space of solutions. This augmented space may reduce the efficiencyof the genetic algorithm. In addition, the redundant encodingalso causes the undesirable effect of casting context-dependent information out of context under the standardcrossover, i.e., equal parents can originate different
offspring.For this reason, the development of genetic operators
specially designed to clustering problems has beeninvestigated [10,9]. In this context, the Genetic ClusteringAlgorithm operators proposed in [9] are of particular interestsince they operate on constant length chromosomes.
B. Fitness Function
Objective functions used for traditional clusteringalgorithms can act as fitness functions for Genetic Clusteringalgorithms. However if the optimal clustering corresponds tothe minimal objective function value, we will need totransform the objective function value, since GAs work tomaximize their fitness values [1]. In addition fitness values
in a GA need to be positive if we are using fitness proportional selection.
C. Genetic Operators
The operators pass genetic information betweensubsequent generations of the population. As a result,operators need to be matched with or designed for therepresentation, so that the offspring are valid and inheritcharacteristics from their parents. Operators used for geneticclustering or grouping includes some of the selection,crossover and mutation methods.
1) SelectionChromosomes are selected for reproduction based on
their relative fitness. Thus the representation is not a factor
when choosing an appropriate selection operator, but thefitness function is. If all fitness values are positive, and themaximum fitness value corresponds to the optimalclustering, then fitness proportional selection may beappropriate. Otherwise, a ranking selection method may beused.
In proposed Genetic Clustering Algorithm, the genotypescorresponding to each generation are selected according tothe roulette wheel strategy [1], which does not admitnegative objective function values. For this reason, aconstant equal to one is summed up to the objective function before the selection procedure takes place. The highestfitness genotype is always copied into the succeedinggeneration.
2) Crossover The crossover operator is designed to transfer genetic
material from one generation to the next. The major concernswith this operator are validity and context insensitivity. Itmay be necessary to check that offspring produced by acertain operator are valid and reject any invalidchromosomes.
The proposed Genetic Clustering Algorithm crossover operator combines clustering solutions coming fromdifferent genotypes. It works in the following way. First, two
genotypes (G1 and G2) are selected. Then, assuming that G1represents k1 clusters, the Genetic Clustering Algorithm
randomly chooses { }1,...2,1 K c∈ clusters to copy into G2.
The unchanged clusters of G2 are maintained and thechanged ones have their instances allocated to thecorresponding nearest clusters (according to their centroids).In this way, an offspring G3 is obtained. The same procedureis employed to get an offspring G4, but now considering thatthe changed clusters of G2 are copied into G1.Note that,although the crossover operator produces offspring usuallyformed by a number of clusters that is neither smaller nor larger than the number of clusters of their parents, thisoperator is able to increase or decrease the number of clusters.
3) MutationMutation introduces new genetic material into the
population. In a clustering context this corresponds tomoving an object from one cluster to another. Two operatorsfor mutation are used in the Genetic Clustering Algorithm
The first operator works only on genotypes that encode
more than two clusters. It eliminates a randomly chosencluster, placing its instances into the nearest remainingclusters (according to their centroids). The second operator divides a randomly selected cluster into two new ones. Thefirst cluster is formed by the instances closer to the originalcentroid, whereas the other cluster is formed by thoseinstances closer to the farthest instance from the centroid.
IV. OBJECTIVE FUNCTION
The objective function evaluates the fitness of individualstrings. All most all partition evaluation functions providesome kind of measure of inter-cluster isolation and/or intra-cluster homogeneity. For a good partition, there should beappreciable inter-cluster isolation and intra-cluster
homogeneity. The homogeneity within a cluster is calculated by the sum of distances between all pairs of objects with in acluster. We use an objective function based on the Euclideandistance [3]:
,)(.....)()(),( 22
22
2
11 jnni ji ji ji x x x x x x X X d −++−+−=(3)
Where Xi= (xi1,xi2…..,xin) and X j=(x j1,x j2…..,x jn) are twodimensional data objects. The calculation of distances between two instances represents the main computationalcost of the Genetic Clustering Algorithm.
V. EXPERIMENTAL R ESULTS
In order to see the performance of the proposed GeneticClustering Algorithm, we first applied the method to Iris dataset, whose true classes are known [14]. Performance wasmeasured by the accuracy, which is the proportion of objectsthat are correctly grouped together against the true classes.To investigate the performance an experimental study wascarried out by generating artificial data sets repetitively andcalculating the average performance of the method.
The Iris data set is available in UCI repository(ftp://ftp.ics.uci.edu/pub/machine-learning-databases/),
254252
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply
8/8/2019 base-jits
http://slidepdf.com/reader/full/base-jits 4/4
which data set including 150 instances. There are threeclasses (Setosa, Versicolour and Virginica), each onerepresented by 50 instances. The class Setosa is linearlyseparable from the others, whereas the classes Versicolour and Virginica are not linearly separable. Four attributes(sepal and petal length and width) describe each instance.The sepal and petal areas were used as attributes (variables).
The sepal area is obtained by multiplying the sepal length bythe sepal width and the petal area is calculated in ananalogous way. We applied the proposed Genetic ClusteringAlgorithm and K- means with k=3 to this data set withoutusing class information. The implementation result of K-mean algorithm shows in figure 1.The Clustering accuracy by k-means is 87.4%,where the clustering accuracy by the proposed Genetic Clustering Algorithm is 97% .The GeneticClustering Algorithm result is shown in figure 2.Bycomparing the results, we analyzed that K-means is wronglygrouped the objects in two classes(Versicolour andVirginica).
Figure 1. Clustering using the K-means method
Figure 2. Clustering using the Genetic Clustering Algorithm
VI. CONCLUSIONS
As a fundamental problem and technique for data
analysis, clustering has become increasingly important.
Many clustering methods usually require the designer to
provide the number of clusters as input. In this paper, we propose a Genetic Clustering Algorithm for data clustering.
We compare our algorithm with k-means .The result fromvarious experiments using artificial data sets shows that
proposed algorithm has better performance and efficiently
finds accurate clusters.
ACKNOWLEDGMENT
The authors would like to thank Prof. Chothmal and Prof.P K Das for thoughtful, constructive comments, andsuggestions.
R EFERENCES
[1] Goldberg, D.E, “Genetic Algorithms in Search, Optimization and
Machine Learning,“Addison-Wesley, 1989[2] Amit Banerjee and Sushil J.Louis,” A Recursive Clustering
Methodology using a genetic algorithm,”IEEE Trans., 2007.
[3] Jiawei Han and M.kamber,”Data mining: Concepts andTechniques,”Morgan Kaufmann,2004.
[4] H. J. Lin, F. W. Yang and Y. T. Kao,”An efficient GA-basedclustering technique, “Tamkang Journal of Science and Engineering,vol. 8, no. 2,pp. 113-122, 2005.
[5] G.L. Liu,” Introduction to Combinatorial Mathematics,” McGrawHill, New York, (1968).
[6] S. Bandyopadhyay and U. Maulik,”An evolutionary technique basedon K-means algorithm for optimal clustering in RN,” InformationSciences,vol. 146, no.1-4, pp. 221-237,2002.
[7] L. Y. Tseng and S. B. Yang,”A genetic approach to the automaticclustering algorithm,” Pattern Recognition, vol. 34, no. 2, pp. 415-424,2001
[8] C. C. Lai,”A novel clustering approach using hierarchical geneticalgorithms,” Intelligent Automation and Soft Computing, vol. 11, no.3,pp. 143-153, 2005.
[9] E.R. Hruschka, N.F.F. Ebecken,”A genetic algorithm for cluster analysis”,Intell. Data Anal. 7 (1) 15–25., 2003.
[10] B.S. Everitt, S. Landau, M. Leese,”Cluster Analysis,” ArnoldPublishers, London, 2001.
[11] L. Kaufman, P. J. Rousseeuw,”Finding Groups in Data—AnIntroduction to Cluster Analysis,” Wiley Series in Probability andMathematical Statistics, 1990.
[12] E. R. Hruschka, R. J. G. B. Campello, and L. N. deCastro,”Improving the efficiency of a clustering genetic algorithm,InAdvances in Artificial Intelligence,” IBERAMIA 2004, volume 3315of LNCS, pages 861–870, 2004.
[13] G. P. Babu and M. N. Murty,A near-optimal initial seed selection in
K-means algorithm using a genetic algorithm,” Pattern Recognit.Lett.,vol. 14, pp. 763–769, 1993.
[14] Hae-Sang Park and,Chi-Hyuck Jun,”A simple and fast algorithm for k-medoids clustering,” Expert System with applications,3336-3341,2009.
255253
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply