base-jits

4
An Effective Clustering Algorithm for Data Mining * Singh Vijendra ,and Kelkar Ashwini Faculty of Engineering and T echnolog y Mody Institute of Technology and Science Lakshmangarh, Sikar, Rajasthan, India [email protected] Sahoo Laxman Department of Computer Science and Engineering  Northern India Engineering College Luck now, UP, India [email protected]   Abstract  —This paper proposes an effective clustering algorithm for databases, which are benchmark data sets of data mining applications. We present a Genetic Clustering Algorithm (GCA) that finds a globally optimal partition of a given data sets into a specified number of clusters. The algorithm is distance-based and creates centroids. To evaluate the proposed algorithm, we use some artificial data sets and compare with results of K-means. Experimental results show that the proposed algorithm has better performance and efficiently finds accurate clusters  .  Keywords- Clustering; K-means; Genetic algorithm. I. I  NTRODUCTION Clustering is the process of grouping a set of objects into clusters so that objects within a cluster are similar to each other but are dissimilar to objects in other clusters [3]. Clustering has been effectively applied in a variety of engineering and scientific disciplines such as psychology,  biology, medicine, computer vision, communications, and remote sensing. Cluster analysis organizes data (a set of  patterns, each pattern could be a vector measurements) by abstracting underlying structure. The grouping is done such that patterns within a group (cluster) are more similar to each other than patterns belonging to different groups. Thus, organization of data using cluster analysis employs some dissimilarity measure among the set of patterns. The dissimilarity measure is defined based on the data under analysis and the purpose of the analysis. Various types of clustering algorithms have been proposed to suit different requirements. Clustering algorithms can be broadly classified into hierarchical and partitioning algorithms based on the structure of abstraction. Hierarchical clustering algorithms construct a hierarchy of partitions, represented as a dendrogram in which each partition is nested within the  partition at the next level in the hierarchy. Partitioning clustering algorithms generate a single partition, with a specified or estimated number of no overlapping clusters, of the data in an attempt to recover natural groups present in the data. One of the important problems in partitioning clustering is to find a partition of the given data, with a specified number of clusters that minimizes the total within cluster variation. Unfortunately in many real life cases the number of clusters in a data set is not known a priori. Under this condition, how to automatically provide the number of clusters and find the clustering partition becomes a challenge. In this regard, some attempts have been made to use genetic algorithms for automatically clustering data sets [2]. Genetic algorithms (GA’s) work on a coding of the  parameter set over which the search has to be performed, rather than the parameters themselves [1]. These encoded  parameters are called solutions or chromosomes  and the objective function value at a solution is the objective function value at the corresponding parameters. GA’s solve optimization problems using a population of a fixed number, called the population size, of solutions. A solution consists of a string of symbols, typically binary symbols. GA’s evolve over generations. During each generation, they produce a new population from the current population by applying genetic operator’s viz., natural selection, crossover, and mutation. Each solution in the population is associated with a figure of merit (fitness value) depending on the value of the function to be optimized. The selection operator selects a solution from the current population for the next population with probability proportional to its fitness value. Crossover operates on two solution strings and results in another two strings. Typical crossover operator exchanges the segments of selected strings across a crossover point with a  probability. The mutation operator toggles each position in a string with a probability, called the mutation probability . Bandyopadhyay and Maulik [6] applied the variable string length genetic algorithm with the real encoding of the coordinates of the cluster centers in the chromosome to the clustering problem. Tseng and Yang [7] proposed a genetic algorithm based approach for the clustering problem. Their method consists of two stages, nearest neighbor clustering and genetic optimization. Lin et al. [4] presented a genetic clustering algorithm based on a binary chromosome representation. The proposed method selects the cluster centers directly from the data set. Lai [8] adopted the hierarchical genetic algorithm to solve the clustering  problem. In the proposed method, the chromosome consists of two types of genes, control genes and parametric genes. 2010 International Conference on Data Storage and Data Engineering 978-0-7695-3 958-4/10 $26.00 © 2010 IEEE DOI 10.1109/DSDE.201 0.34 252 2010 International Conference on Data Storage and Data Engineering 978-0-7695-3 958-4/10 $26.00 © 2010 IEEE DOI 10.1109/DSDE.201 0.34 250 Authorized licensed use limited to: NATIONAL INSTITUTE OF TEC HNOLOGY CALICUT. Download ed on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply.

Transcript of base-jits

8/8/2019 base-jits

http://slidepdf.com/reader/full/base-jits 1/4

An Effective Clustering Algorithm for Data

Mining*Singh Vijendra ,and Kelkar Ashwini

Faculty of Engineering and TechnologyMody Institute of Technology and Science

Lakshmangarh, Sikar, Rajasthan, India

[email protected]

Sahoo Laxman 

Department of Computer Science and Engineering Northern India Engineering College

Luck now, UP, India

[email protected]

 

 Abstract   —This paper proposes an effective clustering

algorithm for databases, which are benchmark data sets of 

data mining applications. We present a Genetic Clustering

Algorithm (GCA) that finds a globally optimal partition of a

given data sets into a specified number of clusters. Thealgorithm is distance-based and creates centroids. To evaluate

the proposed algorithm, we use some artificial data sets and

compare with results of K-means. Experimental results show

that the proposed algorithm has better performance and

efficiently finds accurate clusters .

 Keywords- Clustering; K-means; Genetic algorithm.

I.  I NTRODUCTION

Clustering is the process of grouping a set of objects intoclusters so that objects within a cluster are similar to eachother but are dissimilar to objects in other clusters [3].

Clustering has been effectively applied in a variety of engineering and scientific disciplines such as psychology,  biology, medicine, computer vision, communications, andremote sensing. Cluster analysis organizes data (a set of   patterns, each pattern could be a vector measurements) byabstracting underlying structure. The grouping is done suchthat patterns within a group (cluster) are more similar to eachother than patterns belonging to different groups. Thus,organization of data using cluster analysis employs somedissimilarity measure among the set of patterns. Thedissimilarity measure is defined based on the data under analysis and the purpose of the analysis. Various types of clustering algorithms have been proposed to suit differentrequirements. Clustering algorithms can be broadly classified

into hierarchical and partitioning algorithms based on thestructure of abstraction. Hierarchical clustering algorithmsconstruct a hierarchy of partitions, represented as adendrogram  in which each partition is nested within the  partition at the next level in the hierarchy. Partitioningclustering algorithms generate a single partition, with aspecified or estimated number of no overlapping clusters, of the data in an attempt to recover natural groups present in thedata. One of the important problems in partitioningclustering is to find a partition of the given data, with aspecified number of clusters that minimizes the total within

cluster variation. Unfortunately in many real life cases thenumber of clusters in a data set is not known a priori. Under this condition, how to automatically provide the number of clusters and find the clustering partition becomes a

challenge.In this regard, some attempts have been made to use

genetic algorithms for automatically clustering data sets [2].Genetic algorithms (GA’s) work on a coding of the  parameter set over which the search has to be performed,rather than the parameters themselves [1]. These encoded  parameters are called solutions or chromosomes  and theobjective function value at a solution is the objectivefunction value at the corresponding parameters. GA’s solveoptimization problems using a population of a fixed number,called the population size, of solutions. A solution consists of a string of symbols, typically binary symbols. GA’s evolveover generations. During each generation, they produce anew population from the current population by applying

genetic operator’s viz., natural selection, crossover, andmutation. Each solution in the population is associated with afigure of merit (fitness value) depending on the value of thefunction to be optimized. The selection operator selects asolution from the current population for the next populationwith probability proportional to its fitness value. Crossover operates on two solution strings and results in another twostrings. Typical crossover operator exchanges the segmentsof selected strings across a crossover point with a probability. The mutation operator toggles each position in astring with a probability, called the mutation probability. Bandyopadhyay and Maulik [6] applied the variable stringlength genetic algorithm with the real encoding of thecoordinates of the cluster centers in the chromosome to the

clustering problem. Tseng and Yang [7] proposed a geneticalgorithm based approach for the clustering problem. Their method consists of two stages, nearest neighbor clusteringand genetic optimization. Lin et al. [4] presented a geneticclustering algorithm based on a binary chromosomerepresentation. The proposed method selects the cluster centers directly from the data set. Lai [8] adopted thehierarchical genetic algorithm to solve the clustering  problem. In the proposed method, the chromosome consistsof two types of genes, control genes and parametric genes.

2010 International Conference on Data Storage and Data Engineering

978-0-7695-3958-4/10 $26.00 © 2010 IEEE

DOI 10.1109/DSDE.2010.34

252

2010 International Conference on Data Storage and Data Engineering

978-0-7695-3958-4/10 $26.00 © 2010 IEEE

DOI 10.1109/DSDE.2010.34

250

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply

8/8/2019 base-jits

http://slidepdf.com/reader/full/base-jits 2/4

II.  CLUSTERING PROBLEM 

Clustering is a formal study of algorithms and methodsfor classifying objects without category labels. A cluster is aset of objects that are alike, and objects from differentclusters are not like. The set of n objects X={X1, X2….,Xn}

is to be clustered. Each ∈ X  R  p is an attribute vector consisting of p real measurements describing the object. Theobjects are to be clustered into non overlapping groups C ={C1, C2… Ck } (C is known as a clustering), where k is the

number of clusters, C1UC2U….UCk  = X, φ ≠iC  and

φ =∩  ji C C  for i ≠ j.

The objects within each group should be more similar toeach other than to objects in any other group, and the valueof k may be unknown. If k is known, the problem is referredto as the k-clustering problem. Many methods described inthe literature assume that k is given by the user [12], thesemethods search for k clusters according to a predefinedcriterion. Doing so, the number of ways of sorting N objectsinto k clusters is given by Liu [5]:

Thus, there are a large number of possible partitions evenfor moderate N and k (e.g. NW (25, 5) ≈ 2.5×1015), and thecomplete enumeration of every possible partition is simplynot possible [10]. In other words, it is not easy to find the best partitioning even assuming that k is known. Indeed, thisis rarely the case in practice. A usual approach is to run aclustering algorithm several times and, based on the obtainedresults; choose the value for k that provides the most naturalclustering. This strategy assumes domain knowledge andusually has the disadvantage of searching for the bestsolution in a small subset of the search space. Consequently,these methods have, in general, low probabilities of success.Another alternative involves optimizing k according tonumeric criteria. In this case, k is unknown and the number of ways of grouping N instances into k clusters, consideringS different scenarios (each one resulting from a different k),is [5]:

∑ s

k  N  NW  ),( (2)

 The problem of finding an optimal solution to the

 partition of N data into k clusters is NP-complete [11] and, provided that the number of distinct partitions of N instances

into k clusters increases approximately as k  N

/k!, attemptingto find a globally optimum solution is usually notcomputationally feasible. This difficulty has stimulated thesearch for efficient approximation algorithms. Furthermoretraditional clustering algorithms search a relatively smallsubset of the solution space (these subsets are defined by thenumber of clusters, the clustering criteria, and the clusteringmethod). Consequently the probability of success of thesemethods is small. Algorithms such as single-linkage aredeterministic and will repeatedly and the same solution for a

given data set, whereas algorithms such as k-means conducta local search starting from an initial partition. In each case,the solution may be a local optimum, which is notnecessarily the global solution. This is exacerbated when thesolution space is very large.

Clearly, we need an algorithm with the potential tosearch large solution spaces effectively. The genetic

algorithms have been widely employed for optimization  problems in several domains. Their success lies in their ability to span a large subset of the search space.

III.  GENETIC CLUSTERING ALGORITHM 

We proposed a Genetic algorithm to the problem of k clustering, where the required number of clusters is known.Various adaptations are used to enable the GA to cluster andto enhance their performance. Further the Genetic ClusteringAlgorithms are tested on databases, which are benchmarksfor data mining applications or heuristics are added to enablethe GAs to cope with a larger number of objects. Geneticalgorithm for the clustering problem fall into the followingareas representation, fitness function, operators and

 parameter values.

 A.   Representation

Genetic representations for clustering or grouping  problems are based on underlying scheme. The schemerepresents the objects with gene values, and the position of these genes signifies how the objects are divided amongst theclusters.

The use of simple encoding scheme causes problems of redundant codification and context insensitivity [1].This hasled researchers to devise complicated representations andspecialized operators for clustering problems [13]. Thecluster label based on n bit encoding is simple compared to  parameterization of prototype location. In such a

representation many genotype translate to a unique  phenotype. The notion of cluster labels built into therepresentation makes little intuitive sense. Suchrepresentations have spawned off a set of pre treatmentmethodologies to make the representations suitable for genetic operators.

Let us consider a dataset formed by N instances. Then, agenotype is an integer vector of (N+1) positions. Each  position corresponds to an instance, i.e., the ith position(gene) represents the i-th instance, whereas the last generepresents the number of clusters (k)[9]. Thus, each gene hasa value over the alphabet {1,2,3,y,k}. For instance, in adataset composed of 20 instances, a possible genotype is:

Genotype : 1123245125432533424 5

In this case, three instances {1,2,8} form the cluster whose label is 1. The cluster whose label is 2 has 5 instances{3,5,9,13,18}, and so on. Standard genetic operators areusually not suitable for clustering problems for severalreasons. First, the encoding scheme presented above isnaturally redundant, i.e., the encoding is one-to-many. Infact, there is k! different genotypes that represent the samesolution.

253251

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply

8/8/2019 base-jits

http://slidepdf.com/reader/full/base-jits 3/4

Thus, the size of the search space the genetic algorithmhas to search is much larger than the original space of solutions. This augmented space may reduce the efficiencyof the genetic algorithm. In addition, the redundant encodingalso causes the undesirable effect of casting context-dependent information out of context under the standardcrossover, i.e., equal parents can originate different

offspring.For this reason, the development of genetic operators

specially designed to clustering problems has beeninvestigated [10,9]. In this context, the Genetic ClusteringAlgorithm operators proposed in [9] are of particular interestsince they operate on constant length chromosomes.

 B.   Fitness Function 

Objective functions used for traditional clusteringalgorithms can act as fitness functions for Genetic Clusteringalgorithms. However if the optimal clustering corresponds tothe minimal objective function value, we will need totransform the objective function value, since GAs work tomaximize their fitness values [1]. In addition fitness values

in a GA need to be positive if we are using fitness proportional selection.

C.  Genetic Operators

The operators pass genetic information betweensubsequent generations of the population. As a result,operators need to be matched with or designed for therepresentation, so that the offspring are valid and inheritcharacteristics from their parents. Operators used for geneticclustering or grouping includes some of the selection,crossover and mutation methods.

1)  SelectionChromosomes are selected for reproduction based on

their relative fitness. Thus the representation is not a factor 

when choosing an appropriate selection operator, but thefitness function is. If all fitness values are positive, and themaximum fitness value corresponds to the optimalclustering, then fitness proportional selection may beappropriate. Otherwise, a ranking selection method may beused.

In proposed Genetic Clustering Algorithm, the genotypescorresponding to each generation are selected according tothe roulette wheel strategy [1], which does not admitnegative objective function values. For this reason, aconstant equal to one is summed up to the objective function  before the selection procedure takes place. The highestfitness genotype is always copied into the succeedinggeneration.

2)  Crossover The crossover operator is designed to transfer genetic

material from one generation to the next. The major concernswith this operator are validity and context insensitivity. Itmay be necessary to check that offspring produced by acertain operator are valid and reject any invalidchromosomes.

The proposed Genetic Clustering Algorithm crossover operator combines clustering solutions coming fromdifferent genotypes. It works in the following way. First, two

genotypes (G1 and G2) are selected. Then, assuming that G1represents k1 clusters, the Genetic Clustering Algorithm

randomly chooses { }1,...2,1  K c∈ clusters to copy into G2.

The unchanged clusters of G2 are maintained and thechanged ones have their instances allocated to thecorresponding nearest clusters (according to their centroids).In this way, an offspring G3 is obtained. The same procedureis employed to get an offspring G4, but now considering thatthe changed clusters of G2 are copied into G1.Note that,although the crossover operator produces offspring usuallyformed by a number of clusters that is neither smaller nor larger than the number of clusters of their parents, thisoperator is able to increase or decrease the number of clusters.

3)  MutationMutation introduces new genetic material into the

  population. In a clustering context this corresponds tomoving an object from one cluster to another. Two operatorsfor mutation are used in the Genetic Clustering Algorithm

The first operator works only on genotypes that encode

more than two clusters. It eliminates a randomly chosencluster, placing its instances into the nearest remainingclusters (according to their centroids). The second operator divides a randomly selected cluster into two new ones. Thefirst cluster is formed by the instances closer to the originalcentroid, whereas the other cluster is formed by thoseinstances closer to the farthest instance from the centroid.

IV.  OBJECTIVE FUNCTION 

The objective function evaluates the fitness of individualstrings. All most all partition evaluation functions providesome kind of measure of inter-cluster isolation and/or intra-cluster homogeneity. For a good partition, there should beappreciable inter-cluster isolation and intra-cluster 

homogeneity. The homogeneity within a cluster is calculated by the sum of distances between all pairs of objects with in acluster. We use an objective function based on the Euclideandistance [3]:

,)(.....)()(),( 22

22

2

11  jnni ji ji ji  x x x x x x X  X d  −++−+−=(3)

 Where Xi= (xi1,xi2…..,xin) and X j=(x j1,x j2…..,x jn) are twodimensional data objects. The calculation of distances  between two instances represents the main computationalcost of the Genetic Clustering Algorithm.

V.  EXPERIMENTAL R ESULTS 

In order to see the performance of the proposed GeneticClustering Algorithm, we first applied the method to Iris dataset, whose true classes are known [14]. Performance wasmeasured by the accuracy, which is the proportion of objectsthat are correctly grouped together against the true classes.To investigate the performance an experimental study wascarried out by generating artificial data sets repetitively andcalculating the average performance of the method.

The Iris data set is available in UCI repository(ftp://ftp.ics.uci.edu/pub/machine-learning-databases/),

254252

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply

8/8/2019 base-jits

http://slidepdf.com/reader/full/base-jits 4/4

 

which data set including 150 instances. There are threeclasses (Setosa, Versicolour and Virginica), each onerepresented by 50 instances. The class Setosa is linearlyseparable from the others, whereas the classes Versicolour and Virginica are not linearly separable. Four attributes(sepal and petal length and width) describe each instance.The sepal and petal areas were used as attributes (variables).

The sepal area is obtained by multiplying the sepal length bythe sepal width and the petal area is calculated in ananalogous way. We applied the proposed Genetic ClusteringAlgorithm and K- means with k=3 to this data set withoutusing class information. The implementation result of K-mean algorithm shows in figure 1.The Clustering accuracy  by k-means is 87.4%,where the clustering accuracy by the proposed Genetic Clustering Algorithm is 97% .The GeneticClustering Algorithm result is shown in figure 2.Bycomparing the results, we analyzed that K-means is wronglygrouped the objects in two classes(Versicolour andVirginica).

Figure 1. Clustering using the K-means method

Figure 2. Clustering using the Genetic Clustering Algorithm

VI.  CONCLUSIONS

As a fundamental problem and technique for data

analysis, clustering has become increasingly important.

Many clustering methods usually require the designer to

  provide the number of clusters as input. In this paper, we  propose a Genetic Clustering Algorithm for data clustering.

We compare our algorithm with k-means .The result fromvarious experiments using artificial data sets shows that

  proposed algorithm has better performance and efficiently

finds accurate clusters.

ACKNOWLEDGMENT 

The authors would like to thank Prof. Chothmal and Prof.P K Das for thoughtful, constructive comments, andsuggestions.

R EFERENCES 

[1]  Goldberg, D.E, “Genetic Algorithms in Search, Optimization and

Machine Learning,“Addison-Wesley, 1989[2]  Amit Banerjee and Sushil J.Louis,” A Recursive Clustering

Methodology using a genetic algorithm,”IEEE Trans., 2007.

[3]  Jiawei Han and M.kamber,”Data mining: Concepts andTechniques,”Morgan Kaufmann,2004.

[4]  H. J. Lin, F. W. Yang and Y. T. Kao,”An efficient GA-basedclustering technique, “Tamkang Journal of Science and Engineering,vol. 8, no. 2,pp. 113-122, 2005.

[5]  G.L. Liu,” Introduction to Combinatorial Mathematics,” McGrawHill, New York, (1968).

[6]  S. Bandyopadhyay and U. Maulik,”An evolutionary technique basedon K-means algorithm for optimal clustering in RN,” InformationSciences,vol. 146, no.1-4, pp. 221-237,2002.

[7]  L. Y. Tseng and S. B. Yang,”A genetic approach to the automaticclustering algorithm,” Pattern Recognition, vol. 34, no. 2, pp. 415-424,2001

[8]  C. C. Lai,”A novel clustering approach using hierarchical geneticalgorithms,” Intelligent Automation and Soft Computing, vol. 11, no.3,pp. 143-153, 2005.

[9]  E.R. Hruschka, N.F.F. Ebecken,”A genetic algorithm for cluster analysis”,Intell. Data Anal. 7 (1) 15–25., 2003.

[10]  B.S. Everitt, S. Landau, M. Leese,”Cluster Analysis,” ArnoldPublishers, London, 2001.

[11]  L. Kaufman, P. J. Rousseeuw,”Finding Groups in Data—AnIntroduction to Cluster Analysis,” Wiley Series in Probability andMathematical Statistics, 1990.

[12]  E. R. Hruschka, R. J. G. B. Campello, and L. N. deCastro,”Improving the efficiency of a clustering genetic algorithm,InAdvances in Artificial Intelligence,” IBERAMIA 2004, volume 3315of LNCS, pages 861–870, 2004.

[13]  G. P. Babu and M. N. Murty,A near-optimal initial seed selection in

K-means algorithm using a genetic algorithm,” Pattern Recognit.Lett.,vol. 14, pp. 763–769, 1993.

[14]  Hae-Sang Park and,Chi-Hyuck Jun,”A simple and fast algorithm for k-medoids clustering,” Expert System with applications,3336-3341,2009.

255253

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply