base-jits

8/8/2019 base-jits

http://slidepdf.com/reader/full/base-jits 1/4

An Effective Clustering Algorithm for Data

Mining*Singh Vijendra ,and Kelkar Ashwini

Faculty of Engineering and TechnologyMody Institute of Technology and Science

Lakshmangarh, Sikar, Rajasthan, India

[email protected]

Sahoo Laxman

Department of Computer Science and Engineering Northern India Engineering College

Luck now, UP, India

[email protected]

Abstract —This paper proposes an effective clustering

algorithm for databases, which are benchmark data sets of

data mining applications. We present a Genetic Clustering

Algorithm (GCA) that finds a globally optimal partition of a

given data sets into a specified number of clusters. Thealgorithm is distance-based and creates centroids. To evaluate

the proposed algorithm, we use some artificial data sets and

compare with results of K-means. Experimental results show

that the proposed algorithm has better performance and

efficiently finds accurate clusters .

Keywords- Clustering; K-means; Genetic algorithm.

I. I NTRODUCTION

Clustering is the process of grouping a set of objects intoclusters so that objects within a cluster are similar to eachother but are dissimilar to objects in other clusters [3].

Clustering has been effectively applied in a variety of engineering and scientific disciplines such as psychology, biology, medicine, computer vision, communications, andremote sensing. Cluster analysis organizes data (a set of patterns, each pattern could be a vector measurements) byabstracting underlying structure. The grouping is done suchthat patterns within a group (cluster) are more similar to eachother than patterns belonging to different groups. Thus,organization of data using cluster analysis employs somedissimilarity measure among the set of patterns. Thedissimilarity measure is defined based on the data under analysis and the purpose of the analysis. Various types of clustering algorithms have been proposed to suit differentrequirements. Clustering algorithms can be broadly classified

into hierarchical and partitioning algorithms based on thestructure of abstraction. Hierarchical clustering algorithmsconstruct a hierarchy of partitions, represented as adendrogram in which each partition is nested within the partition at the next level in the hierarchy. Partitioningclustering algorithms generate a single partition, with aspecified or estimated number of no overlapping clusters, of the data in an attempt to recover natural groups present in thedata. One of the important problems in partitioningclustering is to find a partition of the given data, with aspecified number of clusters that minimizes the total within

cluster variation. Unfortunately in many real life cases thenumber of clusters in a data set is not known a priori. Under this condition, how to automatically provide the number of clusters and find the clustering partition becomes a

challenge.In this regard, some attempts have been made to use

genetic algorithms for automatically clustering data sets [2].Genetic algorithms (GA’s) work on a coding of the parameter set over which the search has to be performed,rather than the parameters themselves [1]. These encoded parameters are called solutions or chromosomes and theobjective function value at a solution is the objectivefunction value at the corresponding parameters. GA’s solveoptimization problems using a population of a fixed number,called the population size, of solutions. A solution consists of a string of symbols, typically binary symbols. GA’s evolveover generations. During each generation, they produce anew population from the current population by applying

genetic operator’s viz., natural selection, crossover, andmutation. Each solution in the population is associated with afigure of merit (fitness value) depending on the value of thefunction to be optimized. The selection operator selects asolution from the current population for the next populationwith probability proportional to its fitness value. Crossover operates on two solution strings and results in another twostrings. Typical crossover operator exchanges the segmentsof selected strings across a crossover point with a probability. The mutation operator toggles each position in astring with a probability, called the mutation probability. Bandyopadhyay and Maulik [6] applied the variable stringlength genetic algorithm with the real encoding of thecoordinates of the cluster centers in the chromosome to the

clustering problem. Tseng and Yang [7] proposed a geneticalgorithm based approach for the clustering problem. Their method consists of two stages, nearest neighbor clusteringand genetic optimization. Lin et al. [4] presented a geneticclustering algorithm based on a binary chromosomerepresentation. The proposed method selects the cluster centers directly from the data set. Lai [8] adopted thehierarchical genetic algorithm to solve the clustering problem. In the proposed method, the chromosome consistsof two types of genes, control genes and parametric genes.

2010 International Conference on Data Storage and Data Engineering

978-0-7695-3958-4/10 $26.00 © 2010 IEEE

DOI 10.1109/DSDE.2010.34

252

2010 International Conference on Data Storage and Data Engineering

978-0-7695-3958-4/10 $26.00 © 2010 IEEE

DOI 10.1109/DSDE.2010.34

250

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on June 08,2010 at 15:35:36 UTC from IEEE Xplore. Restrictions apply

8/8/2019 base-jits


II. CLUSTERING PROBLEM

Clustering is a formal study of algorithms and methodsfor classifying objects without category labels. A cluster is aset of objects that are alike, and objects from differentclusters are not like. The set of n objects X={X1, X2….,Xn}

is to be clustered. Each ∈ X R p is an attribute vector consisting of p real measurements describing the object. Theobjects are to be clustered into non overlapping groups C ={C1, C2… Ck } (C is known as a clustering), where k is the

number of clusters, C1UC2U….UCk = X, φ ≠iC and

φ =∩ ji C C for i ≠ j.

The objects within each group should be more similar toeach other than to objects in any other group, and the valueof k may be unknown. If k is known, the problem is referredto as the k-clustering problem. Many methods described inthe literature assume that k is given by the user [12], thesemethods search for k clusters according to a predefinedcriterion. Doing so, the number of ways of sorting N objectsinto k clusters is given by Liu [5]:

Thus, there are a large number of possible partitions evenfor moderate N and k (e.g. NW (25, 5) ≈ 2.5×1015), and thecomplete enumeration of every possible partition is simplynot possible [10]. In other words, it is not easy to find the best partitioning even assuming that k is known. Indeed, thisis rarely the case in practice. A usual approach is to run aclustering algorithm several times and, based on the obtainedresults; choose the value for k that provides the most naturalclustering. This strategy assumes domain knowledge andusually has the disadvantage of searching for the bestsolution in a small subset of the search space. Consequently,these methods have, in general, low probabilities of success.Another alternative involves optimizing k according tonumeric criteria. In this case, k is unknown and the number of ways of grouping N instances into k clusters, consideringS different scenarios (each one resulting from a different k),is [5]:

∑ s

k N NW ),( (2)

The problem of finding an optimal solution to the

partition of N data into k clusters is NP-complete [11] and, provided that the number of distinct partitions of N instances

into k clusters increases approximately as k N

/k!, attemptingto find a globally optimum solution is usually notcomputationally feasible. This difficulty has stimulated thesearch for efficient approximation algorithms. Furthermoretraditional clustering algorithms search a relatively smallsubset of the solution space (these subsets are defined by thenumber of clusters, the clustering criteria, and the clusteringmethod). Consequently the probability of success of thesemethods is small. Algorithms such as single-linkage aredeterministic and will repeatedly and the same solution for a

given data set, whereas algorithms such as k-means conducta local search starting from an initial partition. In each case,the solution may be a local optimum, which is notnecessarily the global solution. This is exacerbated when thesolution space is very large.

Clearly, we need an algorithm with the potential tosearch large solution spaces effectively. The genetic

algorithms have been widely employed for optimization problems in several domains. Their success lies in their ability to span a large subset of the search space.

III. GENETIC CLUSTERING ALGORITHM

We proposed a Genetic algorithm to the problem of k clustering, where the required number of clusters is known.Various adaptations are used to enable the GA to cluster andto enhance their performance. Further the Genetic ClusteringAlgorithms are tested on databases, which are benchmarksfor data mining applications or heuristics are added to enablethe GAs to cope with a larger number of objects. Geneticalgorithm for the clustering problem fall into the followingareas representation, fitness function, operators and

parameter values.

A. Representation

Genetic representations for clustering or grouping problems are based on underlying scheme. The schemerepresents the objects with gene values, and the position of these genes signifies how the objects are divided amongst theclusters.

The use of simple encoding scheme causes problems of redundant codification and context insensitivity [1].This hasled researchers to devise complicated representations andspecialized operators for clustering problems [13]. Thecluster label based on n bit encoding is simple compared to parameterization of prototype location. In such a

representation many genotype translate to a unique phenotype. The notion of cluster labels built into therepresentation makes little intuitive sense. Suchrepresentations have spawned off a set of pre treatmentmethodologies to make the representations suitable for genetic operators.

Let us consider a dataset formed by N instances. Then, agenotype is an integer vector of (N+1) positions. Each position corresponds to an instance, i.e., the ith position(gene) represents the i-th instance, whereas the last generepresents the number of clusters (k)[9]. Thus, each gene hasa value over the alphabet {1,2,3,y,k}. For instance, in adataset composed of 20 instances, a possible genotype is:

Genotype : 1123245125432533424 5

In this case, three instances {1,2,8} form the cluster whose label is 1. The cluster whose label is 2 has 5 instances{3,5,9,13,18}, and so on. Standard genetic operators areusually not suitable for clustering problems for severalreasons. First, the encoding scheme presented above isnaturally redundant, i.e., the encoding is one-to-many. Infact, there is k! different genotypes that represent the samesolution.

253251


8/8/2019 base-jits


Thus, the size of the search space the genetic algorithmhas to search is much larger than the original space of solutions. This augmented space may reduce the efficiencyof the genetic algorithm. In addition, the redundant encodingalso causes the undesirable effect of casting context-dependent information out of context under the standardcrossover, i.e., equal parents can originate different

offspring.For this reason, the development of genetic operators

specially designed to clustering problems has beeninvestigated [10,9]. In this context, the Genetic ClusteringAlgorithm operators proposed in [9] are of particular interestsince they operate on constant length chromosomes.

B. Fitness Function

Objective functions used for traditional clusteringalgorithms can act as fitness functions for Genetic Clusteringalgorithms. However if the optimal clustering corresponds tothe minimal objective function value, we will need totransform the objective function value, since GAs work tomaximize their fitness values [1]. In addition fitness values

in a GA need to be positive if we are using fitness proportional selection.

C. Genetic Operators

The operators pass genetic information betweensubsequent generations of the population. As a result,operators need to be matched with or designed for therepresentation, so that the offspring are valid and inheritcharacteristics from their parents. Operators used for geneticclustering or grouping includes some of the selection,crossover and mutation methods.

1) SelectionChromosomes are selected for reproduction based on

their relative fitness. Thus the representation is not a factor

when choosing an appropriate selection operator, but thefitness function is. If all fitness values are positive, and themaximum fitness value corresponds to the optimalclustering, then fitness proportional selection may beappropriate. Otherwise, a ranking selection method may beused.

In proposed Genetic Clustering Algorithm, the genotypescorresponding to each generation are selected according tothe roulette wheel strategy [1], which does not admitnegative objective function values. For this reason, aconstant equal to one is summed up to the objective function before the selection procedure takes place. The highestfitness genotype is always copied into the succeedinggeneration.

2) Crossover The crossover operator is designed to transfer genetic

material from one generation to the next. The major concernswith this operator are validity and context insensitivity. Itmay be necessary to check that offspring produced by acertain operator are valid and reject any invalidchromosomes.

The proposed Genetic Clustering Algorithm crossover operator combines clustering solutions coming fromdifferent genotypes. It works in the following way. First, two

genotypes (G1 and G2) are selected. Then, assuming that G1represents k1 clusters, the Genetic Clustering Algorithm

randomly chooses { }1,...2,1 K c∈ clusters to copy into G2.

The unchanged clusters of G2 are maintained and thechanged ones have their instances allocated to thecorresponding nearest clusters (according to their centroids).In this way, an offspring G3 is obtained. The same procedureis employed to get an offspring G4, but now considering thatthe changed clusters of G2 are copied into G1.Note that,although the crossover operator produces offspring usuallyformed by a number of clusters that is neither smaller nor larger than the number of clusters of their parents, thisoperator is able to increase or decrease the number of clusters.

3) MutationMutation introduces new genetic material into the

population. In a clustering context this corresponds tomoving an object from one cluster to another. Two operatorsfor mutation are used in the Genetic Clustering Algorithm

The first operator works only on genotypes that encode

more than two clusters. It eliminates a randomly chosencluster, placing its instances into the nearest remainingclusters (according to their centroids). The second operator divides a randomly selected cluster into two new ones. Thefirst cluster is formed by the instances closer to the originalcentroid, whereas the other cluster is formed by thoseinstances closer to the farthest instance from the centroid.

IV. OBJECTIVE FUNCTION

The objective function evaluates the fitness of individualstrings. All most all partition evaluation functions providesome kind of measure of inter-cluster isolation and/or intra-cluster homogeneity. For a good partition, there should beappreciable inter-cluster isolation and intra-cluster

homogeneity. The homogeneity within a cluster is calculated by the sum of distances between all pairs of objects with in acluster. We use an objective function based on the Euclideandistance [3]:

,)(.....)()(),( 22

22

2

11 jnni ji ji ji x x x x x x X X d −++−+−=(3)

Where Xi= (xi1,xi2…..,xin) and X j=(x j1,x j2…..,x jn) are twodimensional data objects. The calculation of distances between two instances represents the main computationalcost of the Genetic Clustering Algorithm.

V. EXPERIMENTAL R ESULTS

In order to see the performance of the proposed GeneticClustering Algorithm, we first applied the method to Iris dataset, whose true classes are known [14]. Performance wasmeasured by the accuracy, which is the proportion of objectsthat are correctly grouped together against the true classes.To investigate the performance an experimental study wascarried out by generating artificial data sets repetitively andcalculating the average performance of the method.

The Iris data set is available in UCI repository(ftp://ftp.ics.uci.edu/pub/machine-learning-databases/),

254252


8/8/2019 base-jits


which data set including 150 instances. There are threeclasses (Setosa, Versicolour and Virginica), each onerepresented by 50 instances. The class Setosa is linearlyseparable from the others, whereas the classes Versicolour and Virginica are not linearly separable. Four attributes(sepal and petal length and width) describe each instance.The sepal and petal areas were used as attributes (variables).

The sepal area is obtained by multiplying the sepal length bythe sepal width and the petal area is calculated in ananalogous way. We applied the proposed Genetic ClusteringAlgorithm and K- means with k=3 to this data set withoutusing class information. The implementation result of K-mean algorithm shows in figure 1.The Clustering accuracy by k-means is 87.4%,where the clustering accuracy by the proposed Genetic Clustering Algorithm is 97% .The GeneticClustering Algorithm result is shown in figure 2.Bycomparing the results, we analyzed that K-means is wronglygrouped the objects in two classes(Versicolour andVirginica).

Figure 1. Clustering using the K-means method

Figure 2. Clustering using the Genetic Clustering Algorithm

VI. CONCLUSIONS

As a fundamental problem and technique for data

analysis, clustering has become increasingly important.

Many clustering methods usually require the designer to

provide the number of clusters as input. In this paper, we propose a Genetic Clustering Algorithm for data clustering.

We compare our algorithm with k-means .The result fromvarious experiments using artificial data sets shows that

proposed algorithm has better performance and efficiently

finds accurate clusters.

ACKNOWLEDGMENT

The authors would like to thank Prof. Chothmal and Prof.P K Das for thoughtful, constructive comments, andsuggestions.

R EFERENCES

[1] Goldberg, D.E, “Genetic Algorithms in Search, Optimization and

Machine Learning,“Addison-Wesley, 1989[2] Amit Banerjee and Sushil J.Louis,” A Recursive Clustering

Methodology using a genetic algorithm,”IEEE Trans., 2007.

[3] Jiawei Han and M.kamber,”Data mining: Concepts andTechniques,”Morgan Kaufmann,2004.

[4] H. J. Lin, F. W. Yang and Y. T. Kao,”An efficient GA-basedclustering technique, “Tamkang Journal of Science and Engineering,vol. 8, no. 2,pp. 113-122, 2005.

[5] G.L. Liu,” Introduction to Combinatorial Mathematics,” McGrawHill, New York, (1968).

[6] S. Bandyopadhyay and U. Maulik,”An evolutionary technique basedon K-means algorithm for optimal clustering in RN,” InformationSciences,vol. 146, no.1-4, pp. 221-237,2002.

[7] L. Y. Tseng and S. B. Yang,”A genetic approach to the automaticclustering algorithm,” Pattern Recognition, vol. 34, no. 2, pp. 415-424,2001

[8] C. C. Lai,”A novel clustering approach using hierarchical geneticalgorithms,” Intelligent Automation and Soft Computing, vol. 11, no.3,pp. 143-153, 2005.

[9] E.R. Hruschka, N.F.F. Ebecken,”A genetic algorithm for cluster analysis”,Intell. Data Anal. 7 (1) 15–25., 2003.

[10] B.S. Everitt, S. Landau, M. Leese,”Cluster Analysis,” ArnoldPublishers, London, 2001.

[11] L. Kaufman, P. J. Rousseeuw,”Finding Groups in Data—AnIntroduction to Cluster Analysis,” Wiley Series in Probability andMathematical Statistics, 1990.

[12] E. R. Hruschka, R. J. G. B. Campello, and L. N. deCastro,”Improving the efficiency of a clustering genetic algorithm,InAdvances in Artificial Intelligence,” IBERAMIA 2004, volume 3315of LNCS, pages 861–870, 2004.

[13] G. P. Babu and M. N. Murty,A near-optimal initial seed selection in

K-means algorithm using a genetic algorithm,” Pattern Recognit.Lett.,vol. 14, pp. 763–769, 1993.

[14] Hae-Sang Park and,Chi-Hyuck Jun,”A simple and fast algorithm for k-medoids clustering,” Expert System with applications,3336-3341,2009.

255253


base-jits

Documents

Transcript of base-jits